meta @lemmy.pineapplemachine.com pineapple @lemmy.pineapplemachine.com 1y ago

Unplanned service interruption for lemmy.pineapplemachine.com, 2023-06-08 18:10-19:15 UTC

The server that lemmy.pineapplemachine.com is hosted on became unresponsive after an increase in network traffic and CPU usage at approximately 18:10 UTC on 2023-06-08. As of approximately 19:15 UTC, an hour later, the server has been upgraded with additional CPU and RAM resources and has been brought back online. Hopefully the service will be more stable from now on.

Sorry about that!

You're viewing a single thread.

10 comments

Any theories as to where the load spike came from? Too many users browsing? Some big dump of data through federated replication? Something else?
- Any theories as to where the load spike came from? Too many users browsing? Some big dump of data through federated replication? Something else?
  
  I've been dissecting the logs to try to answer this for myself...
  
  Traffic to lemmy itself remained at a relatively modest volume up until 18:05:05, at which point the logs stopped, up until the time when I noticed the outage and rebooted the server a while later. However, there was a significant and elevated amount of traffic being directed to pictrs, especially significant considering that lemmy.pineapplemachine.com was previously hosted on a paltry t3.micro AWS instance. Prior to the last logged item at 18:05:06, pictrs had been logging several hundred GET requests per minute.
  
  Edit: Not actually hundreds per minute, but less than that. I initially misunderstood the log format. But still kind of a lot for image-related traffic, and definitely elevated compared to before.
  
  I don't know enough about lemmy's technical workings to be certain, but my best guess is that the announcement post in /r/apolloapp about the app closing down at the end of the month drove a renewed surge of traffic to https://join-lemmy.org/instances, which resulted in an unusually high volume of requests to retrieve the instance's green icon image (icon images are requested by each visitor from the instances themselves, instead of being rehosted by join-lemmy), and this in combination with the existing federation-related image traffic tripped some threshold—not hard to do, on a t3.micro—and killed the server dead.
  
  In investigating this, I've learned that pictrs supports using object storage for hosting (such as Amazon S3) and I will have to look into setting this up in the near future. I would expect this to reduce the likelihood of availability problems arising because a popular webpage embedded an image that was hosted here.
  
  surge of traffic to https://join-lemmy.org/instances, which resulted in an unusually high volume of requests to retrieve the instance’s green icon image
  
  Interesting, thanks for the writeup. I wonder if pict-rs is trying to do some dynamic resizing or something fancy... if that's what's up switching to blob storage might not help that much. Just serving a static image seems like it shouldn't hit the CPU too hard... but I guess I dunno what the request volume was and a micro doesn't leave a lot of headroom. If you're using the docker setup, I wonder if nginx could be configured to serve the icon from a plain static file. Nginx is definitely very efficient at using a small amount of CPU to handle a large number of static file requests.
  
  Definitely a neat writeup though, thanks for sharing your experience.
  
  Just serving a static image seems like it shouldn’t hit the CPU too hard… but I guess I dunno what the request volume was and a micro doesn’t leave a lot of headroom.
  
  Yeah, it doesn't feel like a smoking gun, but I don't have a better guess for now. Logs for everything just sort of cut off, without any plainly obvious reason for having done so. The only notable anomaly I could find in the logs was that the amount of image traffic had gone up relative to before and was higher than I would have expected.
  
  I do have to note, though, that the lemmy and pictrs logs are both very noisy, containing a lot of information that is redundant or not likely to be useful to a server admin. The volume of logs I had to sift through didn't make it any easier to get insight into this. Unfortunately I haven't seen documentation on how to configure this so far, and I don't really have a lot of time to dig into it right now.

10 comments