/0 @lemmy.dbzer0.com db0 @lemmy.dbzer0.com 8mo ago

Post-Mortem: The massive lemmy.world -> lemmy.dbzer0.com federation delays.

dbzer0.com Post-Mortem: The massive lemmy.world -> lemmy.dbzer0.com federation delays.

A couple days ago, someone posted on /0 (the meta community for the Divisions by zero) that the incoming federation from lemmy.world (the largest lemmy instance by an order of magnitude) is malfunctioning. Alarmed, I started digging in, since a federation problem with lemmy.world will massively affe...

Technologie - 🤖 @jlai.lu Camus (il, lui) @jlai.lu 8mo ago

dbzer0.com /blog/post-mortem-the-massive-lemmy-world-lemmy-dbzer0-com-federation-delays/

12 9

28 comments

Nice post, I enjoyed the storytelling. Glad it's all sorted now 😁

Btw, regarding this point:

All in all, this has been a fairly frustrating experience and I can’t imagine anyone who’s not doing IT Infrastructure as their day job being able to solve this. As helpful as the other lemmy admins were, they were relying a lot on me knowing my shit around Linux, networking, docker and postgresql at the same time. I had to do extended DB analysis, fork repositories, compile docker containers from scratch and deploy them ad-hoc etc. Someone who just wants to host a lemmy server would give up way earlier than this.

I think you're totally right, but at the same time, I think the collaborative troubleshooting that happened on Matrix (and has happened many times in the past for other issues) is pretty healthy, and not something that is always possible for other open source software.
- people interested in hosting their own instance is probably already interested in linux, or already using it, i don't think it's that bad
Glad you were able to figure this one out, I never know whether to be mad at myself or proud of my persistence when I spend like a day trying to fix something that turned out to be really simple and almost always unrelated to what I thought the problem was 😂

Edit: also if you found any performance-related config improvements, either to the postgres.conf, nginx.conf, or lemmy.hjson, please contribute them to lemmy-ansible so that all instances can benefit from what you've learned.
- Already sent a big pr for lemmy-doc 😊
This reinforced in my mind that as much as I like the idea of lemmy (or any of the other threadiverse SW), this is only something experts should try hosting. Sadly, this will lead to more centralization of the lemmy community to few big servers instead of many small ones, but given the nature of problems one can encounter and the lack of support to fix them if they’re not experts, I don’t see an option.

This also gave me an insight about how the federation of lemmy will eventually break when a single server (say, lemmy.world) grows big enough to start overwhelming even servers who are not badly setup like mine was.

Lemmy has many scalability problems to solve, and not all of these problems are slow database queries. I believe your experience is going to become increasingly common as the community grows because that increased centralization will compound the scalability problems and continue to drive up the technical know-how required to host a successful instance. The software eventually needs to do more to detect and present operational problems to administrators in a friendly way. I2P is an example of a distributed network that's quite good at reporting issues with the node.

With that said, not everything is doom and gloom. The community has proven itself highly resilient and smart people like yourself are finding solutions. It's going to be tough road ahead.
As someone hosting a service like this, especially when it has 12K people in it, this is very scary! While 2 lemmy core developers were in the chat, the help they provided was very limited overall and this session mostly relied on my own skills to troubleshoot.

This reinforced in my mind that as much as I like the idea of lemmy (or any of the other threadiverse SW), this is only something experts should try hosting. Sadly, this will lead to more centralization of the lemmy community to few big servers instead of many small ones, but given the nature of problems one can encounter and the lack of support to fix them if they’re not experts, I don’t see an option.

I disagree with this conclusion. If you had installed Lemmy according to the official instructions, you would have the database, backend and everything else on the same server and would never have run into this particular issue. And any problems youd have would likely be noticed (and debugged) by many other instances too. Your setup is heavily customized so it is only natural that there are few people who can help with it.

Anyway its an interesting journey, thanks for writing down your experience and for improving the documenation!
- The official instructions do not scale nor do they work for all situations. But besides that, the problem is not that my bad setup caused a problem. Shit happens and I didn't blame anyone but myself. The problems is that when a problem occurs, one has to get lucky to get support. I don't have to even prove this. I know for sure a fact that there's lemmy instances that decommissioned because they followed the default setup, run into issues, got no support and gave up.
  
  Edit: Also, man, from one Foss developer to another: You really have to learn to stop the instinct to say 'it broke because you did it wrong'. I know it feels unfair, but trust me, this is not the way.
  
  I'm not saying you did it wrong, it's open source so of course you can use it in any way you like. But some ways have a higher risk of breaking than others.
- I’m curious how you think “everything on the same box” scales? You can’t load balance, you can’t ensure resources are being used efficiently, you can’t even reboot a machine without the entire thing going dark.
  
  Lemmy.ml runs on a single server and is much bigger than db0. Sure you can't get 100% availability this way but no one expects that.
- Tossing stuff on the same server is not great as I don’t want to pay for fast storage for my image store, but I want fast for my DB. My web server should have extra CPU and network but is otherwise ephemeral. This is the same stuff people have been running for years and is microservices 101.
  
  The correct thing to do here is build in tracing and profiling hooks, as an example OpenTracing so something like Jaeger can consume and show problems and would have lit this up like a Christmas tree, Pyroscope can show changes over time in where CPU goes, and logs get shuffled off into graylog or some other centralized service for correlation.
  
  Images can be stored in S3 so that's not an issue. And Lemmy has some tracing logs as well as Prometheus stats, not sure if db0 tried looking into those.
- This is my job, so I'll counter that this isn't realistic, and in a professional situation it would probably be hosted in kubernetes which spans multiple servers and sometimes multiple regions - I don't think the devs have a readme for that.. (or maybe they do). The point being that the official docs are geared for a hobbyist to set up a node and not having separate VMs makes sense in that scenario. However I would say that it's plain that mister db0 has a much larger instance than could be considered hobbyist at this point.
- Edit: this comment is not written well, and is not describing the issue I wanted to actually comment on, I am tired and sorry
  
  I will hop on to this to also point out that there actually were people willing to actively help (me included, see the original post on this community) but if I say it bluntly we were not "invited in on the show", let me expand that.
  
  The problem is, as @[email protected] points out here, we don't have the slightest idea how exactly your infrastructure looks, without that there is only the most general stuff we can help with.
  
  From my point of view, joining the matrix chat later in the process, I watched you do/post stuff that I have no idea where it comes from, I don't have the full context of what has been already tried and crossed out and what's the current plan.
  You @[email protected] would have to stop chopping and start networking with the people - that is definitely not easy to do effectively, especially if more people join later (and too have to be updated with the sate) but we could have fast tracked the docker/compilation stuff ruling lemmy out sooner.
  
  In retrospect, if we had full picture of how the infrastructure looks the chance someone would go "oh you have split backend and database servers, check the latency" would definitely be a lot higher, but we didn't know (hell I actually assumed your deployment is same or close to the lemmy ansible one). I am aware this is easy to say after the solution has been found but hopefully you get the networking/communication idea.
  
  Wait, hold on, how was help not accepted? I talked with everyone who replied to me me and followed every suggestion. If someone had asked for infra information I gave it.
  
  You know It's really frustrating to open myself and write about my experiences honestly and then people try to stay that it's actually my fault I didn't ask for help "the right way" . What kind of effect to do you think this might have to other potential lemmy hosters?
Thank you for your hard work.
Thanks for the write up!
Very interesting read, thank you!

I (self)host a lot of stuff as well as developing and deploying some of my software via docker containers and dabbled in Full-Stack territory quite a few times.

Exposing stuff to the internet still scares the shit out of me. Debugging sucks. There's so much that can go wrong, every layer multiplicates the possibilities of stuff that can wrong or behave in a way not expected. Your journey describes the pain of debugging perfectly. Yeah, in hindsight, it's often something that probably should have been checked first. But that's hindsight for you.

And that's not even accounting for staying ahead of the game while securing your 24/7 publicly accessible service, running on ever-changing software, with infrastructural requirements you basically have no control over. In your spare time.

Hosting something for yourself can be a lot of fun, hosting something for other, potentially many thousand, people makes you kind of responsible. That can be rewarding and fun at times as well, but is also a prime source for headaches.

Deploying stuff is the easy part, knowing what to do when stuff inevitably breaks is where it is at. Therefore, IMHO, it's probably a good thing that most Lemmy admins at least know where to ask/start when shit hits the fan. This unfortunately leads to more centralization, but for good reasons: teams of volunteers taking care of fewer instances will almost always lead to a better experience than a lot of lone wolfs curating a lot of small instances. Improving scalability, monitoring and documentation is always nice, but will never replace a capable admin such as yourself.
Well that was an entertaining read! Thanks for all your efforts to keep our instance running smoothly. I have noticed it seems a bit snappier since you fixed the problem.
That was like reading Homer's "The Iliad".
Great writeup, thank you so much for sharing!

Nothing more frustrating than googling an issue and (only) finding forum threads ending in "nvm it works now" 😬
Very fascinating and informative. Thank you for sharing.
Yep I can confirm is massively faster now to comment and post. Even faster than other instances and other corporation products.

28 comments