It turns out the way I set up file sharing between the front end web server and the app server created immense slowdowns on fedia. I consolidated some things (web server on app server now), which isn't a long term solution, but will work in the very short term.
I am having to copy media from the s3 storage bucket back to a local directory - that is going to take a long time and images on the site are going to be broken until that finishes - hopefulle overnight.
There are still a few lingering error 500s and I will tackle that tomorrow evening.
I intend to spend some time debugging the reason that federation periodically just stops and has to be manually restarted, as well as the cause of the error 500's. Sorry for the problems.
Its doing that on and off for me. I think it's got to be linked to the magazines that throw 500 errors. For instance, if you try to look at your subs and that first page contains a post that would've federated from one of the magazines throwing a 500 error, it just throws a 500 for that whole sub page. Just my assumption.