Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints [pdf]
Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints [pdf]
www.cs.rice.edu /~eugeneng/papers/SOSP23.pdf
There is a discussion on Hacker News, but feel free to comment here as well.
0 comments