Why isnt there more investments into semi-synchronous training - is it that the convergence is iffy ? Also, it would be great to refactor this code into a typed language, so it is easier to reason about and maintain.
Recently there's been a lot of interest and improvements in semi-synchronous training. The Streaming DiLoCo paper came out this year and is a big step forward for datacenter semi-sync.
Historically it's been limited to areas like federated learning for low power/low network training but with the massive increase in number of GPUs it's becoming relevant even for training in datacenters.
It is another variable ML researchers have to tune so does add some complexity and I expect most folks just aren't familiar with it yet.
On "typed language": all of torchft is typed! The coordination/quorum layers are written in Rust w/ GRPC and the front-end is typed Python with Pyre since it has to interact with PyTorch and model code.
thanks !, I am curious how this relates to the recent "monarch" announcement - which has similar goals of facilitating large scale fault tolerant training [1].
We're working on making these composable. torchft is largely focused on the model integration and algorithms where as Monarch is handling more of the orchestration/monitoring. They operate at a bit of a different layer but the plan is to have torchft have the fault tolerant algorithms that can be used both in Monarch or a standard PTD job