It’s harder because gradient descent is usually done with near synchronous updat...

why_only_15 · on March 18, 2023

Yeah, the synchronous updates are a big deal. To get an idea of how much bandwidth this typically takes, Oracle has 1600gbps per node of interconnect with latency very low, guessing in the tens of microseconds. A really good home connection might have 1gbps of interconnect with latency in the tens of milliseconds. The big question is whether we really need all this interconnect -- GPT-JT[1] is a very promising step in this direction. The idea is that we just drop most of the gradient updates and it still works well. Unclear whether this will take off generally -- if it does it would be a huge deal, because 1600gbps of interconnect is very expensive.

[1]: https://www.together.xyz/blog/releasing-v1-of-gpt-jt-powered...