Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It’s harder because gradient descent is usually done with near synchronous updates of the parameters across the whole model on all training hosts.


Yeah, the synchronous updates are a big deal. To get an idea of how much bandwidth this typically takes, Oracle has 1600gbps per node of interconnect with latency very low, guessing in the tens of microseconds. A really good home connection might have 1gbps of interconnect with latency in the tens of milliseconds. The big question is whether we really need all this interconnect -- GPT-JT[1] is a very promising step in this direction. The idea is that we just drop most of the gradient updates and it still works well. Unclear whether this will take off generally -- if it does it would be a huge deal, because 1600gbps of interconnect is very expensive.

[1]: https://www.together.xyz/blog/releasing-v1-of-gpt-jt-powered...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: