Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most people are vram constrained not compute constrained.


But those people usually have more system RAM than VRAM.

At those scales, most people become bandwidth and compute constrained using CPU inference instead of multiple GPUs. In those cases, an MOE with a low number of active parameters is the fastest.


Cloud providers aren’t though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: