Most people are vram constrained not compute constrained. | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		msp26 on April 24, 2024 \| parent \| context \| favorite \| on: Snowflake Arctic Instruct (128x3B MoE), largest op... Most people are vram constrained not compute constrained.

Manabu-eo on April 24, 2024 | [–]

But those people usually have more system RAM than VRAM.

At those scales, most people become bandwidth and compute constrained using CPU inference instead of multiple GPUs. In those cases, an MOE with a low number of active parameters is the fastest.

kaibee on April 24, 2024 | [–]

Cloud providers aren’t though.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact