With 4-bit quantization the requirements are more like 64gb. I expect we'll see ...

zacmps · on July 19, 2023

I doubt it to be honest, desktop GPUs use too much power (and hence produce too much heat) to be integrated in that fashion, and any kind of shared memory will be too high latency.

nullc · on July 19, 2023

There are 'desktop' (well server) cpus with 64GB of HBM memory per socket now. And big LLMs can be run on lower memory bandwidth systems (like zen4 chips with 12x ddr5 per socket) at lower performance, but where 1-2TB of ram is no big deal.