Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

GPUs need RAM that can handle a lot of bandwidth, so that all of the execution units can remain constantly fed. For bandwidth, there is both a width and a rate of transfer (often bounded by the clock speed) which combined yield the overall bandwidth, i.e. a 384-bit bus at XXXX million transfers/second. It will never matter how much compute or RAM they have if these don't align and you can't feed the cores. Modern desktop DDR has bandwidth that is too low for this, in general, on desktop platforms, given the compute characteristics of a modern GPU, which has shitloads of compute. Despite all that, signal integrity on parallel RAM interfaces has very tight tolerances. DDR sockets are very carefully placed with this in mind on motherboards, for instance. GDDR, which most desktop class graphics cards use instead of normal DDR, has much higher bandwidth (e.g. GDDR6x offers 21gbps/pin while DDR5 is only around 4.8gbp/s total) but even tighter interface characteristics than that. That's one reason why you can't socket GDDR: the physical tolerances required for the interface are extremely tight and the signal integrity required means a socket is out of the question.

Here is an example, go compare the RAM interfaces between an Nvidia A100 with HBM versus an Nvidia 3080, and see how this impacts performance. On compute-bound workloads, an Nvidia A100 will absolutely destroy a 3080 in terms of overall efficiency. One reason for this is because the A100 will have a memory interface that is 3-4x wider which is absolutely vital for lots of workloads. That means 3-4x the amount of data can be fed into execution units in the same clock cycle. That means you can clock the overall system lower, and that means you're using less power, while achieving similar (or better) performance. The only way a 3080 with a 256-bit bus can compare to a A100 with a 1024-bit bus is by pushing the clocks higher (thus increasing the rate of transfers/second), but that causes more heat and power usage, and it scales very poorly in practice e.g. a 10% clock speed increase might result in a measly 1-2% improvement.

So now, a bunch of things fall out of these observations. You can't have extremely high-bandwidth RAM, today, without very tight interface characteristics. For desktops and server-class systems, CPUs don't need bandwidth like GPUs, so they can get away with sockets. That has some knock on benefits; CPU memory can benefit from economies-of-scale on selling RAM sticks, for example. Lots of people need RAM sticks so you're in a good spot to buy more. And because sockets exist "in three dimensions", there's a huge increase in "density per square-inch" on the motherboard. If you want a many-core GPU to remain fed, you need soldered RAM which necessitates a fixed SKU for deployment, or you need to cut down on the compute so lower-bandwidth memory can feed things appropriately, negating the reason you went to GPUs in the first place (more parallel compute). Soldered RAM also means that the compute/memory ratios are now fixed forever. One nice thing about a CPU with sockets is that you can more flexibly arbitrage resources over time; if you find a way to speed something up with more RAM, you can just add it assuming you aren't maxed out.

Note that Apple Silicon is designed for lower power profiles; it has good perf/watt, not necessarily overall best performance in every profile. It uses 256 or 512-bit LPDDR5X, and even goes as high as 1024-bit(!!!) on the Max series apparently. But they can't just ignore the laws of physics; at extremely high bandwidth and bus widths you're going to be very subject to signal interface requirements. You have physical limitations that prevent the bountiful RAM sticks that each have multiple, juicy Samsung DDR5 memory chips on them. The density suffers. So Apple is only limited to so much RAM; there's very little way around this unless they start stacking in 3-dimensions or something. That's one of the other reasons they likely have moved to soldered memory for so long now; it simply makes extremely high performance interfaces like this possible.

All in all the economies of scale for RAM sticks combined with their density means that GPUs will probably continue to be worse for workloads that benefit from lots of memory. You just can't meet the combined physical interface and bandwidth requirements at the same density levels.



I created my account just to reply to this comment.

This is great read. Where can I read more of this stuff? I love the insight you have in there.


Fantastic comment, thanks k you!

Do you think there’s any hope for UMA on PC / x86 systems? Seems like Intel would have an incentive to offer parts, but would it be possible to remain Windows/legacy OS compatible with a UMA implementation?


Amazing post - thank you so much for typing that out.


Thoughts on HBM that Intel has been touting?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: