Or you can swap to DRAM and stream over PCI. I do that all the time and it works fine, especially since you rarely need to load all 70gb into memory at once.
If you're a serious (read: enterprise) customer, you can buy InfiniBand-enabled cards and get duplex bandwidth faster than the M2 Max's entire memory bus. 'Unified memory' isn't even a bullet point on their spec sheet, it means nothing to their customers when they have CUDA primatives that do the same thing faster at a larger scale.
So, at least on paper it doesn't seem possible for a single server with a bunch of PCIe Infiniband links to actually match the bandwidth of the M2 Max memory bus. Maybe 3/4 of it though, which isn't terrible. :)
InfiniBand is intended to be run in parallel. You could handily exceed the M2 Ultra bandwidth if you ran 4 channels of IB, but rarely do people need more bandwidth than what PCI offers. At least for AI.
> You could handily exceed the M2 Ultra bandwidth if you ran 4 channels of IB ...
Hmmm, that doesn't seem to be the case?
Those adapters are 400Gb/s "total bandwidth" each. Not 400Gb/s "each way". And you can't get 8 of those adapters into a server (with x32 links anyway).
> And you can't get 8 of those adapters into a server (with x32 links anyway).
Nvidia's DGX-100 system promises over 3.2Tb/s across 10x Mellanox ConnectX-6 cards. That's their old Epyc system too, with the Grace Superchip you can get 900GB/s of card interconnect and 1TB/s of memory bandwidth. Either one could exceed the bandwidth of the M2 Max's memory.
The "each way" shtick is worth contesting, and ultimately comes down to how you use Nvidia's CUDA primatives. Agree to disagree on that - however I think my point still stands. "Unified memory architecture" is an anachronism at that scale, literally rendered obsolete by a unified address space and fast enough interconnect.
Bearing in mind the unit conversions (Tb/s vs GB/s), 3.2Tb/s matches up with the 400GB/s bandwidth of the M2 Max's memory bandwidth.
---
Their newly announced Grace and/or Grace Hopper "Superchip" does seem interesting. Haven't (yet) seen now it's supposed to connect to other infrastructure though.
Their whitepaper talks about "OEM-defined I/O" but doesn't (in my skimming thus far) indicate what the upper bounds are.
May look further later on, but we're pretty far into the weeds already. ;)
---
Further along in the whitepaper, it says the "NVLink Switch System" in them communicates with the network at 900GB/s "total bandwidth". If that's indeed the case, they yep they're beating the M2 Max's memory bandwidth (400GB/s).
That even beats the M2 Ultra's memory bandwidth (800GB/s):
If you're a serious (read: enterprise) customer, you can buy InfiniBand-enabled cards and get duplex bandwidth faster than the M2 Max's entire memory bus. 'Unified memory' isn't even a bullet point on their spec sheet, it means nothing to their customers when they have CUDA primatives that do the same thing faster at a larger scale.