> You could handily exceed the M2 Ultra bandwidth if you ran 4 channels of IB .....

smoldesu · on Aug 12, 2023

> And you can't get 8 of those adapters into a server (with x32 links anyway).

Nvidia's DGX-100 system promises over 3.2Tb/s across 10x Mellanox ConnectX-6 cards. That's their old Epyc system too, with the Grace Superchip you can get 900GB/s of card interconnect and 1TB/s of memory bandwidth. Either one could exceed the bandwidth of the M2 Max's memory.

The "each way" shtick is worth contesting, and ultimately comes down to how you use Nvidia's CUDA primatives. Agree to disagree on that - however I think my point still stands. "Unified memory architecture" is an anachronism at that scale, literally rendered obsolete by a unified address space and fast enough interconnect.

justinclift · on Aug 13, 2023

Cool. Yeah that's an interesting point about the DGX-100 system. I'd forgotten about those. :)

Looking at the data sheet, it seems to have a maximum of 8 single port Nvidia ConnectX-7 adapters:

https://resources.nvidia.com/en-us-dgx-systems/ai-enterprise...

Bearing in mind the unit conversions (Tb/s vs GB/s), 3.2Tb/s matches up with the 400GB/s bandwidth of the M2 Max's memory bandwidth.

---

Their newly announced Grace and/or Grace Hopper "Superchip" does seem interesting. Haven't (yet) seen now it's supposed to connect to other infrastructure though.

Their whitepaper talks about "OEM-defined I/O" but doesn't (in my skimming thus far) indicate what the upper bounds are.

May look further later on, but we're pretty far into the weeds already. ;)

---

Further along in the whitepaper, it says the "NVLink Switch System" in them communicates with the network at 900GB/s "total bandwidth". If that's indeed the case, they yep they're beating the M2 Max's memory bandwidth (400GB/s).

That even beats the M2 Ultra's memory bandwidth (800GB/s):

https://en.wikipedia.org/wiki/Apple_silicon#Apple_M2_Ultra