Or you can swap to DRAM and stream over PCI. I do that all the time and it works...

justinclift · on Aug 12, 2023

> and get duplex bandwidth faster than the M2 Max's entire memory bus.

Doesn't seem like it. Looking at the wikipedia article for the M2 Max, that says:

    ... with up to 400 GB/s memory bandwidth.

https://en.wikipedia.org/wiki/Apple_silicon#Apple_M2_Max

Looking at the current "Nvidia networking" product page, it lists ConnectX-7 adapters with 400Gb/s total bandwidth:

https://www.nvidia.com/en-us/networking/infiniband-adapters/

So, accounting for the bit -> byte unit difference in those figures it seems like having eight ConnectX-7 adapters would roughly match up.

However, the data sheet for those adapters seems to say those 400Gb/s adapters are for PCIe Gen5 x32 slots (not the standard x16 ones).

    Host interface: PCIe Gen5, up to x32 lanes

Eight slots of x32 lanes doesn't quite seem possible, even with very latest generation AMD EPYC (9004 series) processors:

https://www.amd.com/system/files/documents/epyc-9004-series-...

Those have 128 PCIe Gen5 lanes per cpu. Dual socket systems seem to expand that out a bit, allowing up to 160x usable PCIe lanes in a server.

https://www.servethehome.com/pcie-lanes-and-bandwidth-increa...

So, at least on paper it doesn't seem possible for a single server with a bunch of PCIe Infiniband links to actually match the bandwidth of the M2 Max memory bus. Maybe 3/4 of it though, which isn't terrible. :)

smoldesu · on Aug 12, 2023

InfiniBand is intended to be run in parallel. You could handily exceed the M2 Ultra bandwidth if you ran 4 channels of IB, but rarely do people need more bandwidth than what PCI offers. At least for AI.

justinclift · on Aug 12, 2023

> You could handily exceed the M2 Ultra bandwidth if you ran 4 channels of IB ...

Hmmm, that doesn't seem to be the case?

Those adapters are 400Gb/s "total bandwidth" each. Not 400Gb/s "each way". And you can't get 8 of those adapters into a server (with x32 links anyway).

Where's my calculation going wrong? :)

smoldesu · on Aug 12, 2023

> And you can't get 8 of those adapters into a server (with x32 links anyway).

Nvidia's DGX-100 system promises over 3.2Tb/s across 10x Mellanox ConnectX-6 cards. That's their old Epyc system too, with the Grace Superchip you can get 900GB/s of card interconnect and 1TB/s of memory bandwidth. Either one could exceed the bandwidth of the M2 Max's memory.

The "each way" shtick is worth contesting, and ultimately comes down to how you use Nvidia's CUDA primatives. Agree to disagree on that - however I think my point still stands. "Unified memory architecture" is an anachronism at that scale, literally rendered obsolete by a unified address space and fast enough interconnect.

justinclift · on Aug 13, 2023

Cool. Yeah that's an interesting point about the DGX-100 system. I'd forgotten about those. :)

Looking at the data sheet, it seems to have a maximum of 8 single port Nvidia ConnectX-7 adapters:

https://resources.nvidia.com/en-us-dgx-systems/ai-enterprise...

Bearing in mind the unit conversions (Tb/s vs GB/s), 3.2Tb/s matches up with the 400GB/s bandwidth of the M2 Max's memory bandwidth.

---

Their newly announced Grace and/or Grace Hopper "Superchip" does seem interesting. Haven't (yet) seen now it's supposed to connect to other infrastructure though.

Their whitepaper talks about "OEM-defined I/O" but doesn't (in my skimming thus far) indicate what the upper bounds are.

May look further later on, but we're pretty far into the weeds already. ;)

---

Further along in the whitepaper, it says the "NVLink Switch System" in them communicates with the network at 900GB/s "total bandwidth". If that's indeed the case, they yep they're beating the M2 Max's memory bandwidth (400GB/s).

That even beats the M2 Ultra's memory bandwidth (800GB/s):

https://en.wikipedia.org/wiki/Apple_silicon#Apple_M2_Ultra