Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I aggregate all the M4 Mac mini ports into a M4 cluster by mesh networking all its Serdes/PCIe with FPGAs into a very cheap low power supercomputer with exaflop performance. Cheaper than NVDIA. I'm sure Apple does the same in their data centers.

That sounds super interesting, do you happen to have some further information on that? Is it just a bunch of FPGAs issuing DMA TLPs?



sounds (at least at a high level) similar to EXO[1]

[1] https://github.com/exo-explore/exo


Here a video of testing Exo to run huge LLMs on a cluster of M4 Macs[1] more cheaply than with a cluster of NVDIA RTX 4090s.

[1] https://www.youtube.com/watch?v=GBR6pHZ68Ho


They show a test-run of a 1B llama-3.2 model. Doesn't that fit in a single mac? Distributing the workload in this case must be slower than running it on a single machine.

However, this is interesting and I'm confused why aren't they showcasing the test-run of a larger model that actually necessitates distributing the workload across the cluster.


It is not the first time they built super computers from off the shelf Apple machines [1].

M4 supercomputers are cheaper and it also will be lower Capex and Apex for most datacenter hardware.

>do you happen to have some further information on that?

Yes, the information is in my highly detailed custom documentation for the programmers and buyers of 'my' Apple Silicon super computer, Squeak and Ometa DSL programming languages and adaptive compiler. You can contact me for this highly technical report and several scientific papers (email in my profile).

Do you know of people who might buy a super computer based on better specifications? Or even just buyers who will go for 'the lowest Capex and the lowest Opex supercomputer in 2025-2027'?

Because the problem with HPC is that almost all funders and managers buy supercomputers with a safe brand name (Nvidia, AMD, Intel) at triple the cost and seldom from a super computer researcher as myself. But some do, if they understand why. I have been designing, selling, programming and operating super computers since 1984 (I was 20 years old then), this M4 Apple Silicon Cluster will be my ninth supercomputer. I prefer to build them from the ground up with our own chip and wafer scale integration designs but when an off-the-shelf chip is good enough I'll sell that instead. Price/Performance/Watt is what counts, ease of programming is a secondary consideration for what performance you achieve. Alan Kay argues you should rewrite your software from scratch [2] and do your own hardware [3] so that is what I've done sinds I learned from him.

>Is it just a bunch of FPGAs issuing DMA TLPs?

No. The FPGA's are optional for when you want to flatten the inter-core (=inter-SRAM cache) networking with switches or routers to a shorter hop topology for the message passing like a Slim fly diameter two hop topology [4].

DMA (Direct Memory Access) TLPs (Transaction Layer Packets) are one of the worst ways of doing inter-core and inter-SRAM communication and on PCIe it has a huge 30% protocol overhead at triple the cost. Intel (and most other chip companies like NVIDIA, Altera, AMD/XILINX) can't design proper chips because they don't want to learn about software [2]. Apple Silicon is marginally better.

You should use pure message passing between any process, preferably in a programming language and a VM that uses pure message passing at the lowest level (Squeak, Erlang). Even better if you then map those software messages directly to message passing hardware as in my custom chips [3].

The reason to reverse Apple Silicon instructions for CPU, GPU and ANE are to be able to adapt my adaptive compiler to M4 chips but also to repurpose PCIe for low level message passing with much better performance and latency than DMA TLPs.

To conclude, if you want to get the cheapest Capex and Opex M4 Mac mini supercomputer you need to rewrite your supercomputing software in a high level language and message passing system like the parallel Squeak Smalltalk VM [3] with adaptive load balancing compilation. C, C++, Swift, MPI or CUDA would result in sub-optimal software performance and orders of magnitude more lines of code when optimal performance of parallel software is the goal.

[1] https://en.wikipedia.org/wiki/System_X_(supercomputer)

[2] https://www.youtube.com/watch?v=ubaX1Smg6pY

[3] https://vimeo.com/731037615

[4] https://www.youtube.com/watch?v=rLjMrIWHsxs


I forgot to add links to talk [5] by IBM Research on massively parallel Squeak Smalltalk and why it might be relevant for Apple Silicon reverse engineering and M4 clusters.

Talk [6] on free space optical interconnects without SerDes some day showing up on low power Apple Silicon (around M6-M8 models).

[5] https://www.youtube.com/watch?v=GBtqQwcJoN0

[6] https://www.youtube.com/watch?v=-dQoImLNgWs




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: