Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning (arxiv.org)
166 points by mfiguiere on April 5, 2023 | hide | past | favorite | 53 comments


some interesting tidbits:

> stretched our ML supercomputer scale .. to 4096 TPU v4 nodes

> The Google tradition is to write retrospective papers ... TPU v4s and A100s deployed in 2020 and both use 7nm technology

> The appropriate H100 match would be a successor to TPU v4 deployed in a similar time frame and technology (e.g., in 2023 and 4 nm).

> TPU v4 supercomputers [are] the workhorses of large language models (LLMs) like LaMDA, MUM, and PaLM]. These features allowed the 540B parameter PaLM model to sustain a remarkable 57.8% of the peak hardware floating point performance over 50 days while training on TPU v4 supercomputers

> Google has deployed dozens of TPU v4 supercomputers for both internal use and for external use via Google Cloud

> Moreover, the large size of the TPU v4 supercomputer and its reliance on OCSes looks prescient given that the design began two years before the paper was published that has stoked the enthusiasm for LLMs


Brought to you by the future: "Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and performance; users can pick a twisted 3D torus topology if desired."


As the paper explains, optical circuit switches are not new in TPU v4 and not the main topic of this paper. Google was already using it for networking and published about it last year. For details, see https://arxiv.org/abs/2208.10041.


The future of the 90's. Optical matrix switches like this have been around for a long time. These aren't doing packet switching (which honestly would be the future if done optically), it's more of a layer 1 thing - the switch replaces you plugging and unplugging a cable. Bell labs was building these kinds of switches back in the day.


Like other Google datacenter technologies the innovation of their OCS is that it's a cheap hack that nobody would tolerate outside of Google. Their whole secret sauce is that everything just barely works.


"Any idiot can build a bridge that stands, but it takes an engineer to build a bridge that barely stands."


You think reconfigurable MEMS optical switches are cheap hacks? Um, no.


The whole paper is focused on cost efficiency. I think that's in keeping with Google's history, where for example they were the first to use ultra short range optics, because that is an OK way to save a dime.


There's a difference between cost efficiency and "cheap". In this case, we're talking about advanced photonic systems running in a data center to tie together world-class computing equipment; they didn't go for "cheap". The total investment is in the tens of billions of dollars.


They are "cheap hacks" compared to a packet switched network that runs at that speed. Also, this is another reason why Google can't sell TPUs to other people - nobody would put up with managing this sort of switching network for something they bought. The equivalent for NVidia is to use HDR/NDR Infiniband, and it allows you to run a multi-tenant cluster a lot more efficiently, at no practical loss of performance (due to the marginally higher latency).


I don;'t think you can directly compare a reconfigurable optical switch with a packet switched network. A packet switched networks receives packets electrically, sends them through a processor, and then outputs the results on another port. This is a device that creates static paths between endpoints that can then be dynamically changed later.

It also has the advantage that when they move to multi-wavelenght, its performance will greatly exceed electrical packet-switching networks.


I disagree with you about comparing the two - I have some experience from trading firms which convinced me that they are not actually that different (for context, trading firms use a lot of layer 1 switching, and hacks that take place between layers 1 and 2).

If you think of packets like snakes going through a network, a layer 1 switching network creates tunnels for the snakes that you choose ahead of time (and can reconfigure whenever you want). A packet switched network creates tunnels that are chosen by the snakes. If you run a packet switched network, you can do everything you do with a layer 1 switched network by simply restricting which peers you send data to. On a hardware level, you need to convert from optics to electricity to do this, but you don't strictly need to do any buffering (the use of large switch buffers on Ethernet switches is because of Ethernet, not because of packet switching). Low-latency switches don't buffer unless they need to, and basically just read the header as it's coming in to choose a route for the packet.

EDR Infiniband networks could certainly handle TPU v4 levels of bandwidth in a packet-switched fashion (at the time when TPU v4 was being built and deployed), particularly when the packets are doing something as tame as going around a torus. It also gives you the flexibility to do other things, though.

It certainly raises the complexity of the system, but I assume sometime around TPU v6 or v7, Google will rediscover packet switching for inter-TPU links.


Example history: https://patents.google.com/patent/US4580873

40 year old patent. But I'm thrilled to see it applied at scale to reconfigurable supercomputers.


On a related note, Google also uses optical circuit switches in their datacenter network. See the paper form SIGCOMM'22 [1].

[1] Jupiter Evolving: Transforming Google’s Datacenter Network via Optical Circuit Switches and Software-Defined Networking. https://research.google/pubs/pub51587/


So something like an optical FPGA?


Multiplexer for optical paths, which allows you to change the point-to-point wiring. Much like physical circuit switching in a 50s telephone exchange, actually.


not really, only the interconnects between nodes are optical


Is there any way to purchase anything like a TPU? I guess the Cerebras Andromeda product is one, but I don't know if those are sold or leased. Any others?

https://www.cerebras.net/andromeda/


Cerebres systems are whole rack with high cost and cooling requirement, pci cards are more sensible for home gamers:

https://tenstorrent.com/


Cant wait for tenstorrent card to come to public.

Grayskull is supposed to be A100 performance for $1000, with some cool features (horizontally scalable by plugging them to each other over ethernet, C++ programmable, sparse computation, etc.)


I really wonder how well it's going to perform given that it's 600 TOPS / 16GB DRAM 200 Gbit/s setup. I was told that in Transformer training, memory bandwidth is key. However, neither 16GB single-card memory capacity, nor bandwidth sounds terribly attractive but most important of all, judging from their FAQ, their driver is likely to be proprietary and would require Internet access which is worrying still.

Over the last couple weeks, I've seriously considered purchasing AMD Instinct MI50 (32GB HBM2, 1 Gbit/s) card which goes for under a $1000. I know it lacks Tensor Cores, and can only offer 53 TOPS which sounds silly compared to Grayskull's 600 TOPS. However, isn't it the case that for the most part these cores are idle, & waiting for memory still? At any rate, you're not going be able to run, say Llama 30B which won't fit already into 16GB but must fit comfortably in a 32GB system— on a single card. But perhaps most importantly, amdgpu driver is actually open source and allows PCIe passthrough unlike other vendors so considering all of the above, it almost seems like a no-brainer for a trusted computing setup.

I wonder if my logic is correct.

MI50: https://www.amd.com/en/products/professional-graphics/instin...


The idea is that the tenstorrent cards dynamically prune out codepaths that aren't used.

Transformers naturally have a big part of the network that is unused on any particular token flowing through the model. You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.

So my understanding is that the tenstorrent cards are drastically more efficient, even on "dense" models like transformers because of the sparsity of any specific forward pass.

Also: I wouldn't bet on AMD accelerators for ML. They've disappointed every time. I would trust Jim Keller, whose every project in the last decade ended up being impactful.

I think the internet access is just to download the driver. It's not some sort of DRM setup where it needs to be always-online.


I understand this pruning is the difference between "silicon"and "software-assisted" TOPS in their documentation but I still don't see how exactly that's going to address the fact that to fit a 30B parameters model into memory, you need at least that. So basically to go 30B and up, you would need at least two, if not three cards and I couldn't find any details on how interlink is going to be implemented except that it's apparently done via Ethernet, limiting it to 100 Gbps which again, seems like a hard bandwidth limitation out-of-place when comparing it to impressive compute. Also: "online installer required" is not a good look, at least for me personally and the security model of my system. AMD cards, however, are affordable commodity hardware, offering 32GB HBM2 and the driver is open source so I wouldn't really discount it even considering their lacklustre performance. Cloud VMs do away nicely for most use-cases but as long as soon as hard security comes into focus, you just can't afford to have unauditable blobs from NVIDIA, or any other vendor for that matter. I'm still looking forward to learn more about Grayskull, especially on its memory capacity/bandwidth limitations and what it means for ever-growing large language models. Hopefully, they can open-source their driver stack.


Notice they require 32gb RAM and encourage 64gb RAM.

Presumably the architecture keeps the model in CPU RAM and shuffles it dynamically to the PCIe card network? I'm guessing here.

Whatever quirks come out of their hardware, I want to keep an eye on it.

I also think that comparing the current generation of models, which are built and trained to maximize GPU or TPU bandwidth, could be improved if someone architected models to maximize their model to greyskull advantages. Given PyTorch runs on it, I don't think it'd be too hard to do.


Thank you


> Transformers naturally have a big part of the network that is unused on any particular token flowing through the model. You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.

Dense transformers (GPT-3 AFAIK is dense) don't.


They do though. A ton of the neurons end up not activating on any particular layer, leading to a huge waste as you're passing a zero down the layers.

I spoke with a PM at TT and he told me an important idea is that we're spending a lot of electricity multiplying things with zeros.


> You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.

From what I've read that was just an error in reading memory consumption after switching to the mmap version and not more memory efficient at all in the end.


Not exactly. It's that the model is loading less stuff out of the mmap'ed weights that you would expect.

The author of the mmap patch chimes in here:

https://news.ycombinator.com/item?id=35393615


Seeing people calling this an NVDA A100 killer, being passed around. Any thoughts on this having as much of an impact as stated.


To be honest even if it fulfills half the expectations itll sell like hotcakes


How useful are these for workstation-class scientific computing, vs. a GPU? Not ML necessarily, but I do a lot of matrix multiplication for example when solving systems of equations.


We dont know yet, because they're not available. But Jim Keller has been all-in there for 2 years and they're supposed to be generally programmable.


Absolutely. On an hourly basis. Support is extra. https://cloud.google.com/tpu/pricing


Renting isn't purchasing.


From the page : "Andromeda, a 13.5 Million Core AI Supercomputer". Blown away by the number of cores (I considered myself lucky to have 2 10000+ cores GPU in my workstation) I then realized that the word "core" is singular in the sentence. Is it just a mistake or does it mean something else ? (genuine question, English is not my first language)

EDIT: Ahhh a bit below on the page it is written "13.5 million AI-optimized cores" and there it's plural. So it was probably just a mistake.


It is not a mistake, that is how you phrase it in English when the noun ("core" in this case) is being used as part of a compound adjective. The convention is to keep the noun singular.

e.g.

"He commanded a ten thousand man army." (not men)

"Andromeda, a trillion star galaxy, is 2.5 million lightyears away." (not stars)

etc


Not a native English-speaking person, but shouldn’t it be "He commanded a ten-thousand-man army" and "Andromeda, a trillion-star galaxy, is 2.5 million lightyears away"?


I am a native speaking English-speaking person, and although I think you are correct in that compound adjectives have historically been connected with hyphens, that seems to have fallen out-of-fashion somewhat.


I'm now remembering some WW1 posters from my GCSE History lessons that hyphenated "to-day" like this: https://www.pinterest.com/pin/483644447456350306/


That’s optional, but definitely more clear. The kind of thing a newspaper editor would insist on, but not necessarily seen outside of that.


To expand on the sibling comment, this is when pedantic people start talking about hyphenation. The clearer way to say this is "13.5-million-core AI supercomputer".


Yes, Google TPUs are sold to consumers under the Coral brand.

https://coral.ai/products/


These things come with a lot of limitations, like not being able to work with 4D and larger tensors, plus the 8 bit quantization. Getting models to run on them is a real pain.


This is certainly interesting, but doesn't seem quite like what I want. It only seems to offer "Edge TPUs" which...

> The Edge TPU ... supports only TensorFlow Lite models that are fully 8-bit quantized and then compiled specifically for the Edge TPU.


It's a different chip


Clearly; they're tiny and for embedded applications.

They even wrote "The products offered by Google are unrelated to the products offered under the CORAL trademarks" in the page footer.


This is like saying Fisher Price toy cars are equivalent to F1 cars.


I hate the enormous waste of human ability, ingenuity and effort in the creation of proprietary technologies like this. You've made a chip? Offer it for everyone to use. Same goes for Amazon and Apple. It's not as though it's a chip that's only usable for Google-specific work.


Good news! They're being offered on Google Cloud: https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...

So you don't have to be frustrated anymore.


It’s because Google is a monopoly it doesn’t operate on normal economic incentives.

It’s goal is only to keep the monopoly: appear bening, keep tech advance in house, share it’s spying network with government so the government don’t regulate them, win-win.


>keep tech advance in house

they put out way more R&D research than other big firms. GPT-4 would most likely not exist today without them.


If people are not allowed to monetize their innovations, there is no incentive to innovate. While this needs to have its limits, sharing it to everyone immediately upon creation is not an answer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: