> stretched our ML supercomputer scale .. to 4096 TPU v4 nodes
> The Google tradition is to write retrospective papers ... TPU v4s and A100s deployed in 2020 and both use 7nm technology
> The appropriate H100 match would be a successor to TPU v4 deployed in a similar time frame and technology (e.g., in 2023 and 4 nm).
> TPU v4 supercomputers [are] the workhorses of large language models
(LLMs) like LaMDA, MUM, and PaLM]. These features allowed the 540B parameter PaLM model to sustain a remarkable 57.8% of the peak hardware floating point performance over 50 days while training on TPU v4 supercomputers
> Google has deployed dozens of TPU v4 supercomputers for both internal use and for external use via Google Cloud
> Moreover, the large size of the TPU v4 supercomputer and its reliance on OCSes looks prescient given that the design began two years before the paper was published that has stoked the enthusiasm for LLMs
Brought to you by the future: "Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and performance; users can pick a twisted 3D torus topology if desired."
As the paper explains, optical circuit switches are not new in TPU v4 and not the main topic of this paper. Google was already using it for networking and published about it last year. For details, see https://arxiv.org/abs/2208.10041.
The future of the 90's. Optical matrix switches like this have been around for a long time. These aren't doing packet switching (which honestly would be the future if done optically), it's more of a layer 1 thing - the switch replaces you plugging and unplugging a cable. Bell labs was building these kinds of switches back in the day.
Like other Google datacenter technologies the innovation of their OCS is that it's a cheap hack that nobody would tolerate outside of Google. Their whole secret sauce is that everything just barely works.
The whole paper is focused on cost efficiency. I think that's in keeping with Google's history, where for example they were the first to use ultra short range optics, because that is an OK way to save a dime.
There's a difference between cost efficiency and "cheap". In this case, we're talking about advanced photonic systems running in a data center to tie together world-class computing equipment; they didn't go for "cheap". The total investment is in the tens of billions of dollars.
They are "cheap hacks" compared to a packet switched network that runs at that speed. Also, this is another reason why Google can't sell TPUs to other people - nobody would put up with managing this sort of switching network for something they bought. The equivalent for NVidia is to use HDR/NDR Infiniband, and it allows you to run a multi-tenant cluster a lot more efficiently, at no practical loss of performance (due to the marginally higher latency).
I don;'t think you can directly compare a reconfigurable optical switch with a packet switched network. A packet switched networks receives packets electrically, sends them through a processor, and then outputs the results on another port. This is a device that creates static paths between endpoints that can then be dynamically changed later.
It also has the advantage that when they move to multi-wavelenght, its performance will greatly exceed electrical packet-switching networks.
I disagree with you about comparing the two - I have some experience from trading firms which convinced me that they are not actually that different (for context, trading firms use a lot of layer 1 switching, and hacks that take place between layers 1 and 2).
If you think of packets like snakes going through a network, a layer 1 switching network creates tunnels for the snakes that you choose ahead of time (and can reconfigure whenever you want). A packet switched network creates tunnels that are chosen by the snakes. If you run a packet switched network, you can do everything you do with a layer 1 switched network by simply restricting which peers you send data to. On a hardware level, you need to convert from optics to electricity to do this, but you don't strictly need to do any buffering (the use of large switch buffers on Ethernet switches is because of Ethernet, not because of packet switching). Low-latency switches don't buffer unless they need to, and basically just read the header as it's coming in to choose a route for the packet.
EDR Infiniband networks could certainly handle TPU v4 levels of bandwidth in a packet-switched fashion (at the time when TPU v4 was being built and deployed), particularly when the packets are doing something as tame as going around a torus. It also gives you the flexibility to do other things, though.
It certainly raises the complexity of the system, but I assume sometime around TPU v6 or v7, Google will rediscover packet switching for inter-TPU links.
Multiplexer for optical paths, which allows you to change the point-to-point wiring. Much like physical circuit switching in a 50s telephone exchange, actually.
Is there any way to purchase anything like a TPU? I guess the Cerebras Andromeda product is one, but I don't know if those are sold or leased. Any others?
Grayskull is supposed to be A100 performance for $1000, with some cool features (horizontally scalable by plugging them to each other over ethernet, C++ programmable, sparse computation, etc.)
I really wonder how well it's going to perform given that it's 600 TOPS / 16GB DRAM 200 Gbit/s setup. I was told that in Transformer training, memory bandwidth is key. However, neither 16GB single-card memory capacity, nor bandwidth sounds terribly attractive but most important of all, judging from their FAQ, their driver is likely to be proprietary and would require Internet access which is worrying still.
Over the last couple weeks, I've seriously considered purchasing AMD Instinct MI50 (32GB HBM2, 1 Gbit/s) card which goes for under a $1000. I know it lacks Tensor Cores, and can only offer 53 TOPS which sounds silly compared to Grayskull's 600 TOPS. However, isn't it the case that for the most part these cores are idle, & waiting for memory still? At any rate, you're not going be able to run, say Llama 30B which won't fit already into 16GB but must fit comfortably in a 32GB system— on a single card. But perhaps most importantly, amdgpu driver is actually open source and allows PCIe passthrough unlike other vendors so considering all of the above, it almost seems like a no-brainer for a trusted computing setup.
The idea is that the tenstorrent cards dynamically prune out codepaths that aren't used.
Transformers naturally have a big part of the network that is unused on any particular token flowing through the model. You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.
So my understanding is that the tenstorrent cards are drastically more efficient, even on "dense" models like transformers because of the sparsity of any specific forward pass.
Also: I wouldn't bet on AMD accelerators for ML. They've disappointed every time. I would trust Jim Keller, whose every project in the last decade ended up being impactful.
I think the internet access is just to download the driver. It's not some sort of DRM setup where it needs to be always-online.
I understand this pruning is the difference between "silicon"and "software-assisted" TOPS in their documentation but I still don't see how exactly that's going to address the fact that to fit a 30B parameters model into memory, you need at least that. So basically to go 30B and up, you would need at least two, if not three cards and I couldn't find any details on how interlink is going to be implemented except that it's apparently done via Ethernet, limiting it to 100 Gbps which again, seems like a hard bandwidth limitation out-of-place when comparing it to impressive compute. Also: "online installer required" is not a good look, at least for me personally and the security model of my system. AMD cards, however, are affordable commodity hardware, offering 32GB HBM2 and the driver is open source so I wouldn't really discount it even considering their lacklustre performance. Cloud VMs do away nicely for most use-cases but as long as soon as hard security comes into focus, you just can't afford to have unauditable blobs from NVIDIA, or any other vendor for that matter. I'm still looking forward to learn more about Grayskull, especially on its memory capacity/bandwidth limitations and what it means for ever-growing large language models. Hopefully, they can open-source their driver stack.
Notice they require 32gb RAM and encourage 64gb RAM.
Presumably the architecture keeps the model in CPU RAM and shuffles it dynamically to the PCIe card network? I'm guessing here.
Whatever quirks come out of their hardware, I want to keep an eye on it.
I also think that comparing the current generation of models, which are built and trained to maximize GPU or TPU bandwidth, could be improved if someone architected models to maximize their model to greyskull advantages. Given PyTorch runs on it, I don't think it'd be too hard to do.
> Transformers naturally have a big part of the network that is unused on any particular token flowing through the model. You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.
> You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.
From what I've read that was just an error in reading memory consumption after switching to the mmap version and not more memory efficient at all in the end.
How useful are these for workstation-class scientific computing, vs. a GPU? Not ML necessarily, but I do a lot of matrix multiplication for example when solving systems of equations.
From the page : "Andromeda, a 13.5 Million Core AI Supercomputer". Blown away by the number of cores (I considered myself lucky to have 2 10000+ cores GPU in my workstation) I then realized that the word "core" is singular in the sentence. Is it just a mistake or does it mean something else ? (genuine question, English is not my first language)
EDIT: Ahhh a bit below on the page it is written "13.5 million AI-optimized cores" and there it's plural. So it was probably just a mistake.
It is not a mistake, that is how you phrase it in English when the noun ("core" in this case) is being used as part of a compound adjective. The convention is to keep the noun singular.
e.g.
"He commanded a ten thousand man army." (not men)
"Andromeda, a trillion star galaxy, is 2.5 million lightyears away." (not stars)
Not a native English-speaking person, but shouldn’t it be "He commanded a ten-thousand-man army" and
"Andromeda, a trillion-star galaxy, is 2.5 million lightyears away"?
I am a native speaking English-speaking person, and although I think you are correct in that compound adjectives have historically been connected with hyphens, that seems to have fallen out-of-fashion somewhat.
To expand on the sibling comment, this is when pedantic people start talking about hyphenation. The clearer way to say this is "13.5-million-core AI supercomputer".
These things come with a lot of limitations, like not being able to work with 4D and larger tensors, plus the 8 bit quantization. Getting models to run on them is a real pain.
I hate the enormous waste of human ability, ingenuity and effort in the creation of proprietary technologies like this. You've made a chip? Offer it for everyone to use. Same goes for Amazon and Apple. It's not as though it's a chip that's only usable for Google-specific work.
It’s because Google is a monopoly it doesn’t operate on normal economic incentives.
It’s goal is only to keep the monopoly: appear bening, keep tech advance in house, share it’s spying network with government so the government don’t regulate them, win-win.
If people are not allowed to monetize their innovations, there is no incentive to innovate. While this needs to have its limits, sharing it to everyone immediately upon creation is not an answer.
> stretched our ML supercomputer scale .. to 4096 TPU v4 nodes
> The Google tradition is to write retrospective papers ... TPU v4s and A100s deployed in 2020 and both use 7nm technology
> The appropriate H100 match would be a successor to TPU v4 deployed in a similar time frame and technology (e.g., in 2023 and 4 nm).
> TPU v4 supercomputers [are] the workhorses of large language models (LLMs) like LaMDA, MUM, and PaLM]. These features allowed the 540B parameter PaLM model to sustain a remarkable 57.8% of the peak hardware floating point performance over 50 days while training on TPU v4 supercomputers
> Google has deployed dozens of TPU v4 supercomputers for both internal use and for external use via Google Cloud
> Moreover, the large size of the TPU v4 supercomputer and its reliance on OCSes looks prescient given that the design began two years before the paper was published that has stoked the enthusiasm for LLMs