Full disclosure, we have a contract with AMD to get Llama 405B training on MI350X on MLPerf.
Things are turning around for AMD. If you have an AMD card, go to pytorch.org, click Linux+ROCm and install PyTorch. 3 years ago, this was hopeless. Today, most mainline things work. I ran nanochat on MI300X and it just worked. I think that's true about MI350X now too. The MI350X machine is stable.
They are clearly behind NVIDIA, nobody doubts that. And a lot of investment into software will be required to catch up, ecosystem, compiler, and driver. But 2 years ago they seemed hopeless, now they don't. Things take time. HipKittens is a great codebase to study to see where AMD's LLVM backend is still lacking; compare it to the CUDA Kittens.
For training, it's NVIDIA and Google in first. AMD in second. And nobody in third. Intel and Tenstorrent are not remotely close. Huawei examples segfaulted. Groq gave up selling chips. Cerebras isn't available anywhere. Trainium had a 5 day wait time to get one instance and I lost interest.
As CEO of an AMD NeoCloud for the past 2 years, it is so nice to hear all this and also see the turn around. It is what I bet my business on from the start and I can concur with what George is saying 100%.
The out of box experience can be a bit rough around the edges on bleeding edge stuff, but it isn't anything near as bad as it used to be. For example, a month ago nanochat wasn't working well and now it is. The important thing is that people now care enough to make it work.
At the end of the day, AI does need viable options. Having a monopoly on all AI hardware and software might be a good thing for share holders, but isn't a good thing for what is looking like a fundamental technology, akin to the internet.
That’s interesting, I was specifically looking for AMD hardware being offered by neoclouds, they seem to be rare.
I like your bet though. The difference between NVDA and AMD has never really existed on a hardware level for decades. AMD has always been on par, and software is software, it will catch up.
AMD will be a stock many people will miss because the opportunity has presented itself at the height of AI bubble talk, and this will leave many in the dust. Doubling and tripling of their market cap is pretty much a forgone conclusion.
You're right, it is a much smaller ecosystem, but I think that is partly intentional as a way to focus efforts and not feed into the bubble, which I feel is a smart move. These are the official partners [0]. I'm Hot Aisle.
George was very smart, $500k in the $90's. I saw it coming even earlier than him, but that's cause I was already aware the hardware was good from my own experiences.
Will it catch up or will it forever chase nvidia's tail? I'm betting on the latter unless another AI winter happens. And contrary to anti-generative AI social media talking points, the literature suggests The Red Queen's race is continuing apace IMO.
Nvidia remains undefeated at responding to hardware threats with hardware diving catches to this day. What scenario prevents them from yet another one of their diving catches? I'm genuinely curious as to how one could pull that off. It's like challenging Google in search: even if you deliver better product and some have, the next thing you know Google is doing the same thing or better with deeper pockets.
Nvidia remains undefeated at responding to hardware threats with hardware diving catches to this day. What scenario prevents them from yet another one of their diving catches?
The fact that they make roughly the same hardware as AMD for the last 2 decades, and even today. There was no diving catch, AMD just ignored what the hardware was capable of and didn't reinforce OpenCL. There was literally no diving catch. For example, just in this thread alone, AMD paid someone to make this shit work on their hardware. Don't bet against what's coming.
Except no, AMD 100% played follow the leader with technology like CUDA, NVLink, and tensor cores.
Even paying paying someone in academia to get s** to work on their hardware is yet another example of follow the leader.
What exactly do you think is coming? I think the biggest threat is one or more Chinese companies catching up on both hardware and ecosystem in the next half decade or so myself, mostly because of the state level support for making that so. But I absolutely don't expect an x86_64 moment for GPUs here given past results and the current bias against software in AMD's HW culture. Convince me otherwise.
1 and 2 are supported, 1 you need to specify, 2 will be found with BEAM. We are working on reimplementing HipKittens in tinygrad, all the stuff is there to do it. See the amd_uop_matmul example.
tinygrad doesn't support 3 yet, it's not needed on any AMD GPUs, and not needed on NVIDIA consumer. It wouldn't be hard to add, but it's important to figure out how it best fits with the existing abstractions. I think everything will eventually move to a more producer-consumer model.
Right now AI support on AMD is officially only on specific models. But they are working hard to turn this around to have broader support. And making progress.
Vulkan compute is also getting some good press as a local llm platform (at least on the linux side), will be interesting to see which crosses the line to "can ship production quality apps on this" first.
Nope! Works fine with in-tree somewhat recent kernel. The AMD driver is actually open source, not just a wrapper into a big on device blob like the NVIDIA one. tinygrad also has a driver that doesn't even need the kernel module, just mmapping the PCIe BAR into Python.
Author here. I agree with this comment, but if I wrote more like this my blog post would get less traction.
"LLM coding tools are search-based program synthesizers," in my mind this is what compilers are. I think most compilers do far too little search and opt for heuristics instead, often because they don't have an integrated runtime environment, but it's the same idea.
"Plenty of effective engineering tools are stochastic," sure but while a SAT solver might use randomness and that might adjust your time to solve, it doesn't change the correctness of the result. And for something like a fuzzer, that's a test, which are always more of a best effort thing. I haven't seen a fuzzer deployed in prod.
"Determinism comes from external specs and tests," my dream is a language where I can specify what it does instead of how it does it. Like the concept of Halide's schedule but more generic. The computer can spend its time figuring out the how. And I think this is the kind of tools AI will deliver. Maybe it'll be with LLMs, maybe it'll be something else, but the key is that you need a fairly rigorous spec and that spec itself is the programming. The spec can even be constraint based instead of needing to specify all behavior.
I'm not at all against AI, and if you are using it at a level described in this post, like a tool, aware of its strengths and limitations, I think it can be a great addition to a workflow. I'm against the idea that it's a magical English compiler, which is what I see in public discourse.
I think the key insight I walked away with from this whole thread, for me, was:
A compiler takes source and maps it to some output. Regardless of the compiler detail, this is an atomic operation; you end up with source (unmodified) and an artifact.
These “agent workflows” are distinctly different.
The process of mapping prompt to an output is the same; but these agent workflows are destructive; they modify the source.
Free reign over the entire code base; They modify the tests. The spec, the implementation.
It seems like this is a concept people are still struggling with; if your specification is poorly defined, and is dynamically updated during the compilation process, the results are more than just non deterministic.
Over time, the specification becomes non deterministic.
Thats why unsupervised agents go “off the rails”; not because the specification cant be executed, but because over time the spec drifts.
This is bait. The comment and the blog post say mostly the same thing, the debate is around the subtle edges.
It's not a "compiler," it's a "probabilistic code synthesizer guided by your constraints"
The latter is technically more specific and correct than the former, but it's 7 words instead of 1. And the word compiler is understood to encompass the latter, even if most compilers aren't that. They are both "a tool in a workflow"
That's not price discrimination, like hardcover vs paperback if you have two versions of something and people can choose which they want. That's totally fine and actually something that makes capitalism great. The rich usually end up covering more of the costs here cause they are less price sensitive, like business vs economy on airplanes.
Price discrimination is when two people visit a site to buy a book, the algorithm computes an estimate of what they are barely willing to pay, and then shows the two of them different prices for the exact same book based on who they are.
Yours is an overly narrow version of price discrimination in which the discrimination is extended to the customer level. If that's what OP meant he should use a less ambiguous description.
Nice try. I worked at Facebook for 9 months and left (before vesting any shares) because I didn't agree with the mission, even back in 2012. I worked at Twitter for 5 weeks and left because I realized nothing was going to change (and the good food went away). I don't regret trying at either, but in revealed preferences, I've spent most of my life writing open source software, even if that's not what attracts most media about me.
I know you think everyone is just trying to "get their bag" and that's the framework you see this in. But I already had more money at 21 than I've spent to date, and not cause I had a lot of money, but cause I don't buy much stuff. I'm sorry you feel played, but don't project that on me.
Haha, no worries, you're good. But come on, you have to admit it's at least a little funny that the famous Sony/iOS/Android hacker geohotz ended up... working for Facebook and then Elon Musk's X, lol.
We got the MI300X box on MLPerf too, and every MLPerf from here on general tinygrad improvements should bring down the times. We're still quite focused on AMD.
AMD has legitimately been making great progress. They still have a long way to go, and I appreciate SemiAnalysis taking up the mantle of calling them out, but I ran:
Regarding SA, I’m all for holding AMD accountable, but let’s at least get the facts right, and maybe don’t come at it with a history of cheerleading for Nvidia.
1. The supply is nonetheless constrained and immutably fixed; what is the relevance of whether it is by contract or law of nature?
2. What do you mean by "real" infrastructure? Crypto-mining rigs are no less real than actual mines.
My argument would be that gold's value is as much a social construct as that of crypto; value is just a function of supply and demand.
I'm guessing you might post that there is a third input: utility. "Currency" is one use for gold, but can certainly serve many purposes, whereas crypto coins are strictly used as currency. That fact is presumably taken into account by a coin's price; nonetheless, it still has whatever value the market says it has at any time.
Except none of the crypto "currency" is used as currency at all, and never will.
It's used as a crypto asset. For speculation.
Or even worse, straight up fraud.
The only time I heard crypto was used as currency, is the infamous pizza a decade ago
The meaning behind them are, though. When car alarms was a big thing, it might wail, and the idea was people would have a look to see if somebody was trying to steal your car, but at the end it was mostly false alarms. So the wailing got the reaction of "Not that shit again!".
Meanwhile a TSA scanner's beep get treated as "this person is bringing a problem.".
That OCP 3.0 card has the same link bandwidth as the GPUs, so you can scale out without much loss of all-reduce bandwidth. In practice, for all models except the largest, the ~16GB/s all-reduce is totally fine. You just need to make sure you can all-reduce all weights in your training step time.
Say you are training a 3B parameter model in BF16. That's 6GB of weights, as long as your step time is >=500ms you won't see a slowdown.
Things are turning around for AMD. If you have an AMD card, go to pytorch.org, click Linux+ROCm and install PyTorch. 3 years ago, this was hopeless. Today, most mainline things work. I ran nanochat on MI300X and it just worked. I think that's true about MI350X now too. The MI350X machine is stable.
They are clearly behind NVIDIA, nobody doubts that. And a lot of investment into software will be required to catch up, ecosystem, compiler, and driver. But 2 years ago they seemed hopeless, now they don't. Things take time. HipKittens is a great codebase to study to see where AMD's LLVM backend is still lacking; compare it to the CUDA Kittens.
For training, it's NVIDIA and Google in first. AMD in second. And nobody in third. Intel and Tenstorrent are not remotely close. Huawei examples segfaulted. Groq gave up selling chips. Cerebras isn't available anywhere. Trainium had a 5 day wait time to get one instance and I lost interest.
reply