More

georgehotz · 2025-11-15T01:23:27 1763169807

Full disclosure, we have a contract with AMD to get Llama 405B training on MI350X on MLPerf.

Things are turning around for AMD. If you have an AMD card, go to pytorch.org, click Linux+ROCm and install PyTorch. 3 years ago, this was hopeless. Today, most mainline things work. I ran nanochat on MI300X and it just worked. I think that's true about MI350X now too. The MI350X machine is stable.

They are clearly behind NVIDIA, nobody doubts that. And a lot of investment into software will be required to catch up, ecosystem, compiler, and driver. But 2 years ago they seemed hopeless, now they don't. Things take time. HipKittens is a great codebase to study to see where AMD's LLVM backend is still lacking; compare it to the CUDA Kittens.

For training, it's NVIDIA and Google in first. AMD in second. And nobody in third. Intel and Tenstorrent are not remotely close. Huawei examples segfaulted. Groq gave up selling chips. Cerebras isn't available anywhere. Trainium had a 5 day wait time to get one instance and I lost interest.

latchkey · 2025-11-15T02:40:14 1763174414

As CEO of an AMD NeoCloud for the past 2 years, it is so nice to hear all this and also see the turn around. It is what I bet my business on from the start and I can concur with what George is saying 100%.

The out of box experience can be a bit rough around the edges on bleeding edge stuff, but it isn't anything near as bad as it used to be. For example, a month ago nanochat wasn't working well and now it is. The important thing is that people now care enough to make it work.

At the end of the day, AI does need viable options. Having a monopoly on all AI hardware and software might be a good thing for share holders, but isn't a good thing for what is looking like a fundamental technology, akin to the internet.

ivape · 2025-11-15T05:57:42 1763186262

That’s interesting, I was specifically looking for AMD hardware being offered by neoclouds, they seem to be rare.

I like your bet though. The difference between NVDA and AMD has never really existed on a hardware level for decades. AMD has always been on par, and software is software, it will catch up.

AMD will be a stock many people will miss because the opportunity has presented itself at the height of AI bubble talk, and this will leave many in the dust. Doubling and tripling of their market cap is pretty much a forgone conclusion.

latchkey · 2025-11-15T07:52:39 1763193159

You're right, it is a much smaller ecosystem, but I think that is partly intentional as a way to focus efforts and not feed into the bubble, which I feel is a smart move. These are the official partners [0]. I'm Hot Aisle.

George was very smart, $500k in the $90's. I saw it coming even earlier than him, but that's cause I was already aware the hardware was good from my own experiences.

[0] https://www.amd.com/en/products/accelerators/instinct/eval-r...

LogicFailsMe · 2025-11-16T02:29:18 1763260158

Will it catch up or will it forever chase nvidia's tail? I'm betting on the latter unless another AI winter happens. And contrary to anti-generative AI social media talking points, the literature suggests The Red Queen's race is continuing apace IMO.

Nvidia remains undefeated at responding to hardware threats with hardware diving catches to this day. What scenario prevents them from yet another one of their diving catches? I'm genuinely curious as to how one could pull that off. It's like challenging Google in search: even if you deliver better product and some have, the next thing you know Google is doing the same thing or better with deeper pockets.

ivape · 2025-11-16T16:42:05 1763311325

Nvidia remains undefeated at responding to hardware threats with hardware diving catches to this day. What scenario prevents them from yet another one of their diving catches?

The fact that they make roughly the same hardware as AMD for the last 2 decades, and even today. There was no diving catch, AMD just ignored what the hardware was capable of and didn't reinforce OpenCL. There was literally no diving catch. For example, just in this thread alone, AMD paid someone to make this shit work on their hardware. Don't bet against what's coming.

LogicFailsMe · 2025-11-18T15:43:54 1763480634

Except no, AMD 100% played follow the leader with technology like CUDA, NVLink, and tensor cores.

Even paying paying someone in academia to get s** to work on their hardware is yet another example of follow the leader.

What exactly do you think is coming? I think the biggest threat is one or more Chinese companies catching up on both hardware and ecosystem in the next half decade or so myself, mostly because of the state level support for making that so. But I absolutely don't expect an x86_64 moment for GPUs here given past results and the current bias against software in AMD's HW culture. Convince me otherwise.

WithinReason · 2025-11-15T09:43:33 1763199813

How far is Tinygrad from being able to represent/search the kind of optimisations listed in the article? i.e.:

  1. data layouts to avoid local memory bank conflicts
  2. read patterns from global memory to optimize L2 cache reuse
  3. warp specialisation

How complex is it to add these into tinygrad?

georgehotz · 2025-11-15T17:22:40 1763227360

1 and 2 are supported, 1 you need to specify, 2 will be found with BEAM. We are working on reimplementing HipKittens in tinygrad, all the stuff is there to do it. See the amd_uop_matmul example.

tinygrad doesn't support 3 yet, it's not needed on any AMD GPUs, and not needed on NVIDIA consumer. It wouldn't be hard to add, but it's important to figure out how it best fits with the existing abstractions. I think everything will eventually move to a more producer-consumer model.

0-_-0 · 2025-11-15T19:27:56 1763234876

Good luck with the AMD contract! I imagine HipKittens came at just the right time.

fulafel · 2025-11-15T05:24:28 1763184268

Does consumer hardware (non-MI) need proprietary kernel drivers for running rocm + pytorch?

kieranl · 2025-11-15T16:40:13 1763224813

No. But you might need a specific version of rocm built for your gpu. These are built on https://github.com/ROCm/TheRock

Right now AI support on AMD is officially only on specific models. But they are working hard to turn this around to have broader support. And making progress.

fulafel · 2025-11-15T18:58:53 1763233133

Vulkan compute is also getting some good press as a local llm platform (at least on the linux side), will be interesting to see which crosses the line to "can ship production quality apps on this" first.

georgehotz · 2025-11-15T08:07:17 1763194037

Nope! Works fine with in-tree somewhat recent kernel. The AMD driver is actually open source, not just a wrapper into a big on device blob like the NVIDIA one. tinygrad also has a driver that doesn't even need the kernel module, just mmapping the PCIe BAR into Python.

buckle8017 · 2025-11-15T18:05:57 1763229957

> Cerebras isn't available anywhere.

That sounds like they're winning.

georgehotz · 2025-09-14T00:17:44 1757809064

Author here. I agree with this comment, but if I wrote more like this my blog post would get less traction.

"LLM coding tools are search-based program synthesizers," in my mind this is what compilers are. I think most compilers do far too little search and opt for heuristics instead, often because they don't have an integrated runtime environment, but it's the same idea.

"Plenty of effective engineering tools are stochastic," sure but while a SAT solver might use randomness and that might adjust your time to solve, it doesn't change the correctness of the result. And for something like a fuzzer, that's a test, which are always more of a best effort thing. I haven't seen a fuzzer deployed in prod.

"Determinism comes from external specs and tests," my dream is a language where I can specify what it does instead of how it does it. Like the concept of Halide's schedule but more generic. The computer can spend its time figuring out the how. And I think this is the kind of tools AI will deliver. Maybe it'll be with LLMs, maybe it'll be something else, but the key is that you need a fairly rigorous spec and that spec itself is the programming. The spec can even be constraint based instead of needing to specify all behavior.

I'm not at all against AI, and if you are using it at a level described in this post, like a tool, aware of its strengths and limitations, I think it can be a great addition to a workflow. I'm against the idea that it's a magical English compiler, which is what I see in public discourse.

noodletheworld · 2025-09-14T01:00:20 1757811620

I think the key insight I walked away with from this whole thread, for me, was:

A compiler takes source and maps it to some output. Regardless of the compiler detail, this is an atomic operation; you end up with source (unmodified) and an artifact.

These “agent workflows” are distinctly different.

The process of mapping prompt to an output is the same; but these agent workflows are destructive; they modify the source.

Free reign over the entire code base; They modify the tests. The spec, the implementation.

It seems like this is a concept people are still struggling with; if your specification is poorly defined, and is dynamically updated during the compilation process, the results are more than just non deterministic.

Over time, the specification becomes non deterministic.

Thats why unsupervised agents go “off the rails”; not because the specification cant be executed, but because over time the spec drifts.

That doesnt happen with compilers.

johnnyyyy · 2025-09-14T11:34:55 1757849695

In your blog post: “Most people do not care to find the truth, they care about what pumps their bags”

in your HN comment: “I agree with this comment, but if I wrote more like this my blog post would get less traction.”

Seems like you also not care about the truth.

georgehotz · 2025-09-14T22:22:06 1757888526

This is bait. The comment and the blog post say mostly the same thing, the debate is around the subtle edges.

It's not a "compiler," it's a "probabilistic code synthesizer guided by your constraints"

The latter is technically more specific and correct than the former, but it's 7 words instead of 1. And the word compiler is understood to encompass the latter, even if most compilers aren't that. They are both "a tool in a workflow"

neta1337 · 2025-09-16T09:19:46 1758014386

He cares for the truth by making it accessible to more people.

scubbo · 2025-09-15T01:09:23 1757898563

You said it before I could. Amen.

georgehotz · 2025-07-06T15:29:33 1751815773

That's not price discrimination, like hardcover vs paperback if you have two versions of something and people can choose which they want. That's totally fine and actually something that makes capitalism great. The rich usually end up covering more of the costs here cause they are less price sensitive, like business vs economy on airplanes.

Price discrimination is when two people visit a site to buy a book, the algorithm computes an estimate of what they are barely willing to pay, and then shows the two of them different prices for the exact same book based on who they are.

mhb · 2025-07-06T15:44:42 1751816682

Yours is an overly narrow version of price discrimination in which the discrimination is extended to the customer level. If that's what OP meant he should use a less ambiguous description.

greyface- · 2025-07-06T15:57:25 1751817445

The comment you replied to is OP clarifying his description.

mhb · 2025-07-06T21:25:15 1751837115

OK thanks.

georgehotz · 2025-07-06T15:12:16 1751814736

Nice try. I worked at Facebook for 9 months and left (before vesting any shares) because I didn't agree with the mission, even back in 2012. I worked at Twitter for 5 weeks and left because I realized nothing was going to change (and the good food went away). I don't regret trying at either, but in revealed preferences, I've spent most of my life writing open source software, even if that's not what attracts most media about me.

I know you think everyone is just trying to "get their bag" and that's the framework you see this in. But I already had more money at 21 than I've spent to date, and not cause I had a lot of money, but cause I don't buy much stuff. I'm sorry you feel played, but don't project that on me.

coldpie · 2025-07-06T21:28:23 1751837303

Haha, no worries, you're good. But come on, you have to admit it's at least a little funny that the famous Sony/iOS/Android hacker geohotz ended up... working for Facebook and then Elon Musk's X, lol.

georgehotz · 2025-05-26T06:12:11 1748239931

We got the MI300X box on MLPerf too, and every MLPerf from here on general tinygrad improvements should bring down the times. We're still quite focused on AMD.

Like it's strange people think I give up on things, I think they listen to the media too much. This is a 2+ year long project that I've worked on almost every day. https://geohot.github.io/blog/jekyll/update/2023/05/24/the-t...

georgehotz · 2025-05-26T04:57:06 1748235426

AMD has legitimately been making great progress. They still have a long way to go, and I appreciate SemiAnalysis taking up the mantle of calling them out, but I ran:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3

today on stream and it just worked. No external ROCm install, and just the amdgpu driver that's in Arch.

We also have our own complete driver/runtime now in tinygrad; it's so much nicer to build on a foundation when you can blame yourself for the bugs.

latchkey · 2025-05-26T05:03:55 1748235835

Regarding SA, I’m all for holding AMD accountable, but let’s at least get the facts right, and maybe don’t come at it with a history of cheerleading for Nvidia.

Maybe SA might set their sights on you next?

georgehotz · 2025-03-05T13:26:19 1741181179

Unlike gold, the "finite supply" of BTC is socially constructed.

That's my whole argument for gold. The finite supply isn't socially constructed, and getting more requires building real infrastructure.

matltc · 2025-03-05T15:13:28 1741187608

1. The supply is nonetheless constrained and immutably fixed; what is the relevance of whether it is by contract or law of nature?

2. What do you mean by "real" infrastructure? Crypto-mining rigs are no less real than actual mines.

My argument would be that gold's value is as much a social construct as that of crypto; value is just a function of supply and demand.

I'm guessing you might post that there is a third input: utility. "Currency" is one use for gold, but can certainly serve many purposes, whereas crypto coins are strictly used as currency. That fact is presumably taken into account by a coin's price; nonetheless, it still has whatever value the market says it has at any time.

milutinovici · 2025-03-05T17:09:21 1741194561

Except none of the crypto "currency" is used as currency at all, and never will. It's used as a crypto asset. For speculation. Or even worse, straight up fraud. The only time I heard crypto was used as currency, is the infamous pizza a decade ago

desumeku · 2025-03-05T15:15:02 1741187702

Computer programs and algorithms are the last thing that I would call "socially constructed".

netsharc · 2025-03-06T07:25:36 1741245936

The meaning behind them are, though. When car alarms was a big thing, it might wail, and the idea was people would have a look to see if somebody was trying to steal your car, but at the end it was mostly false alarms. So the wailing got the reaction of "Not that shit again!".

Meanwhile a TSA scanner's beep get treated as "this person is bringing a problem.".

georgehotz · on Sept 23, 2024

Umm, why not?

We wrote entire NVIDIA, AMD, and QCOM drivers in that style.

https://github.com/tinygrad/tinygrad/blob/master/tinygrad/ru...

almostgotcaught · on Sept 23, 2024

Because it's slow duh

georgehotz · on Sept 23, 2024

This sounds like prejudice. Have you benchmarked it?

almostgotcaught · on Sept 23, 2024

Yes I literally duplicated your approach for my driver stack last week and surprise surprise the FFI overhead into libc is too high.

georgehotz · on Sept 23, 2024

FFI? This isn't how GPUs work...they are MMIO (mostly)

Those drivers are faster than anything else when used to run fixed command queues (what neural network runs are)

georgehotz · on Aug 27, 2024

Specs on website updated, anywhere from 100-240V is fine

georgehotz · on Aug 27, 2024

That OCP 3.0 card has the same link bandwidth as the GPUs, so you can scale out without much loss of all-reduce bandwidth. In practice, for all models except the largest, the ~16GB/s all-reduce is totally fine. You just need to make sure you can all-reduce all weights in your training step time.

Say you are training a 3B parameter model in BF16. That's 6GB of weights, as long as your step time is >=500ms you won't see a slowdown.

warkdarrior · on Aug 27, 2024

> 3B parameter model

That's tiny. Can it train/fine-tune 70B models?