Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have been hearing this about AMD/ATI drivers for decades. Every year, someone says that it is fixed, only for new evidence to come out that they are not. I have no reason to believe it is fixed given the history.

Here is evidence to the contrary: If ROCm actually was in good shape, tinygrad would use it instead of developing their own driver.



You're conflating two different things.

ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.

The part of ROCm you're interested in is HIP; HIP is the part that does legacy CUDA emulation. HIP will never be complete because Nvidia keeps adding new things, documents things wrong, and also the "cool" stuff people do on Nvidia cards aren't CUDA and it is out of scope for HIP to emulate PTX (since that is strongly tied to how historical Nvidia architectures worked, and would be entirely inappropriate for AMD architectures).

The whole thing with Tinygrad's "driver" isn't a driver at all, its the infrastructure to handle card to card ccNUMA on PCI-E-based systems, which AMD does not support: if you want that, you buy into the big boy systems that have GPUs that communicate using Infinity Fabric (which it, itself, is the HyperTransport protocol over PCI-E PHY instead of over HyperTransport PHY; PCI over PCI-E has no ability to handle ccNUMA meaningfully).

Extremely few customers, AMD's or not, want to share VRAM directly over PCI-E across GPUs since most PCI-E GPU customers are single GPU. Customers that have massive multi-GPU deployments have bought into the ecosystem of their preferred vendor (ie, Nvidia's Mellanox-powered fabrics, or AMD's wall-to-wall Infinity Fabric).

That said, AMD does want to support it if they can, and Tinygrad isn't interested in waiting for an engineer at AMD to add it, so they're pushing ahead and adding it themselves.

Also, part of Tinygrad's problem is they want it available from ROCm/HIP instead of a standards compliant modern API. ROCm/HIP still has not been ported to the modern shader compiler that the AMD driver uses (ie, the one you use for OpenGL, Vulkan, and Direct family APIs), since it originally came from an unrelated engineering team that isn't part of the driver team.

The big push in AMD currently is to unify efforts so that ROCm/HIP is massively simplified and all the redundant parts are axed, so it is purely a SPIR-V code generator or similar. This would probably help projects like Tinygrad someday, but not today.


> ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.

AMD says otherwise:

> AMD ROCm™ is an open software stack including drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications.

https://www.amd.com/en/products/software/rocm.html

The issues involving AMD hardware not only applied to the drivers, but to the firmware below the drivers:

https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...

Tinygrad’s software looks like a userland driver:

https://github.com/tinygrad/tinygrad/blob/master/tinygrad/ru...

It loads various firmware blobs, manages part of the initialization process, manages memory, writes to registers, etcetera. These are all things a driver does.


AMD is extremely bad at communications. The driver already contains everything ROCm requires to talk to the GPU, and ROCm itself is only a SDK that contains runtimes, libraries, and compilers.

This part of TinyGrad is not a driver, however it tries to hijack the process to do part of that task. You cannot boot the system with this, and it does not replace any part of the Mesa/DRI/DRM/KMS/etc stack. It does reinitialize the hardware with a different firmware, which might be why you think this is a driver.


I consider it to be a driver, or at least part of one. Userspace drivers exist. Graphic drivers originally were entirely in userspace, until portions of them were moved into the kernel for kernel mode setting and DRM. These days, graphics drivers themselves have both kernel mode and user mode components. The shader compiler for example would be a user mode component.


I'm aware. One of the biggest things in fixing the Linux desktop was no longer needing drivers in the Xserver and needing it to be suid root.

What was linked is written in Python. Nothing in Python is ever going to be a userland driver.


There is no reason people cannot write userland drivers in Python.


https://community.amd.com/t5/ai/what-s-new-in-amd-rocm-6-4-b...

ROCm 6.4 software introduces the Instinct GPU Driver, a modular driver architecture that separates the kernel driver from ROCm user space.


They were doing this before, the difference with this is, the version of ROCm you use is locked to the driver versions that are supported, which is a very narrow range.

With this new thing, the backend API is now formalized and easier to support wider range of difference.


We have all been hearing things for decades. Things are noticeably different now. Live in the present, not in the past.

Tinygrad isn’t a driver. It is a framework. It is being developed by George however he wants. If he wants to build something that gives him more direct control over things. Fine. Others might write PTX instead if using higher level abstractions.

Fact is that tinygrad runs not only on AMD, but also Nvidia and others. You might want to reassess your beliefs because you’re reading into things and coming up with the wrong conclusions.


I read tinygrad’s website:

https://tinygrad.org/#tinygrad

Under driver quality for AMD, they say “developing” and point to their git repository. If AMD had fixed the issues, they would instead say the driver quality is great and get more sales.

They can still get sales even if they are honest about the state of AMD hardware, since they sell Nvidia hardware too, while your company would risk 0 sales if you say anything other than “everything is fine”, since your business is based on leasing AMD GPUs:

https://hotaisle.xyz/pricing/

Given your enormous conflict of interest, I will listen to what George Hotz and others are saying over what you say on this matter.


Exactly, it is not a driver.

Appreciate you diving more into my business. Yes, we are one of the few that publishes transparent pricing.

When we started, we got zero sales, for a long time. Nobody knew if these things performed or not. So we donated hardware and people like ChipsAndCheese started to benchmark and write blog posts.

We knew the hardware was good, but the software sucked. 16 or so months later, things have changed and sufficiently improved that now we are at capacity. My deep involvement in this business is exactly how I know what’s going on.

Yes, I have a business to run, but at the same time, I was willing to take the risk, when no-one else would, and deploy this compute. To insinuate that I have some sort of conflict of interest is unfair, especially without knowing the full story.

At this juncture, I don’t know what point you’re trying to make. We agree the software sucked. Tinygrad now runs on mi300x. Whatever George’s motivations were a year ago are no longer true today.

If you feel rocm sucks so badly, go the tinygrad route. Same if you don’t want to be tied to cuda. Choice is a good thing. At the end of the day though, this isn’t a reflection on the hardware at all.


I hope your business works out for you and I am willing to believe that AMD has improved somewhat, but I do not believe AMD has improved enough to be worth people’s time when Nvidia is an option. I have heard too many nightmares and it is going to take many people, including people who reported those nightmares, reporting improvements for me to think otherwise. It is not just George Hotz who reported issues. Eric Hartford has been quiet lately, but one of the last comments he made on his blog was not very inspiring:

> Know that you are in for rough waters. And even when you arrive - There are lots of optimizations tailored for nVidia GPUs so, even though the hardware may be just as strong spec-wise, in my experience so far, it still may take 2-3 times as long to train on equivalient AMD hardware. (though if you are a super hacker maybe you can fix it!)

https://erichartford.com/from-zero-to-fineturning-with-axolo...

There has been no follow-up “it works great now”.

That said, as for saying you have a conflict of interest, let us consider what a conflict of interest is:

https://en.wikipedia.org/wiki/Conflict_of_interest

> A conflict of interest (COI) is a situation in which a person or organization is involved in multiple interests, financial or otherwise, and serving one interest could involve working against another.

You run a company whose business is dependent entirely on leasing AMD GPUs. Here, you want to say that AMD’s hardware is useful for that purpose and no longer has the deluge of problems others reported last year. If it has not improved, saying such could materially negatively impact your business. This by definition is a conflict of interest.

That is quite a large conflict of interest, given that it involves your livelihood. You are incentivized to make things look better than they are, which affects your credibility when you say that things are fine after there has been ample evidence in the recent past that they have not been. In AMD’s case, poor driver quality is something that they inherited from ATI and the issues goes back decades. While it is believable that AMD has improved their drivers, I find it difficult to believe that they have improved them enough that things are fine now, given history. Viewing your words as being less credible because of these things might be unfair, but there have been plenty of people whose livelihoods depended on things working before you that outright lied about the fitness of products. They even lied when people’s lives were at risk:

https://hackaday.com/2015/10/26/killed-by-a-machine-the-ther...

You could be correct in everything you say, but I have good reason to be skeptical until there has been information from others corroborating it. Blame all of the people who were in similar positions to yours that lied in the past for my skepticism. That said, I will keep my ears open for good news from others who use AMD hardware in this space, but I have low expectations given history.


Funny to see you quoting Eric, he’s a friend and was just running on one of our systems. AMD bought credits from us and donated compute time to him as part of the big internal changes they’re pushing. That kind of thing wouldn’t have happened a year ago. And from his experience, the software has come a long way. Stuff is moving so fast, that you aren't even keeping up, but I am the one driving it forward.

https://x.com/cognitivecompai/status/1929260789208142049

https://news.ycombinator.com/item?id=44154174

And sigh, here we are again with the conflict of interest comments, as if I don’t get it. As I said, you don’t know the full story, so let me spell it out. I’m not doing this for money, status, or fame. I’m fortunate enough that I don’t need a job, this isn’t about livelihood or personal gain.

I’m doing this because I genuinely care about the future of this industry. I believe AI is as transformational as the early Internet. I’ve been online since 1991 (BBS before that), and I’ve seen how monopolies can strangle innovation. A world where one company controls all AI hardware and software is a terrible outcome. Imagine if Cisco made every router or Windows was the only OS. That’s where we’re headed with Nvidia, and I refuse to accept that.

Look at my history and who my investor is, this isn’t some VC land grab. We truly care about decentralizing and democratizing compute. Our priority is getting this previously locked up behind supercomps HPC into the hands of as many developers as possible. My cofounder and I are lifelong nerds and developers, doing this because it matters.

Right now, only two companies are truly competing in this space. You’ve fairly pointed out failures of Cerebras and Groq. AMD is the only one with a real shot at breaking the monopoly. They’re behind, yes. But they were behind in CPUs too, and look where that went. If AMD continues on the path they’re on now, they can absolutely become a viable alternative. Make no mistake, humanity needs an alternative and I'll do my best to make that a reality.


Ask Eric to consider writing a new blog post discussing the state of LLM training on AMD hardware. I would be very interested in reading what he has to say.

AMD catching up in CPUs required that they become competent at hardware development. AMD catching up in the GPGPU space would require that they become competent at software development. They have a long history of incompetence when it comes to software development. Here are a number of things Nvidia has done right contrasted with what AMD has done wrong:

  * Nvidia aggressively hires talent. It is known for hiring freshly minted PhDs in areas relevant to them. I heard this firsthand from a CS professor whose specialty was in compilers who had many former students working for Nvidia. AMD is not known for aggressive hiring. Thus, they have fewer software engineers to put on tasks.

  * Nvidia has a unified driver, which reduces duplication of effort, such that their software engineers can focus on improving things. AMD maintains separate drivers for each platform. AMD tried doing partial unification with vulkan, but it took too long to develop, so the Linux community developed its own driver and almost nobody uses AMD’s unified Vulkan driver on Linux. Instead of killing their effort and adopting the community driver for both Linux and Windows, they continued developing their driver that is mostly only used on Windows.

  * Nvidia has a unified architecture, which further deduplicates work. AMD split their architecture into RDNA and CDNA, and thus must implement the same things for each where the two overlap. They realized their mistake and are making UDNA, but the damage is done and they are behind because of their RDNA+CDNA misadventures. It will not be until 2026 that UDNA fixes this.

  * Nvidia proactively uses static analysis tools on their driver, such as coverity. This became public when Nvidia open sourced the kernel part of their Linux driver. I recall a Linux kernel developer that works on static analysis begging the amdgpu kernel driver developers to use static analysis tools on their driver, since there were many obvious issues that were being caught by static analysis tools that were going unaddressed.
There are big differences between how Nvidia and AMD do engineering that make AMD’s chances of catching up slim. That is likely to be the case until they start behaving more like Nvidia in how they do engineering. They are slowly moving in that direction, but so far, it has been too little, too late.

By the way, AMD’s software development incompetence applies to the CPU side of their business too. They had numerous USB issues on the AM4 platform due to bugs in AGESA/UEFI. There were other glitches too, such and memory incompatibilities. End users generally had to put up with it, although some AMD in conjunction with some motherboard vendors managed to fix it the issues. I had an AM4 machine that would not boot reliably with 128GB of RAM and this persisted until I replaced the motherboard with one of the last AM4 motherboards made after suffering for years. Then there was this incompetence that even affected AM5:

https://blog.desdelinux.net/en/Entrysign-a-vulnerability-aff...

AMD needs to change a great deal before they have any hope of competing with Nvidia GPUs in HPC. The only thing going for them in HPC for GPUs is that they have relatively competent GPU hardware design. Everything else about their GPUs have been a disaster. I would not be surprised if Intel manages to become a major player in the GPU market before AMD manages to write good drivers. Intel, unlike AMD, has a history of competent software development. The major black mark on their history would be the initial Windows ARC drivers, but the were able to fix a remarkable number of issues in the time since their discrete GPU launch, and have fairly good drivers on Windows now. Unlike AMD, they did not have a history of incompetence, so the idea that they fixed the vast majority of issues is not hard to believe. Intel will likely continue to have good drivers after they have made competitive hardware to pair with them, provided that they have not laid off their driver developers.

I have more hope in Intel than I have in AMD and I say that despite knowing how bad Intel is at doing anything other than CPUs. No matter how bad Intel is at branching into new areas, AMD is even worse at software development. On the bright side, Intel’s GPU IP has a dual role, since it is needed for their CPU’s iGPUs, so Intel must do the one thing they almost never do when branching into new areas, which is to iterate. The cost of R&D is thus mostly handled by their iGPUs and they can continue iterating on their discrete graphics until it is a real contender in the market. I hope that they merge Gaudi into their GPU development effort, since iterating on ARC is the right way forward. I think Intel having an “AMD moment” in GPUs is less of a longshot than AMD’s recovery from the AM3 fiasco and less of a long shot than AMD becoming competent at driver development before Intel either becomes good at GPGPU or goes out of business.


Trying to find fault over UDNA is hilarious, they literally can't win with you.

My business model is to support viable alternatives. If someone else comes along and develops something that looks viable and there is customer demand for it, I'll deploy it.

You totally lost me at having more hope with Intel. I'm not seeing it. Gaudi 3 release was a nothing burger and is only recently deployed on IBM Cloud. Software is the critical component and if developers can't get access to the hardware, nobody is going to write software for it.


I fixed some autocorrect typos that were in my comment. I do not find fault with UDNA and I have no idea why you think I do. I find fault with the CDNA/RDNA split. UDNA is what AMD should have done in the first place.

As for Gaudi 3, I think it needs to be scrapped and used as an organ donor for ARC. In particular, the interconnect should reused in ARC. That would be Intel’s best chance of becoming competitive with Nvidia.

As for AMD becoming competitive with Nvidia, their incompetence at software engineering makes me skeptical. They do not have enough people. They have the people that they do have divided into to many redundant things. They do not have their people doing good software engineering practices such as static analysis. They also work the people that they do have with long hours (or so I have read), which of course is going to result in more bugs. They need a complete culture change to have any chance of catching up to Nvidia on the software side of things.

As for Intel, they have a good software engineering culture. They just need to fix the hardware side of things and I consider that to be much less of a stretch than AMD becoming good at software engineering Their recent battlematrix announcement is a step in the right direction. They just need to keep improving their GPUs and add an interconnect to fulfill the role of nvlink.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: