Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Only slightly related, but how common are bugs in GPUs and/or CUDA? I'm currently on Day 5 of trying to debug why my GPT-OSS implementation (not using PyTorch) I've made from scratch isn't working correctly, and while I have it somewhat working with some naive and slow methods, I'm now doing an implementation of the tensor cores and have been just stuck for 2-3 days because of some small numerical difference I can't understand why it's happening.

Every day I'm getting closer to believing this is some sort of hardware bug in Blackwell or in CUDA itself, but as we know, the bug is (almost) never in the compiler or in the hardware. Until it is...



They exist, but they're not that common (give or take the "expected" numerical deviations based on the order of summation and whatnot, which can both be nontrivial and propagate error further).

Something I recommend doing, the best time being the start of the project and the second best time being now, is adding numerical gradient checking tests to all operations. You will make mistakes in your kernels from time to time, and it's valuable to know at a glance where those mistakes are.

Mind you, it's possible to write both the forward pass and the backward pass in a way that's wrong but compatible. An additional layer of checks I like to add is a dead-simple implementation of all algorithms -- no vectorization, no fancy blocking or re-orderings, nothing. Compare results to the simple implementation.

It sounds like a lot of work, but writing an optimized kernel is much slower than the numerical gradient checking and the simple kernel, and given how in numerical code it's basically impossible to identify the source of a bug without doing the equivalent of all of those checks, it only takes one bug in the whole project for the effort to pay off.


Thanks a lot for the pointers, I think I've done a similar approach to what you suggest, lots of tiny (relative) tests for each step in the process, and doing sort of sanity checking between the naive stuff I first wrote which works and which does inference correctly, and the new kernel which is a lot more performant, but currently incorrect and produces incoherent outputs.

I'll try to replace bits by simplified versions though, probably could help at least getting closer to knowing where the issue is.

Anyone have more debugging tips I'd greatly appreciate it! Nothing is too small or "obvious", as I'm about to lose my mind more or less.


Beyond that, the tips get less general-purpose. The two big over-arching ideas are:

1. Numerical code is the canonical example of "functional" code. If you prove all the pieces correct then the result is also correct. If you prove one wrong then you know why your overall code is wrong. As such, focusing more heavily than normal on proving each piece correct is prudent. Use automated techniques (like numerical gradient checking), and use randomized inputs. It's easier than you'd think for your favorite special cases to be correct in both right and wrong algorithms. Your eyes will deceive you, so use the computer to do your spot checks.

2. I lied in (1). Especially when you start involving GPUs, it's easy to have to start worrying about variable lifetimes, UAF, double-free, un-initialized memory, accidental clobberings, and other ways in which an innocent "functional" computation can stomp on something else you're doing. Still start with all the checks from (1), and if the parts are correct and the whole is broken then you're messing up global state somewhere. Tracking that down is more art than science, but one technique is adding a "poison" field, tracking deinit count, and otherwise exposing metrics regarding those failure modes. Panic/crash when you hit an invalid state, and once you figure out where the issue happens you can triage as normal (working backward from the broken state to figure out how you got there). With a solid memory management strategy up-front you'll not see this sort of thing, but if it's not something you've thought about then I wouldn't rule it out.

3. Not really another point, just an extension of (2), corruption can show up in subtle ways (like stack-copied pointers inside a paused async function closure which occasionally gets copied by your event loop). If global state is the issue, it's worth a full audit of the application.


You may be running into jensen (huang)’s inequality,

E(loss).cuda() <= E(loss.cuda())


Would make sense I suppose if I was using two different GPUs for the same thing and get two different outcomes. But instead I have two implementations (one naive, one tensor cores) running on the same GPU, but getting different outcomes, where they should be the same.

But then this joke might be flying above my head as well.


Tensor cores use lower precision, so small numerical differences should be expected.


Consumer-visible hardware bugs are extremely uncommon nowadays. There's approximately 10x as many people working in design verification as actual hardware design.

I say "consumer-visible" because the bugs still exist and people who can catch them early get promoted quickly and paid a lot. It's very exciting work if you can get it, since you really have to understand the full GPU to break it.

Good luck!!


How big is the numerical difference? If it's small it might be within the precision of the operation itself.


Magnitudes away (maybe "small numerical difference" was an understatement), my current hypothesis is that I'm doing scaling wrong somewhere, but I can't help but sometimes slide into the "maybe there is something deeper wrong" territory in the evening after another day...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: