Only slightly related, but how common are bugs in GPUs and/or CUDA? I'm currentl...

hansvm · 2025-10-26T15:00:43 1761490843

They exist, but they're not that common (give or take the "expected" numerical deviations based on the order of summation and whatnot, which can both be nontrivial and propagate error further).

Something I recommend doing, the best time being the start of the project and the second best time being now, is adding numerical gradient checking tests to all operations. You will make mistakes in your kernels from time to time, and it's valuable to know at a glance where those mistakes are.

Mind you, it's possible to write both the forward pass and the backward pass in a way that's wrong but compatible. An additional layer of checks I like to add is a dead-simple implementation of all algorithms -- no vectorization, no fancy blocking or re-orderings, nothing. Compare results to the simple implementation.

It sounds like a lot of work, but writing an optimized kernel is much slower than the numerical gradient checking and the simple kernel, and given how in numerical code it's basically impossible to identify the source of a bug without doing the equivalent of all of those checks, it only takes one bug in the whole project for the effort to pay off.

CaptainOfCoit · 2025-10-26T15:05:02 1761491102

Thanks a lot for the pointers, I think I've done a similar approach to what you suggest, lots of tiny (relative) tests for each step in the process, and doing sort of sanity checking between the naive stuff I first wrote which works and which does inference correctly, and the new kernel which is a lot more performant, but currently incorrect and produces incoherent outputs.

I'll try to replace bits by simplified versions though, probably could help at least getting closer to knowing where the issue is.

Anyone have more debugging tips I'd greatly appreciate it! Nothing is too small or "obvious", as I'm about to lose my mind more or less.

hansvm · 2025-10-27T06:39:14 1761547154

Beyond that, the tips get less general-purpose. The two big over-arching ideas are:

1. Numerical code is the canonical example of "functional" code. If you prove all the pieces correct then the result is also correct. If you prove one wrong then you know why your overall code is wrong. As such, focusing more heavily than normal on proving each piece correct is prudent. Use automated techniques (like numerical gradient checking), and use randomized inputs. It's easier than you'd think for your favorite special cases to be correct in both right and wrong algorithms. Your eyes will deceive you, so use the computer to do your spot checks.

2. I lied in (1). Especially when you start involving GPUs, it's easy to have to start worrying about variable lifetimes, UAF, double-free, un-initialized memory, accidental clobberings, and other ways in which an innocent "functional" computation can stomp on something else you're doing. Still start with all the checks from (1), and if the parts are correct and the whole is broken then you're messing up global state somewhere. Tracking that down is more art than science, but one technique is adding a "poison" field, tracking deinit count, and otherwise exposing metrics regarding those failure modes. Panic/crash when you hit an invalid state, and once you figure out where the issue happens you can triage as normal (working backward from the broken state to figure out how you got there). With a solid memory management strategy up-front you'll not see this sort of thing, but if it's not something you've thought about then I wouldn't rule it out.

3. Not really another point, just an extension of (2), corruption can show up in subtle ways (like stack-copied pointers inside a paused async function closure which occasionally gets copied by your event loop). If global state is the issue, it's worth a full audit of the application.

QuadmasterXLII · 2025-10-26T13:50:09 1761486609

You may be running into jensen (huang)’s inequality,

E(loss).cuda() <= E(loss.cuda())

CaptainOfCoit · 2025-10-26T13:57:55 1761487075

Would make sense I suppose if I was using two different GPUs for the same thing and get two different outcomes. But instead I have two implementations (one naive, one tensor cores) running on the same GPU, but getting different outcomes, where they should be the same.

But then this joke might be flying above my head as well.

p1esk · 2025-10-26T15:00:59 1761490859

Tensor cores use lower precision, so small numerical differences should be expected.

jjmarr · 2025-10-26T18:58:10 1761505090

Consumer-visible hardware bugs are extremely uncommon nowadays. There's approximately 10x as many people working in design verification as actual hardware design.

I say "consumer-visible" because the bugs still exist and people who can catch them early get promoted quickly and paid a lot. It's very exciting work if you can get it, since you really have to understand the full GPU to break it.

Good luck!!

saagarjha · 2025-10-26T13:29:28 1761485368

How big is the numerical difference? If it's small it might be within the precision of the operation itself.

CaptainOfCoit · 2025-10-26T13:48:49 1761486529

Magnitudes away (maybe "small numerical difference" was an understatement), my current hypothesis is that I'm doing scaling wrong somewhere, but I can't help but sometimes slide into the "maybe there is something deeper wrong" territory in the evening after another day...