> Ignoring floating point errors. I think you mean non-associativity. And you ca...

sunrunner · 2025-11-13T20:28:23 1763065703

Ignoring floating point errors, assuming a perfectly spherical cow, and taking air resistance as zero.

AJRF · 2025-11-13T20:34:02 1763066042

Imagine you are predicting the next token, you have two tokens very close in probability in the distribution, kernel execution is not deterministic because of floating point non-associativity - the token that gets predicted impacts the tokens later in the prediction stream - so it's very consequential which one gets picked.

This isn't some hypothetical - it happens all the time with LLM's - it isn't some freak accident that isn't probable

randomgermanguy · 2025-11-13T20:48:26 1763066906

Okay yes, but would you really say that the main part of non-determinism in LLM-usage stems from this ? No its obviously the topk sampling.

I don't think my tech-lead was trying to suggest the floating-point error/non-associativity was the real source.

AJRF · 2025-11-13T21:04:57 1763067897

> Would you really say that the main part of non-determinism in LLM-usage stems from this

Yes I would because it causes exponential divergence (P(correct) = (1-e)^n) and doesn't have a widely adopted solution. The major labs have very expensive researchers focused on this specific problem.

There is a paper from Thinking Machines from September around Batch Invariant kernels you should read, it's a good primer on this issue of non-determinism in LLM's, you might learn something from it!

Unfortunately the method has quite a lot of overhead, but promising research all the same.

randomgermanguy · 2025-11-13T21:21:11 1763068871

Alright fair enough.

I dont think this is relevant to the main-point, but it's definitely something I wasn't aware of. I would've thought it might have an impact on like O(100)th token in some negligible way, but glad to learn.