Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Ignoring floating point errors.

I think you mean non-associativity.

And you can’t ignore that.





Ignoring floating point errors, assuming a perfectly spherical cow, and taking air resistance as zero.

Imagine you are predicting the next token, you have two tokens very close in probability in the distribution, kernel execution is not deterministic because of floating point non-associativity - the token that gets predicted impacts the tokens later in the prediction stream - so it's very consequential which one gets picked.

This isn't some hypothetical - it happens all the time with LLM's - it isn't some freak accident that isn't probable


Okay yes, but would you really say that the main part of non-determinism in LLM-usage stems from this ? No its obviously the topk sampling.

I don't think my tech-lead was trying to suggest the floating-point error/non-associativity was the real source.


> Would you really say that the main part of non-determinism in LLM-usage stems from this

Yes I would because it causes exponential divergence (P(correct) = (1-e)^n) and doesn't have a widely adopted solution. The major labs have very expensive researchers focused on this specific problem.

There is a paper from Thinking Machines from September around Batch Invariant kernels you should read, it's a good primer on this issue of non-determinism in LLM's, you might learn something from it!

Unfortunately the method has quite a lot of overhead, but promising research all the same.


Alright fair enough.

I dont think this is relevant to the main-point, but it's definitely something I wasn't aware of. I would've thought it might have an impact on like O(100)th token in some negligible way, but glad to learn.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: