It's not the LLM, but the hardware. GPU operations generally involve concurrency...

dragonwriter · on March 25, 2023

Specifically, as I ubderstand it, the accumulation of rounding errors differs with the order in which floating point values are completed and intermediate aggregates are calculated, unless you put wait conditions in so that the aggregation order is fixed even if the completion order varies, which reduces efficient use of available compute cores in exchange for determinism.