It's not the LLM, but the hardware. GPU operations generally involve concurrency that makes them non-deterministic, unless you give up some speed to make them deterministic.
Specifically, as I ubderstand it, the accumulation of rounding errors differs with the order in which floating point values are completed and intermediate aggregates are calculated, unless you put wait conditions in so that the aggregation order is fixed even if the completion order varies, which reduces efficient use of available compute cores in exchange for determinism.