That's not entirely true. Current-gen Nvidia hardware can use fp8 and newly announced Blackwell can do fp4. Lots of existing specialized inference hardware uses int8 and some int4.
You're right that low-precison training still doesn't seem to work, presumably because you lose the smoothness required for SGD-type optimization.
You're right that low-precison training still doesn't seem to work, presumably because you lose the smoothness required for SGD-type optimization.