It’s been a very long time since I had any inside baseball, but I very much doubt that Hopper gear is in the hot inference path.
The precisions and mantissa/exponent ratios you want for inference are just different to a mixed-precision, fault tolerant, model and data parallel pipeline.
Hopper is for training mega-huge attention decoders: TF32, bfloat16, hot paths to the SRAM end of the cache hierarchy with cache coherency semantics that you can reason about. Parity gear for fault tolerance, it’s just a different game.
The precisions and mantissa/exponent ratios you want for inference are just different to a mixed-precision, fault tolerant, model and data parallel pipeline.
Hopper is for training mega-huge attention decoders: TF32, bfloat16, hot paths to the SRAM end of the cache hierarchy with cache coherency semantics that you can reason about. Parity gear for fault tolerance, it’s just a different game.