But if this next release makes codegen improvements, those would all show up as ...

But if this next release makes codegen improvements, those would all show up as differences. Separating the improvements from the regressions is difficult enough -- fuzzing doesn't really help here, it makes it much harder to determine what code should have been generated.

Generally, codegen issues are exposed by benchmarks or someone who is curious enough to examine and analyze the code generated from multiple compilers or multiple releases. The latter is much rarer. But as these happen, the bar keeps getting raised and we do grow the test suite.