While I would love testing reproducibility more systematically, it would not really have helped for the Python 3.10 regression: in this case we knew full well that it would break reproducibility even before we merged the change, but the performance advantage it unlocked seemed too big to ignore. Such trade-offs are luckily rare - I'm happy that with Python 3.11 we can now have both :)