We solve that partially by running the base state and the delta version through concurrently. This way most types of impact will impact both at the same time. This gives us the relative delta between versions.
Other than that, just like always, run benchmark on a stable dedicated set of hardware.
Yes. Or for a sufficiently long duration. Some things like allocation rate and significantly worse performance are obvious and can be seen in shorter runs.
Other than that, just like always, run benchmark on a stable dedicated set of hardware.