This is ludicrous, but this kind of thing have been going on in science and physics in particular for a long time. Let's call the phenomenon maths side-blinders.
One experiment finds the mass of thw W boson to be 80370 +/- 19 MeV, another 80434 +/-9 MeV. Clearly, both results are incompatible, their rnage don't overlap. Of course, these are statistical ranges. But even with 95% certainty, their differences is many times the uncertainty, so it's not just that they're a bit off. IOW, we can be 100% (not 95%) sure that at least one, if not both, are incorrect.
Yet they are boldly reported with those uncertainty ranges, even though, clearly , those ranges cannot be correct. And then ATLAS double down by "apply more statistical analysis" to narrow their uncertainty range!
There should not be work on "improved stats analysis", but more work on finding where the systemic error between the two experiment lies. I truly don't see the point of retreading the same data set to change the value and uncertainty range when clearly there is something wrong with the data, the science, the experiment or all of the above.
PS: what I'd like to see, is the labs to say something along the line: given result A and B being incompatible, statistically there is a (say) 99.99% chance that one or both experiment has a hidden flaw or that there is a major flaw in the standard model.
I am pretty sure that what you say in your PS is so obvious, at least to anyone who cares what the mass of the W boson is to a precision of ~±2e-4, that it does not need saying. What does need saying is how bad the disagreement is, and when you are at the point where this difference has arisen and cannot quickly be resolved, publishing all the papers seems to be the best way to give the most complete information, available at that time, about the problem.
Once the contradictory Fermilab results were made available, what should the ATLAS group do? Well, one thing would be to double-check their own analysis, and it seems plausible that this is what led to the latest result. Using improved methods might just as well have revealed a problem as end up as it did, supporting the original analysis.
This experiment is way out of my depth, (and I don't know if that's going on here? <shrug>) but I think what you're describing is an interesting phenomenon, which I'll try to explain.
I think it's important to keep in mind is those cases certain models are being tested. In good science, you do experiments with certain models in mind, and of course associated models of your apparatus and machines themselves, and then you compare the results for consistency. You have to be very careful with any other additions in your statistical analysis that wasn't generated by the model, including the models of the machines (say some kind of 'epistemic uncertainty' or 'procedural uncertainty' or something like that) to correct after the fact, as I believe it potentially invalidates the base models by itself.
For example, say you measure gravity at sea level with 1 apparatus and report gravity is 9.9 +/- 0.1. Then you get a second apparatus, and measure 9.2 +/- 0.1 (i.e. something went wrong). The difference is significant. Then you realize there must be some error, so you add a 'experimental error parameter', which you can tune, and it has implications on both measurements: you adjust it until the uncertainties are compatible (which is to be expected from consistency of experiments), and arrive say at 9.9 +/- 0.6 and 9.2 +/- 0.6 for the first and second experiment. This new parameter clearly doesn't belong in the model, and there's no model for the parameter itself: there's no explanatory mechanism involved, only a new free parameter. Something you could say honestly, is that we known there's experimental error in one or both of the experiments, or the base models are significantly incorrect. But you can't say take an average of both results and say gravity is 9.55 +/- (..), because the existence of either experimental error or base model error (at least to a few sigma of certainty) invalidates this procedure -- that is, unless you just want a guess for some sort of immediate practical application and the "experimental error" is acceptable.
Another common and well known effect in experiments is knowing the result you want to get, and trying different "adjustments" or redoing analysis until (subconsciously or not) the analysis yields the result that agrees with previous observations. This has been reported by Feynman in his books. I believe some modern experiments shield against this by, among other ways, not seeing the results of an analysis until you're sure the experiments/analysis is good (so you can't fine tune the experiments analysis to get known results).
One experiment finds the mass of thw W boson to be 80370 +/- 19 MeV, another 80434 +/-9 MeV. Clearly, both results are incompatible, their rnage don't overlap. Of course, these are statistical ranges. But even with 95% certainty, their differences is many times the uncertainty, so it's not just that they're a bit off. IOW, we can be 100% (not 95%) sure that at least one, if not both, are incorrect.
Yet they are boldly reported with those uncertainty ranges, even though, clearly , those ranges cannot be correct. And then ATLAS double down by "apply more statistical analysis" to narrow their uncertainty range!
There should not be work on "improved stats analysis", but more work on finding where the systemic error between the two experiment lies. I truly don't see the point of retreading the same data set to change the value and uncertainty range when clearly there is something wrong with the data, the science, the experiment or all of the above.
PS: what I'd like to see, is the labs to say something along the line: given result A and B being incompatible, statistically there is a (say) 99.99% chance that one or both experiment has a hidden flaw or that there is a major flaw in the standard model.