Question: why do we care about the Bonferroni correction if the model being reviewed shows high performance on holdout/test samples?
I mean, it's nice to know that the p-values of coefficients on models you are submitting for publication are appropriately reported under the conservative approach Bonferroni applies, but I would think making it a _default_ is an inappropriate forcing function when the performance on holdout is more appropriate. Data leakage would be a much, much larger concern IMHO. Variance of the performance metrics is also important.
Because the variance can be uniformly high, making it difficult to properly judge the improvement of one method vs the baseline method: did you actually improve, or did you just get a few lucky seeds? It's much harder to get a paper debunking new "SotA" methods so I default to showing a clear improvement over a good baseline. Simply looking at the performance is also not enough because a task can look impressive, but be actually quite simple (and vice versa), so using these statistical measures makes it easy to distinguish good models on hard tasks from bad models on easy tasks.
I should also note 1) this is about testing whether the performance of a model is meaningfully different from another, not the coefficient of the models 2) I don't reject papers just because they lack this, or if they fail to achieve a statistical significance, I just want it in the paper so the reader can use that to judge (and it also helps suss out cherry picked results)
You'd want to do some sort of test because it can help assess whether your method did better than the alternatives by chance. For example can you really say Method A is better than B if A got 88% accuracy on the holdout set and B got 86% accuracy? Would that be true of all possible datasets?
t-test with Bonferroni isn't necessarily the best test for all metrics either.
The test sample is just a small, arbitrary sample from a universe of similar data.
You (probably) don't care about test-set performance per se but instead want to be able to claim that one model works better _in general_ than another. For that, you need to bust out the tools of statistical inference.
The test set allows you to make this claim if it is representative of the universe of novel data the model will run on and there is no data spoilage between test and train.
This isn't always true (of course, especially in aggregate series over time) and of course statistical measures are used to report model performance. But a Bonferroni correction struck me as a weird place to apply this specifically, but after the other comments from yesterday I saw where they were taking it.
I mean, it's nice to know that the p-values of coefficients on models you are submitting for publication are appropriately reported under the conservative approach Bonferroni applies, but I would think making it a _default_ is an inappropriate forcing function when the performance on holdout is more appropriate. Data leakage would be a much, much larger concern IMHO. Variance of the performance metrics is also important.
What am I missing?