It is something that bothers me about the ML literature is that they frequently present a large number of evaluation results such as precision and AUC but these are not qualified by error bars. Typically they make a table which has different algorithms on one side and different problems on the other side and the highest score for a given problem gets bolded.
I know if you did the experiment over and over against with different splits you'd get slightly different scores so I'd like to see some guidance as to significance in terms of ① statistical significance, and ② is it significant on a business level. Would customers notice the difference? Would it make better decisions that move the needle for revenue or other business metrics?
This study is an example where a drastically more expensive algorithm seems to produce a practically insignificant improvement.
This is one of my default suggestions when I act as reviewer: t test with bonferroni correction please. ML, ironically, has absolutely horrible practices in terms of distinguishing signal from noise( which at least is partially offset by the social pressure to share code, but still)
Bonferroni's correction on hold-out data is an excellent suggestion. To adapt it into time series forecasting, one could perform temporal cross-validation with rolling windows and follow the performance's variance through time.
Unfortunately, the computational time would explode if the ML method's optimization is performed naively. Precise measurements of the statistical significance would crowd out researchers except for Big Tech.
What would be a better method for machine learning folks to take? As a sincere curiosity / desire to learn, not meant as a rhetorical implication that I disagree.
I interpret your criticism to mean that ML folks tend to re-use a test set multiple times without worrying that doing so reduces the meaning of the results. If that's what you mean, then I do agree.
Informally, some researchers are aware of this and aim to use a separate validation data set for all parameter tuning, and would like to use a held out test set as few times as possible — ideally just once. But it gets more complicated than that because, for example, different subsets of the data may not really be independent samples from the run-time distribution (example: data points = medical data about patients who lived or died, but only from three hospitals; the model can learn about different success rates per hospital successfully but it would not generalize to other hospitals). In other words, there are a lot of subtle ways in which a held out test set can result in overconfidence, and I always like to learn of better ways to resist that overconfidence.
Ben Recht actually has a line of work showing that we aren't over fitting the validation/test set for now (amazingly...). What I mean is, by chasing higher and higher SotA with more and more money and compute, whole fields can go "improving" only for papers like https://arxiv.org/abs/2003.08505 or "Implementation matters in deep RL" to come out and show that what's going on is different from the literature consensus. The standards for showing improvement are low, while standards for négative résultats are high (I'm a bit biased because I have a rejected paper trying to show empirically some deep RL work didn't add marginal value but I think the case still holds). Everyone involved is trying their best to do good science but unless someone like me asks for it, there simply isn't a value add for your career to do exhaustive checking.
A concrete improvement would be only being allowed to change 1 thing at a time per paper, and measure the impact of changing that one thing. But then you couldn't realistically publish anything outside of megacorps. Another solution might be banning corporate papers, or at least making a separate track...from reviewing papers, it seems like single authors or small teams in academia need to compete with Google where multiple teams might share aspects of a project, one doing the architecture, the other a new training algorithm etc...which won't be disclosed, you'll just read a paper where for some reason a novel architecture is introduced using a baseline which is a bit exotic but also used in another paper that came out close to this one, and a regulariser which was introduced just before that ...
If you limit the pools, you can put much higher standards on experiments on corporate where you have the budget, while giving academia more points for novelty and creativity
Thanks! I hadn't previously thought about the internal boost that large well-sponsored (aka corporate) teams get this way. It seems worth being aware of. I'm certainly in favor of encouraging & including researchers who work in smaller teams or with less funding.
Question: why do we care about the Bonferroni correction if the model being reviewed shows high performance on holdout/test samples?
I mean, it's nice to know that the p-values of coefficients on models you are submitting for publication are appropriately reported under the conservative approach Bonferroni applies, but I would think making it a _default_ is an inappropriate forcing function when the performance on holdout is more appropriate. Data leakage would be a much, much larger concern IMHO. Variance of the performance metrics is also important.
Because the variance can be uniformly high, making it difficult to properly judge the improvement of one method vs the baseline method: did you actually improve, or did you just get a few lucky seeds? It's much harder to get a paper debunking new "SotA" methods so I default to showing a clear improvement over a good baseline. Simply looking at the performance is also not enough because a task can look impressive, but be actually quite simple (and vice versa), so using these statistical measures makes it easy to distinguish good models on hard tasks from bad models on easy tasks.
I should also note 1) this is about testing whether the performance of a model is meaningfully different from another, not the coefficient of the models 2) I don't reject papers just because they lack this, or if they fail to achieve a statistical significance, I just want it in the paper so the reader can use that to judge (and it also helps suss out cherry picked results)
You'd want to do some sort of test because it can help assess whether your method did better than the alternatives by chance. For example can you really say Method A is better than B if A got 88% accuracy on the holdout set and B got 86% accuracy? Would that be true of all possible datasets?
t-test with Bonferroni isn't necessarily the best test for all metrics either.
The test sample is just a small, arbitrary sample from a universe of similar data.
You (probably) don't care about test-set performance per se but instead want to be able to claim that one model works better _in general_ than another. For that, you need to bust out the tools of statistical inference.
The test set allows you to make this claim if it is representative of the universe of novel data the model will run on and there is no data spoilage between test and train.
This isn't always true (of course, especially in aggregate series over time) and of course statistical measures are used to report model performance. But a Bonferroni correction struck me as a weird place to apply this specifically, but after the other comments from yesterday I saw where they were taking it.
Every researcher would love to include error bars but it's a matter of limited computing resources at universities. Unless you're training on a tiny dataset like MNIST, these training runs get expensive. Also, unless you parallelize from the start and risk wasting a lot of resources if something goes wrong, it could take longer to get the results.
Simple formulas only work because the models themselves for those polls are incredibly simple and adding a bit more complexity requires a lot of tools to compute these uncertainties (this is part of the reason you see probabilistic programming so popular for people doing non-trivial polling work).
There are no simple approximations for a range of even slightly complex models. Even some nice computational tricks like the Laplace approximation don't work on models with high numbers of parameters (since you need to compute the diagonal of the Hessian).
A good overview of the situation is covered in Efron & Hastie's "Computer Age Statistical Inference".
In Machine Learning literature, the variance of accuracy measurements originates from different network parameters initialization.
Since the deep learning ensembles already use aggregate computation in the hundreds of days, computing the variance would elevate the computational time into thousands of days.
In contrast, statistical methods that we report optimize convex objectives; their optimal parameters are deterministic.
That being said, we like the idea of including cross-validation with different splits for future experiments.
The tests sets are large enough to render this moot, as the confidence intervals are almost certainly smaller than the precisions typically reported, i. e. 0.1 %.
I've worked on commercial systems where N<=10,000 in the evaluation set and the confidence interval there is probably not so good as 0.1% for that. For instance there is a lot of work on this data set (which we used to tune up a search engine)
and sometimes it as bad as N=50 queries with judgements. I don't see papers that are part of TREC or based on TREC data dealing with sampling errors in any systematic way.
NIST's TREC workshop series uses Cyril Cleverdon's methodology ("Cranfield paradigm") from the 1960s, and more could surely be done at the evaluation front:
- systematically addressing sampling error;
- more than 50 queries;
- more/all QRELs;
- full evaluation instead of system pooling;
- study IR not just of the English language (this has been picked up by CLEF and NTCIR in Europe and Japan, respectively)
- to devise metrics that take energy efficiency into account.
- ...
At the same time, we have to be very grateful to NIST/TREC for executing an international (open) benchmark annually, which has moved the field forward a lot in the last 25 years.
I know if you did the experiment over and over against with different splits you'd get slightly different scores so I'd like to see some guidance as to significance in terms of ① statistical significance, and ② is it significant on a business level. Would customers notice the difference? Would it make better decisions that move the needle for revenue or other business metrics?
This study is an example where a drastically more expensive algorithm seems to produce a practically insignificant improvement.