Uncertainty quantification is a neglected aspect of data science and especially ...

gh02t · on Dec 4, 2023

You can demand error bars but they aren't always possible or meaningful. You can more or less "fudge" some sort of normally distributed IID error estimate onto any method, but that doesn't necessarily mean anything. Generating error bars (or generally error distributions) that actually describe the common sense idea of uncertainty can be quite theoretically and computationally demanding for a general nonlinear model even in the ideal cases. There are some good practical methods backed by theory like Monte Carlo Dropout, but the error bars generated for that aren't necessarily always the error you want either (MC DO estimates the uncertainty due to model weights but not say, due to poor training data). I'm a huge advocate for methods that natively incorporate uncertainty, but there are lots of model types that empirically produce very useful results but where it's not obvious how to produce/interpret useful estimates of uncertainty in any sort of efficient manner.

Another, separate, issue that is often neglected is the idea of calibrated model outputs, but that's its own rabbit hole.

marcyb5st · on Dec 5, 2023

Well, in reality tools like Tensorflow probability can help you model both aleatoric and epistemic uncertainty with probabilistic layers that have learnable priors and posteriors. The issue there is that for the average ML person might not have the required math skills to model the problem in these terms.

For instance, if you look at https://blog.tensorflow.org/2019/03/regression-with-probabil... until the case 4 it's easy to follow and digest, but if you look at the _Tabula rasa_ section I am pretty sure that such content isn't understandable by many. Where you get stuck because the ideas become too complex depends on your math skills.

gh02t · on Dec 5, 2023

Yeah I've used those methods and am a fan, though they are far from perfect. For one thing they're somewhat invasive methods to implement and they still require you to formulate a likelihood function to varying degrees; a task which is not always possible up front. I've also had issues with getting them to converge during training when using them. They also sometimes don't estimate uncertainty correctly, particularly if you make a mistake modeling the likelihood.

I guess my point is, there is no silver bullet. Adding defensible uncertainty is complicated and problem specific, and comes with downsides (often steep).

kqr · on Dec 4, 2023

I'm going to sound incredibly subjectivist now, but... the human running the model can just add error bars manually. They will probably be wide, but that's better than none at all.

Sure, you'll ideally want a calibrated estimator/superforecaster to do it, but they exist and they aren't that rare. Any decently sized organisation is bound to have at least one. They just need to care about finding them.

gh02t · on Dec 5, 2023

Even subjectively, on what basis would they generate uncertainties that at least keeps some grounding in reality? Any human generation would be ad hoc and likely very wrong, humans are notoriously awful at estimating risk and I'd argue by extension uncertainties with any consistency. And that's not even considering how one would assign an uncertainty to some huge model with 350 wacko features trained on 40 million examples. Lastly, models don't necessarily attend to the same details a human does so even if a human is able to slap an uncertainty on a prediction based on their own analysis that doesn't mean it's representing the uncertainty of what the model based its decision on.

I do think having people in the loop is a very important aspect, however, and can provide an important subjective complement to the more mathematically formulated idea of uncertainty. I don't care if the model I'm using provides the most iron clad and rigorous uncertainties ever, I'm still going to spot check it and play with it before I consider it reliable.

kqr · on Dec 5, 2023

> on what basis would they generate uncertainties that at least keeps some grounding in reality?

By having their forecasts continuously evaluated against outcomes. If someone can show me they have a track record of producing calibrated error bars on a wide variety of forecasts, I trust them to slap error bars on anything.

> even if a human is able to slap an uncertainty on a prediction [...] that doesn't mean it's representing the uncertainty of what the model based its decision on.

This sounds like it's approaching some sort of model mysticism. Models don't make forecasts, humans do. Humans can use models to inform their opinion, but in the end, the forecast is made by a human. The human only needs to put error bars on their own forecast, not on the internal workings of the model.

gh02t · on Dec 5, 2023

> This sounds like it's approaching some sort of model mysticism. Models don't make forecasts, humans do.

By forecasts I only mean output of a model, I've been wrapped up in time series methods where that's the usual term for model outputs. Assigning confidence to the conclusions drawn by an analyst using some model as a tool is a different task that may or may not roll up formal model output uncertainties and usually involves a lot of subjectivity. This is an important thing too, but is downstream.

Uncertainty is inherently tied to a specific model, since it characterizes how the model propagates uncertainty of inputs and its own fit/structure/assumptions onto its outputs. If you aren't building uncertainties contingent on the characteristics of a specific model then it isn't an uncertainty. But there's no mysticism about models possibly being unintuitive, most of the popular model forms nowadays are mystery black boxes. Some function fit to a specific dataset until it finds a local minimum in a loss function that happens to do a good job (simplifying). There's plenty of work that shows ML models often exploit features and correlations that are highly unintuitive to a human or are just plain spurious.

sobriquet9 · on Dec 5, 2023

Conformal prediction solves that problem. Split conformal and Jackknife+ are two simplest examples.

figassis · on Dec 4, 2023

So is it really science? These are concepts from stats 101. And the reasons and need, and the risks of not having them are very clear. But you have millions being put into models without these pre-requisites, and being sold to people as solutions, and waved away as "if people buy is it's bc it has value". People also pay fraudsters.

borroka · on Dec 4, 2023

But even in academia, where supposedly "true science" is, if not done, at least pursued, uncertainty intervals are rarely, with respect to the times they would be needed, understood and used.

When I used to publish stats- and math-heavy papers in the biological sciences, very rarely the reviewers--and I used to publish in intermediate and up journals--were paying any attention to the quality of the predictions, beyond a casual look at the R2 or R2-equivalents and mean absolute errors.

nradov · on Dec 4, 2023

Mostly not. Very few data "scientists" working in industry actually follow the scientific method. Instead they just mess around with various statistical techniques (including AI/ML) until they get a result that management likes.

marcinzm · on Dec 4, 2023

Most decent companies and especially tech do AB testing for everything including having people whose only job is to make sure those test results are statistically valid.

staunton · on Dec 5, 2023

The magic words here are make sure.

marcinzm · on Dec 5, 2023

In my experience they make sure a ton more in industry than in academia.

staunton · on Dec 5, 2023

Any anecdotes you can share? Also, I meant "make sure" in a negative way. As in you "make sure things are statistically significant" by e.g. p-hacking. Not that this isn't done in science but I think you're more in danger of being embarrassed during peer review than by the C-suite reading your executive summary...

marcinzm · on Dec 5, 2023

Most companies that care will run everything though an AB test, the number of AB tests is physically limited by traffic volume and the team in charge of measuring the results of the AB tests is not the team that created the experiment. That makes it much harder to p-hack since you cannot re-run experiments infinitely on a laptop, and the measurement team is judged on the accuracy of their forecasts versus the revenue impact.

figassis · on Dec 6, 2023

The worse things can and often do generate more revenue. So what are the goals of the AB test?

marcinzm · on Dec 7, 2023

>The worse things can and often do generate more revenue.

The goal of a business is usually to generate revenue so I'm confused about your question.

jmalicki · on Dec 5, 2023

Science is just fooling around with data until you get a result a journal reviewer likes.

amai · on Dec 5, 2023

Error bars are important. But most people misinterpret their meaning, see https://errorbars.streamlit.app/

macrolocal · on Dec 4, 2023

Also, error bars qua statistics can indicate problems with the underlying data and model, eg. if they're unrealistically narrow, symmetric etc.