I disagree with the analysis of this article. In a typical machine learning process, the response variable stays the same (at a distributional level) but you cycle through candidate models. So regardless of whatever the class distributions are, a higher AUC score indicates a better model.
It might be true that the classifier performance is worse on an imbalanced data set (with the same AUC score) than a balanced one but that just reflects the fact that classifiers are harder to build for imbalanced data.
No, the point is for an imbalanced set you literally don't care about the model performance where the false-positive-rate is substantial. IE, let's say you have 1% true-hits in your data and run the classifier at an FPR of 5%, that means you are generating ~5 false-positives which is insane to do!
That's why most of the ROC curve is useless for imbalanced sets. That's why I prefer precision/recall graphs as does the OP.
I came to the conclusion that area under ROC is pretty much garbage. What you really want is two separate steps: estimate the probability that instance A belongs in classification X, and then a decision step where you decide how to classify A based on a loss function (this varies depending on how harmful a false positive is).
Area under ROC forces you to conflate these two.
Why not evaluate the probability estimation directly, using Brier score [1] or something similar?
>I came to the conclusion that area under ROC is pretty much garbage.
I cannot disagree more. IMO AUC is a rare great metric: it's principled, useful, universally applicable (e.g. invariant to class imbalances), and easy to explain adequately to non-statistician ("probability of choosing a positive sample over a negative one").
>What you really want is two separate steps: estimate the probability that instance A belongs in classification X, and then a decision step where you decide how to classify A based on a loss function (this varies depending on how harmful a false positive is).
Yes, but those are two inherently separate steps and should be measured separately. AUC is a metric of the first step (the "model"). The second step is a business decision and will often be made separately from the modeling process, by different people, with a different cadence, and with a different goal in mind.
For example if I am designing a model to find a disease, I just want to make the best prediction I can, which is cleanly measured by AUC. Then, when it comes to actual diagnosis, someone else will choose cutoffs based on various factors like false positive costs (treatment cost, human toll), false negative cost (disease damage, death toll), supply of treatment, etc. I can picture scenarios where the same model is used for decades, but the cutoffs change seasonally.
I think this is a red herring that comes from not thinking probabilistically. If the distribution of your training data does not resemble the distribution of your real-world data (or cannot be made to resemble it), you're just guessing anyway. If it does resemble the real distribution, then you want those class imbalances. In fact, I tend to think that the fact that ROC is "invariant to class imbalances" is a significant downside: it means in some sense your score is just as sensitive to things that rarely happen as it is to things that happen all the time.
> "probability of choosing a positive sample over a negative one"
I find the practical implications of this pretty opaque, and it's never been clear to me whether this is measuring anything I actually care about. As far as I know there aren't theoretical guarantees that a better AUC score means anything real. I haven't thought deeply about it, but I am reasonably sure I could find some simple examples illustrating how to "cheat" AUC by getting a higher score with predictions that are worse in any practical sense.
I still like the Brier score: just give me a number indicating how well my estimated probability predictions do on a test/validation set. There are even theoretical guarantees about it, because it's a proper score function.
But isn't it possible to design multiple models, where the judgement of the "best model" is dependent on your goals (e.g. reduction of false positives vs false negatives)?
Unfortunately you do not usually know the loss function when developing a model. A typical example would be credit bureau developing something like FICO score for the banks to use. Banks might know the loss function, but the credit bureaus don't. Hence the need to use a metric like KS or Gini coefficient.
If you really don't know the loss function and are not willing to guess at it, classification is hopeless. If you estimate there's a 70% chance that instance A belongs to category X, there is literally no way to decide whether to classify instance A as X or not.
Anyway, the point of Brier score is that it evaluates your probability estimation (without the loss function), so this is no objection.
But in this example the banks are not really doing the classification. They're trying to figure whether it's more profitable to approve or to decline a credit application. That decision depends not only on probability of default (that risk score predicts), but also on other factors such as APR and type of product. There's another model with some P&L assumptions for that, and turtles all the way down.
Besides, banks typically adjust their credit policy a lot more often than credit bureaus update their scorecards, hence scorecard developers cannot really rely on the fast changing loss function.
True, but the definition is in the very last line of the whole page. For people unfamiliar with an ROC curve, which seems to be the target audience, it should be defined the first time it's used.
Edit: on navigating the site a little more, the previous section does define the acronym on its first use.
https://classeval.wordpress.com/simulation-analysis/roc-and-...