The Area Under an ROC Curve (2001)

pveierland · on Dec 25, 2017

The AUC for ROC does not work well in comparisons involving imbalanced datasets, and Precision Recall curves can be a better option.

https://classeval.wordpress.com/simulation-analysis/roc-and-...

gcmac · on Dec 25, 2017

I disagree with the analysis of this article. In a typical machine learning process, the response variable stays the same (at a distributional level) but you cycle through candidate models. So regardless of whatever the class distributions are, a higher AUC score indicates a better model.

It might be true that the classifier performance is worse on an imbalanced data set (with the same AUC score) than a balanced one but that just reflects the fact that classifiers are harder to build for imbalanced data.

letitgo12345 · on Dec 26, 2017

Having a better AUC score does not guarantee a better AUPR score. So a model with better AUC is not universally "better"

See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98....

eanzenberg · on Dec 26, 2017

No, the point is for an imbalanced set you literally don't care about the model performance where the false-positive-rate is substantial. IE, let's say you have 1% true-hits in your data and run the classifier at an FPR of 5%, that means you are generating ~5 false-positives which is insane to do!

That's why most of the ROC curve is useless for imbalanced sets. That's why I prefer precision/recall graphs as does the OP.

hprotagonist · on Dec 25, 2017

i would dearly love for this to be a more commonly used measure of performance. it’s much harder to hide your sins if you give a ROC.

it’s not naively useful for multi-category classifiers, but it’s great otherwise.

mikebenfield · on Dec 25, 2017

I came to the conclusion that area under ROC is pretty much garbage. What you really want is two separate steps: estimate the probability that instance A belongs in classification X, and then a decision step where you decide how to classify A based on a loss function (this varies depending on how harmful a false positive is).

Area under ROC forces you to conflate these two.

Why not evaluate the probability estimation directly, using Brier score [1] or something similar?

[1] https://en.wikipedia.org/wiki/Brier_score

rm999 · on Dec 25, 2017

>I came to the conclusion that area under ROC is pretty much garbage.

I cannot disagree more. IMO AUC is a rare great metric: it's principled, useful, universally applicable (e.g. invariant to class imbalances), and easy to explain adequately to non-statistician ("probability of choosing a positive sample over a negative one").

>What you really want is two separate steps: estimate the probability that instance A belongs in classification X, and then a decision step where you decide how to classify A based on a loss function (this varies depending on how harmful a false positive is).

Yes, but those are two inherently separate steps and should be measured separately. AUC is a metric of the first step (the "model"). The second step is a business decision and will often be made separately from the modeling process, by different people, with a different cadence, and with a different goal in mind.

For example if I am designing a model to find a disease, I just want to make the best prediction I can, which is cleanly measured by AUC. Then, when it comes to actual diagnosis, someone else will choose cutoffs based on various factors like false positive costs (treatment cost, human toll), false negative cost (disease damage, death toll), supply of treatment, etc. I can picture scenarios where the same model is used for decades, but the cutoffs change seasonally.

mikebenfield · on Dec 25, 2017

> invariant to class imbalances

I think this is a red herring that comes from not thinking probabilistically. If the distribution of your training data does not resemble the distribution of your real-world data (or cannot be made to resemble it), you're just guessing anyway. If it does resemble the real distribution, then you want those class imbalances. In fact, I tend to think that the fact that ROC is "invariant to class imbalances" is a significant downside: it means in some sense your score is just as sensitive to things that rarely happen as it is to things that happen all the time.

> "probability of choosing a positive sample over a negative one"

I find the practical implications of this pretty opaque, and it's never been clear to me whether this is measuring anything I actually care about. As far as I know there aren't theoretical guarantees that a better AUC score means anything real. I haven't thought deeply about it, but I am reasonably sure I could find some simple examples illustrating how to "cheat" AUC by getting a higher score with predictions that are worse in any practical sense.

I still like the Brier score: just give me a number indicating how well my estimated probability predictions do on a test/validation set. There are even theoretical guarantees about it, because it's a proper score function.

jhokanson · on Dec 26, 2017

But isn't it possible to design multiple models, where the judgement of the "best model" is dependent on your goals (e.g. reduction of false positives vs false negatives)?

pps43 · on Dec 25, 2017

Unfortunately you do not usually know the loss function when developing a model. A typical example would be credit bureau developing something like FICO score for the banks to use. Banks might know the loss function, but the credit bureaus don't. Hence the need to use a metric like KS or Gini coefficient.

mikebenfield · on Dec 25, 2017

If you really don't know the loss function and are not willing to guess at it, classification is hopeless. If you estimate there's a 70% chance that instance A belongs to category X, there is literally no way to decide whether to classify instance A as X or not.

Anyway, the point of Brier score is that it evaluates your probability estimation (without the loss function), so this is no objection.

pps43 · on Dec 26, 2017

But in this example the banks are not really doing the classification. They're trying to figure whether it's more profitable to approve or to decline a credit application. That decision depends not only on probability of default (that risk score predicts), but also on other factors such as APR and type of product. There's another model with some P&L assumptions for that, and turtles all the way down.

Besides, banks typically adjust their credit policy a lot more often than credit bureaus update their scorecards, hence scorecard developers cannot really rely on the fast changing loss function.

assafmo · on Dec 25, 2017

This page actually help me a lot in the past. Thank you for sharing!

WilliamSt · on Dec 25, 2017

I really wish people could make it a habit to define acronyms.

curiousgal · on Dec 25, 2017

I really wish people could make it a habit to read articles.

The article defines the acronym and even mentions where it comes from.

sxg · on Dec 25, 2017

True, but the definition is in the very last line of the whole page. For people unfamiliar with an ROC curve, which seems to be the target audience, it should be defined the first time it's used.

Edit: on navigating the site a little more, the previous section does define the acronym on its first use.

pps43 · on Dec 25, 2017

In this case knowing that ROC stands for Receiver Operating Characteristic wouldn't help much.

thomasahle · on Dec 25, 2017

It would let me know that we are not talking about finance, say return on capital or rate of change.

This in turn would help me decide whether I might want to read the article, given that the title is the only resume given on the front page.

Thank you for putting it in the comments though :-)

pps43 · on Dec 25, 2017

Actually ROC curves are commonly used in finance, e.g., to quantify separating power of various credit risk models.