**Confusion Matrices**

Most readers will likely be familiar with concepts like

*true positives*,

*false positives*, etc. Such terms are defined in terms of a

*target class*(or

*class of interest*), such as medical patients with cancer, and a

*background class*, patients without cancer.

*Positive*cases are those classified as belonging to the target class, whereas

*negative*cases have been classified as belonging to the background class.

In this context,

*true*indicates that the case was correctly classified, and

*false*indicates that the case was incorrectly classified.

Note that, for any given model and data set, the counts of each of the four possible combinations of predicted and actual may be mapped onto a 2x2

*confusion matrix*(click to enlarge):

Notice, too, that this framework recognizes two distinct ways to make an error:

*false positives*, which erroneously flag negative cases as positives, and

*false negatives*, which erroneously flag positive cases as negatives.

A warning to the reader: There does not seem to be a consistent convention as to whether the actuals belong on the side of the confusion matrix and predictions across the top, or vice versa. In fact, some graphical representations even invert the vertical axis! To avoid confusion, always check the axis labels when exploring the literature.

This way of organizing model responses permits a variety of performance measures,

*accuracy*(= [TP + TN] / Total) being the most obvious. All such measures, however, require that all predicted cases be divided into predicted positives and predicted negatives. When models generate class probabilities (as opposed to classes) as outputs, some threshold must be chosen above which items are classified as

*positive*.

A variety of mechanisms may be used to select said threshold, from something as simple as just using 0.5, to more sophisticated processes which take into account prior probabilities and the costs associated with different types of errors (false positives and false negatives, for instance, may incur very different costs in real applications).

**ROC Curves**

To avoid having to select a single threshold for classification, one may scan through all possible thresholds, and observe the effect on the

*true positive rate*(= TP / [TP + FN] ) and the

*false positive rate*(= FP / [FP + TN] ). Graphed as coordinate pairs, these measures form the

*receiver operating characteristic*curve (or

*ROC curve*, for short). Some readers will be more familiar with the true positive rate by the term

*sensitivity*, and the false positive rate as 1.0 minus the

*specificity*. The ROC curve describes the performance of a model across the entire range of classification thresholds. An example ROC curve is shown in the figure below (click to enlarge):

All ROC curves begin in the bottom-left corner and rise to the top-right corner. Moving along the ROC curve represents trading off false positives for false negatives. Generally, random models will run up the diagonal, and the more the ROC curve bulges toward the top-left corner, the better the model separates the target class from the background class.

Notice that ROC is an excellent tool for assessing class separation, but it tells us nothing about the accuracy of the predicted class probabilities (for instance, whether cases with a predicted 5% probability of membership in the target class really belong to the target class 5% of the time).

Another important note: despite being similar (and related) to lift charts, ROC curves are calculated differently. Do not confuse the two!

**The Magic of AUC**

By this point the reader may be wondering, "The ROC curve seems great and all, but it provides a spectrum of performance assessments. How do I boil this down to a simple, single-number measure of performance?" The answer, dear reader, is to measure the

*area under the ROC curve*(abbreviated

*AUC*, or less frequently,

*AUROC*).

Assuming that one is not interested in a specific trade-off between true positive rate and false positive rate (that is, a particular point on the ROC curve), the AUC is useful in that it aggregates performance across the entire range of trade-offs. Interpretation of the AUC is easy: the higher the AUC, the better, with 0.50 indicating random performance and 1.00 denoting perfect performance.

Happily, it is not necessary to actually graph the ROC curve to derive the AUC of a model. A clever algorithm, which can be found in the paper by Ling, Huang and Zhang, below, permits rapid calculation of the AUC. I have implemented this algorithm in MATLAB code, within my

*SampleError*function (see the Jan-05-2007 posting, Model Performance Measurement). Note that SampleError expects the target variable to be a dummy variable, with 0 representing the background class and 1 representing the target class.

**Further Reading**

*Data Mining*, by Ian H. Witten and Eibe Frank (ISBN: 0120884070)

AUC: a Statistically Consistent and more Discriminating Measure than Accuracy, by Charles X. Ling, Jin Huang and Harry Zhang

Evaluating Performance, from “ROC Graphs: Notes and Practical Considerations for Researchers”, by T. Fawcett

The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms, by Andrew P. Bradley

**See also:**

The follow-up Jul-14-2007 post, Calculating AUC Using SampleError().

## 7 comments:

Great, clear article; nicely formed logic for use of AUC--in line with that of my advisors!

Could you specify where in the article is the algorithm described. I read the article and I only saw the proof the AUC is better than accuracy.

Thanks

Section 2.1.

Hi, Will,

I am trying to use perfcurve to find auc and plot the curve, but I can't figure out what kind of data I should put as labels, scores, posclass?

Thank you!

I used codes as follow:

y1 = (1:100)'>50;

alldata=cat(1,data1,data2);

% data1=0, data2=1

b = glmfit(alldata,y1,'binomial');

% logistic regression

p = glmval(b,alldata,'logit');

[X,Y,AUC] = perfcurve(labels,scores,posclass)

Looks like pictures are missing from this nice article...

ROC curves can be a bit misleading. For instance, it inlcudes regions of the ROC space in which one would

rarely operate. Anyway, you have to select a classification threshold as some point. Personally I prefer to use the harmonic mean of the sensitivity and specificity.

"Anyway, you have to select a classification threshold as some point."

That depends on one's application. In banking, for instance, a single credit score may be used for multiple purposes, and while some parts of the bank (marketing) operate at one end of the probability spectrum (the good end), other parts of the bank (credit risk) may operate at the other.

Naturally, if one has a firm idea of where along this continuum is most important, then that is what should be measured and optimized.

Post a Comment