Many classification problems require more than a simple "this class or that" output. Many classification solutions will provide estimated probabilities for each possible outcome class. This prompts the question of how to evaluate the performance of classifiers which output probabilities.
Most readers will likely be familiar with concepts like true positives, false positives, etc. Such terms are defined in terms of a target class (or class of interest), such as medical patients with cancer, and a background class, patients without cancer.
Positive cases are those classified as belonging to the target class, whereas negative cases have been classified as belonging to the background class.
In this context, true indicates that the case was correctly classified, and false indicates that the case was incorrectly classified.
Note that, for any given model and data set, the counts of each of the four possible combinations of predicted and actual may be mapped onto a 2x2 confusion matrix (click to enlarge):
Notice, too, that this framework recognizes two distinct ways to make an error: false positives, which erroneously flag negative cases as positives, and false negatives, which erroneously flag positive cases as negatives.
A warning to the reader: There does not seem to be a consistent convention as to whether the actuals belong on the side of the confusion matrix and predictions across the top, or vice versa. In fact, some graphical representations even invert the vertical axis! To avoid confusion, always check the axis labels when exploring the literature.
This way of organizing model responses permits a variety of performance measures, accuracy (= [TP + TN] / Total) being the most obvious. All such measures, however, require that all predicted cases be divided into predicted positives and predicted negatives. When models generate class probabilities (as opposed to classes) as outputs, some threshold must be chosen above which items are classified as positive.
A variety of mechanisms may be used to select said threshold, from something as simple as just using 0.5, to more sophisticated processes which take into account prior probabilities and the costs associated with different types of errors (false positives and false negatives, for instance, may incur very different costs in real applications).
To avoid having to select a single threshold for classification, one may scan through all possible thresholds, and observe the effect on the true positive rate (= TP / [TP + FN] ) and the false positive rate (= FP / [FP + TN] ). Graphed as coordinate pairs, these measures form the receiver operating characteristic curve (or ROC curve, for short). Some readers will be more familiar with the true positive rate by the term sensitivity, and the false positive rate as 1.0 minus the specificity. The ROC curve describes the performance of a model across the entire range of classification thresholds. An example ROC curve is shown in the figure below (click to enlarge):
All ROC curves begin in the bottom-left corner and rise to the top-right corner. Moving along the ROC curve represents trading off false positives for false negatives. Generally, random models will run up the diagonal, and the more the ROC curve bulges toward the top-left corner, the better the model separates the target class from the background class.
Notice that ROC is an excellent tool for assessing class separation, but it tells us nothing about the accuracy of the predicted class probabilities (for instance, whether cases with a predicted 5% probability of membership in the target class really belong to the target class 5% of the time).
Another important note: despite being similar (and related) to lift charts, ROC curves are calculated differently. Do not confuse the two!
The Magic of AUC
By this point the reader may be wondering, "The ROC curve seems great and all, but it provides a spectrum of performance assessments. How do I boil this down to a simple, single-number measure of performance?" The answer, dear reader, is to measure the area under the ROC curve (abbreviated AUC, or less frequently, AUROC).
Assuming that one is not interested in a specific trade-off between true positive rate and false positive rate (that is, a particular point on the ROC curve), the AUC is useful in that it aggregates performance across the entire range of trade-offs. Interpretation of the AUC is easy: the higher the AUC, the better, with 0.50 indicating random performance and 1.00 denoting perfect performance.
Happily, it is not necessary to actually graph the ROC curve to derive the AUC of a model. A clever algorithm, which can be found in the paper by Ling, Huang and Zhang, below, permits rapid calculation of the AUC. I have implemented this algorithm in MATLAB code, within my SampleError function (see the Jan-05-2007 posting, Model Performance Measurement). Note that SampleError expects the target variable to be a dummy variable, with 0 representing the background class and 1 representing the target class.
Data Mining, by Ian H. Witten and Eibe Frank (ISBN: 0120884070)
AUC: a Statistically Consistent and more Discriminating Measure than Accuracy, by Charles X. Ling, Jin Huang and Harry Zhang
Evaluating Performance, from “ROC Graphs: Notes and Practical Considerations for Researchers”, by T. Fawcett
The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms, by Andrew P. Bradley
The follow-up Jul-14-2007 post, Calculating AUC Using SampleError().