Data Mining in MATLAB: error measure

Many classification problems require more than a simple "this class or that" output. Many classification solutions will provide estimated probabilities for each possible outcome class. This prompts the question of how to evaluate the performance of classifiers which output probabilities.

Confusion Matrices

Most readers will likely be familiar with concepts like true positives, false positives, etc. Such terms are defined in terms of a target class (or class of interest), such as medical patients with cancer, and a background class, patients without cancer.

Positive cases are those classified as belonging to the target class, whereas negative cases have been classified as belonging to the background class.

In this context, true indicates that the case was correctly classified, and false indicates that the case was incorrectly classified.

Note that, for any given model and data set, the counts of each of the four possible combinations of predicted and actual may be mapped onto a 2x2 confusion matrix (click to enlarge):

Notice, too, that this framework recognizes two distinct ways to make an error: false positives, which erroneously flag negative cases as positives, and false negatives, which erroneously flag positive cases as negatives.

A warning to the reader: There does not seem to be a consistent convention as to whether the actuals belong on the side of the confusion matrix and predictions across the top, or vice versa. In fact, some graphical representations even invert the vertical axis! To avoid confusion, always check the axis labels when exploring the literature.

This way of organizing model responses permits a variety of performance measures, accuracy (= [TP + TN] / Total) being the most obvious. All such measures, however, require that all predicted cases be divided into predicted positives and predicted negatives. When models generate class probabilities (as opposed to classes) as outputs, some threshold must be chosen above which items are classified as positive.

A variety of mechanisms may be used to select said threshold, from something as simple as just using 0.5, to more sophisticated processes which take into account prior probabilities and the costs associated with different types of errors (false positives and false negatives, for instance, may incur very different costs in real applications).

ROC Curves

To avoid having to select a single threshold for classification, one may scan through all possible thresholds, and observe the effect on the true positive rate (= TP / [TP + FN] ) and the false positive rate (= FP / [FP + TN] ). Graphed as coordinate pairs, these measures form the receiver operating characteristic curve (or ROC curve, for short). Some readers will be more familiar with the true positive rate by the term sensitivity, and the false positive rate as 1.0 minus the specificity. The ROC curve describes the performance of a model across the entire range of classification thresholds. An example ROC curve is shown in the figure below (click to enlarge):

All ROC curves begin in the bottom-left corner and rise to the top-right corner. Moving along the ROC curve represents trading off false positives for false negatives. Generally, random models will run up the diagonal, and the more the ROC curve bulges toward the top-left corner, the better the model separates the target class from the background class.

Notice that ROC is an excellent tool for assessing class separation, but it tells us nothing about the accuracy of the predicted class probabilities (for instance, whether cases with a predicted 5% probability of membership in the target class really belong to the target class 5% of the time).

Another important note: despite being similar (and related) to lift charts, ROC curves are calculated differently. Do not confuse the two!

The Magic of AUC

By this point the reader may be wondering, "The ROC curve seems great and all, but it provides a spectrum of performance assessments. How do I boil this down to a simple, single-number measure of performance?" The answer, dear reader, is to measure the area under the ROC curve (abbreviated AUC, or less frequently, AUROC).

Assuming that one is not interested in a specific trade-off between true positive rate and false positive rate (that is, a particular point on the ROC curve), the AUC is useful in that it aggregates performance across the entire range of trade-offs. Interpretation of the AUC is easy: the higher the AUC, the better, with 0.50 indicating random performance and 1.00 denoting perfect performance.

Happily, it is not necessary to actually graph the ROC curve to derive the AUC of a model. A clever algorithm, which can be found in the paper by Ling, Huang and Zhang, below, permits rapid calculation of the AUC. I have implemented this algorithm in MATLAB code, within my SampleError function (see the Jan-05-2007 posting, Model Performance Measurement). Note that SampleError expects the target variable to be a dummy variable, with 0 representing the background class and 1 representing the target class.

Further Reading

Data Mining, by Ian H. Witten and Eibe Frank (ISBN: 0120884070)

AUC: a Statistically Consistent and more Discriminating Measure than Accuracy, by Charles X. Ling, Jin Huang and Harry Zhang

Evaluating Performance, from “ROC Graphs: Notes and Practical Considerations for Researchers”, by T. Fawcett

The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms, by Andrew P. Bradley

See also:
The follow-up Jul-14-2007 post, Calculating AUC Using SampleError().

A wide variety of model performance measures have been devised. Despite the popularity of mean squared error for numeric models and simple accuracy for classification models, there are many other choices. For my part, I generally prefer mean absolute error for numeric models and the AUC (for class separation) and informational loss (for probability assessment) for classification models.

This log entry is pretty much just a quick giveaway: I have constructed a generic performance calculation MATLAB routine, SampleError.m. Its operation is straightforward, but I find it handy to contain all of these measures in one routine, with the ability to switch among them as a simple parameter change. The use of this routine is simple and is explained by help SampleError and it makes a great building block for modeling routines.

I update many of my MATLAB routines from time to time, and this one is no exception. Presently, though, the following performance measures are supported:

'L-1' (mean absolute error)
'L-2' (mean squared error)
'L-4'
'L-16'
'L-Infinity'
'RMS' (root mean squared error)
'AUC' (requires tiedrank() from Statistics Toolbox)
'Bias'
'Conditional Entropy'
'Cross-Entropy'
'F-Measure'
'Informational Loss'
'MAPE'
'Median Squared Error'
'Worst 10%'
'Worst 20%'

Note: I still need to verify the Cross-Entropy measure. The last two are classification performance measures, being the proportion of the target class found in the predicted most likely 10% and 20%, respectively.

Incidentally, I'd appreciate any feedback on any of the code in this Web log, whether it be about typos, outright coding errors of efficiency issues. Also, please send suggestions for additional measures.

Data Mining in MATLAB

Wednesday, June 20, 2007

ROC Curves and AUC

Friday, January 05, 2007

Model Performance Measurement