Data Mining in MATLAB: SampleError

Showing posts with label SampleError. Show all posts

Saturday, July 14, 2007

Calculating AUC Using SampleError()

In my last post, ROC Curves and AUC (Jun-20-2007), ROC curves and AUC ("area under the curve") were explained. This post will follow up with a quick demonstration of my SampleError function, and its use in calculating the AUC.

In the following examples, predictive models have been constructed which estimate the probability of a defined event. For each of these (very small) data sets, the model has been executed and stored in variable ModelOutput. After the fact, the actual outcome is recorded in the target variable, DependentVariable.

First, let's try a tiny data set with a model that's nearly random:

>> ModelOutput = [0.1869 0.3816 0.4387 0.4456 0.4898 0.6463 0.7094 0.7547 0.7655 0.7952]'

ModelOutput =

0.1869
0.3816
0.4387
0.4456
0.4898
0.6463
0.7094
0.7547
0.7655
0.7952

>> DependentVariable = [0 1 1 0 0 0 1 0 1 0]'

DependentVariable =

0
1
1
0
0
0
1
0
1
0

This data set is already sorted by the model output, but that is not neccesary for the SampleError routine to function properly. A random model does not separate the two classes at all, and has an expected AUC of 0.5.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.4583

The sample AUC, 0.4583, is off a bit from the theoretically expected 0.5, due to the extremely small sample size.

Moving to the other extreme, consider the following data set:

>> ModelOutput = [0.1622 0.1656 0.2630 0.3112 0.5285 0.6020 0.6541 0.6892 0.7482 0.7943]'

ModelOutput =

0.1622
0.1656
0.2630
0.3112
0.5285
0.6020
0.6541
0.6892
0.7482
0.7943

>> DependentVariable = [0 0 0 0 0 1 1 1 1 1]'

DependentVariable =

0
0
0
0
0
1
1
1
1
1

Again, the data set is sorted by the model output. It is plain that this model performs perfectly (at least on this data): the classes are entirely separate. Such a model should have an AUC of 1.0.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

1

The final data set exhibits intermediate performance: some class separation is evident, but it is not perfect. The AUC should lie between 0.5 and 1.0.

>> ModelOutput = [0.0782 0.0838 0.1524 0.2290 0.4427 0.4505 0.5383 0.8258 0.9133 0.9961]'

ModelOutput =

0.0782
0.0838
0.1524
0.2290
0.4427
0.4505
0.5383
0.8258
0.9133
0.9961

>> DependentVariable = [0 0 1 0 0 1 0 1 1 1]'

DependentVariable =

0
0
1
0
0
1
0
1
1
1

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.8400

Wednesday, June 20, 2007

ROC Curves and AUC

Many classification problems require more than a simple "this class or that" output. Many classification solutions will provide estimated probabilities for each possible outcome class. This prompts the question of how to evaluate the performance of classifiers which output probabilities.

Confusion Matrices

Most readers will likely be familiar with concepts like true positives, false positives, etc. Such terms are defined in terms of a target class (or class of interest), such as medical patients with cancer, and a background class, patients without cancer.

Positive cases are those classified as belonging to the target class, whereas negative cases have been classified as belonging to the background class.

In this context, true indicates that the case was correctly classified, and false indicates that the case was incorrectly classified.

Note that, for any given model and data set, the counts of each of the four possible combinations of predicted and actual may be mapped onto a 2x2 confusion matrix (click to enlarge):

Notice, too, that this framework recognizes two distinct ways to make an error: false positives, which erroneously flag negative cases as positives, and false negatives, which erroneously flag positive cases as negatives.

A warning to the reader: There does not seem to be a consistent convention as to whether the actuals belong on the side of the confusion matrix and predictions across the top, or vice versa. In fact, some graphical representations even invert the vertical axis! To avoid confusion, always check the axis labels when exploring the literature.

This way of organizing model responses permits a variety of performance measures, accuracy (= [TP + TN] / Total) being the most obvious. All such measures, however, require that all predicted cases be divided into predicted positives and predicted negatives. When models generate class probabilities (as opposed to classes) as outputs, some threshold must be chosen above which items are classified as positive.

A variety of mechanisms may be used to select said threshold, from something as simple as just using 0.5, to more sophisticated processes which take into account prior probabilities and the costs associated with different types of errors (false positives and false negatives, for instance, may incur very different costs in real applications).

ROC Curves

To avoid having to select a single threshold for classification, one may scan through all possible thresholds, and observe the effect on the true positive rate (= TP / [TP + FN] ) and the false positive rate (= FP / [FP + TN] ). Graphed as coordinate pairs, these measures form the receiver operating characteristic curve (or ROC curve, for short). Some readers will be more familiar with the true positive rate by the term sensitivity, and the false positive rate as 1.0 minus the specificity. The ROC curve describes the performance of a model across the entire range of classification thresholds. An example ROC curve is shown in the figure below (click to enlarge):

All ROC curves begin in the bottom-left corner and rise to the top-right corner. Moving along the ROC curve represents trading off false positives for false negatives. Generally, random models will run up the diagonal, and the more the ROC curve bulges toward the top-left corner, the better the model separates the target class from the background class.

Notice that ROC is an excellent tool for assessing class separation, but it tells us nothing about the accuracy of the predicted class probabilities (for instance, whether cases with a predicted 5% probability of membership in the target class really belong to the target class 5% of the time).

Another important note: despite being similar (and related) to lift charts, ROC curves are calculated differently. Do not confuse the two!

The Magic of AUC

By this point the reader may be wondering, "The ROC curve seems great and all, but it provides a spectrum of performance assessments. How do I boil this down to a simple, single-number measure of performance?" The answer, dear reader, is to measure the area under the ROC curve (abbreviated AUC, or less frequently, AUROC).

Assuming that one is not interested in a specific trade-off between true positive rate and false positive rate (that is, a particular point on the ROC curve), the AUC is useful in that it aggregates performance across the entire range of trade-offs. Interpretation of the AUC is easy: the higher the AUC, the better, with 0.50 indicating random performance and 1.00 denoting perfect performance.

Happily, it is not necessary to actually graph the ROC curve to derive the AUC of a model. A clever algorithm, which can be found in the paper by Ling, Huang and Zhang, below, permits rapid calculation of the AUC. I have implemented this algorithm in MATLAB code, within my SampleError function (see the Jan-05-2007 posting, Model Performance Measurement). Note that SampleError expects the target variable to be a dummy variable, with 0 representing the background class and 1 representing the target class.

Further Reading

Data Mining, by Ian H. Witten and Eibe Frank (ISBN: 0120884070)

AUC: a Statistically Consistent and more Discriminating Measure than Accuracy, by Charles X. Ling, Jin Huang and Harry Zhang

Evaluating Performance, from “ROC Graphs: Notes and Practical Considerations for Researchers”, by T. Fawcett

The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms, by Andrew P. Bradley

See also:
The follow-up Jul-14-2007 post, Calculating AUC Using SampleError().

Friday, January 19, 2007

GASplineFit: A Flexible Curve Fitting Routine

In my work, I have frequently needed to fit curves to data. Curve-fitting is a broad and surprisingly subtle subject. The two most common types of single-input curve-fitting are simple (linear and non-linear) regression and splines.

Regressions involve the discovery of optimal coefficients for some functional form, which maps the input variable to the output variable. Regressions provide a number of advantages, but these are generally qualified advantages:

Regression Characteristics
1. Common functions (linear, logistic, exponential) are well-studied and widely available. More exotic or arbitrary functions are harder to find software for.

2. Regressions provide optimal fit to data, assuming that one wants a least-squares fit. Some software, notably the MATLAB Curve Fitting Toolbox, will provide other, less common fits, such as mean absolute error (L-1) and robust fits.

3. Regressions are the safest bet for extrapolation. Extrapolation is a hazardous endeavor, but well-behaved regression functions generate the most sensible guess in "the far, unlit unknown".

4. Complex regression curves, such as high-order polynomials, can be ill-behaved, especially at the extremes of the data range. While there are methods of combatting such problems (such as fitting at the Chebyshev points), these are not widely available in commercial software.

Splines, which are especially popular in engineering, are connected series of simple curves, most often low-order polynomials. The advantage of splines is that they are extremely flexible. Complex curves require complex regression functions, but splines can handle very complicated shapes simply by connecting several simple curves. Splines may pass directly through data points, or between them. To maintain continuity, these simple curves must come together at points called knots. Like regressions, splines have their advantages:

Spline Characteristics
1. Splines are extremely flexible, but number and placement of knots must be chosen.

2. Technically, most splines are undefined outside the range of the knots.

3. Most commercial software is oriented toward locating a spline through a small number of given data points. "Fitting" through a large number of data points is not commonly available.

Additionally, base MATLAB directly supports the construction and evaluation of splines.

GASplineFit
Wanting to free myself from having to select the right regression function every time I needed to fit a curve, I decided to turn to splines for my curve-fitting needs. The problem, as mentioned above, is that almost all spline routines and canned software are very low-level, "fitting" (if it can be called that) splines by stringing them through a very small number of selected points. My solution, GASplineFit.m, builds on the built-in MATLAB spline routines, and will generate splines which optimize any error function handled by my SampleError routine (see my posting of Jan-05-2007, Model Performance Measurement). Note: GASplineFit requires subroutine LogRand.m.

With GASplineFit, the user supplies:

1. An input variable
2. An output variable
3. The input values of the knots (I often just use linspace for these)
4. The spline type (such as 'pchip' or 'spline': see MATLAB's spline routines)
5. A selected performance measure ('MSE', 'L-1', etc.).

There are two additional parameters, which control the genetic algorithm which performs the fitting. I usually don't change these from 250 and 80. For better but slower fitting, turn these parameters up. For faster but cruder fitting, turn them down. See help GASplineFit for details.

Essentially, the software is sliding the spline knots up and down, trying to get this best fit. The routine returns:

1. The fitted spline as a MATLAB piecewise polynomial (see: help ppval)
2. The knot output values
3. The assessed error

Notice one very nice feature of this routine: that it will fit probability curves directly to 0/1 data. Very often, the statistician is faced with a large collection of examples, which involve a dummy variable indicating membership in a class of interest, and an explanatory variable. To smooth the data and approximate the probability of class membership, common practice is to bin the explanatory variable, and assess the mean for each bin. Fancier methods will fit a curve to these binned values, which isn't very smooth or use a kernel regression, which is smooth but typically difficult to encapsulate as code without taking the entire data set along for the ride. GASplineFit solves all of these problems. The genetic algorithm which is its engine doesn't care about the fact that our data may be only zeros and ones: it only cares about optimizing the performance function.

help GASplineFit explains the specifics of this routine, and provides an example of its use. MATLAB programmers beginning with genetic algorithms may be interested in examining the code, which uses an elitist genetic algorithm.

Friday, January 05, 2007

Model Performance Measurement

A wide variety of model performance measures have been devised. Despite the popularity of mean squared error for numeric models and simple accuracy for classification models, there are many other choices. For my part, I generally prefer mean absolute error for numeric models and the AUC (for class separation) and informational loss (for probability assessment) for classification models.

This log entry is pretty much just a quick giveaway: I have constructed a generic performance calculation MATLAB routine, SampleError.m. Its operation is straightforward, but I find it handy to contain all of these measures in one routine, with the ability to switch among them as a simple parameter change. The use of this routine is simple and is explained by help SampleError and it makes a great building block for modeling routines.

I update many of my MATLAB routines from time to time, and this one is no exception. Presently, though, the following performance measures are supported:

'L-1' (mean absolute error)
'L-2' (mean squared error)
'L-4'
'L-16'
'L-Infinity'
'RMS' (root mean squared error)
'AUC' (requires tiedrank() from Statistics Toolbox)
'Bias'
'Conditional Entropy'
'Cross-Entropy'
'F-Measure'
'Informational Loss'
'MAPE'
'Median Squared Error'
'Worst 10%'
'Worst 20%'

Note: I still need to verify the Cross-Entropy measure. The last two are classification performance measures, being the proportion of the target class found in the predicted most likely 10% and 20%, respectively.

Incidentally, I'd appreciate any feedback on any of the code in this Web log, whether it be about typos, outright coding errors of efficiency issues. Also, please send suggestions for additional measures.