Data Mining in MATLAB: Calculating AUC Using SampleError()

Saturday, July 14, 2007

Calculating AUC Using SampleError()

In my last post, ROC Curves and AUC (Jun-20-2007), ROC curves and AUC ("area under the curve") were explained. This post will follow up with a quick demonstration of my SampleError function, and its use in calculating the AUC.

In the following examples, predictive models have been constructed which estimate the probability of a defined event. For each of these (very small) data sets, the model has been executed and stored in variable ModelOutput. After the fact, the actual outcome is recorded in the target variable, DependentVariable.

First, let's try a tiny data set with a model that's nearly random:

>> ModelOutput = [0.1869 0.3816 0.4387 0.4456 0.4898 0.6463 0.7094 0.7547 0.7655 0.7952]'

ModelOutput =

0.1869
0.3816
0.4387
0.4456
0.4898
0.6463
0.7094
0.7547
0.7655
0.7952

>> DependentVariable = [0 1 1 0 0 0 1 0 1 0]'

DependentVariable =

0
1
1
0
0
0
1
0
1
0

This data set is already sorted by the model output, but that is not neccesary for the SampleError routine to function properly. A random model does not separate the two classes at all, and has an expected AUC of 0.5.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.4583

The sample AUC, 0.4583, is off a bit from the theoretically expected 0.5, due to the extremely small sample size.

Moving to the other extreme, consider the following data set:

>> ModelOutput = [0.1622 0.1656 0.2630 0.3112 0.5285 0.6020 0.6541 0.6892 0.7482 0.7943]'

ModelOutput =

0.1622
0.1656
0.2630
0.3112
0.5285
0.6020
0.6541
0.6892
0.7482
0.7943

>> DependentVariable = [0 0 0 0 0 1 1 1 1 1]'

DependentVariable =

0
0
0
0
0
1
1
1
1
1

Again, the data set is sorted by the model output. It is plain that this model performs perfectly (at least on this data): the classes are entirely separate. Such a model should have an AUC of 1.0.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

1

The final data set exhibits intermediate performance: some class separation is evident, but it is not perfect. The AUC should lie between 0.5 and 1.0.

>> ModelOutput = [0.0782 0.0838 0.1524 0.2290 0.4427 0.4505 0.5383 0.8258 0.9133 0.9961]'

ModelOutput =

0.0782
0.0838
0.1524
0.2290
0.4427
0.4505
0.5383
0.8258
0.9133
0.9961

>> DependentVariable = [0 0 1 0 0 1 0 1 1 1]'

DependentVariable =

0
0
1
0
0
1
0
1
1
1

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.8400

4 comments:

Unknown said...: Thanks a lot for these handy functions.; 12:22 PM
.nugeS said...: Hi,
I have a question. Are the ModelOutput values points on the ROC Curve that have already been plotted, or are they values straight from a classification algorithm, like likelihood ratios? Also, I'm guessing that the DependentVariable represents the classes?

Thanks for your posts!; 10:27 AM
Will Dwinnell said...: When using the SampleError function, the Predicted value (which is being fed by ModelOutput in this example) is the output generated by a model, typically an estimated probability.

The Actual value (being fed by DependentVariable in this illustration) is expected to be a 0/1 dummy variable indicating the outcome class.; 6:15 AM
Anonymous said...: Hi,
I am doing a multiclass classification where I have 6 different classes. The outputs are the posterior probabilities of each class (6 columns) and the true data has been divided into 6 columns (0/1 for not belonging and belonging to that class). So, Do I have to calculate the sample error for each column(one predicted and one true)? How do I approach this? I will be grateful for your comment.; 5:41 PM