## Saturday, July 14, 2007

### Calculating AUC Using SampleError()

In my last post, ROC Curves and AUC (Jun-20-2007), ROC curves and AUC ("area under the curve") were explained. This post will follow up with a quick demonstration of my SampleError function, and its use in calculating the AUC.

In the following examples, predictive models have been constructed which estimate the probability of a defined event. For each of these (very small) data sets, the model has been executed and stored in variable ModelOutput. After the fact, the actual outcome is recorded in the target variable, DependentVariable.

First, let's try a tiny data set with a model that's nearly random:

>> ModelOutput = [0.1869 0.3816 0.4387 0.4456 0.4898 0.6463 0.7094 0.7547 0.7655 0.7952]'

ModelOutput =

0.1869
0.3816
0.4387
0.4456
0.4898
0.6463
0.7094
0.7547
0.7655
0.7952

>> DependentVariable = [0 1 1 0 0 0 1 0 1 0]'

DependentVariable =

0
1
1
0
0
0
1
0
1
0

This data set is already sorted by the model output, but that is not neccesary for the SampleError routine to function properly. A random model does not separate the two classes at all, and has an expected AUC of 0.5.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.4583

The sample AUC, 0.4583, is off a bit from the theoretically expected 0.5, due to the extremely small sample size.

Moving to the other extreme, consider the following data set:

>> ModelOutput = [0.1622 0.1656 0.2630 0.3112 0.5285 0.6020 0.6541 0.6892 0.7482 0.7943]'

ModelOutput =

0.1622
0.1656
0.2630
0.3112
0.5285
0.6020
0.6541
0.6892
0.7482
0.7943

>> DependentVariable = [0 0 0 0 0 1 1 1 1 1]'

DependentVariable =

0
0
0
0
0
1
1
1
1
1

Again, the data set is sorted by the model output. It is plain that this model performs perfectly (at least on this data): the classes are entirely separate. Such a model should have an AUC of 1.0.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

1

The final data set exhibits intermediate performance: some class separation is evident, but it is not perfect. The AUC should lie between 0.5 and 1.0.

>> ModelOutput = [0.0782 0.0838 0.1524 0.2290 0.4427 0.4505 0.5383 0.8258 0.9133 0.9961]'

ModelOutput =

0.0782
0.0838
0.1524
0.2290
0.4427
0.4505
0.5383
0.8258
0.9133
0.9961

>> DependentVariable = [0 0 1 0 0 1 0 1 1 1]'

DependentVariable =

0
0
1
0
0
1
0
1
1
1

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.8400

Unknown said...

Thanks a lot for these handy functions.

.nugeS said...

Hi,
I have a question. Are the ModelOutput values points on the ROC Curve that have already been plotted, or are they values straight from a classification algorithm, like likelihood ratios? Also, I'm guessing that the DependentVariable represents the classes? 