In my last post,

ROC Curves and AUC (Jun-20-2007), ROC curves and AUC ("area under the curve") were explained. This post will follow up with a quick demonstration of my

*SampleError* function, and its use in calculating the AUC.

In the following examples, predictive models have been constructed which estimate the probability of a defined event. For each of these (very small) data sets, the model has been executed and stored in variable

*ModelOutput*. After the fact, the actual outcome is recorded in the target variable,

*DependentVariable*.

First, let's try a tiny data set with a model that's nearly random:

>> ModelOutput = [0.1869 0.3816 0.4387 0.4456 0.4898 0.6463 0.7094 0.7547 0.7655 0.7952]'

ModelOutput =

0.1869

0.3816

0.4387

0.4456

0.4898

0.6463

0.7094

0.7547

0.7655

0.7952

>> DependentVariable = [0 1 1 0 0 0 1 0 1 0]'

DependentVariable =

0

1

1

0

0

0

1

0

1

0

This data set is already sorted by the model output, but that is not neccesary for the

*SampleError* routine to function properly. A random model does not separate the two classes at all, and has an expected AUC of 0.5.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.4583

The sample AUC, 0.4583, is off a bit from the theoretically expected 0.5, due to the extremely small sample size.

Moving to the other extreme, consider the following data set:

>> ModelOutput = [0.1622 0.1656 0.2630 0.3112 0.5285 0.6020 0.6541 0.6892 0.7482 0.7943]'

ModelOutput =

0.1622

0.1656

0.2630

0.3112

0.5285

0.6020

0.6541

0.6892

0.7482

0.7943

>> DependentVariable = [0 0 0 0 0 1 1 1 1 1]'

DependentVariable =

0

0

0

0

0

1

1

1

1

1

Again, the data set is sorted by the model output. It is plain that this model performs perfectly (at least on this data): the classes are entirely separate. Such a model should have an AUC of 1.0.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

1

The final data set exhibits intermediate performance: some class separation is evident, but it is not perfect. The AUC should lie between 0.5 and 1.0.

>> ModelOutput = [0.0782 0.0838 0.1524 0.2290 0.4427 0.4505 0.5383 0.8258 0.9133 0.9961]'

ModelOutput =

0.0782

0.0838

0.1524

0.2290

0.4427

0.4505

0.5383

0.8258

0.9133

0.9961

>> DependentVariable = [0 0 1 0 0 1 0 1 1 1]'

DependentVariable =

0

0

1

0

0

1

0

1

1

1

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.8400