*SampleError*function, and its use in calculating the AUC.

In the following examples, predictive models have been constructed which estimate the probability of a defined event. For each of these (very small) data sets, the model has been executed and stored in variable

*ModelOutput*. After the fact, the actual outcome is recorded in the target variable,

*DependentVariable*.

First, let's try a tiny data set with a model that's nearly random:

>> ModelOutput = [0.1869 0.3816 0.4387 0.4456 0.4898 0.6463 0.7094 0.7547 0.7655 0.7952]'

ModelOutput =

0.1869

0.3816

0.4387

0.4456

0.4898

0.6463

0.7094

0.7547

0.7655

0.7952

>> DependentVariable = [0 1 1 0 0 0 1 0 1 0]'

DependentVariable =

0

1

1

0

0

0

1

0

1

0

>> ModelOutput = [0.1869 0.3816 0.4387 0.4456 0.4898 0.6463 0.7094 0.7547 0.7655 0.7952]'

ModelOutput =

0.1869

0.3816

0.4387

0.4456

0.4898

0.6463

0.7094

0.7547

0.7655

0.7952

>> DependentVariable = [0 1 1 0 0 0 1 0 1 0]'

DependentVariable =

0

1

1

0

0

0

1

0

1

0

This data set is already sorted by the model output, but that is not neccesary for the

*SampleError*routine to function properly. A random model does not separate the two classes at all, and has an expected AUC of 0.5.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.4583

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.4583

The sample AUC, 0.4583, is off a bit from the theoretically expected 0.5, due to the extremely small sample size.

Moving to the other extreme, consider the following data set:

>> ModelOutput = [0.1622 0.1656 0.2630 0.3112 0.5285 0.6020 0.6541 0.6892 0.7482 0.7943]'

ModelOutput =

0.1622

0.1656

0.2630

0.3112

0.5285

0.6020

0.6541

0.6892

0.7482

0.7943

>> DependentVariable = [0 0 0 0 0 1 1 1 1 1]'

DependentVariable =

0

0

0

0

0

1

1

1

1

1

>> ModelOutput = [0.1622 0.1656 0.2630 0.3112 0.5285 0.6020 0.6541 0.6892 0.7482 0.7943]'

ModelOutput =

0.1622

0.1656

0.2630

0.3112

0.5285

0.6020

0.6541

0.6892

0.7482

0.7943

>> DependentVariable = [0 0 0 0 0 1 1 1 1 1]'

DependentVariable =

0

0

0

0

0

1

1

1

1

1

Again, the data set is sorted by the model output. It is plain that this model performs perfectly (at least on this data): the classes are entirely separate. Such a model should have an AUC of 1.0.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

1

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

1

The final data set exhibits intermediate performance: some class separation is evident, but it is not perfect. The AUC should lie between 0.5 and 1.0.

>> ModelOutput = [0.0782 0.0838 0.1524 0.2290 0.4427 0.4505 0.5383 0.8258 0.9133 0.9961]'

ModelOutput =

0.0782

0.0838

0.1524

0.2290

0.4427

0.4505

0.5383

0.8258

0.9133

0.9961

>> DependentVariable = [0 0 1 0 0 1 0 1 1 1]'

DependentVariable =

0

0

1

0

0

1

0

1

1

1

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.8400

>> ModelOutput = [0.0782 0.0838 0.1524 0.2290 0.4427 0.4505 0.5383 0.8258 0.9133 0.9961]'

ModelOutput =

0.0782

0.0838

0.1524

0.2290

0.4427

0.4505

0.5383

0.8258

0.9133

0.9961

>> DependentVariable = [0 0 1 0 0 1 0 1 1 1]'

DependentVariable =

0

0

1

0

0

1

0

1

1

1

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.8400

## 4 comments:

Thanks a lot for these handy functions.

Hi,

I have a question. Are the ModelOutput values points on the ROC Curve that have already been plotted, or are they values straight from a classification algorithm, like likelihood ratios? Also, I'm guessing that the DependentVariable represents the classes?

Thanks for your posts!

When using the

SampleErrorfunction, thePredictedvalue (which is being fed byModelOutputin this example) is the output generated by a model, typically an estimated probability.The

Actualvalue (being fed byDependentVariablein this illustration) is expected to be a 0/1 dummy variable indicating the outcome class.Hi,

I am doing a multiclass classification where I have 6 different classes. The outputs are the posterior probabilities of each class (6 columns) and the true data has been divided into 6 columns (0/1 for not belonging and belonging to that class). So, Do I have to calculate the sample error for each column(one predicted and one true)? How do I approach this? I will be grateful for your comment.

Post a Comment