Data Mining in MATLAB: July 2007

Sunday, July 29, 2007

Poll Results (Jul-22-2007): Source Data File Formats

After a week, the Source Data File Formats poll, of Jul-22-2007, is complete. The question asked was:

What is the original format of the data you analyze?

Multiple responses were permitted. A total of 33 votes were cast, although the polling system used does not indicate the total number of voters.

In decreasing order of popularity, the results are:

9 votes (27%): MATLAB
7 votes (21%): Text (comma-delimited, tab-delimited, etc.)
7 votes (21%): Other
5 votes (15%): Relational database (Oracle, DB2, etc.)
4 votes (12%): Excel
1 vote ( 3%): Statistical software native format (SPSS, S-Plus, etc.)

I'm a little surprised that relational databases didn't appear more frequently.

No one commented, although I'd be very interested in know what the 'Other' source formats are, since they tied for second place. Anyone?

Sunday, July 22, 2007

Poll (Jul-22-2007): Source Data File Formats

This poll is about the original file format of the data you analyze, not (necessarily) the data which MATLAB directly loads. For example, if your source data originally comes from a relational database, choose "relational database", even though you may export it to a tab-delimited text file first.

Multiple selections are permitted, but choose the file formats you encounter most often.

This poll is closed.

See the poll results in the Jul-29-2007 posting, Poll Results (Jul-22-2007): Source Data File Formats.

Thanks for voting!

Saturday, July 14, 2007

Calculating AUC Using SampleError()

In my last post, ROC Curves and AUC (Jun-20-2007), ROC curves and AUC ("area under the curve") were explained. This post will follow up with a quick demonstration of my SampleError function, and its use in calculating the AUC.

In the following examples, predictive models have been constructed which estimate the probability of a defined event. For each of these (very small) data sets, the model has been executed and stored in variable ModelOutput. After the fact, the actual outcome is recorded in the target variable, DependentVariable.

First, let's try a tiny data set with a model that's nearly random:

>> ModelOutput = [0.1869 0.3816 0.4387 0.4456 0.4898 0.6463 0.7094 0.7547 0.7655 0.7952]'

ModelOutput =

0.1869
0.3816
0.4387
0.4456
0.4898
0.6463
0.7094
0.7547
0.7655
0.7952

>> DependentVariable = [0 1 1 0 0 0 1 0 1 0]'

DependentVariable =

0
1
1
0
0
0
1
0
1
0

This data set is already sorted by the model output, but that is not neccesary for the SampleError routine to function properly. A random model does not separate the two classes at all, and has an expected AUC of 0.5.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.4583

The sample AUC, 0.4583, is off a bit from the theoretically expected 0.5, due to the extremely small sample size.

Moving to the other extreme, consider the following data set:

>> ModelOutput = [0.1622 0.1656 0.2630 0.3112 0.5285 0.6020 0.6541 0.6892 0.7482 0.7943]'

ModelOutput =

0.1622
0.1656
0.2630
0.3112
0.5285
0.6020
0.6541
0.6892
0.7482
0.7943

>> DependentVariable = [0 0 0 0 0 1 1 1 1 1]'

DependentVariable =

0
0
0
0
0
1
1
1
1
1

Again, the data set is sorted by the model output. It is plain that this model performs perfectly (at least on this data): the classes are entirely separate. Such a model should have an AUC of 1.0.

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

1

The final data set exhibits intermediate performance: some class separation is evident, but it is not perfect. The AUC should lie between 0.5 and 1.0.

>> ModelOutput = [0.0782 0.0838 0.1524 0.2290 0.4427 0.4505 0.5383 0.8258 0.9133 0.9961]'

ModelOutput =

0.0782
0.0838
0.1524
0.2290
0.4427
0.4505
0.5383
0.8258
0.9133
0.9961

>> DependentVariable = [0 0 1 0 0 1 0 1 1 1]'

DependentVariable =

0
0
1
0
0
1
0
1
1
1

>> SampleError(ModelOutput,DependentVariable,'AUC')

ans =

0.8400