Data Mining in MATLAB: bootstrap

Showing posts with label bootstrap. Show all posts

Tuesday, March 03, 2009

Introduction to the Percentile Bootstrap

Introduction

This article introduces the percentile bootstrap, the simplest of the bootstrap methods. The bootstrap family of techniques are used to establish confidence intervals and calculate hypothesis tests for statistical measures.

Problem Statement and Conventional Solution

One is often required to summarize a set of data, such as the following:

X =

2
10
10
5
8
1
4
9
8
10

The most commonly used summary is the mean, in MATLAB calculated thus:

>> mean(X)

ans =

6.7000

Summaries, however, discard a great deal of information. In any situation, it is helpful to know the quality of our summary. In the case above, we may wonder how far our sample mean is likely to be the true population mean (the mean of all numbers drawn from the theoretical statistical population). Our sample mean, after all, was calculated from only 10 observations.

We may establish some idea of how far off the sample mean may be from the population mean by calculating the standard error of the sample mean, which is the standard deviation divided by the square root of the sample size. In MATLAB:

>> StandardError = std(X) / sqrt(length(X))

StandardError =

1.0858

Note that there are fancier versions of this calculation, for example for cases in which the population size is finite, or when the standard error of a proportion is being calculated. The standard error gives us an estimate of how far away our sample error might be form the true population mean, and acts like a z-score: the population mean is within 2 times the standard error from the sample mean about 95% of the time. In the case of our little data set, this would be from 4.5285 (= 6.7000 - 2 * 1.0858) to 8.8715 (= 6.7000 + 2 * 1.0858).

Note that as the number of observations grows, the bottom part of the standard error fraction becomes larger and the standard error decreases. This seems natural enough: with more data, our confidence in our statistic increases.

Complications of the Problem Statement

So far, so good: We may have had to look up the standard error formula in a book, but we have established some sort of parameters as to the certainty of our summary. What if we didn't have such a reference, though? The median for example, has no such simple formula to establish its certainty. (Actually, I believe there is a formula for the median, but that it is a real bear!) Anyway, there certainly are other measures which we may calculate (even ones which we invent on the spot), for which there are no handy standard error formulas. What to do?

An Alternative Solution: The Bootstrap

Just as we are about to throw up our hands and consider another career, the bootstrap appears. The basic method of the bootstrap is simple: Draw many samples with replacement from the original sample ("replicates"), and tabulate the summary statistic when calculated on each those replicate samples. The distribution of those replicated summaries is intended to mimic the distribution being parameterized by the standard error of the mean.

Above, I mentioned that the population mean would be found inside the band from the sample mean minus two times the standard error to the sample mean plus two times the standard error about 95% of the time. The equivalent area in our bootstrap process would be between the 2.5 and 97.5 percentiles of our replicate summaries. We use 2.5 and 97.5 because that leaves a total of 5% outside of the range, half on each end of the spectrum.

An example using the median will illustrate this process. For reference, let's calculate the sample median first:

>> median(X)

ans =

8

Drawing a single sample with replacement can be done in MATLAB by indexing using random integers:

RandomSampleWithReplacement =

5
8
1
1
10
10
9
9
5
1

This is our first bootstrap replicate. Now, we calculate our summary on this replicate:

>> median(RandomSampleWithReplacement)

ans =

6.5000

To discern the distribution, though, will require many more replicates. Since the computer is doing all of the work, I generally like to run at least 2000 replicates to give the bootstrap distribution a chance to take shape:

rand('twister',1242) % Seed the random number generator for repeatability
T = NaN(2000,1); % Allocate space for the replicated summaries
for i = 1:2000 % The machine's doing the work, so why not?
RandomSampleWithReplacement = X(ceil(length(X) * rand(length(X),1))); % Draw a sample with replacement
T(i) = median(RandomSampleWithReplacement); % Calculate the replicated summary
end

(I apologize if the code is a bit cramped, but I have not been able to figure out how to insert tabs or indentation in this edit window.)

Now, estimating where the "real" median (the population median) is likely to be is a simple matter of checking percentiles in our replicated summaries. I have the Statistic Toolbox, so I will cheat by using a function from there:

>> prctile(T,[2.5 97.5])

ans =

3.5000 10.0000

So, our population median is likely to lie between 3.5 and 10. That is a pretty wide range, but this is the consequence of having so little data.

Wrap-Up

The fundamental trade-off of the bootstrap is that one forsakes pat statistical formulas in favor of strenuous computation. In summary:

Good:

-The bootstrap solves many problems not amenable to conventional methods.

-Even in cases where conventional solutions exist, the bootstrap requires no memory or selection of correct formulas for given situations.

Bad:
-The bootstrap requires considerable numerical computation. Of course, in an era of cheap and powerful computing machinery, this is much less of an issue. Still, if there are many of these to perform...

-The bootstrap presented in this article, the bootstrap percentile, is known to deviate from theoretically correct answers, though generally in a small way. There are more sophisticated bootstrap procedures which address some of these concerns, though.

This process, owing to its very general nature, can be applied to tasks much more complex than estimating the uncertainty of statistical summaries, such as hypothesis testing and predictive model performance evaluation.

Further Reading

An Introduction to the Bootstrap by Efron and Tibshirani (ISBN 0-412-04231-2)
This is the seminal work in this field. It covers a lot of ground, but is a bit mathematical.

Sunday, March 30, 2008

Validating Predictive Models

Introduction

In this author's opinion, validating the performance of predictive models is the single most important step, if one can be chosen, in the process of data mining. One important mechanism for testing models is resampling, the subject of this article. No MATLAB this time, just technique.

When selecting a validation technique, it is vital to keep in mind the purpose of such validation: to estimate the level of performance we may expect from models generated by our modeling process, when such models are run on future cases. Note an important subtlety here: We are not so much interested in testing the performance of individual models as we are in testing the model-generating process (feature selection process, complexity selection process, etc., as well as the actual modeling algorithm).

Apparent Performance: Warning!

The most obvious testing method is to simply execute the model on the very same data upon which it was built. The result is known as the apparent performance. The apparent performance is known to be statistically biased in an optimistic way. This is like giving out the answers to the test before administering the test!

At the extreme, a model could simply memorize the development observations and regurgitate them during testing. Assuming no mutually contradictory cases, such a system would deliver perfect validation performance! Certainly this is not what we are interested in.

The whole point in making a predictive model is so that said model may be used on future cases. What is desired is generalization to new cases, not simple memorization of historical ones.

Ultimately, there is no way to know precisely how optimistic apparent performance estimates are, rendering such performance measures largely useless.

Despite its hazards, calculation of the apparent performance is used as the final assessment of models with shocking frequency in industry. Do not become one of its victims.

Holdout Testing

Given the dangers of apparent performance measures, one might logically reason that a model could be built using all presently available data, and tested at some future point in time, after further observations had been collected. This idea makes perfect sense, but involves potentially considerable delay. Rather than wait for new data, holdout testing splits the data randomly into two sets: training (also called "in-sample") and testing (also called "out-of-sample"). This is the simplest form of resampling. Incidentally, it is not uncommon to stratify the assignment to training and testing groups, based on variables believed to be significant, including the dependent variables.

The idea here is simple: fit the model using the training data, and test it on the testing data. No "cheating" takes place since the test data is not used during model construction.

Holdout testing provides an unbiased measure of performance, provided (and this caveat is rather important) that the test data is used only once to test the model. If the test data is used more than once to test the data, then all bets are off regarding the unbiased nature of the performance measure. Surprisingly many modelers in industry violate this "use once" rule (Shame on you, industry, shame!). In the event that another set of data is needed to make adjustments to the model (to experiment with different numbers of predictors, for instance), a third randomly assigned data set, the validation set (also called the "tuning set") should be employed.

This simple test process works well in many instances in practice. Its biggest drawback is that it trades off training accuracy for testing accuracy. Typically, the data miner is faced with finite supply of data. Every observation which is moved to the testing set is no longer available for training.

As indicated above, our primary interest is in evaluation of the model-generating process. Once we know what to expect from models that come from our process, we may apply our modeling process to the entire data set (without regard to train/test designations) to construct the final model.

k-Fold Cross Validation

Smaller data sets force an uncomfortable choice on the modeler using holdout testing: either short-change model construction or short-change testing. One solution is to use k-fold cross-validation (sometimes referred to as simply "cross-validation").

k-fold cross-validation builds on the idea of holdout testing in a clever way by rotating data through the process. Data is again divided randomly into groups, but now k equal-sized groups are used. As with holdout testing, stratification is sometimes used to force the folds to be statistically similar. The train-test process is repeated k times, each time leaving a different segment of the data out, as the test set.

A common choice for k is 10, resulting in 10-fold cross-validation. In 10-fold cross-validation, the observations are randomly assigned to 10 groups. Ten separate models are built and tested on distinct data segments. The resulting 10 performance measures are unbiased since none of them was built with test data that was used during training. The single, final performance measurement is taken as the mean of these 10 performance measures. The magic of this process is that during each fold, 90% of the data is available for training, yet the final performance metric is based on 100% of the data!

When k is equal to the number of observations, this process goes by the special name leave-one-out. While this may be tempting, there are good reasons for choosing k in the range of 5 to 10.

The good news with k-fold cross-validation is that reliable, unbiased testing may be performed on smaller data sets than would be possible with simple train-and-test holdout testing. The only really bad news is that this process obviously requires much more computational effort than holdout testing.

As with holdout testing, once the modeling process has been evaluated, it may run over the entire data set to produce the final model.

Closing Thoughts

Other resampling techniques are available, such as the bootstrap. Holdout testing and k-fold cross validation are real workhorses, though, and should cover many machine learning and data mining situations.

Few other segments of the empirical modeling pipeline are as critical as model testing- perhaps only problem definition and the collection of appropriate data are as important. Assuming that these other two have been performed properly, model validation is the acid test of model performance: pay it the attention it deserves.

Further Reading

I strongly recommend the book "Computer Systems That Learn", by Weiss and Kulikowski (ISBN: 1-55860-065-5) for a quite readable introduction to this subject.

Also very worthy of consideration is chapter 5 of "Data Mining: Practical Machine Leearning Tools and Techniques", by Witten and Frank (ISBN: 1-55860-552-5).

The Usenet comp.ai.neural-nets FAQ Part 1 contains solid material on this subject as well. See, especially, the section titled "What are the population, sample, training set, design set, validation set, and test set?"

Tuesday, February 27, 2007

Poll Results (Feb-20-2007): Toolbox Use

The Feb-20-2007 posting asked the poll question, Which Toolboxes from the MathWorks do you use (check all that apply)? After approximately 1 week, 38 votes were cast in total (including 1 by myself). The poll results follow:

1. Curve Fitting Toolbox (1 vote) 3%
2. Fuzzy Logic Toolbox (2 votes) 5%
3. Genetic Algorithm Toolbox (1 vote) 3%
4. Image Processing Toolbox (8 votes) 21%
5. Neural Network Toolbox (2 votes) 5%
6. Optimization Toolbox (6 votes) 16%
7. Signal Processing Toolbox (5 votes) 13%
8. Spline Toolbox (0 votes) 0%
9. Statistics Toolbox (13 votes) 34%

Not surprisingly, the Statistics Toolbox is the most popular among this crowd. The use of some of the other analytical toolboxes (such Neural Network and Fuzzy Logic Toolboxes) is also expected, though I was surprised both by the complete absence of anyone responding using the Spline Toolbox and the fairly good turn-out for the Optimization Toolbox. Given the popularity of the Jan-26-2007 posting, Pixel Classificiation Project, the 8 votes cast for the Image Processing Toolbox is expected.

Bootstrapping just the Statistics Toolbox results gives:

>> n = 38; p = 13 / 38; prctile(mean(double(rand(n,10000) < p)),[5 95])

ans =

0.2105 0.4737

So, with a 90% confidence interval, readers' use of the Statistics toolbox is somewhere between 21% and 47%.

Sunday, February 11, 2007

Poll Results (Feb-04-2007): Toolbox Use

The Feb-04-2007 posting, Poll (Feb-04-2007): Toolbox Use featured the following poll question: Which MATLAB Toolboxes, if any, do you use? After 1 week, the poll has closed, with 27 Data Mining in MATLAB readers responding (1 of which was myself). The final results are:

1. None (base MATLAB product only) (0 Votes)
2. MATLAB Toolboxes from the MathWorks only (11 Votes)
3. MATLAB Toolboxes from other sources only (1 Votes)
4. MATLAB Toolboxes both from the MathWorks and other sources (15 Votes)

Interestingly, all reported using toolboxes. All voters but 1 indicated using toolboxes from the MathWorks.

How significant are these results? A quick bootstrap of 100,000 replicates yields the following:

>> Poll = [repmat(2,11,1); repmat(3,1,1); repmat(4,15,1)];
>> Replicates = Poll(ceil(27 * rand(27,1e5)));
>> Count2 = mean(Replicates == 2);
>> Count3 = mean(Replicates == 3);
>> Count4 = mean(Replicates == 4);
>> prctile(Count2,[5 95])

ans =

0.2593 0.5556

>> prctile(Count3,[5 95])

ans =

0 0.1111

>> prctile(Count4,[5 95])

ans =

0.4074 0.7037

So (roughly, due to limited precision provided by small sample size), with a 90% confidence interval, use of MathWorks Toolboxes only is somewhere between 26% and 56%, use of non-MathWorks Toolboxes only is between 0% and 11%, and use of both is between 41% and 70%.

Thanks to everyone who voted!