Data Mining in MATLAB: percentile bootstrap

Introduction

This article introduces the percentile bootstrap, the simplest of the bootstrap methods. The bootstrap family of techniques are used to establish confidence intervals and calculate hypothesis tests for statistical measures.

Problem Statement and Conventional Solution

One is often required to summarize a set of data, such as the following:

X =

2
10
10
5
8
1
4
9
8
10

The most commonly used summary is the mean, in MATLAB calculated thus:

>> mean(X)

ans =

6.7000

Summaries, however, discard a great deal of information. In any situation, it is helpful to know the quality of our summary. In the case above, we may wonder how far our sample mean is likely to be the true population mean (the mean of all numbers drawn from the theoretical statistical population). Our sample mean, after all, was calculated from only 10 observations.

We may establish some idea of how far off the sample mean may be from the population mean by calculating the standard error of the sample mean, which is the standard deviation divided by the square root of the sample size. In MATLAB:

>> StandardError = std(X) / sqrt(length(X))

StandardError =

1.0858

Note that there are fancier versions of this calculation, for example for cases in which the population size is finite, or when the standard error of a proportion is being calculated. The standard error gives us an estimate of how far away our sample error might be form the true population mean, and acts like a z-score: the population mean is within 2 times the standard error from the sample mean about 95% of the time. In the case of our little data set, this would be from 4.5285 (= 6.7000 - 2 * 1.0858) to 8.8715 (= 6.7000 + 2 * 1.0858).

Note that as the number of observations grows, the bottom part of the standard error fraction becomes larger and the standard error decreases. This seems natural enough: with more data, our confidence in our statistic increases.

Complications of the Problem Statement

So far, so good: We may have had to look up the standard error formula in a book, but we have established some sort of parameters as to the certainty of our summary. What if we didn't have such a reference, though? The median for example, has no such simple formula to establish its certainty. (Actually, I believe there is a formula for the median, but that it is a real bear!) Anyway, there certainly are other measures which we may calculate (even ones which we invent on the spot), for which there are no handy standard error formulas. What to do?

An Alternative Solution: The Bootstrap

Just as we are about to throw up our hands and consider another career, the bootstrap appears. The basic method of the bootstrap is simple: Draw many samples with replacement from the original sample ("replicates"), and tabulate the summary statistic when calculated on each those replicate samples. The distribution of those replicated summaries is intended to mimic the distribution being parameterized by the standard error of the mean.

Above, I mentioned that the population mean would be found inside the band from the sample mean minus two times the standard error to the sample mean plus two times the standard error about 95% of the time. The equivalent area in our bootstrap process would be between the 2.5 and 97.5 percentiles of our replicate summaries. We use 2.5 and 97.5 because that leaves a total of 5% outside of the range, half on each end of the spectrum.

An example using the median will illustrate this process. For reference, let's calculate the sample median first:

>> median(X)

ans =

8

Drawing a single sample with replacement can be done in MATLAB by indexing using random integers:

RandomSampleWithReplacement =

5
8
1
1
10
10
9
9
5
1

This is our first bootstrap replicate. Now, we calculate our summary on this replicate:

>> median(RandomSampleWithReplacement)

ans =

6.5000

To discern the distribution, though, will require many more replicates. Since the computer is doing all of the work, I generally like to run at least 2000 replicates to give the bootstrap distribution a chance to take shape:

rand('twister',1242) % Seed the random number generator for repeatability
T = NaN(2000,1); % Allocate space for the replicated summaries
for i = 1:2000 % The machine's doing the work, so why not?
RandomSampleWithReplacement = X(ceil(length(X) * rand(length(X),1))); % Draw a sample with replacement
T(i) = median(RandomSampleWithReplacement); % Calculate the replicated summary
end

(I apologize if the code is a bit cramped, but I have not been able to figure out how to insert tabs or indentation in this edit window.)

Now, estimating where the "real" median (the population median) is likely to be is a simple matter of checking percentiles in our replicated summaries. I have the Statistic Toolbox, so I will cheat by using a function from there:

>> prctile(T,[2.5 97.5])

ans =

3.5000 10.0000

So, our population median is likely to lie between 3.5 and 10. That is a pretty wide range, but this is the consequence of having so little data.

Wrap-Up

The fundamental trade-off of the bootstrap is that one forsakes pat statistical formulas in favor of strenuous computation. In summary:

Good:

-The bootstrap solves many problems not amenable to conventional methods.

-Even in cases where conventional solutions exist, the bootstrap requires no memory or selection of correct formulas for given situations.

Bad:
-The bootstrap requires considerable numerical computation. Of course, in an era of cheap and powerful computing machinery, this is much less of an issue. Still, if there are many of these to perform...

-The bootstrap presented in this article, the bootstrap percentile, is known to deviate from theoretically correct answers, though generally in a small way. There are more sophisticated bootstrap procedures which address some of these concerns, though.

This process, owing to its very general nature, can be applied to tasks much more complex than estimating the uncertainty of statistical summaries, such as hypothesis testing and predictive model performance evaluation.

Further Reading

An Introduction to the Bootstrap by Efron and Tibshirani (ISBN 0-412-04231-2)
This is the seminal work in this field. It covers a lot of ground, but is a bit mathematical.

Data Mining in MATLAB

Tuesday, March 03, 2009

Introduction to the Percentile Bootstrap