In my posting of Nov-09-2006, Simple Random Sampling (SRS), I explained simple random sampling and noted some of its weaknesses. This post will cover stratified random sampling, which addresses those weaknesses.
Stratified sampling provides the analyst with more control over the sampling process. A typical use of stratified sampling is to control the distribution of the variables being sampled. For instance, imagine a data set containing 200 observations, 100 of which are men, and 100 of which are women. Assume that this data set is to be split into two equal-sized groups, for control and treatment testing. Half of the subjects will receive some treatment which is under review (a drug, marketing campaign, etc.), while the other group is held out as a control and receives no treatment. A simple random sampling procedure will result in two groups, each with 50 men and 50 women, more or less. The "more or less" is the awkward part. Some simple random samples will result in a 46/54 split of men (and, in this case, the reverse, 54/46, for women). After an experiment, how will the experimenter know whether any measured differences are due to control versus treatment, or the difference in the respective proportions of men and women? It would be beneficial to control such factors.
When using simple random sampling, deviations from the expected distributions can be substantial. Generally, three factors aggravate this issue:
1. Smaller observation counts
2. More variables to be controlled
3. Higher skew in variables to be controlled
Even very large data may exhibit this problem. Consider the problem of applying treatments (marketing campaigns, for instance) to loan customers at a bank. At the beginning of the experiment, it is reasonable to expect that important variables be distributed similarly among treatment cells. Such variables might include current balance, credit score and loan type. Even a rather large data set may not split well along all of these dimensions.
A Simple Example
Consider a simple situation, in which there are 100,000 observations, 99,000 of which are of class A, and 1,000 of which are of class B. A minority class representation of 1% is not uncommon, and some important problems have even more class imbalance. A model is to be constructed to classify future cases as belonging to one class or the other. A train/test split of 70%/30% has been specified. To ensure that the training and testing data sets have similar proportions of classes A and B, the sampling will be stratified by class. Let's get started:
% Generate some example data (yes, it's very artificial and in order- don't worry about that!)
SimpleData = [randn(100000,5) [zeros(99000,1); ones(1000,1)]];
There are now 5 predictor variables and the target (in the last column) stored in SimpleData.
% Count the examples
n = size(SimpleData,1);
The first task is to identify the distinct stata which are to be sampled, and calculate their respective frequencies. In this case, that would be the two classes:
% Locate observations in each class
ClassA = (SimpleData(:,end) == 0);
ClassB = (SimpleData(:,end) == 1);
% We already know these, but in real-life they'd need to be calculated
nClassA = sum(double(ClassA));
nClassB = sum(double(ClassB));
Next, space is allocated for an integer code representing the segment, with a value of 1 for "training" or a 2 for "testing":
% Create train/test code values
Train = 1;
Test = 2;
% Allocate space for train/test indicator
Segment = repmat(Test,n,1); % Default to the last group
Next, we check a few things about the stratifying variable(s):
% Determine distinct strata
DistinctStrata = unique(SimpleData(:,end));
% Count distinct strata
nDistinctStrata = size(DistinctStrata,1);
For rigor's sake, randperm should be initialized at the beginning of this process, which is done by initializing rand (see my Jan-13-2007 posting, Revisiting rand (MATLAB 2007a)):
% Initialize PRNG
Loop over the segments, splitting each as closely as possible (within one unit) at the 70/30 mark:
% Loop over strata
for Stratum = 1:nDistinctStrata
% Establish region of interest
ROI = find(SimpleData(:,end) == DistinctStrata(Stratum));
% Determine size of region of interest
nROI = length(ROI);
% Generate a scrambled ordering of 'nROI' items
R = randperm(nROI);
% Assign appropriate number of units to Training group
Segment(ROI(R(1:round(0.70 * nROI)))) = Train;
Done! Now, let's check our work:
>> mean(SimpleData(Segment == 1,end))
>> mean(SimpleData(Segment == 2,end))
Both the training and testing data sets have a 1% Class B rate. Note that stratified sampling will sometimes deviate from the expected distributions because strata can only be divided into sets of whole samples. With enough strata, this tiny error (never off by more than 0.5 samples per strata) may add up to a small discrepancy from the exact designed distribution. Regardless, stratified sampling much better preserves distributions of the stratified variables than simple random sampling.
The code in the example given was designed for clarity, not efficiency, so feel free to modify it for execution time and storage considerations.
Typically, numeric variables are stratified by dividing them into segments, such as deciles. Their original numeric values are still used, but each segment is treated as one strata.
When dealing with multiple stratifying variables, it is suggested that unique(X,'rows') be used over the set of stratifying variables to obtain all distinct combinations of single-variable strata, which actually possess any frequency. Beware that using too many stratifying variables or too many strata per variable may result in a large number of (multivariable) strata, many of which are very sparsely populated.
Stratified sampling is highly effective at avoiding the sometimes arbitrary results of simple random sampling, and is useful in assigning observations in control/test, train/test/(validate) and k-fold cross validation designs.
Sampling: Design and Analysis, by Sharon L. Lohr (ISBN: 0-534-35361-4)