## Saturday, February 10, 2007

### Stratified Sampling

Introduction

In my posting of Nov-09-2006, Simple Random Sampling (SRS), I explained simple random sampling and noted some of its weaknesses. This post will cover stratified random sampling, which addresses those weaknesses.

Stratified sampling provides the analyst with more control over the sampling process. A typical use of stratified sampling is to control the distribution of the variables being sampled. For instance, imagine a data set containing 200 observations, 100 of which are men, and 100 of which are women. Assume that this data set is to be split into two equal-sized groups, for control and treatment testing. Half of the subjects will receive some treatment which is under review (a drug, marketing campaign, etc.), while the other group is held out as a control and receives no treatment. A simple random sampling procedure will result in two groups, each with 50 men and 50 women, more or less. The "more or less" is the awkward part. Some simple random samples will result in a 46/54 split of men (and, in this case, the reverse, 54/46, for women). After an experiment, how will the experimenter know whether any measured differences are due to control versus treatment, or the difference in the respective proportions of men and women? It would be beneficial to control such factors.

When using simple random sampling, deviations from the expected distributions can be substantial. Generally, three factors aggravate this issue:

1. Smaller observation counts
2. More variables to be controlled
3. Higher skew in variables to be controlled

Even very large data may exhibit this problem. Consider the problem of applying treatments (marketing campaigns, for instance) to loan customers at a bank. At the beginning of the experiment, it is reasonable to expect that important variables be distributed similarly among treatment cells. Such variables might include current balance, credit score and loan type. Even a rather large data set may not split well along all of these dimensions.

A Simple Example

Consider a simple situation, in which there are 100,000 observations, 99,000 of which are of class A, and 1,000 of which are of class B. A minority class representation of 1% is not uncommon, and some important problems have even more class imbalance. A model is to be constructed to classify future cases as belonging to one class or the other. A train/test split of 70%/30% has been specified. To ensure that the training and testing data sets have similar proportions of classes A and B, the sampling will be stratified by class. Let's get started:

% Generate some example data (yes, it's very artificial and in order- don't worry about that!)
SimpleData = [randn(100000,5) [zeros(99000,1); ones(1000,1)]];

There are now 5 predictor variables and the target (in the last column) stored in SimpleData.

% Count the examples
n = size(SimpleData,1);

The first task is to identify the distinct stata which are to be sampled, and calculate their respective frequencies. In this case, that would be the two classes:

% Locate observations in each class
ClassA = (SimpleData(:,end) == 0);
ClassB = (SimpleData(:,end) == 1);

% We already know these, but in real-life they'd need to be calculated
nClassA = sum(double(ClassA));
nClassB = sum(double(ClassB));

Next, space is allocated for an integer code representing the segment, with a value of 1 for "training" or a 2 for "testing":

% Create train/test code values
Train = 1;
Test = 2;

% Allocate space for train/test indicator
Segment = repmat(Test,n,1); % Default to the last group

Next, we check a few things about the stratifying variable(s):

% Determine distinct strata
DistinctStrata = unique(SimpleData(:,end));

% Count distinct strata
nDistinctStrata = size(DistinctStrata,1);

For rigor's sake, randperm should be initialized at the beginning of this process, which is done by initializing rand (see my Jan-13-2007 posting, Revisiting rand (MATLAB 2007a)):

% Initialize PRNG
rand('state',29182);

Loop over the segments, splitting each as closely as possible (within one unit) at the 70/30 mark:

% Loop over strata
for Stratum = 1:nDistinctStrata
% Establish region of interest
ROI = find(SimpleData(:,end) == DistinctStrata(Stratum));

% Determine size of region of interest
nROI = length(ROI);

% Generate a scrambled ordering of 'nROI' items
R = randperm(nROI);

% Assign appropriate number of units to Training group
Segment(ROI(R(1:round(0.70 * nROI)))) = Train;

end

Done! Now, let's check our work:

>> mean(SimpleData(Segment == 1,end))

ans =

0.0100

>> mean(SimpleData(Segment == 2,end))

ans =

0.0100

Both the training and testing data sets have a 1% Class B rate. Note that stratified sampling will sometimes deviate from the expected distributions because strata can only be divided into sets of whole samples. With enough strata, this tiny error (never off by more than 0.5 samples per strata) may add up to a small discrepancy from the exact designed distribution. Regardless, stratified sampling much better preserves distributions of the stratified variables than simple random sampling.

Epilogue

The code in the example given was designed for clarity, not efficiency, so feel free to modify it for execution time and storage considerations.

Typically, numeric variables are stratified by dividing them into segments, such as deciles. Their original numeric values are still used, but each segment is treated as one strata.

When dealing with multiple stratifying variables, it is suggested that unique(X,'rows') be used over the set of stratifying variables to obtain all distinct combinations of single-variable strata, which actually possess any frequency. Beware that using too many stratifying variables or too many strata per variable may result in a large number of (multivariable) strata, many of which are very sparsely populated.

Stratified sampling is highly effective at avoiding the sometimes arbitrary results of simple random sampling, and is useful in assigning observations in control/test, train/test/(validate) and k-fold cross validation designs.

Sampling: Design and Analysis, by Sharon L. Lohr (ISBN: 0-534-35361-4)

Anonymous said...

Thank you for this nice guide..

Though I have a question, do we use stratified sampling to split training set form testing set?

Anonymous said...

However, I have one small question: Does researchers use stratified sampling to split training data from testing data in a certain data set?

Thanks again,
Mana

Will Dwinnell said...

Data may be split into "train" and "test" groups via simple random sampling ("SRS") or via stratified sampling. In the case of stratified sampling, there is usually some variable which the analyst desire remain similarly distributed among the "train" and "test" groups, such as the outcome class.

Anonymous said...

Hi!, nice page!!, I was wondering if somebody can help me with some advices I have a data set with users and rating movies so I need to give a each user randomly 50% for train and 50% for testing, How can I do that in matlab?

Anonymous said...

Stratified sampling is a special example of jackknife resampling technique and the most usual stratified sampling is cross-validationn (CS).

cguitar said...

Thanks for this nice post! However I have a question here:
What if there is a 7th column in the SimpleData, for example, 30% of the 100000 people are yong (<18 years old) and 70% of them are older than 18?
Now we have two things to consider: the gender and age. How do we do stratified random sampling in this case?

Anonymous said...

Hello,

Thank you for your very helpful blog; however I am still a little lost as to how I can achieve stratified 10 fold cross validation. Please forgive my, I am a novice at Matlab.

I have read this post along with how to divide data randomly into equal size groups and I can't quite figure out how to merge both tips.
http://matlabdatamining.blogspot.co.uk/2007/02/dividing-data-randomly-into-equal-sized.html

Would you point me in the right direction please?

Many thanks!
K