Often, it is neccessary to divide a set of observations into smaller groups, for example control and test groups for some treatment, or training and testing groups for modeling. Ideally, these different groups are more or less "similar" statistically, so that subsequent measurements made on them will differ because of the test being performed, not because the groups themselves are somehow "different".
For thought experiment purposes, consider a group of 10,000 bank loans. These hypothetical bank loans have already run their course, with customers having either: paid back the loan ("good" loans) or not having paid back the loan ("bad" loans). In our imaginary data, 400 loans were bad- a "bad rate" of 4% overall. In MATLAB, we might store such data in a numeric array, LoanData, with observations in the rows, and variables in the columns. The last column contains the target variable, with a value of 0 indicating "good" and a value of 1 indicating "bad".
We might wish to divide the set of 10,000 observations into a training set and a test set, so that we might both build a neural network model of loan outcome and test it fairly. Let's further assume that the train/test split will be 75%/25%.
There are a number of methods for dividing the data. The statistically palatable ones try to be "fair" (in statistical jargon, "unbiased") by using some form of random sampling. By far, the most common technique is simple random sampling (SRS). In simple random sampling, each observation is considered separately and is randomly assigned to one of the sub-samples. For our example problem, this is very easy in MATLAB:
SRSGroup = double(rand(10000,1) > 0.75) + 1;
rand generates the needed random deviates. In this case, the number of deviates is hard-coded, but in practice it'd be preferable to feed the number of observations instead. The threshold of 0.75 is applied to split the observations (approximately) 75%/25%. Strictly speaking, the double data type change is not neccessary, but it is good coding practice. Finally, 1 is added to go from 0/1 group labels to 1/2 group labels (a matter of taste- also, not strictly neccessary).
SRSGroup now contains a series of group indices, 1 or 2, one for each row in our data, LoanData. For clarity, we will assign the training and test groups variable names, and the distinct groupings are accessed by using SRSGroup in the row index:
% Establish group labels
TrainingGroup = 1;
TestGroup = 2;
% Extract distinct groups
% Training obs., all variables
LoanDataTraining = LoanData(SRSGroup == TrainingGroup,:);
% Test obs., all variables
LoanDataTest = LoanData(SRSGroup == TestGroup,:);
Note several important points:
1. For repeatability (from one program run to the next), the state of MATLAB's pseudorandom number generator should be set before its use, with something like this:
rand('state',8086) % The value 8086 is arbitrary
2. For programming purposes I prefer to explicitly store the grouping in a variable, as above in SRSGroup, so that it is available for future reference.
3. Note that the split is not likely to be exactly what was requested using the method above. In one run, I got a split of: 7538 training cases and 2462 test cases, which is a 75.4%/24.6% split. It is possible to force the split to be exactly the desired split (within 1 unit), but SRS faces other potential issues whose solution will fix this as well. I will discuss this in a future posting.
Feel free to contact me, via e-mail at email@example.com or the comment box with questions, typos, unabashed praise, etc.
Feb-10-2007 posting, Stratified Sampling