Monday, February 19, 2007

Dividing Data Randomly Into Equal-Sized Groups

This is a quick note on dividing items randomly into equal-sized groups. This is an even quicker tip than yesterday's Dividing Values Into Equal-Sized Groups, since in this case, the original data does not affect the outcome.

Start by initializing the pseudo-random number generator (PRNG) for reproducible results:


rand('twister',9596)


Being able to reproduce outcomes exactly from run to run is important for several reasons, not the least of which is debugging. If the outcome of a program changes from run to run, it can be very hard to discover what precisely is going wrong.

With that out of the way, we can assign random groupings, in this case 20 groups for 10,000 individuals:


Group = ceil(20 * randperm(10000)' / 10000);


That's all there is to it. The result, 'Group', is a column vector with 10,000 group assignments, running from 1 to 20. If a different number of groups is desired, change the '20' to some other number. If a different number of items are to be assigned groups, change the '10000' (in both places) to something else. Just to check on this example, we reach for tabulate from the Statistics Toolbox:


tabulate(Group)
Value Count Percent
1 500 5.00%
2 500 5.00%
3 500 5.00%
4 500 5.00%
5 500 5.00%
6 500 5.00%
7 500 5.00%
8 500 5.00%
9 500 5.00%
10 500 5.00%
11 500 5.00%
12 500 5.00%
13 500 5.00%
14 500 5.00%
15 500 5.00%
16 500 5.00%
17 500 5.00%
18 500 5.00%
19 500 5.00%
20 500 5.00%


This process guarantees the the sizes of the largest and smallest groups will differ by no more than 1, and is ideal for assigning observations to folds for k-fold cross-validation.

2 comments:

Anonymous said...

How about dividing the data, say A, randomly into random-sized groups, say contained in B, but still maintain #_rows_A = sum(B)?

Anonymous said...

Just saw that entry now, when I was looking for something to split data with for cross-validation purposes. After I had scripted my own data split function, it turned out that the Neural Networks Toolbox in 2007b also has new functions such as divideint or dividerand. That's exactly where I needed them.