Start by initializing the pseudo-random number generator (PRNG) for reproducible results:

rand('twister',9596)

rand('twister',9596)

Being able to reproduce outcomes exactly from run to run is important for several reasons, not the least of which is debugging. If the outcome of a program changes from run to run, it can be very hard to discover what precisely is going wrong.

With that out of the way, we can assign random groupings, in this case 20 groups for 10,000 individuals:

Group = ceil(20 * randperm(10000)' / 10000);

Group = ceil(20 * randperm(10000)' / 10000);

That's all there is to it. The result, 'Group', is a column vector with 10,000 group assignments, running from 1 to 20. If a different number of groups is desired, change the '20' to some other number. If a different number of items are to be assigned groups, change the '10000' (in both places) to something else. Just to check on this example, we reach for

*tabulate*from the Statistics Toolbox:

tabulate(Group)

Value Count Percent

1 500 5.00%

2 500 5.00%

3 500 5.00%

4 500 5.00%

5 500 5.00%

6 500 5.00%

7 500 5.00%

8 500 5.00%

9 500 5.00%

10 500 5.00%

11 500 5.00%

12 500 5.00%

13 500 5.00%

14 500 5.00%

15 500 5.00%

16 500 5.00%

17 500 5.00%

18 500 5.00%

19 500 5.00%

20 500 5.00%

tabulate(Group)

Value Count Percent

1 500 5.00%

2 500 5.00%

3 500 5.00%

4 500 5.00%

5 500 5.00%

6 500 5.00%

7 500 5.00%

8 500 5.00%

9 500 5.00%

10 500 5.00%

11 500 5.00%

12 500 5.00%

13 500 5.00%

14 500 5.00%

15 500 5.00%

16 500 5.00%

17 500 5.00%

18 500 5.00%

19 500 5.00%

20 500 5.00%

This process guarantees the the sizes of the largest and smallest groups will differ by no more than 1, and is ideal for assigning observations to folds for k-fold cross-validation.

## 2 comments:

How about dividing the data, say A, randomly into random-sized groups, say contained in B, but still maintain #_rows_A = sum(B)?

Just saw that entry now, when I was looking for something to split data with for cross-validation purposes. After I had scripted my own data split function, it turned out that the Neural Networks Toolbox in 2007b also has new functions such as

divideintordividerand. That's exactly where I needed them.Post a Comment