Data Mining in MATLAB: Dividing Data Into Groups Based On A Key Value (Hashing)

Introduction

A recent posting, Dividing Data Randomly Into Equal-Sized Groups, Feb-19-2007 presented a relatively painless method for dividing data into groups randomly. Sometimes, though, it is desired that items be divided into random groupings repeatably, despite appearing multiple times in the data set.

Consider, for instance, a group of customers who generate new billing statements once per month. Billing data may be drawn over several months for modeling purposes, and a single customer may appear several times in the data (once for each billing month). When that data is divided into "training" and "testing" groups, it would be preferable not to have any given customer appear in both the "training" and "testing" data sets. Random assignment of records to training/testing groups will not obey this requirement.

One solution is to assign records to groups based on some identifier which is unique to the customer, perhaps an account number. Simply dividing account numbers into deciles, for instance, is problematic because account numbers are likely assigned chronologically, meaning that customers with less tenure will end up in one group, while those with more tenure will land in the other group. Many unique identifiers (often called "keys") share this problem: despite not necessarily being meaningful as quantities, their values typically contain systematic biases.

One solution is to hash such identifiers, which means to transform one set of values to another, scrambling them in the process. This is done via a hash function, which ideally will spread out the new values as evenly as possible. In our present application, the hash function will result in far fewer distinct values than in the original data. Such a solution allows the identifier to drive the process (and thus make it exactly repeatable for any given identifier), but random enough to mix the data.

Hash Functions: Modulo Division

A variety of hash functions have been devised, but the most common use modulo division (mod in MATLAB). An example in MATLAB follows:

% A tiny amount of artificial data to work on
AccountNumber = [13301 15256 27441 27831 50668 89001 90012 93108 95667]'

% Hash the AccountNumber
Group = mod(AccountNumber,5) + 1

AccountNumber =

13301
15256
27441
27831
50668
89001
90012
93108
95667

Group =

2
2
2
2
4
2
3
4
3

All 'AccountNumber' values have been "scrambled" deterministically and mapped into the range 1 - 5. The second parameter in the mod function, the modulo divisor, determines the number of distinct hashed values (hence, the number of groups). In practice, experimentation may be necessary to ensure an even distribution of hashed values. Here is a larger example:

% Synthesize a large amount of artificial data to work on
randn('state',27459); % Initialize the PRNG
AccountNumber = unique(ceil(100000 * abs(randn(10000,1))));

% Hash the AccountNumber
Group = mod(AccountNumber,5) + 1;

% Check the distribution
tabulate(Group)

Value Count Percent
1 1850 19.02%
2 1952 20.07%
3 2008 20.65%
4 2015 20.72%
5 1900 19.54%

Though the distribution of 'AccountNumber' is dramatically skewed, its hashed version is very evenly distributed.

So far, so good, but a few matters remain to be cleared up. First, it is generally recommended that the modulo divisor be prime. It is easy enough to discover primes using MATLAB's primes function:

primes(500)

ans =

Columns 1 through 21

2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73

Columns 22 through 42

79 83 89 97 101 103 107 109 113 127 131 137 139 149 151 157 163 167 173 179 181

Columns 43 through 63

191 193 197 199 211 223 227 229 233 239 241 251 257 263 269 271 277 281 283 293 307

Columns 64 through 84

311 313 317 331 337 347 349 353 359 367 373 379 383 389 397 401 409 419 421 431 433

Columns 85 through 95

439 443 449 457 461 463 467 479 487 491 499

...but what if the desired number of groups is not prime? Many problems, for instance, involve dividing items into 100 groups to permit grouping at the percent level. Although the prime 101 is close to 100, it s not close enough. The solution I usually employ is to cascade several hash functions, each only operating on the out-of-range values from its predecessors. All modulos use divisors which are larger than the desired number of groups except the last one, which uses the largest prime less than the desired number of groups. Following is a detailed example in which the data starts off with a divisor of 17, and scopes down sequentially to 13, 11 and finally 7 (the largest prime smaller than the desired group count of 10):

% Synthesize a large amount of artificial data to work on
randn('state',27459); % Initialize the PRNG
AccountNumber = unique(ceil(100000 * abs(randn(10000,1))));

Group = mod(AccountNumber,17) + 1;

tabulate(Group)

Value Count Percent
1 606 6.23%
2 563 5.79%
3 545 5.60%
4 559 5.75%
5 556 5.72%
6 600 6.17%
7 561 5.77%
8 599 6.16%
9 546 5.61%
10 583 5.99%
11 573 5.89%
12 578 5.94%
13 566 5.82%
14 596 6.13%
15 523 5.38%
16 558 5.74%
17 613 6.30%

>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),13) + 1;

>> tabulate(Group)

Value Count Percent
1 915 9.41%
2 883 9.08%
3 855 8.79%
4 854 8.78%
5 849 8.73%
6 896 9.21%
7 865 8.89%
8 913 9.39%
9 880 9.05%
10 874 8.99%
11 310 3.19%
12 315 3.24%
13 316 3.25%

>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),11) + 1;

>> tabulate(Group)

Value Count Percent
1 1013 10.42%
2 960 9.87%
3 936 9.62%
4 954 9.81%
5 932 9.58%
6 994 10.22%
7 958 9.85%
8 986 10.14%
9 959 9.86%
10 961 9.88%
11 72 0.74%

>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),7) + 1;

>> tabulate(Group)

Value Count Percent
1 1021 10.50%
2 971 9.98%
3 946 9.73%
4 966 9.93%
5 944 9.71%
6 1004 10.32%
7 967 9.94%
8 986 10.14%
9 959 9.86%
10 961 9.88%

The calls to tabulate are only included for explanatory purposes and are obviously unnecessary. Note that the final stage deals with only 72 out-of-range values, and distributes them among values 1 - 7. If managed properly, this slight quirk should have negligible effect on the outcome.

Another issue is how to devise different hash functions for different purposes. For instance, it might be desired that a 10% sample of all customers be selected for a marketing campaign, and, separately, a 30% sample be drawn for modeling purposes. Two different hash functions could solve this problem, if their respective outputs were independent. Two achieve different, but repeatable, results one might scope in with a divisor sequence of 149, 127, 109, 97, and the other might use 233, 197, 151, 107, 97.

One variation on this theme is to use an initial hash function to divide the data among a small number of other hash functions. For instance, start by hashing the key to 1 of 11 values (11 is prime). Use this hash value to select which among 3 hash functions to apply to the key to get the final result: 1-3: hash function 1, 4-7: hash function 2, 8-11: hash function 3. This provides a way of creatively generating a variety of hash functions when many are needed.

Hash Functions: Midsquare

Modulo division is not the only way to hash numeric data. Another option is the "midsquare" method, which simply involves squaring the key and extracting digits from the middle of the result. Here is an example:

% A tiny amount of artificial data to work on
AccountNumber = [13301 15256 27441 27831 50668 89001 90012 93108 95667]'

AccountNumber =

13301
15256
27441
27831
50668
89001
90012
93108
95667

% First, square the key
MidSquare = (AccountNumber .^ 2);

% Remove the rightmost 2 digits
MidSquare = floor(MidSquare / 100);

% Remove the leftmost digits, leaving only 3 digits
MidSquare = MidSquare - 1000 * floor(MidSquare / 1000)

MidSquare =

166
455
84
645
462
780
601
996
748

Conclusion

All of the hash functions described here have dealt with numeric data (specifically integers). While hash functions can be invented for other data types, it is more common to convert them to integers and use one of the more common integer-based hash functions.

One last thing I'd like to point out: hash functions provide portability which random number generators don't. Hashing means that it's not necessary to drag around an enormous look-up table with every 'AccountNumber' and it's assigned group. In fact new data files which have new account numbers can even be accommodated with complete consistency.

Data Mining in MATLAB

Thursday, February 22, 2007

Dividing Data Into Groups Based On A Key Value (Hashing)

1 comment: