Data Mining in MATLAB: February 2007

Tuesday, February 27, 2007

Poll Results (Feb-20-2007): Toolbox Use

The Feb-20-2007 posting asked the poll question, Which Toolboxes from the MathWorks do you use (check all that apply)? After approximately 1 week, 38 votes were cast in total (including 1 by myself). The poll results follow:

1. Curve Fitting Toolbox (1 vote) 3%
2. Fuzzy Logic Toolbox (2 votes) 5%
3. Genetic Algorithm Toolbox (1 vote) 3%
4. Image Processing Toolbox (8 votes) 21%
5. Neural Network Toolbox (2 votes) 5%
6. Optimization Toolbox (6 votes) 16%
7. Signal Processing Toolbox (5 votes) 13%
8. Spline Toolbox (0 votes) 0%
9. Statistics Toolbox (13 votes) 34%

Not surprisingly, the Statistics Toolbox is the most popular among this crowd. The use of some of the other analytical toolboxes (such Neural Network and Fuzzy Logic Toolboxes) is also expected, though I was surprised both by the complete absence of anyone responding using the Spline Toolbox and the fairly good turn-out for the Optimization Toolbox. Given the popularity of the Jan-26-2007 posting, Pixel Classificiation Project, the 8 votes cast for the Image Processing Toolbox is expected.

Bootstrapping just the Statistics Toolbox results gives:

>> n = 38; p = 13 / 38; prctile(mean(double(rand(n,10000) < p)),[5 95])

ans =

0.2105 0.4737

So, with a 90% confidence interval, readers' use of the Statistics toolbox is somewhere between 21% and 47%.

Thursday, February 22, 2007

Dividing Data Into Groups Based On A Key Value (Hashing)

Introduction

A recent posting, Dividing Data Randomly Into Equal-Sized Groups, Feb-19-2007 presented a relatively painless method for dividing data into groups randomly. Sometimes, though, it is desired that items be divided into random groupings repeatably, despite appearing multiple times in the data set.

Consider, for instance, a group of customers who generate new billing statements once per month. Billing data may be drawn over several months for modeling purposes, and a single customer may appear several times in the data (once for each billing month). When that data is divided into "training" and "testing" groups, it would be preferable not to have any given customer appear in both the "training" and "testing" data sets. Random assignment of records to training/testing groups will not obey this requirement.

One solution is to assign records to groups based on some identifier which is unique to the customer, perhaps an account number. Simply dividing account numbers into deciles, for instance, is problematic because account numbers are likely assigned chronologically, meaning that customers with less tenure will end up in one group, while those with more tenure will land in the other group. Many unique identifiers (often called "keys") share this problem: despite not necessarily being meaningful as quantities, their values typically contain systematic biases.

One solution is to hash such identifiers, which means to transform one set of values to another, scrambling them in the process. This is done via a hash function, which ideally will spread out the new values as evenly as possible. In our present application, the hash function will result in far fewer distinct values than in the original data. Such a solution allows the identifier to drive the process (and thus make it exactly repeatable for any given identifier), but random enough to mix the data.

Hash Functions: Modulo Division

A variety of hash functions have been devised, but the most common use modulo division (mod in MATLAB). An example in MATLAB follows:

% A tiny amount of artificial data to work on
AccountNumber = [13301 15256 27441 27831 50668 89001 90012 93108 95667]'

% Hash the AccountNumber
Group = mod(AccountNumber,5) + 1

AccountNumber =

13301
15256
27441
27831
50668
89001
90012
93108
95667

Group =

2
2
2
2
4
2
3
4
3

All 'AccountNumber' values have been "scrambled" deterministically and mapped into the range 1 - 5. The second parameter in the mod function, the modulo divisor, determines the number of distinct hashed values (hence, the number of groups). In practice, experimentation may be necessary to ensure an even distribution of hashed values. Here is a larger example:

% Synthesize a large amount of artificial data to work on
randn('state',27459); % Initialize the PRNG
AccountNumber = unique(ceil(100000 * abs(randn(10000,1))));

% Hash the AccountNumber
Group = mod(AccountNumber,5) + 1;

% Check the distribution
tabulate(Group)

Value Count Percent
1 1850 19.02%
2 1952 20.07%
3 2008 20.65%
4 2015 20.72%
5 1900 19.54%

Though the distribution of 'AccountNumber' is dramatically skewed, its hashed version is very evenly distributed.

So far, so good, but a few matters remain to be cleared up. First, it is generally recommended that the modulo divisor be prime. It is easy enough to discover primes using MATLAB's primes function:

primes(500)

ans =

Columns 1 through 21

2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73

Columns 22 through 42

79 83 89 97 101 103 107 109 113 127 131 137 139 149 151 157 163 167 173 179 181

Columns 43 through 63

191 193 197 199 211 223 227 229 233 239 241 251 257 263 269 271 277 281 283 293 307

Columns 64 through 84

311 313 317 331 337 347 349 353 359 367 373 379 383 389 397 401 409 419 421 431 433

Columns 85 through 95

439 443 449 457 461 463 467 479 487 491 499

...but what if the desired number of groups is not prime? Many problems, for instance, involve dividing items into 100 groups to permit grouping at the percent level. Although the prime 101 is close to 100, it s not close enough. The solution I usually employ is to cascade several hash functions, each only operating on the out-of-range values from its predecessors. All modulos use divisors which are larger than the desired number of groups except the last one, which uses the largest prime less than the desired number of groups. Following is a detailed example in which the data starts off with a divisor of 17, and scopes down sequentially to 13, 11 and finally 7 (the largest prime smaller than the desired group count of 10):

% Synthesize a large amount of artificial data to work on
randn('state',27459); % Initialize the PRNG
AccountNumber = unique(ceil(100000 * abs(randn(10000,1))));

Group = mod(AccountNumber,17) + 1;

tabulate(Group)

Value Count Percent
1 606 6.23%
2 563 5.79%
3 545 5.60%
4 559 5.75%
5 556 5.72%
6 600 6.17%
7 561 5.77%
8 599 6.16%
9 546 5.61%
10 583 5.99%
11 573 5.89%
12 578 5.94%
13 566 5.82%
14 596 6.13%
15 523 5.38%
16 558 5.74%
17 613 6.30%

>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),13) + 1;

>> tabulate(Group)

Value Count Percent
1 915 9.41%
2 883 9.08%
3 855 8.79%
4 854 8.78%
5 849 8.73%
6 896 9.21%
7 865 8.89%
8 913 9.39%
9 880 9.05%
10 874 8.99%
11 310 3.19%
12 315 3.24%
13 316 3.25%

>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),11) + 1;

>> tabulate(Group)

Value Count Percent
1 1013 10.42%
2 960 9.87%
3 936 9.62%
4 954 9.81%
5 932 9.58%
6 994 10.22%
7 958 9.85%
8 986 10.14%
9 959 9.86%
10 961 9.88%
11 72 0.74%

>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),7) + 1;

>> tabulate(Group)

Value Count Percent
1 1021 10.50%
2 971 9.98%
3 946 9.73%
4 966 9.93%
5 944 9.71%
6 1004 10.32%
7 967 9.94%
8 986 10.14%
9 959 9.86%
10 961 9.88%

The calls to tabulate are only included for explanatory purposes and are obviously unnecessary. Note that the final stage deals with only 72 out-of-range values, and distributes them among values 1 - 7. If managed properly, this slight quirk should have negligible effect on the outcome.

Another issue is how to devise different hash functions for different purposes. For instance, it might be desired that a 10% sample of all customers be selected for a marketing campaign, and, separately, a 30% sample be drawn for modeling purposes. Two different hash functions could solve this problem, if their respective outputs were independent. Two achieve different, but repeatable, results one might scope in with a divisor sequence of 149, 127, 109, 97, and the other might use 233, 197, 151, 107, 97.

One variation on this theme is to use an initial hash function to divide the data among a small number of other hash functions. For instance, start by hashing the key to 1 of 11 values (11 is prime). Use this hash value to select which among 3 hash functions to apply to the key to get the final result: 1-3: hash function 1, 4-7: hash function 2, 8-11: hash function 3. This provides a way of creatively generating a variety of hash functions when many are needed.

Hash Functions: Midsquare

Modulo division is not the only way to hash numeric data. Another option is the "midsquare" method, which simply involves squaring the key and extracting digits from the middle of the result. Here is an example:

% A tiny amount of artificial data to work on
AccountNumber = [13301 15256 27441 27831 50668 89001 90012 93108 95667]'

AccountNumber =

13301
15256
27441
27831
50668
89001
90012
93108
95667

% First, square the key
MidSquare = (AccountNumber .^ 2);

% Remove the rightmost 2 digits
MidSquare = floor(MidSquare / 100);

% Remove the leftmost digits, leaving only 3 digits
MidSquare = MidSquare - 1000 * floor(MidSquare / 1000)

MidSquare =

166
455
84
645
462
780
601
996
748

Conclusion

All of the hash functions described here have dealt with numeric data (specifically integers). While hash functions can be invented for other data types, it is more common to convert them to integers and use one of the more common integer-based hash functions.

One last thing I'd like to point out: hash functions provide portability which random number generators don't. Hashing means that it's not necessary to drag around an enormous look-up table with every 'AccountNumber' and it's assigned group. In fact new data files which have new account numbers can even be accommodated with complete consistency.

Tuesday, February 20, 2007

Poll (Feb-20-2007) Which Toolboxes?

The last poll turned up the interesting finding that all respondents use Toolboxes. This time we'll learn which Toolboxes (from the MathWorks) readers use. Please feel free to comment on this post to indicate what other Toolboxes you use (whether from the MathWorks or not).

This poll is closed.

See the poll results in the Feb-27-2007 post, Poll Results (Feb-20-2007): Toolbox Use.

Thanks for voting!

Monday, February 19, 2007

Dividing Data Randomly Into Equal-Sized Groups

This is a quick note on dividing items randomly into equal-sized groups. This is an even quicker tip than yesterday's Dividing Values Into Equal-Sized Groups, since in this case, the original data does not affect the outcome.

Start by initializing the pseudo-random number generator (PRNG) for reproducible results:

rand('twister',9596)

Being able to reproduce outcomes exactly from run to run is important for several reasons, not the least of which is debugging. If the outcome of a program changes from run to run, it can be very hard to discover what precisely is going wrong.

With that out of the way, we can assign random groupings, in this case 20 groups for 10,000 individuals:

Group = ceil(20 * randperm(10000)' / 10000);

That's all there is to it. The result, 'Group', is a column vector with 10,000 group assignments, running from 1 to 20. If a different number of groups is desired, change the '20' to some other number. If a different number of items are to be assigned groups, change the '10000' (in both places) to something else. Just to check on this example, we reach for tabulate from the Statistics Toolbox:

tabulate(Group)
Value Count Percent
1 500 5.00%
2 500 5.00%
3 500 5.00%
4 500 5.00%
5 500 5.00%
6 500 5.00%
7 500 5.00%
8 500 5.00%
9 500 5.00%
10 500 5.00%
11 500 5.00%
12 500 5.00%
13 500 5.00%
14 500 5.00%
15 500 5.00%
16 500 5.00%
17 500 5.00%
18 500 5.00%
19 500 5.00%
20 500 5.00%

This process guarantees the the sizes of the largest and smallest groups will differ by no more than 1, and is ideal for assigning observations to folds for k-fold cross-validation.

Sunday, February 18, 2007

Dividing Values Into Equal-Sized Groups

This is just a quick tip for MATLAB users who need to divide collections of values into even-sized groups. If one is fortunate enough to have the MATLAB Statistics Toolbox available, the tiedrank function is very handy for this sort of thing. (It's not hard to build something similar to tiedrank, using sort anyway.)

This will best be explained via example, so let's generate some example data:

>> rand('twister',951); X = rand(8,1)

X =

0.5798
0.0504
0.0241
0.7555
0.6569
0.3020
0.2042
0.5651

It is desired that this data be divided into 4 equal-sized groups, by magnitude. The following line does this:

>> F = ceil(4 * tiedrank(X) / length(X))

F =

3
1
1
4
4
2
2
3

The variable F now contains an integer code representing the assigned group number, with 1 being the smallest group. Notice that there are two each of the values 1, 2, 3 and 4. Also notice how easy it is to extract all members of any given group. Here, for example, the members of the lowest-valued group are displayed:

>> X(F == 1)

ans =

0.0504
0.0241

Here is an example using a much larger number of items:

>> rand('twister',951); X = rand(10000,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2500 25.00%
3 2500 25.00%
4 2500 25.00%

The tabulate function is a convenient routine for generating frequency tables from the Statistics Toolbox. Again, notice the even distribution of values across bins. What happens if the total count cannot be divided evenly among the bins? Let's see:

>> rand('twister',951); X = rand(10001,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2500 25.00%
3 2500 25.00%
4 2501 25.01%

This procedure ensures that the counts in the smallest bin and the largest bin never differ by more than 1, assuming that this is possible. Observe the progression, as the total count is incremented:

>> rand('twister',951); X = rand(10002,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2501 25.00%
3 2500 25.00%
4 2501 25.00%

>> rand('twister',951); X = rand(10003,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 24.99%
2 2501 25.00%
3 2501 25.00%
4 2501 25.00%

>> rand('twister',951); X = rand(10004,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2501 25.00%
2 2501 25.00%
3 2501 25.00%
4 2501 25.00%

The only catch is the case of multiple instances of the same value. The tiedrank function prevents the repeated values from being broken up. Here is an example with such data:

>> X = [1 2 2 2 5 6 7 8]'

X =

1
2
2
2
5
6
7
8

>> F = ceil(4 * tiedrank(X) / length(X))

F =

1
2
2
2
3
3
4
4

The distribution among bins is now uneven, but tiedrank is doing the best it can. Depending what is needed, this may be justified. If, however, if one needs to break these repeated values apart, even if arbitrarily, then tiedrank may be easily replaced with another procedure which does this.

This process is useful for assigning values to quantiles or n-iles, such as deciles or percentiles. Applied to randomly-generated values, it is also useful for assigning cases to folds for stratified k-fold cross-validation, or to strata for stratified sampling.

See Also

Feb-10-2007 posting, Stratified Sampling.

Sunday, February 11, 2007

Poll Results (Feb-04-2007): Toolbox Use

The Feb-04-2007 posting, Poll (Feb-04-2007): Toolbox Use featured the following poll question: Which MATLAB Toolboxes, if any, do you use? After 1 week, the poll has closed, with 27 Data Mining in MATLAB readers responding (1 of which was myself). The final results are:

1. None (base MATLAB product only) (0 Votes)
2. MATLAB Toolboxes from the MathWorks only (11 Votes)
3. MATLAB Toolboxes from other sources only (1 Votes)
4. MATLAB Toolboxes both from the MathWorks and other sources (15 Votes)

Interestingly, all reported using toolboxes. All voters but 1 indicated using toolboxes from the MathWorks.

How significant are these results? A quick bootstrap of 100,000 replicates yields the following:

>> Poll = [repmat(2,11,1); repmat(3,1,1); repmat(4,15,1)];
>> Replicates = Poll(ceil(27 * rand(27,1e5)));
>> Count2 = mean(Replicates == 2);
>> Count3 = mean(Replicates == 3);
>> Count4 = mean(Replicates == 4);
>> prctile(Count2,[5 95])

ans =

0.2593 0.5556

>> prctile(Count3,[5 95])

ans =

0 0.1111

>> prctile(Count4,[5 95])

ans =

0.4074 0.7037

So (roughly, due to limited precision provided by small sample size), with a 90% confidence interval, use of MathWorks Toolboxes only is somewhere between 26% and 56%, use of non-MathWorks Toolboxes only is between 0% and 11%, and use of both is between 41% and 70%.

Thanks to everyone who voted!

Saturday, February 10, 2007

Stratified Sampling

Introduction

In my posting of Nov-09-2006, Simple Random Sampling (SRS), I explained simple random sampling and noted some of its weaknesses. This post will cover stratified random sampling, which addresses those weaknesses.

Stratified sampling provides the analyst with more control over the sampling process. A typical use of stratified sampling is to control the distribution of the variables being sampled. For instance, imagine a data set containing 200 observations, 100 of which are men, and 100 of which are women. Assume that this data set is to be split into two equal-sized groups, for control and treatment testing. Half of the subjects will receive some treatment which is under review (a drug, marketing campaign, etc.), while the other group is held out as a control and receives no treatment. A simple random sampling procedure will result in two groups, each with 50 men and 50 women, more or less. The "more or less" is the awkward part. Some simple random samples will result in a 46/54 split of men (and, in this case, the reverse, 54/46, for women). After an experiment, how will the experimenter know whether any measured differences are due to control versus treatment, or the difference in the respective proportions of men and women? It would be beneficial to control such factors.

When using simple random sampling, deviations from the expected distributions can be substantial. Generally, three factors aggravate this issue:

1. Smaller observation counts
2. More variables to be controlled
3. Higher skew in variables to be controlled

Even very large data may exhibit this problem. Consider the problem of applying treatments (marketing campaigns, for instance) to loan customers at a bank. At the beginning of the experiment, it is reasonable to expect that important variables be distributed similarly among treatment cells. Such variables might include current balance, credit score and loan type. Even a rather large data set may not split well along all of these dimensions.

A Simple Example

Consider a simple situation, in which there are 100,000 observations, 99,000 of which are of class A, and 1,000 of which are of class B. A minority class representation of 1% is not uncommon, and some important problems have even more class imbalance. A model is to be constructed to classify future cases as belonging to one class or the other. A train/test split of 70%/30% has been specified. To ensure that the training and testing data sets have similar proportions of classes A and B, the sampling will be stratified by class. Let's get started:

% Generate some example data (yes, it's very artificial and in order- don't worry about that!)
SimpleData = [randn(100000,5) [zeros(99000,1); ones(1000,1)]];

There are now 5 predictor variables and the target (in the last column) stored in SimpleData.

% Count the examples
n = size(SimpleData,1);

The first task is to identify the distinct stata which are to be sampled, and calculate their respective frequencies. In this case, that would be the two classes:

% Locate observations in each class
ClassA = (SimpleData(:,end) == 0);
ClassB = (SimpleData(:,end) == 1);

% We already know these, but in real-life they'd need to be calculated
nClassA = sum(double(ClassA));
nClassB = sum(double(ClassB));

Next, space is allocated for an integer code representing the segment, with a value of 1 for "training" or a 2 for "testing":

% Create train/test code values
Train = 1;
Test = 2;

% Allocate space for train/test indicator
Segment = repmat(Test,n,1); % Default to the last group

Next, we check a few things about the stratifying variable(s):

% Determine distinct strata
DistinctStrata = unique(SimpleData(:,end));

% Count distinct strata
nDistinctStrata = size(DistinctStrata,1);

For rigor's sake, randperm should be initialized at the beginning of this process, which is done by initializing rand (see my Jan-13-2007 posting, Revisiting rand (MATLAB 2007a)):

% Initialize PRNG
rand('state',29182);

Loop over the segments, splitting each as closely as possible (within one unit) at the 70/30 mark:

% Loop over strata
for Stratum = 1:nDistinctStrata
% Establish region of interest
ROI = find(SimpleData(:,end) == DistinctStrata(Stratum));

% Determine size of region of interest
nROI = length(ROI);

% Generate a scrambled ordering of 'nROI' items
R = randperm(nROI);

% Assign appropriate number of units to Training group
Segment(ROI(R(1:round(0.70 * nROI)))) = Train;

end

Done! Now, let's check our work:

>> mean(SimpleData(Segment == 1,end))

ans =

0.0100

>> mean(SimpleData(Segment == 2,end))

ans =

0.0100

Both the training and testing data sets have a 1% Class B rate. Note that stratified sampling will sometimes deviate from the expected distributions because strata can only be divided into sets of whole samples. With enough strata, this tiny error (never off by more than 0.5 samples per strata) may add up to a small discrepancy from the exact designed distribution. Regardless, stratified sampling much better preserves distributions of the stratified variables than simple random sampling.

Epilogue

The code in the example given was designed for clarity, not efficiency, so feel free to modify it for execution time and storage considerations.

Typically, numeric variables are stratified by dividing them into segments, such as deciles. Their original numeric values are still used, but each segment is treated as one strata.

When dealing with multiple stratifying variables, it is suggested that unique(X,'rows') be used over the set of stratifying variables to obtain all distinct combinations of single-variable strata, which actually possess any frequency. Beware that using too many stratifying variables or too many strata per variable may result in a large number of (multivariable) strata, many of which are very sparsely populated.

Stratified sampling is highly effective at avoiding the sometimes arbitrary results of simple random sampling, and is useful in assigning observations in control/test, train/test/(validate) and k-fold cross validation designs.

Further reading

Sampling: Design and Analysis, by Sharon L. Lohr (ISBN: 0-534-35361-4)

Sunday, February 04, 2007

Poll (Feb-04-2007): Toolbox Use

I am curious about readers' use of Toolboxes (from the MathWorks or elsewhere) in their MATLAB programs. Please answer the following poll on Toolboxes:

This poll is closed.

See the poll results in the Feb-11-2007 posting, Poll Results (Feb-04-2007): Toolbox Use

Thanks for participating!

Friday, February 02, 2007

Pixel Classification Project: Response

The Jan-26-2007 posting, Pixel Classificiation Project generated quite a response (and some confusion!). Having received a number of responses, both in the Comments section, and via e-mail, I will answer questions and comments in the sections below. Many thanks for your (collective) interest!

Details About The Pixel Classifier's Operation

1. The pixel classifier assesses individual pixels, not entire images. It is applied separately to all pixels within a subject image. The fact that the training images were composed entirely of pixels from one class or the other was merely a logistical convenience since the analyst would not have to label areas of the training images by class. Ultimately, the pixel classifier operates on a small window of pixels, and predicts (the probability of the) the class of the pixel at the center of the window. This is why the new images (visible near the end of the original posting) are shaded in: the classifier is scanned over the entire image, evaluating each pixel separately.

2. While there is obviously a required component of craft in constructing the whole thing, the largest direct infusions of human knowledge to the actual classifier come from: 1. the manual labeling of images as "foliage" / "non-foliage", and 2. the construction of the hue2 feature. hue2 is not strictly necessary, and what the classifier knows, it has learned.

3. The hue-saturation-value (HSV) color components are a relatively simple transformation of the red-green-blue (RGB) color values already in the model. Although they do not bring "new" information, they may improve model performance by providing a different representation of the color data.

4. Which of the entire set of 11 input variables is "most important", I cannot say (although I suspect that the classifier is driven by green color and high-activity texture). As mentioned in the original posting, rigorous testing and variable selection were not performed. If I post another image processing article, it will likely be more thorough.

5. The edge detector variables measure contrast across some distance around the center pixel (I can supply the MATLAB code to any interested parties). The 5x5 edge detector summarizes the differences in brightness of pixels on opposite sides of 5 pixel-by-5 pixel square surrounding the pixel of interest. The other edge detectors consider larger squares about the pixel of interest. The varying sized edge detectors measure texture over different scales. There is nothing special about these particular edge detectors. I chose them only because they are fast to calculate and I already had built them. I would consider using other image processing operators (Sobel edge detector, Laws texture features, window brightness standard deviation, etc.) in any future pixel classifier.

(Possible) Future Developments

1. This process could indeed be applied to other types of data, such as audio. I was actually thinking about doing this, and given the interest in this posting, will consider either an audio project or a more thorough image processing project for the future (any preferences?). Reader suggestions are very welcome.

2. Detection of more complex items (people, automobiles, etc.) might be possible by combining a number of pixel classifiers. Much research has been undertaken in an effort to solve that problem, and the attempted solutions are too numerous to list here.

3. I strongly encourage readers to experiment in this field. Anyone undertaking such a project should feel free to contact me for any assistance I may be able to provide.

I will take up other potential applications of this idea with individual readers via other channels, although I will say that pixel-level classification is being performed already, both by governments (including the military) and in the private sector.

Some examples of other writing using this general follow. The nice thing about this sort of work is that even if one doesn't fully understand the white-paper or report, it is always possible to appreciate what the author has done by looking at the pictures.

Machine Learning Applied to Terrain Classification
for Autonomous Mobile Robot Navigation, by Rogers and Lookingbill

A Support Vector Machine Approach For Object Based Image Analysis, by Tzotsos

Interactively Training Pixel Classifiers, by Piater, Riseman and Utgoff

Texture classification of logged forests in tropical Africa using machine-learning algorithms, by Chan, Laporte and Defries

A Survey on Pixel-Based Skin Color Detection Techniques, by Vezhnevets, Sazonov and Andreeva

Feature Extraction From Digital Imagery: A Hierarchical Method, by Mangrich, Opitz and Mason