Data Mining in MATLAB: January 2007

Friday, January 26, 2007

Pixel Classification Project

Being interested in both machine learning and image processing, I built a pixel-level classifier, on a lark, whose output is the probability that any given pixel was from the class "foliage". The project, in summary, followed these steps:

Pixel Classification Project Steps
1. Collect images, each containing pixels from only one class of interest
2. Extract samples (small windows surrounding pixels of interest) from images
3. Calculate derived features
4. Train classifier to distinguish between "foliage" and "not foliage" classes
5. Apply learned classifier to test images containing pixels from both classes

The salient details of each step follow:

1. Data Acquisition
Thirty-five images of each class ("foliage" and "non-foliage") were acquired. All training images were downloaded from the World Wide Web, after being located by AllTheWeb (Pictures). All images needed to be of at least moderate resolution (about 640x480) to provide enough information to render accurate classifications.

For the "foliage" class, search terms such as "foliage", "leaves" and "grass" were used. Images in this class were screened visually to include images which contained foliage, and only foliage, meaning leaves, plant stalks and stems. Images containing any extraneous elements (colorful flowers, pets, children, wires, rocks, etc.) were excluded.

For the "non-foliage" class, arbitrary search terms were employed, which would likely find things other than plants, like "nail", "dancer", "hallway", etc. Images in this class were visually screened to include anything but foliage. Indoor scenes containing small potted plants, for instance, were excluded.

It would also have been possible to utilize training images with mixtures of "foliage" and "non-foliage" pixels instead, but this would have required determining pixel class (at least for homogeneous regions within images) by hand. I did try this in a similar, earlier experiment, and I will report that: 1. manual identification of pixels can be time-consuming; 2. region-by-region classing complicates the next step, sampling; and 3. I suspect that this approach is less accurate for identification of pixels near the edge of a homogeneous region (which are much harder to distinguish).

2. Data Sampling
One thousand samples were drawn from each image, for a grand total of 70,000 examples (= 2 classes x 35 images each x 1000 samples per image). Each sample was composed of the color information from a small window surrounding the pixel of interest at a random location within each image. The pixel at the center of the window was considered to be from "foliage" class, if it came from the "foliage" set of images, and "non-foliage" if it came from the "non-foliage" set of images.

3. Derived Features
In addition to the raw red, green and blue values, my program calculated several derived features from the image data. In all, 11 features were used:

1. red
2. green
3. blue
4. hue
5. saturation
6. value
7. hue2
8. edge detector: 5x5
9. edge detector: 9x9
10. edge detector: 13x13
11. edge detector: 21x21

Hue, saturation and value, taken together, are another way of representing colors and are easily calculated using MATLAB's rgb2hsv function. The "hue2" variable is a fuzzy function of the hue variable, using a curved plateau function (see my posting of Nov-16-2006, Fuzzy Logic In MATLAB Part 1), hand-tweaked to flag appropriate colors. The edge detectors perform a simple, quick edge detection process over varying window sizes, and are intended to capture texture information at different scales.

There is nothing special about the above list, and indeed the list of possible derived features is limited only by the imagination. The image processing field has delivered a wide array of filters and other such manipulations. Interested readers are urged to examine either of these texts:

Algorithms for Image Processing and Computer Vision, by Parker (ISBN: 0-471-14056-2)

Digital Image Processing, by Gonzalez and Woods (ISBN: 0-201-50803-6)

4. Classifier Construction
My foliage classifier is a logistic regression, only because logistic regression is quick to train, and it was handy, as glmfit in the Statistics Toolbox. Any other machine learning or statistical classifier (linear discriminant, neural network, k-nearest neighbors, etc.) could have been used instead.

As this was just a quick experiment, I didn't bother with rigorous testing, variable selection, etc. Still, results on test images were quite nice (see below), and flaws in the classifier could certainly be addressed through a more thorough and structured effort.

5. Classifier Recall
The finished model was executed on some of my own digital photographs, which contained both "foliage" and "non-foliage" elements. The result appears below.

Overgrown Beams: Original Image
Overgrown Beams: Foliage Detection

Monarch Butterfly: Original Image
Monarch Butterfly: Foliage Detection

Potted Plant: Original Image
Potted Plant: Foliage Detection

Conclusion
I find working with image data to be particularly satisfying since the result is something one can actually look at. Images contain a great deal of data, both in terms of rich structure and in the sheer number of pixels. Even inexpensive digital cameras will deliver several million pixels per image, so consider image processing as a test-bed application for modeling experiments.

This was just a toy project, so it is hardly the last word in foliage detectors and weaknesses in the model should be evident in the images above. I strongly encourage readers to explore this field and improve on what has been presented. Consider training on other classes, like "skin", "people", "sky", "brickface", etc. I would be very interested in hearing from any readers who have results to share. Good luck!

Note on Handling Images in MATLAB
Even without toolboxes, MATLAB provides several tools for dealing with images, such as imread, which is used to load images. Most color images are loaded into MATLAB as 3-dimensional arrays, and are accessed as Image(VerticalCoordinate,HorizontalCoordinate,ColorPlane). The color channels are numbered: red (1), green (2) and blue (3), so Image(:,:,2) is just the green color plane of the image.

Related Work
One interesting work on the subject of pixel classification (skin detection) is Skin Detection, by Jennifer Wortman. While the classifier in this document is based only on color and is constructed by hand, the reader should find some insights on pixel classification.

See Also

Feb-02-2007 posting, Pixel Classificiation Project: Response

Mar-23-2007 posting, Two Bits of Code

Friday, January 19, 2007

GASplineFit: A Flexible Curve Fitting Routine

In my work, I have frequently needed to fit curves to data. Curve-fitting is a broad and surprisingly subtle subject. The two most common types of single-input curve-fitting are simple (linear and non-linear) regression and splines.

Regressions involve the discovery of optimal coefficients for some functional form, which maps the input variable to the output variable. Regressions provide a number of advantages, but these are generally qualified advantages:

Regression Characteristics
1. Common functions (linear, logistic, exponential) are well-studied and widely available. More exotic or arbitrary functions are harder to find software for.

2. Regressions provide optimal fit to data, assuming that one wants a least-squares fit. Some software, notably the MATLAB Curve Fitting Toolbox, will provide other, less common fits, such as mean absolute error (L-1) and robust fits.

3. Regressions are the safest bet for extrapolation. Extrapolation is a hazardous endeavor, but well-behaved regression functions generate the most sensible guess in "the far, unlit unknown".

4. Complex regression curves, such as high-order polynomials, can be ill-behaved, especially at the extremes of the data range. While there are methods of combatting such problems (such as fitting at the Chebyshev points), these are not widely available in commercial software.

Splines, which are especially popular in engineering, are connected series of simple curves, most often low-order polynomials. The advantage of splines is that they are extremely flexible. Complex curves require complex regression functions, but splines can handle very complicated shapes simply by connecting several simple curves. Splines may pass directly through data points, or between them. To maintain continuity, these simple curves must come together at points called knots. Like regressions, splines have their advantages:

Spline Characteristics
1. Splines are extremely flexible, but number and placement of knots must be chosen.

2. Technically, most splines are undefined outside the range of the knots.

3. Most commercial software is oriented toward locating a spline through a small number of given data points. "Fitting" through a large number of data points is not commonly available.

Additionally, base MATLAB directly supports the construction and evaluation of splines.

GASplineFit
Wanting to free myself from having to select the right regression function every time I needed to fit a curve, I decided to turn to splines for my curve-fitting needs. The problem, as mentioned above, is that almost all spline routines and canned software are very low-level, "fitting" (if it can be called that) splines by stringing them through a very small number of selected points. My solution, GASplineFit.m, builds on the built-in MATLAB spline routines, and will generate splines which optimize any error function handled by my SampleError routine (see my posting of Jan-05-2007, Model Performance Measurement). Note: GASplineFit requires subroutine LogRand.m.

With GASplineFit, the user supplies:

1. An input variable
2. An output variable
3. The input values of the knots (I often just use linspace for these)
4. The spline type (such as 'pchip' or 'spline': see MATLAB's spline routines)
5. A selected performance measure ('MSE', 'L-1', etc.).

There are two additional parameters, which control the genetic algorithm which performs the fitting. I usually don't change these from 250 and 80. For better but slower fitting, turn these parameters up. For faster but cruder fitting, turn them down. See help GASplineFit for details.

Essentially, the software is sliding the spline knots up and down, trying to get this best fit. The routine returns:

1. The fitted spline as a MATLAB piecewise polynomial (see: help ppval)
2. The knot output values
3. The assessed error

Notice one very nice feature of this routine: that it will fit probability curves directly to 0/1 data. Very often, the statistician is faced with a large collection of examples, which involve a dummy variable indicating membership in a class of interest, and an explanatory variable. To smooth the data and approximate the probability of class membership, common practice is to bin the explanatory variable, and assess the mean for each bin. Fancier methods will fit a curve to these binned values, which isn't very smooth or use a kernel regression, which is smooth but typically difficult to encapsulate as code without taking the entire data set along for the ride. GASplineFit solves all of these problems. The genetic algorithm which is its engine doesn't care about the fact that our data may be only zeros and ones: it only cares about optimizing the performance function.

help GASplineFit explains the specifics of this routine, and provides an example of its use. MATLAB programmers beginning with genetic algorithms may be interested in examining the code, which uses an elitist genetic algorithm.

Saturday, January 13, 2007

Revisiting rand (MATLAB 2007a)

Pseudo-random number generators (PRNG) are frequently used in statistical and machine learning methods for things like train/test selection and solution initialization. (See an interesting discussion of this in the Nov-22-2006 posting, Explicit Randomization in Learning algorithms, on the Machine Learning (Theory) log.) Hence, it is important to understand the operation of the pseudo-random number generator employed in one's code.

I've just gotten word that MATLAB's built-in rand function will be changing (as of MATLAB 2007a) to use the Mersenne Twister method of generating numbers be default. Note that use of 'state' or 'seed' will change to a generator other than the default. Probably the most important impact at a practical level is that code not specifying another generator (not using 'state' or 'seed') will now generate different values.

Note, too, that if the randperm function continues to follow rand's lead (randperm was initialized by initializing rand in the past), then it will also produce different values than in previous releases if rand is not initialized.

I have no word on whether this affects randn or not.

MATLAB programming suggestion: Always initialize the random number functions before calling them. Repeatable results make testing code much easier.

See my post of Dec-07-2006, Quick Tip Regarding rand and randn for more information on rand and randn.

Also see the Mar-19-2008 post, Quasi-Random Numbers.

Also in 2007a, "divide-by-zero" and "log-of-zero" warning messages will be turned off by default.

Friday, January 05, 2007

Model Performance Measurement

A wide variety of model performance measures have been devised. Despite the popularity of mean squared error for numeric models and simple accuracy for classification models, there are many other choices. For my part, I generally prefer mean absolute error for numeric models and the AUC (for class separation) and informational loss (for probability assessment) for classification models.

This log entry is pretty much just a quick giveaway: I have constructed a generic performance calculation MATLAB routine, SampleError.m. Its operation is straightforward, but I find it handy to contain all of these measures in one routine, with the ability to switch among them as a simple parameter change. The use of this routine is simple and is explained by help SampleError and it makes a great building block for modeling routines.

I update many of my MATLAB routines from time to time, and this one is no exception. Presently, though, the following performance measures are supported:

'L-1' (mean absolute error)
'L-2' (mean squared error)
'L-4'
'L-16'
'L-Infinity'
'RMS' (root mean squared error)
'AUC' (requires tiedrank() from Statistics Toolbox)
'Bias'
'Conditional Entropy'
'Cross-Entropy'
'F-Measure'
'Informational Loss'
'MAPE'
'Median Squared Error'
'Worst 10%'
'Worst 20%'

Note: I still need to verify the Cross-Entropy measure. The last two are classification performance measures, being the proportion of the target class found in the predicted most likely 10% and 20%, respectively.

Incidentally, I'd appreciate any feedback on any of the code in this Web log, whether it be about typos, outright coding errors of efficiency issues. Also, please send suggestions for additional measures.