Data Mining in MATLAB: April 2009

Friday, April 03, 2009

MATLAB Image Basics

Introduction

One nice feature of MATLAB is its provision of handy functions which are not part of the programming language proper. An excellent example of this is its support for images. The base MATLAB product provides routines for the loading from disk, manipulation, display and storing to disk of raster images. While it's true that one can find code libraries to perform these functions for other programming languages, like C++, the MATLAB model offers several advantages, not the least of which is standardization. If I write image-handling code in MATLAB, I know that every other MATLAB user on Earth can run my code without modification or the need for extra header files, libraries, etc. This article will serve as a brief introduction to the use of image data within MATLAB.

Image Data

Images are nearly always stored in digital computers in one of two forms: vector or raster . Vector images store images as line drawings (dots, line segments, polygons) defined by the spatial coordinates of their end points or vertices, and are most often used these days in artistic settings. A raster image is simply a 2-dimensional array of colored pixels (represented by numbers). This article will concentrate on the much more common raster form.

Raster images, being arrays of numbers, are a natural fit for MATLAB, and indeed MATLAB is a convenient tool for applications such as image processing. Raster images always have 2 spatial dimensions (horizontal and vertical), and 1 or more color planes. Typically, grayscale images are stored as a 2-dimensional array, representing 1 color plane with values of 0.0 indicating black, 1.0 indicating white and intermediate values indicating various shades of gray. Color images are similar to grayscale images, but are most often stored as a 3-dimensional array, which is really a stack of three 2-dimensional color planes: one for each primary color: red, green and blue ("RGB"). As with grayscale images, values in the RGB color planes represent brightness of each color. Note that when all three color values are the same, the resulting color is a shade of gray.

For the reader's knowledge, there are also index images which will not be covered here, but which are full of index numbers (integers) which do not represent colors directly, but instead indicate locations in a palette. Also, brightness values are often stored in files as integers, such as 0 - 255 instead of 0.0 to 1.0.

Loading Images from Disk

In MATLAB, images are read from disk using the imread function. Using imread is easy. The basic parameters are the location of the image file and the file format:

>> A = imread('c:\QRNG.png','PNG');
>> whos
Name Size Bytes Class Attributes

A 942 x 1680 x 3 4747680 uint8

This image is 942 pixels vertical by 1680 pixels horizontal, with 3 color planes (red, green and blue). Note that image data has been store in MATLAB as unsigned 8-bit integers (uint8). Since I often make multiple calculations on images, I typically convert the data type to double-precision real (double) and scale to 0.0 - 1.0 (though this will slow calculation):

>> B = double(A) / 255;

Displaying Images

Showing images on the screen is most easily accomplish using the image function:

image(A)

Grayscale images will display using a default palette, which can be changed via the colormap command:

>>colormap gray

Images will be fit to the screen, which may distort their aspect ratio. This can be fixed using:

>>axis equal

...meaning that pixels will use equal scales horizontally and vertically.

Manipulating Images

As arrays, images can be modified using all the fun things we usually do to arrays in MATLAB (subsetting, math operations, etc.). I will mention one other useful base MATLAB tool for image processing: the rgb2hsv function, which converts an RGB image to an HSV one. HSV is a different colorspace (way of representing colors). HSV arrays are similar to RGB arrays, except their 3 color planes are hue, saturation and value (in that order). It is often convenient to work on the value ("brightness") plane, to isolate changes in light/dark from changes in the color. To get back to the land of RGB, use the function hsv2rgb.

Saving Images to Disk

Images can be saved to disk using the imwrite command. This is essentially the inverse of the imread command:

imwrite(A,'New Image.bmp','BMP')

...with the parameters indicating the array to be saved as an image file, the file location and image file format, in that order.

Note that MATLAB understands images as both 0 - 255 uint8s and 0.0 - 1.0 doubles, so there is no need to reverse this transformation before image storage.

Conclusion

Working on images in MATLAB is very convenient, especially when compared to more general-purpose languages. I urge the reader to check the help facility for the functions mentioned here to learn of further functionality.

Further Reading

For more information on image processing, I recommend either of the following books:

Digital Image Processing (3rd Edition) by Gonzalez and Woods
(ISBN-13: 978-0131687288)

Algorithms for Image Processing and Computer Vision by J. R. Parker
(ISBN-13: 978-0471140566)

Wednesday, April 01, 2009

Introduction to Conditional Entropy

Introduction

In one of the earliest posts to this log, Introduction To Entropy (Nov-10-2006), I described the entropy of discrete variables. Finally, in this posting, I have gotten around to continuing this line of inquiry and will explain conditional entropy.

Quick Review: Entropy

Recall that entropy is a measure of uncertainty about the state of a variable. In the case of a variable which can take on only two values (male/female, profit/loss, heads/tails, alive/dead, etc.), entropy assumes its maximum value, 1 bit, when the probability of the outcomes is equal. A fair coin toss is a high-entropy event: Beforehand, we have no idea about what will happen. As the probability distribution moves away from a 50/50 split, uncertainty about the outcome decreases since there is less uncertainty as to the outcome. Recall, too, that entropy decreases regardless of which outcome class becomes the more probable: an unfair coin toss, with a heads / tails probability distribution of 0.10 / 0.90 has exactly the same entropy as another unfair coin with a distribution of 0.90 / 0.10. It is the distribution of probabilities, not their order which matters when calculating entropy.

Extending these ideas to variables with more than 2 possible values, we note that generally, distributions which are evenly spread among the possible values exhibit higher entropy, and those which are more concentrated in a subset of the possible values exhibit lower entropy.

Conditional Entropy

Entropy, in itself, is a useful measure of the mixture of values in variables, but we are often interested in characterizing variables after they have been conditioned ("modeled"). Conditional entropy does exactly that, by measuring the entropy remaining in a variable, after it has been conditioned by another variable. I find it helpful to think of conditional entropy as "residual entropy". Happily for you, I have already assembled a MATLAB function, ConditionalEntropy, to calculate this measure.

As an example, think of sporting events involving pairs of competitors, such as soccer ("football" if you live outside of North America). Knowing nothing about a particular pair of teams (and assuming no ties), the best probability we may assess that any specific team, say, the Black Eagles (Go, Beşiktaş!), will win is:

p(the Black Eagles win) = 0.5

The entropy of this variable is 1 bit- the maximum uncertainty possible for a two-category variable. We will attempt to lower this uncertainty (hence, lowering its conditional entropy).

In some competitive team sports, it has been demonstrated statistically that home teams have an advantage over away teams. Among the most popular sports, the home advantage appears to be largest for soccer. One estimate shows that (excluding ties) home teams have historically won 69% of the time, and away teams 31% of the time. Conditioning the outcome on home vs. away, we may provide the follwing improved probability estimates:

p(the Black Eagles win, given that they are the home team) = 0.69
p(the Black Eagles win, given that they are the away team) = 0.31

Ah, Now there is somewhat less uncertainty! The entropy for probability 0.69 is 0.8932 bits. The entropy for probability 0.31 (being the same distance from 0.5 as 0.69) is also 0.8932 bits.

Conditional entropy is calculated as the weighted average of the entropies of the various possible conditions (in this case home or away). Assuming that it is equally likely that the Black Eagles play home or away, the conditional entropy of them winning is 0.8932 bits = (0.5 * 0.8932 + 0.5 * 0.8932). As entropy has gone down with our simple model, from 1 bit to 0.8932 bits, we learn that knowing whether the Black Eagles are playing at home or away provides information and reduces uncertainty.

Other variables might be used to condition the outcome of a match, such as number of player injuries, outcome of recent games, etc. We can compare these candidate predictors using conditional entropy. The lower the conditional entropy, the lower the remaining uncertainty. It is even possible to assess a combination of predictors by treating each combination of univariate conditions as a separate condition ("symbol", in information theoretic parlance), thus:

p(the Black Eagles win, given:
that they are the home team and there are no injuries) = ...
that they are the away team and there are no injuries) = ...
that they are the home team and there is 1 injury) = ...
that they are the away team and there is 1 injury) = ...
etc.

My routine, ConditionalEntropy can accommodate multiple conditioning variables.

Conclusion

The models being developed here are tables which simplistically cross all values of all input variables, but conditional entropy can also, for instance, be used to evaluate candidate splits in decision tree induction, or to assess class separation in discriminant analysis.

Note that, as the entropies calculated in this article are based on sample probabilities, they suffer from the same limitations as all sample statistics (as opposed to population statistics). A sample entropy calculated from a small number observations likely will not agree exactly with the population entropy.

Further Reading

See also my Sep-12-2010 posting, Reader Question: Putting Entropy to Work.

Print:

The Mathematical Theory of Communication by Claude Shannon (ISBN 0-252-72548-4)

Elements of Information Theory by Cover and Thomas (ISBN 0-471-06259)