Readers of this Web log may be interested in, An Introduction to Combinatorics, an article on the perms, randperm and nchoosek functions which I authored as a guest of Blinkdagger. Blinkdagger covers MATLAB programming, among other things, and I suggest you have a look.
Monday, April 07, 2008
Thursday, April 03, 2008
Generating Hexagonal Grids for Fun and Profit
Grids are used for a variety of purposes in data analysis, such as division of physical areas into equal-sized units, or for data visualization. Some clustering techniques, such as Kohonen's Self-Organizing Map use grids to organize their internal structure.
Square Grids
By far, the most commonly-employed grids are square grids. Square grids are convenient in that every cell is the same shape with the same orientation, and boundaries between rows or columns are straight lines. Indexing square grids is easy: (x, y), and extension to more dimensions is straightforward: (x, y, z), etc. Generating square grids in MATLAB is a breeze:
>> [X Y] = meshgrid(0:8)
X =
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
Y =
0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8
X and Y now contain the coordinates for the centers of the square cells, which can be plotted in MATLAB thus (click the figure to enlarge):
>> figure, voronoi(X(:),Y(:)), axis square
meshgrid-generated grids need not have the same axes, nor equal spacing. See 'help meshgrid' for more information. The linspace and logspace MATLAB routines are handy as meshgrid arguments, as well.
Hexagonal Grids
Despite their advantages, square grids do have one basic failing: their representations of circles and other non-rectangular forms are awkward.
With a square grid, cells surrounding a central cell have mixed distances. Repeated single-unit "hops" from a central cell (such as activation in a cellular automata) result in square or diamond patterns, not circles.
Hexagonal grids, one alternative to square grids, are much cleaner in their approximation of circular regions. All six immediate neighbors of any hexagonal cell are the same distance away. Repeated single-unit hops from a given hexagonal cell maintain a relatively "round" form (at least a better one than those provided by square grids).
Generating hexagonal grids is a bit trickier than generating square grids, but with a little geometry it can be done (as always, click the figure to enlarge):
% Generate hexagonal grid
Rad3Over2 = sqrt(3) / 2;
[X Y] = meshgrid(0:1:41);
n = size(X,1);
X = Rad3Over2 * X;
Y = Y + repmat([0 0.5],[n,n/2]);
% Plot the hexagonal mesh, including cell borders
[XV YV] = voronoi(X(:),Y(:)); plot(XV,YV,'b-')
axis equal, axis([10 20 10 20]), zoom on
Shifting the resulting grid coordinates is accomplished through addition. Scaling is accomplised by multiplication. Note that individual hexagons produced by the code above are oriented with their tops and bottoms flat. Rotating the cells so that the left and right sides are flat is a simple as reversing the rolls of the x and y coordinates in the code.
Sunday, March 30, 2008
Validating Predictive Models
Introduction
In this author's opinion, validating the performance of predictive models is the single most important step, if one can be chosen, in the process of data mining. One important mechanism for testing models is resampling, the subject of this article. No MATLAB this time, just technique.
When selecting a validation technique, it is vital to keep in mind the purpose of such validation: to estimate the level of performance we may expect from models generated by our modeling process, when such models are run on future cases. Note an important subtlety here: We are not so much interested in testing the performance of individual models as we are in testing the model-generating process (feature selection process, complexity selection process, etc., as well as the actual modeling algorithm).
Apparent Performance: Warning!
The most obvious testing method is to simply execute the model on the very same data upon which it was built. The result is known as the apparent performance. The apparent performance is known to be statistically biased in an optimistic way. This is like giving out the answers to the test before administering the test!
At the extreme, a model could simply memorize the development observations and regurgitate them during testing. Assuming no mutually contradictory cases, such a system would deliver perfect validation performance! Certainly this is not what we are interested in.
The whole point in making a predictive model is so that said model may be used on future cases. What is desired is generalization to new cases, not simple memorization of historical ones.
Ultimately, there is no way to know precisely how optimistic apparent performance estimates are, rendering such performance measures largely useless.
Despite its hazards, calculation of the apparent performance is used as the final assessment of models with shocking frequency in industry. Do not become one of its victims.
Holdout Testing
Given the dangers of apparent performance measures, one might logically reason that a model could be built using all presently available data, and tested at some future point in time, after further observations had been collected. This idea makes perfect sense, but involves potentially considerable delay. Rather than wait for new data, holdout testing splits the data randomly into two sets: training (also called "in-sample") and testing (also called "out-of-sample"). This is the simplest form of resampling. Incidentally, it is not uncommon to stratify the assignment to training and testing groups, based on variables believed to be significant, including the dependent variables.
The idea here is simple: fit the model using the training data, and test it on the testing data. No "cheating" takes place since the test data is not used during model construction.
Holdout testing provides an unbiased measure of performance, provided (and this caveat is rather important) that the test data is used only once to test the model. If the test data is used more than once to test the data, then all bets are off regarding the unbiased nature of the performance measure. Surprisingly many modelers in industry violate this "use once" rule (Shame on you, industry, shame!). In the event that another set of data is needed to make adjustments to the model (to experiment with different numbers of predictors, for instance), a third randomly assigned data set, the validation set (also called the "tuning set") should be employed.
This simple test process works well in many instances in practice. Its biggest drawback is that it trades off training accuracy for testing accuracy. Typically, the data miner is faced with finite supply of data. Every observation which is moved to the testing set is no longer available for training.
As indicated above, our primary interest is in evaluation of the model-generating process. Once we know what to expect from models that come from our process, we may apply our modeling process to the entire data set (without regard to train/test designations) to construct the final model.
k-Fold Cross Validation
Smaller data sets force an uncomfortable choice on the modeler using holdout testing: either short-change model construction or short-change testing. One solution is to use k-fold cross-validation (sometimes referred to as simply "cross-validation").
k-fold cross-validation builds on the idea of holdout testing in a clever way by rotating data through the process. Data is again divided randomly into groups, but now k equal-sized groups are used. As with holdout testing, stratification is sometimes used to force the folds to be statistically similar. The train-test process is repeated k times, each time leaving a different segment of the data out, as the test set.
A common choice for k is 10, resulting in 10-fold cross-validation. In 10-fold cross-validation, the observations are randomly assigned to 10 groups. Ten separate models are built and tested on distinct data segments. The resulting 10 performance measures are unbiased since none of them was built with test data that was used during training. The single, final performance measurement is taken as the mean of these 10 performance measures. The magic of this process is that during each fold, 90% of the data is available for training, yet the final performance metric is based on 100% of the data!
When k is equal to the number of observations, this process goes by the special name leave-one-out. While this may be tempting, there are good reasons for choosing k in the range of 5 to 10.
The good news with k-fold cross-validation is that reliable, unbiased testing may be performed on smaller data sets than would be possible with simple train-and-test holdout testing. The only really bad news is that this process obviously requires much more computational effort than holdout testing.
As with holdout testing, once the modeling process has been evaluated, it may run over the entire data set to produce the final model.
Closing Thoughts
Other resampling techniques are available, such as the bootstrap. Holdout testing and k-fold cross validation are real workhorses, though, and should cover many machine learning and data mining situations.
Few other segments of the empirical modeling pipeline are as critical as model testing- perhaps only problem definition and the collection of appropriate data are as important. Assuming that these other two have been performed properly, model validation is the acid test of model performance: pay it the attention it deserves.
Further Reading
I strongly recommend the book "Computer Systems That Learn", by Weiss and Kulikowski (ISBN: 1-55860-065-5) for a quite readable introduction to this subject.
Also very worthy of consideration is chapter 5 of "Data Mining: Practical Machine Leearning Tools and Techniques", by Witten and Frank (ISBN: 1-55860-552-5).
The Usenet comp.ai.neural-nets FAQ Part 1 contains solid material on this subject as well. See, especially, the section titled "What are the population, sample, training set, design set, validation set, and test set?"
Saturday, March 29, 2008
50,000 Visitors and Counting
At 9:32AM local time today, this Web log received its 50,000th visitor, which I consider a significant milestone. Visitation continues to trend upward, with this month (not yet complete) already exhibiting the highest number of visitors yet. Recently, I was also made aware that this log appears very near the top of some Google search results (which is only one way to measure success). As an example, a posting here is the 2nd item returned when searching for Mahalanobis distance.
I humbly interpret these events as evidence of the helpfulness of this log to readers. I'd like to say "Teşekkürler" to Deniz, who recently reminded me of this, among so many other things.
Friday, March 28, 2008
Getting Started with MATLAB
I am occasionally asked for introductory MATLAB materials. The only posts I've written here which I'd consider "introductory" are somewhat specialized:
Basic Summary Statistics in MATLAB (Apr-13-2007)
Getting Data Into MATLAB Using textread (Apr-08-2007)
Statistical Data Management in MATLAB (Mar-26-2008)
Much broader tutorials can easily be found on-line using any search engine. Searching AllTheWeb for
MATLAB introduction
... yields a number of likely prospects, including a nice clearinghouse of such information hosted by the MathWorks:
MATLAB Tutorials
A number of very well-written introductions to MATLAB have been written, especially by university professors and graduate students. Try searching things like MATLAB tutorial. As always, I suggest including PDF or PPT to improve the quality of discovered documents.
