Sunday, December 24, 2006

Feature Selection, Phase 1: Eliminate the Chaff

Feature Selection

In modeling problems, the analyst is often faced with more predictor variables than can be usefully employed. Consider that the size of the input space (the space defined by the input variables in a modeling problem) grows exponentially. Cutting up each input variable's scale into, say, 10 segments implies that a single-input model will require 10 examples for model construction, while a model with two input variables will require 100 (= 10 x 10), and so forth (assuming only one training example per cell). Assuming that the inputs are completely uncorrelated, six input variables, by this criteria, would require 1 million examples. In real problems, input variables are usually somewhat correlated, reducing the number of needed examples, but this problem still explodes rapidly, and these estimates can be considered somewhat conservative in that perhaps more than one example should be available from each cell.

Given this issue, data miners are often faced with the task of selecting which predictor variables to keep in the model. This process goes by several names, the most common of which are subset selection, attribute selection and feature selection. Many solutions have been proposed for this task, though none of them are perfect, except on very small problems. Most such solutions attack this problem directly, by experimenting with predictors to be kept.

As a means of simplifying this job, Weiss and Indurkhya (see reference, below) describe a simple hypothesis test which may be carried out quickly on each candidate predictor variable separately, to gauge whether that predictor is likely to be informative regarding the target variable. Their procedure, which they named the independent significance features test ("IndFeat"), is not meant to select precisely which features should be used to build the final model, rather it is used to quickly and inexpensively discard features which seem obviously useless. In some cases, this dramatically reduces the number of candidate predictors to be considered in a final selection process. In my experience, with larger data sets (100 candidate predictors or more), this pre-processing phase will generally eliminate at least a few percent of the inputs, and has in some cases eliminated as many as 40% of them. As a rule I employ this process as a pre-cursor to the final variable selection whenever I am faced with more than 30 or 40 candidate predictors.

The IndFeat process assumes that the target variable is categorical. When the output variable is numeric, Weiss and Indurkhya recommend splitting the output variable at the median for IndFeat. One may ask about the predictors which are thrown away by IndFeat which are actually useful. Weiss and Indurkhya indicate that this will rarely be a problem, and that has been my experience as well.


MATLAB Implementation

I have implemented IndFeat in a MATLAB function which is available on-line at:

IndFeat.m

This function expects two inputs: a matrix of predictor variables (examples in rows, variables in columns), and a target vector. The target vector can only be a two-class variable. Typing "help IndFeat" provides a simple example of the process in operation.

IndFeat returns a significance value for each predictor: the higher the better, and Weiss and Indurkhya suggest that features be retained only if they are at a significance of 2.0 or higher, which I have found to work well.

I have found IndFeat to work very well as "Phase 1" of a two-phase feature selection process, in which the second phase can be any of a number of procedures (forward selection, stepwise, etc.). It has served me well by cutting down computer time required for modeling, and I hope it does the same for you.


Reference:
Predictive Data Mining by Weiss and Indurkhya (ISBN: 1558604030)

Thursday, December 07, 2006

Hardware for Data Mining in MATLAB

Appropriate hardware is important if data mining is to be performed on the desktop (MATLAB users on other platforms: you get the day off). Needs will vary, naturally, with the amount of data being handled and the type of analysis being performed. In my opinion, given what most organizations are willing to pay for data mining software, the incremental cost of going from a stock PC to a performance PC is comparatively small.

If, like me, one reads the data entirely into memory, then the two most important performance factors in selecting a PC will be the processor and the amount of RAM. Hard-drive size should generally not be an issue given the current, extremely low cost of storage. Hard-drive speed, however, may become important if the drive is being accessed frequently. I welcome input from readers, but I have never found MATLAB to be very demanding of the graphics hardware.

In my day job, I tend to work with data sets (individual MATLAB arrays which I load from tab-delimited files) that range (give or take) in size up to a few hundred thousand rows and as many as a few hundred columns, not necessarily at the same time. At home, I experiment data sets of with similar size.

Until recently, my work computer used an AMD Athlon64 FX-53 (2.4 GHz, no overclocking) and 2 GB RAM, which cost US$2,000 when purchased in 2004. My current work system features an Intel Core 2 Extreme X6800 (2.93 GHz) and 4GB RAM, which cost US$2,700 at purchase in Nov-2006. My home system sports an AMD Athlon64 3400+ with 1GB RAM and cost somewhere between, as best as I can recall, about US$1,500, at purchase in 2004. All of these systems run Windows XP (32-bit) and have met my performance needs.

Performance PC boutiques offer a number of advantages over large PC manufacturers. Component quality tends to be higher with performance boutiques (not all 120GB hard drives are the same). Service and attention to detail also tend to be better. Provision of all original OS, software and driver disks is typical. Individualized "owener's manuals", with all system specifications and settings are also typical, as are disk images of the system as shipped. I have had very good experiences buying from performance PC boutiques. Specifically, I have purchased machines from these two vendors, and been very satisfied:

Falcon Northwest
Velocity Micro

Some other performance PC boutiques which are popular, although I cannot vouch for them personally, are:

Hypersonic PC Systems
Voodoo PC

The performance PC boutiques started by catering to the gaming and science/engineering crowds. (If you're not aware, some games are among the most computationally demanding applications.) Over time, though, these vendors have diversified their product lines to address other interests. Data miners will consume the same amount of computational horsepower as gamers, but don't need the fancy graphics adapters, joysticks, speakers, etc. Some boutiques have machines labeled as "A/V" computers, specifically geared for multimedia audio/video work: These are ideal for data mining work. A good source of information on performance computers, including reviews of systems, is ExtremeTech.


Readers using 64-bit MATLAB (on Windows or Linux), please comment on your experiences!

Quick Tip Regarding rand and randn

This is just a quick note about the rand and randn functions. Many modeling algorithms include some probabilistic component, typically provided by one or both of MATLAB's pseudo-random number generators, rand and randn. To make results repeatable over multiple executions of the same code, initialize these functions before using them.

Note that there are multiple ways to do this with both of these routines. In MATLAB v7.3, rand can be initialized by 'state', 'seed' or 'twister', and randn by 'state' and 'seed'. See help rand and help randn for details.

Also note, and this is very important: rand and randn are initialized separately, meaning that initializing one has no effect on the other function. So, if one's program includes both routines, both will need to be initialized to obtain repeatability.

See the Mar-19-2008 post, Revisiting rand (MATLAB 2007a).

Also see the Mar-19-2008 post, Quasi-Random Numbers.

Monday, December 04, 2006

Small Administrative Note

This is just a small administrative note. Sometimes I start a log entry and am not able to finish it quickly. When finally published, the posting bears the date on which I began editing, not the publication date. I will try to avoid this problem in the future, but for now, please travel back in time to Nov-16-2006, to read Fuzzy Logic In MATLAB Part 1.

A Question And An Answer

This is just short post, to let everyone know I'm still here.

The Answer First...

An interesting MATLAB solution to the relational join problem was provided in a response to my Why MATLAB for Data Mining? posting of Nov-08-2006. I had written, "The one gap with MATLAB is that it is not very good at relational joins. Look-up tables (even large ones) for tacking on a single variable are fine, but MATLAB is not built to perform SQL-style joins.". Eric Sampson of The MathWorks (serow225) couldn't let that go, and provided a solution in his comments to that post. Thanks, Eric!


...Then The Question

The Statistics Toolbox from the MathWorks provides a discriminant analysis routine, called classify. This function performs linear, Mahalanobis and quadratic discriminant analysis- all very handy modeling algorithms. The question is: Why does classify not output the discovered discriminant coefficients?


See Also

Mar-03-2007 posting, MATLAB 2007a Released