**Feature Selection**

In modeling problems, the analyst is often faced with more predictor variables than can be usefully employed. Consider that the size of the input space (the space defined by the input variables in a modeling problem) grows exponentially. Cutting up each input variable's scale into, say, 10 segments implies that a single-input model will require 10 examples for model construction, while a model with two input variables will require 100 (= 10 x 10), and so forth (assuming only one training example per cell). Assuming that the inputs are completely uncorrelated, six input variables, by this criteria, would require 1 million examples. In real problems, input variables are usually somewhat correlated, reducing the number of needed examples, but this problem still explodes rapidly, and these estimates can be considered somewhat conservative in that perhaps more than one example should be available from each cell.

Given this issue, data miners are often faced with the task of selecting which predictor variables to keep in the model. This process goes by several names, the most common of which are

*subset selection*,

*attribute selection*and

*feature selection*. Many solutions have been proposed for this task, though none of them are perfect, except on very small problems. Most such solutions attack this problem directly, by experimenting with predictors to be kept.

As a means of simplifying this job, Weiss and Indurkhya (see reference, below) describe a simple hypothesis test which may be carried out quickly on each candidate predictor variable separately, to gauge whether that predictor is likely to be informative regarding the target variable. Their procedure, which they named the

*independent significance features*test ("IndFeat"), is not meant to select precisely which features should be used to build the final model, rather it is used to quickly and inexpensively discard features which seem obviously useless. In some cases, this dramatically reduces the number of candidate predictors to be considered in a final selection process. In my experience, with larger data sets (100 candidate predictors or more), this pre-processing phase will generally eliminate at least a few percent of the inputs, and has in some cases eliminated as many as 40% of them. As a rule I employ this process as a pre-cursor to the final variable selection whenever I am faced with more than 30 or 40 candidate predictors.

The IndFeat process assumes that the target variable is categorical. When the output variable is numeric, Weiss and Indurkhya recommend splitting the output variable at the median for IndFeat. One may ask about the predictors which are thrown away by IndFeat which are actually useful. Weiss and Indurkhya indicate that this will rarely be a problem, and that has been my experience as well.

**MATLAB Implementation**

I have implemented IndFeat in a MATLAB function which is available on-line at:

IndFeat.m

This function expects two inputs: a matrix of predictor variables (examples in rows, variables in columns), and a target vector. The target vector can only be a two-class variable. Typing "help IndFeat" provides a simple example of the process in operation.

IndFeat returns a significance value for each predictor: the higher the better, and Weiss and Indurkhya suggest that features be retained only if they are at a significance of 2.0 or higher, which I have found to work well.

I have found IndFeat to work very well as "Phase 1" of a two-phase feature selection process, in which the second phase can be any of a number of procedures (forward selection, stepwise, etc.). It has served me well by cutting down computer time required for modeling, and I hope it does the same for you.

Reference:

*Predictive Data Mining*by Weiss and Indurkhya (ISBN: 1558604030)