Sunday, December 24, 2006

Feature Selection, Phase 1: Eliminate the Chaff

Feature Selection

In modeling problems, the analyst is often faced with more predictor variables than can be usefully employed. Consider that the size of the input space (the space defined by the input variables in a modeling problem) grows exponentially. Cutting up each input variable's scale into, say, 10 segments implies that a single-input model will require 10 examples for model construction, while a model with two input variables will require 100 (= 10 x 10), and so forth (assuming only one training example per cell). Assuming that the inputs are completely uncorrelated, six input variables, by this criteria, would require 1 million examples. In real problems, input variables are usually somewhat correlated, reducing the number of needed examples, but this problem still explodes rapidly, and these estimates can be considered somewhat conservative in that perhaps more than one example should be available from each cell.

Given this issue, data miners are often faced with the task of selecting which predictor variables to keep in the model. This process goes by several names, the most common of which are subset selection, attribute selection and feature selection. Many solutions have been proposed for this task, though none of them are perfect, except on very small problems. Most such solutions attack this problem directly, by experimenting with predictors to be kept.

As a means of simplifying this job, Weiss and Indurkhya (see reference, below) describe a simple hypothesis test which may be carried out quickly on each candidate predictor variable separately, to gauge whether that predictor is likely to be informative regarding the target variable. Their procedure, which they named the independent significance features test ("IndFeat"), is not meant to select precisely which features should be used to build the final model, rather it is used to quickly and inexpensively discard features which seem obviously useless. In some cases, this dramatically reduces the number of candidate predictors to be considered in a final selection process. In my experience, with larger data sets (100 candidate predictors or more), this pre-processing phase will generally eliminate at least a few percent of the inputs, and has in some cases eliminated as many as 40% of them. As a rule I employ this process as a pre-cursor to the final variable selection whenever I am faced with more than 30 or 40 candidate predictors.

The IndFeat process assumes that the target variable is categorical. When the output variable is numeric, Weiss and Indurkhya recommend splitting the output variable at the median for IndFeat. One may ask about the predictors which are thrown away by IndFeat which are actually useful. Weiss and Indurkhya indicate that this will rarely be a problem, and that has been my experience as well.


MATLAB Implementation

I have implemented IndFeat in a MATLAB function which is available on-line at:

IndFeat.m

This function expects two inputs: a matrix of predictor variables (examples in rows, variables in columns), and a target vector. The target vector can only be a two-class variable. Typing "help IndFeat" provides a simple example of the process in operation.

IndFeat returns a significance value for each predictor: the higher the better, and Weiss and Indurkhya suggest that features be retained only if they are at a significance of 2.0 or higher, which I have found to work well.

I have found IndFeat to work very well as "Phase 1" of a two-phase feature selection process, in which the second phase can be any of a number of procedures (forward selection, stepwise, etc.). It has served me well by cutting down computer time required for modeling, and I hope it does the same for you.


Reference:
Predictive Data Mining by Weiss and Indurkhya (ISBN: 1558604030)

14 comments:

Shan He said...

Thanks for your text and code. Very useful. I think this so-called 'independent significant' has a more popular name: the fisher's ratio: http://www.vias.org/tmdatanaleng/cc_fisher_ratio.html

Will Dwinnell said...

I'm glad you found this useful. Thanks for the information and the reference.

phormwatch said...

The link to the matlab program seems to have gone down. Here is a version from Google cache:

% IndFeat.m - routine for selecting independent features for
% modeling problems
%
% code by Will Dwinnell (predictr@bellatlantic.net)
%
% Sig = IndFeat(X,Y)
%
% Calculate significance level, 'Sig' of real variables (in columns) from
% matrix 'X', based on their ability to distinguish 2 categories in column
% vector 'Y' (typically, variables are kept when significance >= 2.0).
% For a continuous output variable, it is suggested that values be split
% into lower and upper 50%.
%
% Uses Weiss/Indurkhya 'independent features' significance testing
% method- keep in mind that this is not intended to be the final
% variable selection, just a means of eliminating obviously uninteresting
% features. See Weiss' and Indurkhya's book, "Predictive Data Mining".
%
% Example:
%
% X = randn(5e3,25);
% Y = double(sum(X(:,1:3)')' > 1.1);
% IndFeat(X,Y)
% figure
% stem(IndFeat(X,Y))
% grid on
% zoom on
%
% Last modified: Nov-14-2006

function Sig = IndFeat(X,Y)

% find the two (more to come) class "names"
UniqueClass = unique(Y);

% find indexes for both classes
ClassA = (Y == UniqueClass(1));
ClassB = (Y == UniqueClass(2));
nA = sum(double(ClassA));
nB = sum(double(ClassB));

% calculate significances
Sig = ...
abs(mean(X(ClassA,:)) - mean(X(ClassB,:))) ./ ...
sqrt(var(X(ClassA,:)) ./ nA + var(X(ClassB,:)) ./ nB);


% EOF

Anonymous said...

what about if I have more than two classes?

Will Dwinnell said...

With more than 2 output classes, Weiss and Indurkhya suggest performing this test on all pairs of classes for each input variable. If any one of the pairwise test proves significant, keep the input variable.

wirth6 said...

Thanks for the function. I'm a bit confused though: I'm training a neural network for classification with two different feature sets. But for some reason, the feature set with bigger significance gives me worse classification. What do you think might be the cause? (Thanks in advance)

Will Dwinnell said...

wirth6:

The purpose of this function is to eliminate variables from a set of candidate model inputs who are unlikely to be helpful. It is not for comparing sets of candidates or (as others have thought) selecting a final set of model inputs. If one had, for instance, 600 variables available, eliminating 100 useless ones would be a big step forward.

TarikT said...

so for a regression problems, we should split output values into 2 class, upper and lower 50%?? does anyone has experience with this, and with feature selection for regression in general? Thanks a lot

Will Dwinnell said...

"so for a regression problems, we should split output values into 2 class, upper and lower 50%??"

Yes, but keep in mind this method is only intended to eliminate obviously useless candidate inputs, it is not intended to select the final subset of inputs for a model. When a data set contains many possible input variables, it is still help to weed out the fraction which are least likely to be of utility.

TarikT said...

Thanks for the answare.

There is a lot of literature, advices, examples for the classification problems. How can i use it (adjust) for regression problem? for example in class-specific scaling, if i want to scale attributes by their influence:

ai=[mean(xi+)-mean(xi-)]\[var(xi+)-var(xi-)]
bi=0
zi=ai*xi-bi;

So i need to calculate mean/var value of the attribute xi for both classes. I doubt that i can also separate my continuous output into upper and lower 50% classes?? Or to transform it into multclass problem? Thanks in advance

Will Dwinnell said...

The idea behind separating a numeric target variable into upper and lower halves is just to determine whether a given candidate predictor variable can even distinguish that much. If it can't, it's may not be worth retaining. Naturally, when the actual model is built, the numeric target is used in its original form.

TarikT said...

yes, but how can i scale (not just 1 or 0) my predictor variables according to their influence on the output, in the regression?

Unknown said...

In terms of Y = double(sum(X(:,1:3)')' > 1.1), why did you set up "X(:,1:3)" and "> 1.1"

name said...

what about the classes more than 2 ? for multiclass problems, how we can implement into those?