Sunday, January 13, 2013

Ranking as a Pre-Processing Tool

To squeeze the most from data, analysts will often modify raw variables using mathematical transformations.  For example, in Data Analysis and Regression, Tukey described what he terms a "ladder of re-expression" which included a series of low-order roots and powers (and the logarithm) intended to adjust distributions to permit better results using linear regression.  Univariate adjustments using those particular transformations are fairly popular today, and are even incorporated directly into some machine learning software as part of the solution search.  The modified variables are fed to the analysis (modeling algorithm, clustering, visualization, etc.) just like the raw variables.

Another possible transformation which can be quite useful, but which enjoys much less popularity is ranking.  In fact, entire modeling procedures have been built around ranked data (see, for instance, the material on Theil's regression described in Applied Nonparametric Statistical Methods, by Sprent).  One nice thing about ranking is that it ignores the shape of the distribution and considers only the order of the values.  This is a natural fit to any monotonic relationship (where the graph of the data moves only up, or only down), which covers a wide variety of behaviors.  Even when the data does not meet this condition, ranking can be quite helpful.

When using ranking as a pre-processing step, one important question to answer is: How do we rank new data values?  The original ranking is performed on the training data only.  One solution is to interpolate between the ranks of the training values.  If a new value matches one from the training set, then it receives that value's rank.  If the new value instead lies between training values, we interpolate linearly between them.  Note that this results in "ranks" which are not integers.  For instance, if the new value to be ranked is 111, and the two nearest values are 110 (rank: 8) and 112 (rank: 7), then the interpolated rank is 7.5.

One drawback of this approach is, of course, that the entire set of original values and their corresponding ranks must be carried along as part of the transformation process.  It may be more economical to only retain a subset of the original data- keeping enough (perhaps 100 or less value-rank pairs) to allow the simplified ranking to retain the shape of the data distribution.

MATLAB users with the Statistics Toolbox can use the tiedrank function to perform ranking (tiedrank automatically handles ties, as well), and it is not difficult to construct a similar process using base MATLAB's sort function.  Base MATLAB also provides several very handy interpolation functions, such as interp1.

Keep in mind that univariate transformations, including ranking, typically do not affect logical machine learning solutions (decision tree induction and rule induction).  Since trees and rules usually test only one variable at a time, and most univariate transformations do not change the order of values, such transformations do not change the way the learning algorithm sees the data.