## Saturday, November 23, 2013

I read a great deal of technical literature and highly recommend the same for anyone interested in developing their skill in this field. While many recent publications have proven worthwhile (2011's update of Data Mining by Witten and Frank; and 2010's Fundamentals of Predictive Text Mining, by Weiss, Indurkhya and Zhang are good examples), I confess being less than overwhelmed by many current offerings in the literature. I will not name names, but the first ten entries returned from my search of books at Amazon for "big data" left me unimpressed. While this field has enjoyed popular attention because of a colorful succession of new labels ("big data" merely being the most recent), many current books and articles remain too far down the wrong end of the flash/substance continuum.

For some time, I have been exploring older material, some from as far back as the 1920s. A good example would be a certain group of texts I've discovered, from the 1950s and 1960s- a period during which the practical value of statistical analysis had been firmly established in many fields, but long before the arrival of cheap computing machinery. At the time, techniques which maximized the statistician's productivity were ones which were most economical in their use of arithmetic calculations.

I first encountered such techniques in Introduction to Statistical Analysis, by Dixon and Massey (my edition is from 1957). Two chapters, in particular, are pertinent: "Microstatistics" and "Macrostatistics", which deal with very small and very large data sets, respectively (using the definitions of that time). One set of techniques described in this book involve the calculation of the mean and standard deviation of a set of numbers from a very small subset of their values. For instance, the mean of just 5 values- percentiles 10, 30, 50, 70 and 90- estimates the mean with 93% statistical efficiency.

How is this relevant to today's analysis? Assuming that the data is already sorted (which it often is, for tree induction techniques), extremely large data sets can be summarized with a very small number of operations (4 additions and 1 division, in this case), without appreciably degrading the quality of the estimate.

Data storage has grown faster than computer processing speed. Today, truly vast collections of data are easily acquired by even tiny organizations. Dealing with such data requires finding methods to accelerate information processing, which is exactly what authors in the 1950s and 1960s were writing about.

Readers may find a book on exactly this subject, Nonparametric and Shortcut Statistics, by Tate and Clelland (my copy is from 1959), to be of interest.

Quite a bit was written during the time period in question regarding statistical techniques which: 1. made economical use of calculation, 2. avoided many of the usual assumptions attached to more popular techniques (normal distributions, etc.) or 3. provided some resistance to outliers.

Genuinely new ideas, naturally, continue to emerge, but don't kid yourself: Much of what is standard practice today was established years ago. There's a book on my shelf, Probabilistic Models, by Springer, Herlihy, Mall and Beggs, published in 1968, which describes in detail what we would today call a naive Bayes regression model. (Indeed, Thomas Bayes' contributions were published in the 1760s.) Claude Shannon described information and entropy- today commonly used in rule- and tree-induction algorithms as well as regression regularization techniques- in the late 1940s. Logistic regression (today's tool of choice in credit scoring and many medical analyses) was initially developed by Pearl and Reed in the 1920s, and the logistic curve itself was used to forecast proportions (not probabilities) after being introduced by Verhulst in the late 1830s.

Ignore the history of our field at your peril.

## Sunday, January 13, 2013

### Ranking as a Pre-Processing Tool

To squeeze the most from data, analysts will often modify raw variables using mathematical transformations.  For example, in Data Analysis and Regression, Tukey described what he terms a "ladder of re-expression" which included a series of low-order roots and powers (and the logarithm) intended to adjust distributions to permit better results using linear regression.  Univariate adjustments using those particular transformations are fairly popular today, and are even incorporated directly into some machine learning software as part of the solution search.  The modified variables are fed to the analysis (modeling algorithm, clustering, visualization, etc.) just like the raw variables.

Another possible transformation which can be quite useful, but which enjoys much less popularity is ranking.  In fact, entire modeling procedures have been built around ranked data (see, for instance, the material on Theil's regression described in Applied Nonparametric Statistical Methods, by Sprent).  One nice thing about ranking is that it ignores the shape of the distribution and considers only the order of the values.  This is a natural fit to any monotonic relationship (where the graph of the data moves only up, or only down), which covers a wide variety of behaviors.  Even when the data does not meet this condition, ranking can be quite helpful.

When using ranking as a pre-processing step, one important question to answer is: How do we rank new data values?  The original ranking is performed on the training data only.  One solution is to interpolate between the ranks of the training values.  If a new value matches one from the training set, then it receives that value's rank.  If the new value instead lies between training values, we interpolate linearly between them.  Note that this results in "ranks" which are not integers.  For instance, if the new value to be ranked is 111, and the two nearest values are 110 (rank: 8) and 112 (rank: 7), then the interpolated rank is 7.5.

One drawback of this approach is, of course, that the entire set of original values and their corresponding ranks must be carried along as part of the transformation process.  It may be more economical to only retain a subset of the original data- keeping enough (perhaps 100 or less value-rank pairs) to allow the simplified ranking to retain the shape of the data distribution.

MATLAB users with the Statistics Toolbox can use the tiedrank function to perform ranking (tiedrank automatically handles ties, as well), and it is not difficult to construct a similar process using base MATLAB's sort function.  Base MATLAB also provides several very handy interpolation functions, such as interp1.

Keep in mind that univariate transformations, including ranking, typically do not affect logical machine learning solutions (decision tree induction and rule induction).  Since trees and rules usually test only one variable at a time, and most univariate transformations do not change the order of values, such transformations do not change the way the learning algorithm sees the data.