Friday, March 18, 2016

Four Books Worth Owning

Below are listed four books on statistics which I feel are worth owning. They largely take a "traditional" statistics perspective, as opposed to a machine learning/data mining one. With the exception of "The Statistical Sleuth", these are less like textbooks than guide-books, with information reflecting the experience and practical advice of their respective authors. Comparatively few of their pages are devoted to predictive modeling- rather, they cover a host of topics relevant to the statistical analyst: sample size determination, hypothesis testing, assumptions, sampling technique,  etc.

I have not given ISBNs since they vary by edition. Older editions of any of these will be valuable, so consider saving money by skipping the latest edition.

A final clarification: I am not giving a blanket endorsement to any of these books. I completely disagree with a few of their ideas. I see the value of such books in their use as "paper advisers", with different backgrounds and perspectives than my own.

Thursday, March 10, 2016

Re-Coding Categorical Variables

Categorical variables as candidate predictors pose a distinct challenge to the analyst, especially when they exhibit high cardinality (a large number of distinct values). Numerical models (for instance linear regression and most neural networks) cannot accept these variables directly as inputs, since operations between categories and numbers are not defined.

It is sometimes advantageous (even necessary) to re-code such variables as one or more numeric dummy variables, with each new variable containing a 0 or 1 value indicating the presence (1) or absence (0) of one distinct value. This often works well with very small numbers of distinct values. However, as cardinality increases, dummy variables quickly become inconvenient: they add potentially many parameters to the model, while reducing the average number of observations available for fitting each of those parameters.

One solution is to convert each individual categorical variable to a new, numeric variable representing a local summary (typically the mean) of the target variable. An example would be replacing the categorical variable, State, with a numeric variable, StateNumeric, representing the mean of the target variable within each state.

Unfortunately, there are often insufficient observations to reliably calculate summaries for all possible categories. To overcome this limitation, one might select either the n most frequent categories, or the n categories with the largest and smallest means. Though both of those techniques sometimes work well, they are, in reality, crude attempts to discover the categories with the most significant differences from the global mean.

A more robust method which I've found useful is to select only those categories whose mean target values are at least a minimum multiple of standard errors away from the global mean. The detailed procedure is as follows:

1. Calculate the global mean (the mean of the target variable, across all observations)
2. For each category…
  • a. Calculate the local mean and local standard error of the mean for the target variable (“local” meaning just for the observations in this category).
  • b. Calculate the absolute value of the difference between the local mean and global mean, and divide it by the local standard error.
  • c. If the calculation from B is greater than some chosen threshold (I usually use 3.0, but this may be adjusted to suit the context of the problem), then this category is replaced in the new variable by its local mean.
3. All variables not chosen in 2c to get their own value are collected in an “other” category, and all such observations are assigned the mean target for the “other” group.

The threshold can be judgmentally adjusted to suit the need of the problem, but I normally stay within the arbitrary range of 2.0 to 4.0. Missing values can be treated as their own category for the purposes of this algorithm.

Note that this procedure seeks the accumulation of two kinds of evidence that categories are significantly different from the global mean: category frequency and difference between category local mean and the global mean. Some categories may exhibit one and not the other. A frequent category may not get its own numeric value in the new variable in this procedure, if its local mean is too close to the global mean. At the same time, less frequent categories might be assigned their own numeric value, if their local mean is far enough away from the global mean. Notice that this procedure does not establish a single, minimum distance from the global mean for values to be broken away from the herd.

Though using a multiple of the standard error is only an approximation to a rigorous significance test, it is close enough for this purpose. Despite some limitations, this technique is very quick to calculate and I’ve found that it works well in practice: The number of variables remains the same (one new variable for each raw one) and missing values are handled cleanly.

Quick Refresher on Standard Errors

For numeric variables, the standard error of the mean of a sample of values is the standard deviation of those values divided by the square root of the number of values. In MATLAB, this would be:

StandardError = std(MyData) / sqrt(n)

For binary classification (0/1) variables, the standard error of a sample of values is the square root of this whole mess: the mean times 1 minus the mean, over the number of values. In MATLAB, this is:

p = mean(MyData); StandardError = sqrt( (p * (1 - p) ) / n)

Final Thoughts

Data preparation is an important part of this work, often being identified as the part of the process which takes the most time. This phase of the work offers tremendous opportunity for the analyst to be creative, and for the entire process to wring the most information out of the raw data.

This is why it is so strange that so little published material from experts is dedicated to it. Except in specialized fields, such as signal and image processing, one generally only finds bits of information on data preparation here and there in the literature. One exception is Dorian Pyle's book, "Data Preparation for Data Mining" (ISBN-13: 978-1558605299), which is entirely focused on this issue.

Monday, March 07, 2016

MATLAB as (Near-)PseudoCode

In "Teaching Data Science in English (Not in Math)", the Feb-08-2016 entry of his Web log, "The Datatist", Charles Givre criticizes the use of specialized math symbols (capital sigma for summation, etc.) and Greek letters as being confusing, especially to newcomers to the field. He offers, as an example, the following, traditional definition of sum of squared errors:

He suggests that "English" (pseudo-code) be used instead, such as the following:

residuals_squared = (actual_values - predictions) ^ 2
RSS = sum( residuals_squared )

Although there are some flaws in this particular comparison (1. the first example includes an unnecessary middle portion, 2. it also features the linear model definition, while the pseudo-code example does not, and 3. the indexing variable in the summation is straightforward in this case, but is not always so), I tend to agree. Despite my own familiarity with the math jargon, I agree with him that, in many cases, pseudo-code is easier to understand.

Pseudo-code is certainly simpler syntactically since it tends to make heavy use of (sometimes deeply-nested) functions, as opposed to floating subscripts, superscripts and baroque symbols. Pseudo-code often employs real names for parameters, variables and constants, as opposed to letters. It is also closer to computer code that one actually writes, which is the point of this post.

Given the convention in MATLAB to store variables as column vectors, here is my MATLAB implementation of Givre's pseudo-code:

residuals_squared = (actual_values - predictions) .^ 2;
RSS = sum( residuals_squared );

Only two minor changes were needed: 1. The MATLAB .^ operator raises each individual element in an input matrix to a power, while the MATLAB ^ operator raises the entire matrix to a power (through matrix multiplication). 2. The semicolons prevent echoing the result of each calculation to the console: They are not strictly necessary, but will typically be included in a MATLAB program. Note that, unlike some object-oriented languages we could name, no extra code is needed to lay the groundwork of overloaded operators.

Consider an example I have given in the past, for the forward calculation of a feedforward neural network using a hyperbolic tangent transfer function:

HiddenLayer = tanh(Input * HiddenWeights);
OutputLayer = tanh(HiddenLayer * OutputWeights);

Note that this code recalls the neural network over all observations stored in the matrix Input.That's it: Two lines of code, without looping, which execute efficiently (quickly) in MATLAB. The addition of bias nodes, jump connections and scaling of the output layer would require more code, but even such changes would leave this code easy to comprehend.