Tuesday, March 29, 2016

Who Rates the Raters?

Introduction

An interesting question which arises in many fields is the comparison of multiple ratings or rankings of the same items. Imagine that several real estate appraisers assess a number of properties. Questions of interest include: How similar are the ratings, across appraisers? On which items are the ratings most different? Why? Is there some way to consolidate differing ratings? Such questions arise when considering such diverse raters as movie critics, medical doctors, and investment advisors, and a variety of quantitative techniques have been developed to and these questions.

In his Mar-16-2016 posting, "The 25 Best Teams In The Tournament (Say Stat Geeks)", to the TeamRankings Web site, Seth Trachtman explores some inter-rater issues surrounding college basketball rankings. In the table at the bottom of his post, he include rankings from 5 different basketball experts for the 25 highest-ranked teams (out of a field of many more teams).

While there are quite a few correlation measures we might apply to pairs of raters, this posting examines the provided data using techniques described in my previous postings, Principal Components Analysis and Putting PCA to Work.


PCA

The table from Trachtman's article contains 5 columns, representing 5 different basketball raters, and 25 rows, representing 25 different NCAA teams. The raters are Trachtman's own TeamRankings.com, Ken Pomeroy, ESPN's Basketball Power Index, the ominously named Prediction Machine and LRMC from Georgia Tech:



After standardizing the original data, principal components analysis (PCA) reveals 5 new variables (the principal components), which are transformed versions of the original rankings. These new variables exhibit the following cumulative variances:

PCs   Variance
1         0.8816
2         0.9522
3         0.9820
4         0.9920
5         1.0000

We interpret this  summary to mean that the first principal component- note: a single variable- contains 88% of the total statistical variance in the original table. Clearly, there is substantial inter-rater agreement about these 25 teams. Further, retaining the first two principal components contains 95% of the total variance, 98% is contained in the first three, etc.

Reconstructing the rankings using only 3 principal components yields a new table in which 86% of rankings are within 1 (either direction) of the originals, and the maximum error (which occurs only 3 times in the entire table) has the new ranks off by 3 from the original. What this means is that the original data could pretty closely be approximated by a mixture of 3 artificial raters (the first 3 principal components).

It's interesting to note that the first principal component comes up with almost exactly the same ordering as Trachtman's Consensus, which is the mean ranking across all 5 raters (Miami and Texas A&M, which are tied in the Consensus are given a definite order by the first principal component). In situations like this one, the first principal component often ends up as a weighted average of the original data. In this case, the first principal component gives slightly more weight to Ken Pomeroy and LRMC.  Also of note, the second principal component treats TeamRankings, ESPN and LMRC as one group, and Ken Pomeroy and the Prediction Machine as another, based on the signs of its coefficients.


Conclusion

While these findings are notable, the table provided on TeamRankings gives a limited window into their total data set. It'd be interesting to repeat the above analysis with at least the 50 best ranked teams. Likewise, adding more raters might produce more surprising results from using the first principal component as an alternative to the Consensus.





Friday, March 18, 2016

Four Books Worth Owning

Below are listed four books on statistics which I feel are worth owning. They largely take a "traditional" statistics perspective, as opposed to a machine learning/data mining one. With the exception of "The Statistical Sleuth", these are less like textbooks than guide-books, with information reflecting the experience and practical advice of their respective authors. Comparatively few of their pages are devoted to predictive modeling- rather, they cover a host of topics relevant to the statistical analyst: sample size determination, hypothesis testing, assumptions, sampling technique,  etc.



I have not given ISBNs since they vary by edition. Older editions of any of these will be valuable, so consider saving money by skipping the latest edition.

A final clarification: I am not giving a blanket endorsement to any of these books. I completely disagree with a few of their ideas. I see the value of such books in their use as "paper advisers", with different backgrounds and perspectives than my own.

Thursday, March 10, 2016

Re-Coding Categorical Variables

Categorical variables as candidate predictors pose a distinct challenge to the analyst, especially when they exhibit high cardinality (a large number of distinct values). Numerical models (for instance linear regression and most neural networks) cannot accept these variables directly as inputs, since operations between categories and numbers are not defined.

It is sometimes advantageous (even necessary) to re-code such variables as one or more numeric dummy variables, with each new variable containing a 0 or 1 value indicating the presence (1) or absence (0) of one distinct value. This often works well with very small numbers of distinct values. However, as cardinality increases, dummy variables quickly become inconvenient: they add potentially many parameters to the model, while reducing the average number of observations available for fitting each of those parameters.

One solution is to convert each individual categorical variable to a new, numeric variable representing a local summary (typically the mean) of the target variable. An example would be replacing the categorical variable, State, with a numeric variable, StateNumeric, representing the mean of the target variable within each state.

Unfortunately, there are often insufficient observations to reliably calculate summaries for all possible categories. To overcome this limitation, one might select either the n most frequent categories, or the n categories with the largest and smallest means. Though both of those techniques sometimes work well, they are, in reality, crude attempts to discover the categories with the most significant differences from the global mean.

A more robust method which I've found useful is to select only those categories whose mean target values are at least a minimum multiple of standard errors away from the global mean. The detailed procedure is as follows:


1. Calculate the global mean (the mean of the target variable, across all observations)
2. For each category…
  • a. Calculate the local mean and local standard error of the mean for the target variable (“local” meaning just for the observations in this category).
  • b. Calculate the absolute value of the difference between the local mean and global mean, and divide it by the local standard error.
  • c. If the calculation from B is greater than some chosen threshold (I usually use 3.0, but this may be adjusted to suit the context of the problem), then this category is replaced in the new variable by its local mean.
3. All variables not chosen in 2c to get their own value are collected in an “other” category, and all such observations are assigned the mean target for the “other” group.



The threshold can be judgmentally adjusted to suit the need of the problem, but I normally stay within the arbitrary range of 2.0 to 4.0. Missing values can be treated as their own category for the purposes of this algorithm.

Note that this procedure seeks the accumulation of two kinds of evidence that categories are significantly different from the global mean: category frequency and difference between category local mean and the global mean. Some categories may exhibit one and not the other. A frequent category may not get its own numeric value in the new variable in this procedure, if its local mean is too close to the global mean. At the same time, less frequent categories might be assigned their own numeric value, if their local mean is far enough away from the global mean. Notice that this procedure does not establish a single, minimum distance from the global mean for values to be broken away from the herd.

Though using a multiple of the standard error is only an approximation to a rigorous significance test, it is close enough for this purpose. Despite some limitations, this technique is very quick to calculate and I’ve found that it works well in practice: The number of variables remains the same (one new variable for each raw one) and missing values are handled cleanly.


Quick Refresher on Standard Errors

For numeric variables, the standard error of the mean of a sample of values is the standard deviation of those values divided by the square root of the number of values. In MATLAB, this would be:

StandardError = std(MyData) / sqrt(n)

For binary classification (0/1) variables, the standard error of a sample of values is the square root of this whole mess: the mean times 1 minus the mean, over the number of values. In MATLAB, this is:

p = mean(MyData); StandardError = sqrt( (p * (1 - p) ) / n)


Final Thoughts

Data preparation is an important part of this work, often being identified as the part of the process which takes the most time. This phase of the work offers tremendous opportunity for the analyst to be creative, and for the entire process to wring the most information out of the raw data.

This is why it is so strange that so little published material from experts is dedicated to it. Except in specialized fields, such as signal and image processing, one generally only finds bits of information on data preparation here and there in the literature. One exception is Dorian Pyle's book, "Data Preparation for Data Mining" (ISBN-13: 978-1558605299), which is entirely focused on this issue.


Monday, March 07, 2016

MATLAB as (Near-)PseudoCode

In "Teaching Data Science in English (Not in Math)", the Feb-08-2016 entry of his Web log, "The Datatist", Charles Givre criticizes the use of specialized math symbols (capital sigma for summation, etc.) and Greek letters as being confusing, especially to newcomers to the field. He offers, as an example, the following, traditional definition of sum of squared errors:



He suggests that "English" (pseudo-code) be used instead, such as the following:

residuals_squared = (actual_values - predictions) ^ 2
RSS = sum( residuals_squared )


Although there are some flaws in this particular comparison (1. the first example includes an unnecessary middle portion, 2. it also features the linear model definition, while the pseudo-code example does not, and 3. the indexing variable in the summation is straightforward in this case, but is not always so), I tend to agree. Despite my own familiarity with the math jargon, I agree with him that, in many cases, pseudo-code is easier to understand.

Pseudo-code is certainly simpler syntactically since it tends to make heavy use of (sometimes deeply-nested) functions, as opposed to floating subscripts, superscripts and baroque symbols. Pseudo-code often employs real names for parameters, variables and constants, as opposed to letters. It is also closer to computer code that one actually writes, which is the point of this post.


Given the convention in MATLAB to store variables as column vectors, here is my MATLAB implementation of Givre's pseudo-code:

residuals_squared = (actual_values - predictions) .^ 2;
RSS = sum( residuals_squared );


Only two minor changes were needed: 1. The MATLAB .^ operator raises each individual element in an input matrix to a power, while the MATLAB ^ operator raises the entire matrix to a power (through matrix multiplication). 2. The semicolons prevent echoing the result of each calculation to the console: They are not strictly necessary, but will typically be included in a MATLAB program. Note that, unlike some object-oriented languages we could name, no extra code is needed to lay the groundwork of overloaded operators.

Consider an example I have given in the past, for the forward calculation of a feedforward neural network using a hyperbolic tangent transfer function:

HiddenLayer = tanh(Input * HiddenWeights);
OutputLayer = tanh(HiddenLayer * OutputWeights);


Note that this code recalls the neural network over all observations stored in the matrix Input.That's it: Two lines of code, without looping, which execute efficiently (quickly) in MATLAB. The addition of bias nodes, jump connections and scaling of the output layer would require more code, but even such changes would leave this code easy to comprehend.



Saturday, April 12, 2014

Probability: A Halloween Puzzle

Introduction

Though Halloween is months away, I found the following interesting and thought readers might enjoy examining my solution.

Recently, I was given the following probability question to answer:

Halloween Probability Puzzler

The number of trick-or-treaters knocking on my door in any five minute interval between 6 and 8pm on Halloween night is distributed as a Poisson with a mean of 5 (ignoring time effects). The number of pieces of candy taken by each child, in addition to the expected one piece per child, is distributed as a Poisson with a mean of 1. What is the minimum number of pieces of candy that I should have on hand so that I have only a 5% probability of running out?

I am not aware of the author of this puzzle nor its origin. If anyone cares to identify either, I'd be glad to give credit where it's due.


Interpretation

I interpret the phrase "ignoring time effects", above, to mean that there is no correlation among arrivals over time; in other words, the count of trick-or-treater arrivals during any five minute interval is statistically independent of the counts in the other intervals.

So, the described model of trick-or-treater behaviors is that they arrive in random numbers during each time interval, and each take a single piece of candy plus a random amount more. The problem, then, becomes finding the 95th percentile of the distribution of candy taken during the entire evening, since that's the amount beyond which we'd run out of candy 5% of the time.


Solution Idea 1 (A Dead End)

The most direct method of solving this problem, accounting for all possible outcomes, seemed hopelessly complex to me. Trying to match each possible number of pieces of candy taken over so many arrivals over so many intervals results in an astronomical number of combinations. Suspecting that someone with more expertise in combinatorics than myself might see a way through that mess, I quickly gave up on that idea and fell back to something I know better: the computer (leading to solution idea 2... ).


Solution Idea 2

The next thing which occurred to me was Monte Carlo simulation, which is way of solving problems by approximating probability distributions using many random samples from those distributions. Sometimes very many samples are required, but contemporary computer hardware easily accommodates this need (at least for this problem). With a bit of code, I could approximate the various distributions in this problem and their interactions and have plenty of candy come October 31.

While Monte Carlo simulation and its variants are used to solve very complex real-world problems, the basic idea is very simple: Draw samples from the indicated distributions, have them interact as they do in real life, and accumulate the results. Using the poissrnd() function provided in the MATLAB Statistics Toolbox, that is exactly what I did (apologies for the dreadful formatting):

% HalloweenPuzzler
% Program to solve the Halloween Probability Puzzler by the Monte Carlo method
% by Will Dwinnell
%
% Note: Requires the poissrnd() and prctile() functions from the Statistics Toolbox
%
% Last modified: Apr-02-2014


% Parameter
B = 4e5;   % How many times should we run the nightly simulation?

% Initialize
S = zeros(B,1);  % Vector of nightly simulation totals


% Loop over simulated Halloween evenings
for i = 1:B
  % Loop over 5-minute time intervals for this night
  for j = 1:24
    % Determine number of arrivals tonight
    Arrivals = poissrnd(5);
   
    % Are there any?
    if  (Arrivals > 0)
      % ...yes, so process them.
     
      % Loop over tonight's trick-or-treaters
      for k = 1:Arrivals
        % Add this trick-or-treater's candy to the total
        S(i) = S(i) + 1 + poissrnd(1);
      end
    end
  end
end

% Determine the 95th percentile of our simulated nightly totals for the answer
disp(['You need ' int2str(prctile(S,95)) ' pieces of candy.'])


% Graph the distribution of nightly totals
figure
hist(S,50)
grid on
title({'Halloween Probability Puzzler: Monte Carlo Solution',[int2str(B) ' Replicates']})
xlabel('Candy (pieces/night)')
ylabel('Frequency')




% EOF


Fair warning: This program takes a long time to run (hours, if run on a slow machine).

Hopefully the function of this simulation is adequately explained in the comments. To simulate a single Halloween night, the program loops over all time periods (there are 24 intervals, each 5-minutes long, between 6:00 and 8:00), then all arrivals (if any) within each time period. Inside the innermost loop, the number of pieces of candy taken by the given trick-or-treater is generated. The outermost loop governs the multiple executions of the nightly simulation.

The poissrnd() function generates random values from the Poisson distribution (which are always integers, in case you're fuzzy on the Poisson distribution). Its lone parameter is the mean of the Poisson distribution in question. A graph is generated at the end, displaying the simulated distribution for an entire night.



Important Caveat

Recall my mentioning that Monte Carlo is an approximate method, several paragraphs above. The more runs through this process, the closer it mimics the behavior of the probability distributions. My first run used 1e5 (one hundred thousand) runs, and I qualified my response to the person who gave me this puzzle that my answer, 282, was quite possibly off from the correct one "by a piece of candy or two". Indeed, I was off by one piece. Notice that the program listed above now employs 4e5 (four hundred thousand) runs, which yielded the exactly correct answer, 281, the first time I ran it.