Saturday, April 12, 2014

Probability: A Halloween Puzzle


Though Halloween is months away, I found the following interesting and thought readers might enjoy examining my solution.

Recently, I was given the following probability question to answer:

Halloween Probability Puzzler

The number of trick-or-treaters knocking on my door in any five minute interval between 6 and 8pm on Halloween night is distributed as a Poisson with a mean of 5 (ignoring time effects). The number of pieces of candy taken by each child, in addition to the expected one piece per child, is distributed as a Poisson with a mean of 1. What is the minimum number of pieces of candy that I should have on hand so that I have only a 5% probability of running out?

I am not aware of the author of this puzzle nor its origin. If anyone cares to identify either, I'd be glad to give credit where it's due.


I interpret the phrase "ignoring time effects", above, to mean that there is no correlation among arrivals over time; in other words, the count of trick-or-treater arrivals during any five minute interval is statistically independent of the counts in the other intervals.

So, the described model of trick-or-treater behaviors is that they arrive in random numbers during each time interval, and each take a single piece of candy plus a random amount more. The problem, then, becomes finding the 95th percentile of the distribution of candy taken during the entire evening, since that's the amount beyond which we'd run out of candy 5% of the time.

Solution Idea 1 (A Dead End)

The most direct method of solving this problem, accounting for all possible outcomes, seemed hopelessly complex to me. Trying to match each possible number of pieces of candy taken over so many arrivals over so many intervals results in an astronomical number of combinations. Suspecting that someone with more expertise in combinatorics than myself might see a way through that mess, I quickly gave up on that idea and fell back to something I know better: the computer (leading to solution idea 2... ).

Solution Idea 2

The next thing which occurred to me was Monte Carlo simulation, which is way of solving problems by approximating probability distributions using many random samples from those distributions. Sometimes very many samples are required, but contemporary computer hardware easily accommodates this need (at least for this problem). With a bit of code, I could approximate the various distributions in this problem and their interactions and have plenty of candy come October 31.

While Monte Carlo simulation and its variants are used to solve very complex real-world problems, the basic idea is very simple: Draw samples from the indicated distributions, have them interact as they do in real life, and accumulate the results. Using the poissrnd() function provided in the MATLAB Statistics Toolbox, that is exactly what I did (apologies for the dreadful formatting):

% HalloweenPuzzler
% Program to solve the Halloween Probability Puzzler by the Monte Carlo method
% by Will Dwinnell
% Note: Requires the poissrnd() and prctile() functions from the Statistics Toolbox
% Last modified: Apr-02-2014

% Parameter
B = 4e5;   % How many times should we run the nightly simulation?

% Initialize
S = zeros(B,1);  % Vector of nightly simulation totals

% Loop over simulated Halloween evenings
for i = 1:B
  % Loop over 5-minute time intervals for this night
  for j = 1:24
    % Determine number of arrivals tonight
    Arrivals = poissrnd(5);
    % Are there any?
    if  (Arrivals > 0)
      % ...yes, so process them.
      % Loop over tonight's trick-or-treaters
      for k = 1:Arrivals
        % Add this trick-or-treater's candy to the total
        S(i) = S(i) + 1 + poissrnd(1);

% Determine the 95th percentile of our simulated nightly totals for the answer
disp(['You need ' int2str(prctile(S,95)) ' pieces of candy.'])

% Graph the distribution of nightly totals
grid on
title({'Halloween Probability Puzzler: Monte Carlo Solution',[int2str(B) ' Replicates']})
xlabel('Candy (pieces/night)')


Fair warning: This program takes a long time to run (hours, if run on a slow machine).

Hopefully the function of this simulation is adequately explained in the comments. To simulate a single Halloween night, the program loops over all time periods (there are 24 intervals, each 5-minutes long, between 6:00 and 8:00), then all arrivals (if any) within each time period. Inside the innermost loop, the number of pieces of candy taken by the given trick-or-treater is generated. The outermost loop governs the multiple executions of the nightly simulation.

The poissrnd() function generates random values from the Poisson distribution (which are always integers, in case you're fuzzy on the Poisson distribution). Its lone parameter is the mean of the Poisson distribution in question. A graph is generated at the end, displaying the simulated distribution for an entire night.

Important Caveat

Recall my mentioning that Monte Carlo is an approximate method, several paragraphs above. The more runs through this process, the closer it mimics the behavior of the probability distributions. My first run used 1e5 (one hundred thousand) runs, and I qualified my response to the person who gave me this puzzle that my answer, 282, was quite possibly off from the correct one "by a piece of candy or two". Indeed, I was off by one piece. Notice that the program listed above now employs 4e5 (four hundred thousand) runs, which yielded the exactly correct answer, 281, the first time I ran it.

Saturday, November 23, 2013

Reading lately (Nov-2013)

I read a great deal of technical literature and highly recommend the same for anyone interested in developing their skill in this field. While many recent publications have proven worthwhile (2011's update of Data Mining by Witten and Frank; and 2010's Fundamentals of Predictive Text Mining, by Weiss, Indurkhya and Zhang are good examples), I confess being less than overwhelmed by many current offerings in the literature. I will not name names, but the first ten entries returned from my search of books at Amazon for "big data" left me unimpressed. While this field has enjoyed popular attention because of a colorful succession of new labels ("big data" merely being the most recent), many current books and articles remain too far down the wrong end of the flash/substance continuum.

For some time, I have been exploring older material, some from as far back as the 1920s. A good example would be a certain group of texts I've discovered, from the 1950s and 1960s- a period during which the practical value of statistical analysis had been firmly established in many fields, but long before the arrival of cheap computing machinery. At the time, techniques which maximized the statistician's productivity were ones which were most economical in their use of arithmetic calculations.

I first encountered such techniques in Introduction to Statistical Analysis, by Dixon and Massey (my edition is from 1957). Two chapters, in particular, are pertinent: "Microstatistics" and "Macrostatistics", which deal with very small and very large data sets, respectively (using the definitions of that time). One set of techniques described in this book involve the calculation of the mean and standard deviation of a set of numbers from a very small subset of their values. For instance, the mean of just 5 values- percentiles 10, 30, 50, 70 and 90- estimates the mean with 93% statistical efficiency.

How is this relevant to today's analysis? Assuming that the data is already sorted (which it often is, for tree induction techniques), extremely large data sets can be summarized with a very small number of operations (4 additions and 1 division, in this case), without appreciably degrading the quality of the estimate.

Data storage has grown faster than computer processing speed. Today, truly vast collections of data are easily acquired by even tiny organizations. Dealing with such data requires finding methods to accelerate information processing, which is exactly what authors in the 1950s and 1960s were writing about.

Readers may find a book on exactly this subject, Nonparametric and Shortcut Statistics, by Tate and Clelland (my copy is from 1959), to be of interest.

Quite a bit was written during the time period in question regarding statistical techniques which: 1. made economical use of calculation, 2. avoided many of the usual assumptions attached to more popular techniques (normal distributions, etc.) or 3. provided some resistance to outliers.

Genuinely new ideas, naturally, continue to emerge, but don't kid yourself: Much of what is standard practice today was established years ago. There's a book on my shelf, Probabilistic Models, by Springer, Herlihy, Mall and Beggs, published in 1968, which describes in detail what we would today call a naive Bayes regression model. (Indeed, Thomas Bayes' contributions were published in the 1760s.) Claude Shannon described information and entropy- today commonly used in rule- and tree-induction algorithms as well as regression regularization techniques- in the late 1940s. Logistic regression (today's tool of choice in credit scoring and many medical analyses) was initially developed by Pearl and Reed in the 1920s, and the logistic curve itself was used to forecast proportions (not probabilities) after being introduced by Verhulst in the late 1830s.

Ignore the history of our field at your peril.

Sunday, January 13, 2013

Ranking as a Pre-Processing Tool

To squeeze the most from data, analysts will often modify raw variables using mathematical transformations.  For example, in Data Analysis and Regression, Tukey described what he terms a "ladder of re-expression" which included a series of low-order roots and powers (and the logarithm) intended to adjust distributions to permit better results using linear regression.  Univariate adjustments using those particular transformations are fairly popular today, and are even incorporated directly into some machine learning software as part of the solution search.  The modified variables are fed to the analysis (modeling algorithm, clustering, visualization, etc.) just like the raw variables.

Another possible transformation which can be quite useful, but which enjoys much less popularity is ranking.  In fact, entire modeling procedures have been built around ranked data (see, for instance, the material on Theil's regression described in Applied Nonparametric Statistical Methods, by Sprent).  One nice thing about ranking is that it ignores the shape of the distribution and considers only the order of the values.  This is a natural fit to any monotonic relationship (where the graph of the data moves only up, or only down), which covers a wide variety of behaviors.  Even when the data does not meet this condition, ranking can be quite helpful.

When using ranking as a pre-processing step, one important question to answer is: How do we rank new data values?  The original ranking is performed on the training data only.  One solution is to interpolate between the ranks of the training values.  If a new value matches one from the training set, then it receives that value's rank.  If the new value instead lies between training values, we interpolate linearly between them.  Note that this results in "ranks" which are not integers.  For instance, if the new value to be ranked is 111, and the two nearest values are 110 (rank: 8) and 112 (rank: 7), then the interpolated rank is 7.5.

One drawback of this approach is, of course, that the entire set of original values and their corresponding ranks must be carried along as part of the transformation process.  It may be more economical to only retain a subset of the original data- keeping enough (perhaps 100 or less value-rank pairs) to allow the simplified ranking to retain the shape of the data distribution.

MATLAB users with the Statistics Toolbox can use the tiedrank function to perform ranking (tiedrank automatically handles ties, as well), and it is not difficult to construct a similar process using base MATLAB's sort function.  Base MATLAB also provides several very handy interpolation functions, such as interp1.

Keep in mind that univariate transformations, including ranking, typically do not affect logical machine learning solutions (decision tree induction and rule induction).  Since trees and rules usually test only one variable at a time, and most univariate transformations do not change the order of values, such transformations do not change the way the learning algorithm sees the data.

Tuesday, July 31, 2012

The Good and the Bad of the Median

Perhaps the most fundamental statistical summary beyond simple counting or totaling is the mean.  The mean reduces a collection of numbers to a single value, and is one of a number of measures of location.  The mean is by far the most commonly used and widely understood way of averaging data, but it is not the only one, nor is it always the "best" one.  In terms of popularity, the median is a distant second, but it offers a mixture of behaviors which make it an appealing alternative in many circumstances.

One important property of the median is that it is not affected- at all- by extreme values.  No matter how we change any observation, its value will have zero effect on the sample median unless it changes from being above the median to below it, or vice versa.  If the maximum value in a data set is multiplied by a million, the median will not change.  This behavior is usually characterized as a benefit of the median, as it means that noisy or extreme values will not tug at the median.  This is in stark contrast to the sample mean, which blows around in the wind of any change to the data set.  Indeed, this is one of the nice things about working with the median:  It is highly resistant to noise.  Also typically cast as a benefit is the median's resistance to being tugged to extreme values by long-tailed distributions.  Whether this is truly a strength or a weakness, though, I will leave to the analyst to decide, since capturing the nature of a long-tail may be important for some problems.

Another important quality of the sample median is that, for odd numbers of observations, it is always a value from the original data set.  This is also true for many data sets with even numbers of observations.  In some situations, this is desirable.  Data summaries being shown to non-technical people often provide more aesthetic appeal if there are no "odd" values.  Consider a retail store in which all products are prices with .99 at the end, yet the reported mean ends in .73- this is normally not an issue with the median.  Replacing missing values with means very likely introduces new values to the data set, and this is especially awkward when dealing with variables which present limited distributions, such as counts.  As an example, consider a database with missing values for "number of children", which (one would expect!) would always be an integer.  Substituting the mean for missings may result in observations sporting "1.2" children.  This is not a problem with the median.  Note that this behavior has a darker side for signal- and image-processing problems: data treated with a windowed median tend to form stepped plateaus, rather than smooth curves.  Such artifacts can be distracting or worse.

On the downside, the median is not always the most statistically efficient summary.  This is a technical issue which means simply that the median may "wander" from the "true" value (the population parameter) more than other summaries when finite data are available.  For instance, when the data are known to be drawn from a true normal distribution, the sample mean is known to "wander" the least from the true value.  I'm sparing the reader the technical details here, but suffice it to say that, though the mean or median might be closer to the "true" value in any particular situation, the mean is likely to be closer most of the time.  Statistical efficiency may be measured, and it is worth noting that different summaries achieve different relative efficiencies, depending on the size of the data sample, and the shape of the population distribution.

Hopefully, this short note has whet your appetite for the median and other alternatives to the mean.  I encourage the reader to explore yet other alternatives, such as trimmed means, of which the mean and median are special cases.

Saturday, December 11, 2010

Linear Discriminant Analysis (LDA)


Linear discriminant analysis (LDA) is one of the oldest mechanical classification systems, dating back to statistical pioneer Ronald Fisher, whose original 1936 paper on the subject, The Use of Multiple Measurements in Taxonomic Problems, can be found online (for example, here).

The basic idea of LDA is simple: for each class to be identified, calculate a (different) linear function of the attributes. The class function yielding the highest score represents the predicted class.

There are many linear classification models, and they differ largely in how the coefficients are established. One nice quality of LDA is that, unlike some of the alternatives, it does not require multiple passes over the data for optimization. Also, it naturally handles problems with more than two classes and it can provide probability estimates for each of the candidate classes.

Some analysts attempt to interpret the signs and magnitudes of the coefficients of the linear scores, but this can be tricky, especially when the number of classes is greater than 2.

LDA bears some resemblance to principal components analysis (PCA), in that a number of linear functions are produced (using all raw variables), which are intended, in some sense, to provide data reduction through rearrangement of information. (See the Feb-26-2010 posting to this log, Principal Components Analysis.) Note, though, some important differences: First, the objective of LDA is to maximize class discrimination, whereas the objective of PCA is to squeeze variance into as few components as possible. Second, LDA produces exactly as many linear functions as there are classes, whereas PCA produces as many linear functions as there are original variables. Last, principal components are always orthogonal to each other ("uncorrelated"), while that is not generally true for LDA's linear scores.

An Implementation

I have made available on MATLAB Central, a routine, aptly named LDA which performs all the necessary calculations. I'd like to thank Deniz Seviş, whose prompting got me to finally write this code (with her) and whose collaboration is very much appreciated.

Note that the LDA function assumes that the data its being fed is complete (no missing values) and performs no attribute selection. Also, it requires only base MATLAB (no toolboxes needed).

Use of LDA is straightforward: the programmer supplies the input and target variables and, optionally, prior probabilities. The function returns the fitted linear discriminant coefficients. help LDA provides a good example:

% Generate example data: 2 groups, of 10 and 15, respectively
X = [randn(10,2); randn(15,2) + 1.5]; Y = [zeros(10,1); ones(15,1)];

% Calculate linear discriminant coefficients
W = LDA(X,Y);

This example randomly generates an artificial data set of two classes (labeled 0 and 1) and two input variables. The LDA function fits linear discriminants to the data, and stores the result in W. So, what is in W? Let's take a look:

>> W

W =

-1.1997 0.2182 0.6110
-2.0697 0.4660 1.4718

The first row contains the coefficients for the linear score associated with the first class (this routine orders the linear functions the same way as unique()). In this model, -1.1997 is the constant and 0.2182 and 0.6110 are the coefficients for the input variables for the first class (class 0). Coefficients for the second class's linear function are in the second row. Calculating the linear scores is easy:

% Calulcate linear scores for training data
L = [ones(25,1) X] * W';

Each column represents the output of the linear score for one class. In this case, the first column is class 0, and the second column is class 1. For any given observation, the higher the linear score, the more likely that class. Note that LDA's linear scores are not probabilities, and may even assume negative values. Here are the values from my run:

>> L

L =

-1.9072 -3.8060
1.0547 3.2517
-1.2493 -2.0547
-1.0502 -1.7608
-0.6935 -0.8692
-1.6103 -2.9808
-1.3702 -2.4545
-0.2148 0.2825
0.4419 1.6717
0.2704 1.3067
1.0694 3.2670
-0.0207 0.7529
-0.2608 0.0601
1.2369 3.6135
-0.8951 -1.4542
0.2073 1.1687
0.0551 0.8204
0.1729 1.1654
0.2993 1.4344
-0.6562 -0.8028
0.2195 1.2068
-0.3070 0.0598
0.1944 1.2628
0.5354 2.0689
0.0795 1.0976

To obtain estimated probabilities, simply run the linear scores through the softmax transform (exponentiate everything, and normalize so that they sum to 1.0):

% Calculate class probabilities
P = exp(L) ./ repmat(sum(exp(L),2),[1 2]);

As we see, most of the first 10 cases exhibit higher probabilities for class 0 (the first column) than for class 1 (the second column) and the reverse is true for the last 15 cases:

>> P

P =

0.8697 0.1303
0.1000 0.9000
0.6911 0.3089
0.6705 0.3295
0.5438 0.4562
0.7975 0.2025
0.7473 0.2527
0.3782 0.6218
0.2262 0.7738
0.2619 0.7381
0.1000 0.9000
0.3157 0.6843
0.4205 0.5795
0.0850 0.9150
0.6363 0.3637
0.2766 0.7234
0.3175 0.6825
0.2704 0.7296
0.2432 0.7568
0.5366 0.4634
0.2714 0.7286
0.4093 0.5907
0.2557 0.7443
0.1775 0.8225
0.2654 0.7346

This model is not perfect, and would really need to be tested more rigorously (via holdout testing, k-fold cross validation, etc.) to determine how well it approximates the data.

I will not demonstrate its use here, but the LDA routine offers a facility for modifying the prior probabilities. Briefly, the function assumes that the true distribution of classes is whatever it observes in the training data. Analysts, however, may wish to adjust this distribution for several reasons, and the third, optional, parameter allows this. Note that the LDA routine presented here always performs the adjustment for prior probabilities: Some statistical software drops the adjustment for prior probabilities altogether if the user specifies that classes are equally likely, and will produce different results than LDA.

Closing Thoughts

Though it employs a fairly simple model structure, LDA has held up reasonably well, sometimes still besting more complex algorithms. When its assumptions are met, the literature records it doing better than logistic regression. It is very fast to execute and fitted models are extremely portable- even a spreadsheet will support linear models (...or, one supposes, paper and pencil!) LDA is at least worth trying at the beginning of a project, if for no other reason than to establish a lower bound on acceptable performance.

See Also

Feb-16-2010 posting, Single Neuron Training: The Delta Rule
Mar-15-2009 posting, Logistic Regression