Data Mining in MATLAB

Implied Probabilities

2022-01-14T11:11:00.009-05:00

Oddsmakers provide a useful prediction mechanism for many subjects of interest. Beyond sports, they host prediction markets for events in politics, entertainment, current events and other fields. There are some subtleties, however, in converting payout odds to probabilities. Note that that payout odds (also called payoff odds or house odds) are expressed in this article as fractional odds (there are several other popular formats used in casinos and on-line bookmakers, such as decimal odds and American odds: they simply indicate the same information a different way).

Payout Odds

It is important to understand the specific nature of payout odds, which are not the same as odds taught in statistics courses. Payout odds indicate the schedule of payment for win or loss at the conclusion of a wager. Typically, payout odds overpredict the probability of a specific outcome. Think about it this way: The rarer the event the player successfully predicts, the more the house will need to pay out. Hence, payout odds tend to be upwardly biased.

If the oddsmaker believes that the probability of an event is 50%, then a bet which does not favor the house nor the player would be 1/1: The house puts up $1 and the player puts up $1. To make such a situation work to the oddsmaker's advantage, the payout odds might be set by the house to, for instance, 5/6: the house puts up only $5 while the player puts up $6 . The player in this case is expected to lose an average of $0.50 dollars: Half the time, the player leaves with $11, and the other half of the time the player leaves with nothing: an average payout of $5.50 after putting up $6.

Implied Probability

Payout odds can be converted to implied probabilities by doing some quick arithmetic: the player's wager is divided by the total wager (the house wager plus the player's wager). Continuing with the hypothetical payout odds of 5/6, the implied probability is 0.54 = 6 / (5 + 6). The extra 0.04 versus the oddsmaker's prediction of 0.5 is the statistical bias of this implied probability.

Mapping to the Probability Continuum

Interestingly, it is normally the case that the sum of the implied probabilities for real payout odds exceeds 1.0. This is because the oddsmaker generally inflates each of the payout odds (to make them all in the house's favor). One common method for dealing with this is to divide each of the implied probabilities by their total.

For an illustration using real payout odds, consider an actual wager on American politics offered on Jan-14-2022 at Bovada ( https://www.bovada.lv/ ), an on-line gambling house. The listed wager is: "Which Party Will Control The Senate After The 2022 Midterm Election?", and the listed outcomes are "Republican" with payout odds of 2/5 and "Democratic" with payout odds of 37/20. The implied probability of Republican control of the Senate is 0.71 = 5 / (2 + 5). The implied probability of Democrat control is 0.35 = 20 / (37 + 20). Note, significantly, that 0.71 + 0.35 > 1. To map these implied probabilities together into the probability space, one divides by their total, 1.06. Thus, using payout odds as our model, the estimated probability of Republican control is 0.67 = 0.71 / 1.06, and the estimated probability of Democrat control is 0.33 = 0.35 / 1.06.

Important Details

A few important details are worth mentioning:

First, gambling is a product of human behavior. Betting is partially driven by emotion, and gamblers enjoy the same failings as the rest of humanity. This sometimes spoils their aim, and statistically biases predictions of the betting markets. On the positive side, the gain or loss of real money provides a powerful inducement for clear thinking, and in the long run tends to weed out poor predictors. It's one thing for an academic or pundit to casually make a prediction in a paper or on a talk show, it's quite another thing to do so when one's own money is on the line!

Second, payout odds are arguably driven by two forces: 1. the wisdom of crowds and 2. the desire of the house to minimize risk. The first force works in favor of statistically unbiased estimates, whereas the second one may bias payout odds. It has been suggested that oddsmakers will set odds to balance wagers on competing outcomes. Despite making less money per wager, this would ensure that "the losers pay the winners", avoiding situations in which the house pays out painfully on its own poor predictions.

Third, payout odds may vary from one casino or on-line house to another, but market forces (and the potential for arbitrage) tend to keep them fairly well aligned.

Last, commercial gambling houses tend to be fairly specific in the terms of their proposed wagers, regarding times, conditions, etc. In the real example given above, in the event that neither political party wins control (there is a 50/50 split), the wager will include a stipulation that the bet is cancelled, or the party of the Vice President breaks the tie or some other such specific resolution is used.

Note, too, the exact meaning of such wagers. In the real example related here, the 0.67 represents the probability that the Republican party will control (have more than 50% of the membership of) the Senate. It does not indicate the percent of Senate seats which the Republicans will fill.

Conclusion

Gambling markets are, at their foundation, information markets. Participants are powerfully motivated to deliver competent analysis. By aggregating such predictions, oddsmakers provide a competitive alternative to more traditional predictors, such as pollsters, forecasters and similar experts. Additionally, the events whose outcomes are anticipated by gambling houses are often important yet statistically awkward: high importance, low frequency events that often involve special, one-time circumstances which can stymie other prediction mechanisms.

A New Dawn for Local Learning Methods?

2017-04-21T08:27:00.000-04:00

The relentless improvement in speed of computers continues. While some technical barriers to this progress have begun to emerge, exploitation of parallelism has actually increased the rate of acceleration for many purposes, especially in applied mathematical fields such as data mining.

Interestingly, new, powerful hardware has been put to the task of running ever more baroque algorithms. Feedforward neural networks, once trained over several days, now train in minutes on affordable desktop hardware. Over time, ever fancier algorithms have been fed to these machines: boosting, support vector machines, random forests and, most recently, deep learning illustrate this trend.

Another class of learning algorithms may also benefit from developments in hardware: local learning methods. Typical of local methods are radial basis function (RBF) neural networks and k-nearest neighbors (k-NN). RBF neural networks were briefly popular in the heyday of neural networks (the 1990s) since they train much faster than the more popular feedforward neural networks. k-NN is often discussed in chapter 1 of machine learning books: it is conceptually simple, easy to implement and demonstrates the advantages and disadvantages of local techniques well.

Local learning techniques usually have a large number of components, each of which handles only a small fraction of the set of possible input cases. The nice thing about this approach is that these local components largely do not need to coordinate with each other: The complexity of the model comes from having a large number of such components to handle many different situations. Local learning techniques thus make training easy: In the case of k-NN, one simply stores the training data for future reference. Little, if any, "fitting" is done during learning. This gift comes with a price, though: Local learning systems train very quickly, but model execution is often rather slow. This is because local models will either fire all of those local components, or spend time figuring out which among them applies to any given situation.

Local learning methods have largely fallen out of favor since: 1. they are slow to predict outcomes for new cases and, secondarily, 2. their implementation requires retention of some or all of the training data, and 2. This author wonders whether contemporary computer hardware may not present an opportunity for a resurgence among local methods. Local methods often perform well statistically, and would help diversify model ensembles for users of more popular learning algoprithms. Analysts looking for that last measure of improvement might be well served by investigating this class of solutions. Local algorithms are among the easiest to code from scratch. Interested readers are directed to "Lazy Learning", edited by D. Aha (ISBN-13: 978-0792345848) and "Nearest Neighbor Norms: NN Pattern Classification Techniques", edited by B. Dasarathy (ISBN-13: 978-0818689307).

Data Analytics Summit III at Harrisburg University of Science and Technology

2017-03-29T15:20:00.003-04:00

Harrisburg University of Science and Technology (Harrisburg, Pennsylvania) has just finished hosting Data Analytics Summit III. This is a multi-day event featuring a mix of presenters from the private sector, the government/government-related businesses and academia which spans research, practice and more visionary ("big picture") topics. The theme was “Analytics Applied: Case Studies, Measuring Impact, and Communicating Results".

Regrettably, I was unable to attend this time because I was traveling for business, but I was at Data Analytics Summit II, which was held in December of 2015. If you haven't been: Harrisburg University of Science and Technology does a nice job hosting this event. Additionally, (so far) the Data Analytics Summit has been free of charge, so there is the prospect of free food if you are a starving grad student.

The university has generously provided links to video of the presentations from the most recent Summit:

http://analyticssummit.harrisburgu.edu/

Video links for the previous Summit, whose theme was unstructured data can be found at the bottom of my article, "Unstructured Data Mining - A Primer" (Apr-11-2016) over on icrunchdata:

https://icrunchdata.com/unstructured-data-mining-primer/

I encourage readers to explore this free resource.

Geographic Distances: A Quick Trip Around the Great Circle

2017-03-17T23:42:00.000-04:00

Recently, I wanted to calculate the distance between locations on the Earth. Finding a handy solution, I thought readers might be interested. In my situation, location data included ZIP codes (American postal codes). Also available to me is a look-up table of the latitude and longitude of the geometric centroid of each ZIP code. Since the areas identified by ZIP codes are usually geographical small, and making the "close enough" assumption that this planet is perfectly spherical, trigonometry will allow distance calculations which are, for most purposes, precise enough.

Given the latitude and longitude of cities 'A' and 'B', the following line of MATLAB code will calculate the distance between the two coordinates "as the crow flies" (technically, the "great circle distance"), in kilometers:

DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)));

Note that latitude and longitude are expected as decimal degrees. If your data is in degrees/minutes/seconds, a quick conversion will be needed.

I've checked this formula against a second source and quickly verified it using a few pairs of cities:

% 'A' = New York
% 'B' = Atlanta
% Random on-line reference: 1202km
LatA = 40.664274;
LongA = -73.9385;
LatB = 33.762909;
LongB = -84.422675;
DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)))

DistanceKilometers =

1202

% 'A' = New York
% 'B' = Los Angeles
% Random on-line reference: 3940km (less than 0.5% difference)<0 .5="" br="" difference="">
LatA = 40.664274;
LongA = -73.9385;
LatB = 34.019394;
LongB = -118.410825;
DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)))

DistanceKilometers =

3955

References:

"How Far is Berlin?", by Alan Zeichick, published in the Sep-1991 issue of "Computer Language" magazine. Note that Zeichick credits as his source an HP-27 scientific calculator, from which he reverse-engineered the formula above.

"Trigonometry DeMYSTiFieD, 2nd edition", by Stan Gibilisco (ISBN: 978-0-07-178024-7)

Four Books Worth Owning

2016-03-18T22:02:00.001-04:00

Below are listed four books on statistics which I feel are worth owning. They largely take a "traditional" statistics perspective, as opposed to a machine learning/data mining one. With the exception of "The Statistical Sleuth", these are less like textbooks than guide-books, with information reflecting the experience and practical advice of their respective authors. Comparatively few of their pages are devoted to predictive modeling- rather, they cover a host of topics relevant to the statistical analyst: sample size determination, hypothesis testing, assumptions, sampling technique, etc.

I have not given ISBNs since they vary by edition. Older editions of any of these will be valuable, so consider saving money by skipping the latest edition.

A final clarification: I am not giving a blanket endorsement to any of these books. I completely disagree with a few of their ideas. I see the value of such books in their use as "paper advisers", with different backgrounds and perspectives than my own.

Re-Coding Categorical Variables

2016-03-10T15:34:00.000-05:00

Categorical variables as candidate predictors pose a distinct challenge to the analyst, especially when they exhibit high cardinality (a large number of distinct values). Numerical models (for instance linear regression and most neural networks) cannot accept these variables directly as inputs, since operations between categories and numbers are not defined.

It is sometimes advantageous (even necessary) to re-code such variables as one or more numeric dummy variables, with each new variable containing a 0 or 1 value indicating the presence (1) or absence (0) of one distinct value. This often works well with very small numbers of distinct values. However, as cardinality increases, dummy variables quickly become inconvenient: they add potentially many parameters to the model, while reducing the average number of observations available for fitting each of those parameters.

One solution is to convert each individual categorical variable to a new, numeric variable representing a local summary (typically the mean) of the target variable. An example would be replacing the categorical variable, State, with a numeric variable, StateNumeric, representing the mean of the target variable within each state.

Unfortunately, there are often insufficient observations to reliably calculate summaries for all possible categories. To overcome this limitation, one might select either the n most frequent categories, or the n categories with the largest and smallest means. Though both of those techniques sometimes work well, they are, in reality, crude attempts to discover the categories with the most significant differences from the global mean.

A more robust method which I've found useful is to select only those categories whose mean target values are at least a minimum multiple of standard errors away from the global mean. The detailed procedure is as follows:

1. Calculate the global mean (the mean of the target variable, across all observations)
2. For each category…

a. Calculate the local mean and local standard error of the mean for the target variable (“local” meaning just for the observations in this category).
b. Calculate the absolute value of the difference between the local mean and global mean, and divide it by the local standard error.
c. If the calculation from B is greater than some chosen threshold (I usually use 3.0, but this may be adjusted to suit the context of the problem), then this category is replaced in the new variable by its local mean.

3. All variables not chosen in 2c to get their own value are collected in an “other” category, and all such observations are assigned the mean target for the “other” group.

The threshold can be judgmentally adjusted to suit the need of the problem, but I normally stay within the arbitrary range of 2.0 to 4.0. Missing values can be treated as their own category for the purposes of this algorithm.

Note that this procedure seeks the accumulation of two kinds of evidence that categories are significantly different from the global mean: category frequency and difference between category local mean and the global mean. Some categories may exhibit one and not the other. A frequent category may not get its own numeric value in the new variable in this procedure, if its local mean is too close to the global mean. At the same time, less frequent categories might be assigned their own numeric value, if their local mean is far enough away from the global mean. Notice that this procedure does not establish a single, minimum distance from the global mean for values to be broken away from the herd.

Though using a multiple of the standard error is only an approximation to a rigorous significance test, it is close enough for this purpose. Despite some limitations, this technique is very quick to calculate and I’ve found that it works well in practice: The number of variables remains the same (one new variable for each raw one) and missing values are handled cleanly.

Quick Refresher on Standard Errors

For numeric variables, the standard error of the mean of a sample of values is the standard deviation of those values divided by the square root of the number of values. In MATLAB, this would be:

StandardError = std(MyData) / sqrt(n)

For binary classification (0/1) variables, the standard error of a sample of values is the square root of this whole mess: the mean times 1 minus the mean, over the number of values. In MATLAB, this is:

p = mean(MyData); StandardError = sqrt( (p * (1 - p) ) / n)

Final Thoughts

Data preparation is an important part of this work, often being identified as the part of the process which takes the most time. This phase of the work offers tremendous opportunity for the analyst to be creative, and for the entire process to wring the most information out of the raw data.

This is why it is so strange that so little published material from experts is dedicated to it. Except in specialized fields, such as signal and image processing, one generally only finds bits of information on data preparation here and there in the literature. One exception is Dorian Pyle's book, "Data Preparation for Data Mining" (ISBN-13: 978-1558605299), which is entirely focused on this issue.

MATLAB as (Near-)PseudoCode

2016-03-07T10:45:00.000-05:00

In "Teaching Data Science in English (Not in Math)", the Feb-08-2016 entry of his Web log, "The Datatist", Charles Givre criticizes the use of specialized math symbols (capital sigma for summation, etc.) and Greek letters as being confusing, especially to newcomers to the field. He offers, as an example, the following, traditional definition of sum of squared errors:

He suggests that "English" (pseudo-code) be used instead, such as the following:

residuals_squared = (actual_values - predictions) ^ 2

RSS = sum( residuals_squared )

Although there are some flaws in this particular comparison (1. the first example includes an unnecessary middle portion, 2. it also features the linear model definition, while the pseudo-code example does not, and 3. the indexing variable in the summation is straightforward in this case, but is not always so), I tend to agree. Despite my own familiarity with the math jargon, I agree with him that, in many cases, pseudo-code is easier to understand.

Pseudo-code is certainly simpler syntactically since it tends to make heavy use of (sometimes deeply-nested) functions, as opposed to floating subscripts, superscripts and baroque symbols. Pseudo-code often employs real names for parameters, variables and constants, as opposed to letters. It is also closer to computer code that one actually writes, which is the point of this post.

Given the convention in MATLAB to store variables as column vectors, here is my MATLAB implementation of Givre's pseudo-code:

residuals_squared = (actual_values - predictions) .^ 2;

RSS = sum( residuals_squared );

Only two minor changes were needed: 1. The MATLAB .^ operator raises each individual element in an input matrix to a power, while the MATLAB ^ operator raises the entire matrix to a power (through matrix multiplication). 2. The semicolons prevent echoing the result of each calculation to the console: They are not strictly necessary, but will typically be included in a MATLAB program. Note that, unlike some object-oriented languages we could name, no extra code is needed to lay the groundwork of overloaded operators.

Consider an example I have given in the past, for the forward calculation of a feedforward neural network using a hyperbolic tangent transfer function:

HiddenLayer = tanh(Input * HiddenWeights);
OutputLayer = tanh(HiddenLayer * OutputWeights);

Note that this code recalls the neural network over all observations stored in the matrix Input.That's it: Two lines of code, without looping, which execute efficiently (quickly) in MATLAB. The addition of bias nodes, jump connections and scaling of the output layer would require more code, but even such changes would leave this code easy to comprehend.

Probability: A Halloween Puzzle

2014-04-12T22:22:00.000-04:00

Introduction

Though Halloween is months away, I found the following interesting and thought readers might enjoy examining my solution.

Recently, I was given the following probability question to answer:

Halloween Probability Puzzler

The number of trick-or-treaters knocking on my door in any five minute interval between 6 and 8pm on Halloween night is distributed as a Poisson with a mean of 5 (ignoring time effects). The number of pieces of candy taken by each child, in addition to the expected one piece per child, is distributed as a Poisson with a mean of 1. What is the minimum number of pieces of candy that I should have on hand so that I have only a 5% probability of running out?

I am not aware of the author of this puzzle nor its origin. If anyone cares to identify either, I'd be glad to give credit where it's due.

Interpretation

I interpret the phrase "ignoring time effects", above, to mean that there is no correlation among arrivals over time; in other words, the count of trick-or-treater arrivals during any five minute interval is statistically independent of the counts in the other intervals.

So, the described model of trick-or-treater behaviors is that they arrive in random numbers during each time interval, and each take a single piece of candy plus a random amount more. The problem, then, becomes finding the 95th percentile of the distribution of candy taken during the entire evening, since that's the amount beyond which we'd run out of candy 5% of the time.

Solution Idea 1 (A Dead End)

The most direct method of solving this problem, accounting for all possible outcomes, seemed hopelessly complex to me. Trying to match each possible number of pieces of candy taken over so many arrivals over so many intervals results in an astronomical number of combinations. Suspecting that someone with more expertise in combinatorics than myself might see a way through that mess, I quickly gave up on that idea and fell back to something I know better: the computer (leading to solution idea 2... ).

Solution Idea 2

The next thing which occurred to me was Monte Carlo simulation, which is way of solving problems by approximating probability distributions using many random samples from those distributions. Sometimes very many samples are required, but contemporary computer hardware easily accommodates this need (at least for this problem). With a bit of code, I could approximate the various distributions in this problem and their interactions and have plenty of candy come October 31.

While Monte Carlo simulation and its variants are used to solve very complex real-world problems, the basic idea is very simple: Draw samples from the indicated distributions, have them interact as they do in real life, and accumulate the results. Using the poissrnd() function provided in the MATLAB Statistics Toolbox, that is exactly what I did (apologies for the dreadful formatting):

% HalloweenPuzzler
% Program to solve the Halloween Probability Puzzler by the Monte Carlo method
% by Will Dwinnell
%
% Note: Requires the poissrnd() and prctile() functions from the Statistics Toolbox
%
% Last modified: Apr-02-2014

% Parameter
B = 4e5;   % How many times should we run the nightly simulation?

% Initialize
S = zeros(B,1); % Vector of nightly simulation totals

% Loop over simulated Halloween evenings
for i = 1:B
% Loop over 5-minute time intervals for this night
for j = 1:24
    % Determine number of arrivals tonight
    Arrivals = poissrnd(5);

    % Are there any?
    if (Arrivals > 0)
      % ...yes, so process them.

      % Loop over tonight's trick-or-treaters
      for k = 1:Arrivals
        % Add this trick-or-treater's candy to the total
        S(i) = S(i) + 1 + poissrnd(1);
      end
    end
end
end

% Determine the 95th percentile of our simulated nightly totals for the answer
disp(['You need ' int2str(prctile(S,95)) ' pieces of candy.'])

% Graph the distribution of nightly totals
figure
hist(S,50)
grid on
title({'Halloween Probability Puzzler: Monte Carlo Solution',[int2str(B) ' Replicates']})
xlabel('Candy (pieces/night)')
ylabel('Frequency')

% EOF

Fair warning: This program takes a long time to run (hours, if run on a slow machine).

Hopefully the function of this simulation is adequately explained in the comments. To simulate a single Halloween night, the program loops over all time periods (there are 24 intervals, each 5-minutes long, between 6:00 and 8:00), then all arrivals (if any) within each time period. Inside the innermost loop, the number of pieces of candy taken by the given trick-or-treater is generated. The outermost loop governs the multiple executions of the nightly simulation.

The poissrnd() function generates random values from the Poisson distribution (which are always integers, in case you're fuzzy on the Poisson distribution). Its lone parameter is the mean of the Poisson distribution in question. A graph is generated at the end, displaying the simulated distribution for an entire night.

Important Caveat

Recall my mentioning that Monte Carlo is an approximate method, several paragraphs above. The more runs through this process, the closer it mimics the behavior of the probability distributions. My first run used 1e5 (one hundred thousand) runs, and I qualified my response to the person who gave me this puzzle that my answer, 282, was quite possibly off from the correct one "by a piece of candy or two". Indeed, I was off by one piece. Notice that the program listed above now employs 4e5 (four hundred thousand) runs, which yielded the exactly correct answer, 281, the first time I ran it.

Reading lately (Nov-2013)

2013-11-23T21:23:00.001-05:00

I read a great deal of technical literature and highly recommend the same for anyone interested in developing their skill in this field. While many recent publications have proven worthwhile (2011's update of Data Mining by Witten and Frank; and 2010's Fundamentals of Predictive Text Mining, by Weiss, Indurkhya and Zhang are good examples), I confess being less than overwhelmed by many current offerings in the literature. I will not name names, but the first ten entries returned from my search of books at Amazon for "big data" left me unimpressed. While this field has enjoyed popular attention because of a colorful succession of new labels ("big data" merely being the most recent), many current books and articles remain too far down the wrong end of the flash/substance continuum.

For some time, I have been exploring older material, some from as far back as the 1920s. A good example would be a certain group of texts I've discovered, from the 1950s and 1960s- a period during which the practical value of statistical analysis had been firmly established in many fields, but long before the arrival of cheap computing machinery. At the time, techniques which maximized the statistician's productivity were ones which were most economical in their use of arithmetic calculations.

I first encountered such techniques in Introduction to Statistical Analysis, by Dixon and Massey (my edition is from 1957). Two chapters, in particular, are pertinent: "Microstatistics" and "Macrostatistics", which deal with very small and very large data sets, respectively (using the definitions of that time). One set of techniques described in this book involve the calculation of the mean and standard deviation of a set of numbers from a very small subset of their values. For instance, the mean of just 5 values- percentiles 10, 30, 50, 70 and 90- estimates the mean with 93% statistical efficiency.

How is this relevant to today's analysis? Assuming that the data is already sorted (which it often is, for tree induction techniques), extremely large data sets can be summarized with a very small number of operations (4 additions and 1 division, in this case), without appreciably degrading the quality of the estimate.

Data storage has grown faster than computer processing speed. Today, truly vast collections of data are easily acquired by even tiny organizations. Dealing with such data requires finding methods to accelerate information processing, which is exactly what authors in the 1950s and 1960s were writing about.

Readers may find a book on exactly this subject, Nonparametric and Shortcut Statistics, by Tate and Clelland (my copy is from 1959), to be of interest.

Quite a bit was written during the time period in question regarding statistical techniques which: 1. made economical use of calculation, 2. avoided many of the usual assumptions attached to more popular techniques (normal distributions, etc.) or 3. provided some resistance to outliers.

Genuinely new ideas, naturally, continue to emerge, but don't kid yourself: Much of what is standard practice today was established years ago. There's a book on my shelf, Probabilistic Models, by Springer, Herlihy, Mall and Beggs, published in 1968, which describes in detail what we would today call a naive Bayes regression model. (Indeed, Thomas Bayes' contributions were published in the 1760s.) Claude Shannon described information and entropy- today commonly used in rule- and tree-induction algorithms as well as regression regularization techniques- in the late 1940s. Logistic regression (today's tool of choice in credit scoring and many medical analyses) was initially developed by Pearl and Reed in the 1920s, and the logistic curve itself was used to forecast proportions (not probabilities) after being introduced by Verhulst in the late 1830s.

Ignore the history of our field at your peril.

Ranking as a Pre-Processing Tool

2013-01-13T13:13:00.003-05:00

To squeeze the most from data, analysts will often modify raw variables using mathematical transformations. For example, in Data Analysis and Regression, Tukey described what he terms a "ladder of re-expression" which included a series of low-order roots and powers (and the logarithm) intended to adjust distributions to permit better results using linear regression. Univariate adjustments using those particular transformations are fairly popular today, and are even incorporated directly into some machine learning software as part of the solution search. The modified variables are fed to the analysis (modeling algorithm, clustering, visualization, etc.) just like the raw variables.

Another possible transformation which can be quite useful, but which enjoys much less popularity is ranking. In fact, entire modeling procedures have been built around ranked data (see, for instance, the material on Theil's regression described in Applied Nonparametric Statistical Methods, by Sprent). One nice thing about ranking is that it ignores the shape of the distribution and considers only the order of the values. This is a natural fit to any monotonic relationship (where the graph of the data moves only up, or only down), which covers a wide variety of behaviors. Even when the data does not meet this condition, ranking can be quite helpful.

When using ranking as a pre-processing step, one important question to answer is: How do we rank new data values? The original ranking is performed on the training data only. One solution is to interpolate between the ranks of the training values. If a new value matches one from the training set, then it receives that value's rank. If the new value instead lies between training values, we interpolate linearly between them. Note that this results in "ranks" which are not integers. For instance, if the new value to be ranked is 111, and the two nearest values are 110 (rank: 8) and 112 (rank: 7), then the interpolated rank is 7.5.

One drawback of this approach is, of course, that the entire set of original values and their corresponding ranks must be carried along as part of the transformation process. It may be more economical to only retain a subset of the original data- keeping enough (perhaps 100 or less value-rank pairs) to allow the simplified ranking to retain the shape of the data distribution.

MATLAB users with the Statistics Toolbox can use the tiedrank function to perform ranking (tiedrank automatically handles ties, as well), and it is not difficult to construct a similar process using base MATLAB's sort function. Base MATLAB also provides several very handy interpolation functions, such as interp1.

Keep in mind that univariate transformations, including ranking, typically do not affect logical machine learning solutions (decision tree induction and rule induction). Since trees and rules usually test only one variable at a time, and most univariate transformations do not change the order of values, such transformations do not change the way the learning algorithm sees the data.

The Good and the Bad of the Median

2012-07-31T12:08:00.001-04:00

Perhaps the most fundamental statistical summary beyond simple counting or totaling is the mean. The mean reduces a collection of numbers to a single value, and is one of a number of measures of location. The mean is by far the most commonly used and widely understood way of averaging data, but it is not the only one, nor is it always the "best" one. In terms of popularity, the median is a distant second, but it offers a mixture of behaviors which make it an appealing alternative in many circumstances.

One important property of the median is that it is not affected- at all- by extreme values. No matter how we change any observation, its value will have zero effect on the sample median unless it changes from being above the median to below it, or vice versa. If the maximum value in a data set is multiplied by a million, the median will not change. This behavior is usually characterized as a benefit of the median, as it means that noisy or extreme values will not tug at the median. This is in stark contrast to the sample mean, which blows around in the wind of any change to the data set. Indeed, this is one of the nice things about working with the median: It is highly resistant to noise. Also typically cast as a benefit is the median's resistance to being tugged to extreme values by long-tailed distributions. Whether this is truly a strength or a weakness, though, I will leave to the analyst to decide, since capturing the nature of a long-tail may be important for some problems.

Another important quality of the sample median is that, for odd numbers of observations, it is always a value from the original data set. This is also true for many data sets with even numbers of observations. In some situations, this is desirable. Data summaries being shown to non-technical people often provide more aesthetic appeal if there are no "odd" values. Consider a retail store in which all products are prices with .99 at the end, yet the reported mean ends in .73- this is normally not an issue with the median. Replacing missing values with means very likely introduces new values to the data set, and this is especially awkward when dealing with variables which present limited distributions, such as counts. As an example, consider a database with missing values for "number of children", which (one would expect!) would always be an integer. Substituting the mean for missings may result in observations sporting "1.2" children. This is not a problem with the median. Note that this behavior has a darker side for signal- and image-processing problems: data treated with a windowed median tend to form stepped plateaus, rather than smooth curves. Such artifacts can be distracting or worse.

On the downside, the median is not always the most statistically efficient summary. This is a technical issue which means simply that the median may "wander" from the "true" value (the population parameter) more than other summaries when finite data are available. For instance, when the data are known to be drawn from a true normal distribution, the sample mean is known to "wander" the least from the true value. I'm sparing the reader the technical details here, but suffice it to say that, though the mean or median might be closer to the "true" value in any particular situation, the mean is likely to be closer most of the time. Statistical efficiency may be measured, and it is worth noting that different summaries achieve different relative efficiencies, depending on the size of the data sample, and the shape of the population distribution.

Hopefully, this short note has whet your appetite for the median and other alternatives to the mean. I encourage the reader to explore yet other alternatives, such as trimmed means, of which the mean and median are special cases.

Linear Discriminant Analysis (LDA)

2010-12-11T05:03:00.053-05:00

Overview

Linear discriminant analysis (LDA) is one of the oldest mechanical classification systems, dating back to statistical pioneer Ronald Fisher, whose original 1936 paper on the subject, The Use of Multiple Measurements in Taxonomic Problems, can be found online (for example, here).

The basic idea of LDA is simple: for each class to be identified, calculate a (different) linear function of the attributes. The class function yielding the highest score represents the predicted class.

There are many linear classification models, and they differ largely in how the coefficients are established. One nice quality of LDA is that, unlike some of the alternatives, it does not require multiple passes over the data for optimization. Also, it naturally handles problems with more than two classes and it can provide probability estimates for each of the candidate classes.

Some analysts attempt to interpret the signs and magnitudes of the coefficients of the linear scores, but this can be tricky, especially when the number of classes is greater than 2.

LDA bears some resemblance to principal components analysis (PCA), in that a number of linear functions are produced (using all raw variables), which are intended, in some sense, to provide data reduction through rearrangement of information. (See the Feb-26-2010 posting to this log, Principal Components Analysis.) Note, though, some important differences: First, the objective of LDA is to maximize class discrimination, whereas the objective of PCA is to squeeze variance into as few components as possible. Second, LDA produces exactly as many linear functions as there are classes, whereas PCA produces as many linear functions as there are original variables. Last, principal components are always orthogonal to each other ("uncorrelated"), while that is not generally true for LDA's linear scores.

An Implementation

I have made available on MATLAB Central, a routine, aptly named LDA which performs all the necessary calculations. I'd like to thank Deniz Seviş, whose prompting got me to finally write this code (with her) and whose collaboration is very much appreciated.

Note that the LDA function assumes that the data its being fed is complete (no missing values) and performs no attribute selection. Also, it requires only base MATLAB (no toolboxes needed).

Use of LDA is straightforward: the programmer supplies the input and target variables and, optionally, prior probabilities. The function returns the fitted linear discriminant coefficients. help LDA provides a good example:

% Generate example data: 2 groups, of 10 and 15, respectively
X = [randn(10,2); randn(15,2) + 1.5]; Y = [zeros(10,1); ones(15,1)];

% Calculate linear discriminant coefficients
W = LDA(X,Y);

This example randomly generates an artificial data set of two classes (labeled 0 and 1) and two input variables. The LDA function fits linear discriminants to the data, and stores the result in W. So, what is in W? Let's take a look:

>> W

W =

-1.1997 0.2182 0.6110
-2.0697 0.4660 1.4718

The first row contains the coefficients for the linear score associated with the first class (this routine orders the linear functions the same way as unique()). In this model, -1.1997 is the constant and 0.2182 and 0.6110 are the coefficients for the input variables for the first class (class 0). Coefficients for the second class's linear function are in the second row. Calculating the linear scores is easy:

% Calulcate linear scores for training data
L = [ones(25,1) X] * W';

Each column represents the output of the linear score for one class. In this case, the first column is class 0, and the second column is class 1. For any given observation, the higher the linear score, the more likely that class. Note that LDA's linear scores are not probabilities, and may even assume negative values. Here are the values from my run:

>> L

L =

-1.9072 -3.8060
1.0547 3.2517
-1.2493 -2.0547
-1.0502 -1.7608
-0.6935 -0.8692
-1.6103 -2.9808
-1.3702 -2.4545
-0.2148 0.2825
0.4419 1.6717
0.2704 1.3067
1.0694 3.2670
-0.0207 0.7529
-0.2608 0.0601
1.2369 3.6135
-0.8951 -1.4542
0.2073 1.1687
0.0551 0.8204
0.1729 1.1654
0.2993 1.4344
-0.6562 -0.8028
0.2195 1.2068
-0.3070 0.0598
0.1944 1.2628
0.5354 2.0689
0.0795 1.0976

To obtain estimated probabilities, simply run the linear scores through the softmax transform (exponentiate everything, and normalize so that they sum to 1.0):

% Calculate class probabilities
P = exp(L) ./ repmat(sum(exp(L),2),[1 2]);

As we see, most of the first 10 cases exhibit higher probabilities for class 0 (the first column) than for class 1 (the second column) and the reverse is true for the last 15 cases:

>> P

P =

0.8697 0.1303
0.1000 0.9000
0.6911 0.3089
0.6705 0.3295
0.5438 0.4562
0.7975 0.2025
0.7473 0.2527
0.3782 0.6218
0.2262 0.7738
0.2619 0.7381
0.1000 0.9000
0.3157 0.6843
0.4205 0.5795
0.0850 0.9150
0.6363 0.3637
0.2766 0.7234
0.3175 0.6825
0.2704 0.7296
0.2432 0.7568
0.5366 0.4634
0.2714 0.7286
0.4093 0.5907
0.2557 0.7443
0.1775 0.8225
0.2654 0.7346

This model is not perfect, and would really need to be tested more rigorously (via holdout testing, k-fold cross validation, etc.) to determine how well it approximates the data.

I will not demonstrate its use here, but the LDA routine offers a facility for modifying the prior probabilities. Briefly, the function assumes that the true distribution of classes is whatever it observes in the training data. Analysts, however, may wish to adjust this distribution for several reasons, and the third, optional, parameter allows this. Note that the LDA routine presented here always performs the adjustment for prior probabilities: Some statistical software drops the adjustment for prior probabilities altogether if the user specifies that classes are equally likely, and will produce different results than LDA.

Closing Thoughts

Though it employs a fairly simple model structure, LDA has held up reasonably well, sometimes still besting more complex algorithms. When its assumptions are met, the literature records it doing better than logistic regression. It is very fast to execute and fitted models are extremely portable- even a spreadsheet will support linear models (...or, one supposes, paper and pencil!) LDA is at least worth trying at the beginning of a project, if for no other reason than to establish a lower bound on acceptable performance.

See Also

Feb-16-2010 posting, Single Neuron Training: The Delta Rule
Mar-15-2009 posting, Logistic Regression

Reader Question: Putting Entropy to Work

2010-09-12T08:14:00.039-04:00

Introduction

In response to my Nov-10-2006 posting, Introduction To Entropy, an anonymous reader asked:

Can we use entropy for distinguishing random signals and deterministic signal? Lets say i generate two signals in matlab. First signal using sin function and second using randn function. Can we use entropy to distinguish between these two signal?

The short answer is: Yes, we can use entropy for this purpose, although even simpler summary statistics would reveal that the normally distributed randn data included values outside of -1..+1, while the sin data did not.

In this article, I will be using my own entropy calculating routines, which can be found on MATLAB Central: Entropy, JointEntropy, ConditionalEntropy and MutualInformation.

A Slightly Harder Problem

To illustrate this application of entropy, I propose a slightly different problem, in which the sine data and the random data share the same distribution. To achieve this, the "random" data will be a random sample from the sine function:

>> X = [1:1000]';
>> Sine = sin(0.05 * X);
>> RandomData = sin(2 * pi * rand(size(X)));

As a quick check on the distributions, we will examine their respective histograms:

>> figure
>> subplot(2,1,1), hist(Sine), xlabel('Sine Value'), ylabel('Frequency'), grid on
>> subplot(2,1,2), hist(RandomData), xlabel('RandomData Value'), ylabel('Frequency'), grid on

Click image to enlarge.

More or less, they appear to match.

A First Look, Using Entropy

At this point, the reader may be tempted to calculate the entropies of the two distributions, and compare them. Since their distributions (as per the histograms) are similar, we should expect their entropies to also be similar.

To date, this Web log has only dealt with discrete entropy, yet our data is continuous. While there is a continuous entropy, we will stick with the simpler (in my opinion) discrete entropy for now. This requires that the real-valued numbers of our data be converted to symbols. We will accomplish this via quantization ("binning") to 10 levels:

>> Sine10 = ceil(10 * (Sine + 1) / 2);
>> RandomData10 = ceil(10 * (RandomData + 1) / 2);

If the MATLAB Statistics Toolbox is installed, one can check the resulting frequencies thus (I apologize for Blogger's butchering of the text formatting):

>> tabulate(Sine10)
Value Count Percent
1 205 20.50%
2 91 9.10%
3 75 7.50%
4 66 6.60%
5 60 6.00%
6 66 6.60%
7 66 6.60%
8 75 7.50%
9 91 9.10%
10 205 20.50%
>> tabulate(RandomData10)
Value Count Percent
1 197 19.70%
2 99 9.90%
3 84 8.40%
4 68 6.80%
5 66 6.60%
6 55 5.50%
7 68 6.80%
8 67 6.70%
9 82 8.20%
10 214 21.40%

It should be noted that other procedures could have been used for the signal-to-symbol conversion. For example, bin frequencies could have been made equal. The above method was selected because it is simple and requires no Toolbox functions. Also, other numbers of bins could have been utilized.

Now that the data is represented by symbols, we may check the earlier assertion regarding similar distributions yielding similar entropies (measured in bits per observation):

>> Entropy(Sine10)

ans =

3.1473

>> Entropy(RandomData10)

ans =

3.1418

As these are sample statistics, we would not expect them to match exactly, but these are very close.

Another Perspective

One important aspect of the structure of a sine curve is that it varies over time (or whatever the domain is). This means that any given sine value is typically very similar to those on either side. With this in mind, we will investigate the conditional entropy of each of these two signals versus themselves, lagged by one observation:

>> ConditionalEntropy(Sine10(2:end),Sine10(1:end-1))

ans =

0.6631

>> ConditionalEntropy(RandomData10(2:end),RandomData10(1:end-1))

ans =

3.0519

Ah! Notice that the entropy of the sine data, given knowledge of its immediate predecessor is much lower than the entropy of the random data, given its immediate predecessor. These data are indeed demonstrably different insofar as they behave over time, despite sharing the same distribution.

An astute reader may at this point notice that the conditional entropy of the random data, given 1 lagged value, is less than the entropy of the raw random data. This is an artifact of the finite number of samples and the quantization process. Given more observations and a finer quantization, this discrepancy between sample statistics and population statistics will shrink.

Entropy could have been applied to this problem other ways, too. For instance, one might calculate entropy for short time windows. I would point out that other, more traditional procedures might be used instead, such as calculating the auto-correlation for lag 1. It is worth seeing how entropy adds to the analyst's toolbox, though.

Further Reading

See also the Apr-01-2009 posting, Introduction to Conditional Entropy.

Print:

The Mathematical Theory of Communication by Claude Shannon (ISBN 0-252-72548-4)

Elements of Information Theory by Cover and Thomas (ISBN 0-471-06259)

Putting PCA to Work

2010-02-28T16:17:00.032-05:00

Context

The last posting to this Web log, Principal Components Analysis (Feb-26-2010), gave an overview of principal component analysis (PCA), and how to effect it within MATLAB. This article will cover three uses of PCA: 1. pre-processing for empirical modeling, 2. data compression and 3. noise suppression.

To serve the widest possible audience, this article will conduct PCA using only base MATLAB functions, but realize that users with the Statistics Toolbox have, as mentioned in the last posting, the option of using tools like princomp and zscore.

We will continue to use the very small data set used in the last article:

>> A = [269.8 38.9 50.5
272.4 39.5 50.0
270.0 38.9 50.5
272.0 39.3 50.2
269.8 38.9 50.5
269.8 38.9 50.5
268.2 38.6 50.2
268.2 38.6 50.8
267.0 38.2 51.1
267.8 38.4 51.0
273.6 39.6 50.0
271.2 39.1 50.4
269.8 38.9 50.5
270.0 38.9 50.5
270.0 38.9 50.5
];

We calculate the sample parameters, and standardize the data table:

>> [n m] = size(A)

n =

15

m =

3

>> AMean = mean(A)

AMean =

269.9733 38.9067 50.4800

>> AStd = std(A)

AStd =

1.7854 0.3751 0.3144

>> B = (A - repmat(AMean,[n 1])) ./ repmat(AStd,[n 1])

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

Now that the data is centered with mean 0.0 and standard deviation 1.0, we perform the eigenanalysis of the sample covariances to determine the coefficient matrix which generates the principal components:

>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587

D =

0.0066 0 0
0 0.1809 0
0 0 2.8125

Recall that the MATLAB eig function orders information for the principal components from last to first when reading the columns from left to right. The matrix V contains the linear coefficients for the principal components. The diagonal of matrix D contains the variances for the principal components. So far, we have accomplished the principal components analysis itself. To put the PCA to use, we will want to know what proportion each principal component represents of total variance. We can do this by extracting and normalizing the diagonal of matrix D (we use flipud because the principal components are in "reverse" order):

>> cumsum(flipud(diag(D))) / sum(diag(D))

ans =

0.9375
0.9978
1.0000

We interpret the above column of numbers to mean that the first principal component
contains 93.75% of the total variance of the original data, the first two principal components together contain 99.78% and of course all principal components taken together have all of the variance (exactly as much as in the original standardized data).

Last, to calculate the principal components themselves, simply multiply the standardized data by the coefficient matrix:

>> PC = B * V

PC =

-0.0571 -0.0003 0.1026
-0.1277 -0.1226 -2.5786
0.0157 0.0543 0.0373
0.0536 0.1326 -1.7779
-0.0571 -0.0003 0.1026
-0.0571 -0.0003 0.1026
0.0704 -1.4579 0.5637
-0.1495 0.1095 1.6299
0.1041 0.2496 3.1841
0.0319 0.3647 2.4306
0.1093 0.2840 -3.1275
0.0892 0.2787 -0.8467
-0.0571 -0.0003 0.1026
0.0157 0.0543 0.0373
0.0157 0.0543 0.0373

To verify the condensing of the variance, calculate the sample variances:

>> var(PC)

ans =

0.0066 0.1809 2.8125

Again, note that the first principal component appears in the last column when using MATLAB's eig function, and columns to the left have less and less variance until the last principal component, stored in the first column.

Application: Pre-processing Data for Empirical Modeling

This application of PCA is simple: calculate the principal components and choose from them rather than the original data to construct the empirical model (regression, neural network, etc.). The (hoped for) advantage of doing this is that since PCA squeezes information into a subset of the new variables, less of them will be necessary to construct the model. In fact, it would not be unreasonable to simply step through the first so many principal components to build the model: First, use just the first principal component, then try the first and second, then the first, second and third, etc. A nice side benefit is that all the principal components are uncorrelated with each other.

As was mentioned in the last article, this may or may not work well, for several reasons: PCA may not be able to squeeze the variance much if the original variables are already highly uncorrelated with one another. Also, statistical variance may not be the same thing as "information" for the purposes of model building. Last, even if this process works, one is left with the reality that PCA needs all of the original variables to calculate the principal components, even if only a subset of them are used. Regardless, this is a data processing technique which can yield benefit, so it is worth trying.

Application: Data Compression

PCA offers a mechanism for performing lossy data compression. When data compression is "lossy", it may not return exactly the original data. The trade-off is that much greater compression can be achieved than with "lossless" data compression (compression in which the original data is returned exactly). In many cases, such as audio (MP3) and images (JPEG), some loss in fidelity is acceptable and greater compression is very much desired.

All compression schemes rely on the discovery of regularities within the data. In the case of PCA, the regularity is a linear relationship among the variables. To the extent that PCA finds this relationship, the data may be compressed. The idea is to discard the last principal components (those exhibiting the least variance).

In MATLAB, this means simply dropping the columns representing the unwanted principal components. In this case, we will retain only the first principal component:

>> VReduced = V(:,3)

VReduced =

-0.5825
-0.5904
0.5587

>> PCReduced = B * VReduced

PCReduced =

0.1026
-2.5786
0.0373
-1.7779
0.1026
0.1026
0.5637
1.6299
3.1841
2.4306
-3.1275
-0.8467
0.1026
0.0373
0.0373

Decompression is accomplished by inverting the process, which we can do by transposing the coefficient vector and multiplying:

>> PCReduced * VReduced'

ans =

-0.0598 -0.0606 0.0573
1.5020 1.5224 -1.4406
-0.0217 -0.0220 0.0209
1.0356 1.0497 -0.9933
-0.0598 -0.0606 0.0573
-0.0598 -0.0606 0.0573
-0.3284 -0.3328 0.3150
-0.9494 -0.9623 0.9106
-1.8547 -1.8799 1.7789
-1.4158 -1.4351 1.3580
1.8217 1.8465 -1.7473
0.4932 0.4999 -0.4730
-0.0598 -0.0606 0.0573
-0.0217 -0.0220 0.0209
-0.0217 -0.0220 0.0209

The result is not exactly the same as the original standardized data, but it is pretty close. We "un-standardize" by reversing the original standardization step:

>> Z = ((PCReduced * VReduced') .* repmat(AStd,[n 1])) + repmat(AMean,[n 1])

Z =

269.8667 38.8840 50.4980
272.6550 39.4777 50.0270
269.9345 38.8984 50.4866
271.8223 39.3004 50.1677
269.8667 38.8840 50.4980
269.8667 38.8840 50.4980
269.3870 38.7818 50.5790
268.2783 38.5457 50.7663
266.6619 38.2016 51.0393
267.4455 38.3684 50.9070
273.2259 39.5992 49.9306
270.8539 39.0942 50.3313
269.8667 38.8840 50.4980
269.9345 38.8984 50.4866
269.9345 38.8984 50.4866

Again, the result is pretty similar to the original, but not exactly: about 94% of the variance has been preserved, and we have compressed the data to 33% of its original size.

The trade-off here is between compression (count of principal components retained) and compression fidelity (the variance preserved). In a typical application, there will be more variables and the variance compression is normally not quite as dramatic as in our illustration. This means that there will be more data compression "levels", represented by the number of principal components retained.

Application: Noise Suppression

Extending the data compression application, we may use PCA for noise suppression. The basic idea is that the variance captured by the least important principal components is noise which should be rejected. Assuming that the variables bear a linear relationship, they will lie in a line (plane, hyperplane) and noise items will lift them away from the line. Dropping the last principal components means flattening the data in a geometric sense and (hopefully) eliminating some of the noise.

This process is much like the data compression process described in the last section, except: 1. discarded components have their coefficients set to zero instead of being deleted outright and 2. the PCA coefficient matrix and its inverse are multiplied together to allow a single processing step which (again, hopefully) reduces noise in the data.

As before, we calculate the PCA coefficients:

>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587

D =

0.0066 0 0
0 0.1809 0
0 0 2.8125

Deciding to eliminate the last principal component, we set its coefficients to zero:

>> VDenoise = V; VDenoise(:,1) = 0

VDenoise =

0 0.4874 -0.5825
0 0.2963 -0.5904
0 0.8213 0.5587

This matrix will project the standardized data into a flat surface- in this case a plane, since we have retained 2 dimensions. Not wanting to bother with two steps, we multiply this matrix by its inverse, which in this case is easily obtained by taking the transpose:

>> VDenoise = VDenoise * VDenoise'

VDenoise =

0.5769 0.4883 0.0749
0.4883 0.4364 -0.0865
0.0749 -0.0865 0.9867

This magical matrix will, in a single matrix multiplication, denoise the standardized data:

>> B * VDenoise

ans =

-0.0599 -0.0607 0.0570
1.4422 1.4861 -1.5414
0.0047 -0.0060 0.0654
1.1002 1.0890 -0.8844
-0.0599 -0.0607 0.0570
-0.0599 -0.0607 0.0570
-1.0390 -0.7648 -0.8824
-0.8960 -0.9299 1.0005
-1.7330 -1.8060 1.9839
-1.2380 -1.3270 1.6575
1.9601 1.9307 -1.5141
0.6290 0.5825 -0.2442
-0.0599 -0.0607 0.0570
0.0047 -0.0060 0.0654
0.0047 -0.0060 0.0654

Naturally, we still need to multiply back the standard deviation and add back the mean to get to the original scale:

>> Z = ((B * VDenoise) .* repmat(AStd,[n 1])) + repmat(AMean,[n 1])

Z =

269.8664 38.8839 50.4979
272.5483 39.4640 49.9954
269.9817 38.9044 50.5006
271.9377 39.3151 50.2019
269.8664 38.8839 50.4979
269.8664 38.8839 50.4979
268.1183 38.6198 50.2025
268.3736 38.5579 50.7946
266.8791 38.2293 51.1038
267.7630 38.4090 51.0012
273.4731 39.6308 50.0040
271.0964 39.1251 50.4032
269.8664 38.8839 50.4979
269.9817 38.9044 50.5006
269.9817 38.9044 50.5006

The degree of noise reduction is controlled by the number of principal components retained: the less principal components retained, the greater the noise reduction. Obviously, like all such schemes, this process has limitations and the big assumption here is that the original variables are linearly related so that noise stands out as a departure from this linearity.

Final Thoughts

PCA is a powerful tool, and is quickly computed on current computers, even on fairly large data. While there are limits to what it can do, it is a handy tool which is inexpensive in terms of compute time.

Further Reading

As a general reference on PCA see:

Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)

Note: The first edition is adequate for understanding and coding PCA, and is at present much cheaper than the second or third editions.

The noise suppression application is described in the article, Vectors help make sense of multiple signals, by Sullivan, Personal Engineering and Instrumentation News (Dec-1997), in which it is referred to as subspace projection.

See also the Dec-11-2010 posting, Linear Discriminant Analysis (LDA) .

Principal Components Analysis

2010-02-26T07:51:00.037-05:00

Introduction

Real-world data sets usually exhibit relationships among their variables. These relationships are often linear, or at least approximately so, making them amenable to common analysis techniques. One such technique is principal component analysis ("PCA"), which rotates the original data to new coordinates, making the data as "flat" as possible.

Given a table of two or more variables, PCA generates a new table with the same number of variables, called the principal components. Each principal component is a linear transformation of the entire original data set. The coefficients of the principal components are calculated so that the first principal component contains the maximum variance (which we may tentatively think of as the "maximum information"). The second principal component is calculated to have the second most variance, and, importantly, is uncorrelated (in a linear sense) with the first principal component. Further principal components, if there are any, exhibit decreasing variance and are uncorrelated with all other principal components.

PCA is completely reversible (the original data may be recovered exactly from the principal components), making it a versatile tool, useful for data reduction, noise rejection, visualization and data compression among other things. This article walks through the specific mechanics of calculating the principal components of a data set in MATLAB, using either the MATLAB Statistics Toolbox, or just the base MATLAB product.

Performing Principal Components Analysis

Performing PCA will be illustrated using the following data set, which consists of 3 measurements taken of a particular subject over time:

>> A = [269.8 38.9 50.5
272.4 39.5 50.0
270.0 38.9 50.5
272.0 39.3 50.2
269.8 38.9 50.5
269.8 38.9 50.5
268.2 38.6 50.2
268.2 38.6 50.8
267.0 38.2 51.1
267.8 38.4 51.0
273.6 39.6 50.0
271.2 39.1 50.4
269.8 38.9 50.5
270.0 38.9 50.5
270.0 38.9 50.5
];

We determine the size of this data set thus:

>> [n m] = size(A)

n =

15

m =

3

To summarize the data, we calculate the sample mean vector and the sample standard deviation vector:

>> AMean = mean(A)

AMean =

269.9733 38.9067 50.4800

>> AStd = std(A)

AStd =

1.7854 0.3751 0.3144

Most often, the first step in PCA is to standardize the data. Here, "standardization" means subtracting the sample mean from each observation, then dividing by the sample standard deviation. This centers and scales the data. Sometimes there are good reasons for modifying or not performing this step, but I will recommend that you standardize unless you have a good reason not to. This is easy to perform, as follows:

>> B = (A - repmat(AMean,[n 1])) ./ repmat(AStd,[n 1])

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

This calculation can also be carried out using the zscore function from the Statistics Toolbox:

>> B = zscore(A)

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

Calculating the coefficients of the principal components and their respective variances is done by finding the eigenfunctions of the sample covariance matrix:

>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587

D =

0.0066 0 0
0 0.1809 0
0 0 2.8125

The matrix V contains the coefficients for the principal components. The diagonal elements of D store the variance of the respective principal components. We can extract the diagonal like this:

>> diag(D)

ans =

0.0066
0.1809
2.8125

The coefficients and respective variances of the principal components could also be found using the princomp function from the Statistics Toolbox:

>> [COEFF SCORE LATENT] = princomp(B)

COEFF =

0.5825 -0.4874 0.6505
0.5904 -0.2963 -0.7507
-0.5587 -0.8213 -0.1152

SCORE =

-0.1026 0.0003 -0.0571
2.5786 0.1226 -0.1277
-0.0373 -0.0543 0.0157
1.7779 -0.1326 0.0536
-0.1026 0.0003 -0.0571
-0.1026 0.0003 -0.0571
-0.5637 1.4579 0.0704
-1.6299 -0.1095 -0.1495
-3.1841 -0.2496 0.1041
-2.4306 -0.3647 0.0319
3.1275 -0.2840 0.1093
0.8467 -0.2787 0.0892
-0.1026 0.0003 -0.0571
-0.0373 -0.0543 0.0157
-0.0373 -0.0543 0.0157

LATENT =

2.8125
0.1809
0.0066

Note three important things about the above:

1. The order of the principal components from princomp is opposite of that from eig(cov(B)). princomp orders the principal components so that the first one appears in column 1, whereas eig(cov(B)) stores it in the last column.

2. Some of the coefficients from each method have the opposite sign. This is fine: There is no "natural" orientation for principal components, so you can expect different software to produce different mixes of signs.

3. SCORE contains the actual principal components, as calculated by princomp.

To calculate the principal components without princomp, simply multiply the standardized data by the principal component coefficients:

>> B * COEFF

ans =

-0.1026 0.0003 -0.0571
2.5786 0.1226 -0.1277
-0.0373 -0.0543 0.0157
1.7779 -0.1326 0.0536
-0.1026 0.0003 -0.0571
-0.1026 0.0003 -0.0571
-0.5637 1.4579 0.0704
-1.6299 -0.1095 -0.1495
-3.1841 -0.2496 0.1041
-2.4306 -0.3647 0.0319
3.1275 -0.2840 0.1093
0.8467 -0.2787 0.0892
-0.1026 0.0003 -0.0571
-0.0373 -0.0543 0.0157
-0.0373 -0.0543 0.0157

To reverse this transformation, simply multiply by the transpose of the coefficent matrix:

>> (B * COEFF) * COEFF'

ans =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

Finally, to get back to the original data, multiply each observation by the sample standard deviation vector and add the mean vector:

>> ((B * COEFF) * COEFF') .* repmat(AStd,[n 1]) + repmat(AMean,[n 1])

ans =

269.8000 38.9000 50.5000
272.4000 39.5000 50.0000
270.0000 38.9000 50.5000
272.0000 39.3000 50.2000
269.8000 38.9000 50.5000
269.8000 38.9000 50.5000
268.2000 38.6000 50.2000
268.2000 38.6000 50.8000
267.0000 38.2000 51.1000
267.8000 38.4000 51.0000
273.6000 39.6000 50.0000
271.2000 39.1000 50.4000
269.8000 38.9000 50.5000
270.0000 38.9000 50.5000
270.0000 38.9000 50.5000

This completes the round trip from the original data to the principal components and back to the original data. In some applications, the principal components are modified before the return trip.

Let's consider what we've gained by making the trip to the principal component coordinate system. First, more variance has indeed been squeezed in the first principal component, which we can see by taking the sample variance of principal components:

>> var(SCORE)

ans =

2.8125 0.1809 0.0066

The cumulative variance contained in the first so many principal components can be easily calculated thus:

>> cumsum(var(SCORE)) / sum(var(SCORE))

ans =

0.9375 0.9978 1.0000

Interestingly in this case, the first principal component contains nearly 94% of the variance of the original table. A lossy data compression scheme which discarded the second and third principal components would compress 3 variables into 1, while losing only 6% of the variance.

The other important thing to note about the principal components is that they are completely uncorrelated (as measured by the usual Pearson correlation), which we can test by calculating their correlation matrix:

>> corrcoef(SCORE)

ans =

1.0000 -0.0000 0.0000
-0.0000 1.0000 -0.0000
0.0000 -0.0000 1.0000

Discussion

PCA "squeezes" as much information (as measured by variance) as possible into the first principal components. In some cases the number of principal components needed to store the vast majority of variance is shockingly small: a tremendous feat of data manipulation. This transformation can be performed quickly on contemporary hardware and is invertible, permitting any number of useful applications.

For the most part, PCA really is as wonderful as it seems. There are a few caveats, however:

1. PCA doesn't always work well, in terms of compressing the variance. Sometimes variables just aren't related in a way which is easily exploited by PCA. This means that all or nearly all of the principal components will be needed to capture the multivariate variance in the data, making the use of PCA moot.

2. Variance may not be what we want condensed into a few variables. For example, if we are using PCA to reduce data for predictive model construction, then it is not necessarily the case that the first principal components yield a better model than the last principal components (though it often works out more or less that way).

3. PCA is built from components, such as the sample covariance, which are not statistically robust. This means that PCA may be thrown off by outliers and other data pathologies. How seriously this affects the result is specific to the data and application.

4. Though PCA can cram much of the variance in a data set into fewer variables, it still requires all of the variables to generate the principal components of future observations. Note that this is true, regardless of how many principal components are retained for the application. PCA is not a subset selection procedure, and this may have important logistical implications.

Further Reading

See also the Feb-28-2010 posting, Putting PCA to Work and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA) .

Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)

Note: The first edition is adequate for understanding and coding PCA, and is at present much cheaper than the second or third editions.

Single Neuron Training: The Delta Rule

2010-02-16T17:39:00.010-05:00

I have recently put together a routine, DeltaRule, to train a single artificial neuron using the delta rule. DeltaRule can be found at MATLAB Central.

This posting will not go into much detail, but this type of model is something like a logistic regression, where a linear model is calculated on the input variables, then passed through a squashing function (in this case the logistic curve). Such models are most often used to model binary outcomes, hence the dependent variable is normally composed of the values 0 and 1.

Single neurons with linear functions (with squashing functions or not) are only capable of separating classes that may be divided by a line (plane, hyperplane), yet they are often useful, either by themselves or in building more complex models.

Use help DeltaRule for syntax and a simple example of its use.

Anyway, I thought readers might find this routine useful. It trains quickly and the code is straightforward (I think), making modification easy. Please write to let me know if you do anything interesting with it.

If you are already familiar with simple neural models like this one, here are the technical details:

Learning rule: incremental delta rule
Learning rate: constant
Transfer function: logistic
Exemplar presentation order: random, by training epoch

See also the Mar-15-2009 posting, Logistic Regression and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).

MATLAB Gaining in Popularity!

2009-07-24T11:31:00.005-04:00

Every month, TIOBE Software publishes its TIOBE Programming Community Index, a measure of programming language popularity based on a variety of sources. The current list, TIOBE Programming Community Index for July 2009 lists MATLAB as entering the Top 20 list (for the first time, I believe). While no such ordering is likely to be perfect, TIOBE seems to be one of the more comprehensive efforts for this sort of thing.

I encourage readers to visit the TIOBE site because it is interesting to know what other tools are available, and what new languages are emerging. For interested parties, the July, 2009 Top 20 are, in descending order of popularity:

1. Java
2. C
3. C++
4. PHP
5. (Visual) Basic
6. C#
7. Python
8. Perl
9. JavaScript
10. Ruby
11. Delphi
12. PL/SQL
13. SAS
14. RPG (OS/400)
15. Pascal
16. ABAP
17. Lisp/Scheme
18. D
19. Lua
20. MATLAB

MATLAB Image Basics

2009-04-03T15:32:00.047-04:00

Introduction

One nice feature of MATLAB is its provision of handy functions which are not part of the programming language proper. An excellent example of this is its support for images. The base MATLAB product provides routines for the loading from disk, manipulation, display and storing to disk of raster images. While it's true that one can find code libraries to perform these functions for other programming languages, like C++, the MATLAB model offers several advantages, not the least of which is standardization. If I write image-handling code in MATLAB, I know that every other MATLAB user on Earth can run my code without modification or the need for extra header files, libraries, etc. This article will serve as a brief introduction to the use of image data within MATLAB.

Image Data

Images are nearly always stored in digital computers in one of two forms: vector or raster . Vector images store images as line drawings (dots, line segments, polygons) defined by the spatial coordinates of their end points or vertices, and are most often used these days in artistic settings. A raster image is simply a 2-dimensional array of colored pixels (represented by numbers). This article will concentrate on the much more common raster form.

Raster images, being arrays of numbers, are a natural fit for MATLAB, and indeed MATLAB is a convenient tool for applications such as image processing. Raster images always have 2 spatial dimensions (horizontal and vertical), and 1 or more color planes. Typically, grayscale images are stored as a 2-dimensional array, representing 1 color plane with values of 0.0 indicating black, 1.0 indicating white and intermediate values indicating various shades of gray. Color images are similar to grayscale images, but are most often stored as a 3-dimensional array, which is really a stack of three 2-dimensional color planes: one for each primary color: red, green and blue ("RGB"). As with grayscale images, values in the RGB color planes represent brightness of each color. Note that when all three color values are the same, the resulting color is a shade of gray.

For the reader's knowledge, there are also index images which will not be covered here, but which are full of index numbers (integers) which do not represent colors directly, but instead indicate locations in a palette. Also, brightness values are often stored in files as integers, such as 0 - 255 instead of 0.0 to 1.0.

Loading Images from Disk

In MATLAB, images are read from disk using the imread function. Using imread is easy. The basic parameters are the location of the image file and the file format:

>> A = imread('c:\QRNG.png','PNG');
>> whos
Name Size Bytes Class Attributes

A 942 x 1680 x 3 4747680 uint8

This image is 942 pixels vertical by 1680 pixels horizontal, with 3 color planes (red, green and blue). Note that image data has been store in MATLAB as unsigned 8-bit integers (uint8). Since I often make multiple calculations on images, I typically convert the data type to double-precision real (double) and scale to 0.0 - 1.0 (though this will slow calculation):

>> B = double(A) / 255;

Displaying Images

Showing images on the screen is most easily accomplish using the image function:

image(A)

Grayscale images will display using a default palette, which can be changed via the colormap command:

>>colormap gray

Images will be fit to the screen, which may distort their aspect ratio. This can be fixed using:

>>axis equal

...meaning that pixels will use equal scales horizontally and vertically.

Manipulating Images

As arrays, images can be modified using all the fun things we usually do to arrays in MATLAB (subsetting, math operations, etc.). I will mention one other useful base MATLAB tool for image processing: the rgb2hsv function, which converts an RGB image to an HSV one. HSV is a different colorspace (way of representing colors). HSV arrays are similar to RGB arrays, except their 3 color planes are hue, saturation and value (in that order). It is often convenient to work on the value ("brightness") plane, to isolate changes in light/dark from changes in the color. To get back to the land of RGB, use the function hsv2rgb.

Saving Images to Disk

Images can be saved to disk using the imwrite command. This is essentially the inverse of the imread command:

imwrite(A,'New Image.bmp','BMP')

...with the parameters indicating the array to be saved as an image file, the file location and image file format, in that order.

Note that MATLAB understands images as both 0 - 255 uint8s and 0.0 - 1.0 doubles, so there is no need to reverse this transformation before image storage.

Conclusion

Working on images in MATLAB is very convenient, especially when compared to more general-purpose languages. I urge the reader to check the help facility for the functions mentioned here to learn of further functionality.

Further Reading

For more information on image processing, I recommend either of the following books:

Digital Image Processing (3rd Edition) by Gonzalez and Woods
(ISBN-13: 978-0131687288)

Algorithms for Image Processing and Computer Vision by J. R. Parker
(ISBN-13: 978-0471140566)

Introduction to Conditional Entropy

2009-04-01T20:59:00.008-04:00

Introduction

In one of the earliest posts to this log, Introduction To Entropy (Nov-10-2006), I described the entropy of discrete variables. Finally, in this posting, I have gotten around to continuing this line of inquiry and will explain conditional entropy.

Quick Review: Entropy

Recall that entropy is a measure of uncertainty about the state of a variable. In the case of a variable which can take on only two values (male/female, profit/loss, heads/tails, alive/dead, etc.), entropy assumes its maximum value, 1 bit, when the probability of the outcomes is equal. A fair coin toss is a high-entropy event: Beforehand, we have no idea about what will happen. As the probability distribution moves away from a 50/50 split, uncertainty about the outcome decreases since there is less uncertainty as to the outcome. Recall, too, that entropy decreases regardless of which outcome class becomes the more probable: an unfair coin toss, with a heads / tails probability distribution of 0.10 / 0.90 has exactly the same entropy as another unfair coin with a distribution of 0.90 / 0.10. It is the distribution of probabilities, not their order which matters when calculating entropy.

Extending these ideas to variables with more than 2 possible values, we note that generally, distributions which are evenly spread among the possible values exhibit higher entropy, and those which are more concentrated in a subset of the possible values exhibit lower entropy.

Conditional Entropy

Entropy, in itself, is a useful measure of the mixture of values in variables, but we are often interested in characterizing variables after they have been conditioned ("modeled"). Conditional entropy does exactly that, by measuring the entropy remaining in a variable, after it has been conditioned by another variable. I find it helpful to think of conditional entropy as "residual entropy". Happily for you, I have already assembled a MATLAB function, ConditionalEntropy, to calculate this measure.

As an example, think of sporting events involving pairs of competitors, such as soccer ("football" if you live outside of North America). Knowing nothing about a particular pair of teams (and assuming no ties), the best probability we may assess that any specific team, say, the Black Eagles (Go, Beşiktaş!), will win is:

p(the Black Eagles win) = 0.5

The entropy of this variable is 1 bit- the maximum uncertainty possible for a two-category variable. We will attempt to lower this uncertainty (hence, lowering its conditional entropy).

In some competitive team sports, it has been demonstrated statistically that home teams have an advantage over away teams. Among the most popular sports, the home advantage appears to be largest for soccer. One estimate shows that (excluding ties) home teams have historically won 69% of the time, and away teams 31% of the time. Conditioning the outcome on home vs. away, we may provide the follwing improved probability estimates:

p(the Black Eagles win, given that they are the home team) = 0.69
p(the Black Eagles win, given that they are the away team) = 0.31

Ah, Now there is somewhat less uncertainty! The entropy for probability 0.69 is 0.8932 bits. The entropy for probability 0.31 (being the same distance from 0.5 as 0.69) is also 0.8932 bits.

Conditional entropy is calculated as the weighted average of the entropies of the various possible conditions (in this case home or away). Assuming that it is equally likely that the Black Eagles play home or away, the conditional entropy of them winning is 0.8932 bits = (0.5 * 0.8932 + 0.5 * 0.8932). As entropy has gone down with our simple model, from 1 bit to 0.8932 bits, we learn that knowing whether the Black Eagles are playing at home or away provides information and reduces uncertainty.

Other variables might be used to condition the outcome of a match, such as number of player injuries, outcome of recent games, etc. We can compare these candidate predictors using conditional entropy. The lower the conditional entropy, the lower the remaining uncertainty. It is even possible to assess a combination of predictors by treating each combination of univariate conditions as a separate condition ("symbol", in information theoretic parlance), thus:

p(the Black Eagles win, given:
that they are the home team and there are no injuries) = ...
that they are the away team and there are no injuries) = ...
that they are the home team and there is 1 injury) = ...
that they are the away team and there is 1 injury) = ...
etc.

My routine, ConditionalEntropy can accommodate multiple conditioning variables.

Conclusion

The models being developed here are tables which simplistically cross all values of all input variables, but conditional entropy can also, for instance, be used to evaluate candidate splits in decision tree induction, or to assess class separation in discriminant analysis.

Note that, as the entropies calculated in this article are based on sample probabilities, they suffer from the same limitations as all sample statistics (as opposed to population statistics). A sample entropy calculated from a small number observations likely will not agree exactly with the population entropy.

Further Reading

See also my Sep-12-2010 posting, Reader Question: Putting Entropy to Work.

Print:

The Mathematical Theory of Communication by Claude Shannon (ISBN 0-252-72548-4)

Elements of Information Theory by Cover and Thomas (ISBN 0-471-06259)

L1LinearRegression Code Update

2009-03-27T10:36:00.004-04:00

The L-1 regression routine, L1LinearRegression, originally mentioned in the Oct-23-2007 posting, L-1 Linear Regression, has been updated. The old version produced correct results, but the new one is more efficient.

Thanks to reader Andreas Steimer for contacting me about this routine.

Rexer Analytics' 2009 Data Miner Survey

2009-03-26T18:44:00.003-04:00

I'd like to alert readers to Rexer Analytics' 2009 Data Miner Survey. I urge you to participate by visiting the on-line survey at:

Rexer Analytics' 2009 Data Miner Survey

The Access Code is: TW4D2

Your entry is confidential and will help all of us better understand what is happening in the field of data mining.

Thanks!

Status Update: Mar-2009

2009-03-20T21:18:00.002-04:00

This is just a short note to let everyone know that I have been working (finally) to restore the broken links on this Web log. I believe that the source code links have all now been fixed.

This log has passed the 100,000 unique visitor mark without fanfare because I was too busy at the time to notice. I am thankful for those who continue to write in response to this Web log, and am glad to learn how many people this helps. The pace of posting has picked up, and I expect this to continue this year.

Logistic Regression

2009-03-15T11:02:00.036-04:00

Introduction

Often, the analyst is required to construct a model which estimates probabilities. This is common in many fields: medical diagnosis (probability of recovery, relapse, etc.), credit scoring (probability of a loan being repaid), sports (probability of a team beating a competitor- wait... maybe that belongs in the "investment" category?).

Many people are familiar with linear regression- why not just use that? There are several good reasons not to do this, but probably the most obvious is that linear models will always fall below 0.0 and poke out above 1.0, yielding answers which do not make sense as probabilities.

Many different classification models have been devised which estimate the probability of class membership, such as linear and quadratic discriminant analysis, neural networks and tree induction. The technique covered in this article is logistic regression- one of the simplest modeling procedures.

Logistic Regression

Logistic regression is a member of the family of methods called generalized linear models ("GLM"). Such models include a linear part followed by some "link function". If you are familiar with neural networks, think of "transfer functions" or "squashing functions". So, the linear function of the predictor variables is calculated, and the result of this calculation is run through the link function. In the case of logistic regression, the linear result is run through a logistic function (see figure 1), which runs from 0.0 (at negative infinity), rises monotonically to 1.0 (at positive infinity). Along the way, it is 0.5 when the input value is exactly zero. Among other desirable properties, note that this logistic function only returns values between 0.0 and 1.0. Other GLMs operate similarly, but employ different link functions- some of which are also bound by 0.0 - 1.0, and some of which are not.

Figure 1: The Most Interesting Part of the Logistic Function (Click figure to enlarge)

While calculating the optimal coefficients of a least-squares linear regression has a direct, closed-form solution, this is not the case for logistic regression. Instead, some iterative fitting procedure is needed, in which successive "guesses" at the right coefficients are incrementally improved. Again, if you are familiar with neural networks, this is much like the various training rules used with the simplest "single neuron" models. Hopefully, you are lucky enough to have a routine handy to perform this process for you, such as glmfit, from the Statistics Toolbox.

glmfit

The glmfit function is easy to apply. The syntax for logistic regression is:

B = glmfit(X, [Y N], 'binomial', 'link', 'logit');

B will contain the discovered coefficients for the linear portion of the logistic regression (the link function has no coefficients). X contains the pedictor data, with examples in rows, variables in columns. Y contains the target variable, usually a 0 or a 1 representing the outcome. Last, the variable N contains the count of events for each row of the example data- most often, this will be a columns of 1s, the same size as Y. The count parameter, N, will be set to values greater than 1 for grouped data. As an example, think of medical cases summarized by country: each country will have averaged input values, an outcome which is a rate (between 0.0 and 1.0), and the count of cases from that country. In the event that the counts are greater than one, then the target variable represents the count of target class observations.

Here is a very small example:

>> X = [0.0 0.1 0.7 1.0 1.1 1.3 1.4 1.7 2.1 2.2]';
>> Y = [0 0 1 0 0 0 1 1 1 1]';
>> B = glmfit(X, [Y ones(10,1)], 'binomial', 'link', 'logit')

B =

-3.4932
2.9402

The first element of B is the constant term, and the second element is the coefficient for the lone input variable. We apply the linear part of this logistic regression thus:

>> Z = B(1) + X * (B(2))

Z =

-3.4932
-3.1992
-1.4350
-0.5530
-0.2589
0.3291
0.6231
1.5052
2.6813
2.9753

To finish, we apply the logistic function to the output of the linear part:

>> Z = Logistic(B(1) + X * (B(2)))

Z =

0.0295
0.0392
0.1923
0.3652
0.4356
0.5815
0.6509
0.8183
0.9359
0.9514

Despite the simplicity of the logistic function, I built it into a small function, Logistic, so that I wouldn't have to repeatedly write out the formula:

% Logistic: calculates the logistic function of the input
% by Will Dwinnell
%
% Last modified: Sep-02-2006

function Output = Logistic(Input)

Output = 1 ./ (1 + exp(-Input));

% EOF

Conclusion

Though it is structurally very simple, logistic regression still finds wide use today in many fields. It is quick to fit, easy to implement the discovered model and quick to recall. Frequently, it yields better performance than competing, more complex techniques. I recently built a logistic regression model which beat out a neural network, decision trees and two types of discriminant analysis. If nothing else, it is worth fitting a simple model such as logistic regression early in a modeling project, just to establish a performance benchmark for the project.

Logistic regression is closely related to another GLM procedure, probit regression, which differs only in its link function (specified in glmfit by replacing 'logit' with 'probit'). I believe that probit regression has been losing popularity since its results are typically very similar to those from logistic regression, but the formula for the logistic link function is simpler than that of the probit link function.

References

Generalized Linear Models, by McCullagh and Nelder (ISBN-13: 978-0412317606)

See Also

The Apr-21-2007 posting, Linear Regression in MATLAB, the Feb-16-2010 posting, Single Neuron Training: The Delta Rule and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).

MATLAB 2009a

2009-03-13T07:53:00.010-04:00

MATLAB is on the move. Release 2009a brings a number of changes. The function of the random number generators had already begun to change in the base product as of the last release, if you hadn't noticed, and several functions (min, max, sum and prod, as well as several of the FFT functions) are now multi-threaded. This release also witnesses several changes to the analytical toolboxes. Among others...

In the Statistics Toolbox...

A Naïve Bayes modeling tool has been added. This is a completely different way of modeling than the other technique in the Statistics Toolbox. Obviously, the more diverse to set of modeling tools, the better.

Data table joining has been enhanced several ways, including the ability to use multiple keys and different types of joins (inner, outer, etc.).

A number of changes to the tree induction facility (classregtree), including a fix to the quirky splitmin parameter. Now the programmer can specify the minimum number of cases per leaf node, which seems like a better way to control decision tree growth.

There are also new options for model ensembles and performance curve summaries.

In the Curve Fitting Toolbox...

Yipee! There are now functions for surface fitting (functions fit to 2 inputs, instead of just 1). Both interactive and programmatic fitting is available.

In the Parallel Computing Toolbox...

The maximum number of local workers has been increased from 4 to 8.

Introduction to the Percentile Bootstrap

2009-03-03T20:57:00.029-05:00

Introduction

This article introduces the percentile bootstrap, the simplest of the bootstrap methods. The bootstrap family of techniques are used to establish confidence intervals and calculate hypothesis tests for statistical measures.

Problem Statement and Conventional Solution

One is often required to summarize a set of data, such as the following:

X =

2
10
10
5
8
1
4
9
8
10

The most commonly used summary is the mean, in MATLAB calculated thus:

>> mean(X)

ans =

6.7000

Summaries, however, discard a great deal of information. In any situation, it is helpful to know the quality of our summary. In the case above, we may wonder how far our sample mean is likely to be the true population mean (the mean of all numbers drawn from the theoretical statistical population). Our sample mean, after all, was calculated from only 10 observations.

We may establish some idea of how far off the sample mean may be from the population mean by calculating the standard error of the sample mean, which is the standard deviation divided by the square root of the sample size. In MATLAB:

>> StandardError = std(X) / sqrt(length(X))

StandardError =

1.0858

Note that there are fancier versions of this calculation, for example for cases in which the population size is finite, or when the standard error of a proportion is being calculated. The standard error gives us an estimate of how far away our sample error might be form the true population mean, and acts like a z-score: the population mean is within 2 times the standard error from the sample mean about 95% of the time. In the case of our little data set, this would be from 4.5285 (= 6.7000 - 2 * 1.0858) to 8.8715 (= 6.7000 + 2 * 1.0858).

Note that as the number of observations grows, the bottom part of the standard error fraction becomes larger and the standard error decreases. This seems natural enough: with more data, our confidence in our statistic increases.

Complications of the Problem Statement

So far, so good: We may have had to look up the standard error formula in a book, but we have established some sort of parameters as to the certainty of our summary. What if we didn't have such a reference, though? The median for example, has no such simple formula to establish its certainty. (Actually, I believe there is a formula for the median, but that it is a real bear!) Anyway, there certainly are other measures which we may calculate (even ones which we invent on the spot), for which there are no handy standard error formulas. What to do?

An Alternative Solution: The Bootstrap

Just as we are about to throw up our hands and consider another career, the bootstrap appears. The basic method of the bootstrap is simple: Draw many samples with replacement from the original sample ("replicates"), and tabulate the summary statistic when calculated on each those replicate samples. The distribution of those replicated summaries is intended to mimic the distribution being parameterized by the standard error of the mean.

Above, I mentioned that the population mean would be found inside the band from the sample mean minus two times the standard error to the sample mean plus two times the standard error about 95% of the time. The equivalent area in our bootstrap process would be between the 2.5 and 97.5 percentiles of our replicate summaries. We use 2.5 and 97.5 because that leaves a total of 5% outside of the range, half on each end of the spectrum.

An example using the median will illustrate this process. For reference, let's calculate the sample median first:

>> median(X)

ans =

8

Drawing a single sample with replacement can be done in MATLAB by indexing using random integers:

RandomSampleWithReplacement =

5
8
1
1
10
10
9
9
5
1

This is our first bootstrap replicate. Now, we calculate our summary on this replicate:

>> median(RandomSampleWithReplacement)

ans =

6.5000

To discern the distribution, though, will require many more replicates. Since the computer is doing all of the work, I generally like to run at least 2000 replicates to give the bootstrap distribution a chance to take shape:

rand('twister',1242) % Seed the random number generator for repeatability
T = NaN(2000,1); % Allocate space for the replicated summaries
for i = 1:2000 % The machine's doing the work, so why not?
RandomSampleWithReplacement = X(ceil(length(X) * rand(length(X),1))); % Draw a sample with replacement
T(i) = median(RandomSampleWithReplacement); % Calculate the replicated summary
end

(I apologize if the code is a bit cramped, but I have not been able to figure out how to insert tabs or indentation in this edit window.)

Now, estimating where the "real" median (the population median) is likely to be is a simple matter of checking percentiles in our replicated summaries. I have the Statistic Toolbox, so I will cheat by using a function from there:

>> prctile(T,[2.5 97.5])

ans =

3.5000 10.0000

So, our population median is likely to lie between 3.5 and 10. That is a pretty wide range, but this is the consequence of having so little data.

Wrap-Up

The fundamental trade-off of the bootstrap is that one forsakes pat statistical formulas in favor of strenuous computation. In summary:

Good:

-The bootstrap solves many problems not amenable to conventional methods.

-Even in cases where conventional solutions exist, the bootstrap requires no memory or selection of correct formulas for given situations.

Bad:
-The bootstrap requires considerable numerical computation. Of course, in an era of cheap and powerful computing machinery, this is much less of an issue. Still, if there are many of these to perform...

-The bootstrap presented in this article, the bootstrap percentile, is known to deviate from theoretically correct answers, though generally in a small way. There are more sophisticated bootstrap procedures which address some of these concerns, though.

This process, owing to its very general nature, can be applied to tasks much more complex than estimating the uncertainty of statistical summaries, such as hypothesis testing and predictive model performance evaluation.

Further Reading

An Introduction to the Bootstrap by Efron and Tibshirani (ISBN 0-412-04231-2)
This is the seminal work in this field. It covers a lot of ground, but is a bit mathematical.