Data Mining in MATLAB: March 2009

Friday, March 27, 2009

L1LinearRegression Code Update

The L-1 regression routine, L1LinearRegression, originally mentioned in the Oct-23-2007 posting, L-1 Linear Regression, has been updated. The old version produced correct results, but the new one is more efficient.

Thanks to reader Andreas Steimer for contacting me about this routine.

Thursday, March 26, 2009

Rexer Analytics' 2009 Data Miner Survey

I'd like to alert readers to Rexer Analytics' 2009 Data Miner Survey. I urge you to participate by visiting the on-line survey at:

Rexer Analytics' 2009 Data Miner Survey

The Access Code is: TW4D2

Your entry is confidential and will help all of us better understand what is happening in the field of data mining.

Thanks!

Friday, March 20, 2009

Status Update: Mar-2009

This is just a short note to let everyone know that I have been working (finally) to restore the broken links on this Web log. I believe that the source code links have all now been fixed.

This log has passed the 100,000 unique visitor mark without fanfare because I was too busy at the time to notice. I am thankful for those who continue to write in response to this Web log, and am glad to learn how many people this helps. The pace of posting has picked up, and I expect this to continue this year.

Sunday, March 15, 2009

Logistic Regression

Introduction

Often, the analyst is required to construct a model which estimates probabilities. This is common in many fields: medical diagnosis (probability of recovery, relapse, etc.), credit scoring (probability of a loan being repaid), sports (probability of a team beating a competitor- wait... maybe that belongs in the "investment" category?).

Many people are familiar with linear regression- why not just use that? There are several good reasons not to do this, but probably the most obvious is that linear models will always fall below 0.0 and poke out above 1.0, yielding answers which do not make sense as probabilities.

Many different classification models have been devised which estimate the probability of class membership, such as linear and quadratic discriminant analysis, neural networks and tree induction. The technique covered in this article is logistic regression- one of the simplest modeling procedures.

Logistic Regression

Logistic regression is a member of the family of methods called generalized linear models ("GLM"). Such models include a linear part followed by some "link function". If you are familiar with neural networks, think of "transfer functions" or "squashing functions". So, the linear function of the predictor variables is calculated, and the result of this calculation is run through the link function. In the case of logistic regression, the linear result is run through a logistic function (see figure 1), which runs from 0.0 (at negative infinity), rises monotonically to 1.0 (at positive infinity). Along the way, it is 0.5 when the input value is exactly zero. Among other desirable properties, note that this logistic function only returns values between 0.0 and 1.0. Other GLMs operate similarly, but employ different link functions- some of which are also bound by 0.0 - 1.0, and some of which are not.

Figure 1: The Most Interesting Part of the Logistic Function (Click figure to enlarge)

While calculating the optimal coefficients of a least-squares linear regression has a direct, closed-form solution, this is not the case for logistic regression. Instead, some iterative fitting procedure is needed, in which successive "guesses" at the right coefficients are incrementally improved. Again, if you are familiar with neural networks, this is much like the various training rules used with the simplest "single neuron" models. Hopefully, you are lucky enough to have a routine handy to perform this process for you, such as glmfit, from the Statistics Toolbox.

glmfit

The glmfit function is easy to apply. The syntax for logistic regression is:

B = glmfit(X, [Y N], 'binomial', 'link', 'logit');

B will contain the discovered coefficients for the linear portion of the logistic regression (the link function has no coefficients). X contains the pedictor data, with examples in rows, variables in columns. Y contains the target variable, usually a 0 or a 1 representing the outcome. Last, the variable N contains the count of events for each row of the example data- most often, this will be a columns of 1s, the same size as Y. The count parameter, N, will be set to values greater than 1 for grouped data. As an example, think of medical cases summarized by country: each country will have averaged input values, an outcome which is a rate (between 0.0 and 1.0), and the count of cases from that country. In the event that the counts are greater than one, then the target variable represents the count of target class observations.

Here is a very small example:

>> X = [0.0 0.1 0.7 1.0 1.1 1.3 1.4 1.7 2.1 2.2]';
>> Y = [0 0 1 0 0 0 1 1 1 1]';
>> B = glmfit(X, [Y ones(10,1)], 'binomial', 'link', 'logit')

B =

-3.4932
2.9402

The first element of B is the constant term, and the second element is the coefficient for the lone input variable. We apply the linear part of this logistic regression thus:

>> Z = B(1) + X * (B(2))

Z =

-3.4932
-3.1992
-1.4350
-0.5530
-0.2589
0.3291
0.6231
1.5052
2.6813
2.9753

To finish, we apply the logistic function to the output of the linear part:

>> Z = Logistic(B(1) + X * (B(2)))

Z =

0.0295
0.0392
0.1923
0.3652
0.4356
0.5815
0.6509
0.8183
0.9359
0.9514

Despite the simplicity of the logistic function, I built it into a small function, Logistic, so that I wouldn't have to repeatedly write out the formula:

% Logistic: calculates the logistic function of the input
% by Will Dwinnell
%
% Last modified: Sep-02-2006

function Output = Logistic(Input)

Output = 1 ./ (1 + exp(-Input));

% EOF

Conclusion

Though it is structurally very simple, logistic regression still finds wide use today in many fields. It is quick to fit, easy to implement the discovered model and quick to recall. Frequently, it yields better performance than competing, more complex techniques. I recently built a logistic regression model which beat out a neural network, decision trees and two types of discriminant analysis. If nothing else, it is worth fitting a simple model such as logistic regression early in a modeling project, just to establish a performance benchmark for the project.

Logistic regression is closely related to another GLM procedure, probit regression, which differs only in its link function (specified in glmfit by replacing 'logit' with 'probit'). I believe that probit regression has been losing popularity since its results are typically very similar to those from logistic regression, but the formula for the logistic link function is simpler than that of the probit link function.

References

Generalized Linear Models, by McCullagh and Nelder (ISBN-13: 978-0412317606)

See Also

The Apr-21-2007 posting, Linear Regression in MATLAB, the Feb-16-2010 posting, Single Neuron Training: The Delta Rule and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).

Friday, March 13, 2009

MATLAB 2009a

MATLAB is on the move. Release 2009a brings a number of changes. The function of the random number generators had already begun to change in the base product as of the last release, if you hadn't noticed, and several functions (min, max, sum and prod, as well as several of the FFT functions) are now multi-threaded. This release also witnesses several changes to the analytical toolboxes. Among others...

In the Statistics Toolbox...

A Naïve Bayes modeling tool has been added. This is a completely different way of modeling than the other technique in the Statistics Toolbox. Obviously, the more diverse to set of modeling tools, the better.

Data table joining has been enhanced several ways, including the ability to use multiple keys and different types of joins (inner, outer, etc.).

A number of changes to the tree induction facility (classregtree), including a fix to the quirky splitmin parameter. Now the programmer can specify the minimum number of cases per leaf node, which seems like a better way to control decision tree growth.

There are also new options for model ensembles and performance curve summaries.

In the Curve Fitting Toolbox...

Yipee! There are now functions for surface fitting (functions fit to 2 inputs, instead of just 1). Both interactive and programmatic fitting is available.

In the Parallel Computing Toolbox...

The maximum number of local workers has been increased from 4 to 8.

Tuesday, March 03, 2009

Introduction to the Percentile Bootstrap

Introduction

This article introduces the percentile bootstrap, the simplest of the bootstrap methods. The bootstrap family of techniques are used to establish confidence intervals and calculate hypothesis tests for statistical measures.

Problem Statement and Conventional Solution

One is often required to summarize a set of data, such as the following:

X =

2
10
10
5
8
1
4
9
8
10

The most commonly used summary is the mean, in MATLAB calculated thus:

>> mean(X)

ans =

6.7000

Summaries, however, discard a great deal of information. In any situation, it is helpful to know the quality of our summary. In the case above, we may wonder how far our sample mean is likely to be the true population mean (the mean of all numbers drawn from the theoretical statistical population). Our sample mean, after all, was calculated from only 10 observations.

We may establish some idea of how far off the sample mean may be from the population mean by calculating the standard error of the sample mean, which is the standard deviation divided by the square root of the sample size. In MATLAB:

>> StandardError = std(X) / sqrt(length(X))

StandardError =

1.0858

Note that there are fancier versions of this calculation, for example for cases in which the population size is finite, or when the standard error of a proportion is being calculated. The standard error gives us an estimate of how far away our sample error might be form the true population mean, and acts like a z-score: the population mean is within 2 times the standard error from the sample mean about 95% of the time. In the case of our little data set, this would be from 4.5285 (= 6.7000 - 2 * 1.0858) to 8.8715 (= 6.7000 + 2 * 1.0858).

Note that as the number of observations grows, the bottom part of the standard error fraction becomes larger and the standard error decreases. This seems natural enough: with more data, our confidence in our statistic increases.

Complications of the Problem Statement

So far, so good: We may have had to look up the standard error formula in a book, but we have established some sort of parameters as to the certainty of our summary. What if we didn't have such a reference, though? The median for example, has no such simple formula to establish its certainty. (Actually, I believe there is a formula for the median, but that it is a real bear!) Anyway, there certainly are other measures which we may calculate (even ones which we invent on the spot), for which there are no handy standard error formulas. What to do?

An Alternative Solution: The Bootstrap

Just as we are about to throw up our hands and consider another career, the bootstrap appears. The basic method of the bootstrap is simple: Draw many samples with replacement from the original sample ("replicates"), and tabulate the summary statistic when calculated on each those replicate samples. The distribution of those replicated summaries is intended to mimic the distribution being parameterized by the standard error of the mean.

Above, I mentioned that the population mean would be found inside the band from the sample mean minus two times the standard error to the sample mean plus two times the standard error about 95% of the time. The equivalent area in our bootstrap process would be between the 2.5 and 97.5 percentiles of our replicate summaries. We use 2.5 and 97.5 because that leaves a total of 5% outside of the range, half on each end of the spectrum.

An example using the median will illustrate this process. For reference, let's calculate the sample median first:

>> median(X)

ans =

8

Drawing a single sample with replacement can be done in MATLAB by indexing using random integers:

RandomSampleWithReplacement =

5
8
1
1
10
10
9
9
5
1

This is our first bootstrap replicate. Now, we calculate our summary on this replicate:

>> median(RandomSampleWithReplacement)

ans =

6.5000

To discern the distribution, though, will require many more replicates. Since the computer is doing all of the work, I generally like to run at least 2000 replicates to give the bootstrap distribution a chance to take shape:

rand('twister',1242) % Seed the random number generator for repeatability
T = NaN(2000,1); % Allocate space for the replicated summaries
for i = 1:2000 % The machine's doing the work, so why not?
RandomSampleWithReplacement = X(ceil(length(X) * rand(length(X),1))); % Draw a sample with replacement
T(i) = median(RandomSampleWithReplacement); % Calculate the replicated summary
end

(I apologize if the code is a bit cramped, but I have not been able to figure out how to insert tabs or indentation in this edit window.)

Now, estimating where the "real" median (the population median) is likely to be is a simple matter of checking percentiles in our replicated summaries. I have the Statistic Toolbox, so I will cheat by using a function from there:

>> prctile(T,[2.5 97.5])

ans =

3.5000 10.0000

So, our population median is likely to lie between 3.5 and 10. That is a pretty wide range, but this is the consequence of having so little data.

Wrap-Up

The fundamental trade-off of the bootstrap is that one forsakes pat statistical formulas in favor of strenuous computation. In summary:

Good:

-The bootstrap solves many problems not amenable to conventional methods.

-Even in cases where conventional solutions exist, the bootstrap requires no memory or selection of correct formulas for given situations.

Bad:
-The bootstrap requires considerable numerical computation. Of course, in an era of cheap and powerful computing machinery, this is much less of an issue. Still, if there are many of these to perform...

-The bootstrap presented in this article, the bootstrap percentile, is known to deviate from theoretically correct answers, though generally in a small way. There are more sophisticated bootstrap procedures which address some of these concerns, though.

This process, owing to its very general nature, can be applied to tasks much more complex than estimating the uncertainty of statistical summaries, such as hypothesis testing and predictive model performance evaluation.

Further Reading

An Introduction to the Bootstrap by Efron and Tibshirani (ISBN 0-412-04231-2)
This is the seminal work in this field. It covers a lot of ground, but is a bit mathematical.