Data Mining in MATLAB: Statistics Toolbox

Showing posts with label Statistics Toolbox. Show all posts

Sunday, March 15, 2009

Logistic Regression

Introduction

Often, the analyst is required to construct a model which estimates probabilities. This is common in many fields: medical diagnosis (probability of recovery, relapse, etc.), credit scoring (probability of a loan being repaid), sports (probability of a team beating a competitor- wait... maybe that belongs in the "investment" category?).

Many people are familiar with linear regression- why not just use that? There are several good reasons not to do this, but probably the most obvious is that linear models will always fall below 0.0 and poke out above 1.0, yielding answers which do not make sense as probabilities.

Many different classification models have been devised which estimate the probability of class membership, such as linear and quadratic discriminant analysis, neural networks and tree induction. The technique covered in this article is logistic regression- one of the simplest modeling procedures.

Logistic Regression

Logistic regression is a member of the family of methods called generalized linear models ("GLM"). Such models include a linear part followed by some "link function". If you are familiar with neural networks, think of "transfer functions" or "squashing functions". So, the linear function of the predictor variables is calculated, and the result of this calculation is run through the link function. In the case of logistic regression, the linear result is run through a logistic function (see figure 1), which runs from 0.0 (at negative infinity), rises monotonically to 1.0 (at positive infinity). Along the way, it is 0.5 when the input value is exactly zero. Among other desirable properties, note that this logistic function only returns values between 0.0 and 1.0. Other GLMs operate similarly, but employ different link functions- some of which are also bound by 0.0 - 1.0, and some of which are not.

Figure 1: The Most Interesting Part of the Logistic Function (Click figure to enlarge)

While calculating the optimal coefficients of a least-squares linear regression has a direct, closed-form solution, this is not the case for logistic regression. Instead, some iterative fitting procedure is needed, in which successive "guesses" at the right coefficients are incrementally improved. Again, if you are familiar with neural networks, this is much like the various training rules used with the simplest "single neuron" models. Hopefully, you are lucky enough to have a routine handy to perform this process for you, such as glmfit, from the Statistics Toolbox.

glmfit

The glmfit function is easy to apply. The syntax for logistic regression is:

B = glmfit(X, [Y N], 'binomial', 'link', 'logit');

B will contain the discovered coefficients for the linear portion of the logistic regression (the link function has no coefficients). X contains the pedictor data, with examples in rows, variables in columns. Y contains the target variable, usually a 0 or a 1 representing the outcome. Last, the variable N contains the count of events for each row of the example data- most often, this will be a columns of 1s, the same size as Y. The count parameter, N, will be set to values greater than 1 for grouped data. As an example, think of medical cases summarized by country: each country will have averaged input values, an outcome which is a rate (between 0.0 and 1.0), and the count of cases from that country. In the event that the counts are greater than one, then the target variable represents the count of target class observations.

Here is a very small example:

>> X = [0.0 0.1 0.7 1.0 1.1 1.3 1.4 1.7 2.1 2.2]';
>> Y = [0 0 1 0 0 0 1 1 1 1]';
>> B = glmfit(X, [Y ones(10,1)], 'binomial', 'link', 'logit')

B =

-3.4932
2.9402

The first element of B is the constant term, and the second element is the coefficient for the lone input variable. We apply the linear part of this logistic regression thus:

>> Z = B(1) + X * (B(2))

Z =

-3.4932
-3.1992
-1.4350
-0.5530
-0.2589
0.3291
0.6231
1.5052
2.6813
2.9753

To finish, we apply the logistic function to the output of the linear part:

>> Z = Logistic(B(1) + X * (B(2)))

Z =

0.0295
0.0392
0.1923
0.3652
0.4356
0.5815
0.6509
0.8183
0.9359
0.9514

Despite the simplicity of the logistic function, I built it into a small function, Logistic, so that I wouldn't have to repeatedly write out the formula:

% Logistic: calculates the logistic function of the input
% by Will Dwinnell
%
% Last modified: Sep-02-2006

function Output = Logistic(Input)

Output = 1 ./ (1 + exp(-Input));

% EOF

Conclusion

Though it is structurally very simple, logistic regression still finds wide use today in many fields. It is quick to fit, easy to implement the discovered model and quick to recall. Frequently, it yields better performance than competing, more complex techniques. I recently built a logistic regression model which beat out a neural network, decision trees and two types of discriminant analysis. If nothing else, it is worth fitting a simple model such as logistic regression early in a modeling project, just to establish a performance benchmark for the project.

Logistic regression is closely related to another GLM procedure, probit regression, which differs only in its link function (specified in glmfit by replacing 'logit' with 'probit'). I believe that probit regression has been losing popularity since its results are typically very similar to those from logistic regression, but the formula for the logistic link function is simpler than that of the probit link function.

References

Generalized Linear Models, by McCullagh and Nelder (ISBN-13: 978-0412317606)

See Also

The Apr-21-2007 posting, Linear Regression in MATLAB, the Feb-16-2010 posting, Single Neuron Training: The Delta Rule and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).

Friday, March 13, 2009

MATLAB 2009a

MATLAB is on the move. Release 2009a brings a number of changes. The function of the random number generators had already begun to change in the base product as of the last release, if you hadn't noticed, and several functions (min, max, sum and prod, as well as several of the FFT functions) are now multi-threaded. This release also witnesses several changes to the analytical toolboxes. Among others...

In the Statistics Toolbox...

A Naïve Bayes modeling tool has been added. This is a completely different way of modeling than the other technique in the Statistics Toolbox. Obviously, the more diverse to set of modeling tools, the better.

Data table joining has been enhanced several ways, including the ability to use multiple keys and different types of joins (inner, outer, etc.).

A number of changes to the tree induction facility (classregtree), including a fix to the quirky splitmin parameter. Now the programmer can specify the minimum number of cases per leaf node, which seems like a better way to control decision tree growth.

There are also new options for model ensembles and performance curve summaries.

In the Curve Fitting Toolbox...

Yipee! There are now functions for surface fitting (functions fit to 2 inputs, instead of just 1). Both interactive and programmatic fitting is available.

In the Parallel Computing Toolbox...

The maximum number of local workers has been increased from 4 to 8.

Sunday, March 16, 2008

MATLAB 2008a Released

While not on disk yet, MATLAB release 2008a (MATLAB 7.6) is available for download from the MathWorks for licensed users. This release brings news on several fronts:

The Statistics Toolbox has seen a number of interesting additions, including: quasirandom number generators and (sequential) feature selection and cross-validation for modeling functions.

Another change which may be of interest to readers of this log is the upgrade of the Distributed Computing Toolbox, now named the Parallel Computing Toolbox. This Toolbox permits (with slight restructuring of MATLAB code) the distribution of the computational workload over multiple processors or processor cores. With more and more multi-core computers being sold every month, this offers the opportunity to greatly accelerate MATLAB code execution (think in terms of multiples of performance, not mere percentages!) to a very broad audience. Code which can be parallelized will run nearly twice as fast on a dual-core machine, and nearly four times as fast on a quad-core machine.

On a non-technical note, MATLAB has finally moved beyond software keys to on-line authentication for software installation. Is this good or bad? Both, I suppose. I'm sure that the MathWorks experiences its share of software piracy, so this move is understandable. It's also worth mentioning that MATLAB licensing is still very casual, with "one user" licensing still permitting installation on multiple machines (such as work and home), with the understanding that only one licensed person will use the software at a time.

Friday, April 13, 2007

Basic Summary Statistics in MATLAB

This posting covers basic summary statistics in MATLAB.

First, note that MATLAB has a strong array-orientation, so data sets to be analyzed are most often stored as a matrix of values. Note that the convention in MATLAB is for variables to be stored in columns, and observations to be stored in rows. This is not a hard-and-fast rule, but it is much more common than the alternative (variables in rows, observations in columns). Besides, most MATLAB routines (whether from the MathWorks or elsewhere) assume this convention.

Basic summaries are easy to obtain from MATLAB. For the examples below, the following matrix of data, A, will be used (No, it's not very exciting, but it will do for our purposes):

>> A = [1 2 3 4; -1 10 8 5; 9 8 7 0; 0 0 0 1]

A =

1 2 3 4
-1 10 8 5
9 8 7 0
0 0 0 1

MATLAB matrices are indexed as: MatrixName(row,column):

>> A(2,1)

ans =

-1

Common statistical summaries are available in MATLAB, such as: mean (arithmetic mean), median (median), min (minimum value), max (maximum value) and std (standard deviation). Their use is illustrated below:

>> mean(A)

ans =

2.2500 5.0000 4.5000 2.5000

>> median(A)

ans =

0.5000 5.0000 5.0000 2.5000

>> min(A)

ans =

-1 0 0 0

>> max(A)

ans =

9 10 8 5

>> std(A)

ans =

4.5735 4.7610 3.6968 2.3805

Note that each of these functions operate along the columns, yielding one summary for each, stored in a row vector. Sometimes it is desired to calculate along the rows instead. Some routines can be redirected by another parameter, like this:

>> mean(A,2)

ans =

2.5000
5.5000
6.0000
0.2500

The above calculates the arithmetic means of each row, storing them in a column vector. The second mean parameter, if it is specified, indicates the dimension along which mean is to operate.

For routines without this capability, the data matrix may be transposed (rows become columns and columns become rows) using the apostrophe operator while feeding it to the function:

>> mean(A')

ans =

2.5000 5.5000 6.0000 0.2500

Note that, this time, the result is stored in a row vector.

The colon operator, :, can be used to dump all of the contents of an array into one giant column vector. The result of this operation can then be fed to any of our summary routines:

>> A(:)

ans =

1
-1
9
0
2
10
8
0
3
8
7
0
4
5
0
1

>> mean(A(:))

ans =

3.5625

The reader will find more information on summary routines in base MATLAB through:

help datafun

The MATLAB Statistics Toolbox

MATLAB users lucky enough to own the Statistics Toolbox will have available still more summaries, such as iqr (inter-quartile range), trimmean (trimmed mean) and geomean (geometric mean). Also, there are extended versions of several summary functions, such as nanmean and nanmax, which will ignore NaN (IEEE floating point "not-a-number") values, which are commonly used to represent missing values in MATLAB.

To learn more, see the "Descriptive Statistics" section when using:

help stats

Saturday, March 03, 2007

MATLAB 2007a Released

The latest version of MATLAB, 2007a, has been released. While some changes to base MATLAB are of interest to data miners (multi-threading, in particular), owners of the Statistics Toolbox receive a number of new features in this major upgrade.

First, the Statistics Toolbox makes new data structures available for categorical data (categorical arrays) and mixed-type data (dataset arrays). Most MATLAB users performing statistical analysis or data mining tend to store their data in numerical matrices (my preference) or cell arrays. Using ordinary matrices requires the programmer/analyst to manage things like variable names. Cell arrays deal with the variable name issue, but preclude some of the nice things about using MATLAB matrices. Hopefully these new structures make statistical analysis in MATLAB more natural.

Second, the Statistics Toolbox updates the classify function, permitting it to output the discovered discriminant coefficients (at last!). I have been complaining about this for a long time. Why? Because classify provides quadratic discriminant analysis (QDA), an important non-linear modeling algorithm. Without the coefficients, though, it is impossible to deliver models to other (admittedly inferior to MATLAB) platforms.

Also of note: the Genetic Algorithm and Direct Search Toolbox now includes simulated annealing.

More information on 2007a is available at The Mathworks.