Often, the analyst is required to construct a model which estimates probabilities. This is common in many fields: medical diagnosis (probability of recovery, relapse, etc.), credit scoring (probability of a loan being repaid), sports (probability of a team beating a competitor- wait... maybe that belongs in the "investment" category?).
Many people are familiar with linear regression- why not just use that? There are several good reasons not to do this, but probably the most obvious is that linear models will always fall below 0.0 and poke out above 1.0, yielding answers which do not make sense as probabilities.
Many different classification models have been devised which estimate the probability of class membership, such as linear and quadratic discriminant analysis, neural networks and tree induction. The technique covered in this article is logistic regression- one of the simplest modeling procedures.
Logistic regression is a member of the family of methods called generalized linear models ("GLM"). Such models include a linear part followed by some "link function". If you are familiar with neural networks, think of "transfer functions" or "squashing functions". So, the linear function of the predictor variables is calculated, and the result of this calculation is run through the link function. In the case of logistic regression, the linear result is run through a logistic function (see figure 1), which runs from 0.0 (at negative infinity), rises monotonically to 1.0 (at positive infinity). Along the way, it is 0.5 when the input value is exactly zero. Among other desirable properties, note that this logistic function only returns values between 0.0 and 1.0. Other GLMs operate similarly, but employ different link functions- some of which are also bound by 0.0 - 1.0, and some of which are not.
Figure 1: The Most Interesting Part of the Logistic Function (Click figure to enlarge)
While calculating the optimal coefficients of a least-squares linear regression has a direct, closed-form solution, this is not the case for logistic regression. Instead, some iterative fitting procedure is needed, in which successive "guesses" at the right coefficients are incrementally improved. Again, if you are familiar with neural networks, this is much like the various training rules used with the simplest "single neuron" models. Hopefully, you are lucky enough to have a routine handy to perform this process for you, such as glmfit, from the Statistics Toolbox.
The glmfit function is easy to apply. The syntax for logistic regression is:
B = glmfit(X, [Y N], 'binomial', 'link', 'logit');
B will contain the discovered coefficients for the linear portion of the logistic regression (the link function has no coefficients). X contains the pedictor data, with examples in rows, variables in columns. Y contains the target variable, usually a 0 or a 1 representing the outcome. Last, the variable N contains the count of events for each row of the example data- most often, this will be a columns of 1s, the same size as Y. The count parameter, N, will be set to values greater than 1 for grouped data. As an example, think of medical cases summarized by country: each country will have averaged input values, an outcome which is a rate (between 0.0 and 1.0), and the count of cases from that country. In the event that the counts are greater than one, then the target variable represents the count of target class observations.
Here is a very small example:
>> X = [0.0 0.1 0.7 1.0 1.1 1.3 1.4 1.7 2.1 2.2]';
>> Y = [0 0 1 0 0 0 1 1 1 1]';
>> B = glmfit(X, [Y ones(10,1)], 'binomial', 'link', 'logit')
The first element of B is the constant term, and the second element is the coefficient for the lone input variable. We apply the linear part of this logistic regression thus:
>> Z = B(1) + X * (B(2))
To finish, we apply the logistic function to the output of the linear part:
>> Z = Logistic(B(1) + X * (B(2)))
Despite the simplicity of the logistic function, I built it into a small function, Logistic, so that I wouldn't have to repeatedly write out the formula:
% Logistic: calculates the logistic function of the input
% by Will Dwinnell
% Last modified: Sep-02-2006
function Output = Logistic(Input)
Output = 1 ./ (1 + exp(-Input));
Though it is structurally very simple, logistic regression still finds wide use today in many fields. It is quick to fit, easy to implement the discovered model and quick to recall. Frequently, it yields better performance than competing, more complex techniques. I recently built a logistic regression model which beat out a neural network, decision trees and two types of discriminant analysis. If nothing else, it is worth fitting a simple model such as logistic regression early in a modeling project, just to establish a performance benchmark for the project.
Logistic regression is closely related to another GLM procedure, probit regression, which differs only in its link function (specified in glmfit by replacing 'logit' with 'probit'). I believe that probit regression has been losing popularity since its results are typically very similar to those from logistic regression, but the formula for the logistic link function is simpler than that of the probit link function.
Generalized Linear Models, by McCullagh and Nelder (ISBN-13: 978-0412317606)
The Apr-21-2007 posting, Linear Regression in MATLAB, the Feb-16-2010 posting, Single Neuron Training: The Delta Rule and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).