Sunday, March 15, 2009

Logistic Regression

Introduction

Often, the analyst is required to construct a model which estimates probabilities. This is common in many fields: medical diagnosis (probability of recovery, relapse, etc.), credit scoring (probability of a loan being repaid), sports (probability of a team beating a competitor- wait... maybe that belongs in the "investment" category?).

Many people are familiar with linear regression- why not just use that? There are several good reasons not to do this, but probably the most obvious is that linear models will always fall below 0.0 and poke out above 1.0, yielding answers which do not make sense as probabilities.

Many different classification models have been devised which estimate the probability of class membership, such as linear and quadratic discriminant analysis, neural networks and tree induction. The technique covered in this article is logistic regression- one of the simplest modeling procedures.

Logistic Regression

Logistic regression is a member of the family of methods called generalized linear models ("GLM"). Such models include a linear part followed by some "link function". If you are familiar with neural networks, think of "transfer functions" or "squashing functions". So, the linear function of the predictor variables is calculated, and the result of this calculation is run through the link function. In the case of logistic regression, the linear result is run through a logistic function (see figure 1), which runs from 0.0 (at negative infinity), rises monotonically to 1.0 (at positive infinity). Along the way, it is 0.5 when the input value is exactly zero. Among other desirable properties, note that this logistic function only returns values between 0.0 and 1.0. Other GLMs operate similarly, but employ different link functions- some of which are also bound by 0.0 - 1.0, and some of which are not.

Figure 1: The Most Interesting Part of the Logistic Function (Click figure to enlarge)

While calculating the optimal coefficients of a least-squares linear regression has a direct, closed-form solution, this is not the case for logistic regression. Instead, some iterative fitting procedure is needed, in which successive "guesses" at the right coefficients are incrementally improved. Again, if you are familiar with neural networks, this is much like the various training rules used with the simplest "single neuron" models. Hopefully, you are lucky enough to have a routine handy to perform this process for you, such as glmfit, from the Statistics Toolbox.

glmfit

The glmfit function is easy to apply. The syntax for logistic regression is:

B = glmfit(X, [Y N], 'binomial', 'link', 'logit');

B will contain the discovered coefficients for the linear portion of the logistic regression (the link function has no coefficients). X contains the pedictor data, with examples in rows, variables in columns. Y contains the target variable, usually a 0 or a 1 representing the outcome. Last, the variable N contains the count of events for each row of the example data- most often, this will be a columns of 1s, the same size as Y. The count parameter, N, will be set to values greater than 1 for grouped data. As an example, think of medical cases summarized by country: each country will have averaged input values, an outcome which is a rate (between 0.0 and 1.0), and the count of cases from that country. In the event that the counts are greater than one, then the target variable represents the count of target class observations.

Here is a very small example:

>> X = [0.0 0.1 0.7 1.0 1.1 1.3 1.4 1.7 2.1 2.2]';
>> Y = [0 0 1 0 0 0 1 1 1 1]';
>> B = glmfit(X, [Y ones(10,1)], 'binomial', 'link', 'logit')

B =

-3.4932
2.9402

The first element of B is the constant term, and the second element is the coefficient for the lone input variable. We apply the linear part of this logistic regression thus:

>> Z = B(1) + X * (B(2))

Z =

-3.4932
-3.1992
-1.4350
-0.5530
-0.2589
0.3291
0.6231
1.5052
2.6813
2.9753

To finish, we apply the logistic function to the output of the linear part:

>> Z = Logistic(B(1) + X * (B(2)))

Z =

0.0295
0.0392
0.1923
0.3652
0.4356
0.5815
0.6509
0.8183
0.9359
0.9514

Despite the simplicity of the logistic function, I built it into a small function, Logistic, so that I wouldn't have to repeatedly write out the formula:

% Logistic: calculates the logistic function of the input
% by Will Dwinnell
%

function Output = Logistic(Input)

Output = 1 ./ (1 + exp(-Input));

% EOF

Conclusion

Though it is structurally very simple, logistic regression still finds wide use today in many fields. It is quick to fit, easy to implement the discovered model and quick to recall. Frequently, it yields better performance than competing, more complex techniques. I recently built a logistic regression model which beat out a neural network, decision trees and two types of discriminant analysis. If nothing else, it is worth fitting a simple model such as logistic regression early in a modeling project, just to establish a performance benchmark for the project.

Logistic regression is closely related to another GLM procedure, probit regression, which differs only in its link function (specified in glmfit by replacing 'logit' with 'probit'). I believe that probit regression has been losing popularity since its results are typically very similar to those from logistic regression, but the formula for the logistic link function is simpler than that of the probit link function.

References

Generalized Linear Models, by McCullagh and Nelder (ISBN-13: 978-0412317606)

The Apr-21-2007 posting, Linear Regression in MATLAB, the Feb-16-2010 posting, Single Neuron Training: The Delta Rule and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).

Kees said...

I never understood why there is no support for class/level variables in General Linear Model estimation (glmfit function). I currently resort to SAS for doing this type of analysis.
There is a MATLAB toolbox called GEEQBOX (http://www.jstatsoft.org/v25/i14/paper), which at least has the possibility to account for repeated measurements (GEE), but still it accepts only numeric data in the predictor variables.

Will Dwinnell said...

That's a good question.

I'd like to see support (even if it required using multiple dummy variables) for multi-class logistic regression in the Statistics Toolbox.

Dean Abbott said...

Back 20 years ago, when I worked at Barron Associates, Inc. in Virginia, Andrew Barron created a simple way to build multiple logistic regression models (M-1 models, where M is the number of levels of the target variable), and then compute the probability of each outcome, The Mth probability is just 1 - SUM(all other probs). So in Matlab, I think this would just require an outer loop and then scaling on the back end.

By the way, these were fun times--John Elder was there at Barron Associates, as was Paul Hess (co-founder of AbTech Corp.).

eyeballjunk said...

And what about likelihood estimates? Chi^2? AIC?

CTraft said...

Is there a way to perform multiple logistic regression in Matlab, using multiple, maybe 3 or 4 variables at a time?

eyeballjunk said...

Ctraft: I believe you're referring to multinomial logistic regression, which is a separate function.

Dosn said...

Hey there,
thanks a lot for your blog entry, it really helped me a lot!

Anonymous said...

Hi, I am really new in this area and I just started to learn MatLAb yesterday in order to perfor logistic regerssion. I don't understand how to define the function glmfit. Could you please help me with that.

Will Dwinnell said...

glmfit is provided in the Statistics Toolbox, which is an add-on product from MATLAB's vendor, the MathWorks.

At the MATLAB command line, try typing help glmfit. If MATLAB doesn't know what you're talking about, an error message will come back because you don't have this Toolbox installed.

msuzen said...

Z can also be obtained via :
>> glmval(B, X, 'logit')

msuzen said...
This comment has been removed by the author.
vinkal vishnoi said...

Thanks for such a post...

ml learner said...

This article helped me to understand some of the basics of logistic regression, but I am confused about one thing, through glmfit, a model is found that fits the data, am I right?

And the next question, that I am planning to use this as classifier but I am not sure that how can I test my examples i.e. how would the classifier would predict the class? Is there any Matlab function available for this?

Regards,

Anonymous said...

is there a simple was of determining the inflection point? I have been searching for this all day...

Anonymous said...

Hi,

I was just wondering how can you get the accuracy of the model.
Thank you!

Anonymous said...

Will, is there a benefit of logistic regression over LDA? How do you know which one to use?

Truth said...

The last statement

Z = Logistic(B(1) + X * (B(2)))

produce an error

Undefined function 'Logistic' for input arguments of type 'double'.

hi. i will use logistic regression but some of my features are not numeric and even categorical. how can i use this classifier for these features. some features like: device_ip=0f201fe02

genetixx01 said...

Hello Mr. Dwinnell,
Thank you very much for your article. I'm trying to perform logistic regression on financial data in context of my master degree and still not very clever with Matlab. I would like to know if it is possible to get all stats from a regression to assess the accuracy and validity of the model. I know the function "regstats" but it puts results in variables on the workspace. Is it possible to get a output with all main stats in a single windows to quickly assess the validity of the model?

Thank you very much.
Gabriel

Iris said...

Thanks so much for the easy to understand explanations on fitting a logistic regression to the data! I will be checking your blog more often now for any matlab questions:)

Dimitri Huwyler said...

What is the best way in matlab to do a logistic regression with more than 2 outcome categories?

Anonymous said...

For those asking, the MATLAB mnrfit function performs multinomial logistic regression.

Coepd said...

We at COEPD provides finest Data Science and R-Language courses in Hyderabad. Your search to learn Data Science ends here at COEPD. Here, we are an established training institute who have trained more than 10,000 participants in all streams. We will help you to convert your passion to learn into an enriched learning process. We will accelerate your career in data science by mastering concepts of Data Management, Statistics, Machine Learning and Big Data.

http://www.coepd.com/AnalyticsTraining.html