Data Mining in MATLAB: linear

Showing posts with label linear. Show all posts

Saturday, December 11, 2010

Linear Discriminant Analysis (LDA)

Overview

Linear discriminant analysis (LDA) is one of the oldest mechanical classification systems, dating back to statistical pioneer Ronald Fisher, whose original 1936 paper on the subject, The Use of Multiple Measurements in Taxonomic Problems, can be found online (for example, here).

The basic idea of LDA is simple: for each class to be identified, calculate a (different) linear function of the attributes. The class function yielding the highest score represents the predicted class.

There are many linear classification models, and they differ largely in how the coefficients are established. One nice quality of LDA is that, unlike some of the alternatives, it does not require multiple passes over the data for optimization. Also, it naturally handles problems with more than two classes and it can provide probability estimates for each of the candidate classes.

Some analysts attempt to interpret the signs and magnitudes of the coefficients of the linear scores, but this can be tricky, especially when the number of classes is greater than 2.

LDA bears some resemblance to principal components analysis (PCA), in that a number of linear functions are produced (using all raw variables), which are intended, in some sense, to provide data reduction through rearrangement of information. (See the Feb-26-2010 posting to this log, Principal Components Analysis.) Note, though, some important differences: First, the objective of LDA is to maximize class discrimination, whereas the objective of PCA is to squeeze variance into as few components as possible. Second, LDA produces exactly as many linear functions as there are classes, whereas PCA produces as many linear functions as there are original variables. Last, principal components are always orthogonal to each other ("uncorrelated"), while that is not generally true for LDA's linear scores.

An Implementation

I have made available on MATLAB Central, a routine, aptly named LDA which performs all the necessary calculations. I'd like to thank Deniz Seviş, whose prompting got me to finally write this code (with her) and whose collaboration is very much appreciated.

Note that the LDA function assumes that the data its being fed is complete (no missing values) and performs no attribute selection. Also, it requires only base MATLAB (no toolboxes needed).

Use of LDA is straightforward: the programmer supplies the input and target variables and, optionally, prior probabilities. The function returns the fitted linear discriminant coefficients. help LDA provides a good example:

% Generate example data: 2 groups, of 10 and 15, respectively
X = [randn(10,2); randn(15,2) + 1.5]; Y = [zeros(10,1); ones(15,1)];

% Calculate linear discriminant coefficients
W = LDA(X,Y);

This example randomly generates an artificial data set of two classes (labeled 0 and 1) and two input variables. The LDA function fits linear discriminants to the data, and stores the result in W. So, what is in W? Let's take a look:

>> W

W =

-1.1997 0.2182 0.6110
-2.0697 0.4660 1.4718

The first row contains the coefficients for the linear score associated with the first class (this routine orders the linear functions the same way as unique()). In this model, -1.1997 is the constant and 0.2182 and 0.6110 are the coefficients for the input variables for the first class (class 0). Coefficients for the second class's linear function are in the second row. Calculating the linear scores is easy:

% Calulcate linear scores for training data
L = [ones(25,1) X] * W';

Each column represents the output of the linear score for one class. In this case, the first column is class 0, and the second column is class 1. For any given observation, the higher the linear score, the more likely that class. Note that LDA's linear scores are not probabilities, and may even assume negative values. Here are the values from my run:

>> L

L =

-1.9072 -3.8060
1.0547 3.2517
-1.2493 -2.0547
-1.0502 -1.7608
-0.6935 -0.8692
-1.6103 -2.9808
-1.3702 -2.4545
-0.2148 0.2825
0.4419 1.6717
0.2704 1.3067
1.0694 3.2670
-0.0207 0.7529
-0.2608 0.0601
1.2369 3.6135
-0.8951 -1.4542
0.2073 1.1687
0.0551 0.8204
0.1729 1.1654
0.2993 1.4344
-0.6562 -0.8028
0.2195 1.2068
-0.3070 0.0598
0.1944 1.2628
0.5354 2.0689
0.0795 1.0976

To obtain estimated probabilities, simply run the linear scores through the softmax transform (exponentiate everything, and normalize so that they sum to 1.0):

% Calculate class probabilities
P = exp(L) ./ repmat(sum(exp(L),2),[1 2]);

As we see, most of the first 10 cases exhibit higher probabilities for class 0 (the first column) than for class 1 (the second column) and the reverse is true for the last 15 cases:

>> P

P =

0.8697 0.1303
0.1000 0.9000
0.6911 0.3089
0.6705 0.3295
0.5438 0.4562
0.7975 0.2025
0.7473 0.2527
0.3782 0.6218
0.2262 0.7738
0.2619 0.7381
0.1000 0.9000
0.3157 0.6843
0.4205 0.5795
0.0850 0.9150
0.6363 0.3637
0.2766 0.7234
0.3175 0.6825
0.2704 0.7296
0.2432 0.7568
0.5366 0.4634
0.2714 0.7286
0.4093 0.5907
0.2557 0.7443
0.1775 0.8225
0.2654 0.7346

This model is not perfect, and would really need to be tested more rigorously (via holdout testing, k-fold cross validation, etc.) to determine how well it approximates the data.

I will not demonstrate its use here, but the LDA routine offers a facility for modifying the prior probabilities. Briefly, the function assumes that the true distribution of classes is whatever it observes in the training data. Analysts, however, may wish to adjust this distribution for several reasons, and the third, optional, parameter allows this. Note that the LDA routine presented here always performs the adjustment for prior probabilities: Some statistical software drops the adjustment for prior probabilities altogether if the user specifies that classes are equally likely, and will produce different results than LDA.

Closing Thoughts

Though it employs a fairly simple model structure, LDA has held up reasonably well, sometimes still besting more complex algorithms. When its assumptions are met, the literature records it doing better than logistic regression. It is very fast to execute and fitted models are extremely portable- even a spreadsheet will support linear models (...or, one supposes, paper and pencil!) LDA is at least worth trying at the beginning of a project, if for no other reason than to establish a lower bound on acceptable performance.

See Also

Feb-16-2010 posting, Single Neuron Training: The Delta Rule
Mar-15-2009 posting, Logistic Regression

Tuesday, February 16, 2010

Single Neuron Training: The Delta Rule

I have recently put together a routine, DeltaRule, to train a single artificial neuron using the delta rule. DeltaRule can be found at MATLAB Central.

This posting will not go into much detail, but this type of model is something like a logistic regression, where a linear model is calculated on the input variables, then passed through a squashing function (in this case the logistic curve). Such models are most often used to model binary outcomes, hence the dependent variable is normally composed of the values 0 and 1.

Single neurons with linear functions (with squashing functions or not) are only capable of separating classes that may be divided by a line (plane, hyperplane), yet they are often useful, either by themselves or in building more complex models.

Use help DeltaRule for syntax and a simple example of its use.

Anyway, I thought readers might find this routine useful. It trains quickly and the code is straightforward (I think), making modification easy. Please write to let me know if you do anything interesting with it.

If you are already familiar with simple neural models like this one, here are the technical details:

Learning rule: incremental delta rule
Learning rate: constant
Transfer function: logistic
Exemplar presentation order: random, by training epoch

See also the Mar-15-2009 posting, Logistic Regression and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).

Sunday, March 15, 2009

Logistic Regression

Introduction

Often, the analyst is required to construct a model which estimates probabilities. This is common in many fields: medical diagnosis (probability of recovery, relapse, etc.), credit scoring (probability of a loan being repaid), sports (probability of a team beating a competitor- wait... maybe that belongs in the "investment" category?).

Many people are familiar with linear regression- why not just use that? There are several good reasons not to do this, but probably the most obvious is that linear models will always fall below 0.0 and poke out above 1.0, yielding answers which do not make sense as probabilities.

Many different classification models have been devised which estimate the probability of class membership, such as linear and quadratic discriminant analysis, neural networks and tree induction. The technique covered in this article is logistic regression- one of the simplest modeling procedures.

Logistic Regression

Logistic regression is a member of the family of methods called generalized linear models ("GLM"). Such models include a linear part followed by some "link function". If you are familiar with neural networks, think of "transfer functions" or "squashing functions". So, the linear function of the predictor variables is calculated, and the result of this calculation is run through the link function. In the case of logistic regression, the linear result is run through a logistic function (see figure 1), which runs from 0.0 (at negative infinity), rises monotonically to 1.0 (at positive infinity). Along the way, it is 0.5 when the input value is exactly zero. Among other desirable properties, note that this logistic function only returns values between 0.0 and 1.0. Other GLMs operate similarly, but employ different link functions- some of which are also bound by 0.0 - 1.0, and some of which are not.

Figure 1: The Most Interesting Part of the Logistic Function (Click figure to enlarge)

While calculating the optimal coefficients of a least-squares linear regression has a direct, closed-form solution, this is not the case for logistic regression. Instead, some iterative fitting procedure is needed, in which successive "guesses" at the right coefficients are incrementally improved. Again, if you are familiar with neural networks, this is much like the various training rules used with the simplest "single neuron" models. Hopefully, you are lucky enough to have a routine handy to perform this process for you, such as glmfit, from the Statistics Toolbox.

glmfit

The glmfit function is easy to apply. The syntax for logistic regression is:

B = glmfit(X, [Y N], 'binomial', 'link', 'logit');

B will contain the discovered coefficients for the linear portion of the logistic regression (the link function has no coefficients). X contains the pedictor data, with examples in rows, variables in columns. Y contains the target variable, usually a 0 or a 1 representing the outcome. Last, the variable N contains the count of events for each row of the example data- most often, this will be a columns of 1s, the same size as Y. The count parameter, N, will be set to values greater than 1 for grouped data. As an example, think of medical cases summarized by country: each country will have averaged input values, an outcome which is a rate (between 0.0 and 1.0), and the count of cases from that country. In the event that the counts are greater than one, then the target variable represents the count of target class observations.

Here is a very small example:

>> X = [0.0 0.1 0.7 1.0 1.1 1.3 1.4 1.7 2.1 2.2]';
>> Y = [0 0 1 0 0 0 1 1 1 1]';
>> B = glmfit(X, [Y ones(10,1)], 'binomial', 'link', 'logit')

B =

-3.4932
2.9402

The first element of B is the constant term, and the second element is the coefficient for the lone input variable. We apply the linear part of this logistic regression thus:

>> Z = B(1) + X * (B(2))

Z =

-3.4932
-3.1992
-1.4350
-0.5530
-0.2589
0.3291
0.6231
1.5052
2.6813
2.9753

To finish, we apply the logistic function to the output of the linear part:

>> Z = Logistic(B(1) + X * (B(2)))

Z =

0.0295
0.0392
0.1923
0.3652
0.4356
0.5815
0.6509
0.8183
0.9359
0.9514

Despite the simplicity of the logistic function, I built it into a small function, Logistic, so that I wouldn't have to repeatedly write out the formula:

% Logistic: calculates the logistic function of the input
% by Will Dwinnell
%
% Last modified: Sep-02-2006

function Output = Logistic(Input)

Output = 1 ./ (1 + exp(-Input));

% EOF

Conclusion

Though it is structurally very simple, logistic regression still finds wide use today in many fields. It is quick to fit, easy to implement the discovered model and quick to recall. Frequently, it yields better performance than competing, more complex techniques. I recently built a logistic regression model which beat out a neural network, decision trees and two types of discriminant analysis. If nothing else, it is worth fitting a simple model such as logistic regression early in a modeling project, just to establish a performance benchmark for the project.

Logistic regression is closely related to another GLM procedure, probit regression, which differs only in its link function (specified in glmfit by replacing 'logit' with 'probit'). I believe that probit regression has been losing popularity since its results are typically very similar to those from logistic regression, but the formula for the logistic link function is simpler than that of the probit link function.

References

Generalized Linear Models, by McCullagh and Nelder (ISBN-13: 978-0412317606)

See Also

The Apr-21-2007 posting, Linear Regression in MATLAB, the Feb-16-2010 posting, Single Neuron Training: The Delta Rule and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).

Thursday, May 03, 2007

Weighted Regression in MATLAB

Many predictive modeling techniques have weighted counterparts, which permit the analyst to assign weights representing the "importance" of individual observations. An observation with a weight of 8, for instance, is treated in the modeling process as though there were 8 individual observations with the same values. The usual, unweighted algorithms may be thought of as a special case of weighted algorithms, in which the weights of all observations equal 1.0.

There are several reasons for using weighted methods. One is simply that some data sets have been pre-summarized, with identical records being collapsed to a single record having a weight equal to the original number of identical records. Many analysts favor binning of predictor variables, which can drastically reduce the number of distinct combinations of input variable values.

A second reason to use weighting is simple economy of space: data with identical (or very similar) records consolidated with weights representing the number of original observations they represent can be much smaller (even by orders of magnitude!) than the original data.

Another important reason to weight observations is to "fix" class distributions in the data. Assume that the original data contains a million rows of bank loan data, of which only 2% represent bad loans. It is common to sample down the number of good loans, while retaining all of the bad loans. This can save time on learning, but will result in a systematically biased model. A learning system which can accept weights on the observations can correct for this bias.

There are also a number of on-line resources for performing weighted regression in base MATLAB, such as:

Optimization Tips and Tricks, by John D'Errico

The thread linked below records an interesting conversation about weighted linear regression, and some practical issues for implementation in MATLAB:

Weighted regression thread on Usenet

Weighted regression can also be accomplished using the Statistics Toolbox, via functions such as glimfit and nlinfit. See the help facility for these functions, or try wnlsdemo for more information.

The Curve Fitting Toolbox also provides facilities for weighted regression (see: help fitoptions).

See also:

The Apr-21-2007 posting, Linear Regression in MATLAB.

The Oct-23-2007 posting, L-1 Linear Regression.

Saturday, April 21, 2007

Linear Regression in MATLAB

Fitting a least-squares linear regression is easily accomplished in MATLAB using the backslash operator: '\'. In linear algebra, matrices may by multiplied like this:

output = input * coefficients

The backslash in MATLAB allows the programmer to effectively "divide" the output by the input to get the linear coefficients. This process will be illustrated by the following examples:

Simple Linear Regression

First, some data with a roughly linear relationship is needed:

>> X = [1 2 4 5 7 9 11 13 14 16]'; Y = [101 105 109 112 117 116 122 123 129 130]';

"Divide" using MATLAB's backslash operator to regress without an intercept:

>> B = X \ Y

B =

10.8900

Append a column of ones before dividing to include an intercept:

>> B = [ones(length(X),1) X] \ Y

B =

101.3021
1.8412

In this case, the first number is the intercept and the second is the coefficient.

Multiple Linear Regression

The following generates a matrix of 1000 observations of 5 random input variables:

>> X = rand(1e3,5);

Next, the true coefficients are defined (which wouldn't be known in a real problem). As is conventional, the intercept term is the first element of the coefficient vector. The problem at hand is to approximate these coefficients, knowing only the input and output data:

>> BTrue = [-1 2 -3 4 -5 6]';

Multiply the matrices to get the output data.

>> Y = BTrue(1) + X * BTrue(2:end);

As before, append a column of ones and use the backslash operator:

>> B = [ones(size(X,1),1) X] \ Y

B =

-1.0000
2.0000
-3.0000
4.0000
-5.0000
6.0000

Again, the first element in the coefficient vector is the intercept. Note that, oh so conveniently, the discovered coefficients match the designed ones exactly, since this data set is completely noise-free.

Model Recall

Executing linear models is a simple matter of matrix multiplication, but there is an efficiency issue. One might append a column of ones and simply perform the complete matrix multiplication, thus:

>> Z = [ones(size(X,1),1) X] * B;

The above process is inefficient, though, and can be improved by simply multiplying all the other coefficients by the input data matrix and adding the intercept term:

>> Z = B(1) + X * B(2:end);

Regression in the Statistics Toolbox

The MATLAB Statistics Toolbox includes several linear regression functions. Among others, there are:

regress: least squares linear regression and diagnostics

stepwisefit: stepwise linear regression

robustfit: robust (non-least-squares) linear regression and diagnostics

See help stats for more information.

See also:

The May-03-2007 posting, Weighted Regression in MATLAB.

The Oct-23-2007 posting, L-1 Linear Regression.

The Mar-15-2009 posting, Logistic Regression.