Data Mining in MATLAB: Linear Regression in MATLAB

Saturday, April 21, 2007

Linear Regression in MATLAB

Fitting a least-squares linear regression is easily accomplished in MATLAB using the backslash operator: '\'. In linear algebra, matrices may by multiplied like this:

output = input * coefficients

The backslash in MATLAB allows the programmer to effectively "divide" the output by the input to get the linear coefficients. This process will be illustrated by the following examples:

Simple Linear Regression

First, some data with a roughly linear relationship is needed:

>> X = [1 2 4 5 7 9 11 13 14 16]'; Y = [101 105 109 112 117 116 122 123 129 130]';

"Divide" using MATLAB's backslash operator to regress without an intercept:

>> B = X \ Y

B =

10.8900

Append a column of ones before dividing to include an intercept:

>> B = [ones(length(X),1) X] \ Y

B =

101.3021
1.8412

In this case, the first number is the intercept and the second is the coefficient.

Multiple Linear Regression

The following generates a matrix of 1000 observations of 5 random input variables:

>> X = rand(1e3,5);

Next, the true coefficients are defined (which wouldn't be known in a real problem). As is conventional, the intercept term is the first element of the coefficient vector. The problem at hand is to approximate these coefficients, knowing only the input and output data:

>> BTrue = [-1 2 -3 4 -5 6]';

Multiply the matrices to get the output data.

>> Y = BTrue(1) + X * BTrue(2:end);

As before, append a column of ones and use the backslash operator:

>> B = [ones(size(X,1),1) X] \ Y

B =

-1.0000
2.0000
-3.0000
4.0000
-5.0000
6.0000

Again, the first element in the coefficient vector is the intercept. Note that, oh so conveniently, the discovered coefficients match the designed ones exactly, since this data set is completely noise-free.

Model Recall

Executing linear models is a simple matter of matrix multiplication, but there is an efficiency issue. One might append a column of ones and simply perform the complete matrix multiplication, thus:

>> Z = [ones(size(X,1),1) X] * B;

The above process is inefficient, though, and can be improved by simply multiplying all the other coefficients by the input data matrix and adding the intercept term:

>> Z = B(1) + X * B(2:end);

Regression in the Statistics Toolbox

The MATLAB Statistics Toolbox includes several linear regression functions. Among others, there are:

regress: least squares linear regression and diagnostics

stepwisefit: stepwise linear regression

robustfit: robust (non-least-squares) linear regression and diagnostics

See help stats for more information.

See also:

The May-03-2007 posting, Weighted Regression in MATLAB.

The Oct-23-2007 posting, L-1 Linear Regression.

The Mar-15-2009 posting, Logistic Regression.

7 comments:

Sandro Saitta said...: Hello Will,

This comment is not related to your post but it may be of interest to you. I've just found a new book on amazon about data mining and matlab. Perhaps you already know it.; 2:43 AM
Will Dwinnell said...: I was not aware of that particular title. It looks like it's time to brush up on my linear algebra!; 8:41 PM
Anonymous said...: How does one attain simple diagnostic statistics about the multiple regression, such as:
-standard error
-t statistic
-P-value
-confidence interval
-r square
-adjusted r square

in matlab?
These are available in Excel with hte click of a button, but I'm positive Matlab should be way better than Excel :P; 1:28 PM
Will Dwinnell said...: I never use those statistics, so I do not have any code immediately handy. The regress function in the Statistics Toolbox will generate a number of these diagnostics, and it should not be hard to create one's own calculations using MATLAB's built-in functions, like var.; 7:35 PM
Anonymous said...: [B,BINT,R,RINT,STATS] = REGRESS(Y,X) returns a vector STATS containing, in
the following order, the R-square statistic, the F statistic and p value
for the full model, and an estimate of the error variance.; 5:44 AM
Cristiano said...: dear Will Dwinnell,
many compliments for you work on MATLAB.

I'd like to know if there is a solution/workaround form my statistical problem.

I'd like to understand, just as overview, how I can solve this problem: I know yi(outcome) and wi(initial weights) and xi (values) but I don't know the f(x,w)

My function to predict is:

y = f(w1*x1,w2*x2,...,w39*x39)

I'd like to minimize error for predicting yi and find the final weight.

I'm wondering if is a Optimization problem: http://www.mathworks.com/matlabcentral/fileexchange/8553

I'm a statistician and I'd think to solve this example with nonlinear regression with Bound constraint, but I ask you if with Matlab can solve it better.

Any kind of suggestions will be really appreciated.

Thanks in advance; 7:02 AM
Anonymous said...: Hi,

I am doing multiple regression using matlab with three dependent variables. Matlab is spitting out only 1 p-value or strictly speaking, the F-statistic. How can i get the p-values corresponding to all of the dependent variables?

Thanks.; 10:00 PM