Data Mining in MATLAB: April 2007

Saturday, April 21, 2007

Linear Regression in MATLAB

Fitting a least-squares linear regression is easily accomplished in MATLAB using the backslash operator: '\'. In linear algebra, matrices may by multiplied like this:

output = input * coefficients

The backslash in MATLAB allows the programmer to effectively "divide" the output by the input to get the linear coefficients. This process will be illustrated by the following examples:

Simple Linear Regression

First, some data with a roughly linear relationship is needed:

>> X = [1 2 4 5 7 9 11 13 14 16]'; Y = [101 105 109 112 117 116 122 123 129 130]';

"Divide" using MATLAB's backslash operator to regress without an intercept:

>> B = X \ Y

B =

10.8900

Append a column of ones before dividing to include an intercept:

>> B = [ones(length(X),1) X] \ Y

B =

101.3021
1.8412

In this case, the first number is the intercept and the second is the coefficient.

Multiple Linear Regression

The following generates a matrix of 1000 observations of 5 random input variables:

>> X = rand(1e3,5);

Next, the true coefficients are defined (which wouldn't be known in a real problem). As is conventional, the intercept term is the first element of the coefficient vector. The problem at hand is to approximate these coefficients, knowing only the input and output data:

>> BTrue = [-1 2 -3 4 -5 6]';

Multiply the matrices to get the output data.

>> Y = BTrue(1) + X * BTrue(2:end);

As before, append a column of ones and use the backslash operator:

>> B = [ones(size(X,1),1) X] \ Y

B =

-1.0000
2.0000
-3.0000
4.0000
-5.0000
6.0000

Again, the first element in the coefficient vector is the intercept. Note that, oh so conveniently, the discovered coefficients match the designed ones exactly, since this data set is completely noise-free.

Model Recall

Executing linear models is a simple matter of matrix multiplication, but there is an efficiency issue. One might append a column of ones and simply perform the complete matrix multiplication, thus:

>> Z = [ones(size(X,1),1) X] * B;

The above process is inefficient, though, and can be improved by simply multiplying all the other coefficients by the input data matrix and adding the intercept term:

>> Z = B(1) + X * B(2:end);

Regression in the Statistics Toolbox

The MATLAB Statistics Toolbox includes several linear regression functions. Among others, there are:

regress: least squares linear regression and diagnostics

stepwisefit: stepwise linear regression

robustfit: robust (non-least-squares) linear regression and diagnostics

See help stats for more information.

See also:

The May-03-2007 posting, Weighted Regression in MATLAB.

The Oct-23-2007 posting, L-1 Linear Regression.

The Mar-15-2009 posting, Logistic Regression.

Friday, April 13, 2007

Basic Summary Statistics in MATLAB

This posting covers basic summary statistics in MATLAB.

First, note that MATLAB has a strong array-orientation, so data sets to be analyzed are most often stored as a matrix of values. Note that the convention in MATLAB is for variables to be stored in columns, and observations to be stored in rows. This is not a hard-and-fast rule, but it is much more common than the alternative (variables in rows, observations in columns). Besides, most MATLAB routines (whether from the MathWorks or elsewhere) assume this convention.

Basic summaries are easy to obtain from MATLAB. For the examples below, the following matrix of data, A, will be used (No, it's not very exciting, but it will do for our purposes):

>> A = [1 2 3 4; -1 10 8 5; 9 8 7 0; 0 0 0 1]

A =

1 2 3 4
-1 10 8 5
9 8 7 0
0 0 0 1

MATLAB matrices are indexed as: MatrixName(row,column):

>> A(2,1)

ans =

-1

Common statistical summaries are available in MATLAB, such as: mean (arithmetic mean), median (median), min (minimum value), max (maximum value) and std (standard deviation). Their use is illustrated below:

>> mean(A)

ans =

2.2500 5.0000 4.5000 2.5000

>> median(A)

ans =

0.5000 5.0000 5.0000 2.5000

>> min(A)

ans =

-1 0 0 0

>> max(A)

ans =

9 10 8 5

>> std(A)

ans =

4.5735 4.7610 3.6968 2.3805

Note that each of these functions operate along the columns, yielding one summary for each, stored in a row vector. Sometimes it is desired to calculate along the rows instead. Some routines can be redirected by another parameter, like this:

>> mean(A,2)

ans =

2.5000
5.5000
6.0000
0.2500

The above calculates the arithmetic means of each row, storing them in a column vector. The second mean parameter, if it is specified, indicates the dimension along which mean is to operate.

For routines without this capability, the data matrix may be transposed (rows become columns and columns become rows) using the apostrophe operator while feeding it to the function:

>> mean(A')

ans =

2.5000 5.5000 6.0000 0.2500

Note that, this time, the result is stored in a row vector.

The colon operator, :, can be used to dump all of the contents of an array into one giant column vector. The result of this operation can then be fed to any of our summary routines:

>> A(:)

ans =

1
-1
9
0
2
10
8
0
3
8
7
0
4
5
0
1

>> mean(A(:))

ans =

3.5625

The reader will find more information on summary routines in base MATLAB through:

help datafun

The MATLAB Statistics Toolbox

MATLAB users lucky enough to own the Statistics Toolbox will have available still more summaries, such as iqr (inter-quartile range), trimmean (trimmed mean) and geomean (geometric mean). Also, there are extended versions of several summary functions, such as nanmean and nanmax, which will ignore NaN (IEEE floating point "not-a-number") values, which are commonly used to represent missing values in MATLAB.

To learn more, see the "Descriptive Statistics" section when using:

help stats

Sunday, April 08, 2007

Getting Data Into MATLAB Using textread

Before any analysis can be performed in MATLAB, the data must somehow be imported. MATLAB offers a number of functions for data import (dlmread, fscanf, xlsread, etc.- try help iofun for more information), which the reader should explore. In my work, I tend to work with other software systems (relational databases and statistical packages) which can export data to text files. I have found it convenient for my purposes to convert all categorical data to numeric form (dummy variables or integer codes) in the originating system, and then to dump the data to a tab-delimited text file for consumption by MATLAB.

Below is a typical textread call, from one of my recent projects (Web formatting necessitates breaking this up: it is supposed to be a single line of code):

A = ...
textread('F:\MyFile.dat','',-1,'delimiter','\t', ...
'headerlines',1,'emptyvalue',NaN);

An explanation of each textread parameter being used follows:

A is the variable containing the data after it is loaded from disk.

'F:\MyFile.dat' is the fully qualified name of the file being loaded.

The empty string indicates that no format string is being used.

The -1 indicates that all rows of data are to be loaded. To load some specific fraction of all rows, change this value to the number of rows to be loaded.

All parameters after this point come in pairs.

The 'delimiter' indicates that the data is delimited (as opposed to fixed-width). The '\t' lets textread know that the delimiter is the tab character. Comma-delimited .CSV files, for instance, would use 'delimiter',',' instead.

Next is the parameter pair 'headerlines',1, which tells textread to ignore the first row (it contains column headers).

Last, the parameter pair 'emptyvalue',NaN indicates that any missing values should be represented in the resulting MATLAB matrix as NaN ("not-a-number") values. Other values could be used. Some analysts are accustomed to using values like -99, -9999, etc., although it is generally the convention in MATLAB to use NaN to represent missing values.

I use a separate chunk of code to read in the header line and digest the variable names. Some MATLAB programmers prefer using cell arrays instead. Yet a third possibility of dealing with variable names is to use the categorical arrays and dataset arrays from the Statistics Toolbox (new with MATLAB version 2007a), but I haven't had a chance to explore them yet.