Fitting a least-squares linear regression is easily accomplished in MATLAB using the backslash operator: '\'. In linear algebra, matrices may by multiplied like this:
output = input * coefficients
The backslash in MATLAB allows the programmer to effectively "divide" the output by the input to get the linear coefficients. This process will be illustrated by the following examples:
Simple Linear Regression
First, some data with a roughly linear relationship is needed:
>> X = [1 2 4 5 7 9 11 13 14 16]'; Y = [101 105 109 112 117 116 122 123 129 130]';
"Divide" using MATLAB's backslash operator to regress without an intercept:
>> B = X \ Y
B =
10.8900
Append a column of ones before dividing to include an intercept:
>> B = [ones(length(X),1) X] \ Y
B =
101.3021
1.8412
In this case, the first number is the intercept and the second is the coefficient.
Multiple Linear Regression
The following generates a matrix of 1000 observations of 5 random input variables:
>> X = rand(1e3,5);
Next, the true coefficients are defined (which wouldn't be known in a real problem). As is conventional, the intercept term is the first element of the coefficient vector. The problem at hand is to approximate these coefficients, knowing only the input and output data:
>> BTrue = [-1 2 -3 4 -5 6]';
Multiply the matrices to get the output data.
>> Y = BTrue(1) + X * BTrue(2:end);
As before, append a column of ones and use the backslash operator:
>> B = [ones(size(X,1),1) X] \ Y
B =
-1.0000
2.0000
-3.0000
4.0000
-5.0000
6.0000
Again, the first element in the coefficient vector is the intercept. Note that, oh so conveniently, the discovered coefficients match the designed ones exactly, since this data set is completely noise-free.
Model Recall
Executing linear models is a simple matter of matrix multiplication, but there is an efficiency issue. One might append a column of ones and simply perform the complete matrix multiplication, thus:
>> Z = [ones(size(X,1),1) X] * B;
The above process is inefficient, though, and can be improved by simply multiplying all the other coefficients by the input data matrix and adding the intercept term:
>> Z = B(1) + X * B(2:end);
Regression in the Statistics Toolbox
The MATLAB Statistics Toolbox includes several linear regression functions. Among others, there are:
regress: least squares linear regression and diagnostics
stepwisefit: stepwise linear regression
robustfit: robust (non-least-squares) linear regression and diagnostics
See help stats for more information.
See also:
The May-03-2007 posting, Weighted Regression in MATLAB.
The Oct-23-2007 posting, L-1 Linear Regression.
The Mar-15-2009 posting, Logistic Regression.
Saturday, April 21, 2007
Friday, April 13, 2007
Basic Summary Statistics in MATLAB
This posting covers basic summary statistics in MATLAB.
First, note that MATLAB has a strong array-orientation, so data sets to be analyzed are most often stored as a matrix of values. Note that the convention in MATLAB is for variables to be stored in columns, and observations to be stored in rows. This is not a hard-and-fast rule, but it is much more common than the alternative (variables in rows, observations in columns). Besides, most MATLAB routines (whether from the MathWorks or elsewhere) assume this convention.
Basic summaries are easy to obtain from MATLAB. For the examples below, the following matrix of data, A, will be used (No, it's not very exciting, but it will do for our purposes):
>> A = [1 2 3 4; -1 10 8 5; 9 8 7 0; 0 0 0 1]
A =
1 2 3 4
-1 10 8 5
9 8 7 0
0 0 0 1
MATLAB matrices are indexed as: MatrixName(row,column):
>> A(2,1)
ans =
-1
Common statistical summaries are available in MATLAB, such as: mean (arithmetic mean), median (median), min (minimum value), max (maximum value) and std (standard deviation). Their use is illustrated below:
>> mean(A)
ans =
2.2500 5.0000 4.5000 2.5000
>> median(A)
ans =
0.5000 5.0000 5.0000 2.5000
>> min(A)
ans =
-1 0 0 0
>> max(A)
ans =
9 10 8 5
>> std(A)
ans =
4.5735 4.7610 3.6968 2.3805
Note that each of these functions operate along the columns, yielding one summary for each, stored in a row vector. Sometimes it is desired to calculate along the rows instead. Some routines can be redirected by another parameter, like this:
>> mean(A,2)
ans =
2.5000
5.5000
6.0000
0.2500
The above calculates the arithmetic means of each row, storing them in a column vector. The second mean parameter, if it is specified, indicates the dimension along which mean is to operate.
For routines without this capability, the data matrix may be transposed (rows become columns and columns become rows) using the apostrophe operator while feeding it to the function:
>> mean(A')
ans =
2.5000 5.5000 6.0000 0.2500
Note that, this time, the result is stored in a row vector.
The colon operator, :, can be used to dump all of the contents of an array into one giant column vector. The result of this operation can then be fed to any of our summary routines:
>> A(:)
ans =
1
-1
9
0
2
10
8
0
3
8
7
0
4
5
0
1
>> mean(A(:))
ans =
3.5625
The reader will find more information on summary routines in base MATLAB through:
help datafun
The MATLAB Statistics Toolbox
MATLAB users lucky enough to own the Statistics Toolbox will have available still more summaries, such as iqr (inter-quartile range), trimmean (trimmed mean) and geomean (geometric mean). Also, there are extended versions of several summary functions, such as nanmean and nanmax, which will ignore NaN (IEEE floating point "not-a-number") values, which are commonly used to represent missing values in MATLAB.
To learn more, see the "Descriptive Statistics" section when using:
help stats
First, note that MATLAB has a strong array-orientation, so data sets to be analyzed are most often stored as a matrix of values. Note that the convention in MATLAB is for variables to be stored in columns, and observations to be stored in rows. This is not a hard-and-fast rule, but it is much more common than the alternative (variables in rows, observations in columns). Besides, most MATLAB routines (whether from the MathWorks or elsewhere) assume this convention.
Basic summaries are easy to obtain from MATLAB. For the examples below, the following matrix of data, A, will be used (No, it's not very exciting, but it will do for our purposes):
>> A = [1 2 3 4; -1 10 8 5; 9 8 7 0; 0 0 0 1]
A =
1 2 3 4
-1 10 8 5
9 8 7 0
0 0 0 1
MATLAB matrices are indexed as: MatrixName(row,column):
>> A(2,1)
ans =
-1
Common statistical summaries are available in MATLAB, such as: mean (arithmetic mean), median (median), min (minimum value), max (maximum value) and std (standard deviation). Their use is illustrated below:
>> mean(A)
ans =
2.2500 5.0000 4.5000 2.5000
>> median(A)
ans =
0.5000 5.0000 5.0000 2.5000
>> min(A)
ans =
-1 0 0 0
>> max(A)
ans =
9 10 8 5
>> std(A)
ans =
4.5735 4.7610 3.6968 2.3805
Note that each of these functions operate along the columns, yielding one summary for each, stored in a row vector. Sometimes it is desired to calculate along the rows instead. Some routines can be redirected by another parameter, like this:
>> mean(A,2)
ans =
2.5000
5.5000
6.0000
0.2500
The above calculates the arithmetic means of each row, storing them in a column vector. The second mean parameter, if it is specified, indicates the dimension along which mean is to operate.
For routines without this capability, the data matrix may be transposed (rows become columns and columns become rows) using the apostrophe operator while feeding it to the function:
>> mean(A')
ans =
2.5000 5.5000 6.0000 0.2500
Note that, this time, the result is stored in a row vector.
The colon operator, :, can be used to dump all of the contents of an array into one giant column vector. The result of this operation can then be fed to any of our summary routines:
>> A(:)
ans =
1
-1
9
0
2
10
8
0
3
8
7
0
4
5
0
1
>> mean(A(:))
ans =
3.5625
The reader will find more information on summary routines in base MATLAB through:
help datafun
The MATLAB Statistics Toolbox
MATLAB users lucky enough to own the Statistics Toolbox will have available still more summaries, such as iqr (inter-quartile range), trimmean (trimmed mean) and geomean (geometric mean). Also, there are extended versions of several summary functions, such as nanmean and nanmax, which will ignore NaN (IEEE floating point "not-a-number") values, which are commonly used to represent missing values in MATLAB.
To learn more, see the "Descriptive Statistics" section when using:
help stats
Sunday, April 08, 2007
Getting Data Into MATLAB Using textread
Before any analysis can be performed in MATLAB, the data must somehow be imported. MATLAB offers a number of functions for data import (dlmread, fscanf, xlsread, etc.- try help iofun for more information), which the reader should explore. In my work, I tend to work with other software systems (relational databases and statistical packages) which can export data to text files. I have found it convenient for my purposes to convert all categorical data to numeric form (dummy variables or integer codes) in the originating system, and then to dump the data to a tab-delimited text file for consumption by MATLAB.
Below is a typical textread call, from one of my recent projects (Web formatting necessitates breaking this up: it is supposed to be a single line of code):
A = ...
textread('F:\MyFile.dat','',-1,'delimiter','\t', ...
'headerlines',1,'emptyvalue',NaN);
An explanation of each textread parameter being used follows:
A is the variable containing the data after it is loaded from disk.
'F:\MyFile.dat' is the fully qualified name of the file being loaded.
The empty string indicates that no format string is being used.
The -1 indicates that all rows of data are to be loaded. To load some specific fraction of all rows, change this value to the number of rows to be loaded.
All parameters after this point come in pairs.
The 'delimiter' indicates that the data is delimited (as opposed to fixed-width). The '\t' lets textread know that the delimiter is the tab character. Comma-delimited .CSV files, for instance, would use 'delimiter',',' instead.
Next is the parameter pair 'headerlines',1, which tells textread to ignore the first row (it contains column headers).
Last, the parameter pair 'emptyvalue',NaN indicates that any missing values should be represented in the resulting MATLAB matrix as NaN ("not-a-number") values. Other values could be used. Some analysts are accustomed to using values like -99, -9999, etc., although it is generally the convention in MATLAB to use NaN to represent missing values.
I use a separate chunk of code to read in the header line and digest the variable names. Some MATLAB programmers prefer using cell arrays instead. Yet a third possibility of dealing with variable names is to use the categorical arrays and dataset arrays from the Statistics Toolbox (new with MATLAB version 2007a), but I haven't had a chance to explore them yet.
Below is a typical textread call, from one of my recent projects (Web formatting necessitates breaking this up: it is supposed to be a single line of code):
A = ...
textread('F:\MyFile.dat','',-1,'delimiter','\t', ...
'headerlines',1,'emptyvalue',NaN);
An explanation of each textread parameter being used follows:
A is the variable containing the data after it is loaded from disk.
'F:\MyFile.dat' is the fully qualified name of the file being loaded.
The empty string indicates that no format string is being used.
The -1 indicates that all rows of data are to be loaded. To load some specific fraction of all rows, change this value to the number of rows to be loaded.
All parameters after this point come in pairs.
The 'delimiter' indicates that the data is delimited (as opposed to fixed-width). The '\t' lets textread know that the delimiter is the tab character. Comma-delimited .CSV files, for instance, would use 'delimiter',',' instead.
Next is the parameter pair 'headerlines',1, which tells textread to ignore the first row (it contains column headers).
Last, the parameter pair 'emptyvalue',NaN indicates that any missing values should be represented in the resulting MATLAB matrix as NaN ("not-a-number") values. Other values could be used. Some analysts are accustomed to using values like -99, -9999, etc., although it is generally the convention in MATLAB to use NaN to represent missing values.
I use a separate chunk of code to read in the header line and digest the variable names. Some MATLAB programmers prefer using cell arrays instead. Yet a third possibility of dealing with variable names is to use the categorical arrays and dataset arrays from the Statistics Toolbox (new with MATLAB version 2007a), but I haven't had a chance to explore them yet.
Labels:
2007a,
categorical array,
data,
dataset array,
delimited,
delimiter,
import,
iofun,
textread
Subscribe to:
Posts (Atom)