Although it was begun in 2006, Data Mining in MATLAB is just now completing its first full calendar year in operation. I want to thank readers who have sent words of encouragement or thanks, and those who have commented or asked questions. Sometimes I post material and wonder if anyone is reading this, so it is nice to receive a favorable response.
All in all, it has been a productive year here, with 27 posts (not counting this one). My only regret is not being more consistent in posting, but, in the interest of quality, I have studiously avoided rushing out material. (There are at least 4 half-finished posts sitting here now- if only it weren't for my darned "day job"!)
I'd like to wish an especially Merry Christmas to Dean Abbott, with whom I co-author Data Mining and Predictive Analytics, and Sandro Saitta, who writes Data Mining Research!
Merry Christmas to all!
Monday, December 24, 2007
Tuesday, October 23, 2007
L-1 Linear Regression
Fitting lines to data is a fundamental part of data mining and inferential statistics. Many more complicated schemes use line-fitting as a foundation, and least-squares linear regression has, for years, been the workhorse technique of the field. Least-squares linear regression fits a line (or plane, hyperplane, etc.) with the minimum possible squared error. I explained the execution of least-squares linear regression in MATLAB in my Apr-21-2007 posting, Linear Regression in MATLAB.
Why least squares?
Least-squares offers a number of esoteric technical strengths, but many students of statistics wonder: "Why least-squares?" The simplest (and most superficial) answer is: "Squaring the errors makes them all positive, so that errors with conflicting signs do not cancel each other out in sums or means". While this is true, squaring should seem an odd way to go about this when taking the absolute values of the errors (simply ignoring the signs) is much more straightforward.
Taking the absolute values of the errors (instead of their squares) leads to an alternative regression procedure, known as least absolute errors regression or L-1 linear regression. Like least-squares linear regression, L-1 linear regression fits a line to the supplied data points. Taking the absolute values seems simpler, so why not use L-1 regression? For that matter, why is lest-squares regression so popular, given the availability of seemingly more natural alternative?
Despite the fact that L-1 regression was developed decades before least squares regression, least-squares regression is much more widely used today. Though L-1 regression has a few quirks, they are not what is holding it back. The secret real reason that least squares is favored, which your stats professor never told you is:
Least-squares makes the calculus behind the fitting process extremely easy!
That's it. Statisticians will give all manner of rationalizations, but the real reason least-squares regression is in vogue, is that it is extremely easy to calculate.
L-1 Regression
There are several ways to perform the L-1 regression, and all of them involve more computation than any of the least-squares procedures. Thankfully, we live in an age in which mechanical computation is plentiful and cheap! Also thankfully, I have written an L-1 regression routine in MATLAB, called L1LinearRegression.
L1LinearRegression assumes that an intercept term is to be included and takes two parameters: the independent variables (a matrix whose columns represent the independent variables) and the dependent variable (in a column vector).
L-1 regression is less affected by large errors than least squares regression. The following graph depicts this behavior (click to enlarge):

This example intentionally demonstrates least-squares' slavish chasing of distant data points, but the effect is very real. The biggest drawback of L-1 regression is that it takes longer to run. Unless there are many such regressions to perform, execution time is a small matter, which gets smaller every year that computers get faster. L1LinearRegression runs in about 10 seconds for 100,000 observations with 10 predictors on fast PC hardware.
References
Alternative Methods of Regression, by Birkes and Dodge (ISBN-13: 978-0471568810)
Least absolute deviation estimation of linear econometric models: A literature review, by Dasgupta and Mishra (Jun-2004)
See also
L1LinearRession Code Update (Mar-27-2009)
Why least squares?
Least-squares offers a number of esoteric technical strengths, but many students of statistics wonder: "Why least-squares?" The simplest (and most superficial) answer is: "Squaring the errors makes them all positive, so that errors with conflicting signs do not cancel each other out in sums or means". While this is true, squaring should seem an odd way to go about this when taking the absolute values of the errors (simply ignoring the signs) is much more straightforward.
Taking the absolute values of the errors (instead of their squares) leads to an alternative regression procedure, known as least absolute errors regression or L-1 linear regression. Like least-squares linear regression, L-1 linear regression fits a line to the supplied data points. Taking the absolute values seems simpler, so why not use L-1 regression? For that matter, why is lest-squares regression so popular, given the availability of seemingly more natural alternative?
Despite the fact that L-1 regression was developed decades before least squares regression, least-squares regression is much more widely used today. Though L-1 regression has a few quirks, they are not what is holding it back. The secret real reason that least squares is favored, which your stats professor never told you is:
Least-squares makes the calculus behind the fitting process extremely easy!
That's it. Statisticians will give all manner of rationalizations, but the real reason least-squares regression is in vogue, is that it is extremely easy to calculate.
L-1 Regression
There are several ways to perform the L-1 regression, and all of them involve more computation than any of the least-squares procedures. Thankfully, we live in an age in which mechanical computation is plentiful and cheap! Also thankfully, I have written an L-1 regression routine in MATLAB, called L1LinearRegression.
L1LinearRegression assumes that an intercept term is to be included and takes two parameters: the independent variables (a matrix whose columns represent the independent variables) and the dependent variable (in a column vector).
L-1 regression is less affected by large errors than least squares regression. The following graph depicts this behavior (click to enlarge):

This example intentionally demonstrates least-squares' slavish chasing of distant data points, but the effect is very real. The biggest drawback of L-1 regression is that it takes longer to run. Unless there are many such regressions to perform, execution time is a small matter, which gets smaller every year that computers get faster. L1LinearRegression runs in about 10 seconds for 100,000 observations with 10 predictors on fast PC hardware.
References
Alternative Methods of Regression, by Birkes and Dodge (ISBN-13: 978-0471568810)
Least absolute deviation estimation of linear econometric models: A literature review, by Dasgupta and Mishra (Jun-2004)
See also
L1LinearRession Code Update (Mar-27-2009)
Labels:
L-1,
L1,
LAD,
LAE,
LAR,
LAV,
least absolute,
least squared,
least squares,
linear regression,
mean squared,
MSE,
regression
Saturday, September 29, 2007
MATLAB 2007b Released
The fall release of MATLAB is out, and while most toolbox updates relevant to data mining are minor, MATLAB itself has seen some big changes. From the MATLAB 7.5 Latest Features page, among other things:
Performance and Large Data Set Handling
* MATLAB arrays no longer limited to 2^31 (~2 x 10^9) elements, allowing many numeric and low-level file I/O functions to support real double arrays greater than 16 GB on 64-bit platforms
* New function maxNumCompThreads enabling use of get and set for the maximum number of computational threads
* Upgraded Linear Algebra Package library (LAPACK 3.1) on all platforms, plus upgraded optimized Basic Linear Algebra Subprogram libraries (BLAS) on Intel processors (MKL 9.1) and on AMD processors (AMCL 3.6)
Readers are strongly encouraged to visit the New Features page for more information.
On A Completely Unrelated Subject...
A few months ago, I attended a one-day presentation, offered locally free of charge by the MathWorks. The session was on algorithm development for C/C++ programmers. Though I program in C and C++ seldom these days (Why would I? I have MATLAB!), the class was very informative. There are many features of the MATLAB interface which I ignored in the past which I learned about that day. My suggestion to readers is to consider attending one of these presentations, which you can learn about on the MathWorks Web site.
Performance and Large Data Set Handling
* MATLAB arrays no longer limited to 2^31 (~2 x 10^9) elements, allowing many numeric and low-level file I/O functions to support real double arrays greater than 16 GB on 64-bit platforms
* New function maxNumCompThreads enabling use of get and set for the maximum number of computational threads
* Upgraded Linear Algebra Package library (LAPACK 3.1) on all platforms, plus upgraded optimized Basic Linear Algebra Subprogram libraries (BLAS) on Intel processors (MKL 9.1) and on AMD processors (AMCL 3.6)
Readers are strongly encouraged to visit the New Features page for more information.
On A Completely Unrelated Subject...
A few months ago, I attended a one-day presentation, offered locally free of charge by the MathWorks. The session was on algorithm development for C/C++ programmers. Though I program in C and C++ seldom these days (Why would I? I have MATLAB!), the class was very informative. There are many features of the MATLAB interface which I ignored in the past which I learned about that day. My suggestion to readers is to consider attending one of these presentations, which you can learn about on the MathWorks Web site.
Sunday, July 29, 2007
Poll Results (Jul-22-2007): Source Data File Formats
After a week, the Source Data File Formats poll, of Jul-22-2007, is complete. The question asked was:
What is the original format of the data you analyze?
Multiple responses were permitted. A total of 33 votes were cast, although the polling system used does not indicate the total number of voters.
In decreasing order of popularity, the results are:
9 votes (27%): MATLAB
7 votes (21%): Text (comma-delimited, tab-delimited, etc.)
7 votes (21%): Other
5 votes (15%): Relational database (Oracle, DB2, etc.)
4 votes (12%): Excel
1 vote ( 3%): Statistical software native format (SPSS, S-Plus, etc.)
I'm a little surprised that relational databases didn't appear more frequently.
No one commented, although I'd be very interested in know what the 'Other' source formats are, since they tied for second place. Anyone?
What is the original format of the data you analyze?
Multiple responses were permitted. A total of 33 votes were cast, although the polling system used does not indicate the total number of voters.
In decreasing order of popularity, the results are:
9 votes (27%): MATLAB
7 votes (21%): Text (comma-delimited, tab-delimited, etc.)
7 votes (21%): Other
5 votes (15%): Relational database (Oracle, DB2, etc.)
4 votes (12%): Excel
1 vote ( 3%): Statistical software native format (SPSS, S-Plus, etc.)
I'm a little surprised that relational databases didn't appear more frequently.
No one commented, although I'd be very interested in know what the 'Other' source formats are, since they tied for second place. Anyone?
Sunday, July 22, 2007
Poll (Jul-22-2007): Source Data File Formats
This poll is about the original file format of the data you analyze, not (necessarily) the data which MATLAB directly loads. For example, if your source data originally comes from a relational database, choose "relational database", even though you may export it to a tab-delimited text file first.
Multiple selections are permitted, but choose the file formats you encounter most often.
This poll is closed.
See the poll results in the Jul-29-2007 posting, Poll Results (Jul-22-2007): Source Data File Formats.
Thanks for voting!
Multiple selections are permitted, but choose the file formats you encounter most often.
This poll is closed.
See the poll results in the Jul-29-2007 posting, Poll Results (Jul-22-2007): Source Data File Formats.
Thanks for voting!
Subscribe to:
Comments (Atom)