Data Mining in MATLAB: L-1

Tuesday, October 23, 2007

L-1 Linear Regression

Fitting lines to data is a fundamental part of data mining and inferential statistics. Many more complicated schemes use line-fitting as a foundation, and least-squares linear regression has, for years, been the workhorse technique of the field. Least-squares linear regression fits a line (or plane, hyperplane, etc.) with the minimum possible squared error. I explained the execution of least-squares linear regression in MATLAB in my Apr-21-2007 posting, Linear Regression in MATLAB.

Why least squares?

Least-squares offers a number of esoteric technical strengths, but many students of statistics wonder: "Why least-squares?" The simplest (and most superficial) answer is: "Squaring the errors makes them all positive, so that errors with conflicting signs do not cancel each other out in sums or means". While this is true, squaring should seem an odd way to go about this when taking the absolute values of the errors (simply ignoring the signs) is much more straightforward.

Taking the absolute values of the errors (instead of their squares) leads to an alternative regression procedure, known as least absolute errors regression or L-1 linear regression. Like least-squares linear regression, L-1 linear regression fits a line to the supplied data points. Taking the absolute values seems simpler, so why not use L-1 regression? For that matter, why is lest-squares regression so popular, given the availability of seemingly more natural alternative?

Despite the fact that L-1 regression was developed decades before least squares regression, least-squares regression is much more widely used today. Though L-1 regression has a few quirks, they are not what is holding it back. The secret real reason that least squares is favored, which your stats professor never told you is:

Least-squares makes the calculus behind the fitting process extremely easy!

That's it. Statisticians will give all manner of rationalizations, but the real reason least-squares regression is in vogue, is that it is extremely easy to calculate.

L-1 Regression

There are several ways to perform the L-1 regression, and all of them involve more computation than any of the least-squares procedures. Thankfully, we live in an age in which mechanical computation is plentiful and cheap! Also thankfully, I have written an L-1 regression routine in MATLAB, called L1LinearRegression.

L1LinearRegression assumes that an intercept term is to be included and takes two parameters: the independent variables (a matrix whose columns represent the independent variables) and the dependent variable (in a column vector).

L-1 regression is less affected by large errors than least squares regression. The following graph depicts this behavior (click to enlarge):

This example intentionally demonstrates least-squares' slavish chasing of distant data points, but the effect is very real. The biggest drawback of L-1 regression is that it takes longer to run. Unless there are many such regressions to perform, execution time is a small matter, which gets smaller every year that computers get faster. L1LinearRegression runs in about 10 seconds for 100,000 observations with 10 predictors on fast PC hardware.

References
Alternative Methods of Regression, by Birkes and Dodge (ISBN-13: 978-0471568810)

Least absolute deviation estimation of linear econometric models: A literature review, by Dasgupta and Mishra (Jun-2004)

See also
L1LinearRession Code Update (Mar-27-2009)

Friday, January 05, 2007

Model Performance Measurement

A wide variety of model performance measures have been devised. Despite the popularity of mean squared error for numeric models and simple accuracy for classification models, there are many other choices. For my part, I generally prefer mean absolute error for numeric models and the AUC (for class separation) and informational loss (for probability assessment) for classification models.

This log entry is pretty much just a quick giveaway: I have constructed a generic performance calculation MATLAB routine, SampleError.m. Its operation is straightforward, but I find it handy to contain all of these measures in one routine, with the ability to switch among them as a simple parameter change. The use of this routine is simple and is explained by help SampleError and it makes a great building block for modeling routines.

I update many of my MATLAB routines from time to time, and this one is no exception. Presently, though, the following performance measures are supported:

'L-1' (mean absolute error)
'L-2' (mean squared error)
'L-4'
'L-16'
'L-Infinity'
'RMS' (root mean squared error)
'AUC' (requires tiedrank() from Statistics Toolbox)
'Bias'
'Conditional Entropy'
'Cross-Entropy'
'F-Measure'
'Informational Loss'
'MAPE'
'Median Squared Error'
'Worst 10%'
'Worst 20%'

Note: I still need to verify the Cross-Entropy measure. The last two are classification performance measures, being the proportion of the target class found in the predicted most likely 10% and 20%, respectively.

Incidentally, I'd appreciate any feedback on any of the code in this Web log, whether it be about typos, outright coding errors of efficiency issues. Also, please send suggestions for additional measures.