**Why least squares?**

Least-squares offers a number of esoteric technical strengths, but many students of statistics wonder: "Why least-squares?" The simplest (and most superficial) answer is: "Squaring the errors makes them all positive, so that errors with conflicting signs do not cancel each other out in sums or means". While this is true, squaring should seem an odd way to go about this when taking the absolute values of the errors (simply ignoring the signs) is much more straightforward.

Taking the absolute values of the errors (instead of their squares) leads to an alternative regression procedure, known as

*least absolute errors regression*or

*L-1 linear regression*. Like least-squares linear regression, L-1 linear regression fits a line to the supplied data points. Taking the absolute values seems simpler, so why

*not*use L-1 regression? For that matter, why is lest-squares regression so popular, given the availability of seemingly more natural alternative?

Despite the fact that L-1 regression was developed decades

*before*least squares regression, least-squares regression is much more widely used today. Though L-1 regression has a few quirks, they are not what is holding it back. The secret

*real*reason that least squares is favored, which your stats professor never told you is:

**Least-squares makes the calculus behind the fitting process extremely easy!**

That's it. Statisticians will give all manner of rationalizations, but the real reason least-squares regression is in vogue, is that it is extremely easy to calculate.

**L-1 Regression**

There are several ways to perform the L-1 regression, and all of them involve more computation than any of the least-squares procedures. Thankfully, we live in an age in which mechanical computation is plentiful and cheap! Also thankfully, I have written an L-1 regression routine in MATLAB, called L1LinearRegression.

*L1LinearRegression*assumes that an intercept term is to be included and takes two parameters: the independent variables (a matrix whose columns represent the independent variables) and the dependent variable (in a column vector).

L-1 regression is less affected by large errors than least squares regression. The following graph depicts this behavior (click to enlarge):

This example intentionally demonstrates least-squares' slavish chasing of distant data points, but the effect is very real. The biggest drawback of L-1 regression is that it takes longer to run. Unless there are many such regressions to perform, execution time is a small matter, which gets smaller every year that computers get faster.

*L1LinearRegression*runs in about 10 seconds for 100,000 observations with 10 predictors on fast PC hardware.

**References**

*Alternative Methods of Regression*, by Birkes and Dodge (ISBN-13: 978-0471568810)

Least absolute deviation estimation of linear econometric models: A literature review, by Dasgupta and Mishra (Jun-2004)

**See also**

L1LinearRession Code Update (Mar-27-2009)

## 9 comments:

it is indeed true that least squares regression results in a particularly easy computation of the optimal weights of the hyperplane to be learned. However, one should not forget that the least squares solution to regression is also the maximum likelihood solution!

I believe (someone correct me if I'm wrong) that the least squares regression is only the maximum likelihood solution when the errors have a Gaussian distribution.

Regardless, the performance function should be defined by the needs of the analysis, and the least absolute error is a logical choice for many problems.

The L-1 linear regression solution is the maximum likelihood solution when the errors are Laplace (double exponential) distributed.

Thanks, Finn Årup Nielsen!

Hi All,

I was wondering how do I get such information that the least squares method is the maximum likelihood estimation if the errors are laplace... am startin in this field n I like to understand what am doing.. not just plug data into equations...

Thanks

E.A.

Your L1LinearRegression function finds the L1 solution using a sequence of weighted least squares problems. The more standard approach is to solve a linear programming problem. Could you compare the two methods? E.g. is your method guaranteed to converge? How do run times compare?

Okay, I've found several iterative algorithms, for example in the survey by Dasgupta et al. http://mpra.ub.uni-muenchen.de/1781/1/MPRA_paper_1781.pdf. The 2004 paper by Li and Arse http://downloads.hindawi.com/journals/asp/2004/948982.pdf points out that the naive algorithm can get stuck in a local minimum and suggests improvements. Several methods are compared by Kuzmanovic et al. http://journal.aplimat.com/volume_2_2009/.../Kuzmanovic_Sabo.pdf

Hello,

I really much want to know how does those function works with 3D data points??

Post a Comment