Data Mining in MATLAB: GASplineFit: A Flexible Curve Fitting Routine

In my work, I have frequently needed to fit curves to data. Curve-fitting is a broad and surprisingly subtle subject. The two most common types of single-input curve-fitting are simple (linear and non-linear) regression and splines.

Regressions involve the discovery of optimal coefficients for some functional form, which maps the input variable to the output variable. Regressions provide a number of advantages, but these are generally qualified advantages:

Regression Characteristics
1. Common functions (linear, logistic, exponential) are well-studied and widely available. More exotic or arbitrary functions are harder to find software for.

2. Regressions provide optimal fit to data, assuming that one wants a least-squares fit. Some software, notably the MATLAB Curve Fitting Toolbox, will provide other, less common fits, such as mean absolute error (L-1) and robust fits.

3. Regressions are the safest bet for extrapolation. Extrapolation is a hazardous endeavor, but well-behaved regression functions generate the most sensible guess in "the far, unlit unknown".

4. Complex regression curves, such as high-order polynomials, can be ill-behaved, especially at the extremes of the data range. While there are methods of combatting such problems (such as fitting at the Chebyshev points), these are not widely available in commercial software.

Splines, which are especially popular in engineering, are connected series of simple curves, most often low-order polynomials. The advantage of splines is that they are extremely flexible. Complex curves require complex regression functions, but splines can handle very complicated shapes simply by connecting several simple curves. Splines may pass directly through data points, or between them. To maintain continuity, these simple curves must come together at points called knots. Like regressions, splines have their advantages:

Spline Characteristics
1. Splines are extremely flexible, but number and placement of knots must be chosen.

2. Technically, most splines are undefined outside the range of the knots.

3. Most commercial software is oriented toward locating a spline through a small number of given data points. "Fitting" through a large number of data points is not commonly available.

Additionally, base MATLAB directly supports the construction and evaluation of splines.

GASplineFit
Wanting to free myself from having to select the right regression function every time I needed to fit a curve, I decided to turn to splines for my curve-fitting needs. The problem, as mentioned above, is that almost all spline routines and canned software are very low-level, "fitting" (if it can be called that) splines by stringing them through a very small number of selected points. My solution, GASplineFit.m, builds on the built-in MATLAB spline routines, and will generate splines which optimize any error function handled by my SampleError routine (see my posting of Jan-05-2007, Model Performance Measurement). Note: GASplineFit requires subroutine LogRand.m.

With GASplineFit, the user supplies:

1. An input variable
2. An output variable
3. The input values of the knots (I often just use linspace for these)
4. The spline type (such as 'pchip' or 'spline': see MATLAB's spline routines)
5. A selected performance measure ('MSE', 'L-1', etc.).

There are two additional parameters, which control the genetic algorithm which performs the fitting. I usually don't change these from 250 and 80. For better but slower fitting, turn these parameters up. For faster but cruder fitting, turn them down. See help GASplineFit for details.

Essentially, the software is sliding the spline knots up and down, trying to get this best fit. The routine returns:

1. The fitted spline as a MATLAB piecewise polynomial (see: help ppval)
2. The knot output values
3. The assessed error

Notice one very nice feature of this routine: that it will fit probability curves directly to 0/1 data. Very often, the statistician is faced with a large collection of examples, which involve a dummy variable indicating membership in a class of interest, and an explanatory variable. To smooth the data and approximate the probability of class membership, common practice is to bin the explanatory variable, and assess the mean for each bin. Fancier methods will fit a curve to these binned values, which isn't very smooth or use a kernel regression, which is smooth but typically difficult to encapsulate as code without taking the entire data set along for the ride. GASplineFit solves all of these problems. The genetic algorithm which is its engine doesn't care about the fact that our data may be only zeros and ones: it only cares about optimizing the performance function.

help GASplineFit explains the specifics of this routine, and provides an example of its use. MATLAB programmers beginning with genetic algorithms may be interested in examining the code, which uses an elitist genetic algorithm.

14 comments:

Iliana said...: Will,

I am trying to implement my first GA to improve curve fitting and came across your site which was very informative. I was wondering whether I could get a copy of the code you mention in this post since the downloads do not seem to work. Thank you very much, it is a great help when investigating new methods to be able to have access to people with experience like yourself.; 5:28 AM
Will Dwinnell said...: I am sorry about that. I was unaware of the problem until you drew my attention to it, and am investigating it now. In the meantime, I am e-mailing the GASplineFit code to you.

I am very glad that you find this log helpful.; 8:24 AM
Iliana said...: Thanks Will, that's great. Do you need my email for that?; 7:19 AM
Will Dwinnell said...: Yes, sorry! You can reach me at:
predictr@bellatlantic.net; 7:47 AM
Roderik Emmerink said...: Thanks for the help with the routine. I referenced this web page in my report which has just been submitted.; 10:45 AM
Will Dwinnell said...: That's great! I'm glad this has been helpful.; 10:36 AM
Anonymous said...: will,
I am a beginner of GA,could you give me some suggestions about learning it,and could you recommend a book for me?Thank you very much.

Another question is,Why I cann't download the mfile,does it has a timeliness or other limits?; 9:38 AM
Unknown said...: dear will,

i am trying to apply a least square fitting to noisy data. It kills me when i am writing out code for optimization on knot placement. Would you please kindly send me the .m file for your version of spline fitting? speedowee@gmail.com

Many thanks!

Wee; 12:27 AM
Anonymous said...: Hi,

I am working on a copula method and I need a regression spline method. I found your site while googling. But the links are not working. Is it possible for you to share this code ?
ahmetbilgili_at_gmail_dot_com

Regards,; 7:12 AM
baldy said...: Hi Will, the download links still seem to be down, could you possibly mail me a copy of the code (specifically LogRand.m?) Thanks very much; 12:34 PM
raju said...: Hi,
I have a mathematical expression with unknown coefficients and the test data. I need to find this unknown coefficients. i am using lsqcurvefit command from matlab but it is sensitive to limits(upper and lower bounds) provided on coefficients. Does ur code help me in this regard?; 3:43 AM
Robert said...: Hi!
This post is really cool. I was looking for a long time for a good optimization algorithm for spline curve fitting. thanks for that!

I was wondering if it is possible to use some heuristically found curve that is already close to the optimal solution as a part of the first generation?

Many Thanks,
Rob.; 4:23 AM
Anonymous said...: Hi Will,

a very nice idea I must say. Is there a way to have the two m-files in order to check their implementation on a dataset I have?
My email is v p a p a m i k o s [ a t ] g m a i l . c o m; 3:06 PM
Analystek said...: This comment has been removed by a blog administrator.; 6:42 AM

Friday, January 19, 2007

GASplineFit: A Flexible Curve Fitting Routine

14 comments: