In my work, I have frequently needed to fit curves to data. Curve-fitting is a broad and surprisingly subtle subject. The two most common types of single-input curve-fitting are simple (linear and non-linear) regression and splines.
Regressions involve the discovery of optimal coefficients for some functional form, which maps the input variable to the output variable. Regressions provide a number of advantages, but these are generally qualified advantages:
1. Common functions (linear, logistic, exponential) are well-studied and widely available. More exotic or arbitrary functions are harder to find software for.
2. Regressions provide optimal fit to data, assuming that one wants a least-squares fit. Some software, notably the MATLAB Curve Fitting Toolbox, will provide other, less common fits, such as mean absolute error (L-1) and robust fits.
3. Regressions are the safest bet for extrapolation. Extrapolation is a hazardous endeavor, but well-behaved regression functions generate the most sensible guess in "the far, unlit unknown".
4. Complex regression curves, such as high-order polynomials, can be ill-behaved, especially at the extremes of the data range. While there are methods of combatting such problems (such as fitting at the Chebyshev points), these are not widely available in commercial software.
Splines, which are especially popular in engineering, are connected series of simple curves, most often low-order polynomials. The advantage of splines is that they are extremely flexible. Complex curves require complex regression functions, but splines can handle very complicated shapes simply by connecting several simple curves. Splines may pass directly through data points, or between them. To maintain continuity, these simple curves must come together at points called knots. Like regressions, splines have their advantages:
1. Splines are extremely flexible, but number and placement of knots must be chosen.
2. Technically, most splines are undefined outside the range of the knots.
3. Most commercial software is oriented toward locating a spline through a small number of given data points. "Fitting" through a large number of data points is not commonly available.
Additionally, base MATLAB directly supports the construction and evaluation of splines.
Wanting to free myself from having to select the right regression function every time I needed to fit a curve, I decided to turn to splines for my curve-fitting needs. The problem, as mentioned above, is that almost all spline routines and canned software are very low-level, "fitting" (if it can be called that) splines by stringing them through a very small number of selected points. My solution, GASplineFit.m, builds on the built-in MATLAB spline routines, and will generate splines which optimize any error function handled by my SampleError routine (see my posting of Jan-05-2007, Model Performance Measurement). Note: GASplineFit requires subroutine LogRand.m.
With GASplineFit, the user supplies:
1. An input variable
2. An output variable
3. The input values of the knots (I often just use linspace for these)
4. The spline type (such as 'pchip' or 'spline': see MATLAB's spline routines)
5. A selected performance measure ('MSE', 'L-1', etc.).
There are two additional parameters, which control the genetic algorithm which performs the fitting. I usually don't change these from 250 and 80. For better but slower fitting, turn these parameters up. For faster but cruder fitting, turn them down. See help GASplineFit for details.
Essentially, the software is sliding the spline knots up and down, trying to get this best fit. The routine returns:
1. The fitted spline as a MATLAB piecewise polynomial (see: help ppval)
2. The knot output values
3. The assessed error
Notice one very nice feature of this routine: that it will fit probability curves directly to 0/1 data. Very often, the statistician is faced with a large collection of examples, which involve a dummy variable indicating membership in a class of interest, and an explanatory variable. To smooth the data and approximate the probability of class membership, common practice is to bin the explanatory variable, and assess the mean for each bin. Fancier methods will fit a curve to these binned values, which isn't very smooth or use a kernel regression, which is smooth but typically difficult to encapsulate as code without taking the entire data set along for the ride. GASplineFit solves all of these problems. The genetic algorithm which is its engine doesn't care about the fact that our data may be only zeros and ones: it only cares about optimizing the performance function.
help GASplineFit explains the specifics of this routine, and provides an example of its use. MATLAB programmers beginning with genetic algorithms may be interested in examining the code, which uses an elitist genetic algorithm.