Data Mining in MATLAB: July 2012

Perhaps the most fundamental statistical summary beyond simple counting or totaling is the mean. The mean reduces a collection of numbers to a single value, and is one of a number of measures of location. The mean is by far the most commonly used and widely understood way of averaging data, but it is not the only one, nor is it always the "best" one. In terms of popularity, the median is a distant second, but it offers a mixture of behaviors which make it an appealing alternative in many circumstances.

One important property of the median is that it is not affected- at all- by extreme values. No matter how we change any observation, its value will have zero effect on the sample median unless it changes from being above the median to below it, or vice versa. If the maximum value in a data set is multiplied by a million, the median will not change. This behavior is usually characterized as a benefit of the median, as it means that noisy or extreme values will not tug at the median. This is in stark contrast to the sample mean, which blows around in the wind of any change to the data set. Indeed, this is one of the nice things about working with the median: It is highly resistant to noise. Also typically cast as a benefit is the median's resistance to being tugged to extreme values by long-tailed distributions. Whether this is truly a strength or a weakness, though, I will leave to the analyst to decide, since capturing the nature of a long-tail may be important for some problems.

Another important quality of the sample median is that, for odd numbers of observations, it is always a value from the original data set. This is also true for many data sets with even numbers of observations. In some situations, this is desirable. Data summaries being shown to non-technical people often provide more aesthetic appeal if there are no "odd" values. Consider a retail store in which all products are prices with .99 at the end, yet the reported mean ends in .73- this is normally not an issue with the median. Replacing missing values with means very likely introduces new values to the data set, and this is especially awkward when dealing with variables which present limited distributions, such as counts. As an example, consider a database with missing values for "number of children", which (one would expect!) would always be an integer. Substituting the mean for missings may result in observations sporting "1.2" children. This is not a problem with the median. Note that this behavior has a darker side for signal- and image-processing problems: data treated with a windowed median tend to form stepped plateaus, rather than smooth curves. Such artifacts can be distracting or worse.

On the downside, the median is not always the most statistically efficient summary. This is a technical issue which means simply that the median may "wander" from the "true" value (the population parameter) more than other summaries when finite data are available. For instance, when the data are known to be drawn from a true normal distribution, the sample mean is known to "wander" the least from the true value. I'm sparing the reader the technical details here, but suffice it to say that, though the mean or median might be closer to the "true" value in any particular situation, the mean is likely to be closer most of the time. Statistical efficiency may be measured, and it is worth noting that different summaries achieve different relative efficiencies, depending on the size of the data sample, and the shape of the population distribution.

Hopefully, this short note has whet your appetite for the median and other alternatives to the mean. I encourage the reader to explore yet other alternatives, such as trimmed means, of which the mean and median are special cases.

Data Mining in MATLAB

Tuesday, July 31, 2012

The Good and the Bad of the Median