Sunday, January 13, 2013

Ranking as a Pre-Processing Tool

To squeeze the most from data, analysts will often modify raw variables using mathematical transformations.  For example, in Data Analysis and Regression, Tukey described what he terms a "ladder of re-expression" which included a series of low-order roots and powers (and the logarithm) intended to adjust distributions to permit better results using linear regression.  Univariate adjustments using those particular transformations are fairly popular today, and are even incorporated directly into some machine learning software as part of the solution search.  The modified variables are fed to the analysis (modeling algorithm, clustering, visualization, etc.) just like the raw variables.

Another possible transformation which can be quite useful, but which enjoys much less popularity is ranking.  In fact, entire modeling procedures have been built around ranked data (see, for instance, the material on Theil's regression described in Applied Nonparametric Statistical Methods, by Sprent).  One nice thing about ranking is that it ignores the shape of the distribution and considers only the order of the values.  This is a natural fit to any monotonic relationship (where the graph of the data moves only up, or only down), which covers a wide variety of behaviors.  Even when the data does not meet this condition, ranking can be quite helpful.

When using ranking as a pre-processing step, one important question to answer is: How do we rank new data values?  The original ranking is performed on the training data only.  One solution is to interpolate between the ranks of the training values.  If a new value matches one from the training set, then it receives that value's rank.  If the new value instead lies between training values, we interpolate linearly between them.  Note that this results in "ranks" which are not integers.  For instance, if the new value to be ranked is 111, and the two nearest values are 110 (rank: 8) and 112 (rank: 7), then the interpolated rank is 7.5.

One drawback of this approach is, of course, that the entire set of original values and their corresponding ranks must be carried along as part of the transformation process.  It may be more economical to only retain a subset of the original data- keeping enough (perhaps 100 or less value-rank pairs) to allow the simplified ranking to retain the shape of the data distribution.

MATLAB users with the Statistics Toolbox can use the tiedrank function to perform ranking (tiedrank automatically handles ties, as well), and it is not difficult to construct a similar process using base MATLAB's sort function.  Base MATLAB also provides several very handy interpolation functions, such as interp1.

Keep in mind that univariate transformations, including ranking, typically do not affect logical machine learning solutions (decision tree induction and rule induction).  Since trees and rules usually test only one variable at a time, and most univariate transformations do not change the order of values, such transformations do not change the way the learning algorithm sees the data.

Tuesday, July 31, 2012

The Good and the Bad of the Median

Perhaps the most fundamental statistical summary beyond simple counting or totaling is the mean.  The mean reduces a collection of numbers to a single value, and is one of a number of measures of location.  The mean is by far the most commonly used and widely understood way of averaging data, but it is not the only one, nor is it always the "best" one.  In terms of popularity, the median is a distant second, but it offers a mixture of behaviors which make it an appealing alternative in many circumstances.

One important property of the median is that it is not affected- at all- by extreme values.  No matter how we change any observation, its value will have zero effect on the sample median unless it changes from being above the median to below it, or vice versa.  If the maximum value in a data set is multiplied by a million, the median will not change.  This behavior is usually characterized as a benefit of the median, as it means that noisy or extreme values will not tug at the median.  This is in stark contrast to the sample mean, which blows around in the wind of any change to the data set.  Indeed, this is one of the nice things about working with the median:  It is highly resistant to noise.  Also typically cast as a benefit is the median's resistance to being tugged to extreme values by long-tailed distributions.  Whether this is truly a strength or a weakness, though, I will leave to the analyst to decide, since capturing the nature of a long-tail may be important for some problems.

Another important quality of the sample median is that, for odd numbers of observations, it is always a value from the original data set.  This is also true for many data sets with even numbers of observations.  In some situations, this is desirable.  Data summaries being shown to non-technical people often provide more aesthetic appeal if there are no "odd" values.  Consider a retail store in which all products are prices with .99 at the end, yet the reported mean ends in .73- this is normally not an issue with the median.  Replacing missing values with means very likely introduces new values to the data set, and this is especially awkward when dealing with variables which present limited distributions, such as counts.  As an example, consider a database with missing values for "number of children", which (one would expect!) would always be an integer.  Substituting the mean for missings may result in observations sporting "1.2" children.  This is not a problem with the median.  Note that this behavior has a darker side for signal- and image-processing problems: data treated with a windowed median tend to form stepped plateaus, rather than smooth curves.  Such artifacts can be distracting or worse.

On the downside, the median is not always the most statistically efficient summary.  This is a technical issue which means simply that the median may "wander" from the "true" value (the population parameter) more than other summaries when finite data are available.  For instance, when the data are known to be drawn from a true normal distribution, the sample mean is known to "wander" the least from the true value.  I'm sparing the reader the technical details here, but suffice it to say that, though the mean or median might be closer to the "true" value in any particular situation, the mean is likely to be closer most of the time.  Statistical efficiency may be measured, and it is worth noting that different summaries achieve different relative efficiencies, depending on the size of the data sample, and the shape of the population distribution.

Hopefully, this short note has whet your appetite for the median and other alternatives to the mean.  I encourage the reader to explore yet other alternatives, such as trimmed means, of which the mean and median are special cases.

Saturday, December 11, 2010

Linear Discriminant Analysis (LDA)

Overview

Linear discriminant analysis (LDA) is one of the oldest mechanical classification systems, dating back to statistical pioneer Ronald Fisher, whose original 1936 paper on the subject, The Use of Multiple Measurements in Taxonomic Problems, can be found online (for example, here).

The basic idea of LDA is simple: for each class to be identified, calculate a (different) linear function of the attributes. The class function yielding the highest score represents the predicted class.

There are many linear classification models, and they differ largely in how the coefficients are established. One nice quality of LDA is that, unlike some of the alternatives, it does not require multiple passes over the data for optimization. Also, it naturally handles problems with more than two classes and it can provide probability estimates for each of the candidate classes.

Some analysts attempt to interpret the signs and magnitudes of the coefficients of the linear scores, but this can be tricky, especially when the number of classes is greater than 2.

LDA bears some resemblance to principal components analysis (PCA), in that a number of linear functions are produced (using all raw variables), which are intended, in some sense, to provide data reduction through rearrangement of information. (See the Feb-26-2010 posting to this log, Principal Components Analysis.) Note, though, some important differences: First, the objective of LDA is to maximize class discrimination, whereas the objective of PCA is to squeeze variance into as few components as possible. Second, LDA produces exactly as many linear functions as there are classes, whereas PCA produces as many linear functions as there are original variables. Last, principal components are always orthogonal to each other ("uncorrelated"), while that is not generally true for LDA's linear scores.


An Implementation

I have made available on MATLAB Central, a routine, aptly named LDA which performs all the necessary calculations. I'd like to thank Deniz Seviş, whose prompting got me to finally write this code (with her) and whose collaboration is very much appreciated.

Note that the LDA function assumes that the data its being fed is complete (no missing values) and performs no attribute selection. Also, it requires only base MATLAB (no toolboxes needed).

Use of LDA is straightforward: the programmer supplies the input and target variables and, optionally, prior probabilities. The function returns the fitted linear discriminant coefficients. help LDA provides a good example:


% Generate example data: 2 groups, of 10 and 15, respectively
X = [randn(10,2); randn(15,2) + 1.5]; Y = [zeros(10,1); ones(15,1)];

% Calculate linear discriminant coefficients
W = LDA(X,Y);


This example randomly generates an artificial data set of two classes (labeled 0 and 1) and two input variables. The LDA function fits linear discriminants to the data, and stores the result in W. So, what is in W? Let's take a look:


>> W

W =

-1.1997 0.2182 0.6110
-2.0697 0.4660 1.4718


The first row contains the coefficients for the linear score associated with the first class (this routine orders the linear functions the same way as unique()). In this model, -1.1997 is the constant and 0.2182 and 0.6110 are the coefficients for the input variables for the first class (class 0). Coefficients for the second class's linear function are in the second row. Calculating the linear scores is easy:


% Calulcate linear scores for training data
L = [ones(25,1) X] * W';


Each column represents the output of the linear score for one class. In this case, the first column is class 0, and the second column is class 1. For any given observation, the higher the linear score, the more likely that class. Note that LDA's linear scores are not probabilities, and may even assume negative values. Here are the values from my run:


>> L

L =

-1.9072 -3.8060
1.0547 3.2517
-1.2493 -2.0547
-1.0502 -1.7608
-0.6935 -0.8692
-1.6103 -2.9808
-1.3702 -2.4545
-0.2148 0.2825
0.4419 1.6717
0.2704 1.3067
1.0694 3.2670
-0.0207 0.7529
-0.2608 0.0601
1.2369 3.6135
-0.8951 -1.4542
0.2073 1.1687
0.0551 0.8204
0.1729 1.1654
0.2993 1.4344
-0.6562 -0.8028
0.2195 1.2068
-0.3070 0.0598
0.1944 1.2628
0.5354 2.0689
0.0795 1.0976


To obtain estimated probabilities, simply run the linear scores through the softmax transform (exponentiate everything, and normalize so that they sum to 1.0):


% Calculate class probabilities
P = exp(L) ./ repmat(sum(exp(L),2),[1 2]);


As we see, most of the first 10 cases exhibit higher probabilities for class 0 (the first column) than for class 1 (the second column) and the reverse is true for the last 15 cases:


>> P

P =

0.8697 0.1303
0.1000 0.9000
0.6911 0.3089
0.6705 0.3295
0.5438 0.4562
0.7975 0.2025
0.7473 0.2527
0.3782 0.6218
0.2262 0.7738
0.2619 0.7381
0.1000 0.9000
0.3157 0.6843
0.4205 0.5795
0.0850 0.9150
0.6363 0.3637
0.2766 0.7234
0.3175 0.6825
0.2704 0.7296
0.2432 0.7568
0.5366 0.4634
0.2714 0.7286
0.4093 0.5907
0.2557 0.7443
0.1775 0.8225
0.2654 0.7346


This model is not perfect, and would really need to be tested more rigorously (via holdout testing, k-fold cross validation, etc.) to determine how well it approximates the data.

I will not demonstrate its use here, but the LDA routine offers a facility for modifying the prior probabilities. Briefly, the function assumes that the true distribution of classes is whatever it observes in the training data. Analysts, however, may wish to adjust this distribution for several reasons, and the third, optional, parameter allows this. Note that the LDA routine presented here always performs the adjustment for prior probabilities: Some statistical software drops the adjustment for prior probabilities altogether if the user specifies that classes are equally likely, and will produce different results than LDA.


Closing Thoughts

Though it employs a fairly simple model structure, LDA has held up reasonably well, sometimes still besting more complex algorithms. When its assumptions are met, the literature records it doing better than logistic regression. It is very fast to execute and fitted models are extremely portable- even a spreadsheet will support linear models (...or, one supposes, paper and pencil!) LDA is at least worth trying at the beginning of a project, if for no other reason than to establish a lower bound on acceptable performance.


See Also

Feb-16-2010 posting, Single Neuron Training: The Delta Rule
Mar-15-2009 posting, Logistic Regression

Sunday, September 12, 2010

Reader Question: Putting Entropy to Work

Introduction

In response to my Nov-10-2006 posting, Introduction To Entropy, an anonymous reader asked:

Can we use entropy for distinguishing random signals and deterministic signal? Lets say i generate two signals in matlab. First signal using sin function and second using randn function. Can we use entropy to distinguish between these two signal?

The short answer is: Yes, we can use entropy for this purpose, although even simpler summary statistics would reveal that the normally distributed randn data included values outside of -1..+1, while the sin data did not.

In this article, I will be using my own entropy calculating routines, which can be found on MATLAB Central: Entropy, JointEntropy, ConditionalEntropy and MutualInformation.


A Slightly Harder Problem

To illustrate this application of entropy, I propose a slightly different problem, in which the sine data and the random data share the same distribution. To achieve this, the "random" data will be a random sample from the sine function:


>> X = [1:1000]';
>> Sine = sin(0.05 * X);
>> RandomData = sin(2 * pi * rand(size(X)));


As a quick check on the distributions, we will examine their respective histograms:


>> figure
>> subplot(2,1,1), hist(Sine), xlabel('Sine Value'), ylabel('Frequency'), grid on
>> subplot(2,1,2), hist(RandomData), xlabel('RandomData Value'), ylabel('Frequency'), grid on



Click image to enlarge.


More or less, they appear to match.


A First Look, Using Entropy

At this point, the reader may be tempted to calculate the entropies of the two distributions, and compare them. Since their distributions (as per the histograms) are similar, we should expect their entropies to also be similar.

To date, this Web log has only dealt with discrete entropy, yet our data is continuous. While there is a continuous entropy, we will stick with the simpler (in my opinion) discrete entropy for now. This requires that the real-valued numbers of our data be converted to symbols. We will accomplish this via quantization ("binning") to 10 levels:


>> Sine10 = ceil(10 * (Sine + 1) / 2);
>> RandomData10 = ceil(10 * (RandomData + 1) / 2);


If the MATLAB Statistics Toolbox is installed, one can check the resulting frequencies thus (I apologize for Blogger's butchering of the text formatting):


>> tabulate(Sine10)
Value Count Percent
1 205 20.50%
2 91 9.10%
3 75 7.50%
4 66 6.60%
5 60 6.00%
6 66 6.60%
7 66 6.60%
8 75 7.50%
9 91 9.10%
10 205 20.50%
>> tabulate(RandomData10)
Value Count Percent
1 197 19.70%
2 99 9.90%
3 84 8.40%
4 68 6.80%
5 66 6.60%
6 55 5.50%
7 68 6.80%
8 67 6.70%
9 82 8.20%
10 214 21.40%


It should be noted that other procedures could have been used for the signal-to-symbol conversion. For example, bin frequencies could have been made equal. The above method was selected because it is simple and requires no Toolbox functions. Also, other numbers of bins could have been utilized.

Now that the data is represented by symbols, we may check the earlier assertion regarding similar distributions yielding similar entropies (measured in bits per observation):


>> Entropy(Sine10)

ans =

3.1473

>> Entropy(RandomData10)

ans =

3.1418


As these are sample statistics, we would not expect them to match exactly, but these are very close.


Another Perspective

One important aspect of the structure of a sine curve is that it varies over time (or whatever the domain is). This means that any given sine value is typically very similar to those on either side. With this in mind, we will investigate the conditional entropy of each of these two signals versus themselves, lagged by one observation:


>> ConditionalEntropy(Sine10(2:end),Sine10(1:end-1))

ans =

0.6631

>> ConditionalEntropy(RandomData10(2:end),RandomData10(1:end-1))

ans =

3.0519


Ah! Notice that the entropy of the sine data, given knowledge of its immediate predecessor is much lower than the entropy of the random data, given its immediate predecessor. These data are indeed demonstrably different insofar as they behave over time, despite sharing the same distribution.

An astute reader may at this point notice that the conditional entropy of the random data, given 1 lagged value, is less than the entropy of the raw random data. This is an artifact of the finite number of samples and the quantization process. Given more observations and a finer quantization, this discrepancy between sample statistics and population statistics will shrink.

Entropy could have been applied to this problem other ways, too. For instance, one might calculate entropy for short time windows. I would point out that other, more traditional procedures might be used instead, such as calculating the auto-correlation for lag 1. It is worth seeing how entropy adds to the analyst's toolbox, though.

Further Reading

See also the Apr-01-2009 posting, Introduction to Conditional Entropy.

Print:

The Mathematical Theory of Communication by Claude Shannon (ISBN 0-252-72548-4)

Elements of Information Theory by Cover and Thomas (ISBN 0-471-06259)

Sunday, February 28, 2010

Putting PCA to Work

Context

The last posting to this Web log, Principal Components Analysis (Feb-26-2010), gave an overview of principal component analysis (PCA), and how to effect it within MATLAB. This article will cover three uses of PCA: 1. pre-processing for empirical modeling, 2. data compression and 3. noise suppression.

To serve the widest possible audience, this article will conduct PCA using only base MATLAB functions, but realize that users with the Statistics Toolbox have, as mentioned in the last posting, the option of using tools like princomp and zscore.

We will continue to use the very small data set used in the last article:


>> A = [269.8 38.9 50.5
272.4 39.5 50.0
270.0 38.9 50.5
272.0 39.3 50.2
269.8 38.9 50.5
269.8 38.9 50.5
268.2 38.6 50.2
268.2 38.6 50.8
267.0 38.2 51.1
267.8 38.4 51.0
273.6 39.6 50.0
271.2 39.1 50.4
269.8 38.9 50.5
270.0 38.9 50.5
270.0 38.9 50.5
];


We calculate the sample parameters, and standardize the data table:


>> [n m] = size(A)

n =

15


m =

3

>> AMean = mean(A)

AMean =

269.9733 38.9067 50.4800

>> AStd = std(A)

AStd =

1.7854 0.3751 0.3144


>> B = (A - repmat(AMean,[n 1])) ./ repmat(AStd,[n 1])

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

Now that the data is centered with mean 0.0 and standard deviation 1.0, we perform the eigenanalysis of the sample covariances to determine the coefficient matrix which generates the principal components:


>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587


D =

0.0066 0 0
0 0.1809 0
0 0 2.8125


Recall that the MATLAB eig function orders information for the principal components from last to first when reading the columns from left to right. The matrix V contains the linear coefficients for the principal components. The diagonal of matrix D contains the variances for the principal components. So far, we have accomplished the principal components analysis itself. To put the PCA to use, we will want to know what proportion each principal component represents of total variance. We can do this by extracting and normalizing the diagonal of matrix D (we use flipud because the principal components are in "reverse" order):


>> cumsum(flipud(diag(D))) / sum(diag(D))

ans =

0.9375
0.9978
1.0000


We interpret the above column of numbers to mean that the first principal component
contains 93.75% of the total variance of the original data, the first two principal components together contain 99.78% and of course all principal components taken together have all of the variance (exactly as much as in the original standardized data).

Last, to calculate the principal components themselves, simply multiply the standardized data by the coefficient matrix:


>> PC = B * V

PC =

-0.0571 -0.0003 0.1026
-0.1277 -0.1226 -2.5786
0.0157 0.0543 0.0373
0.0536 0.1326 -1.7779
-0.0571 -0.0003 0.1026
-0.0571 -0.0003 0.1026
0.0704 -1.4579 0.5637
-0.1495 0.1095 1.6299
0.1041 0.2496 3.1841
0.0319 0.3647 2.4306
0.1093 0.2840 -3.1275
0.0892 0.2787 -0.8467
-0.0571 -0.0003 0.1026
0.0157 0.0543 0.0373
0.0157 0.0543 0.0373


To verify the condensing of the variance, calculate the sample variances:


>> var(PC)

ans =

0.0066 0.1809 2.8125


Again, note that the first principal component appears in the last column when using MATLAB's eig function, and columns to the left have less and less variance until the last principal component, stored in the first column.


Application: Pre-processing Data for Empirical Modeling

This application of PCA is simple: calculate the principal components and choose from them rather than the original data to construct the empirical model (regression, neural network, etc.). The (hoped for) advantage of doing this is that since PCA squeezes information into a subset of the new variables, less of them will be necessary to construct the model. In fact, it would not be unreasonable to simply step through the first so many principal components to build the model: First, use just the first principal component, then try the first and second, then the first, second and third, etc. A nice side benefit is that all the principal components are uncorrelated with each other.

As was mentioned in the last article, this may or may not work well, for several reasons: PCA may not be able to squeeze the variance much if the original variables are already highly uncorrelated with one another. Also, statistical variance may not be the same thing as "information" for the purposes of model building. Last, even if this process works, one is left with the reality that PCA needs all of the original variables to calculate the principal components, even if only a subset of them are used. Regardless, this is a data processing technique which can yield benefit, so it is worth trying.


Application: Data Compression

PCA offers a mechanism for performing lossy data compression. When data compression is "lossy", it may not return exactly the original data. The trade-off is that much greater compression can be achieved than with "lossless" data compression (compression in which the original data is returned exactly). In many cases, such as audio (MP3) and images (JPEG), some loss in fidelity is acceptable and greater compression is very much desired.

All compression schemes rely on the discovery of regularities within the data. In the case of PCA, the regularity is a linear relationship among the variables. To the extent that PCA finds this relationship, the data may be compressed. The idea is to discard the last principal components (those exhibiting the least variance).

In MATLAB, this means simply dropping the columns representing the unwanted principal components. In this case, we will retain only the first principal component:


>> VReduced = V(:,3)

VReduced =

-0.5825
-0.5904
0.5587

>> PCReduced = B * VReduced

PCReduced =

0.1026
-2.5786
0.0373
-1.7779
0.1026
0.1026
0.5637
1.6299
3.1841
2.4306
-3.1275
-0.8467
0.1026
0.0373
0.0373


Decompression is accomplished by inverting the process, which we can do by transposing the coefficient vector and multiplying:


>> PCReduced * VReduced'

ans =

-0.0598 -0.0606 0.0573
1.5020 1.5224 -1.4406
-0.0217 -0.0220 0.0209
1.0356 1.0497 -0.9933
-0.0598 -0.0606 0.0573
-0.0598 -0.0606 0.0573
-0.3284 -0.3328 0.3150
-0.9494 -0.9623 0.9106
-1.8547 -1.8799 1.7789
-1.4158 -1.4351 1.3580
1.8217 1.8465 -1.7473
0.4932 0.4999 -0.4730
-0.0598 -0.0606 0.0573
-0.0217 -0.0220 0.0209
-0.0217 -0.0220 0.0209

The result is not exactly the same as the original standardized data, but it is pretty close. We "un-standardize" by reversing the original standardization step:


>> Z = ((PCReduced * VReduced') .* repmat(AStd,[n 1])) + repmat(AMean,[n 1])

Z =

269.8667 38.8840 50.4980
272.6550 39.4777 50.0270
269.9345 38.8984 50.4866
271.8223 39.3004 50.1677
269.8667 38.8840 50.4980
269.8667 38.8840 50.4980
269.3870 38.7818 50.5790
268.2783 38.5457 50.7663
266.6619 38.2016 51.0393
267.4455 38.3684 50.9070
273.2259 39.5992 49.9306
270.8539 39.0942 50.3313
269.8667 38.8840 50.4980
269.9345 38.8984 50.4866
269.9345 38.8984 50.4866


Again, the result is pretty similar to the original, but not exactly: about 94% of the variance has been preserved, and we have compressed the data to 33% of its original size.

The trade-off here is between compression (count of principal components retained) and compression fidelity (the variance preserved). In a typical application, there will be more variables and the variance compression is normally not quite as dramatic as in our illustration. This means that there will be more data compression "levels", represented by the number of principal components retained.


Application: Noise Suppression

Extending the data compression application, we may use PCA for noise suppression. The basic idea is that the variance captured by the least important principal components is noise which should be rejected. Assuming that the variables bear a linear relationship, they will lie in a line (plane, hyperplane) and noise items will lift them away from the line. Dropping the last principal components means flattening the data in a geometric sense and (hopefully) eliminating some of the noise.

This process is much like the data compression process described in the last section, except: 1. discarded components have their coefficients set to zero instead of being deleted outright and 2. the PCA coefficient matrix and its inverse are multiplied together to allow a single processing step which (again, hopefully) reduces noise in the data.

As before, we calculate the PCA coefficients:


>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587


D =

0.0066 0 0
0 0.1809 0
0 0 2.8125


Deciding to eliminate the last principal component, we set its coefficients to zero:


>> VDenoise = V; VDenoise(:,1) = 0

VDenoise =

0 0.4874 -0.5825
0 0.2963 -0.5904
0 0.8213 0.5587


This matrix will project the standardized data into a flat surface- in this case a plane, since we have retained 2 dimensions. Not wanting to bother with two steps, we multiply this matrix by its inverse, which in this case is easily obtained by taking the transpose:


>> VDenoise = VDenoise * VDenoise'

VDenoise =

0.5769 0.4883 0.0749
0.4883 0.4364 -0.0865
0.0749 -0.0865 0.9867


This magical matrix will, in a single matrix multiplication, denoise the standardized data:


>> B * VDenoise

ans =

-0.0599 -0.0607 0.0570
1.4422 1.4861 -1.5414
0.0047 -0.0060 0.0654
1.1002 1.0890 -0.8844
-0.0599 -0.0607 0.0570
-0.0599 -0.0607 0.0570
-1.0390 -0.7648 -0.8824
-0.8960 -0.9299 1.0005
-1.7330 -1.8060 1.9839
-1.2380 -1.3270 1.6575
1.9601 1.9307 -1.5141
0.6290 0.5825 -0.2442
-0.0599 -0.0607 0.0570
0.0047 -0.0060 0.0654
0.0047 -0.0060 0.0654


Naturally, we still need to multiply back the standard deviation and add back the mean to get to the original scale:


>> Z = ((B * VDenoise) .* repmat(AStd,[n 1])) + repmat(AMean,[n 1])

Z =

269.8664 38.8839 50.4979
272.5483 39.4640 49.9954
269.9817 38.9044 50.5006
271.9377 39.3151 50.2019
269.8664 38.8839 50.4979
269.8664 38.8839 50.4979
268.1183 38.6198 50.2025
268.3736 38.5579 50.7946
266.8791 38.2293 51.1038
267.7630 38.4090 51.0012
273.4731 39.6308 50.0040
271.0964 39.1251 50.4032
269.8664 38.8839 50.4979
269.9817 38.9044 50.5006
269.9817 38.9044 50.5006


The degree of noise reduction is controlled by the number of principal components retained: the less principal components retained, the greater the noise reduction. Obviously, like all such schemes, this process has limitations and the big assumption here is that the original variables are linearly related so that noise stands out as a departure from this linearity.


Final Thoughts

PCA is a powerful tool, and is quickly computed on current computers, even on fairly large data. While there are limits to what it can do, it is a handy tool which is inexpensive in terms of compute time.


Further Reading

As a general reference on PCA see:

Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)

Note: The first edition is adequate for understanding and coding PCA, and is at present much cheaper than the second or third editions.


The noise suppression application is described in the article, Vectors help make sense of multiple signals, by Sullivan, Personal Engineering and Instrumentation News (Dec-1997), in which it is referred to as subspace projection.

See also the Dec-11-2010 posting, Linear Discriminant Analysis (LDA) .