Sunday, February 28, 2010

Putting PCA to Work

Context

The last posting to this Web log, Principal Components Analysis (Feb-26-2010), gave an overview of principal component analysis (PCA), and how to effect it within MATLAB. This article will cover three uses of PCA: 1. pre-processing for empirical modeling, 2. data compression and 3. noise suppression.

To serve the widest possible audience, this article will conduct PCA using only base MATLAB functions, but realize that users with the Statistics Toolbox have, as mentioned in the last posting, the option of using tools like princomp and zscore.

We will continue to use the very small data set used in the last article:


>> A = [269.8 38.9 50.5
272.4 39.5 50.0
270.0 38.9 50.5
272.0 39.3 50.2
269.8 38.9 50.5
269.8 38.9 50.5
268.2 38.6 50.2
268.2 38.6 50.8
267.0 38.2 51.1
267.8 38.4 51.0
273.6 39.6 50.0
271.2 39.1 50.4
269.8 38.9 50.5
270.0 38.9 50.5
270.0 38.9 50.5
];


We calculate the sample parameters, and standardize the data table:


>> [n m] = size(A)

n =

15


m =

3

>> AMean = mean(A)

AMean =

269.9733 38.9067 50.4800

>> AStd = std(A)

AStd =

1.7854 0.3751 0.3144


>> B = (A - repmat(AMean,[n 1])) ./ repmat(AStd,[n 1])

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

Now that the data is centered with mean 0.0 and standard deviation 1.0, we perform the eigenanalysis of the sample covariances to determine the coefficient matrix which generates the principal components:


>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587


D =

0.0066 0 0
0 0.1809 0
0 0 2.8125


Recall that the MATLAB eig function orders information for the principal components from last to first when reading the columns from left to right. The matrix V contains the linear coefficients for the principal components. The diagonal of matrix D contains the variances for the principal components. So far, we have accomplished the principal components analysis itself. To put the PCA to use, we will want to know what proportion each principal component represents of total variance. We can do this by extracting and normalizing the diagonal of matrix D (we use flipud because the principal components are in "reverse" order):


>> cumsum(flipud(diag(D))) / sum(diag(D))

ans =

0.9375
0.9978
1.0000


We interpret the above column of numbers to mean that the first principal component
contains 93.75% of the total variance of the original data, the first two principal components together contain 99.78% and of course all principal components taken together have all of the variance (exactly as much as in the original standardized data).

Last, to calculate the principal components themselves, simply multiply the standardized data by the coefficient matrix:


>> PC = B * V

PC =

-0.0571 -0.0003 0.1026
-0.1277 -0.1226 -2.5786
0.0157 0.0543 0.0373
0.0536 0.1326 -1.7779
-0.0571 -0.0003 0.1026
-0.0571 -0.0003 0.1026
0.0704 -1.4579 0.5637
-0.1495 0.1095 1.6299
0.1041 0.2496 3.1841
0.0319 0.3647 2.4306
0.1093 0.2840 -3.1275
0.0892 0.2787 -0.8467
-0.0571 -0.0003 0.1026
0.0157 0.0543 0.0373
0.0157 0.0543 0.0373


To verify the condensing of the variance, calculate the sample variances:


>> var(PC)

ans =

0.0066 0.1809 2.8125


Again, note that the first principal component appears in the last column when using MATLAB's eig function, and columns to the left have less and less variance until the last principal component, stored in the first column.


Application: Pre-processing Data for Empirical Modeling

This application of PCA is simple: calculate the principal components and choose from them rather than the original data to construct the empirical model (regression, neural network, etc.). The (hoped for) advantage of doing this is that since PCA squeezes information into a subset of the new variables, less of them will be necessary to construct the model. In fact, it would not be unreasonable to simply step through the first so many principal components to build the model: First, use just the first principal component, then try the first and second, then the first, second and third, etc. A nice side benefit is that all the principal components are uncorrelated with each other.

As was mentioned in the last article, this may or may not work well, for several reasons: PCA may not be able to squeeze the variance much if the original variables are already highly uncorrelated with one another. Also, statistical variance may not be the same thing as "information" for the purposes of model building. Last, even if this process works, one is left with the reality that PCA needs all of the original variables to calculate the principal components, even if only a subset of them are used. Regardless, this is a data processing technique which can yield benefit, so it is worth trying.


Application: Data Compression

PCA offers a mechanism for performing lossy data compression. When data compression is "lossy", it may not return exactly the original data. The trade-off is that much greater compression can be achieved than with "lossless" data compression (compression in which the original data is returned exactly). In many cases, such as audio (MP3) and images (JPEG), some loss in fidelity is acceptable and greater compression is very much desired.

All compression schemes rely on the discovery of regularities within the data. In the case of PCA, the regularity is a linear relationship among the variables. To the extent that PCA finds this relationship, the data may be compressed. The idea is to discard the last principal components (those exhibiting the least variance).

In MATLAB, this means simply dropping the columns representing the unwanted principal components. In this case, we will retain only the first principal component:


>> VReduced = V(:,3)

VReduced =

-0.5825
-0.5904
0.5587

>> PCReduced = B * VReduced

PCReduced =

0.1026
-2.5786
0.0373
-1.7779
0.1026
0.1026
0.5637
1.6299
3.1841
2.4306
-3.1275
-0.8467
0.1026
0.0373
0.0373


Decompression is accomplished by inverting the process, which we can do by transposing the coefficient vector and multiplying:


>> PCReduced * VReduced'

ans =

-0.0598 -0.0606 0.0573
1.5020 1.5224 -1.4406
-0.0217 -0.0220 0.0209
1.0356 1.0497 -0.9933
-0.0598 -0.0606 0.0573
-0.0598 -0.0606 0.0573
-0.3284 -0.3328 0.3150
-0.9494 -0.9623 0.9106
-1.8547 -1.8799 1.7789
-1.4158 -1.4351 1.3580
1.8217 1.8465 -1.7473
0.4932 0.4999 -0.4730
-0.0598 -0.0606 0.0573
-0.0217 -0.0220 0.0209
-0.0217 -0.0220 0.0209

The result is not exactly the same as the original standardized data, but it is pretty close. We "un-standardize" by reversing the original standardization step:


>> Z = ((PCReduced * VReduced') .* repmat(AStd,[n 1])) + repmat(AMean,[n 1])

Z =

269.8667 38.8840 50.4980
272.6550 39.4777 50.0270
269.9345 38.8984 50.4866
271.8223 39.3004 50.1677
269.8667 38.8840 50.4980
269.8667 38.8840 50.4980
269.3870 38.7818 50.5790
268.2783 38.5457 50.7663
266.6619 38.2016 51.0393
267.4455 38.3684 50.9070
273.2259 39.5992 49.9306
270.8539 39.0942 50.3313
269.8667 38.8840 50.4980
269.9345 38.8984 50.4866
269.9345 38.8984 50.4866


Again, the result is pretty similar to the original, but not exactly: about 94% of the variance has been preserved, and we have compressed the data to 33% of its original size.

The trade-off here is between compression (count of principal components retained) and compression fidelity (the variance preserved). In a typical application, there will be more variables and the variance compression is normally not quite as dramatic as in our illustration. This means that there will be more data compression "levels", represented by the number of principal components retained.


Application: Noise Suppression

Extending the data compression application, we may use PCA for noise suppression. The basic idea is that the variance captured by the least important principal components is noise which should be rejected. Assuming that the variables bear a linear relationship, they will lie in a line (plane, hyperplane) and noise items will lift them away from the line. Dropping the last principal components means flattening the data in a geometric sense and (hopefully) eliminating some of the noise.

This process is much like the data compression process described in the last section, except: 1. discarded components have their coefficients set to zero instead of being deleted outright and 2. the PCA coefficient matrix and its inverse are multiplied together to allow a single processing step which (again, hopefully) reduces noise in the data.

As before, we calculate the PCA coefficients:


>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587


D =

0.0066 0 0
0 0.1809 0
0 0 2.8125


Deciding to eliminate the last principal component, we set its coefficients to zero:


>> VDenoise = V; VDenoise(:,1) = 0

VDenoise =

0 0.4874 -0.5825
0 0.2963 -0.5904
0 0.8213 0.5587


This matrix will project the standardized data into a flat surface- in this case a plane, since we have retained 2 dimensions. Not wanting to bother with two steps, we multiply this matrix by its inverse, which in this case is easily obtained by taking the transpose:


>> VDenoise = VDenoise * VDenoise'

VDenoise =

0.5769 0.4883 0.0749
0.4883 0.4364 -0.0865
0.0749 -0.0865 0.9867


This magical matrix will, in a single matrix multiplication, denoise the standardized data:


>> B * VDenoise

ans =

-0.0599 -0.0607 0.0570
1.4422 1.4861 -1.5414
0.0047 -0.0060 0.0654
1.1002 1.0890 -0.8844
-0.0599 -0.0607 0.0570
-0.0599 -0.0607 0.0570
-1.0390 -0.7648 -0.8824
-0.8960 -0.9299 1.0005
-1.7330 -1.8060 1.9839
-1.2380 -1.3270 1.6575
1.9601 1.9307 -1.5141
0.6290 0.5825 -0.2442
-0.0599 -0.0607 0.0570
0.0047 -0.0060 0.0654
0.0047 -0.0060 0.0654


Naturally, we still need to multiply back the standard deviation and add back the mean to get to the original scale:


>> Z = ((B * VDenoise) .* repmat(AStd,[n 1])) + repmat(AMean,[n 1])

Z =

269.8664 38.8839 50.4979
272.5483 39.4640 49.9954
269.9817 38.9044 50.5006
271.9377 39.3151 50.2019
269.8664 38.8839 50.4979
269.8664 38.8839 50.4979
268.1183 38.6198 50.2025
268.3736 38.5579 50.7946
266.8791 38.2293 51.1038
267.7630 38.4090 51.0012
273.4731 39.6308 50.0040
271.0964 39.1251 50.4032
269.8664 38.8839 50.4979
269.9817 38.9044 50.5006
269.9817 38.9044 50.5006


The degree of noise reduction is controlled by the number of principal components retained: the less principal components retained, the greater the noise reduction. Obviously, like all such schemes, this process has limitations and the big assumption here is that the original variables are linearly related so that noise stands out as a departure from this linearity.


Final Thoughts

PCA is a powerful tool, and is quickly computed on current computers, even on fairly large data. While there are limits to what it can do, it is a handy tool which is inexpensive in terms of compute time.


Further Reading

As a general reference on PCA see:

Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)

Note: The first edition is adequate for understanding and coding PCA, and is at present much cheaper than the second or third editions.


The noise suppression application is described in the article, Vectors help make sense of multiple signals, by Sullivan, Personal Engineering and Instrumentation News (Dec-1997), in which it is referred to as subspace projection.

See also the Dec-11-2010 posting, Linear Discriminant Analysis (LDA) .

25 comments:

Anonymous said...

Thanks fo rthe PCA explanations, I found it very useful.

кухни харьков said...

It's just an excellent blog! It just matches my work 100% and I can learn a lot here. Keep up a great job!!!

Josh said...

Clear description of principle components analysis. I don't know how this blog stayed under my radar for so long. Also like the data binning article you did in 2007 (which actually led me here). Looking forward to reading more and visiting often.

Lek said...

Very good explanation for PCA and example. Thank you. I am looking forward to see the other topic related to data mining ;)

Mincheol said...

Thanks for WONDERFUL explanation for PCA and example. This is what I was looking for.

Kristo said...

This is the best explanation on PCA. Thank you a lot for your contribution in explaining PCA.

Anonymous said...

Great stuff. I just used the eigenvectors of a PCA for creating a local coordinate system. I'm bookmarking the blog.

Caio said...

So good this tutorial. But, how can i do a graphic analysis using PCA?
Thanks.

walter said...

You got talent in explaining in a simple way; systematic and complete with applications. Thanks

zalik said...

Thank you, for the useful reading, and knowledge. A blog worth reading, for sure.

Rupam Das said...

This is such a great explanation on data mining with PCA! I had a question, can I use the technique shown by you and develop a complete code that enables data reduction on a sample pesticide database? Will you allow me to publish any data mining code in your blog as guest blogger? I have been contributing to math works regularly ( but in image processing). This blog of yours has a wide data mining audience. So I was just wondering, if you permit guest contributions! Might not be as high a quality in mining as you are posting, but can give some meaningful inputs none the less. Thank you!

Anonymous said...

Thanks alot for excellent description of PCA

Anonymous said...

Excellent explanation for PCA, thank you so much!!!

Anonymous said...

Really U R great! I understood What is PCA application for my research work(Pattern Recognition)..Thank U so much..
Kavitha

Alex said...

Hi Will,

I just found your blog, it's great! I have a question regarding PCA for dimensionality reduction. If I'm using PCA to reduce the dimensionality of my data and use it as training data for a classification algorithm, how should I standardize my test data before transforming with the principal components from the test data(assuming that the test example become available one by one)? Should I subtract the mean and divide by the standard deviation of the training data?

Anonymous said...

Hi Will
Thank you for explaining PCA wonderfully and example.

Anonymous said...

Hi Will, really helpful. I am trying to recover the strength of eigenvalue relative to original covariance matrix(my A).I have not seen anyone do this, but is it correct to multiply the eigenvalues, sort correctly and add the mean. The magnitude of eigenvalue is far reduced from if I calculate pca from Jacobi rotation. Thanks for your insight.

Anonymous said...

Hi Will,Need to restate my last post.My A is a covariance matrix. I am using the top 3 latent values which provide principal variances. Then I sqrt them and multiply by coeff to get the volatility of these components like volk=(diag(sqrt(latent(1:3)))*coeff(:,1:3)')'; The volk magnitude is too small because size of latent compared to what I obtain from Jacobi Rotation method. Can it be fixed so that PC can be used instead of the original factors?It seems like it should be a common problem.
Also,is it possible to identify columns from A that contributed the most to highest eigenvalue,ie, if the factor cannot be replaced for use.
Thanks vm.

Jason Tan said...

Thanks for the great explanation about PCA. But, I got 1 question to ask about the data compression. How do you calculate out the 33% of original data? Is there any formulae to calculate it? Your help will be so much appreciated.

Anonymous said...

Hi,

Can you please give me some idea about how can i get an empirical equation from my regression(say pcr) using PCA?

sai srujan said...
This comment has been removed by the author.
Prabhat Sharma said...

Thank you

Ila said...

Can you please explain about plotting and interpretation of PCA score plot.

Zhen Kang Wong said...

Thank you Will for the splendid explanation. I have a question however, with regard to the step on the standardisation you mentioned. If I have a data set of 2 classes (to classify them into 2 classes), do I still standardise them? And if I do, which mean and variance should I take against? Individually to subtract each or just use one of the classes to subtract out?

Anonymous said...

In this example you created n values by example 20 values
is possible to simulate 100xn values with CPA, This the principal goal of simulation using CPA
how to simulate or generate by example 10000 with 100 original data (100xn, n=length of original data)
Tar