Friday, February 26, 2010

Principal Components Analysis

Introduction

Real-world data sets usually exhibit relationships among their variables. These relationships are often linear, or at least approximately so, making them amenable to common analysis techniques. One such technique is principal component analysis ("PCA"), which rotates the original data to new coordinates, making the data as "flat" as possible.

Given a table of two or more variables, PCA generates a new table with the same number of variables, called the principal components. Each principal component is a linear transformation of the entire original data set. The coefficients of the principal components are calculated so that the first principal component contains the maximum variance (which we may tentatively think of as the "maximum information"). The second principal component is calculated to have the second most variance, and, importantly, is uncorrelated (in a linear sense) with the first principal component. Further principal components, if there are any, exhibit decreasing variance and are uncorrelated with all other principal components.

PCA is completely reversible (the original data may be recovered exactly from the principal components), making it a versatile tool, useful for data reduction, noise rejection, visualization and data compression among other things. This article walks through the specific mechanics of calculating the principal components of a data set in MATLAB, using either the MATLAB Statistics Toolbox, or just the base MATLAB product.


Performing Principal Components Analysis


Performing PCA will be illustrated using the following data set, which consists of 3 measurements taken of a particular subject over time:


>> A = [269.8 38.9 50.5
272.4 39.5 50.0
270.0 38.9 50.5
272.0 39.3 50.2
269.8 38.9 50.5
269.8 38.9 50.5
268.2 38.6 50.2
268.2 38.6 50.8
267.0 38.2 51.1
267.8 38.4 51.0
273.6 39.6 50.0
271.2 39.1 50.4
269.8 38.9 50.5
270.0 38.9 50.5
270.0 38.9 50.5
];


We determine the size of this data set thus:


>> [n m] = size(A)

n =

15


m =

3


To summarize the data, we calculate the sample mean vector and the sample standard deviation vector:


>> AMean = mean(A)

AMean =

269.9733 38.9067 50.4800

>> AStd = std(A)

AStd =

1.7854 0.3751 0.3144


Most often, the first step in PCA is to standardize the data. Here, "standardization" means subtracting the sample mean from each observation, then dividing by the sample standard deviation. This centers and scales the data. Sometimes there are good reasons for modifying or not performing this step, but I will recommend that you standardize unless you have a good reason not to. This is easy to perform, as follows:


>> B = (A - repmat(AMean,[n 1])) ./ repmat(AStd,[n 1])

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636


This calculation can also be carried out using the zscore function from the Statistics Toolbox:


>> B = zscore(A)

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636


Calculating the coefficients of the principal components and their respective variances is done by finding the eigenfunctions of the sample covariance matrix:


>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587


D =

0.0066 0 0
0 0.1809 0
0 0 2.8125


The matrix V contains the coefficients for the principal components. The diagonal elements of D store the variance of the respective principal components. We can extract the diagonal like this:


>> diag(D)

ans =

0.0066
0.1809
2.8125


The coefficients and respective variances of the principal components could also be found using the princomp function from the Statistics Toolbox:


>> [COEFF SCORE LATENT] = princomp(B)

COEFF =

0.5825 -0.4874 0.6505
0.5904 -0.2963 -0.7507
-0.5587 -0.8213 -0.1152


SCORE =

-0.1026 0.0003 -0.0571
2.5786 0.1226 -0.1277
-0.0373 -0.0543 0.0157
1.7779 -0.1326 0.0536
-0.1026 0.0003 -0.0571
-0.1026 0.0003 -0.0571
-0.5637 1.4579 0.0704
-1.6299 -0.1095 -0.1495
-3.1841 -0.2496 0.1041
-2.4306 -0.3647 0.0319
3.1275 -0.2840 0.1093
0.8467 -0.2787 0.0892
-0.1026 0.0003 -0.0571
-0.0373 -0.0543 0.0157
-0.0373 -0.0543 0.0157


LATENT =

2.8125
0.1809
0.0066


Note three important things about the above:

1. The order of the principal components from princomp is opposite of that from eig(cov(B)). princomp orders the principal components so that the first one appears in column 1, whereas eig(cov(B)) stores it in the last column.

2. Some of the coefficients from each method have the opposite sign. This is fine: There is no "natural" orientation for principal components, so you can expect different software to produce different mixes of signs.

3. SCORE contains the actual principal components, as calculated by princomp.

To calculate the principal components without princomp, simply multiply the standardized data by the principal component coefficients:


>> B * COEFF

ans =

-0.1026 0.0003 -0.0571
2.5786 0.1226 -0.1277
-0.0373 -0.0543 0.0157
1.7779 -0.1326 0.0536
-0.1026 0.0003 -0.0571
-0.1026 0.0003 -0.0571
-0.5637 1.4579 0.0704
-1.6299 -0.1095 -0.1495
-3.1841 -0.2496 0.1041
-2.4306 -0.3647 0.0319
3.1275 -0.2840 0.1093
0.8467 -0.2787 0.0892
-0.1026 0.0003 -0.0571
-0.0373 -0.0543 0.0157
-0.0373 -0.0543 0.0157


To reverse this transformation, simply multiply by the transpose of the coefficent matrix:


>> (B * COEFF) * COEFF'

ans =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636


Finally, to get back to the original data, multiply each observation by the sample standard deviation vector and add the mean vector:


>> ((B * COEFF) * COEFF') .* repmat(AStd,[n 1]) + repmat(AMean,[n 1])

ans =

269.8000 38.9000 50.5000
272.4000 39.5000 50.0000
270.0000 38.9000 50.5000
272.0000 39.3000 50.2000
269.8000 38.9000 50.5000
269.8000 38.9000 50.5000
268.2000 38.6000 50.2000
268.2000 38.6000 50.8000
267.0000 38.2000 51.1000
267.8000 38.4000 51.0000
273.6000 39.6000 50.0000
271.2000 39.1000 50.4000
269.8000 38.9000 50.5000
270.0000 38.9000 50.5000
270.0000 38.9000 50.5000


This completes the round trip from the original data to the principal components and back to the original data. In some applications, the principal components are modified before the return trip.

Let's consider what we've gained by making the trip to the principal component coordinate system. First, more variance has indeed been squeezed in the first principal component, which we can see by taking the sample variance of principal components:


>> var(SCORE)

ans =

2.8125 0.1809 0.0066


The cumulative variance contained in the first so many principal components can be easily calculated thus:


>> cumsum(var(SCORE)) / sum(var(SCORE))

ans =

0.9375 0.9978 1.0000


Interestingly in this case, the first principal component contains nearly 94% of the variance of the original table. A lossy data compression scheme which discarded the second and third principal components would compress 3 variables into 1, while losing only 6% of the variance.

The other important thing to note about the principal components is that they are completely uncorrelated (as measured by the usual Pearson correlation), which we can test by calculating their correlation matrix:


>> corrcoef(SCORE)

ans =

1.0000 -0.0000 0.0000
-0.0000 1.0000 -0.0000
0.0000 -0.0000 1.0000


Discussion

PCA "squeezes" as much information (as measured by variance) as possible into the first principal components. In some cases the number of principal components needed to store the vast majority of variance is shockingly small: a tremendous feat of data manipulation. This transformation can be performed quickly on contemporary hardware and is invertible, permitting any number of useful applications.

For the most part, PCA really is as wonderful as it seems. There are a few caveats, however:

1. PCA doesn't always work well, in terms of compressing the variance. Sometimes variables just aren't related in a way which is easily exploited by PCA. This means that all or nearly all of the principal components will be needed to capture the multivariate variance in the data, making the use of PCA moot.

2. Variance may not be what we want condensed into a few variables. For example, if we are using PCA to reduce data for predictive model construction, then it is not necessarily the case that the first principal components yield a better model than the last principal components (though it often works out more or less that way).

3. PCA is built from components, such as the sample covariance, which are not statistically robust. This means that PCA may be thrown off by outliers and other data pathologies. How seriously this affects the result is specific to the data and application.

4. Though PCA can cram much of the variance in a data set into fewer variables, it still requires all of the variables to generate the principal components of future observations. Note that this is true, regardless of how many principal components are retained for the application. PCA is not a subset selection procedure, and this may have important logistical implications.



Further Reading

See also the Feb-28-2010 posting, Putting PCA to Work and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA) .


Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)

Note: The first edition is adequate for understanding and coding PCA, and is at present much cheaper than the second or third editions.

56 comments:

Anonymous said...

Good job..
agree with images examples suggestion..

How / where on your source code specify the number of features you want to extract from given data? for instance, using PCA to extract top 15 feature.

Thanks for your help

Ibraheem M. Al-Dhamari said...

Excellent,
I hope you provide same excellent explanation of SVM.

regards,

Anonymous said...

respected sir,
ur blog is very helpful to through this i can understand the beauty for PCA method...but can it apply to image..?

Anonymous said...

I greatly appreciate the practical and clear explanation. Other things available are beyond my current background, but this explanation allows me to start empirically playing with the method - in my opinion a good precursor to really understanding it formally (if I ever get there! - and if not I still learn something).

MP said...

The first time I did PCA I used the following document which has pictures. I agree they help a lot. It is nice to see the MatLab code on your blog. I think it would have been better if you did an example where you actually reduced the dimensionality by selecting a subset of feature vectors. Also it's very easy to do this without the statiscs toolbox.

I do something like this in MatLab to select my feature vectors
Values=diag(Values);
[Vsort, Vindices] = sort(-1*Values);
Values = Values(Vindices);
PC = Vectors(:,Vindices(1:number_factors));

@previous poster This process can be applied to images for compression.

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

Unknown said...

What do you do if you wanted to know the points that are, say, 1 standard deviations from the mean of the first principal component? In other words, I want to know what is being altered as I move two standard deviations from the first component.

nicc said...

Great. Very clear! I also liked how you offer the non-MATLAB function description. Good show!

Fernando C Barbi said...

Will,

Great Post, thanks! You might want to explicit the relation between V, D, SCORE and COEFF. COEFF is V in descending order and SCORE = B*COEFF.

Anonymous said...

You said A consists of 3 measurements taken of a particular subject over time. But what in this case COEFF SCORE LATENT? How to use this number for better interpretation of your results?

Will Dwinnell said...

COEFF is V, but with the opposite order, so that when it is multiplied by the normalized data, B * COEFF, the result, which is generated by princomp() as SCORE, has the first principal component in the first column. LATENT contains the variances of the principal components.

amin said...

this articel very helpful for me
I am a student from Indonesia
I am now more complete thesis on data mining with the PCA as a dimension reduction but I want to use the Jacobi iteration to find the eigen vector
Can you help me

skumar said...

dear will, thank you for your wonderful explanation on PCA. Can you please explain or help me solve apriori association rule mining using matlab. thanks - s kumar

Amalina said...

hi, glad to find this blog! i have a huge set of data: 17689 approximate coefficient which extracted from feature extraction of MRI brain image. how can i use PCA to reduce the data so that i can use a minimum data for SVM classification purpose. really need your advice, tq
-amalina_azman_80@yahoo.com.my

Anonymous said...

HI, thanks, how the pca could be used for an image?

harly said...

Thanks for your posting..
it's very useful to me..

Trevor said...

If you wanted to find how much each original variable contributed to each of the new variables, how would you do it?

hari said...

thank you for ur post ....can u please explain how this can be done for color images?

Paco said...

Good post!!

A question: with this post as starting point, how do you implement rotated PCA with matlab's rotatefactors (varimax algorithm) and the new explained variance based on the rotated components?

Thanks for your help

Anonymous said...

To plot the PC1 vs PC2 plot do I plot the scores first column Vs scores second column of values?

Will Dwinnell said...

"To plot the PC1 vs PC2 plot do I plot the scores first column Vs scores second column of values?"

That depends on how the principal components were calculated. If the princomp function in the Statistics Toolbox was used then, yes. If the "manual" method I describe here was used, then the order of the principal components column is reversed, so the last 2 columns are the first 2 principal components.

Anonymous said...

Will, excellent post. Thanks. Quick question, at the end of the process my new variables will be the matrix SCORE, right? If I want to add trend line I should use the first, second, …columns of SCORE matrix?

Kh@L3D said...

Hello, I'd like to thank you very much for your tutorial. Very helpful and useful.
But I do have a question. I read in many works that PCA is used as a "preprocessing" method, prior to classification (LDA or other). Therefor, I'd like to ask you, How can I use the principals coefficients to perform a classification ?

I mean, okay, I know the assigned classes to the training matrice T, but when I do : Coeff=princomp(A),I found out that I know nothing about Coeff, I don't know its classes. So how can I perform a classification ?

Thank you very much for your blog. Please, excuse me if my question is misplaced, please indicate me the appropriate forum to ask it.

Thank you very much :)

Will Dwinnell said...

Kh@L3D:

PCA itself is not a classification method. PCA merely rearranges the data to exploit linear structure. As I note in this posting, PCA may or may not help classification, which is a separate process (performed by some classification algorithm: discriminant analysis, neural networks, etc.).

Anonymous said...

Hello! This is one of the best posts I've seen on PCA - thank you!
I've used this post as starting point. However for better interpretation of which variables comprise my factors, I would like to go one step further and rotate my factors using the varimax method of rotation.
Once I've obtained my COEFF, SCORE, and latent, what is the next step?
Thank you~

Anonymous said...

Hello! This is one of the best posts I've seen on PCA - thank you!
I've used this post as starting point. However for a better interpretation of which variables comprise my factors, I would like to rotate my factors using the varimax method of rotation.
My question is, once I've obtained my COEFF, SCORE, and latent, what is the next step to rotation?
Thank you~

Upul Senanayake said...

Hey Will,
Your a life saver! Thank you so much!

Burak said...

Thank you very much for your good explanation. I am wondering about one thing though. Once we get the principal components by using the princomp function of matlab, can we say that the first principal component is related to the first column of the original data matrix? Or is it possible that the first column of the original data matrix does not have much variance as the second column; therefore, the first column of the principal components corresponds to the second column of the original data matrix? How can we know which column of the principal components is related to which column of the original data matrix? Thanks in advance.

Christian Himpe said...

Instead of using repmat one can use bsxfun to compute the correlation more efficiently as follows:

T = bsxfun(@minus,A,mean(A))
T = bsxfun(@times,A,1./std(A))

and since the eigenvalues of a covariance matrix (that by design is positive (semi-)definite) are computed next, one can compute equivalently, but somewhat faster, the singular values:

D = svd(cov(T))

Greetings

Unknown said...

Hi! Lately I've been reading a lot about PCA and I've found this post very useful, kudos to you!
I've learned that a useful method for validating a PCA model (choosing how many PCs to retain) is by cross validation.
Taking your example in Matlab and applying the 'Leave One Out' method (I really need to implement this in my work) how would you do it?
I understand the concept behind CV but I'm a bit confused as how to apply it here (or in any other example), because of my lack of experience.
Any help would be MUCH appreciated!
Thanks in advance,
Nuno B.

Anonymous said...

why there is also a sign problem beside the first-to-last order problem comparing V and COEFF if you compare the last column of V and the first column of COEFF. Which one I should use for the vector? Thanks.

Anonymous said...

If possible, I would like to know how to calculate the relation among original variables and the principal components

Unknown said...

how plotting a first score in a two dimensional plane.
i have a data set: X=[x1 x2]

Unknown said...

how plotting a first score only .
i have a data set X=[x1 x2]

Unknown said...

Hi.how can I plot relation between LDA and PCA.Thank you.

Diego Alonso said...

The PC's are orthogonal (V*V.'), but I may say that the transformed data is not always uncorrelated. In case, in the original data, there is one variable that is a linear combination of other variables, this dependence is kept.
However, PCA can help us to see these dependencies and eliminate redundant variables.

Unknown said...

Thank you very much for this explanation. I found your page because I've been struggling for a while to understand why the coefficients given by matlab are systematically smaller than those given by Statistica. Today, I've realized that the norms of the factors (columns of COEFF)which should be equal to the factors' variances are instead normalized to one. Do you know why, and more importantly how I can prevent this normalization in order to one to stick to Statistica's results ?
Thanks in advance !

Anonymous said...

Super blog post. It helped a lot.

However, I am trying to get the same results out of MATLAB that a colleague is getting out of SPSS.

I tried using applying eig to both the covariance matrix (cov) and the corelation matrix (corrcoef) but it results in the same set of eigenvectors.

I know SPSS has the option to rotatefactors (as does MATLAB of course). But I am talking about the unrotated factors.

Hope you can help

Steve Westland
University of Leeds

Life said...

Very good and clear, after reading this can understand PCA.
How to use PCA to unbalanced data ?

Unknown said...

hello please tell me how can use
pca for privacy preserving in data mining thank u.

Unknown said...

Nice post; the examples make it all very clear. I recently wrote an article about what PCA actually means. This might be helpful for some, to get a more intuitive understanding: http://www.visiondummy.com/2014/05/feature-extraction-using-pca/

Nethravati Ma'm said...

After applying PCA on original and modified data sets, I want to analyse the new data sets for accuracy and information loss & entropy. Any functions or specific codes for the same?

Anonymous said...

If you want to get the original data from B , you do not need COEFF at all. COEFF*COEFF'=identity matrix or 1. I think you should point out that V or COEFF are the eigenvectors of Cov(B). It took me a while to figure out. In this case cov(B) is same as corr(B), as they are zscores. However, B*B' is 14*cov(B). why the 14? it is puzzling me. I was thinking that will give the covariance matrix.

Anonymous said...

Answering post above. B'B is 14 times cov(B) because the covariance estimator is 1/n-1* B'B when E(B)=0 ie series are zero mean. since n=15, n-1 =14. mystery solved. very interesting. covariance esitmator link http://www.encyclopediaofmath.org/index.php/Covariance_matrix

Anonymous said...

Lets assume you've 3 features of an image instead of 3 measurements taken of a particular subject. I have 10 images. My training dataset will be 10 x 3; If I use matlab buildin function princomp and get COEFF SCORE LATENT? which one should I use; score also gives me 3 col. Do I need to use first col. only. How to use this number for better interpretation of my results? how to give input to the classifier

Anisha K S said...

Thanks for sharing this valuable information. I think this PCA method is also helpful for data mining process Containing data compression.

aaaaaaghhhhhhhhhhhhhhhhh said...

Does the dividing of the data by the standard deviation mess it up? Will it lower the energy of the high variance dimensions and increase the energy of the low variance dimensions

Chris said...

Best explanation of PCA on the net! I've been googling it all day.

Anonymous said...

i was searching for 3 months how to use PCA for feature reduction. Thank you for the content.. sdp

Anonymous said...

Hi Will,
Nice post to explain PCA. I wonder if you can help my simple problem. I wish to do a GPR with input from PCA of my data, and I learned that the right way to do the CV is by doing PCA on the training set, then use the training regression coefficients to map the test set to their PCs. The following is my attempt in matlab:

% Calculate xscores for training set
[coef, score, latent, explained, mu] = pca(data_train(:,2:end));
xscore_train=score;
ytrain=zdata_train(:,1);

% Calculate xscores for test set=standardized newX * coef
xscore_test=zdata_test(:,2:end)*coef; ytest=zdata_test(:,1);

Do I compute the xscore_test correctly? Also, the scores returned are not standardized correct?

Thanks a bunch:)

munjal said...

Nice article.

After applying PCA , if I want to take only first component and throw out other 2 components and want to reproduce the original data , then my reproduced data still having three variable. Can you please tell me how this procedure affects my data.

Or if I want to predict some Y variable based on given data , then how can we use PCA for that ?

Thank you.

Unknown said...

For a multi variables data sets is it possible to tell which variable have most influence on 1st Principal component. Another question is that what is the significance of the negative sign in the V matrix component?

Anonymous said...

hai,will can we find the attribute names which has been reduced as the result...

Akira said...

I have been going through number of websites and textbooks to learn how to utilize PCA and somewhat confused because of so many technical terms and formulas flying around. Your explanation just settled everything in place in my brain. Thank you!

Unknown said...

Hey thank u for an example.I tried it with and without using princomp function and my solution without using princomp matches with your answer but i am getting different answer when i use princomp as follows:
coeff =

0.9686 0.1808 -0.1709
0.2019 -0.1698 0.9646
-0.1454 0.9687 0.2010


score =

-0.1721 -0.0108 0.0272
2.5399 -0.1270 0.0612
0.0216 0.0253 -0.0070
2.0831 0.0284 -0.0232
-0.1721 -0.0108 0.0272
-0.1721 -0.0108 0.0272
-1.7388 -0.5398 -0.0491
-1.8260 0.0414 0.0715
-3.1127 0.1830 -0.0490
-2.2829 0.1968 -0.0129
3.7224 0.0730 -0.0473
1.2388 0.1115 -0.0392
-0.1721 -0.0108 0.0272
0.0216 0.0253 -0.0070
0.0216 0.0253 -0.0070


latent =

3.3971
0.0287
0.0015

Please help me out!
Thank you

Anonymous said...

Hi, Will!

Nice explanation! Thanks!

Could you also give an explanation for the case when the dimensionality (d) is greater than the number of samples (s), and how to extract more PCs than s?

For example:
Given:
s = 100
d = 3000

Problem:
Let's say reduce d to 1500.

Thanks in advance

Unknown said...

I have 3 classes with 200 instances in each class. And 14 variables to predict those classes. I want to apply PCA for feature selection and do the classification with few features which are more important.
Could you help me on this and explain how to proceed for feature selection.

Thank you