Data Mining in MATLAB: February 2010

Sunday, February 28, 2010

Putting PCA to Work

Context

The last posting to this Web log, Principal Components Analysis (Feb-26-2010), gave an overview of principal component analysis (PCA), and how to effect it within MATLAB. This article will cover three uses of PCA: 1. pre-processing for empirical modeling, 2. data compression and 3. noise suppression.

To serve the widest possible audience, this article will conduct PCA using only base MATLAB functions, but realize that users with the Statistics Toolbox have, as mentioned in the last posting, the option of using tools like princomp and zscore.

We will continue to use the very small data set used in the last article:

>> A = [269.8 38.9 50.5
272.4 39.5 50.0
270.0 38.9 50.5
272.0 39.3 50.2
269.8 38.9 50.5
269.8 38.9 50.5
268.2 38.6 50.2
268.2 38.6 50.8
267.0 38.2 51.1
267.8 38.4 51.0
273.6 39.6 50.0
271.2 39.1 50.4
269.8 38.9 50.5
270.0 38.9 50.5
270.0 38.9 50.5
];

We calculate the sample parameters, and standardize the data table:

>> [n m] = size(A)

n =

15

m =

3

>> AMean = mean(A)

AMean =

269.9733 38.9067 50.4800

>> AStd = std(A)

AStd =

1.7854 0.3751 0.3144

>> B = (A - repmat(AMean,[n 1])) ./ repmat(AStd,[n 1])

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

Now that the data is centered with mean 0.0 and standard deviation 1.0, we perform the eigenanalysis of the sample covariances to determine the coefficient matrix which generates the principal components:

>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587

D =

0.0066 0 0
0 0.1809 0
0 0 2.8125

Recall that the MATLAB eig function orders information for the principal components from last to first when reading the columns from left to right. The matrix V contains the linear coefficients for the principal components. The diagonal of matrix D contains the variances for the principal components. So far, we have accomplished the principal components analysis itself. To put the PCA to use, we will want to know what proportion each principal component represents of total variance. We can do this by extracting and normalizing the diagonal of matrix D (we use flipud because the principal components are in "reverse" order):

>> cumsum(flipud(diag(D))) / sum(diag(D))

ans =

0.9375
0.9978
1.0000

We interpret the above column of numbers to mean that the first principal component
contains 93.75% of the total variance of the original data, the first two principal components together contain 99.78% and of course all principal components taken together have all of the variance (exactly as much as in the original standardized data).

Last, to calculate the principal components themselves, simply multiply the standardized data by the coefficient matrix:

>> PC = B * V

PC =

-0.0571 -0.0003 0.1026
-0.1277 -0.1226 -2.5786
0.0157 0.0543 0.0373
0.0536 0.1326 -1.7779
-0.0571 -0.0003 0.1026
-0.0571 -0.0003 0.1026
0.0704 -1.4579 0.5637
-0.1495 0.1095 1.6299
0.1041 0.2496 3.1841
0.0319 0.3647 2.4306
0.1093 0.2840 -3.1275
0.0892 0.2787 -0.8467
-0.0571 -0.0003 0.1026
0.0157 0.0543 0.0373
0.0157 0.0543 0.0373

To verify the condensing of the variance, calculate the sample variances:

>> var(PC)

ans =

0.0066 0.1809 2.8125

Again, note that the first principal component appears in the last column when using MATLAB's eig function, and columns to the left have less and less variance until the last principal component, stored in the first column.

Application: Pre-processing Data for Empirical Modeling

This application of PCA is simple: calculate the principal components and choose from them rather than the original data to construct the empirical model (regression, neural network, etc.). The (hoped for) advantage of doing this is that since PCA squeezes information into a subset of the new variables, less of them will be necessary to construct the model. In fact, it would not be unreasonable to simply step through the first so many principal components to build the model: First, use just the first principal component, then try the first and second, then the first, second and third, etc. A nice side benefit is that all the principal components are uncorrelated with each other.

As was mentioned in the last article, this may or may not work well, for several reasons: PCA may not be able to squeeze the variance much if the original variables are already highly uncorrelated with one another. Also, statistical variance may not be the same thing as "information" for the purposes of model building. Last, even if this process works, one is left with the reality that PCA needs all of the original variables to calculate the principal components, even if only a subset of them are used. Regardless, this is a data processing technique which can yield benefit, so it is worth trying.

Application: Data Compression

PCA offers a mechanism for performing lossy data compression. When data compression is "lossy", it may not return exactly the original data. The trade-off is that much greater compression can be achieved than with "lossless" data compression (compression in which the original data is returned exactly). In many cases, such as audio (MP3) and images (JPEG), some loss in fidelity is acceptable and greater compression is very much desired.

All compression schemes rely on the discovery of regularities within the data. In the case of PCA, the regularity is a linear relationship among the variables. To the extent that PCA finds this relationship, the data may be compressed. The idea is to discard the last principal components (those exhibiting the least variance).

In MATLAB, this means simply dropping the columns representing the unwanted principal components. In this case, we will retain only the first principal component:

>> VReduced = V(:,3)

VReduced =

-0.5825
-0.5904
0.5587

>> PCReduced = B * VReduced

PCReduced =

0.1026
-2.5786
0.0373
-1.7779
0.1026
0.1026
0.5637
1.6299
3.1841
2.4306
-3.1275
-0.8467
0.1026
0.0373
0.0373

Decompression is accomplished by inverting the process, which we can do by transposing the coefficient vector and multiplying:

>> PCReduced * VReduced'

ans =

-0.0598 -0.0606 0.0573
1.5020 1.5224 -1.4406
-0.0217 -0.0220 0.0209
1.0356 1.0497 -0.9933
-0.0598 -0.0606 0.0573
-0.0598 -0.0606 0.0573
-0.3284 -0.3328 0.3150
-0.9494 -0.9623 0.9106
-1.8547 -1.8799 1.7789
-1.4158 -1.4351 1.3580
1.8217 1.8465 -1.7473
0.4932 0.4999 -0.4730
-0.0598 -0.0606 0.0573
-0.0217 -0.0220 0.0209
-0.0217 -0.0220 0.0209

The result is not exactly the same as the original standardized data, but it is pretty close. We "un-standardize" by reversing the original standardization step:

>> Z = ((PCReduced * VReduced') .* repmat(AStd,[n 1])) + repmat(AMean,[n 1])

Z =

269.8667 38.8840 50.4980
272.6550 39.4777 50.0270
269.9345 38.8984 50.4866
271.8223 39.3004 50.1677
269.8667 38.8840 50.4980
269.8667 38.8840 50.4980
269.3870 38.7818 50.5790
268.2783 38.5457 50.7663
266.6619 38.2016 51.0393
267.4455 38.3684 50.9070
273.2259 39.5992 49.9306
270.8539 39.0942 50.3313
269.8667 38.8840 50.4980
269.9345 38.8984 50.4866
269.9345 38.8984 50.4866

Again, the result is pretty similar to the original, but not exactly: about 94% of the variance has been preserved, and we have compressed the data to 33% of its original size.

The trade-off here is between compression (count of principal components retained) and compression fidelity (the variance preserved). In a typical application, there will be more variables and the variance compression is normally not quite as dramatic as in our illustration. This means that there will be more data compression "levels", represented by the number of principal components retained.

Application: Noise Suppression

Extending the data compression application, we may use PCA for noise suppression. The basic idea is that the variance captured by the least important principal components is noise which should be rejected. Assuming that the variables bear a linear relationship, they will lie in a line (plane, hyperplane) and noise items will lift them away from the line. Dropping the last principal components means flattening the data in a geometric sense and (hopefully) eliminating some of the noise.

This process is much like the data compression process described in the last section, except: 1. discarded components have their coefficients set to zero instead of being deleted outright and 2. the PCA coefficient matrix and its inverse are multiplied together to allow a single processing step which (again, hopefully) reduces noise in the data.

As before, we calculate the PCA coefficients:

>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587

D =

0.0066 0 0
0 0.1809 0
0 0 2.8125

Deciding to eliminate the last principal component, we set its coefficients to zero:

>> VDenoise = V; VDenoise(:,1) = 0

VDenoise =

0 0.4874 -0.5825
0 0.2963 -0.5904
0 0.8213 0.5587

This matrix will project the standardized data into a flat surface- in this case a plane, since we have retained 2 dimensions. Not wanting to bother with two steps, we multiply this matrix by its inverse, which in this case is easily obtained by taking the transpose:

>> VDenoise = VDenoise * VDenoise'

VDenoise =

0.5769 0.4883 0.0749
0.4883 0.4364 -0.0865
0.0749 -0.0865 0.9867

This magical matrix will, in a single matrix multiplication, denoise the standardized data:

>> B * VDenoise

ans =

-0.0599 -0.0607 0.0570
1.4422 1.4861 -1.5414
0.0047 -0.0060 0.0654
1.1002 1.0890 -0.8844
-0.0599 -0.0607 0.0570
-0.0599 -0.0607 0.0570
-1.0390 -0.7648 -0.8824
-0.8960 -0.9299 1.0005
-1.7330 -1.8060 1.9839
-1.2380 -1.3270 1.6575
1.9601 1.9307 -1.5141
0.6290 0.5825 -0.2442
-0.0599 -0.0607 0.0570
0.0047 -0.0060 0.0654
0.0047 -0.0060 0.0654

Naturally, we still need to multiply back the standard deviation and add back the mean to get to the original scale:

>> Z = ((B * VDenoise) .* repmat(AStd,[n 1])) + repmat(AMean,[n 1])

Z =

269.8664 38.8839 50.4979
272.5483 39.4640 49.9954
269.9817 38.9044 50.5006
271.9377 39.3151 50.2019
269.8664 38.8839 50.4979
269.8664 38.8839 50.4979
268.1183 38.6198 50.2025
268.3736 38.5579 50.7946
266.8791 38.2293 51.1038
267.7630 38.4090 51.0012
273.4731 39.6308 50.0040
271.0964 39.1251 50.4032
269.8664 38.8839 50.4979
269.9817 38.9044 50.5006
269.9817 38.9044 50.5006

The degree of noise reduction is controlled by the number of principal components retained: the less principal components retained, the greater the noise reduction. Obviously, like all such schemes, this process has limitations and the big assumption here is that the original variables are linearly related so that noise stands out as a departure from this linearity.

Final Thoughts

PCA is a powerful tool, and is quickly computed on current computers, even on fairly large data. While there are limits to what it can do, it is a handy tool which is inexpensive in terms of compute time.

Further Reading

As a general reference on PCA see:

Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)

Note: The first edition is adequate for understanding and coding PCA, and is at present much cheaper than the second or third editions.

The noise suppression application is described in the article, Vectors help make sense of multiple signals, by Sullivan, Personal Engineering and Instrumentation News (Dec-1997), in which it is referred to as subspace projection.

See also the Dec-11-2010 posting, Linear Discriminant Analysis (LDA) .

Friday, February 26, 2010

Principal Components Analysis

Introduction

Real-world data sets usually exhibit relationships among their variables. These relationships are often linear, or at least approximately so, making them amenable to common analysis techniques. One such technique is principal component analysis ("PCA"), which rotates the original data to new coordinates, making the data as "flat" as possible.

Given a table of two or more variables, PCA generates a new table with the same number of variables, called the principal components. Each principal component is a linear transformation of the entire original data set. The coefficients of the principal components are calculated so that the first principal component contains the maximum variance (which we may tentatively think of as the "maximum information"). The second principal component is calculated to have the second most variance, and, importantly, is uncorrelated (in a linear sense) with the first principal component. Further principal components, if there are any, exhibit decreasing variance and are uncorrelated with all other principal components.

PCA is completely reversible (the original data may be recovered exactly from the principal components), making it a versatile tool, useful for data reduction, noise rejection, visualization and data compression among other things. This article walks through the specific mechanics of calculating the principal components of a data set in MATLAB, using either the MATLAB Statistics Toolbox, or just the base MATLAB product.

Performing Principal Components Analysis

Performing PCA will be illustrated using the following data set, which consists of 3 measurements taken of a particular subject over time:

>> A = [269.8 38.9 50.5
272.4 39.5 50.0
270.0 38.9 50.5
272.0 39.3 50.2
269.8 38.9 50.5
269.8 38.9 50.5
268.2 38.6 50.2
268.2 38.6 50.8
267.0 38.2 51.1
267.8 38.4 51.0
273.6 39.6 50.0
271.2 39.1 50.4
269.8 38.9 50.5
270.0 38.9 50.5
270.0 38.9 50.5
];

We determine the size of this data set thus:

>> [n m] = size(A)

n =

15

m =

3

To summarize the data, we calculate the sample mean vector and the sample standard deviation vector:

>> AMean = mean(A)

AMean =

269.9733 38.9067 50.4800

>> AStd = std(A)

AStd =

1.7854 0.3751 0.3144

Most often, the first step in PCA is to standardize the data. Here, "standardization" means subtracting the sample mean from each observation, then dividing by the sample standard deviation. This centers and scales the data. Sometimes there are good reasons for modifying or not performing this step, but I will recommend that you standardize unless you have a good reason not to. This is easy to perform, as follows:

>> B = (A - repmat(AMean,[n 1])) ./ repmat(AStd,[n 1])

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

This calculation can also be carried out using the zscore function from the Statistics Toolbox:

>> B = zscore(A)

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

Calculating the coefficients of the principal components and their respective variances is done by finding the eigenfunctions of the sample covariance matrix:

>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587

D =

0.0066 0 0
0 0.1809 0
0 0 2.8125

The matrix V contains the coefficients for the principal components. The diagonal elements of D store the variance of the respective principal components. We can extract the diagonal like this:

>> diag(D)

ans =

0.0066
0.1809
2.8125

The coefficients and respective variances of the principal components could also be found using the princomp function from the Statistics Toolbox:

>> [COEFF SCORE LATENT] = princomp(B)

COEFF =

0.5825 -0.4874 0.6505
0.5904 -0.2963 -0.7507
-0.5587 -0.8213 -0.1152

SCORE =

-0.1026 0.0003 -0.0571
2.5786 0.1226 -0.1277
-0.0373 -0.0543 0.0157
1.7779 -0.1326 0.0536
-0.1026 0.0003 -0.0571
-0.1026 0.0003 -0.0571
-0.5637 1.4579 0.0704
-1.6299 -0.1095 -0.1495
-3.1841 -0.2496 0.1041
-2.4306 -0.3647 0.0319
3.1275 -0.2840 0.1093
0.8467 -0.2787 0.0892
-0.1026 0.0003 -0.0571
-0.0373 -0.0543 0.0157
-0.0373 -0.0543 0.0157

LATENT =

2.8125
0.1809
0.0066

Note three important things about the above:

1. The order of the principal components from princomp is opposite of that from eig(cov(B)). princomp orders the principal components so that the first one appears in column 1, whereas eig(cov(B)) stores it in the last column.

2. Some of the coefficients from each method have the opposite sign. This is fine: There is no "natural" orientation for principal components, so you can expect different software to produce different mixes of signs.

3. SCORE contains the actual principal components, as calculated by princomp.

To calculate the principal components without princomp, simply multiply the standardized data by the principal component coefficients:

>> B * COEFF

ans =

-0.1026 0.0003 -0.0571
2.5786 0.1226 -0.1277
-0.0373 -0.0543 0.0157
1.7779 -0.1326 0.0536
-0.1026 0.0003 -0.0571
-0.1026 0.0003 -0.0571
-0.5637 1.4579 0.0704
-1.6299 -0.1095 -0.1495
-3.1841 -0.2496 0.1041
-2.4306 -0.3647 0.0319
3.1275 -0.2840 0.1093
0.8467 -0.2787 0.0892
-0.1026 0.0003 -0.0571
-0.0373 -0.0543 0.0157
-0.0373 -0.0543 0.0157

To reverse this transformation, simply multiply by the transpose of the coefficent matrix:

>> (B * COEFF) * COEFF'

ans =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

Finally, to get back to the original data, multiply each observation by the sample standard deviation vector and add the mean vector:

>> ((B * COEFF) * COEFF') .* repmat(AStd,[n 1]) + repmat(AMean,[n 1])

ans =

269.8000 38.9000 50.5000
272.4000 39.5000 50.0000
270.0000 38.9000 50.5000
272.0000 39.3000 50.2000
269.8000 38.9000 50.5000
269.8000 38.9000 50.5000
268.2000 38.6000 50.2000
268.2000 38.6000 50.8000
267.0000 38.2000 51.1000
267.8000 38.4000 51.0000
273.6000 39.6000 50.0000
271.2000 39.1000 50.4000
269.8000 38.9000 50.5000
270.0000 38.9000 50.5000
270.0000 38.9000 50.5000

This completes the round trip from the original data to the principal components and back to the original data. In some applications, the principal components are modified before the return trip.

Let's consider what we've gained by making the trip to the principal component coordinate system. First, more variance has indeed been squeezed in the first principal component, which we can see by taking the sample variance of principal components:

>> var(SCORE)

ans =

2.8125 0.1809 0.0066

The cumulative variance contained in the first so many principal components can be easily calculated thus:

>> cumsum(var(SCORE)) / sum(var(SCORE))

ans =

0.9375 0.9978 1.0000

Interestingly in this case, the first principal component contains nearly 94% of the variance of the original table. A lossy data compression scheme which discarded the second and third principal components would compress 3 variables into 1, while losing only 6% of the variance.

The other important thing to note about the principal components is that they are completely uncorrelated (as measured by the usual Pearson correlation), which we can test by calculating their correlation matrix:

>> corrcoef(SCORE)

ans =

1.0000 -0.0000 0.0000
-0.0000 1.0000 -0.0000
0.0000 -0.0000 1.0000

Discussion

PCA "squeezes" as much information (as measured by variance) as possible into the first principal components. In some cases the number of principal components needed to store the vast majority of variance is shockingly small: a tremendous feat of data manipulation. This transformation can be performed quickly on contemporary hardware and is invertible, permitting any number of useful applications.

For the most part, PCA really is as wonderful as it seems. There are a few caveats, however:

1. PCA doesn't always work well, in terms of compressing the variance. Sometimes variables just aren't related in a way which is easily exploited by PCA. This means that all or nearly all of the principal components will be needed to capture the multivariate variance in the data, making the use of PCA moot.

2. Variance may not be what we want condensed into a few variables. For example, if we are using PCA to reduce data for predictive model construction, then it is not necessarily the case that the first principal components yield a better model than the last principal components (though it often works out more or less that way).

3. PCA is built from components, such as the sample covariance, which are not statistically robust. This means that PCA may be thrown off by outliers and other data pathologies. How seriously this affects the result is specific to the data and application.

4. Though PCA can cram much of the variance in a data set into fewer variables, it still requires all of the variables to generate the principal components of future observations. Note that this is true, regardless of how many principal components are retained for the application. PCA is not a subset selection procedure, and this may have important logistical implications.

Further Reading

See also the Feb-28-2010 posting, Putting PCA to Work and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA) .

Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)

Note: The first edition is adequate for understanding and coding PCA, and is at present much cheaper than the second or third editions.

Tuesday, February 16, 2010

Single Neuron Training: The Delta Rule

I have recently put together a routine, DeltaRule, to train a single artificial neuron using the delta rule. DeltaRule can be found at MATLAB Central.

This posting will not go into much detail, but this type of model is something like a logistic regression, where a linear model is calculated on the input variables, then passed through a squashing function (in this case the logistic curve). Such models are most often used to model binary outcomes, hence the dependent variable is normally composed of the values 0 and 1.

Single neurons with linear functions (with squashing functions or not) are only capable of separating classes that may be divided by a line (plane, hyperplane), yet they are often useful, either by themselves or in building more complex models.

Use help DeltaRule for syntax and a simple example of its use.

Anyway, I thought readers might find this routine useful. It trains quickly and the code is straightforward (I think), making modification easy. Please write to let me know if you do anything interesting with it.

If you are already familiar with simple neural models like this one, here are the technical details:

Learning rule: incremental delta rule
Learning rate: constant
Transfer function: logistic
Exemplar presentation order: random, by training epoch

See also the Mar-15-2009 posting, Logistic Regression and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).