Data Mining in MATLAB: 2010

Saturday, December 11, 2010

Linear Discriminant Analysis (LDA)

Overview

Linear discriminant analysis (LDA) is one of the oldest mechanical classification systems, dating back to statistical pioneer Ronald Fisher, whose original 1936 paper on the subject, The Use of Multiple Measurements in Taxonomic Problems, can be found online (for example, here).

The basic idea of LDA is simple: for each class to be identified, calculate a (different) linear function of the attributes. The class function yielding the highest score represents the predicted class.

There are many linear classification models, and they differ largely in how the coefficients are established. One nice quality of LDA is that, unlike some of the alternatives, it does not require multiple passes over the data for optimization. Also, it naturally handles problems with more than two classes and it can provide probability estimates for each of the candidate classes.

Some analysts attempt to interpret the signs and magnitudes of the coefficients of the linear scores, but this can be tricky, especially when the number of classes is greater than 2.

LDA bears some resemblance to principal components analysis (PCA), in that a number of linear functions are produced (using all raw variables), which are intended, in some sense, to provide data reduction through rearrangement of information. (See the Feb-26-2010 posting to this log, Principal Components Analysis.) Note, though, some important differences: First, the objective of LDA is to maximize class discrimination, whereas the objective of PCA is to squeeze variance into as few components as possible. Second, LDA produces exactly as many linear functions as there are classes, whereas PCA produces as many linear functions as there are original variables. Last, principal components are always orthogonal to each other ("uncorrelated"), while that is not generally true for LDA's linear scores.

An Implementation

I have made available on MATLAB Central, a routine, aptly named LDA which performs all the necessary calculations. I'd like to thank Deniz Seviş, whose prompting got me to finally write this code (with her) and whose collaboration is very much appreciated.

Note that the LDA function assumes that the data its being fed is complete (no missing values) and performs no attribute selection. Also, it requires only base MATLAB (no toolboxes needed).

Use of LDA is straightforward: the programmer supplies the input and target variables and, optionally, prior probabilities. The function returns the fitted linear discriminant coefficients. help LDA provides a good example:

% Generate example data: 2 groups, of 10 and 15, respectively
X = [randn(10,2); randn(15,2) + 1.5]; Y = [zeros(10,1); ones(15,1)];

% Calculate linear discriminant coefficients
W = LDA(X,Y);

This example randomly generates an artificial data set of two classes (labeled 0 and 1) and two input variables. The LDA function fits linear discriminants to the data, and stores the result in W. So, what is in W? Let's take a look:

>> W

W =

-1.1997 0.2182 0.6110
-2.0697 0.4660 1.4718

The first row contains the coefficients for the linear score associated with the first class (this routine orders the linear functions the same way as unique()). In this model, -1.1997 is the constant and 0.2182 and 0.6110 are the coefficients for the input variables for the first class (class 0). Coefficients for the second class's linear function are in the second row. Calculating the linear scores is easy:

% Calulcate linear scores for training data
L = [ones(25,1) X] * W';

Each column represents the output of the linear score for one class. In this case, the first column is class 0, and the second column is class 1. For any given observation, the higher the linear score, the more likely that class. Note that LDA's linear scores are not probabilities, and may even assume negative values. Here are the values from my run:

>> L

L =

-1.9072 -3.8060
1.0547 3.2517
-1.2493 -2.0547
-1.0502 -1.7608
-0.6935 -0.8692
-1.6103 -2.9808
-1.3702 -2.4545
-0.2148 0.2825
0.4419 1.6717
0.2704 1.3067
1.0694 3.2670
-0.0207 0.7529
-0.2608 0.0601
1.2369 3.6135
-0.8951 -1.4542
0.2073 1.1687
0.0551 0.8204
0.1729 1.1654
0.2993 1.4344
-0.6562 -0.8028
0.2195 1.2068
-0.3070 0.0598
0.1944 1.2628
0.5354 2.0689
0.0795 1.0976

To obtain estimated probabilities, simply run the linear scores through the softmax transform (exponentiate everything, and normalize so that they sum to 1.0):

% Calculate class probabilities
P = exp(L) ./ repmat(sum(exp(L),2),[1 2]);

As we see, most of the first 10 cases exhibit higher probabilities for class 0 (the first column) than for class 1 (the second column) and the reverse is true for the last 15 cases:

>> P

P =

0.8697 0.1303
0.1000 0.9000
0.6911 0.3089
0.6705 0.3295
0.5438 0.4562
0.7975 0.2025
0.7473 0.2527
0.3782 0.6218
0.2262 0.7738
0.2619 0.7381
0.1000 0.9000
0.3157 0.6843
0.4205 0.5795
0.0850 0.9150
0.6363 0.3637
0.2766 0.7234
0.3175 0.6825
0.2704 0.7296
0.2432 0.7568
0.5366 0.4634
0.2714 0.7286
0.4093 0.5907
0.2557 0.7443
0.1775 0.8225
0.2654 0.7346

This model is not perfect, and would really need to be tested more rigorously (via holdout testing, k-fold cross validation, etc.) to determine how well it approximates the data.

I will not demonstrate its use here, but the LDA routine offers a facility for modifying the prior probabilities. Briefly, the function assumes that the true distribution of classes is whatever it observes in the training data. Analysts, however, may wish to adjust this distribution for several reasons, and the third, optional, parameter allows this. Note that the LDA routine presented here always performs the adjustment for prior probabilities: Some statistical software drops the adjustment for prior probabilities altogether if the user specifies that classes are equally likely, and will produce different results than LDA.

Closing Thoughts

Though it employs a fairly simple model structure, LDA has held up reasonably well, sometimes still besting more complex algorithms. When its assumptions are met, the literature records it doing better than logistic regression. It is very fast to execute and fitted models are extremely portable- even a spreadsheet will support linear models (...or, one supposes, paper and pencil!) LDA is at least worth trying at the beginning of a project, if for no other reason than to establish a lower bound on acceptable performance.

See Also

Feb-16-2010 posting, Single Neuron Training: The Delta Rule
Mar-15-2009 posting, Logistic Regression

Sunday, September 12, 2010

Reader Question: Putting Entropy to Work

Introduction

In response to my Nov-10-2006 posting, Introduction To Entropy, an anonymous reader asked:

Can we use entropy for distinguishing random signals and deterministic signal? Lets say i generate two signals in matlab. First signal using sin function and second using randn function. Can we use entropy to distinguish between these two signal?

The short answer is: Yes, we can use entropy for this purpose, although even simpler summary statistics would reveal that the normally distributed randn data included values outside of -1..+1, while the sin data did not.

In this article, I will be using my own entropy calculating routines, which can be found on MATLAB Central: Entropy, JointEntropy, ConditionalEntropy and MutualInformation.

A Slightly Harder Problem

To illustrate this application of entropy, I propose a slightly different problem, in which the sine data and the random data share the same distribution. To achieve this, the "random" data will be a random sample from the sine function:

>> X = [1:1000]';
>> Sine = sin(0.05 * X);
>> RandomData = sin(2 * pi * rand(size(X)));

As a quick check on the distributions, we will examine their respective histograms:

>> figure
>> subplot(2,1,1), hist(Sine), xlabel('Sine Value'), ylabel('Frequency'), grid on
>> subplot(2,1,2), hist(RandomData), xlabel('RandomData Value'), ylabel('Frequency'), grid on

Click image to enlarge.

More or less, they appear to match.

A First Look, Using Entropy

At this point, the reader may be tempted to calculate the entropies of the two distributions, and compare them. Since their distributions (as per the histograms) are similar, we should expect their entropies to also be similar.

To date, this Web log has only dealt with discrete entropy, yet our data is continuous. While there is a continuous entropy, we will stick with the simpler (in my opinion) discrete entropy for now. This requires that the real-valued numbers of our data be converted to symbols. We will accomplish this via quantization ("binning") to 10 levels:

>> Sine10 = ceil(10 * (Sine + 1) / 2);
>> RandomData10 = ceil(10 * (RandomData + 1) / 2);

If the MATLAB Statistics Toolbox is installed, one can check the resulting frequencies thus (I apologize for Blogger's butchering of the text formatting):

>> tabulate(Sine10)
Value Count Percent
1 205 20.50%
2 91 9.10%
3 75 7.50%
4 66 6.60%
5 60 6.00%
6 66 6.60%
7 66 6.60%
8 75 7.50%
9 91 9.10%
10 205 20.50%
>> tabulate(RandomData10)
Value Count Percent
1 197 19.70%
2 99 9.90%
3 84 8.40%
4 68 6.80%
5 66 6.60%
6 55 5.50%
7 68 6.80%
8 67 6.70%
9 82 8.20%
10 214 21.40%

It should be noted that other procedures could have been used for the signal-to-symbol conversion. For example, bin frequencies could have been made equal. The above method was selected because it is simple and requires no Toolbox functions. Also, other numbers of bins could have been utilized.

Now that the data is represented by symbols, we may check the earlier assertion regarding similar distributions yielding similar entropies (measured in bits per observation):

>> Entropy(Sine10)

ans =

3.1473

>> Entropy(RandomData10)

ans =

3.1418

As these are sample statistics, we would not expect them to match exactly, but these are very close.

Another Perspective

One important aspect of the structure of a sine curve is that it varies over time (or whatever the domain is). This means that any given sine value is typically very similar to those on either side. With this in mind, we will investigate the conditional entropy of each of these two signals versus themselves, lagged by one observation:

>> ConditionalEntropy(Sine10(2:end),Sine10(1:end-1))

ans =

0.6631

>> ConditionalEntropy(RandomData10(2:end),RandomData10(1:end-1))

ans =

3.0519

Ah! Notice that the entropy of the sine data, given knowledge of its immediate predecessor is much lower than the entropy of the random data, given its immediate predecessor. These data are indeed demonstrably different insofar as they behave over time, despite sharing the same distribution.

An astute reader may at this point notice that the conditional entropy of the random data, given 1 lagged value, is less than the entropy of the raw random data. This is an artifact of the finite number of samples and the quantization process. Given more observations and a finer quantization, this discrepancy between sample statistics and population statistics will shrink.

Entropy could have been applied to this problem other ways, too. For instance, one might calculate entropy for short time windows. I would point out that other, more traditional procedures might be used instead, such as calculating the auto-correlation for lag 1. It is worth seeing how entropy adds to the analyst's toolbox, though.

Further Reading

See also the Apr-01-2009 posting, Introduction to Conditional Entropy.

Print:

The Mathematical Theory of Communication by Claude Shannon (ISBN 0-252-72548-4)

Elements of Information Theory by Cover and Thomas (ISBN 0-471-06259)

Sunday, February 28, 2010

Putting PCA to Work

Context

The last posting to this Web log, Principal Components Analysis (Feb-26-2010), gave an overview of principal component analysis (PCA), and how to effect it within MATLAB. This article will cover three uses of PCA: 1. pre-processing for empirical modeling, 2. data compression and 3. noise suppression.

To serve the widest possible audience, this article will conduct PCA using only base MATLAB functions, but realize that users with the Statistics Toolbox have, as mentioned in the last posting, the option of using tools like princomp and zscore.

We will continue to use the very small data set used in the last article:

>> A = [269.8 38.9 50.5
272.4 39.5 50.0
270.0 38.9 50.5
272.0 39.3 50.2
269.8 38.9 50.5
269.8 38.9 50.5
268.2 38.6 50.2
268.2 38.6 50.8
267.0 38.2 51.1
267.8 38.4 51.0
273.6 39.6 50.0
271.2 39.1 50.4
269.8 38.9 50.5
270.0 38.9 50.5
270.0 38.9 50.5
];

We calculate the sample parameters, and standardize the data table:

>> [n m] = size(A)

n =

15

m =

3

>> AMean = mean(A)

AMean =

269.9733 38.9067 50.4800

>> AStd = std(A)

AStd =

1.7854 0.3751 0.3144

>> B = (A - repmat(AMean,[n 1])) ./ repmat(AStd,[n 1])

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

Now that the data is centered with mean 0.0 and standard deviation 1.0, we perform the eigenanalysis of the sample covariances to determine the coefficient matrix which generates the principal components:

>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587

D =

0.0066 0 0
0 0.1809 0
0 0 2.8125

Recall that the MATLAB eig function orders information for the principal components from last to first when reading the columns from left to right. The matrix V contains the linear coefficients for the principal components. The diagonal of matrix D contains the variances for the principal components. So far, we have accomplished the principal components analysis itself. To put the PCA to use, we will want to know what proportion each principal component represents of total variance. We can do this by extracting and normalizing the diagonal of matrix D (we use flipud because the principal components are in "reverse" order):

>> cumsum(flipud(diag(D))) / sum(diag(D))

ans =

0.9375
0.9978
1.0000

We interpret the above column of numbers to mean that the first principal component
contains 93.75% of the total variance of the original data, the first two principal components together contain 99.78% and of course all principal components taken together have all of the variance (exactly as much as in the original standardized data).

Last, to calculate the principal components themselves, simply multiply the standardized data by the coefficient matrix:

>> PC = B * V

PC =

-0.0571 -0.0003 0.1026
-0.1277 -0.1226 -2.5786
0.0157 0.0543 0.0373
0.0536 0.1326 -1.7779
-0.0571 -0.0003 0.1026
-0.0571 -0.0003 0.1026
0.0704 -1.4579 0.5637
-0.1495 0.1095 1.6299
0.1041 0.2496 3.1841
0.0319 0.3647 2.4306
0.1093 0.2840 -3.1275
0.0892 0.2787 -0.8467
-0.0571 -0.0003 0.1026
0.0157 0.0543 0.0373
0.0157 0.0543 0.0373

To verify the condensing of the variance, calculate the sample variances:

>> var(PC)

ans =

0.0066 0.1809 2.8125

Again, note that the first principal component appears in the last column when using MATLAB's eig function, and columns to the left have less and less variance until the last principal component, stored in the first column.

Application: Pre-processing Data for Empirical Modeling

This application of PCA is simple: calculate the principal components and choose from them rather than the original data to construct the empirical model (regression, neural network, etc.). The (hoped for) advantage of doing this is that since PCA squeezes information into a subset of the new variables, less of them will be necessary to construct the model. In fact, it would not be unreasonable to simply step through the first so many principal components to build the model: First, use just the first principal component, then try the first and second, then the first, second and third, etc. A nice side benefit is that all the principal components are uncorrelated with each other.

As was mentioned in the last article, this may or may not work well, for several reasons: PCA may not be able to squeeze the variance much if the original variables are already highly uncorrelated with one another. Also, statistical variance may not be the same thing as "information" for the purposes of model building. Last, even if this process works, one is left with the reality that PCA needs all of the original variables to calculate the principal components, even if only a subset of them are used. Regardless, this is a data processing technique which can yield benefit, so it is worth trying.

Application: Data Compression

PCA offers a mechanism for performing lossy data compression. When data compression is "lossy", it may not return exactly the original data. The trade-off is that much greater compression can be achieved than with "lossless" data compression (compression in which the original data is returned exactly). In many cases, such as audio (MP3) and images (JPEG), some loss in fidelity is acceptable and greater compression is very much desired.

All compression schemes rely on the discovery of regularities within the data. In the case of PCA, the regularity is a linear relationship among the variables. To the extent that PCA finds this relationship, the data may be compressed. The idea is to discard the last principal components (those exhibiting the least variance).

In MATLAB, this means simply dropping the columns representing the unwanted principal components. In this case, we will retain only the first principal component:

>> VReduced = V(:,3)

VReduced =

-0.5825
-0.5904
0.5587

>> PCReduced = B * VReduced

PCReduced =

0.1026
-2.5786
0.0373
-1.7779
0.1026
0.1026
0.5637
1.6299
3.1841
2.4306
-3.1275
-0.8467
0.1026
0.0373
0.0373

Decompression is accomplished by inverting the process, which we can do by transposing the coefficient vector and multiplying:

>> PCReduced * VReduced'

ans =

-0.0598 -0.0606 0.0573
1.5020 1.5224 -1.4406
-0.0217 -0.0220 0.0209
1.0356 1.0497 -0.9933
-0.0598 -0.0606 0.0573
-0.0598 -0.0606 0.0573
-0.3284 -0.3328 0.3150
-0.9494 -0.9623 0.9106
-1.8547 -1.8799 1.7789
-1.4158 -1.4351 1.3580
1.8217 1.8465 -1.7473
0.4932 0.4999 -0.4730
-0.0598 -0.0606 0.0573
-0.0217 -0.0220 0.0209
-0.0217 -0.0220 0.0209

The result is not exactly the same as the original standardized data, but it is pretty close. We "un-standardize" by reversing the original standardization step:

>> Z = ((PCReduced * VReduced') .* repmat(AStd,[n 1])) + repmat(AMean,[n 1])

Z =

269.8667 38.8840 50.4980
272.6550 39.4777 50.0270
269.9345 38.8984 50.4866
271.8223 39.3004 50.1677
269.8667 38.8840 50.4980
269.8667 38.8840 50.4980
269.3870 38.7818 50.5790
268.2783 38.5457 50.7663
266.6619 38.2016 51.0393
267.4455 38.3684 50.9070
273.2259 39.5992 49.9306
270.8539 39.0942 50.3313
269.8667 38.8840 50.4980
269.9345 38.8984 50.4866
269.9345 38.8984 50.4866

Again, the result is pretty similar to the original, but not exactly: about 94% of the variance has been preserved, and we have compressed the data to 33% of its original size.

The trade-off here is between compression (count of principal components retained) and compression fidelity (the variance preserved). In a typical application, there will be more variables and the variance compression is normally not quite as dramatic as in our illustration. This means that there will be more data compression "levels", represented by the number of principal components retained.

Application: Noise Suppression

Extending the data compression application, we may use PCA for noise suppression. The basic idea is that the variance captured by the least important principal components is noise which should be rejected. Assuming that the variables bear a linear relationship, they will lie in a line (plane, hyperplane) and noise items will lift them away from the line. Dropping the last principal components means flattening the data in a geometric sense and (hopefully) eliminating some of the noise.

This process is much like the data compression process described in the last section, except: 1. discarded components have their coefficients set to zero instead of being deleted outright and 2. the PCA coefficient matrix and its inverse are multiplied together to allow a single processing step which (again, hopefully) reduces noise in the data.

As before, we calculate the PCA coefficients:

>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587

D =

0.0066 0 0
0 0.1809 0
0 0 2.8125

Deciding to eliminate the last principal component, we set its coefficients to zero:

>> VDenoise = V; VDenoise(:,1) = 0

VDenoise =

0 0.4874 -0.5825
0 0.2963 -0.5904
0 0.8213 0.5587

This matrix will project the standardized data into a flat surface- in this case a plane, since we have retained 2 dimensions. Not wanting to bother with two steps, we multiply this matrix by its inverse, which in this case is easily obtained by taking the transpose:

>> VDenoise = VDenoise * VDenoise'

VDenoise =

0.5769 0.4883 0.0749
0.4883 0.4364 -0.0865
0.0749 -0.0865 0.9867

This magical matrix will, in a single matrix multiplication, denoise the standardized data:

>> B * VDenoise

ans =

-0.0599 -0.0607 0.0570
1.4422 1.4861 -1.5414
0.0047 -0.0060 0.0654
1.1002 1.0890 -0.8844
-0.0599 -0.0607 0.0570
-0.0599 -0.0607 0.0570
-1.0390 -0.7648 -0.8824
-0.8960 -0.9299 1.0005
-1.7330 -1.8060 1.9839
-1.2380 -1.3270 1.6575
1.9601 1.9307 -1.5141
0.6290 0.5825 -0.2442
-0.0599 -0.0607 0.0570
0.0047 -0.0060 0.0654
0.0047 -0.0060 0.0654

Naturally, we still need to multiply back the standard deviation and add back the mean to get to the original scale:

>> Z = ((B * VDenoise) .* repmat(AStd,[n 1])) + repmat(AMean,[n 1])

Z =

269.8664 38.8839 50.4979
272.5483 39.4640 49.9954
269.9817 38.9044 50.5006
271.9377 39.3151 50.2019
269.8664 38.8839 50.4979
269.8664 38.8839 50.4979
268.1183 38.6198 50.2025
268.3736 38.5579 50.7946
266.8791 38.2293 51.1038
267.7630 38.4090 51.0012
273.4731 39.6308 50.0040
271.0964 39.1251 50.4032
269.8664 38.8839 50.4979
269.9817 38.9044 50.5006
269.9817 38.9044 50.5006

The degree of noise reduction is controlled by the number of principal components retained: the less principal components retained, the greater the noise reduction. Obviously, like all such schemes, this process has limitations and the big assumption here is that the original variables are linearly related so that noise stands out as a departure from this linearity.

Final Thoughts

PCA is a powerful tool, and is quickly computed on current computers, even on fairly large data. While there are limits to what it can do, it is a handy tool which is inexpensive in terms of compute time.

Further Reading

As a general reference on PCA see:

Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)

Note: The first edition is adequate for understanding and coding PCA, and is at present much cheaper than the second or third editions.

The noise suppression application is described in the article, Vectors help make sense of multiple signals, by Sullivan, Personal Engineering and Instrumentation News (Dec-1997), in which it is referred to as subspace projection.

See also the Dec-11-2010 posting, Linear Discriminant Analysis (LDA) .

Friday, February 26, 2010

Principal Components Analysis

Introduction

Real-world data sets usually exhibit relationships among their variables. These relationships are often linear, or at least approximately so, making them amenable to common analysis techniques. One such technique is principal component analysis ("PCA"), which rotates the original data to new coordinates, making the data as "flat" as possible.

Given a table of two or more variables, PCA generates a new table with the same number of variables, called the principal components. Each principal component is a linear transformation of the entire original data set. The coefficients of the principal components are calculated so that the first principal component contains the maximum variance (which we may tentatively think of as the "maximum information"). The second principal component is calculated to have the second most variance, and, importantly, is uncorrelated (in a linear sense) with the first principal component. Further principal components, if there are any, exhibit decreasing variance and are uncorrelated with all other principal components.

PCA is completely reversible (the original data may be recovered exactly from the principal components), making it a versatile tool, useful for data reduction, noise rejection, visualization and data compression among other things. This article walks through the specific mechanics of calculating the principal components of a data set in MATLAB, using either the MATLAB Statistics Toolbox, or just the base MATLAB product.

Performing Principal Components Analysis

Performing PCA will be illustrated using the following data set, which consists of 3 measurements taken of a particular subject over time:

>> A = [269.8 38.9 50.5
272.4 39.5 50.0
270.0 38.9 50.5
272.0 39.3 50.2
269.8 38.9 50.5
269.8 38.9 50.5
268.2 38.6 50.2
268.2 38.6 50.8
267.0 38.2 51.1
267.8 38.4 51.0
273.6 39.6 50.0
271.2 39.1 50.4
269.8 38.9 50.5
270.0 38.9 50.5
270.0 38.9 50.5
];

We determine the size of this data set thus:

>> [n m] = size(A)

n =

15

m =

3

To summarize the data, we calculate the sample mean vector and the sample standard deviation vector:

>> AMean = mean(A)

AMean =

269.9733 38.9067 50.4800

>> AStd = std(A)

AStd =

1.7854 0.3751 0.3144

Most often, the first step in PCA is to standardize the data. Here, "standardization" means subtracting the sample mean from each observation, then dividing by the sample standard deviation. This centers and scales the data. Sometimes there are good reasons for modifying or not performing this step, but I will recommend that you standardize unless you have a good reason not to. This is easy to perform, as follows:

>> B = (A - repmat(AMean,[n 1])) ./ repmat(AStd,[n 1])

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

This calculation can also be carried out using the zscore function from the Statistics Toolbox:

>> B = zscore(A)

B =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

Calculating the coefficients of the principal components and their respective variances is done by finding the eigenfunctions of the sample covariance matrix:

>> [V D] = eig(cov(B))

V =

0.6505 0.4874 -0.5825
-0.7507 0.2963 -0.5904
-0.1152 0.8213 0.5587

D =

0.0066 0 0
0 0.1809 0
0 0 2.8125

The matrix V contains the coefficients for the principal components. The diagonal elements of D store the variance of the respective principal components. We can extract the diagonal like this:

>> diag(D)

ans =

0.0066
0.1809
2.8125

The coefficients and respective variances of the principal components could also be found using the princomp function from the Statistics Toolbox:

>> [COEFF SCORE LATENT] = princomp(B)

COEFF =

0.5825 -0.4874 0.6505
0.5904 -0.2963 -0.7507
-0.5587 -0.8213 -0.1152

SCORE =

-0.1026 0.0003 -0.0571
2.5786 0.1226 -0.1277
-0.0373 -0.0543 0.0157
1.7779 -0.1326 0.0536
-0.1026 0.0003 -0.0571
-0.1026 0.0003 -0.0571
-0.5637 1.4579 0.0704
-1.6299 -0.1095 -0.1495
-3.1841 -0.2496 0.1041
-2.4306 -0.3647 0.0319
3.1275 -0.2840 0.1093
0.8467 -0.2787 0.0892
-0.1026 0.0003 -0.0571
-0.0373 -0.0543 0.0157
-0.0373 -0.0543 0.0157

LATENT =

2.8125
0.1809
0.0066

Note three important things about the above:

1. The order of the principal components from princomp is opposite of that from eig(cov(B)). princomp orders the principal components so that the first one appears in column 1, whereas eig(cov(B)) stores it in the last column.

2. Some of the coefficients from each method have the opposite sign. This is fine: There is no "natural" orientation for principal components, so you can expect different software to produce different mixes of signs.

3. SCORE contains the actual principal components, as calculated by princomp.

To calculate the principal components without princomp, simply multiply the standardized data by the principal component coefficients:

>> B * COEFF

ans =

-0.1026 0.0003 -0.0571
2.5786 0.1226 -0.1277
-0.0373 -0.0543 0.0157
1.7779 -0.1326 0.0536
-0.1026 0.0003 -0.0571
-0.1026 0.0003 -0.0571
-0.5637 1.4579 0.0704
-1.6299 -0.1095 -0.1495
-3.1841 -0.2496 0.1041
-2.4306 -0.3647 0.0319
3.1275 -0.2840 0.1093
0.8467 -0.2787 0.0892
-0.1026 0.0003 -0.0571
-0.0373 -0.0543 0.0157
-0.0373 -0.0543 0.0157

To reverse this transformation, simply multiply by the transpose of the coefficent matrix:

>> (B * COEFF) * COEFF'

ans =

-0.0971 -0.0178 0.0636
1.3591 1.5820 -1.5266
0.0149 -0.0178 0.0636
1.1351 1.0487 -0.8905
-0.0971 -0.0178 0.0636
-0.0971 -0.0178 0.0636
-0.9932 -0.8177 -0.8905
-0.9932 -0.8177 1.0178
-1.6653 -1.8842 1.9719
-1.2173 -1.3509 1.6539
2.0312 1.8486 -1.5266
0.6870 0.5155 -0.2544
-0.0971 -0.0178 0.0636
0.0149 -0.0178 0.0636
0.0149 -0.0178 0.0636

Finally, to get back to the original data, multiply each observation by the sample standard deviation vector and add the mean vector:

>> ((B * COEFF) * COEFF') .* repmat(AStd,[n 1]) + repmat(AMean,[n 1])

ans =

269.8000 38.9000 50.5000
272.4000 39.5000 50.0000
270.0000 38.9000 50.5000
272.0000 39.3000 50.2000
269.8000 38.9000 50.5000
269.8000 38.9000 50.5000
268.2000 38.6000 50.2000
268.2000 38.6000 50.8000
267.0000 38.2000 51.1000
267.8000 38.4000 51.0000
273.6000 39.6000 50.0000
271.2000 39.1000 50.4000
269.8000 38.9000 50.5000
270.0000 38.9000 50.5000
270.0000 38.9000 50.5000

This completes the round trip from the original data to the principal components and back to the original data. In some applications, the principal components are modified before the return trip.

Let's consider what we've gained by making the trip to the principal component coordinate system. First, more variance has indeed been squeezed in the first principal component, which we can see by taking the sample variance of principal components:

>> var(SCORE)

ans =

2.8125 0.1809 0.0066

The cumulative variance contained in the first so many principal components can be easily calculated thus:

>> cumsum(var(SCORE)) / sum(var(SCORE))

ans =

0.9375 0.9978 1.0000

Interestingly in this case, the first principal component contains nearly 94% of the variance of the original table. A lossy data compression scheme which discarded the second and third principal components would compress 3 variables into 1, while losing only 6% of the variance.

The other important thing to note about the principal components is that they are completely uncorrelated (as measured by the usual Pearson correlation), which we can test by calculating their correlation matrix:

>> corrcoef(SCORE)

ans =

1.0000 -0.0000 0.0000
-0.0000 1.0000 -0.0000
0.0000 -0.0000 1.0000

Discussion

PCA "squeezes" as much information (as measured by variance) as possible into the first principal components. In some cases the number of principal components needed to store the vast majority of variance is shockingly small: a tremendous feat of data manipulation. This transformation can be performed quickly on contemporary hardware and is invertible, permitting any number of useful applications.

For the most part, PCA really is as wonderful as it seems. There are a few caveats, however:

1. PCA doesn't always work well, in terms of compressing the variance. Sometimes variables just aren't related in a way which is easily exploited by PCA. This means that all or nearly all of the principal components will be needed to capture the multivariate variance in the data, making the use of PCA moot.

2. Variance may not be what we want condensed into a few variables. For example, if we are using PCA to reduce data for predictive model construction, then it is not necessarily the case that the first principal components yield a better model than the last principal components (though it often works out more or less that way).

3. PCA is built from components, such as the sample covariance, which are not statistically robust. This means that PCA may be thrown off by outliers and other data pathologies. How seriously this affects the result is specific to the data and application.

4. Though PCA can cram much of the variance in a data set into fewer variables, it still requires all of the variables to generate the principal components of future observations. Note that this is true, regardless of how many principal components are retained for the application. PCA is not a subset selection procedure, and this may have important logistical implications.

Further Reading

See also the Feb-28-2010 posting, Putting PCA to Work and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA) .

Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)

Note: The first edition is adequate for understanding and coding PCA, and is at present much cheaper than the second or third editions.

Tuesday, February 16, 2010

Single Neuron Training: The Delta Rule

I have recently put together a routine, DeltaRule, to train a single artificial neuron using the delta rule. DeltaRule can be found at MATLAB Central.

This posting will not go into much detail, but this type of model is something like a logistic regression, where a linear model is calculated on the input variables, then passed through a squashing function (in this case the logistic curve). Such models are most often used to model binary outcomes, hence the dependent variable is normally composed of the values 0 and 1.

Single neurons with linear functions (with squashing functions or not) are only capable of separating classes that may be divided by a line (plane, hyperplane), yet they are often useful, either by themselves or in building more complex models.

Use help DeltaRule for syntax and a simple example of its use.

Anyway, I thought readers might find this routine useful. It trains quickly and the code is straightforward (I think), making modification easy. Please write to let me know if you do anything interesting with it.

If you are already familiar with simple neural models like this one, here are the technical details:

Learning rule: incremental delta rule
Learning rate: constant
Transfer function: logistic
Exemplar presentation order: random, by training epoch

See also the Mar-15-2009 posting, Logistic Regression and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).