Data Mining in MATLAB: conditional entropy

Sunday, September 12, 2010

Reader Question: Putting Entropy to Work

Introduction

In response to my Nov-10-2006 posting, Introduction To Entropy, an anonymous reader asked:

Can we use entropy for distinguishing random signals and deterministic signal? Lets say i generate two signals in matlab. First signal using sin function and second using randn function. Can we use entropy to distinguish between these two signal?

The short answer is: Yes, we can use entropy for this purpose, although even simpler summary statistics would reveal that the normally distributed randn data included values outside of -1..+1, while the sin data did not.

In this article, I will be using my own entropy calculating routines, which can be found on MATLAB Central: Entropy, JointEntropy, ConditionalEntropy and MutualInformation.

A Slightly Harder Problem

To illustrate this application of entropy, I propose a slightly different problem, in which the sine data and the random data share the same distribution. To achieve this, the "random" data will be a random sample from the sine function:

>> X = [1:1000]';
>> Sine = sin(0.05 * X);
>> RandomData = sin(2 * pi * rand(size(X)));

As a quick check on the distributions, we will examine their respective histograms:

>> figure
>> subplot(2,1,1), hist(Sine), xlabel('Sine Value'), ylabel('Frequency'), grid on
>> subplot(2,1,2), hist(RandomData), xlabel('RandomData Value'), ylabel('Frequency'), grid on

Click image to enlarge.

More or less, they appear to match.

A First Look, Using Entropy

At this point, the reader may be tempted to calculate the entropies of the two distributions, and compare them. Since their distributions (as per the histograms) are similar, we should expect their entropies to also be similar.

To date, this Web log has only dealt with discrete entropy, yet our data is continuous. While there is a continuous entropy, we will stick with the simpler (in my opinion) discrete entropy for now. This requires that the real-valued numbers of our data be converted to symbols. We will accomplish this via quantization ("binning") to 10 levels:

>> Sine10 = ceil(10 * (Sine + 1) / 2);
>> RandomData10 = ceil(10 * (RandomData + 1) / 2);

If the MATLAB Statistics Toolbox is installed, one can check the resulting frequencies thus (I apologize for Blogger's butchering of the text formatting):

>> tabulate(Sine10)
Value Count Percent
1 205 20.50%
2 91 9.10%
3 75 7.50%
4 66 6.60%
5 60 6.00%
6 66 6.60%
7 66 6.60%
8 75 7.50%
9 91 9.10%
10 205 20.50%
>> tabulate(RandomData10)
Value Count Percent
1 197 19.70%
2 99 9.90%
3 84 8.40%
4 68 6.80%
5 66 6.60%
6 55 5.50%
7 68 6.80%
8 67 6.70%
9 82 8.20%
10 214 21.40%

It should be noted that other procedures could have been used for the signal-to-symbol conversion. For example, bin frequencies could have been made equal. The above method was selected because it is simple and requires no Toolbox functions. Also, other numbers of bins could have been utilized.

Now that the data is represented by symbols, we may check the earlier assertion regarding similar distributions yielding similar entropies (measured in bits per observation):

>> Entropy(Sine10)

ans =

3.1473

>> Entropy(RandomData10)

ans =

3.1418

As these are sample statistics, we would not expect them to match exactly, but these are very close.

Another Perspective

One important aspect of the structure of a sine curve is that it varies over time (or whatever the domain is). This means that any given sine value is typically very similar to those on either side. With this in mind, we will investigate the conditional entropy of each of these two signals versus themselves, lagged by one observation:

>> ConditionalEntropy(Sine10(2:end),Sine10(1:end-1))

ans =

0.6631

>> ConditionalEntropy(RandomData10(2:end),RandomData10(1:end-1))

ans =

3.0519

Ah! Notice that the entropy of the sine data, given knowledge of its immediate predecessor is much lower than the entropy of the random data, given its immediate predecessor. These data are indeed demonstrably different insofar as they behave over time, despite sharing the same distribution.

An astute reader may at this point notice that the conditional entropy of the random data, given 1 lagged value, is less than the entropy of the raw random data. This is an artifact of the finite number of samples and the quantization process. Given more observations and a finer quantization, this discrepancy between sample statistics and population statistics will shrink.

Entropy could have been applied to this problem other ways, too. For instance, one might calculate entropy for short time windows. I would point out that other, more traditional procedures might be used instead, such as calculating the auto-correlation for lag 1. It is worth seeing how entropy adds to the analyst's toolbox, though.

Further Reading

See also the Apr-01-2009 posting, Introduction to Conditional Entropy.

Print:

The Mathematical Theory of Communication by Claude Shannon (ISBN 0-252-72548-4)

Elements of Information Theory by Cover and Thomas (ISBN 0-471-06259)

Wednesday, April 01, 2009

Introduction to Conditional Entropy

Introduction

In one of the earliest posts to this log, Introduction To Entropy (Nov-10-2006), I described the entropy of discrete variables. Finally, in this posting, I have gotten around to continuing this line of inquiry and will explain conditional entropy.

Quick Review: Entropy

Recall that entropy is a measure of uncertainty about the state of a variable. In the case of a variable which can take on only two values (male/female, profit/loss, heads/tails, alive/dead, etc.), entropy assumes its maximum value, 1 bit, when the probability of the outcomes is equal. A fair coin toss is a high-entropy event: Beforehand, we have no idea about what will happen. As the probability distribution moves away from a 50/50 split, uncertainty about the outcome decreases since there is less uncertainty as to the outcome. Recall, too, that entropy decreases regardless of which outcome class becomes the more probable: an unfair coin toss, with a heads / tails probability distribution of 0.10 / 0.90 has exactly the same entropy as another unfair coin with a distribution of 0.90 / 0.10. It is the distribution of probabilities, not their order which matters when calculating entropy.

Extending these ideas to variables with more than 2 possible values, we note that generally, distributions which are evenly spread among the possible values exhibit higher entropy, and those which are more concentrated in a subset of the possible values exhibit lower entropy.

Conditional Entropy

Entropy, in itself, is a useful measure of the mixture of values in variables, but we are often interested in characterizing variables after they have been conditioned ("modeled"). Conditional entropy does exactly that, by measuring the entropy remaining in a variable, after it has been conditioned by another variable. I find it helpful to think of conditional entropy as "residual entropy". Happily for you, I have already assembled a MATLAB function, ConditionalEntropy, to calculate this measure.

As an example, think of sporting events involving pairs of competitors, such as soccer ("football" if you live outside of North America). Knowing nothing about a particular pair of teams (and assuming no ties), the best probability we may assess that any specific team, say, the Black Eagles (Go, Beşiktaş!), will win is:

p(the Black Eagles win) = 0.5

The entropy of this variable is 1 bit- the maximum uncertainty possible for a two-category variable. We will attempt to lower this uncertainty (hence, lowering its conditional entropy).

In some competitive team sports, it has been demonstrated statistically that home teams have an advantage over away teams. Among the most popular sports, the home advantage appears to be largest for soccer. One estimate shows that (excluding ties) home teams have historically won 69% of the time, and away teams 31% of the time. Conditioning the outcome on home vs. away, we may provide the follwing improved probability estimates:

p(the Black Eagles win, given that they are the home team) = 0.69
p(the Black Eagles win, given that they are the away team) = 0.31

Ah, Now there is somewhat less uncertainty! The entropy for probability 0.69 is 0.8932 bits. The entropy for probability 0.31 (being the same distance from 0.5 as 0.69) is also 0.8932 bits.

Conditional entropy is calculated as the weighted average of the entropies of the various possible conditions (in this case home or away). Assuming that it is equally likely that the Black Eagles play home or away, the conditional entropy of them winning is 0.8932 bits = (0.5 * 0.8932 + 0.5 * 0.8932). As entropy has gone down with our simple model, from 1 bit to 0.8932 bits, we learn that knowing whether the Black Eagles are playing at home or away provides information and reduces uncertainty.

Other variables might be used to condition the outcome of a match, such as number of player injuries, outcome of recent games, etc. We can compare these candidate predictors using conditional entropy. The lower the conditional entropy, the lower the remaining uncertainty. It is even possible to assess a combination of predictors by treating each combination of univariate conditions as a separate condition ("symbol", in information theoretic parlance), thus:

p(the Black Eagles win, given:
that they are the home team and there are no injuries) = ...
that they are the away team and there are no injuries) = ...
that they are the home team and there is 1 injury) = ...
that they are the away team and there is 1 injury) = ...
etc.

My routine, ConditionalEntropy can accommodate multiple conditioning variables.

Conclusion

The models being developed here are tables which simplistically cross all values of all input variables, but conditional entropy can also, for instance, be used to evaluate candidate splits in decision tree induction, or to assess class separation in discriminant analysis.

Note that, as the entropies calculated in this article are based on sample probabilities, they suffer from the same limitations as all sample statistics (as opposed to population statistics). A sample entropy calculated from a small number observations likely will not agree exactly with the population entropy.

Further Reading

See also my Sep-12-2010 posting, Reader Question: Putting Entropy to Work.

Print:

The Mathematical Theory of Communication by Claude Shannon (ISBN 0-252-72548-4)

Elements of Information Theory by Cover and Thomas (ISBN 0-471-06259)