Data Mining in MATLAB: Introduction To Entropy

The two most common general types of abstract data are numeric data and class data. Basic summaries of numeric data, like the mean and standard deviation are so common that they have even been encapsulated as functions in spreadsheet software.

Class (or categorical) data is also very common. Class data has a finite number of possible values, and those values are unordered. Blood type, ZIP code, ice cream flavor and marital status are good examples of class variables. Importantly, note that even though ZIP codes, for instance, are "numbers", the magnitude of those numbers is not meaningful.

For the purposes of this article, I'll assume that classes are coded as integers (1 = class 1, 2 = class 2, etc.), and that the numeric order of these codes are meaningless.

How can class data be summarized? One very simple class summary is a list of distinct values, which MATLAB provides via the unique function. Slightly more complex is the frequency table (distinct values and their counts), which can be generated using the tabulate function from the Statistics Toolbox, or one's own code in base MATLAB, using unique:

>> MyData = [1 3 2 1 2 1 1 3 1 1 2]';
>> UniqueMyData = unique(MyData)

UniqueMyData =

1
2
3

>> nUniqueMyData = length(UniqueMyData)

nUniqueMyData =

3

>> FreqMyData = zeros(nUniqueMyData,1);
>> for i = 1:nUniqueMyData, FreqMyData(i) = sum(double(MyData == UniqueMyData(i))); end
>> FreqMyData

FreqMyData =

6
3
2

The closest thing to an "average" for class variables is the mode, which is the value (or values) appearing most frequently and can be extracted from the frequency table.

A number of measures have been devised to assess the "amount of mix" found in class variables. One of the most important such measures is entropy. This would be roughly equivalent to "spread" in a numeric variable.

I will define entropy for two-valued variables, but it is worth first conducting a thought experiment to understand how such a measure should work. Consider a variable which assumes only 2 values, like a coin toss. The outcome of a series of coin tosses is a variable which can be characterized by its probability of coming up "heads". If the probability is 0.0 ("tails" every time) or 1.0 (always "heads"), then there isn't any mix of values at all. The maximum mix will occur when the probability of "heads" is 0.5- a fair coin toss. Let's assume that our measure of mixture varies on a scale from 0.0 ("no mix") to 1.0 ("maximum mix"). This means that our measurement function would yield a 0.0 at a probability of 0.0 (pure tails), rise to 1.0 at a probability of 0.5 (maximum impurity), and fall back to 0.0 at a probability of 1.0 (pure heads).

Entropy is exactly such a measure. It was devised in the late 1940s by Claude Shannon when he invented information theory (then known as communication theory). Entropy can be applied to variables with more than two values, but graphing the two-value case is much more intuitive (click the graph to enlarge):

The formula for entropy in the case of a two-valued variable is as follows:

entropy = -( p * log(p) + (1-p) * log(1-p) )

...where p is the probability of one class (it doesn't matter which one).

Most often, base 2 logarithms are used, which means that the calculated entropy is in units of bits. Generally, the more distinct values in the variable, and the more evenly they are distributed, the greater the entropy. As an example, variables with completely even distributions of values possess 1 bit of entropy when there are 2 values, 2 bits when there are 4 values and 3 bits when there are 8 values. The less evenly those values are distributed, the lower the entropy. For instance, a 4-valued variable with a 40%/20%/20%/20% split has an entropy of 1.92 bits.

Entropy is frequently used in machine learning and data mining algorithms for things like feature selection or evaluating splits in decision trees.

I have written a MATLAB routine to calculate the entropy of sample data in MATLAB (see details in help Entropy):

Entropy

This routine calculates the entropy of each column in the provided matrix, and will handle more than 2 distinct values per variable.

Further Reading / References

See, also, my postings of Apr-01-2009, Introduction to Conditional Entropy, and of Sep-12-2010, Reader Question: Putting Entropy to Work.

Note that, for whatever historical reason, entropy is typically labeled H in formulas, i.e.: H(X) means "entropy of X".

Print:

The Mathematical Theory of Communication by Claude Shannon (ISBN 0-252-72548-4)

Elements of Information Theory by Cover and Thomas (ISBN 0-471-06259)

On-Line:

Information Theory lecture notes (PowerPoint), by Robert A. Soni

course slides (PDF), by Rob Malouf

Information Gain tutorial (PDF), by Andrew Moore

14 comments:

Anonymous said...: Hi Will, you can also do the following to get the same answer:

MyData = [1 3 2 1 2 1 1 3 1 1 2]';
FreqMyData = histc(MyData,unique(MyData))

Regards,
Eric Sampson
The MathWorks; 4:05 PM
Will Dwinnell said...: Thanks!

-Will; 7:09 PM
Anonymous said...: Notice that when you code the entropy function, you have to make the assumption that 0*log(0) = 0 (since log(0) is -Inf).; 5:13 AM
Will Dwinnell said...: Yes, or avoid the issue by only calculating probabilities for symbols which actually appear in the variable. I used unique() in my code to get a list of the those symbols for exactly that reason.; 10:14 AM
Unknown said...: Hi, I cant find the function called MutualInformation that is needed with ConditionalEntroy. Can you please put the link to it. I have a table of 31 columns with 500 rows, and I want to find Conditional Information gain on certain attributes. Thanks; 3:51 PM
skumar said...: Hi will, i admire the kind of work you do and appreciate your solution. i am not able to find MutualInformation as you mentioned. can you please put it online?; 12:29 AM
Anonymous said...: Hi Will,

Can we use entropy for distinguishing random signals and deterministic signal? Lets say i generate two signals in matlab. First signal using sin function and second using randn function. Can we use entropy to distinguish between these two signal?

Thanks; 3:08 AM
Will Dwinnell said...: Yes, anonymous, we can, and I explain fully in my Sep-12-2010 posting, Reader Question: Putting Entropy to Work; 6:47 PM
Will Dwinnell said...: Yes, Compuviz, you can find all 4 of my entropy functions at MATLAB Central's File Exchange: Entropy, JointEntropy, ConditionalEntropy and MutualInformation.; 9:29 PM
Nick said...: In addition to histc, here's another handy Matlab shortcut.

FreqMyData = accumarray(MyData,ones(size(MyData))); 11:35 PM
Will Dwinnell said...: Thanks for the tip, Nick.; 6:21 AM
Anonymous said...: Hi, thanks so much always your post helps me.
I need Mutual Information or Information Gain for continues variables, can you help me,please?; 7:31 AM
Unknown said...: thanks sir, useful for my assignment; 10:05 PM
Anuja said...: Hi. I used your function and noticed that entropy values are exceeding 1. H lies between and including 0 and 1, right?; 10:33 AM

Friday, November 10, 2006

Introduction To Entropy

14 comments: