Every month, TIOBE Software publishes its TIOBE Programming Community Index, a measure of programming language popularity based on a variety of sources. The current list, TIOBE Programming Community Index for July 2009 lists MATLAB as entering the Top 20 list (for the first time, I believe). While no such ordering is likely to be perfect, TIOBE seems to be one of the more comprehensive efforts for this sort of thing.
I encourage readers to visit the TIOBE site because it is interesting to know what other tools are available, and what new languages are emerging. For interested parties, the July, 2009 Top 20 are, in descending order of popularity:
1. Java
2. C
3. C++
4. PHP
5. (Visual) Basic
6. C#
7. Python
8. Perl
9. JavaScript
10. Ruby
11. Delphi
12. PL/SQL
13. SAS
14. RPG (OS/400)
15. Pascal
16. ABAP
17. Lisp/Scheme
18. D
19. Lua
20. MATLAB
Friday, July 24, 2009
Friday, April 03, 2009
MATLAB Image Basics
Introduction
One nice feature of MATLAB is its provision of handy functions which are not part of the programming language proper. An excellent example of this is its support for images. The base MATLAB product provides routines for the loading from disk, manipulation, display and storing to disk of raster images. While it's true that one can find code libraries to perform these functions for other programming languages, like C++, the MATLAB model offers several advantages, not the least of which is standardization. If I write image-handling code in MATLAB, I know that every other MATLAB user on Earth can run my code without modification or the need for extra header files, libraries, etc. This article will serve as a brief introduction to the use of image data within MATLAB.
Image Data
Images are nearly always stored in digital computers in one of two forms: vector or raster . Vector images store images as line drawings (dots, line segments, polygons) defined by the spatial coordinates of their end points or vertices, and are most often used these days in artistic settings. A raster image is simply a 2-dimensional array of colored pixels (represented by numbers). This article will concentrate on the much more common raster form.
Raster images, being arrays of numbers, are a natural fit for MATLAB, and indeed MATLAB is a convenient tool for applications such as image processing. Raster images always have 2 spatial dimensions (horizontal and vertical), and 1 or more color planes. Typically, grayscale images are stored as a 2-dimensional array, representing 1 color plane with values of 0.0 indicating black, 1.0 indicating white and intermediate values indicating various shades of gray. Color images are similar to grayscale images, but are most often stored as a 3-dimensional array, which is really a stack of three 2-dimensional color planes: one for each primary color: red, green and blue ("RGB"). As with grayscale images, values in the RGB color planes represent brightness of each color. Note that when all three color values are the same, the resulting color is a shade of gray.
For the reader's knowledge, there are also index images which will not be covered here, but which are full of index numbers (integers) which do not represent colors directly, but instead indicate locations in a palette. Also, brightness values are often stored in files as integers, such as 0 - 255 instead of 0.0 to 1.0.
Loading Images from Disk
In MATLAB, images are read from disk using the imread function. Using imread is easy. The basic parameters are the location of the image file and the file format:
>> A = imread('c:\QRNG.png','PNG');
>> whos
Name Size Bytes Class Attributes
A 942 x 1680 x 3 4747680 uint8
This image is 942 pixels vertical by 1680 pixels horizontal, with 3 color planes (red, green and blue). Note that image data has been store in MATLAB as unsigned 8-bit integers (uint8). Since I often make multiple calculations on images, I typically convert the data type to double-precision real (double) and scale to 0.0 - 1.0 (though this will slow calculation):
>> B = double(A) / 255;
Displaying Images
Showing images on the screen is most easily accomplish using the image function:
image(A)
Grayscale images will display using a default palette, which can be changed via the colormap command:
>>colormap gray
Images will be fit to the screen, which may distort their aspect ratio. This can be fixed using:
>>axis equal
...meaning that pixels will use equal scales horizontally and vertically.
Manipulating Images
As arrays, images can be modified using all the fun things we usually do to arrays in MATLAB (subsetting, math operations, etc.). I will mention one other useful base MATLAB tool for image processing: the rgb2hsv function, which converts an RGB image to an HSV one. HSV is a different colorspace (way of representing colors). HSV arrays are similar to RGB arrays, except their 3 color planes are hue, saturation and value (in that order). It is often convenient to work on the value ("brightness") plane, to isolate changes in light/dark from changes in the color. To get back to the land of RGB, use the function hsv2rgb.
Saving Images to Disk
Images can be saved to disk using the imwrite command. This is essentially the inverse of the imread command:
imwrite(A,'New Image.bmp','BMP')
...with the parameters indicating the array to be saved as an image file, the file location and image file format, in that order.
Note that MATLAB understands images as both 0 - 255 uint8s and 0.0 - 1.0 doubles, so there is no need to reverse this transformation before image storage.
Conclusion
Working on images in MATLAB is very convenient, especially when compared to more general-purpose languages. I urge the reader to check the help facility for the functions mentioned here to learn of further functionality.
Further Reading
For more information on image processing, I recommend either of the following books:
Digital Image Processing (3rd Edition) by Gonzalez and Woods
(ISBN-13: 978-0131687288)
Algorithms for Image Processing and Computer Vision by J. R. Parker
(ISBN-13: 978-0471140566)
One nice feature of MATLAB is its provision of handy functions which are not part of the programming language proper. An excellent example of this is its support for images. The base MATLAB product provides routines for the loading from disk, manipulation, display and storing to disk of raster images. While it's true that one can find code libraries to perform these functions for other programming languages, like C++, the MATLAB model offers several advantages, not the least of which is standardization. If I write image-handling code in MATLAB, I know that every other MATLAB user on Earth can run my code without modification or the need for extra header files, libraries, etc. This article will serve as a brief introduction to the use of image data within MATLAB.
Image Data
Images are nearly always stored in digital computers in one of two forms: vector or raster . Vector images store images as line drawings (dots, line segments, polygons) defined by the spatial coordinates of their end points or vertices, and are most often used these days in artistic settings. A raster image is simply a 2-dimensional array of colored pixels (represented by numbers). This article will concentrate on the much more common raster form.
Raster images, being arrays of numbers, are a natural fit for MATLAB, and indeed MATLAB is a convenient tool for applications such as image processing. Raster images always have 2 spatial dimensions (horizontal and vertical), and 1 or more color planes. Typically, grayscale images are stored as a 2-dimensional array, representing 1 color plane with values of 0.0 indicating black, 1.0 indicating white and intermediate values indicating various shades of gray. Color images are similar to grayscale images, but are most often stored as a 3-dimensional array, which is really a stack of three 2-dimensional color planes: one for each primary color: red, green and blue ("RGB"). As with grayscale images, values in the RGB color planes represent brightness of each color. Note that when all three color values are the same, the resulting color is a shade of gray.
For the reader's knowledge, there are also index images which will not be covered here, but which are full of index numbers (integers) which do not represent colors directly, but instead indicate locations in a palette. Also, brightness values are often stored in files as integers, such as 0 - 255 instead of 0.0 to 1.0.
Loading Images from Disk
In MATLAB, images are read from disk using the imread function. Using imread is easy. The basic parameters are the location of the image file and the file format:
>> A = imread('c:\QRNG.png','PNG');
>> whos
Name Size Bytes Class Attributes
A 942 x 1680 x 3 4747680 uint8
This image is 942 pixels vertical by 1680 pixels horizontal, with 3 color planes (red, green and blue). Note that image data has been store in MATLAB as unsigned 8-bit integers (uint8). Since I often make multiple calculations on images, I typically convert the data type to double-precision real (double) and scale to 0.0 - 1.0 (though this will slow calculation):
>> B = double(A) / 255;
Displaying Images
Showing images on the screen is most easily accomplish using the image function:
image(A)
Grayscale images will display using a default palette, which can be changed via the colormap command:
>>colormap gray
Images will be fit to the screen, which may distort their aspect ratio. This can be fixed using:
>>axis equal
...meaning that pixels will use equal scales horizontally and vertically.
Manipulating Images
As arrays, images can be modified using all the fun things we usually do to arrays in MATLAB (subsetting, math operations, etc.). I will mention one other useful base MATLAB tool for image processing: the rgb2hsv function, which converts an RGB image to an HSV one. HSV is a different colorspace (way of representing colors). HSV arrays are similar to RGB arrays, except their 3 color planes are hue, saturation and value (in that order). It is often convenient to work on the value ("brightness") plane, to isolate changes in light/dark from changes in the color. To get back to the land of RGB, use the function hsv2rgb.
Saving Images to Disk
Images can be saved to disk using the imwrite command. This is essentially the inverse of the imread command:
imwrite(A,'New Image.bmp','BMP')
...with the parameters indicating the array to be saved as an image file, the file location and image file format, in that order.
Note that MATLAB understands images as both 0 - 255 uint8s and 0.0 - 1.0 doubles, so there is no need to reverse this transformation before image storage.
Conclusion
Working on images in MATLAB is very convenient, especially when compared to more general-purpose languages. I urge the reader to check the help facility for the functions mentioned here to learn of further functionality.
Further Reading
For more information on image processing, I recommend either of the following books:
Digital Image Processing (3rd Edition) by Gonzalez and Woods
(ISBN-13: 978-0131687288)
Algorithms for Image Processing and Computer Vision by J. R. Parker
(ISBN-13: 978-0471140566)
Wednesday, April 01, 2009
Introduction to Conditional Entropy
Introduction
In one of the earliest posts to this log, Introduction To Entropy (Nov-10-2006), I described the entropy of discrete variables. Finally, in this posting, I have gotten around to continuing this line of inquiry and will explain conditional entropy.
Quick Review: Entropy
Recall that entropy is a measure of uncertainty about the state of a variable. In the case of a variable which can take on only two values (male/female, profit/loss, heads/tails, alive/dead, etc.), entropy assumes its maximum value, 1 bit, when the probability of the outcomes is equal. A fair coin toss is a high-entropy event: Beforehand, we have no idea about what will happen. As the probability distribution moves away from a 50/50 split, uncertainty about the outcome decreases since there is less uncertainty as to the outcome. Recall, too, that entropy decreases regardless of which outcome class becomes the more probable: an unfair coin toss, with a heads / tails probability distribution of 0.10 / 0.90 has exactly the same entropy as another unfair coin with a distribution of 0.90 / 0.10. It is the distribution of probabilities, not their order which matters when calculating entropy.
Extending these ideas to variables with more than 2 possible values, we note that generally, distributions which are evenly spread among the possible values exhibit higher entropy, and those which are more concentrated in a subset of the possible values exhibit lower entropy.
Conditional Entropy
Entropy, in itself, is a useful measure of the mixture of values in variables, but we are often interested in characterizing variables after they have been conditioned ("modeled"). Conditional entropy does exactly that, by measuring the entropy remaining in a variable, after it has been conditioned by another variable. I find it helpful to think of conditional entropy as "residual entropy". Happily for you, I have already assembled a MATLAB function, ConditionalEntropy, to calculate this measure.
As an example, think of sporting events involving pairs of competitors, such as soccer ("football" if you live outside of North America). Knowing nothing about a particular pair of teams (and assuming no ties), the best probability we may assess that any specific team, say, the Black Eagles (Go, Beşiktaş!), will win is:
p(the Black Eagles win) = 0.5
The entropy of this variable is 1 bit- the maximum uncertainty possible for a two-category variable. We will attempt to lower this uncertainty (hence, lowering its conditional entropy).
In some competitive team sports, it has been demonstrated statistically that home teams have an advantage over away teams. Among the most popular sports, the home advantage appears to be largest for soccer. One estimate shows that (excluding ties) home teams have historically won 69% of the time, and away teams 31% of the time. Conditioning the outcome on home vs. away, we may provide the follwing improved probability estimates:
p(the Black Eagles win, given that they are the home team) = 0.69
p(the Black Eagles win, given that they are the away team) = 0.31
Ah, Now there is somewhat less uncertainty! The entropy for probability 0.69 is 0.8932 bits. The entropy for probability 0.31 (being the same distance from 0.5 as 0.69) is also 0.8932 bits.
Conditional entropy is calculated as the weighted average of the entropies of the various possible conditions (in this case home or away). Assuming that it is equally likely that the Black Eagles play home or away, the conditional entropy of them winning is 0.8932 bits = (0.5 * 0.8932 + 0.5 * 0.8932). As entropy has gone down with our simple model, from 1 bit to 0.8932 bits, we learn that knowing whether the Black Eagles are playing at home or away provides information and reduces uncertainty.
Other variables might be used to condition the outcome of a match, such as number of player injuries, outcome of recent games, etc. We can compare these candidate predictors using conditional entropy. The lower the conditional entropy, the lower the remaining uncertainty. It is even possible to assess a combination of predictors by treating each combination of univariate conditions as a separate condition ("symbol", in information theoretic parlance), thus:
p(the Black Eagles win, given:
that they are the home team and there are no injuries) = ...
that they are the away team and there are no injuries) = ...
that they are the home team and there is 1 injury) = ...
that they are the away team and there is 1 injury) = ...
etc.
My routine, ConditionalEntropy can accommodate multiple conditioning variables.
Conclusion
The models being developed here are tables which simplistically cross all values of all input variables, but conditional entropy can also, for instance, be used to evaluate candidate splits in decision tree induction, or to assess class separation in discriminant analysis.
Note that, as the entropies calculated in this article are based on sample probabilities, they suffer from the same limitations as all sample statistics (as opposed to population statistics). A sample entropy calculated from a small number observations likely will not agree exactly with the population entropy.
Further Reading
See also my Sep-12-2010 posting, Reader Question: Putting Entropy to Work.
Print:
The Mathematical Theory of Communication by Claude Shannon (ISBN 0-252-72548-4)
Elements of Information Theory by Cover and Thomas (ISBN 0-471-06259)
In one of the earliest posts to this log, Introduction To Entropy (Nov-10-2006), I described the entropy of discrete variables. Finally, in this posting, I have gotten around to continuing this line of inquiry and will explain conditional entropy.
Quick Review: Entropy
Recall that entropy is a measure of uncertainty about the state of a variable. In the case of a variable which can take on only two values (male/female, profit/loss, heads/tails, alive/dead, etc.), entropy assumes its maximum value, 1 bit, when the probability of the outcomes is equal. A fair coin toss is a high-entropy event: Beforehand, we have no idea about what will happen. As the probability distribution moves away from a 50/50 split, uncertainty about the outcome decreases since there is less uncertainty as to the outcome. Recall, too, that entropy decreases regardless of which outcome class becomes the more probable: an unfair coin toss, with a heads / tails probability distribution of 0.10 / 0.90 has exactly the same entropy as another unfair coin with a distribution of 0.90 / 0.10. It is the distribution of probabilities, not their order which matters when calculating entropy.
Extending these ideas to variables with more than 2 possible values, we note that generally, distributions which are evenly spread among the possible values exhibit higher entropy, and those which are more concentrated in a subset of the possible values exhibit lower entropy.
Conditional Entropy
Entropy, in itself, is a useful measure of the mixture of values in variables, but we are often interested in characterizing variables after they have been conditioned ("modeled"). Conditional entropy does exactly that, by measuring the entropy remaining in a variable, after it has been conditioned by another variable. I find it helpful to think of conditional entropy as "residual entropy". Happily for you, I have already assembled a MATLAB function, ConditionalEntropy, to calculate this measure.
As an example, think of sporting events involving pairs of competitors, such as soccer ("football" if you live outside of North America). Knowing nothing about a particular pair of teams (and assuming no ties), the best probability we may assess that any specific team, say, the Black Eagles (Go, Beşiktaş!), will win is:
p(the Black Eagles win) = 0.5
The entropy of this variable is 1 bit- the maximum uncertainty possible for a two-category variable. We will attempt to lower this uncertainty (hence, lowering its conditional entropy).
In some competitive team sports, it has been demonstrated statistically that home teams have an advantage over away teams. Among the most popular sports, the home advantage appears to be largest for soccer. One estimate shows that (excluding ties) home teams have historically won 69% of the time, and away teams 31% of the time. Conditioning the outcome on home vs. away, we may provide the follwing improved probability estimates:
p(the Black Eagles win, given that they are the home team) = 0.69
p(the Black Eagles win, given that they are the away team) = 0.31
Ah, Now there is somewhat less uncertainty! The entropy for probability 0.69 is 0.8932 bits. The entropy for probability 0.31 (being the same distance from 0.5 as 0.69) is also 0.8932 bits.
Conditional entropy is calculated as the weighted average of the entropies of the various possible conditions (in this case home or away). Assuming that it is equally likely that the Black Eagles play home or away, the conditional entropy of them winning is 0.8932 bits = (0.5 * 0.8932 + 0.5 * 0.8932). As entropy has gone down with our simple model, from 1 bit to 0.8932 bits, we learn that knowing whether the Black Eagles are playing at home or away provides information and reduces uncertainty.
Other variables might be used to condition the outcome of a match, such as number of player injuries, outcome of recent games, etc. We can compare these candidate predictors using conditional entropy. The lower the conditional entropy, the lower the remaining uncertainty. It is even possible to assess a combination of predictors by treating each combination of univariate conditions as a separate condition ("symbol", in information theoretic parlance), thus:
p(the Black Eagles win, given:
that they are the home team and there are no injuries) = ...
that they are the away team and there are no injuries) = ...
that they are the home team and there is 1 injury) = ...
that they are the away team and there is 1 injury) = ...
etc.
My routine, ConditionalEntropy can accommodate multiple conditioning variables.
Conclusion
The models being developed here are tables which simplistically cross all values of all input variables, but conditional entropy can also, for instance, be used to evaluate candidate splits in decision tree induction, or to assess class separation in discriminant analysis.
Note that, as the entropies calculated in this article are based on sample probabilities, they suffer from the same limitations as all sample statistics (as opposed to population statistics). A sample entropy calculated from a small number observations likely will not agree exactly with the population entropy.
Further Reading
See also my Sep-12-2010 posting, Reader Question: Putting Entropy to Work.
Print:
The Mathematical Theory of Communication by Claude Shannon (ISBN 0-252-72548-4)
Elements of Information Theory by Cover and Thomas (ISBN 0-471-06259)
Friday, March 27, 2009
L1LinearRegression Code Update
The L-1 regression routine, L1LinearRegression, originally mentioned in the Oct-23-2007 posting, L-1 Linear Regression, has been updated. The old version produced correct results, but the new one is more efficient.
Thanks to reader Andreas Steimer for contacting me about this routine.
Thanks to reader Andreas Steimer for contacting me about this routine.
Thursday, March 26, 2009
Rexer Analytics' 2009 Data Miner Survey
I'd like to alert readers to Rexer Analytics' 2009 Data Miner Survey. I urge you to participate by visiting the on-line survey at:
Rexer Analytics' 2009 Data Miner Survey
The Access Code is: TW4D2
Your entry is confidential and will help all of us better understand what is happening in the field of data mining.
Thanks!
Rexer Analytics' 2009 Data Miner Survey
The Access Code is: TW4D2
Your entry is confidential and will help all of us better understand what is happening in the field of data mining.
Thanks!
Friday, March 20, 2009
Status Update: Mar-2009
This is just a short note to let everyone know that I have been working (finally) to restore the broken links on this Web log. I believe that the source code links have all now been fixed.
This log has passed the 100,000 unique visitor mark without fanfare because I was too busy at the time to notice. I am thankful for those who continue to write in response to this Web log, and am glad to learn how many people this helps. The pace of posting has picked up, and I expect this to continue this year.
This log has passed the 100,000 unique visitor mark without fanfare because I was too busy at the time to notice. I am thankful for those who continue to write in response to this Web log, and am glad to learn how many people this helps. The pace of posting has picked up, and I expect this to continue this year.
Sunday, March 15, 2009
Logistic Regression
Introduction
Often, the analyst is required to construct a model which estimates probabilities. This is common in many fields: medical diagnosis (probability of recovery, relapse, etc.), credit scoring (probability of a loan being repaid), sports (probability of a team beating a competitor- wait... maybe that belongs in the "investment" category?).
Many people are familiar with linear regression- why not just use that? There are several good reasons not to do this, but probably the most obvious is that linear models will always fall below 0.0 and poke out above 1.0, yielding answers which do not make sense as probabilities.
Many different classification models have been devised which estimate the probability of class membership, such as linear and quadratic discriminant analysis, neural networks and tree induction. The technique covered in this article is logistic regression- one of the simplest modeling procedures.
Logistic Regression
Logistic regression is a member of the family of methods called generalized linear models ("GLM"). Such models include a linear part followed by some "link function". If you are familiar with neural networks, think of "transfer functions" or "squashing functions". So, the linear function of the predictor variables is calculated, and the result of this calculation is run through the link function. In the case of logistic regression, the linear result is run through a logistic function (see figure 1), which runs from 0.0 (at negative infinity), rises monotonically to 1.0 (at positive infinity). Along the way, it is 0.5 when the input value is exactly zero. Among other desirable properties, note that this logistic function only returns values between 0.0 and 1.0. Other GLMs operate similarly, but employ different link functions- some of which are also bound by 0.0 - 1.0, and some of which are not.
Figure 1: The Most Interesting Part of the Logistic Function (Click figure to enlarge)
While calculating the optimal coefficients of a least-squares linear regression has a direct, closed-form solution, this is not the case for logistic regression. Instead, some iterative fitting procedure is needed, in which successive "guesses" at the right coefficients are incrementally improved. Again, if you are familiar with neural networks, this is much like the various training rules used with the simplest "single neuron" models. Hopefully, you are lucky enough to have a routine handy to perform this process for you, such as glmfit, from the Statistics Toolbox.
glmfit
The glmfit function is easy to apply. The syntax for logistic regression is:
B = glmfit(X, [Y N], 'binomial', 'link', 'logit');
B will contain the discovered coefficients for the linear portion of the logistic regression (the link function has no coefficients). X contains the pedictor data, with examples in rows, variables in columns. Y contains the target variable, usually a 0 or a 1 representing the outcome. Last, the variable N contains the count of events for each row of the example data- most often, this will be a columns of 1s, the same size as Y. The count parameter, N, will be set to values greater than 1 for grouped data. As an example, think of medical cases summarized by country: each country will have averaged input values, an outcome which is a rate (between 0.0 and 1.0), and the count of cases from that country. In the event that the counts are greater than one, then the target variable represents the count of target class observations.
Here is a very small example:
>> X = [0.0 0.1 0.7 1.0 1.1 1.3 1.4 1.7 2.1 2.2]';
>> Y = [0 0 1 0 0 0 1 1 1 1]';
>> B = glmfit(X, [Y ones(10,1)], 'binomial', 'link', 'logit')
B =
-3.4932
2.9402
The first element of B is the constant term, and the second element is the coefficient for the lone input variable. We apply the linear part of this logistic regression thus:
>> Z = B(1) + X * (B(2))
Z =
-3.4932
-3.1992
-1.4350
-0.5530
-0.2589
0.3291
0.6231
1.5052
2.6813
2.9753
To finish, we apply the logistic function to the output of the linear part:
>> Z = Logistic(B(1) + X * (B(2)))
Z =
0.0295
0.0392
0.1923
0.3652
0.4356
0.5815
0.6509
0.8183
0.9359
0.9514
Despite the simplicity of the logistic function, I built it into a small function, Logistic, so that I wouldn't have to repeatedly write out the formula:
% Logistic: calculates the logistic function of the input
% by Will Dwinnell
%
% Last modified: Sep-02-2006
function Output = Logistic(Input)
Output = 1 ./ (1 + exp(-Input));
% EOF
Conclusion
Though it is structurally very simple, logistic regression still finds wide use today in many fields. It is quick to fit, easy to implement the discovered model and quick to recall. Frequently, it yields better performance than competing, more complex techniques. I recently built a logistic regression model which beat out a neural network, decision trees and two types of discriminant analysis. If nothing else, it is worth fitting a simple model such as logistic regression early in a modeling project, just to establish a performance benchmark for the project.
Logistic regression is closely related to another GLM procedure, probit regression, which differs only in its link function (specified in glmfit by replacing 'logit' with 'probit'). I believe that probit regression has been losing popularity since its results are typically very similar to those from logistic regression, but the formula for the logistic link function is simpler than that of the probit link function.
References
Generalized Linear Models, by McCullagh and Nelder (ISBN-13: 978-0412317606)
See Also
The Apr-21-2007 posting, Linear Regression in MATLAB, the Feb-16-2010 posting, Single Neuron Training: The Delta Rule and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).
Often, the analyst is required to construct a model which estimates probabilities. This is common in many fields: medical diagnosis (probability of recovery, relapse, etc.), credit scoring (probability of a loan being repaid), sports (probability of a team beating a competitor- wait... maybe that belongs in the "investment" category?).
Many people are familiar with linear regression- why not just use that? There are several good reasons not to do this, but probably the most obvious is that linear models will always fall below 0.0 and poke out above 1.0, yielding answers which do not make sense as probabilities.
Many different classification models have been devised which estimate the probability of class membership, such as linear and quadratic discriminant analysis, neural networks and tree induction. The technique covered in this article is logistic regression- one of the simplest modeling procedures.
Logistic Regression
Logistic regression is a member of the family of methods called generalized linear models ("GLM"). Such models include a linear part followed by some "link function". If you are familiar with neural networks, think of "transfer functions" or "squashing functions". So, the linear function of the predictor variables is calculated, and the result of this calculation is run through the link function. In the case of logistic regression, the linear result is run through a logistic function (see figure 1), which runs from 0.0 (at negative infinity), rises monotonically to 1.0 (at positive infinity). Along the way, it is 0.5 when the input value is exactly zero. Among other desirable properties, note that this logistic function only returns values between 0.0 and 1.0. Other GLMs operate similarly, but employ different link functions- some of which are also bound by 0.0 - 1.0, and some of which are not.
Figure 1: The Most Interesting Part of the Logistic Function (Click figure to enlarge)
While calculating the optimal coefficients of a least-squares linear regression has a direct, closed-form solution, this is not the case for logistic regression. Instead, some iterative fitting procedure is needed, in which successive "guesses" at the right coefficients are incrementally improved. Again, if you are familiar with neural networks, this is much like the various training rules used with the simplest "single neuron" models. Hopefully, you are lucky enough to have a routine handy to perform this process for you, such as glmfit, from the Statistics Toolbox.
glmfit
The glmfit function is easy to apply. The syntax for logistic regression is:
B = glmfit(X, [Y N], 'binomial', 'link', 'logit');
B will contain the discovered coefficients for the linear portion of the logistic regression (the link function has no coefficients). X contains the pedictor data, with examples in rows, variables in columns. Y contains the target variable, usually a 0 or a 1 representing the outcome. Last, the variable N contains the count of events for each row of the example data- most often, this will be a columns of 1s, the same size as Y. The count parameter, N, will be set to values greater than 1 for grouped data. As an example, think of medical cases summarized by country: each country will have averaged input values, an outcome which is a rate (between 0.0 and 1.0), and the count of cases from that country. In the event that the counts are greater than one, then the target variable represents the count of target class observations.
Here is a very small example:
>> X = [0.0 0.1 0.7 1.0 1.1 1.3 1.4 1.7 2.1 2.2]';
>> Y = [0 0 1 0 0 0 1 1 1 1]';
>> B = glmfit(X, [Y ones(10,1)], 'binomial', 'link', 'logit')
B =
-3.4932
2.9402
The first element of B is the constant term, and the second element is the coefficient for the lone input variable. We apply the linear part of this logistic regression thus:
>> Z = B(1) + X * (B(2))
Z =
-3.4932
-3.1992
-1.4350
-0.5530
-0.2589
0.3291
0.6231
1.5052
2.6813
2.9753
To finish, we apply the logistic function to the output of the linear part:
>> Z = Logistic(B(1) + X * (B(2)))
Z =
0.0295
0.0392
0.1923
0.3652
0.4356
0.5815
0.6509
0.8183
0.9359
0.9514
Despite the simplicity of the logistic function, I built it into a small function, Logistic, so that I wouldn't have to repeatedly write out the formula:
% Logistic: calculates the logistic function of the input
% by Will Dwinnell
%
% Last modified: Sep-02-2006
function Output = Logistic(Input)
Output = 1 ./ (1 + exp(-Input));
% EOF
Conclusion
Though it is structurally very simple, logistic regression still finds wide use today in many fields. It is quick to fit, easy to implement the discovered model and quick to recall. Frequently, it yields better performance than competing, more complex techniques. I recently built a logistic regression model which beat out a neural network, decision trees and two types of discriminant analysis. If nothing else, it is worth fitting a simple model such as logistic regression early in a modeling project, just to establish a performance benchmark for the project.
Logistic regression is closely related to another GLM procedure, probit regression, which differs only in its link function (specified in glmfit by replacing 'logit' with 'probit'). I believe that probit regression has been losing popularity since its results are typically very similar to those from logistic regression, but the formula for the logistic link function is simpler than that of the probit link function.
References
Generalized Linear Models, by McCullagh and Nelder (ISBN-13: 978-0412317606)
See Also
The Apr-21-2007 posting, Linear Regression in MATLAB, the Feb-16-2010 posting, Single Neuron Training: The Delta Rule and the Dec-11-2010 posting, Linear Discriminant Analysis (LDA).
Friday, March 13, 2009
MATLAB 2009a
MATLAB is on the move. Release 2009a brings a number of changes. The function of the random number generators had already begun to change in the base product as of the last release, if you hadn't noticed, and several functions (min, max, sum and prod, as well as several of the FFT functions) are now multi-threaded. This release also witnesses several changes to the analytical toolboxes. Among others...
In the Statistics Toolbox...
A Naïve Bayes modeling tool has been added. This is a completely different way of modeling than the other technique in the Statistics Toolbox. Obviously, the more diverse to set of modeling tools, the better.
Data table joining has been enhanced several ways, including the ability to use multiple keys and different types of joins (inner, outer, etc.).
A number of changes to the tree induction facility (classregtree), including a fix to the quirky splitmin parameter. Now the programmer can specify the minimum number of cases per leaf node, which seems like a better way to control decision tree growth.
There are also new options for model ensembles and performance curve summaries.
In the Curve Fitting Toolbox...
Yipee! There are now functions for surface fitting (functions fit to 2 inputs, instead of just 1). Both interactive and programmatic fitting is available.
In the Parallel Computing Toolbox...
The maximum number of local workers has been increased from 4 to 8.
In the Statistics Toolbox...
A Naïve Bayes modeling tool has been added. This is a completely different way of modeling than the other technique in the Statistics Toolbox. Obviously, the more diverse to set of modeling tools, the better.
Data table joining has been enhanced several ways, including the ability to use multiple keys and different types of joins (inner, outer, etc.).
A number of changes to the tree induction facility (classregtree), including a fix to the quirky splitmin parameter. Now the programmer can specify the minimum number of cases per leaf node, which seems like a better way to control decision tree growth.
There are also new options for model ensembles and performance curve summaries.
In the Curve Fitting Toolbox...
Yipee! There are now functions for surface fitting (functions fit to 2 inputs, instead of just 1). Both interactive and programmatic fitting is available.
In the Parallel Computing Toolbox...
The maximum number of local workers has been increased from 4 to 8.
Tuesday, March 03, 2009
Introduction to the Percentile Bootstrap
Introduction
This article introduces the percentile bootstrap, the simplest of the bootstrap methods. The bootstrap family of techniques are used to establish confidence intervals and calculate hypothesis tests for statistical measures.
Problem Statement and Conventional Solution
One is often required to summarize a set of data, such as the following:
X =
2
10
10
5
8
1
4
9
8
10
The most commonly used summary is the mean, in MATLAB calculated thus:
>> mean(X)
ans =
6.7000
Summaries, however, discard a great deal of information. In any situation, it is helpful to know the quality of our summary. In the case above, we may wonder how far our sample mean is likely to be the true population mean (the mean of all numbers drawn from the theoretical statistical population). Our sample mean, after all, was calculated from only 10 observations.
We may establish some idea of how far off the sample mean may be from the population mean by calculating the standard error of the sample mean, which is the standard deviation divided by the square root of the sample size. In MATLAB:
>> StandardError = std(X) / sqrt(length(X))
StandardError =
1.0858
Note that there are fancier versions of this calculation, for example for cases in which the population size is finite, or when the standard error of a proportion is being calculated. The standard error gives us an estimate of how far away our sample error might be form the true population mean, and acts like a z-score: the population mean is within 2 times the standard error from the sample mean about 95% of the time. In the case of our little data set, this would be from 4.5285 (= 6.7000 - 2 * 1.0858) to 8.8715 (= 6.7000 + 2 * 1.0858).
Note that as the number of observations grows, the bottom part of the standard error fraction becomes larger and the standard error decreases. This seems natural enough: with more data, our confidence in our statistic increases.
Complications of the Problem Statement
So far, so good: We may have had to look up the standard error formula in a book, but we have established some sort of parameters as to the certainty of our summary. What if we didn't have such a reference, though? The median for example, has no such simple formula to establish its certainty. (Actually, I believe there is a formula for the median, but that it is a real bear!) Anyway, there certainly are other measures which we may calculate (even ones which we invent on the spot), for which there are no handy standard error formulas. What to do?
An Alternative Solution: The Bootstrap
Just as we are about to throw up our hands and consider another career, the bootstrap appears. The basic method of the bootstrap is simple: Draw many samples with replacement from the original sample ("replicates"), and tabulate the summary statistic when calculated on each those replicate samples. The distribution of those replicated summaries is intended to mimic the distribution being parameterized by the standard error of the mean.
Above, I mentioned that the population mean would be found inside the band from the sample mean minus two times the standard error to the sample mean plus two times the standard error about 95% of the time. The equivalent area in our bootstrap process would be between the 2.5 and 97.5 percentiles of our replicate summaries. We use 2.5 and 97.5 because that leaves a total of 5% outside of the range, half on each end of the spectrum.
An example using the median will illustrate this process. For reference, let's calculate the sample median first:
>> median(X)
ans =
8
Drawing a single sample with replacement can be done in MATLAB by indexing using random integers:
RandomSampleWithReplacement =
5
8
1
1
10
10
9
9
5
1
This is our first bootstrap replicate. Now, we calculate our summary on this replicate:
>> median(RandomSampleWithReplacement)
ans =
6.5000
To discern the distribution, though, will require many more replicates. Since the computer is doing all of the work, I generally like to run at least 2000 replicates to give the bootstrap distribution a chance to take shape:
rand('twister',1242) % Seed the random number generator for repeatability
T = NaN(2000,1); % Allocate space for the replicated summaries
for i = 1:2000 % The machine's doing the work, so why not?
RandomSampleWithReplacement = X(ceil(length(X) * rand(length(X),1))); % Draw a sample with replacement
T(i) = median(RandomSampleWithReplacement); % Calculate the replicated summary
end
(I apologize if the code is a bit cramped, but I have not been able to figure out how to insert tabs or indentation in this edit window.)
Now, estimating where the "real" median (the population median) is likely to be is a simple matter of checking percentiles in our replicated summaries. I have the Statistic Toolbox, so I will cheat by using a function from there:
>> prctile(T,[2.5 97.5])
ans =
3.5000 10.0000
So, our population median is likely to lie between 3.5 and 10. That is a pretty wide range, but this is the consequence of having so little data.
Wrap-Up
The fundamental trade-off of the bootstrap is that one forsakes pat statistical formulas in favor of strenuous computation. In summary:
Good:
-The bootstrap solves many problems not amenable to conventional methods.
-Even in cases where conventional solutions exist, the bootstrap requires no memory or selection of correct formulas for given situations.
Bad:
-The bootstrap requires considerable numerical computation. Of course, in an era of cheap and powerful computing machinery, this is much less of an issue. Still, if there are many of these to perform...
-The bootstrap presented in this article, the bootstrap percentile, is known to deviate from theoretically correct answers, though generally in a small way. There are more sophisticated bootstrap procedures which address some of these concerns, though.
This process, owing to its very general nature, can be applied to tasks much more complex than estimating the uncertainty of statistical summaries, such as hypothesis testing and predictive model performance evaluation.
Further Reading
An Introduction to the Bootstrap by Efron and Tibshirani (ISBN 0-412-04231-2)
This is the seminal work in this field. It covers a lot of ground, but is a bit mathematical.
This article introduces the percentile bootstrap, the simplest of the bootstrap methods. The bootstrap family of techniques are used to establish confidence intervals and calculate hypothesis tests for statistical measures.
Problem Statement and Conventional Solution
One is often required to summarize a set of data, such as the following:
X =
2
10
10
5
8
1
4
9
8
10
The most commonly used summary is the mean, in MATLAB calculated thus:
>> mean(X)
ans =
6.7000
Summaries, however, discard a great deal of information. In any situation, it is helpful to know the quality of our summary. In the case above, we may wonder how far our sample mean is likely to be the true population mean (the mean of all numbers drawn from the theoretical statistical population). Our sample mean, after all, was calculated from only 10 observations.
We may establish some idea of how far off the sample mean may be from the population mean by calculating the standard error of the sample mean, which is the standard deviation divided by the square root of the sample size. In MATLAB:
>> StandardError = std(X) / sqrt(length(X))
StandardError =
1.0858
Note that there are fancier versions of this calculation, for example for cases in which the population size is finite, or when the standard error of a proportion is being calculated. The standard error gives us an estimate of how far away our sample error might be form the true population mean, and acts like a z-score: the population mean is within 2 times the standard error from the sample mean about 95% of the time. In the case of our little data set, this would be from 4.5285 (= 6.7000 - 2 * 1.0858) to 8.8715 (= 6.7000 + 2 * 1.0858).
Note that as the number of observations grows, the bottom part of the standard error fraction becomes larger and the standard error decreases. This seems natural enough: with more data, our confidence in our statistic increases.
Complications of the Problem Statement
So far, so good: We may have had to look up the standard error formula in a book, but we have established some sort of parameters as to the certainty of our summary. What if we didn't have such a reference, though? The median for example, has no such simple formula to establish its certainty. (Actually, I believe there is a formula for the median, but that it is a real bear!) Anyway, there certainly are other measures which we may calculate (even ones which we invent on the spot), for which there are no handy standard error formulas. What to do?
An Alternative Solution: The Bootstrap
Just as we are about to throw up our hands and consider another career, the bootstrap appears. The basic method of the bootstrap is simple: Draw many samples with replacement from the original sample ("replicates"), and tabulate the summary statistic when calculated on each those replicate samples. The distribution of those replicated summaries is intended to mimic the distribution being parameterized by the standard error of the mean.
Above, I mentioned that the population mean would be found inside the band from the sample mean minus two times the standard error to the sample mean plus two times the standard error about 95% of the time. The equivalent area in our bootstrap process would be between the 2.5 and 97.5 percentiles of our replicate summaries. We use 2.5 and 97.5 because that leaves a total of 5% outside of the range, half on each end of the spectrum.
An example using the median will illustrate this process. For reference, let's calculate the sample median first:
>> median(X)
ans =
8
Drawing a single sample with replacement can be done in MATLAB by indexing using random integers:
RandomSampleWithReplacement =
5
8
1
1
10
10
9
9
5
1
This is our first bootstrap replicate. Now, we calculate our summary on this replicate:
>> median(RandomSampleWithReplacement)
ans =
6.5000
To discern the distribution, though, will require many more replicates. Since the computer is doing all of the work, I generally like to run at least 2000 replicates to give the bootstrap distribution a chance to take shape:
rand('twister',1242) % Seed the random number generator for repeatability
T = NaN(2000,1); % Allocate space for the replicated summaries
for i = 1:2000 % The machine's doing the work, so why not?
RandomSampleWithReplacement = X(ceil(length(X) * rand(length(X),1))); % Draw a sample with replacement
T(i) = median(RandomSampleWithReplacement); % Calculate the replicated summary
end
(I apologize if the code is a bit cramped, but I have not been able to figure out how to insert tabs or indentation in this edit window.)
Now, estimating where the "real" median (the population median) is likely to be is a simple matter of checking percentiles in our replicated summaries. I have the Statistic Toolbox, so I will cheat by using a function from there:
>> prctile(T,[2.5 97.5])
ans =
3.5000 10.0000
So, our population median is likely to lie between 3.5 and 10. That is a pretty wide range, but this is the consequence of having so little data.
Wrap-Up
The fundamental trade-off of the bootstrap is that one forsakes pat statistical formulas in favor of strenuous computation. In summary:
Good:
-The bootstrap solves many problems not amenable to conventional methods.
-Even in cases where conventional solutions exist, the bootstrap requires no memory or selection of correct formulas for given situations.
Bad:
-The bootstrap requires considerable numerical computation. Of course, in an era of cheap and powerful computing machinery, this is much less of an issue. Still, if there are many of these to perform...
-The bootstrap presented in this article, the bootstrap percentile, is known to deviate from theoretically correct answers, though generally in a small way. There are more sophisticated bootstrap procedures which address some of these concerns, though.
This process, owing to its very general nature, can be applied to tasks much more complex than estimating the uncertainty of statistical summaries, such as hypothesis testing and predictive model performance evaluation.
Further Reading
An Introduction to the Bootstrap by Efron and Tibshirani (ISBN 0-412-04231-2)
This is the seminal work in this field. It covers a lot of ground, but is a bit mathematical.
Saturday, February 28, 2009
Getting Data Into and Out of Excel
Introduction
Recently, I needed to assemble a report which is to be generated quarterly and distributed to those unfortunate enough not to have MATLAB. Currently, such reports are distributed as Excel workbooks. Nearly everyone who produces such reports where I work does so by generating tables of numbers and cutting-and-pasting them into Excel (yuck!). As I have a number of these reports to produce, I was motivated to construct a more automatic solution.
Happily, MATLAB can easily get data from or send data to Excel documents. The mechanics of this are not difficult, but I thought that readers might not be aware that this facility exists and just how easy this is to accomplish.
There is just one catch: For users without Excel to act as a COM server (such as UNIX users), 'basic' mode is required, and functionality will be limited: See help xlsread for details.
Getting Data From Excel in MATLAB
MATLAB's function for extracting data from Excel documents is xlsread. Using it is as simple as this:
[NumericData TextData] = xlsread(FileName,SheetName,CellRange)
...where NumericData and TextData contain the numeric and text data read from the workbook, respectively; and FileName, SheetName and CellRange are the names of the Excel document, sheet name and cell range from which to read.
Often, I find myself needing to read data from growing ranges of cells within Excel spreadsheets. Think of daily rainfall data stored in a single column within a spreadsheet, which periodically has data appended to it. To load such data, simply set the range of cells to be much larger than the existing range: xlsread will ignore the extra empty spreadsheet cells.
Getting Data Into Excel in MATLAB
Writing data to Excel documents is also quite simple. Just use xlswrite:
xlswrite(FileName,DataArray,SheetName,CellRange)
...where FileName, SheetName and CellRange are the names of the Excel document, sheet name and cell range to which to write, and DataArray contains the data to be written.
Final note
Refer to the help facility for diagnostic and other capabilities of both of these functions. See also wk1read and wk1write to handle the old Lotus 1-2-3 .WK1 format.
Recently, I needed to assemble a report which is to be generated quarterly and distributed to those unfortunate enough not to have MATLAB. Currently, such reports are distributed as Excel workbooks. Nearly everyone who produces such reports where I work does so by generating tables of numbers and cutting-and-pasting them into Excel (yuck!). As I have a number of these reports to produce, I was motivated to construct a more automatic solution.
Happily, MATLAB can easily get data from or send data to Excel documents. The mechanics of this are not difficult, but I thought that readers might not be aware that this facility exists and just how easy this is to accomplish.
There is just one catch: For users without Excel to act as a COM server (such as UNIX users), 'basic' mode is required, and functionality will be limited: See help xlsread for details.
Getting Data From Excel in MATLAB
MATLAB's function for extracting data from Excel documents is xlsread. Using it is as simple as this:
[NumericData TextData] = xlsread(FileName,SheetName,CellRange)
...where NumericData and TextData contain the numeric and text data read from the workbook, respectively; and FileName, SheetName and CellRange are the names of the Excel document, sheet name and cell range from which to read.
Often, I find myself needing to read data from growing ranges of cells within Excel spreadsheets. Think of daily rainfall data stored in a single column within a spreadsheet, which periodically has data appended to it. To load such data, simply set the range of cells to be much larger than the existing range: xlsread will ignore the extra empty spreadsheet cells.
Getting Data Into Excel in MATLAB
Writing data to Excel documents is also quite simple. Just use xlswrite:
xlswrite(FileName,DataArray,SheetName,CellRange)
...where FileName, SheetName and CellRange are the names of the Excel document, sheet name and cell range to which to write, and DataArray contains the data to be written.
Final note
Refer to the help facility for diagnostic and other capabilities of both of these functions. See also wk1read and wk1write to handle the old Lotus 1-2-3 .WK1 format.
Thursday, February 12, 2009
Parallel Programming: Another Look
Introduction
In my last posting, Parallel Programming: A First Look (Nov-16-2008), I introduced the subject of parallel programming in MATLAB. In that case, I briefly described my experiences with the MATLAB Parallel Computing Toolbox, from the MathWorks. Since then, I have been made aware of another parallel programming product for MATLAB, the Jacket Engine for MATLAB, from AccelerEyes.
Jacket differs from the Parallel Computing Toolbox in that Jacket off-loads work to the computer's GPU (graphics processing unit), whereas the Parallel Computer Toolbox distributes work over multiple cores or processors. Each solution has its merits, and it would be worth the time of MATLAB programmers interested in accelerating computation to investigate the nuances of each.
Some History
Having observed the computer hardware industry for several decades now, I have witnessed the arrival and departure of any number of special-purpose add-in cards which have been used to speed up math for things like neural networks, etc. For my part, I have resisted the urge to employ such hardware assistance for several reasons:
First, special hardware nearly always requires special software. Accommodating the new hardware environment with custom software means an added learning curve for the program and drastically reduced code portability.
Second, there is the cost of the hardware itself, which was often considerable.
Third, there was the fundamental fact that general-purpose computing hardware was inexorably propelled forward by a very large market demand. Within 2 or 3 years, even the coolest turbo board would be outclassed by new PCs, which didn't involve either of the two issues mentioned above.
Two significant items have emerged in today's computer hardware environment: multi-core processors and high-power graphics processors. Even low-end PCs today sport central processors featuring at least two cores, which you may think of more-or-less as "2 (or more) computers on a single chip". As chip complexity has continued to grow, chip makers like Intel and AMD have fit multiple "cores" on single chips. It is tempting to think that this would yield a direct benefit to the user, but the reality is more subtle. Most software was written to run on single-core computers, and is not equipped to take advantage of the extra computing power of today's multi-core computers. This is where the Parallel Computer Toolbox steps in, by providing programmers a way to distribute the execution of their programs over several cores or processors, resulting in a substantially improved performance.
Similarly, the graphics subsystem in desktop PCs has also evolved to a very sophisticated state. At the dawn of the IBM PC (around 1980), graphics display cards with few exceptions basically converted the contents of a section of memory into a display signal usable by a computer monitor. Graphics cards did little more.
Over time, though, greater processing functionality was added to the graphics cards culminating in compute engines which would rival supercomputer-class machines of only a few years ago. This evolution has been fueled by the inclusion of many processing units (today, some cards contain hundreds of these units). Originally designed to perform specific graphics functions, many of these units are not small, somewhat general-purpose computers and they can be programmed to do things having nothing to do with the image shown on the computer's monitor. Tapping into this power requires some sort of programming interface, though, which is where Jacket comes in.
Caveats
Here is a simple assessment of the pros and cons of these two methods of achieving parallel computing on the desktop:
Multi-Core:
Good:
The required hardware is cheap. If you program in MATLAB, you probably have at least 2 cores at your disposal already, if not more.
Bad:
Most systems top out 4 cores, limiting the potential speed-up with this method (although doubling or quadrupling performance isn't bad).
GPU:
Good:
The number of processing units which can be harnessed by this method is quite large. Some of the fancier graphics cards have over 200 such units.
Bad:
The required hardware may be a bit pricey, although the price/performance is probably still very attractive.
Most GPUs will only perform single-precision floating point math. Newer GPUs, though, will perform double-precision floating-point math.
Moving data from the main computer to the graphics card and back takes time, eating into the potential gain.
Conclusion
My use of the Parallel Computing Toolbox has been limited to certain, very specific tasks, and I have not used Jacket at all. The use of ubiquitous multi-core computers and widely-available GPUs avoids most of the problems I described regarding special-purpose hardware. It will be very interesting to see how these technologies fit into the technological landscape over the next few years, and I am eager to learn of readers' experiences with them.
In my last posting, Parallel Programming: A First Look (Nov-16-2008), I introduced the subject of parallel programming in MATLAB. In that case, I briefly described my experiences with the MATLAB Parallel Computing Toolbox, from the MathWorks. Since then, I have been made aware of another parallel programming product for MATLAB, the Jacket Engine for MATLAB, from AccelerEyes.
Jacket differs from the Parallel Computing Toolbox in that Jacket off-loads work to the computer's GPU (graphics processing unit), whereas the Parallel Computer Toolbox distributes work over multiple cores or processors. Each solution has its merits, and it would be worth the time of MATLAB programmers interested in accelerating computation to investigate the nuances of each.
Some History
Having observed the computer hardware industry for several decades now, I have witnessed the arrival and departure of any number of special-purpose add-in cards which have been used to speed up math for things like neural networks, etc. For my part, I have resisted the urge to employ such hardware assistance for several reasons:
First, special hardware nearly always requires special software. Accommodating the new hardware environment with custom software means an added learning curve for the program and drastically reduced code portability.
Second, there is the cost of the hardware itself, which was often considerable.
Third, there was the fundamental fact that general-purpose computing hardware was inexorably propelled forward by a very large market demand. Within 2 or 3 years, even the coolest turbo board would be outclassed by new PCs, which didn't involve either of the two issues mentioned above.
Two significant items have emerged in today's computer hardware environment: multi-core processors and high-power graphics processors. Even low-end PCs today sport central processors featuring at least two cores, which you may think of more-or-less as "2 (or more) computers on a single chip". As chip complexity has continued to grow, chip makers like Intel and AMD have fit multiple "cores" on single chips. It is tempting to think that this would yield a direct benefit to the user, but the reality is more subtle. Most software was written to run on single-core computers, and is not equipped to take advantage of the extra computing power of today's multi-core computers. This is where the Parallel Computer Toolbox steps in, by providing programmers a way to distribute the execution of their programs over several cores or processors, resulting in a substantially improved performance.
Similarly, the graphics subsystem in desktop PCs has also evolved to a very sophisticated state. At the dawn of the IBM PC (around 1980), graphics display cards with few exceptions basically converted the contents of a section of memory into a display signal usable by a computer monitor. Graphics cards did little more.
Over time, though, greater processing functionality was added to the graphics cards culminating in compute engines which would rival supercomputer-class machines of only a few years ago. This evolution has been fueled by the inclusion of many processing units (today, some cards contain hundreds of these units). Originally designed to perform specific graphics functions, many of these units are not small, somewhat general-purpose computers and they can be programmed to do things having nothing to do with the image shown on the computer's monitor. Tapping into this power requires some sort of programming interface, though, which is where Jacket comes in.
Caveats
Here is a simple assessment of the pros and cons of these two methods of achieving parallel computing on the desktop:
Multi-Core:
Good:
The required hardware is cheap. If you program in MATLAB, you probably have at least 2 cores at your disposal already, if not more.
Bad:
Most systems top out 4 cores, limiting the potential speed-up with this method (although doubling or quadrupling performance isn't bad).
GPU:
Good:
The number of processing units which can be harnessed by this method is quite large. Some of the fancier graphics cards have over 200 such units.
Bad:
The required hardware may be a bit pricey, although the price/performance is probably still very attractive.
Most GPUs will only perform single-precision floating point math. Newer GPUs, though, will perform double-precision floating-point math.
Moving data from the main computer to the graphics card and back takes time, eating into the potential gain.
Conclusion
My use of the Parallel Computing Toolbox has been limited to certain, very specific tasks, and I have not used Jacket at all. The use of ubiquitous multi-core computers and widely-available GPUs avoids most of the problems I described regarding special-purpose hardware. It will be very interesting to see how these technologies fit into the technological landscape over the next few years, and I am eager to learn of readers' experiences with them.
Subscribe to:
Posts (Atom)