Although it was begun in 2006, Data Mining in MATLAB is just now completing its first full calendar year in operation. I want to thank readers who have sent words of encouragement or thanks, and those who have commented or asked questions. Sometimes I post material and wonder if anyone is reading this, so it is nice to receive a favorable response.
All in all, it has been a productive year here, with 27 posts (not counting this one). My only regret is not being more consistent in posting, but, in the interest of quality, I have studiously avoided rushing out material. (There are at least 4 half-finished posts sitting here now- if only it weren't for my darned "day job"!)
I'd like to wish an especially Merry Christmas to Dean Abbott, with whom I co-author Data Mining and Predictive Analytics, and Sandro Saitta, who writes Data Mining Research!
Merry Christmas to all!
Monday, December 24, 2007
Tuesday, October 23, 2007
L-1 Linear Regression
Fitting lines to data is a fundamental part of data mining and inferential statistics. Many more complicated schemes use line-fitting as a foundation, and least-squares linear regression has, for years, been the workhorse technique of the field. Least-squares linear regression fits a line (or plane, hyperplane, etc.) with the minimum possible squared error. I explained the execution of least-squares linear regression in MATLAB in my Apr-21-2007 posting, Linear Regression in MATLAB.
Why least squares?
Least-squares offers a number of esoteric technical strengths, but many students of statistics wonder: "Why least-squares?" The simplest (and most superficial) answer is: "Squaring the errors makes them all positive, so that errors with conflicting signs do not cancel each other out in sums or means". While this is true, squaring should seem an odd way to go about this when taking the absolute values of the errors (simply ignoring the signs) is much more straightforward.
Taking the absolute values of the errors (instead of their squares) leads to an alternative regression procedure, known as least absolute errors regression or L-1 linear regression. Like least-squares linear regression, L-1 linear regression fits a line to the supplied data points. Taking the absolute values seems simpler, so why not use L-1 regression? For that matter, why is lest-squares regression so popular, given the availability of seemingly more natural alternative?
Despite the fact that L-1 regression was developed decades before least squares regression, least-squares regression is much more widely used today. Though L-1 regression has a few quirks, they are not what is holding it back. The secret real reason that least squares is favored, which your stats professor never told you is:
Least-squares makes the calculus behind the fitting process extremely easy!
That's it. Statisticians will give all manner of rationalizations, but the real reason least-squares regression is in vogue, is that it is extremely easy to calculate.
L-1 Regression
There are several ways to perform the L-1 regression, and all of them involve more computation than any of the least-squares procedures. Thankfully, we live in an age in which mechanical computation is plentiful and cheap! Also thankfully, I have written an L-1 regression routine in MATLAB, called L1LinearRegression.
L1LinearRegression assumes that an intercept term is to be included and takes two parameters: the independent variables (a matrix whose columns represent the independent variables) and the dependent variable (in a column vector).
L-1 regression is less affected by large errors than least squares regression. The following graph depicts this behavior (click to enlarge):
This example intentionally demonstrates least-squares' slavish chasing of distant data points, but the effect is very real. The biggest drawback of L-1 regression is that it takes longer to run. Unless there are many such regressions to perform, execution time is a small matter, which gets smaller every year that computers get faster. L1LinearRegression runs in about 10 seconds for 100,000 observations with 10 predictors on fast PC hardware.
References
Alternative Methods of Regression, by Birkes and Dodge (ISBN-13: 978-0471568810)
Least absolute deviation estimation of linear econometric models: A literature review, by Dasgupta and Mishra (Jun-2004)
See also
L1LinearRession Code Update (Mar-27-2009)
Why least squares?
Least-squares offers a number of esoteric technical strengths, but many students of statistics wonder: "Why least-squares?" The simplest (and most superficial) answer is: "Squaring the errors makes them all positive, so that errors with conflicting signs do not cancel each other out in sums or means". While this is true, squaring should seem an odd way to go about this when taking the absolute values of the errors (simply ignoring the signs) is much more straightforward.
Taking the absolute values of the errors (instead of their squares) leads to an alternative regression procedure, known as least absolute errors regression or L-1 linear regression. Like least-squares linear regression, L-1 linear regression fits a line to the supplied data points. Taking the absolute values seems simpler, so why not use L-1 regression? For that matter, why is lest-squares regression so popular, given the availability of seemingly more natural alternative?
Despite the fact that L-1 regression was developed decades before least squares regression, least-squares regression is much more widely used today. Though L-1 regression has a few quirks, they are not what is holding it back. The secret real reason that least squares is favored, which your stats professor never told you is:
Least-squares makes the calculus behind the fitting process extremely easy!
That's it. Statisticians will give all manner of rationalizations, but the real reason least-squares regression is in vogue, is that it is extremely easy to calculate.
L-1 Regression
There are several ways to perform the L-1 regression, and all of them involve more computation than any of the least-squares procedures. Thankfully, we live in an age in which mechanical computation is plentiful and cheap! Also thankfully, I have written an L-1 regression routine in MATLAB, called L1LinearRegression.
L1LinearRegression assumes that an intercept term is to be included and takes two parameters: the independent variables (a matrix whose columns represent the independent variables) and the dependent variable (in a column vector).
L-1 regression is less affected by large errors than least squares regression. The following graph depicts this behavior (click to enlarge):
This example intentionally demonstrates least-squares' slavish chasing of distant data points, but the effect is very real. The biggest drawback of L-1 regression is that it takes longer to run. Unless there are many such regressions to perform, execution time is a small matter, which gets smaller every year that computers get faster. L1LinearRegression runs in about 10 seconds for 100,000 observations with 10 predictors on fast PC hardware.
References
Alternative Methods of Regression, by Birkes and Dodge (ISBN-13: 978-0471568810)
Least absolute deviation estimation of linear econometric models: A literature review, by Dasgupta and Mishra (Jun-2004)
See also
L1LinearRession Code Update (Mar-27-2009)
Labels:
L-1,
L1,
LAD,
LAE,
LAR,
LAV,
least absolute,
least squared,
least squares,
linear regression,
mean squared,
MSE,
regression
Saturday, September 29, 2007
MATLAB 2007b Released
The fall release of MATLAB is out, and while most toolbox updates relevant to data mining are minor, MATLAB itself has seen some big changes. From the MATLAB 7.5 Latest Features page, among other things:
Performance and Large Data Set Handling
* MATLAB arrays no longer limited to 2^31 (~2 x 10^9) elements, allowing many numeric and low-level file I/O functions to support real double arrays greater than 16 GB on 64-bit platforms
* New function maxNumCompThreads enabling use of get and set for the maximum number of computational threads
* Upgraded Linear Algebra Package library (LAPACK 3.1) on all platforms, plus upgraded optimized Basic Linear Algebra Subprogram libraries (BLAS) on Intel processors (MKL 9.1) and on AMD processors (AMCL 3.6)
Readers are strongly encouraged to visit the New Features page for more information.
On A Completely Unrelated Subject...
A few months ago, I attended a one-day presentation, offered locally free of charge by the MathWorks. The session was on algorithm development for C/C++ programmers. Though I program in C and C++ seldom these days (Why would I? I have MATLAB!), the class was very informative. There are many features of the MATLAB interface which I ignored in the past which I learned about that day. My suggestion to readers is to consider attending one of these presentations, which you can learn about on the MathWorks Web site.
Performance and Large Data Set Handling
* MATLAB arrays no longer limited to 2^31 (~2 x 10^9) elements, allowing many numeric and low-level file I/O functions to support real double arrays greater than 16 GB on 64-bit platforms
* New function maxNumCompThreads enabling use of get and set for the maximum number of computational threads
* Upgraded Linear Algebra Package library (LAPACK 3.1) on all platforms, plus upgraded optimized Basic Linear Algebra Subprogram libraries (BLAS) on Intel processors (MKL 9.1) and on AMD processors (AMCL 3.6)
Readers are strongly encouraged to visit the New Features page for more information.
On A Completely Unrelated Subject...
A few months ago, I attended a one-day presentation, offered locally free of charge by the MathWorks. The session was on algorithm development for C/C++ programmers. Though I program in C and C++ seldom these days (Why would I? I have MATLAB!), the class was very informative. There are many features of the MATLAB interface which I ignored in the past which I learned about that day. My suggestion to readers is to consider attending one of these presentations, which you can learn about on the MathWorks Web site.
Sunday, July 29, 2007
Poll Results (Jul-22-2007): Source Data File Formats
After a week, the Source Data File Formats poll, of Jul-22-2007, is complete. The question asked was:
What is the original format of the data you analyze?
Multiple responses were permitted. A total of 33 votes were cast, although the polling system used does not indicate the total number of voters.
In decreasing order of popularity, the results are:
9 votes (27%): MATLAB
7 votes (21%): Text (comma-delimited, tab-delimited, etc.)
7 votes (21%): Other
5 votes (15%): Relational database (Oracle, DB2, etc.)
4 votes (12%): Excel
1 vote ( 3%): Statistical software native format (SPSS, S-Plus, etc.)
I'm a little surprised that relational databases didn't appear more frequently.
No one commented, although I'd be very interested in know what the 'Other' source formats are, since they tied for second place. Anyone?
What is the original format of the data you analyze?
Multiple responses were permitted. A total of 33 votes were cast, although the polling system used does not indicate the total number of voters.
In decreasing order of popularity, the results are:
9 votes (27%): MATLAB
7 votes (21%): Text (comma-delimited, tab-delimited, etc.)
7 votes (21%): Other
5 votes (15%): Relational database (Oracle, DB2, etc.)
4 votes (12%): Excel
1 vote ( 3%): Statistical software native format (SPSS, S-Plus, etc.)
I'm a little surprised that relational databases didn't appear more frequently.
No one commented, although I'd be very interested in know what the 'Other' source formats are, since they tied for second place. Anyone?
Sunday, July 22, 2007
Poll (Jul-22-2007): Source Data File Formats
This poll is about the original file format of the data you analyze, not (necessarily) the data which MATLAB directly loads. For example, if your source data originally comes from a relational database, choose "relational database", even though you may export it to a tab-delimited text file first.
Multiple selections are permitted, but choose the file formats you encounter most often.
This poll is closed.
See the poll results in the Jul-29-2007 posting, Poll Results (Jul-22-2007): Source Data File Formats.
Thanks for voting!
Multiple selections are permitted, but choose the file formats you encounter most often.
This poll is closed.
See the poll results in the Jul-29-2007 posting, Poll Results (Jul-22-2007): Source Data File Formats.
Thanks for voting!
Saturday, July 14, 2007
Calculating AUC Using SampleError()
In my last post, ROC Curves and AUC (Jun-20-2007), ROC curves and AUC ("area under the curve") were explained. This post will follow up with a quick demonstration of my SampleError function, and its use in calculating the AUC.
In the following examples, predictive models have been constructed which estimate the probability of a defined event. For each of these (very small) data sets, the model has been executed and stored in variable ModelOutput. After the fact, the actual outcome is recorded in the target variable, DependentVariable.
First, let's try a tiny data set with a model that's nearly random:
>> ModelOutput = [0.1869 0.3816 0.4387 0.4456 0.4898 0.6463 0.7094 0.7547 0.7655 0.7952]'
ModelOutput =
0.1869
0.3816
0.4387
0.4456
0.4898
0.6463
0.7094
0.7547
0.7655
0.7952
>> DependentVariable = [0 1 1 0 0 0 1 0 1 0]'
DependentVariable =
0
1
1
0
0
0
1
0
1
0
This data set is already sorted by the model output, but that is not neccesary for the SampleError routine to function properly. A random model does not separate the two classes at all, and has an expected AUC of 0.5.
>> SampleError(ModelOutput,DependentVariable,'AUC')
ans =
0.4583
The sample AUC, 0.4583, is off a bit from the theoretically expected 0.5, due to the extremely small sample size.
Moving to the other extreme, consider the following data set:
>> ModelOutput = [0.1622 0.1656 0.2630 0.3112 0.5285 0.6020 0.6541 0.6892 0.7482 0.7943]'
ModelOutput =
0.1622
0.1656
0.2630
0.3112
0.5285
0.6020
0.6541
0.6892
0.7482
0.7943
>> DependentVariable = [0 0 0 0 0 1 1 1 1 1]'
DependentVariable =
0
0
0
0
0
1
1
1
1
1
Again, the data set is sorted by the model output. It is plain that this model performs perfectly (at least on this data): the classes are entirely separate. Such a model should have an AUC of 1.0.
>> SampleError(ModelOutput,DependentVariable,'AUC')
ans =
1
The final data set exhibits intermediate performance: some class separation is evident, but it is not perfect. The AUC should lie between 0.5 and 1.0.
>> ModelOutput = [0.0782 0.0838 0.1524 0.2290 0.4427 0.4505 0.5383 0.8258 0.9133 0.9961]'
ModelOutput =
0.0782
0.0838
0.1524
0.2290
0.4427
0.4505
0.5383
0.8258
0.9133
0.9961
>> DependentVariable = [0 0 1 0 0 1 0 1 1 1]'
DependentVariable =
0
0
1
0
0
1
0
1
1
1
>> SampleError(ModelOutput,DependentVariable,'AUC')
ans =
0.8400
In the following examples, predictive models have been constructed which estimate the probability of a defined event. For each of these (very small) data sets, the model has been executed and stored in variable ModelOutput. After the fact, the actual outcome is recorded in the target variable, DependentVariable.
First, let's try a tiny data set with a model that's nearly random:
>> ModelOutput = [0.1869 0.3816 0.4387 0.4456 0.4898 0.6463 0.7094 0.7547 0.7655 0.7952]'
ModelOutput =
0.1869
0.3816
0.4387
0.4456
0.4898
0.6463
0.7094
0.7547
0.7655
0.7952
>> DependentVariable = [0 1 1 0 0 0 1 0 1 0]'
DependentVariable =
0
1
1
0
0
0
1
0
1
0
This data set is already sorted by the model output, but that is not neccesary for the SampleError routine to function properly. A random model does not separate the two classes at all, and has an expected AUC of 0.5.
>> SampleError(ModelOutput,DependentVariable,'AUC')
ans =
0.4583
The sample AUC, 0.4583, is off a bit from the theoretically expected 0.5, due to the extremely small sample size.
Moving to the other extreme, consider the following data set:
>> ModelOutput = [0.1622 0.1656 0.2630 0.3112 0.5285 0.6020 0.6541 0.6892 0.7482 0.7943]'
ModelOutput =
0.1622
0.1656
0.2630
0.3112
0.5285
0.6020
0.6541
0.6892
0.7482
0.7943
>> DependentVariable = [0 0 0 0 0 1 1 1 1 1]'
DependentVariable =
0
0
0
0
0
1
1
1
1
1
Again, the data set is sorted by the model output. It is plain that this model performs perfectly (at least on this data): the classes are entirely separate. Such a model should have an AUC of 1.0.
>> SampleError(ModelOutput,DependentVariable,'AUC')
ans =
1
The final data set exhibits intermediate performance: some class separation is evident, but it is not perfect. The AUC should lie between 0.5 and 1.0.
>> ModelOutput = [0.0782 0.0838 0.1524 0.2290 0.4427 0.4505 0.5383 0.8258 0.9133 0.9961]'
ModelOutput =
0.0782
0.0838
0.1524
0.2290
0.4427
0.4505
0.5383
0.8258
0.9133
0.9961
>> DependentVariable = [0 0 1 0 0 1 0 1 1 1]'
DependentVariable =
0
0
1
0
0
1
0
1
1
1
>> SampleError(ModelOutput,DependentVariable,'AUC')
ans =
0.8400
Labels:
area under the curve,
area under the ROC curve,
AUC,
AUROC,
SampleError
Wednesday, June 20, 2007
ROC Curves and AUC
Many classification problems require more than a simple "this class or that" output. Many classification solutions will provide estimated probabilities for each possible outcome class. This prompts the question of how to evaluate the performance of classifiers which output probabilities.
Confusion Matrices
Most readers will likely be familiar with concepts like true positives, false positives, etc. Such terms are defined in terms of a target class (or class of interest), such as medical patients with cancer, and a background class, patients without cancer.
Positive cases are those classified as belonging to the target class, whereas negative cases have been classified as belonging to the background class.
In this context, true indicates that the case was correctly classified, and false indicates that the case was incorrectly classified.
Note that, for any given model and data set, the counts of each of the four possible combinations of predicted and actual may be mapped onto a 2x2 confusion matrix (click to enlarge):
Notice, too, that this framework recognizes two distinct ways to make an error: false positives, which erroneously flag negative cases as positives, and false negatives, which erroneously flag positive cases as negatives.
A warning to the reader: There does not seem to be a consistent convention as to whether the actuals belong on the side of the confusion matrix and predictions across the top, or vice versa. In fact, some graphical representations even invert the vertical axis! To avoid confusion, always check the axis labels when exploring the literature.
This way of organizing model responses permits a variety of performance measures, accuracy (= [TP + TN] / Total) being the most obvious. All such measures, however, require that all predicted cases be divided into predicted positives and predicted negatives. When models generate class probabilities (as opposed to classes) as outputs, some threshold must be chosen above which items are classified as positive.
A variety of mechanisms may be used to select said threshold, from something as simple as just using 0.5, to more sophisticated processes which take into account prior probabilities and the costs associated with different types of errors (false positives and false negatives, for instance, may incur very different costs in real applications).
ROC Curves
To avoid having to select a single threshold for classification, one may scan through all possible thresholds, and observe the effect on the true positive rate (= TP / [TP + FN] ) and the false positive rate (= FP / [FP + TN] ). Graphed as coordinate pairs, these measures form the receiver operating characteristic curve (or ROC curve, for short). Some readers will be more familiar with the true positive rate by the term sensitivity, and the false positive rate as 1.0 minus the specificity. The ROC curve describes the performance of a model across the entire range of classification thresholds. An example ROC curve is shown in the figure below (click to enlarge):
All ROC curves begin in the bottom-left corner and rise to the top-right corner. Moving along the ROC curve represents trading off false positives for false negatives. Generally, random models will run up the diagonal, and the more the ROC curve bulges toward the top-left corner, the better the model separates the target class from the background class.
Notice that ROC is an excellent tool for assessing class separation, but it tells us nothing about the accuracy of the predicted class probabilities (for instance, whether cases with a predicted 5% probability of membership in the target class really belong to the target class 5% of the time).
Another important note: despite being similar (and related) to lift charts, ROC curves are calculated differently. Do not confuse the two!
The Magic of AUC
By this point the reader may be wondering, "The ROC curve seems great and all, but it provides a spectrum of performance assessments. How do I boil this down to a simple, single-number measure of performance?" The answer, dear reader, is to measure the area under the ROC curve (abbreviated AUC, or less frequently, AUROC).
Assuming that one is not interested in a specific trade-off between true positive rate and false positive rate (that is, a particular point on the ROC curve), the AUC is useful in that it aggregates performance across the entire range of trade-offs. Interpretation of the AUC is easy: the higher the AUC, the better, with 0.50 indicating random performance and 1.00 denoting perfect performance.
Happily, it is not necessary to actually graph the ROC curve to derive the AUC of a model. A clever algorithm, which can be found in the paper by Ling, Huang and Zhang, below, permits rapid calculation of the AUC. I have implemented this algorithm in MATLAB code, within my SampleError function (see the Jan-05-2007 posting, Model Performance Measurement). Note that SampleError expects the target variable to be a dummy variable, with 0 representing the background class and 1 representing the target class.
Further Reading
Data Mining, by Ian H. Witten and Eibe Frank (ISBN: 0120884070)
AUC: a Statistically Consistent and more Discriminating Measure than Accuracy, by Charles X. Ling, Jin Huang and Harry Zhang
Evaluating Performance, from “ROC Graphs: Notes and Practical Considerations for Researchers”, by T. Fawcett
The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms, by Andrew P. Bradley
See also:
The follow-up Jul-14-2007 post, Calculating AUC Using SampleError().
Confusion Matrices
Most readers will likely be familiar with concepts like true positives, false positives, etc. Such terms are defined in terms of a target class (or class of interest), such as medical patients with cancer, and a background class, patients without cancer.
Positive cases are those classified as belonging to the target class, whereas negative cases have been classified as belonging to the background class.
In this context, true indicates that the case was correctly classified, and false indicates that the case was incorrectly classified.
Note that, for any given model and data set, the counts of each of the four possible combinations of predicted and actual may be mapped onto a 2x2 confusion matrix (click to enlarge):
Notice, too, that this framework recognizes two distinct ways to make an error: false positives, which erroneously flag negative cases as positives, and false negatives, which erroneously flag positive cases as negatives.
A warning to the reader: There does not seem to be a consistent convention as to whether the actuals belong on the side of the confusion matrix and predictions across the top, or vice versa. In fact, some graphical representations even invert the vertical axis! To avoid confusion, always check the axis labels when exploring the literature.
This way of organizing model responses permits a variety of performance measures, accuracy (= [TP + TN] / Total) being the most obvious. All such measures, however, require that all predicted cases be divided into predicted positives and predicted negatives. When models generate class probabilities (as opposed to classes) as outputs, some threshold must be chosen above which items are classified as positive.
A variety of mechanisms may be used to select said threshold, from something as simple as just using 0.5, to more sophisticated processes which take into account prior probabilities and the costs associated with different types of errors (false positives and false negatives, for instance, may incur very different costs in real applications).
ROC Curves
To avoid having to select a single threshold for classification, one may scan through all possible thresholds, and observe the effect on the true positive rate (= TP / [TP + FN] ) and the false positive rate (= FP / [FP + TN] ). Graphed as coordinate pairs, these measures form the receiver operating characteristic curve (or ROC curve, for short). Some readers will be more familiar with the true positive rate by the term sensitivity, and the false positive rate as 1.0 minus the specificity. The ROC curve describes the performance of a model across the entire range of classification thresholds. An example ROC curve is shown in the figure below (click to enlarge):
All ROC curves begin in the bottom-left corner and rise to the top-right corner. Moving along the ROC curve represents trading off false positives for false negatives. Generally, random models will run up the diagonal, and the more the ROC curve bulges toward the top-left corner, the better the model separates the target class from the background class.
Notice that ROC is an excellent tool for assessing class separation, but it tells us nothing about the accuracy of the predicted class probabilities (for instance, whether cases with a predicted 5% probability of membership in the target class really belong to the target class 5% of the time).
Another important note: despite being similar (and related) to lift charts, ROC curves are calculated differently. Do not confuse the two!
The Magic of AUC
By this point the reader may be wondering, "The ROC curve seems great and all, but it provides a spectrum of performance assessments. How do I boil this down to a simple, single-number measure of performance?" The answer, dear reader, is to measure the area under the ROC curve (abbreviated AUC, or less frequently, AUROC).
Assuming that one is not interested in a specific trade-off between true positive rate and false positive rate (that is, a particular point on the ROC curve), the AUC is useful in that it aggregates performance across the entire range of trade-offs. Interpretation of the AUC is easy: the higher the AUC, the better, with 0.50 indicating random performance and 1.00 denoting perfect performance.
Happily, it is not necessary to actually graph the ROC curve to derive the AUC of a model. A clever algorithm, which can be found in the paper by Ling, Huang and Zhang, below, permits rapid calculation of the AUC. I have implemented this algorithm in MATLAB code, within my SampleError function (see the Jan-05-2007 posting, Model Performance Measurement). Note that SampleError expects the target variable to be a dummy variable, with 0 representing the background class and 1 representing the target class.
Further Reading
Data Mining, by Ian H. Witten and Eibe Frank (ISBN: 0120884070)
AUC: a Statistically Consistent and more Discriminating Measure than Accuracy, by Charles X. Ling, Jin Huang and Harry Zhang
Evaluating Performance, from “ROC Graphs: Notes and Practical Considerations for Researchers”, by T. Fawcett
The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms, by Andrew P. Bradley
See also:
The follow-up Jul-14-2007 post, Calculating AUC Using SampleError().
Sunday, June 17, 2007
Now on MySpace
I'm now on MySpace, at: Will's MySpace page. If you're on MySpace, feel free to give me a shout!
Of course, my (badly out-dated) home page is still: Will's home page.
Of course, my (badly out-dated) home page is still: Will's home page.
Monday, June 04, 2007
KDnuggets 2007 Data Mining Software Poll
KDnuggets has completed its annual survey of data miners, Data Mining / Analytic Software Tools (May 2007). This survey asked participants to name the Data Mining (Analytic) tools you used in 2007. Choices included free and commercial tools, as well as "Your own code". Respondents were able to vote for as many tools as they like. Votes were cast by 534 voters for 28 distinct alternatives.
Two results should be interest to readers of this Web site:
1. MATLAB received a respectable 30 votes, which puts it in the middle of the commercial offerings pack. Despite the existence of several add-ons which could be used for data mining, MATLAB is not generally billed as a data mining product. Yet, MATLAB beat out several well-known data mining tools in this survey.
2. "Your own code" received 61 votes, which is more than half the number cast for the most popular commercial data mining tool. This clearly demonstrates that a substantial portion of data miners are choosing the "DIY" route.
Two results should be interest to readers of this Web site:
1. MATLAB received a respectable 30 votes, which puts it in the middle of the commercial offerings pack. Despite the existence of several add-ons which could be used for data mining, MATLAB is not generally billed as a data mining product. Yet, MATLAB beat out several well-known data mining tools in this survey.
2. "Your own code" received 61 votes, which is more than half the number cast for the most popular commercial data mining tool. This clearly demonstrates that a substantial portion of data miners are choosing the "DIY" route.
Thursday, May 03, 2007
Weighted Regression in MATLAB
Many predictive modeling techniques have weighted counterparts, which permit the analyst to assign weights representing the "importance" of individual observations. An observation with a weight of 8, for instance, is treated in the modeling process as though there were 8 individual observations with the same values. The usual, unweighted algorithms may be thought of as a special case of weighted algorithms, in which the weights of all observations equal 1.0.
There are several reasons for using weighted methods. One is simply that some data sets have been pre-summarized, with identical records being collapsed to a single record having a weight equal to the original number of identical records. Many analysts favor binning of predictor variables, which can drastically reduce the number of distinct combinations of input variable values.
A second reason to use weighting is simple economy of space: data with identical (or very similar) records consolidated with weights representing the number of original observations they represent can be much smaller (even by orders of magnitude!) than the original data.
Another important reason to weight observations is to "fix" class distributions in the data. Assume that the original data contains a million rows of bank loan data, of which only 2% represent bad loans. It is common to sample down the number of good loans, while retaining all of the bad loans. This can save time on learning, but will result in a systematically biased model. A learning system which can accept weights on the observations can correct for this bias.
There are also a number of on-line resources for performing weighted regression in base MATLAB, such as:
Optimization Tips and Tricks, by John D'Errico
The thread linked below records an interesting conversation about weighted linear regression, and some practical issues for implementation in MATLAB:
Weighted regression thread on Usenet
Weighted regression can also be accomplished using the Statistics Toolbox, via functions such as glimfit and nlinfit. See the help facility for these functions, or try wnlsdemo for more information.
The Curve Fitting Toolbox also provides facilities for weighted regression (see: help fitoptions).
See also:
The Apr-21-2007 posting, Linear Regression in MATLAB.
The Oct-23-2007 posting, L-1 Linear Regression.
There are several reasons for using weighted methods. One is simply that some data sets have been pre-summarized, with identical records being collapsed to a single record having a weight equal to the original number of identical records. Many analysts favor binning of predictor variables, which can drastically reduce the number of distinct combinations of input variable values.
A second reason to use weighting is simple economy of space: data with identical (or very similar) records consolidated with weights representing the number of original observations they represent can be much smaller (even by orders of magnitude!) than the original data.
Another important reason to weight observations is to "fix" class distributions in the data. Assume that the original data contains a million rows of bank loan data, of which only 2% represent bad loans. It is common to sample down the number of good loans, while retaining all of the bad loans. This can save time on learning, but will result in a systematically biased model. A learning system which can accept weights on the observations can correct for this bias.
There are also a number of on-line resources for performing weighted regression in base MATLAB, such as:
Optimization Tips and Tricks, by John D'Errico
The thread linked below records an interesting conversation about weighted linear regression, and some practical issues for implementation in MATLAB:
Weighted regression thread on Usenet
Weighted regression can also be accomplished using the Statistics Toolbox, via functions such as glimfit and nlinfit. See the help facility for these functions, or try wnlsdemo for more information.
The Curve Fitting Toolbox also provides facilities for weighted regression (see: help fitoptions).
See also:
The Apr-21-2007 posting, Linear Regression in MATLAB.
The Oct-23-2007 posting, L-1 Linear Regression.
Labels:
linear,
regression,
supervised learning,
weighted
Saturday, April 21, 2007
Linear Regression in MATLAB
Fitting a least-squares linear regression is easily accomplished in MATLAB using the backslash operator: '\'. In linear algebra, matrices may by multiplied like this:
output = input * coefficients
The backslash in MATLAB allows the programmer to effectively "divide" the output by the input to get the linear coefficients. This process will be illustrated by the following examples:
Simple Linear Regression
First, some data with a roughly linear relationship is needed:
>> X = [1 2 4 5 7 9 11 13 14 16]'; Y = [101 105 109 112 117 116 122 123 129 130]';
"Divide" using MATLAB's backslash operator to regress without an intercept:
>> B = X \ Y
B =
10.8900
Append a column of ones before dividing to include an intercept:
>> B = [ones(length(X),1) X] \ Y
B =
101.3021
1.8412
In this case, the first number is the intercept and the second is the coefficient.
Multiple Linear Regression
The following generates a matrix of 1000 observations of 5 random input variables:
>> X = rand(1e3,5);
Next, the true coefficients are defined (which wouldn't be known in a real problem). As is conventional, the intercept term is the first element of the coefficient vector. The problem at hand is to approximate these coefficients, knowing only the input and output data:
>> BTrue = [-1 2 -3 4 -5 6]';
Multiply the matrices to get the output data.
>> Y = BTrue(1) + X * BTrue(2:end);
As before, append a column of ones and use the backslash operator:
>> B = [ones(size(X,1),1) X] \ Y
B =
-1.0000
2.0000
-3.0000
4.0000
-5.0000
6.0000
Again, the first element in the coefficient vector is the intercept. Note that, oh so conveniently, the discovered coefficients match the designed ones exactly, since this data set is completely noise-free.
Model Recall
Executing linear models is a simple matter of matrix multiplication, but there is an efficiency issue. One might append a column of ones and simply perform the complete matrix multiplication, thus:
>> Z = [ones(size(X,1),1) X] * B;
The above process is inefficient, though, and can be improved by simply multiplying all the other coefficients by the input data matrix and adding the intercept term:
>> Z = B(1) + X * B(2:end);
Regression in the Statistics Toolbox
The MATLAB Statistics Toolbox includes several linear regression functions. Among others, there are:
regress: least squares linear regression and diagnostics
stepwisefit: stepwise linear regression
robustfit: robust (non-least-squares) linear regression and diagnostics
See help stats for more information.
See also:
The May-03-2007 posting, Weighted Regression in MATLAB.
The Oct-23-2007 posting, L-1 Linear Regression.
The Mar-15-2009 posting, Logistic Regression.
output = input * coefficients
The backslash in MATLAB allows the programmer to effectively "divide" the output by the input to get the linear coefficients. This process will be illustrated by the following examples:
Simple Linear Regression
First, some data with a roughly linear relationship is needed:
>> X = [1 2 4 5 7 9 11 13 14 16]'; Y = [101 105 109 112 117 116 122 123 129 130]';
"Divide" using MATLAB's backslash operator to regress without an intercept:
>> B = X \ Y
B =
10.8900
Append a column of ones before dividing to include an intercept:
>> B = [ones(length(X),1) X] \ Y
B =
101.3021
1.8412
In this case, the first number is the intercept and the second is the coefficient.
Multiple Linear Regression
The following generates a matrix of 1000 observations of 5 random input variables:
>> X = rand(1e3,5);
Next, the true coefficients are defined (which wouldn't be known in a real problem). As is conventional, the intercept term is the first element of the coefficient vector. The problem at hand is to approximate these coefficients, knowing only the input and output data:
>> BTrue = [-1 2 -3 4 -5 6]';
Multiply the matrices to get the output data.
>> Y = BTrue(1) + X * BTrue(2:end);
As before, append a column of ones and use the backslash operator:
>> B = [ones(size(X,1),1) X] \ Y
B =
-1.0000
2.0000
-3.0000
4.0000
-5.0000
6.0000
Again, the first element in the coefficient vector is the intercept. Note that, oh so conveniently, the discovered coefficients match the designed ones exactly, since this data set is completely noise-free.
Model Recall
Executing linear models is a simple matter of matrix multiplication, but there is an efficiency issue. One might append a column of ones and simply perform the complete matrix multiplication, thus:
>> Z = [ones(size(X,1),1) X] * B;
The above process is inefficient, though, and can be improved by simply multiplying all the other coefficients by the input data matrix and adding the intercept term:
>> Z = B(1) + X * B(2:end);
Regression in the Statistics Toolbox
The MATLAB Statistics Toolbox includes several linear regression functions. Among others, there are:
regress: least squares linear regression and diagnostics
stepwisefit: stepwise linear regression
robustfit: robust (non-least-squares) linear regression and diagnostics
See help stats for more information.
See also:
The May-03-2007 posting, Weighted Regression in MATLAB.
The Oct-23-2007 posting, L-1 Linear Regression.
The Mar-15-2009 posting, Logistic Regression.
Labels:
least squares,
linear,
linear regression,
mean squared error,
MSE,
OLS,
regression,
robust
Friday, April 13, 2007
Basic Summary Statistics in MATLAB
This posting covers basic summary statistics in MATLAB.
First, note that MATLAB has a strong array-orientation, so data sets to be analyzed are most often stored as a matrix of values. Note that the convention in MATLAB is for variables to be stored in columns, and observations to be stored in rows. This is not a hard-and-fast rule, but it is much more common than the alternative (variables in rows, observations in columns). Besides, most MATLAB routines (whether from the MathWorks or elsewhere) assume this convention.
Basic summaries are easy to obtain from MATLAB. For the examples below, the following matrix of data, A, will be used (No, it's not very exciting, but it will do for our purposes):
>> A = [1 2 3 4; -1 10 8 5; 9 8 7 0; 0 0 0 1]
A =
1 2 3 4
-1 10 8 5
9 8 7 0
0 0 0 1
MATLAB matrices are indexed as: MatrixName(row,column):
>> A(2,1)
ans =
-1
Common statistical summaries are available in MATLAB, such as: mean (arithmetic mean), median (median), min (minimum value), max (maximum value) and std (standard deviation). Their use is illustrated below:
>> mean(A)
ans =
2.2500 5.0000 4.5000 2.5000
>> median(A)
ans =
0.5000 5.0000 5.0000 2.5000
>> min(A)
ans =
-1 0 0 0
>> max(A)
ans =
9 10 8 5
>> std(A)
ans =
4.5735 4.7610 3.6968 2.3805
Note that each of these functions operate along the columns, yielding one summary for each, stored in a row vector. Sometimes it is desired to calculate along the rows instead. Some routines can be redirected by another parameter, like this:
>> mean(A,2)
ans =
2.5000
5.5000
6.0000
0.2500
The above calculates the arithmetic means of each row, storing them in a column vector. The second mean parameter, if it is specified, indicates the dimension along which mean is to operate.
For routines without this capability, the data matrix may be transposed (rows become columns and columns become rows) using the apostrophe operator while feeding it to the function:
>> mean(A')
ans =
2.5000 5.5000 6.0000 0.2500
Note that, this time, the result is stored in a row vector.
The colon operator, :, can be used to dump all of the contents of an array into one giant column vector. The result of this operation can then be fed to any of our summary routines:
>> A(:)
ans =
1
-1
9
0
2
10
8
0
3
8
7
0
4
5
0
1
>> mean(A(:))
ans =
3.5625
The reader will find more information on summary routines in base MATLAB through:
help datafun
The MATLAB Statistics Toolbox
MATLAB users lucky enough to own the Statistics Toolbox will have available still more summaries, such as iqr (inter-quartile range), trimmean (trimmed mean) and geomean (geometric mean). Also, there are extended versions of several summary functions, such as nanmean and nanmax, which will ignore NaN (IEEE floating point "not-a-number") values, which are commonly used to represent missing values in MATLAB.
To learn more, see the "Descriptive Statistics" section when using:
help stats
First, note that MATLAB has a strong array-orientation, so data sets to be analyzed are most often stored as a matrix of values. Note that the convention in MATLAB is for variables to be stored in columns, and observations to be stored in rows. This is not a hard-and-fast rule, but it is much more common than the alternative (variables in rows, observations in columns). Besides, most MATLAB routines (whether from the MathWorks or elsewhere) assume this convention.
Basic summaries are easy to obtain from MATLAB. For the examples below, the following matrix of data, A, will be used (No, it's not very exciting, but it will do for our purposes):
>> A = [1 2 3 4; -1 10 8 5; 9 8 7 0; 0 0 0 1]
A =
1 2 3 4
-1 10 8 5
9 8 7 0
0 0 0 1
MATLAB matrices are indexed as: MatrixName(row,column):
>> A(2,1)
ans =
-1
Common statistical summaries are available in MATLAB, such as: mean (arithmetic mean), median (median), min (minimum value), max (maximum value) and std (standard deviation). Their use is illustrated below:
>> mean(A)
ans =
2.2500 5.0000 4.5000 2.5000
>> median(A)
ans =
0.5000 5.0000 5.0000 2.5000
>> min(A)
ans =
-1 0 0 0
>> max(A)
ans =
9 10 8 5
>> std(A)
ans =
4.5735 4.7610 3.6968 2.3805
Note that each of these functions operate along the columns, yielding one summary for each, stored in a row vector. Sometimes it is desired to calculate along the rows instead. Some routines can be redirected by another parameter, like this:
>> mean(A,2)
ans =
2.5000
5.5000
6.0000
0.2500
The above calculates the arithmetic means of each row, storing them in a column vector. The second mean parameter, if it is specified, indicates the dimension along which mean is to operate.
For routines without this capability, the data matrix may be transposed (rows become columns and columns become rows) using the apostrophe operator while feeding it to the function:
>> mean(A')
ans =
2.5000 5.5000 6.0000 0.2500
Note that, this time, the result is stored in a row vector.
The colon operator, :, can be used to dump all of the contents of an array into one giant column vector. The result of this operation can then be fed to any of our summary routines:
>> A(:)
ans =
1
-1
9
0
2
10
8
0
3
8
7
0
4
5
0
1
>> mean(A(:))
ans =
3.5625
The reader will find more information on summary routines in base MATLAB through:
help datafun
The MATLAB Statistics Toolbox
MATLAB users lucky enough to own the Statistics Toolbox will have available still more summaries, such as iqr (inter-quartile range), trimmean (trimmed mean) and geomean (geometric mean). Also, there are extended versions of several summary functions, such as nanmean and nanmax, which will ignore NaN (IEEE floating point "not-a-number") values, which are commonly used to represent missing values in MATLAB.
To learn more, see the "Descriptive Statistics" section when using:
help stats
Sunday, April 08, 2007
Getting Data Into MATLAB Using textread
Before any analysis can be performed in MATLAB, the data must somehow be imported. MATLAB offers a number of functions for data import (dlmread, fscanf, xlsread, etc.- try help iofun for more information), which the reader should explore. In my work, I tend to work with other software systems (relational databases and statistical packages) which can export data to text files. I have found it convenient for my purposes to convert all categorical data to numeric form (dummy variables or integer codes) in the originating system, and then to dump the data to a tab-delimited text file for consumption by MATLAB.
Below is a typical textread call, from one of my recent projects (Web formatting necessitates breaking this up: it is supposed to be a single line of code):
A = ...
textread('F:\MyFile.dat','',-1,'delimiter','\t', ...
'headerlines',1,'emptyvalue',NaN);
An explanation of each textread parameter being used follows:
A is the variable containing the data after it is loaded from disk.
'F:\MyFile.dat' is the fully qualified name of the file being loaded.
The empty string indicates that no format string is being used.
The -1 indicates that all rows of data are to be loaded. To load some specific fraction of all rows, change this value to the number of rows to be loaded.
All parameters after this point come in pairs.
The 'delimiter' indicates that the data is delimited (as opposed to fixed-width). The '\t' lets textread know that the delimiter is the tab character. Comma-delimited .CSV files, for instance, would use 'delimiter',',' instead.
Next is the parameter pair 'headerlines',1, which tells textread to ignore the first row (it contains column headers).
Last, the parameter pair 'emptyvalue',NaN indicates that any missing values should be represented in the resulting MATLAB matrix as NaN ("not-a-number") values. Other values could be used. Some analysts are accustomed to using values like -99, -9999, etc., although it is generally the convention in MATLAB to use NaN to represent missing values.
I use a separate chunk of code to read in the header line and digest the variable names. Some MATLAB programmers prefer using cell arrays instead. Yet a third possibility of dealing with variable names is to use the categorical arrays and dataset arrays from the Statistics Toolbox (new with MATLAB version 2007a), but I haven't had a chance to explore them yet.
Below is a typical textread call, from one of my recent projects (Web formatting necessitates breaking this up: it is supposed to be a single line of code):
A = ...
textread('F:\MyFile.dat','',-1,'delimiter','\t', ...
'headerlines',1,'emptyvalue',NaN);
An explanation of each textread parameter being used follows:
A is the variable containing the data after it is loaded from disk.
'F:\MyFile.dat' is the fully qualified name of the file being loaded.
The empty string indicates that no format string is being used.
The -1 indicates that all rows of data are to be loaded. To load some specific fraction of all rows, change this value to the number of rows to be loaded.
All parameters after this point come in pairs.
The 'delimiter' indicates that the data is delimited (as opposed to fixed-width). The '\t' lets textread know that the delimiter is the tab character. Comma-delimited .CSV files, for instance, would use 'delimiter',',' instead.
Next is the parameter pair 'headerlines',1, which tells textread to ignore the first row (it contains column headers).
Last, the parameter pair 'emptyvalue',NaN indicates that any missing values should be represented in the resulting MATLAB matrix as NaN ("not-a-number") values. Other values could be used. Some analysts are accustomed to using values like -99, -9999, etc., although it is generally the convention in MATLAB to use NaN to represent missing values.
I use a separate chunk of code to read in the header line and digest the variable names. Some MATLAB programmers prefer using cell arrays instead. Yet a third possibility of dealing with variable names is to use the categorical arrays and dataset arrays from the Statistics Toolbox (new with MATLAB version 2007a), but I haven't had a chance to explore them yet.
Labels:
2007a,
categorical array,
data,
dataset array,
delimited,
delimiter,
import,
iofun,
textread
Friday, March 23, 2007
Two Bits of Code
Given some private requests for code, I figured that I would share two routines which I've mentioned in this log.
In the Nov-17-2006 entry, Mahalanobis Distance, I mentioned that I had implemented my own Mahalanobis distance routine in MATLAB. That routine is now available at:
MahalanobisDistance.m
In the Jan-26-2007 posting, Pixel Classification Project, one of the texture features which proved useful was the "edge detector". This routine is now available here:
DiffEdge.m
DiffEdge calculates a summary of the differences in brightness levels of opposing pixels on the square (whose size is indicated by the user) surrounding the pixel of interest. This operator was described in the article "Image Processing, Part 6: Advanced Edge Detection", by Dwayne Phillips, which appeared in the Jan-1992 issue of "C/C++ Users Journal".
In the Nov-17-2006 entry, Mahalanobis Distance, I mentioned that I had implemented my own Mahalanobis distance routine in MATLAB. That routine is now available at:
MahalanobisDistance.m
In the Jan-26-2007 posting, Pixel Classification Project, one of the texture features which proved useful was the "edge detector". This routine is now available here:
DiffEdge.m
DiffEdge calculates a summary of the differences in brightness levels of opposing pixels on the square (whose size is indicated by the user) surrounding the pixel of interest. This operator was described in the article "Image Processing, Part 6: Advanced Edge Detection", by Dwayne Phillips, which appeared in the Jan-1992 issue of "C/C++ Users Journal".
Labels:
edge detection,
image processing,
Mahalanobis,
texture
Saturday, March 03, 2007
MATLAB 2007a Released
The latest version of MATLAB, 2007a, has been released. While some changes to base MATLAB are of interest to data miners (multi-threading, in particular), owners of the Statistics Toolbox receive a number of new features in this major upgrade.
First, the Statistics Toolbox makes new data structures available for categorical data (categorical arrays) and mixed-type data (dataset arrays). Most MATLAB users performing statistical analysis or data mining tend to store their data in numerical matrices (my preference) or cell arrays. Using ordinary matrices requires the programmer/analyst to manage things like variable names. Cell arrays deal with the variable name issue, but preclude some of the nice things about using MATLAB matrices. Hopefully these new structures make statistical analysis in MATLAB more natural.
Second, the Statistics Toolbox updates the classify function, permitting it to output the discovered discriminant coefficients (at last!). I have been complaining about this for a long time. Why? Because classify provides quadratic discriminant analysis (QDA), an important non-linear modeling algorithm. Without the coefficients, though, it is impossible to deliver models to other (admittedly inferior to MATLAB) platforms.
Also of note: the Genetic Algorithm and Direct Search Toolbox now includes simulated annealing.
More information on 2007a is available at The Mathworks.
First, the Statistics Toolbox makes new data structures available for categorical data (categorical arrays) and mixed-type data (dataset arrays). Most MATLAB users performing statistical analysis or data mining tend to store their data in numerical matrices (my preference) or cell arrays. Using ordinary matrices requires the programmer/analyst to manage things like variable names. Cell arrays deal with the variable name issue, but preclude some of the nice things about using MATLAB matrices. Hopefully these new structures make statistical analysis in MATLAB more natural.
Second, the Statistics Toolbox updates the classify function, permitting it to output the discovered discriminant coefficients (at last!). I have been complaining about this for a long time. Why? Because classify provides quadratic discriminant analysis (QDA), an important non-linear modeling algorithm. Without the coefficients, though, it is impossible to deliver models to other (admittedly inferior to MATLAB) platforms.
Also of note: the Genetic Algorithm and Direct Search Toolbox now includes simulated annealing.
More information on 2007a is available at The Mathworks.
Tuesday, February 27, 2007
Poll Results (Feb-20-2007): Toolbox Use
The Feb-20-2007 posting asked the poll question, Which Toolboxes from the MathWorks do you use (check all that apply)? After approximately 1 week, 38 votes were cast in total (including 1 by myself). The poll results follow:
1. Curve Fitting Toolbox (1 vote) 3%
2. Fuzzy Logic Toolbox (2 votes) 5%
3. Genetic Algorithm Toolbox (1 vote) 3%
4. Image Processing Toolbox (8 votes) 21%
5. Neural Network Toolbox (2 votes) 5%
6. Optimization Toolbox (6 votes) 16%
7. Signal Processing Toolbox (5 votes) 13%
8. Spline Toolbox (0 votes) 0%
9. Statistics Toolbox (13 votes) 34%
Not surprisingly, the Statistics Toolbox is the most popular among this crowd. The use of some of the other analytical toolboxes (such Neural Network and Fuzzy Logic Toolboxes) is also expected, though I was surprised both by the complete absence of anyone responding using the Spline Toolbox and the fairly good turn-out for the Optimization Toolbox. Given the popularity of the Jan-26-2007 posting, Pixel Classificiation Project, the 8 votes cast for the Image Processing Toolbox is expected.
Bootstrapping just the Statistics Toolbox results gives:
>> n = 38; p = 13 / 38; prctile(mean(double(rand(n,10000) < p)),[5 95])
ans =
0.2105 0.4737
So, with a 90% confidence interval, readers' use of the Statistics toolbox is somewhere between 21% and 47%.
1. Curve Fitting Toolbox (1 vote) 3%
2. Fuzzy Logic Toolbox (2 votes) 5%
3. Genetic Algorithm Toolbox (1 vote) 3%
4. Image Processing Toolbox (8 votes) 21%
5. Neural Network Toolbox (2 votes) 5%
6. Optimization Toolbox (6 votes) 16%
7. Signal Processing Toolbox (5 votes) 13%
8. Spline Toolbox (0 votes) 0%
9. Statistics Toolbox (13 votes) 34%
Not surprisingly, the Statistics Toolbox is the most popular among this crowd. The use of some of the other analytical toolboxes (such Neural Network and Fuzzy Logic Toolboxes) is also expected, though I was surprised both by the complete absence of anyone responding using the Spline Toolbox and the fairly good turn-out for the Optimization Toolbox. Given the popularity of the Jan-26-2007 posting, Pixel Classificiation Project, the 8 votes cast for the Image Processing Toolbox is expected.
Bootstrapping just the Statistics Toolbox results gives:
>> n = 38; p = 13 / 38; prctile(mean(double(rand(n,10000) < p)),[5 95])
ans =
0.2105 0.4737
So, with a 90% confidence interval, readers' use of the Statistics toolbox is somewhere between 21% and 47%.
Thursday, February 22, 2007
Dividing Data Into Groups Based On A Key Value (Hashing)
Introduction
A recent posting, Dividing Data Randomly Into Equal-Sized Groups, Feb-19-2007 presented a relatively painless method for dividing data into groups randomly. Sometimes, though, it is desired that items be divided into random groupings repeatably, despite appearing multiple times in the data set.
Consider, for instance, a group of customers who generate new billing statements once per month. Billing data may be drawn over several months for modeling purposes, and a single customer may appear several times in the data (once for each billing month). When that data is divided into "training" and "testing" groups, it would be preferable not to have any given customer appear in both the "training" and "testing" data sets. Random assignment of records to training/testing groups will not obey this requirement.
One solution is to assign records to groups based on some identifier which is unique to the customer, perhaps an account number. Simply dividing account numbers into deciles, for instance, is problematic because account numbers are likely assigned chronologically, meaning that customers with less tenure will end up in one group, while those with more tenure will land in the other group. Many unique identifiers (often called "keys") share this problem: despite not necessarily being meaningful as quantities, their values typically contain systematic biases.
One solution is to hash such identifiers, which means to transform one set of values to another, scrambling them in the process. This is done via a hash function, which ideally will spread out the new values as evenly as possible. In our present application, the hash function will result in far fewer distinct values than in the original data. Such a solution allows the identifier to drive the process (and thus make it exactly repeatable for any given identifier), but random enough to mix the data.
Hash Functions: Modulo Division
A variety of hash functions have been devised, but the most common use modulo division (mod in MATLAB). An example in MATLAB follows:
% A tiny amount of artificial data to work on
AccountNumber = [13301 15256 27441 27831 50668 89001 90012 93108 95667]'
% Hash the AccountNumber
Group = mod(AccountNumber,5) + 1
AccountNumber =
13301
15256
27441
27831
50668
89001
90012
93108
95667
Group =
2
2
2
2
4
2
3
4
3
All 'AccountNumber' values have been "scrambled" deterministically and mapped into the range 1 - 5. The second parameter in the mod function, the modulo divisor, determines the number of distinct hashed values (hence, the number of groups). In practice, experimentation may be necessary to ensure an even distribution of hashed values. Here is a larger example:
% Synthesize a large amount of artificial data to work on
randn('state',27459); % Initialize the PRNG
AccountNumber = unique(ceil(100000 * abs(randn(10000,1))));
% Hash the AccountNumber
Group = mod(AccountNumber,5) + 1;
% Check the distribution
tabulate(Group)
Value Count Percent
1 1850 19.02%
2 1952 20.07%
3 2008 20.65%
4 2015 20.72%
5 1900 19.54%
Though the distribution of 'AccountNumber' is dramatically skewed, its hashed version is very evenly distributed.
So far, so good, but a few matters remain to be cleared up. First, it is generally recommended that the modulo divisor be prime. It is easy enough to discover primes using MATLAB's primes function:
primes(500)
ans =
Columns 1 through 21
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73
Columns 22 through 42
79 83 89 97 101 103 107 109 113 127 131 137 139 149 151 157 163 167 173 179 181
Columns 43 through 63
191 193 197 199 211 223 227 229 233 239 241 251 257 263 269 271 277 281 283 293 307
Columns 64 through 84
311 313 317 331 337 347 349 353 359 367 373 379 383 389 397 401 409 419 421 431 433
Columns 85 through 95
439 443 449 457 461 463 467 479 487 491 499
...but what if the desired number of groups is not prime? Many problems, for instance, involve dividing items into 100 groups to permit grouping at the percent level. Although the prime 101 is close to 100, it s not close enough. The solution I usually employ is to cascade several hash functions, each only operating on the out-of-range values from its predecessors. All modulos use divisors which are larger than the desired number of groups except the last one, which uses the largest prime less than the desired number of groups. Following is a detailed example in which the data starts off with a divisor of 17, and scopes down sequentially to 13, 11 and finally 7 (the largest prime smaller than the desired group count of 10):
% Synthesize a large amount of artificial data to work on
randn('state',27459); % Initialize the PRNG
AccountNumber = unique(ceil(100000 * abs(randn(10000,1))));
Group = mod(AccountNumber,17) + 1;
tabulate(Group)
Value Count Percent
1 606 6.23%
2 563 5.79%
3 545 5.60%
4 559 5.75%
5 556 5.72%
6 600 6.17%
7 561 5.77%
8 599 6.16%
9 546 5.61%
10 583 5.99%
11 573 5.89%
12 578 5.94%
13 566 5.82%
14 596 6.13%
15 523 5.38%
16 558 5.74%
17 613 6.30%
>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),13) + 1;
>> tabulate(Group)
Value Count Percent
1 915 9.41%
2 883 9.08%
3 855 8.79%
4 854 8.78%
5 849 8.73%
6 896 9.21%
7 865 8.89%
8 913 9.39%
9 880 9.05%
10 874 8.99%
11 310 3.19%
12 315 3.24%
13 316 3.25%
>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),11) + 1;
>> tabulate(Group)
Value Count Percent
1 1013 10.42%
2 960 9.87%
3 936 9.62%
4 954 9.81%
5 932 9.58%
6 994 10.22%
7 958 9.85%
8 986 10.14%
9 959 9.86%
10 961 9.88%
11 72 0.74%
>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),7) + 1;
>> tabulate(Group)
Value Count Percent
1 1021 10.50%
2 971 9.98%
3 946 9.73%
4 966 9.93%
5 944 9.71%
6 1004 10.32%
7 967 9.94%
8 986 10.14%
9 959 9.86%
10 961 9.88%
The calls to tabulate are only included for explanatory purposes and are obviously unnecessary. Note that the final stage deals with only 72 out-of-range values, and distributes them among values 1 - 7. If managed properly, this slight quirk should have negligible effect on the outcome.
Another issue is how to devise different hash functions for different purposes. For instance, it might be desired that a 10% sample of all customers be selected for a marketing campaign, and, separately, a 30% sample be drawn for modeling purposes. Two different hash functions could solve this problem, if their respective outputs were independent. Two achieve different, but repeatable, results one might scope in with a divisor sequence of 149, 127, 109, 97, and the other might use 233, 197, 151, 107, 97.
One variation on this theme is to use an initial hash function to divide the data among a small number of other hash functions. For instance, start by hashing the key to 1 of 11 values (11 is prime). Use this hash value to select which among 3 hash functions to apply to the key to get the final result: 1-3: hash function 1, 4-7: hash function 2, 8-11: hash function 3. This provides a way of creatively generating a variety of hash functions when many are needed.
Hash Functions: Midsquare
Modulo division is not the only way to hash numeric data. Another option is the "midsquare" method, which simply involves squaring the key and extracting digits from the middle of the result. Here is an example:
% A tiny amount of artificial data to work on
AccountNumber = [13301 15256 27441 27831 50668 89001 90012 93108 95667]'
AccountNumber =
13301
15256
27441
27831
50668
89001
90012
93108
95667
% First, square the key
MidSquare = (AccountNumber .^ 2);
% Remove the rightmost 2 digits
MidSquare = floor(MidSquare / 100);
% Remove the leftmost digits, leaving only 3 digits
MidSquare = MidSquare - 1000 * floor(MidSquare / 1000)
MidSquare =
166
455
84
645
462
780
601
996
748
Conclusion
All of the hash functions described here have dealt with numeric data (specifically integers). While hash functions can be invented for other data types, it is more common to convert them to integers and use one of the more common integer-based hash functions.
One last thing I'd like to point out: hash functions provide portability which random number generators don't. Hashing means that it's not necessary to drag around an enormous look-up table with every 'AccountNumber' and it's assigned group. In fact new data files which have new account numbers can even be accommodated with complete consistency.
A recent posting, Dividing Data Randomly Into Equal-Sized Groups, Feb-19-2007 presented a relatively painless method for dividing data into groups randomly. Sometimes, though, it is desired that items be divided into random groupings repeatably, despite appearing multiple times in the data set.
Consider, for instance, a group of customers who generate new billing statements once per month. Billing data may be drawn over several months for modeling purposes, and a single customer may appear several times in the data (once for each billing month). When that data is divided into "training" and "testing" groups, it would be preferable not to have any given customer appear in both the "training" and "testing" data sets. Random assignment of records to training/testing groups will not obey this requirement.
One solution is to assign records to groups based on some identifier which is unique to the customer, perhaps an account number. Simply dividing account numbers into deciles, for instance, is problematic because account numbers are likely assigned chronologically, meaning that customers with less tenure will end up in one group, while those with more tenure will land in the other group. Many unique identifiers (often called "keys") share this problem: despite not necessarily being meaningful as quantities, their values typically contain systematic biases.
One solution is to hash such identifiers, which means to transform one set of values to another, scrambling them in the process. This is done via a hash function, which ideally will spread out the new values as evenly as possible. In our present application, the hash function will result in far fewer distinct values than in the original data. Such a solution allows the identifier to drive the process (and thus make it exactly repeatable for any given identifier), but random enough to mix the data.
Hash Functions: Modulo Division
A variety of hash functions have been devised, but the most common use modulo division (mod in MATLAB). An example in MATLAB follows:
% A tiny amount of artificial data to work on
AccountNumber = [13301 15256 27441 27831 50668 89001 90012 93108 95667]'
% Hash the AccountNumber
Group = mod(AccountNumber,5) + 1
AccountNumber =
13301
15256
27441
27831
50668
89001
90012
93108
95667
Group =
2
2
2
2
4
2
3
4
3
All 'AccountNumber' values have been "scrambled" deterministically and mapped into the range 1 - 5. The second parameter in the mod function, the modulo divisor, determines the number of distinct hashed values (hence, the number of groups). In practice, experimentation may be necessary to ensure an even distribution of hashed values. Here is a larger example:
% Synthesize a large amount of artificial data to work on
randn('state',27459); % Initialize the PRNG
AccountNumber = unique(ceil(100000 * abs(randn(10000,1))));
% Hash the AccountNumber
Group = mod(AccountNumber,5) + 1;
% Check the distribution
tabulate(Group)
Value Count Percent
1 1850 19.02%
2 1952 20.07%
3 2008 20.65%
4 2015 20.72%
5 1900 19.54%
Though the distribution of 'AccountNumber' is dramatically skewed, its hashed version is very evenly distributed.
So far, so good, but a few matters remain to be cleared up. First, it is generally recommended that the modulo divisor be prime. It is easy enough to discover primes using MATLAB's primes function:
primes(500)
ans =
Columns 1 through 21
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73
Columns 22 through 42
79 83 89 97 101 103 107 109 113 127 131 137 139 149 151 157 163 167 173 179 181
Columns 43 through 63
191 193 197 199 211 223 227 229 233 239 241 251 257 263 269 271 277 281 283 293 307
Columns 64 through 84
311 313 317 331 337 347 349 353 359 367 373 379 383 389 397 401 409 419 421 431 433
Columns 85 through 95
439 443 449 457 461 463 467 479 487 491 499
...but what if the desired number of groups is not prime? Many problems, for instance, involve dividing items into 100 groups to permit grouping at the percent level. Although the prime 101 is close to 100, it s not close enough. The solution I usually employ is to cascade several hash functions, each only operating on the out-of-range values from its predecessors. All modulos use divisors which are larger than the desired number of groups except the last one, which uses the largest prime less than the desired number of groups. Following is a detailed example in which the data starts off with a divisor of 17, and scopes down sequentially to 13, 11 and finally 7 (the largest prime smaller than the desired group count of 10):
% Synthesize a large amount of artificial data to work on
randn('state',27459); % Initialize the PRNG
AccountNumber = unique(ceil(100000 * abs(randn(10000,1))));
Group = mod(AccountNumber,17) + 1;
tabulate(Group)
Value Count Percent
1 606 6.23%
2 563 5.79%
3 545 5.60%
4 559 5.75%
5 556 5.72%
6 600 6.17%
7 561 5.77%
8 599 6.16%
9 546 5.61%
10 583 5.99%
11 573 5.89%
12 578 5.94%
13 566 5.82%
14 596 6.13%
15 523 5.38%
16 558 5.74%
17 613 6.30%
>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),13) + 1;
>> tabulate(Group)
Value Count Percent
1 915 9.41%
2 883 9.08%
3 855 8.79%
4 854 8.78%
5 849 8.73%
6 896 9.21%
7 865 8.89%
8 913 9.39%
9 880 9.05%
10 874 8.99%
11 310 3.19%
12 315 3.24%
13 316 3.25%
>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),11) + 1;
>> tabulate(Group)
Value Count Percent
1 1013 10.42%
2 960 9.87%
3 936 9.62%
4 954 9.81%
5 932 9.58%
6 994 10.22%
7 958 9.85%
8 986 10.14%
9 959 9.86%
10 961 9.88%
11 72 0.74%
>> ROI = (Group > 10); Group(ROI) = mod(AccountNumber(ROI),7) + 1;
>> tabulate(Group)
Value Count Percent
1 1021 10.50%
2 971 9.98%
3 946 9.73%
4 966 9.93%
5 944 9.71%
6 1004 10.32%
7 967 9.94%
8 986 10.14%
9 959 9.86%
10 961 9.88%
The calls to tabulate are only included for explanatory purposes and are obviously unnecessary. Note that the final stage deals with only 72 out-of-range values, and distributes them among values 1 - 7. If managed properly, this slight quirk should have negligible effect on the outcome.
Another issue is how to devise different hash functions for different purposes. For instance, it might be desired that a 10% sample of all customers be selected for a marketing campaign, and, separately, a 30% sample be drawn for modeling purposes. Two different hash functions could solve this problem, if their respective outputs were independent. Two achieve different, but repeatable, results one might scope in with a divisor sequence of 149, 127, 109, 97, and the other might use 233, 197, 151, 107, 97.
One variation on this theme is to use an initial hash function to divide the data among a small number of other hash functions. For instance, start by hashing the key to 1 of 11 values (11 is prime). Use this hash value to select which among 3 hash functions to apply to the key to get the final result: 1-3: hash function 1, 4-7: hash function 2, 8-11: hash function 3. This provides a way of creatively generating a variety of hash functions when many are needed.
Hash Functions: Midsquare
Modulo division is not the only way to hash numeric data. Another option is the "midsquare" method, which simply involves squaring the key and extracting digits from the middle of the result. Here is an example:
% A tiny amount of artificial data to work on
AccountNumber = [13301 15256 27441 27831 50668 89001 90012 93108 95667]'
AccountNumber =
13301
15256
27441
27831
50668
89001
90012
93108
95667
% First, square the key
MidSquare = (AccountNumber .^ 2);
% Remove the rightmost 2 digits
MidSquare = floor(MidSquare / 100);
% Remove the leftmost digits, leaving only 3 digits
MidSquare = MidSquare - 1000 * floor(MidSquare / 1000)
MidSquare =
166
455
84
645
462
780
601
996
748
Conclusion
All of the hash functions described here have dealt with numeric data (specifically integers). While hash functions can be invented for other data types, it is more common to convert them to integers and use one of the more common integer-based hash functions.
One last thing I'd like to point out: hash functions provide portability which random number generators don't. Hashing means that it's not necessary to drag around an enormous look-up table with every 'AccountNumber' and it's assigned group. In fact new data files which have new account numbers can even be accommodated with complete consistency.
Labels:
data mining,
dividing,
hash,
hash function,
hashing,
machine learning,
MATLAB,
midsquare,
mod,
modulo,
prime,
primes,
random,
sample,
sampling
Tuesday, February 20, 2007
Poll (Feb-20-2007) Which Toolboxes?
The last poll turned up the interesting finding that all respondents use Toolboxes. This time we'll learn which Toolboxes (from the MathWorks) readers use. Please feel free to comment on this post to indicate what other Toolboxes you use (whether from the MathWorks or not).
This poll is closed.
See the poll results in the Feb-27-2007 post, Poll Results (Feb-20-2007): Toolbox Use.
Thanks for voting!
This poll is closed.
See the poll results in the Feb-27-2007 post, Poll Results (Feb-20-2007): Toolbox Use.
Thanks for voting!
Labels:
data mining,
machine learning,
MATLAB,
toolboxes
Monday, February 19, 2007
Dividing Data Randomly Into Equal-Sized Groups
This is a quick note on dividing items randomly into equal-sized groups. This is an even quicker tip than yesterday's Dividing Values Into Equal-Sized Groups, since in this case, the original data does not affect the outcome.
Start by initializing the pseudo-random number generator (PRNG) for reproducible results:
rand('twister',9596)
Being able to reproduce outcomes exactly from run to run is important for several reasons, not the least of which is debugging. If the outcome of a program changes from run to run, it can be very hard to discover what precisely is going wrong.
With that out of the way, we can assign random groupings, in this case 20 groups for 10,000 individuals:
Group = ceil(20 * randperm(10000)' / 10000);
That's all there is to it. The result, 'Group', is a column vector with 10,000 group assignments, running from 1 to 20. If a different number of groups is desired, change the '20' to some other number. If a different number of items are to be assigned groups, change the '10000' (in both places) to something else. Just to check on this example, we reach for tabulate from the Statistics Toolbox:
tabulate(Group)
Value Count Percent
1 500 5.00%
2 500 5.00%
3 500 5.00%
4 500 5.00%
5 500 5.00%
6 500 5.00%
7 500 5.00%
8 500 5.00%
9 500 5.00%
10 500 5.00%
11 500 5.00%
12 500 5.00%
13 500 5.00%
14 500 5.00%
15 500 5.00%
16 500 5.00%
17 500 5.00%
18 500 5.00%
19 500 5.00%
20 500 5.00%
This process guarantees the the sizes of the largest and smallest groups will differ by no more than 1, and is ideal for assigning observations to folds for k-fold cross-validation.
Start by initializing the pseudo-random number generator (PRNG) for reproducible results:
rand('twister',9596)
Being able to reproduce outcomes exactly from run to run is important for several reasons, not the least of which is debugging. If the outcome of a program changes from run to run, it can be very hard to discover what precisely is going wrong.
With that out of the way, we can assign random groupings, in this case 20 groups for 10,000 individuals:
Group = ceil(20 * randperm(10000)' / 10000);
That's all there is to it. The result, 'Group', is a column vector with 10,000 group assignments, running from 1 to 20. If a different number of groups is desired, change the '20' to some other number. If a different number of items are to be assigned groups, change the '10000' (in both places) to something else. Just to check on this example, we reach for tabulate from the Statistics Toolbox:
tabulate(Group)
Value Count Percent
1 500 5.00%
2 500 5.00%
3 500 5.00%
4 500 5.00%
5 500 5.00%
6 500 5.00%
7 500 5.00%
8 500 5.00%
9 500 5.00%
10 500 5.00%
11 500 5.00%
12 500 5.00%
13 500 5.00%
14 500 5.00%
15 500 5.00%
16 500 5.00%
17 500 5.00%
18 500 5.00%
19 500 5.00%
20 500 5.00%
This process guarantees the the sizes of the largest and smallest groups will differ by no more than 1, and is ideal for assigning observations to folds for k-fold cross-validation.
Labels:
data mining,
dividing,
k-fold cross validation,
machine learning,
MATLAB,
PRNG,
random,
randperm
Sunday, February 18, 2007
Dividing Values Into Equal-Sized Groups
This is just a quick tip for MATLAB users who need to divide collections of values into even-sized groups. If one is fortunate enough to have the MATLAB Statistics Toolbox available, the tiedrank function is very handy for this sort of thing. (It's not hard to build something similar to tiedrank, using sort anyway.)
This will best be explained via example, so let's generate some example data:
>> rand('twister',951); X = rand(8,1)
X =
0.5798
0.0504
0.0241
0.7555
0.6569
0.3020
0.2042
0.5651
It is desired that this data be divided into 4 equal-sized groups, by magnitude. The following line does this:
>> F = ceil(4 * tiedrank(X) / length(X))
F =
3
1
1
4
4
2
2
3
The variable F now contains an integer code representing the assigned group number, with 1 being the smallest group. Notice that there are two each of the values 1, 2, 3 and 4. Also notice how easy it is to extract all members of any given group. Here, for example, the members of the lowest-valued group are displayed:
>> X(F == 1)
ans =
0.0504
0.0241
Here is an example using a much larger number of items:
>> rand('twister',951); X = rand(10000,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2500 25.00%
3 2500 25.00%
4 2500 25.00%
The tabulate function is a convenient routine for generating frequency tables from the Statistics Toolbox. Again, notice the even distribution of values across bins. What happens if the total count cannot be divided evenly among the bins? Let's see:
>> rand('twister',951); X = rand(10001,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2500 25.00%
3 2500 25.00%
4 2501 25.01%
This procedure ensures that the counts in the smallest bin and the largest bin never differ by more than 1, assuming that this is possible. Observe the progression, as the total count is incremented:
>> rand('twister',951); X = rand(10002,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2501 25.00%
3 2500 25.00%
4 2501 25.00%
>> rand('twister',951); X = rand(10003,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 24.99%
2 2501 25.00%
3 2501 25.00%
4 2501 25.00%
>> rand('twister',951); X = rand(10004,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2501 25.00%
2 2501 25.00%
3 2501 25.00%
4 2501 25.00%
The only catch is the case of multiple instances of the same value. The tiedrank function prevents the repeated values from being broken up. Here is an example with such data:
>> X = [1 2 2 2 5 6 7 8]'
X =
1
2
2
2
5
6
7
8
>> F = ceil(4 * tiedrank(X) / length(X))
F =
1
2
2
2
3
3
4
4
The distribution among bins is now uneven, but tiedrank is doing the best it can. Depending what is needed, this may be justified. If, however, if one needs to break these repeated values apart, even if arbitrarily, then tiedrank may be easily replaced with another procedure which does this.
This process is useful for assigning values to quantiles or n-iles, such as deciles or percentiles. Applied to randomly-generated values, it is also useful for assigning cases to folds for stratified k-fold cross-validation, or to strata for stratified sampling.
See Also
Feb-10-2007 posting, Stratified Sampling.
This will best be explained via example, so let's generate some example data:
>> rand('twister',951); X = rand(8,1)
X =
0.5798
0.0504
0.0241
0.7555
0.6569
0.3020
0.2042
0.5651
It is desired that this data be divided into 4 equal-sized groups, by magnitude. The following line does this:
>> F = ceil(4 * tiedrank(X) / length(X))
F =
3
1
1
4
4
2
2
3
The variable F now contains an integer code representing the assigned group number, with 1 being the smallest group. Notice that there are two each of the values 1, 2, 3 and 4. Also notice how easy it is to extract all members of any given group. Here, for example, the members of the lowest-valued group are displayed:
>> X(F == 1)
ans =
0.0504
0.0241
Here is an example using a much larger number of items:
>> rand('twister',951); X = rand(10000,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2500 25.00%
3 2500 25.00%
4 2500 25.00%
The tabulate function is a convenient routine for generating frequency tables from the Statistics Toolbox. Again, notice the even distribution of values across bins. What happens if the total count cannot be divided evenly among the bins? Let's see:
>> rand('twister',951); X = rand(10001,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2500 25.00%
3 2500 25.00%
4 2501 25.01%
This procedure ensures that the counts in the smallest bin and the largest bin never differ by more than 1, assuming that this is possible. Observe the progression, as the total count is incremented:
>> rand('twister',951); X = rand(10002,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2501 25.00%
3 2500 25.00%
4 2501 25.00%
>> rand('twister',951); X = rand(10003,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 24.99%
2 2501 25.00%
3 2501 25.00%
4 2501 25.00%
>> rand('twister',951); X = rand(10004,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2501 25.00%
2 2501 25.00%
3 2501 25.00%
4 2501 25.00%
The only catch is the case of multiple instances of the same value. The tiedrank function prevents the repeated values from being broken up. Here is an example with such data:
>> X = [1 2 2 2 5 6 7 8]'
X =
1
2
2
2
5
6
7
8
>> F = ceil(4 * tiedrank(X) / length(X))
F =
1
2
2
2
3
3
4
4
The distribution among bins is now uneven, but tiedrank is doing the best it can. Depending what is needed, this may be justified. If, however, if one needs to break these repeated values apart, even if arbitrarily, then tiedrank may be easily replaced with another procedure which does this.
This process is useful for assigning values to quantiles or n-iles, such as deciles or percentiles. Applied to randomly-generated values, it is also useful for assigning cases to folds for stratified k-fold cross-validation, or to strata for stratified sampling.
See Also
Feb-10-2007 posting, Stratified Sampling.
Sunday, February 11, 2007
Poll Results (Feb-04-2007): Toolbox Use
The Feb-04-2007 posting, Poll (Feb-04-2007): Toolbox Use featured the following poll question: Which MATLAB Toolboxes, if any, do you use? After 1 week, the poll has closed, with 27 Data Mining in MATLAB readers responding (1 of which was myself). The final results are:
1. None (base MATLAB product only) (0 Votes)
2. MATLAB Toolboxes from the MathWorks only (11 Votes)
3. MATLAB Toolboxes from other sources only (1 Votes)
4. MATLAB Toolboxes both from the MathWorks and other sources (15 Votes)
Interestingly, all reported using toolboxes. All voters but 1 indicated using toolboxes from the MathWorks.
How significant are these results? A quick bootstrap of 100,000 replicates yields the following:
>> Poll = [repmat(2,11,1); repmat(3,1,1); repmat(4,15,1)];
>> Replicates = Poll(ceil(27 * rand(27,1e5)));
>> Count2 = mean(Replicates == 2);
>> Count3 = mean(Replicates == 3);
>> Count4 = mean(Replicates == 4);
>> prctile(Count2,[5 95])
ans =
0.2593 0.5556
>> prctile(Count3,[5 95])
ans =
0 0.1111
>> prctile(Count4,[5 95])
ans =
0.4074 0.7037
So (roughly, due to limited precision provided by small sample size), with a 90% confidence interval, use of MathWorks Toolboxes only is somewhere between 26% and 56%, use of non-MathWorks Toolboxes only is between 0% and 11%, and use of both is between 41% and 70%.
Thanks to everyone who voted!
1. None (base MATLAB product only) (0 Votes)
2. MATLAB Toolboxes from the MathWorks only (11 Votes)
3. MATLAB Toolboxes from other sources only (1 Votes)
4. MATLAB Toolboxes both from the MathWorks and other sources (15 Votes)
Interestingly, all reported using toolboxes. All voters but 1 indicated using toolboxes from the MathWorks.
How significant are these results? A quick bootstrap of 100,000 replicates yields the following:
>> Poll = [repmat(2,11,1); repmat(3,1,1); repmat(4,15,1)];
>> Replicates = Poll(ceil(27 * rand(27,1e5)));
>> Count2 = mean(Replicates == 2);
>> Count3 = mean(Replicates == 3);
>> Count4 = mean(Replicates == 4);
>> prctile(Count2,[5 95])
ans =
0.2593 0.5556
>> prctile(Count3,[5 95])
ans =
0 0.1111
>> prctile(Count4,[5 95])
ans =
0.4074 0.7037
So (roughly, due to limited precision provided by small sample size), with a 90% confidence interval, use of MathWorks Toolboxes only is somewhere between 26% and 56%, use of non-MathWorks Toolboxes only is between 0% and 11%, and use of both is between 41% and 70%.
Thanks to everyone who voted!
Saturday, February 10, 2007
Stratified Sampling
Introduction
In my posting of Nov-09-2006, Simple Random Sampling (SRS), I explained simple random sampling and noted some of its weaknesses. This post will cover stratified random sampling, which addresses those weaknesses.
Stratified sampling provides the analyst with more control over the sampling process. A typical use of stratified sampling is to control the distribution of the variables being sampled. For instance, imagine a data set containing 200 observations, 100 of which are men, and 100 of which are women. Assume that this data set is to be split into two equal-sized groups, for control and treatment testing. Half of the subjects will receive some treatment which is under review (a drug, marketing campaign, etc.), while the other group is held out as a control and receives no treatment. A simple random sampling procedure will result in two groups, each with 50 men and 50 women, more or less. The "more or less" is the awkward part. Some simple random samples will result in a 46/54 split of men (and, in this case, the reverse, 54/46, for women). After an experiment, how will the experimenter know whether any measured differences are due to control versus treatment, or the difference in the respective proportions of men and women? It would be beneficial to control such factors.
When using simple random sampling, deviations from the expected distributions can be substantial. Generally, three factors aggravate this issue:
1. Smaller observation counts
2. More variables to be controlled
3. Higher skew in variables to be controlled
Even very large data may exhibit this problem. Consider the problem of applying treatments (marketing campaigns, for instance) to loan customers at a bank. At the beginning of the experiment, it is reasonable to expect that important variables be distributed similarly among treatment cells. Such variables might include current balance, credit score and loan type. Even a rather large data set may not split well along all of these dimensions.
A Simple Example
Consider a simple situation, in which there are 100,000 observations, 99,000 of which are of class A, and 1,000 of which are of class B. A minority class representation of 1% is not uncommon, and some important problems have even more class imbalance. A model is to be constructed to classify future cases as belonging to one class or the other. A train/test split of 70%/30% has been specified. To ensure that the training and testing data sets have similar proportions of classes A and B, the sampling will be stratified by class. Let's get started:
% Generate some example data (yes, it's very artificial and in order- don't worry about that!)
SimpleData = [randn(100000,5) [zeros(99000,1); ones(1000,1)]];
There are now 5 predictor variables and the target (in the last column) stored in SimpleData.
% Count the examples
n = size(SimpleData,1);
The first task is to identify the distinct stata which are to be sampled, and calculate their respective frequencies. In this case, that would be the two classes:
% Locate observations in each class
ClassA = (SimpleData(:,end) == 0);
ClassB = (SimpleData(:,end) == 1);
% We already know these, but in real-life they'd need to be calculated
nClassA = sum(double(ClassA));
nClassB = sum(double(ClassB));
Next, space is allocated for an integer code representing the segment, with a value of 1 for "training" or a 2 for "testing":
% Create train/test code values
Train = 1;
Test = 2;
% Allocate space for train/test indicator
Segment = repmat(Test,n,1); % Default to the last group
Next, we check a few things about the stratifying variable(s):
% Determine distinct strata
DistinctStrata = unique(SimpleData(:,end));
% Count distinct strata
nDistinctStrata = size(DistinctStrata,1);
For rigor's sake, randperm should be initialized at the beginning of this process, which is done by initializing rand (see my Jan-13-2007 posting, Revisiting rand (MATLAB 2007a)):
% Initialize PRNG
rand('state',29182);
Loop over the segments, splitting each as closely as possible (within one unit) at the 70/30 mark:
% Loop over strata
for Stratum = 1:nDistinctStrata
end
Done! Now, let's check our work:
>> mean(SimpleData(Segment == 1,end))
ans =
0.0100
>> mean(SimpleData(Segment == 2,end))
ans =
0.0100
Both the training and testing data sets have a 1% Class B rate. Note that stratified sampling will sometimes deviate from the expected distributions because strata can only be divided into sets of whole samples. With enough strata, this tiny error (never off by more than 0.5 samples per strata) may add up to a small discrepancy from the exact designed distribution. Regardless, stratified sampling much better preserves distributions of the stratified variables than simple random sampling.
Epilogue
The code in the example given was designed for clarity, not efficiency, so feel free to modify it for execution time and storage considerations.
Typically, numeric variables are stratified by dividing them into segments, such as deciles. Their original numeric values are still used, but each segment is treated as one strata.
When dealing with multiple stratifying variables, it is suggested that unique(X,'rows') be used over the set of stratifying variables to obtain all distinct combinations of single-variable strata, which actually possess any frequency. Beware that using too many stratifying variables or too many strata per variable may result in a large number of (multivariable) strata, many of which are very sparsely populated.
Stratified sampling is highly effective at avoiding the sometimes arbitrary results of simple random sampling, and is useful in assigning observations in control/test, train/test/(validate) and k-fold cross validation designs.
Further reading
Sampling: Design and Analysis, by Sharon L. Lohr (ISBN: 0-534-35361-4)
In my posting of Nov-09-2006, Simple Random Sampling (SRS), I explained simple random sampling and noted some of its weaknesses. This post will cover stratified random sampling, which addresses those weaknesses.
Stratified sampling provides the analyst with more control over the sampling process. A typical use of stratified sampling is to control the distribution of the variables being sampled. For instance, imagine a data set containing 200 observations, 100 of which are men, and 100 of which are women. Assume that this data set is to be split into two equal-sized groups, for control and treatment testing. Half of the subjects will receive some treatment which is under review (a drug, marketing campaign, etc.), while the other group is held out as a control and receives no treatment. A simple random sampling procedure will result in two groups, each with 50 men and 50 women, more or less. The "more or less" is the awkward part. Some simple random samples will result in a 46/54 split of men (and, in this case, the reverse, 54/46, for women). After an experiment, how will the experimenter know whether any measured differences are due to control versus treatment, or the difference in the respective proportions of men and women? It would be beneficial to control such factors.
When using simple random sampling, deviations from the expected distributions can be substantial. Generally, three factors aggravate this issue:
1. Smaller observation counts
2. More variables to be controlled
3. Higher skew in variables to be controlled
Even very large data may exhibit this problem. Consider the problem of applying treatments (marketing campaigns, for instance) to loan customers at a bank. At the beginning of the experiment, it is reasonable to expect that important variables be distributed similarly among treatment cells. Such variables might include current balance, credit score and loan type. Even a rather large data set may not split well along all of these dimensions.
A Simple Example
Consider a simple situation, in which there are 100,000 observations, 99,000 of which are of class A, and 1,000 of which are of class B. A minority class representation of 1% is not uncommon, and some important problems have even more class imbalance. A model is to be constructed to classify future cases as belonging to one class or the other. A train/test split of 70%/30% has been specified. To ensure that the training and testing data sets have similar proportions of classes A and B, the sampling will be stratified by class. Let's get started:
% Generate some example data (yes, it's very artificial and in order- don't worry about that!)
SimpleData = [randn(100000,5) [zeros(99000,1); ones(1000,1)]];
There are now 5 predictor variables and the target (in the last column) stored in SimpleData.
% Count the examples
n = size(SimpleData,1);
The first task is to identify the distinct stata which are to be sampled, and calculate their respective frequencies. In this case, that would be the two classes:
% Locate observations in each class
ClassA = (SimpleData(:,end) == 0);
ClassB = (SimpleData(:,end) == 1);
% We already know these, but in real-life they'd need to be calculated
nClassA = sum(double(ClassA));
nClassB = sum(double(ClassB));
Next, space is allocated for an integer code representing the segment, with a value of 1 for "training" or a 2 for "testing":
% Create train/test code values
Train = 1;
Test = 2;
% Allocate space for train/test indicator
Segment = repmat(Test,n,1); % Default to the last group
Next, we check a few things about the stratifying variable(s):
% Determine distinct strata
DistinctStrata = unique(SimpleData(:,end));
% Count distinct strata
nDistinctStrata = size(DistinctStrata,1);
For rigor's sake, randperm should be initialized at the beginning of this process, which is done by initializing rand (see my Jan-13-2007 posting, Revisiting rand (MATLAB 2007a)):
% Initialize PRNG
rand('state',29182);
Loop over the segments, splitting each as closely as possible (within one unit) at the 70/30 mark:
% Loop over strata
for Stratum = 1:nDistinctStrata
% Establish region of interest
ROI = find(SimpleData(:,end) == DistinctStrata(Stratum));
% Determine size of region of interest
nROI = length(ROI);
% Generate a scrambled ordering of 'nROI' items
R = randperm(nROI);
% Assign appropriate number of units to Training group
Segment(ROI(R(1:round(0.70 * nROI)))) = Train;
end
Done! Now, let's check our work:
>> mean(SimpleData(Segment == 1,end))
ans =
0.0100
>> mean(SimpleData(Segment == 2,end))
ans =
0.0100
Both the training and testing data sets have a 1% Class B rate. Note that stratified sampling will sometimes deviate from the expected distributions because strata can only be divided into sets of whole samples. With enough strata, this tiny error (never off by more than 0.5 samples per strata) may add up to a small discrepancy from the exact designed distribution. Regardless, stratified sampling much better preserves distributions of the stratified variables than simple random sampling.
Epilogue
The code in the example given was designed for clarity, not efficiency, so feel free to modify it for execution time and storage considerations.
Typically, numeric variables are stratified by dividing them into segments, such as deciles. Their original numeric values are still used, but each segment is treated as one strata.
When dealing with multiple stratifying variables, it is suggested that unique(X,'rows') be used over the set of stratifying variables to obtain all distinct combinations of single-variable strata, which actually possess any frequency. Beware that using too many stratifying variables or too many strata per variable may result in a large number of (multivariable) strata, many of which are very sparsely populated.
Stratified sampling is highly effective at avoiding the sometimes arbitrary results of simple random sampling, and is useful in assigning observations in control/test, train/test/(validate) and k-fold cross validation designs.
Further reading
Sampling: Design and Analysis, by Sharon L. Lohr (ISBN: 0-534-35361-4)
Labels:
sampling,
simple random sampling,
strata,
stratified sampling,
stratum
Sunday, February 04, 2007
Poll (Feb-04-2007): Toolbox Use
I am curious about readers' use of Toolboxes (from the MathWorks or elsewhere) in their MATLAB programs. Please answer the following poll on Toolboxes:
This poll is closed.
See the poll results in the Feb-11-2007 posting, Poll Results (Feb-04-2007): Toolbox Use
Thanks for participating!
This poll is closed.
See the poll results in the Feb-11-2007 posting, Poll Results (Feb-04-2007): Toolbox Use
Thanks for participating!
Friday, February 02, 2007
Pixel Classification Project: Response
The Jan-26-2007 posting, Pixel Classificiation Project generated quite a response (and some confusion!). Having received a number of responses, both in the Comments section, and via e-mail, I will answer questions and comments in the sections below. Many thanks for your (collective) interest!
Details About The Pixel Classifier's Operation
1. The pixel classifier assesses individual pixels, not entire images. It is applied separately to all pixels within a subject image. The fact that the training images were composed entirely of pixels from one class or the other was merely a logistical convenience since the analyst would not have to label areas of the training images by class. Ultimately, the pixel classifier operates on a small window of pixels, and predicts (the probability of the) the class of the pixel at the center of the window. This is why the new images (visible near the end of the original posting) are shaded in: the classifier is scanned over the entire image, evaluating each pixel separately.
2. While there is obviously a required component of craft in constructing the whole thing, the largest direct infusions of human knowledge to the actual classifier come from: 1. the manual labeling of images as "foliage" / "non-foliage", and 2. the construction of the hue2 feature. hue2 is not strictly necessary, and what the classifier knows, it has learned.
3. The hue-saturation-value (HSV) color components are a relatively simple transformation of the red-green-blue (RGB) color values already in the model. Although they do not bring "new" information, they may improve model performance by providing a different representation of the color data.
4. Which of the entire set of 11 input variables is "most important", I cannot say (although I suspect that the classifier is driven by green color and high-activity texture). As mentioned in the original posting, rigorous testing and variable selection were not performed. If I post another image processing article, it will likely be more thorough.
5. The edge detector variables measure contrast across some distance around the center pixel (I can supply the MATLAB code to any interested parties). The 5x5 edge detector summarizes the differences in brightness of pixels on opposite sides of 5 pixel-by-5 pixel square surrounding the pixel of interest. The other edge detectors consider larger squares about the pixel of interest. The varying sized edge detectors measure texture over different scales. There is nothing special about these particular edge detectors. I chose them only because they are fast to calculate and I already had built them. I would consider using other image processing operators (Sobel edge detector, Laws texture features, window brightness standard deviation, etc.) in any future pixel classifier.
(Possible) Future Developments
1. This process could indeed be applied to other types of data, such as audio. I was actually thinking about doing this, and given the interest in this posting, will consider either an audio project or a more thorough image processing project for the future (any preferences?). Reader suggestions are very welcome.
2. Detection of more complex items (people, automobiles, etc.) might be possible by combining a number of pixel classifiers. Much research has been undertaken in an effort to solve that problem, and the attempted solutions are too numerous to list here.
3. I strongly encourage readers to experiment in this field. Anyone undertaking such a project should feel free to contact me for any assistance I may be able to provide.
I will take up other potential applications of this idea with individual readers via other channels, although I will say that pixel-level classification is being performed already, both by governments (including the military) and in the private sector.
Some examples of other writing using this general follow. The nice thing about this sort of work is that even if one doesn't fully understand the white-paper or report, it is always possible to appreciate what the author has done by looking at the pictures.
Machine Learning Applied to Terrain Classification
for Autonomous Mobile Robot Navigation, by Rogers and Lookingbill
A Support Vector Machine Approach For Object Based Image Analysis, by Tzotsos
Interactively Training Pixel Classifiers, by Piater, Riseman and Utgoff
Texture classification of logged forests in tropical Africa using machine-learning algorithms, by Chan, Laporte and Defries
A Survey on Pixel-Based Skin Color Detection Techniques, by Vezhnevets, Sazonov and Andreeva
Feature Extraction From Digital Imagery: A Hierarchical Method, by Mangrich, Opitz and Mason
Details About The Pixel Classifier's Operation
1. The pixel classifier assesses individual pixels, not entire images. It is applied separately to all pixels within a subject image. The fact that the training images were composed entirely of pixels from one class or the other was merely a logistical convenience since the analyst would not have to label areas of the training images by class. Ultimately, the pixel classifier operates on a small window of pixels, and predicts (the probability of the) the class of the pixel at the center of the window. This is why the new images (visible near the end of the original posting) are shaded in: the classifier is scanned over the entire image, evaluating each pixel separately.
2. While there is obviously a required component of craft in constructing the whole thing, the largest direct infusions of human knowledge to the actual classifier come from: 1. the manual labeling of images as "foliage" / "non-foliage", and 2. the construction of the hue2 feature. hue2 is not strictly necessary, and what the classifier knows, it has learned.
3. The hue-saturation-value (HSV) color components are a relatively simple transformation of the red-green-blue (RGB) color values already in the model. Although they do not bring "new" information, they may improve model performance by providing a different representation of the color data.
4. Which of the entire set of 11 input variables is "most important", I cannot say (although I suspect that the classifier is driven by green color and high-activity texture). As mentioned in the original posting, rigorous testing and variable selection were not performed. If I post another image processing article, it will likely be more thorough.
5. The edge detector variables measure contrast across some distance around the center pixel (I can supply the MATLAB code to any interested parties). The 5x5 edge detector summarizes the differences in brightness of pixels on opposite sides of 5 pixel-by-5 pixel square surrounding the pixel of interest. The other edge detectors consider larger squares about the pixel of interest. The varying sized edge detectors measure texture over different scales. There is nothing special about these particular edge detectors. I chose them only because they are fast to calculate and I already had built them. I would consider using other image processing operators (Sobel edge detector, Laws texture features, window brightness standard deviation, etc.) in any future pixel classifier.
(Possible) Future Developments
1. This process could indeed be applied to other types of data, such as audio. I was actually thinking about doing this, and given the interest in this posting, will consider either an audio project or a more thorough image processing project for the future (any preferences?). Reader suggestions are very welcome.
2. Detection of more complex items (people, automobiles, etc.) might be possible by combining a number of pixel classifiers. Much research has been undertaken in an effort to solve that problem, and the attempted solutions are too numerous to list here.
3. I strongly encourage readers to experiment in this field. Anyone undertaking such a project should feel free to contact me for any assistance I may be able to provide.
I will take up other potential applications of this idea with individual readers via other channels, although I will say that pixel-level classification is being performed already, both by governments (including the military) and in the private sector.
Some examples of other writing using this general follow. The nice thing about this sort of work is that even if one doesn't fully understand the white-paper or report, it is always possible to appreciate what the author has done by looking at the pictures.
Machine Learning Applied to Terrain Classification
for Autonomous Mobile Robot Navigation, by Rogers and Lookingbill
A Support Vector Machine Approach For Object Based Image Analysis, by Tzotsos
Interactively Training Pixel Classifiers, by Piater, Riseman and Utgoff
Texture classification of logged forests in tropical Africa using machine-learning algorithms, by Chan, Laporte and Defries
A Survey on Pixel-Based Skin Color Detection Techniques, by Vezhnevets, Sazonov and Andreeva
Feature Extraction From Digital Imagery: A Hierarchical Method, by Mangrich, Opitz and Mason
Labels:
classifier,
foliage,
graphics,
image processing,
pixel,
skin
Friday, January 26, 2007
Pixel Classification Project
Being interested in both machine learning and image processing, I built a pixel-level classifier, on a lark, whose output is the probability that any given pixel was from the class "foliage". The project, in summary, followed these steps:
Pixel Classification Project Steps
1. Collect images, each containing pixels from only one class of interest
2. Extract samples (small windows surrounding pixels of interest) from images
3. Calculate derived features
4. Train classifier to distinguish between "foliage" and "not foliage" classes
5. Apply learned classifier to test images containing pixels from both classes
The salient details of each step follow:
1. Data Acquisition
Thirty-five images of each class ("foliage" and "non-foliage") were acquired. All training images were downloaded from the World Wide Web, after being located by AllTheWeb (Pictures). All images needed to be of at least moderate resolution (about 640x480) to provide enough information to render accurate classifications.
For the "foliage" class, search terms such as "foliage", "leaves" and "grass" were used. Images in this class were screened visually to include images which contained foliage, and only foliage, meaning leaves, plant stalks and stems. Images containing any extraneous elements (colorful flowers, pets, children, wires, rocks, etc.) were excluded.
For the "non-foliage" class, arbitrary search terms were employed, which would likely find things other than plants, like "nail", "dancer", "hallway", etc. Images in this class were visually screened to include anything but foliage. Indoor scenes containing small potted plants, for instance, were excluded.
It would also have been possible to utilize training images with mixtures of "foliage" and "non-foliage" pixels instead, but this would have required determining pixel class (at least for homogeneous regions within images) by hand. I did try this in a similar, earlier experiment, and I will report that: 1. manual identification of pixels can be time-consuming; 2. region-by-region classing complicates the next step, sampling; and 3. I suspect that this approach is less accurate for identification of pixels near the edge of a homogeneous region (which are much harder to distinguish).
2. Data Sampling
One thousand samples were drawn from each image, for a grand total of 70,000 examples (= 2 classes x 35 images each x 1000 samples per image). Each sample was composed of the color information from a small window surrounding the pixel of interest at a random location within each image. The pixel at the center of the window was considered to be from "foliage" class, if it came from the "foliage" set of images, and "non-foliage" if it came from the "non-foliage" set of images.
3. Derived Features
In addition to the raw red, green and blue values, my program calculated several derived features from the image data. In all, 11 features were used:
1. red
2. green
3. blue
4. hue
5. saturation
6. value
7. hue2
8. edge detector: 5x5
9. edge detector: 9x9
10. edge detector: 13x13
11. edge detector: 21x21
Hue, saturation and value, taken together, are another way of representing colors and are easily calculated using MATLAB's rgb2hsv function. The "hue2" variable is a fuzzy function of the hue variable, using a curved plateau function (see my posting of Nov-16-2006, Fuzzy Logic In MATLAB Part 1), hand-tweaked to flag appropriate colors. The edge detectors perform a simple, quick edge detection process over varying window sizes, and are intended to capture texture information at different scales.
There is nothing special about the above list, and indeed the list of possible derived features is limited only by the imagination. The image processing field has delivered a wide array of filters and other such manipulations. Interested readers are urged to examine either of these texts:
Algorithms for Image Processing and Computer Vision, by Parker (ISBN: 0-471-14056-2)
Digital Image Processing, by Gonzalez and Woods (ISBN: 0-201-50803-6)
4. Classifier Construction
My foliage classifier is a logistic regression, only because logistic regression is quick to train, and it was handy, as glmfit in the Statistics Toolbox. Any other machine learning or statistical classifier (linear discriminant, neural network, k-nearest neighbors, etc.) could have been used instead.
As this was just a quick experiment, I didn't bother with rigorous testing, variable selection, etc. Still, results on test images were quite nice (see below), and flaws in the classifier could certainly be addressed through a more thorough and structured effort.
5. Classifier Recall
The finished model was executed on some of my own digital photographs, which contained both "foliage" and "non-foliage" elements. The result appears below.
Overgrown Beams: Original Image
Overgrown Beams: Foliage Detection
Monarch Butterfly: Original Image
Monarch Butterfly: Foliage Detection
Potted Plant: Original Image
Potted Plant: Foliage Detection
Conclusion
I find working with image data to be particularly satisfying since the result is something one can actually look at. Images contain a great deal of data, both in terms of rich structure and in the sheer number of pixels. Even inexpensive digital cameras will deliver several million pixels per image, so consider image processing as a test-bed application for modeling experiments.
This was just a toy project, so it is hardly the last word in foliage detectors and weaknesses in the model should be evident in the images above. I strongly encourage readers to explore this field and improve on what has been presented. Consider training on other classes, like "skin", "people", "sky", "brickface", etc. I would be very interested in hearing from any readers who have results to share. Good luck!
Note on Handling Images in MATLAB
Even without toolboxes, MATLAB provides several tools for dealing with images, such as imread, which is used to load images. Most color images are loaded into MATLAB as 3-dimensional arrays, and are accessed as Image(VerticalCoordinate,HorizontalCoordinate,ColorPlane). The color channels are numbered: red (1), green (2) and blue (3), so Image(:,:,2) is just the green color plane of the image.
Related Work
One interesting work on the subject of pixel classification (skin detection) is Skin Detection, by Jennifer Wortman. While the classifier in this document is based only on color and is constructed by hand, the reader should find some insights on pixel classification.
See Also
Feb-02-2007 posting, Pixel Classificiation Project: Response
Mar-23-2007 posting, Two Bits of Code
Pixel Classification Project Steps
1. Collect images, each containing pixels from only one class of interest
2. Extract samples (small windows surrounding pixels of interest) from images
3. Calculate derived features
4. Train classifier to distinguish between "foliage" and "not foliage" classes
5. Apply learned classifier to test images containing pixels from both classes
The salient details of each step follow:
1. Data Acquisition
Thirty-five images of each class ("foliage" and "non-foliage") were acquired. All training images were downloaded from the World Wide Web, after being located by AllTheWeb (Pictures). All images needed to be of at least moderate resolution (about 640x480) to provide enough information to render accurate classifications.
For the "foliage" class, search terms such as "foliage", "leaves" and "grass" were used. Images in this class were screened visually to include images which contained foliage, and only foliage, meaning leaves, plant stalks and stems. Images containing any extraneous elements (colorful flowers, pets, children, wires, rocks, etc.) were excluded.
For the "non-foliage" class, arbitrary search terms were employed, which would likely find things other than plants, like "nail", "dancer", "hallway", etc. Images in this class were visually screened to include anything but foliage. Indoor scenes containing small potted plants, for instance, were excluded.
It would also have been possible to utilize training images with mixtures of "foliage" and "non-foliage" pixels instead, but this would have required determining pixel class (at least for homogeneous regions within images) by hand. I did try this in a similar, earlier experiment, and I will report that: 1. manual identification of pixels can be time-consuming; 2. region-by-region classing complicates the next step, sampling; and 3. I suspect that this approach is less accurate for identification of pixels near the edge of a homogeneous region (which are much harder to distinguish).
2. Data Sampling
One thousand samples were drawn from each image, for a grand total of 70,000 examples (= 2 classes x 35 images each x 1000 samples per image). Each sample was composed of the color information from a small window surrounding the pixel of interest at a random location within each image. The pixel at the center of the window was considered to be from "foliage" class, if it came from the "foliage" set of images, and "non-foliage" if it came from the "non-foliage" set of images.
3. Derived Features
In addition to the raw red, green and blue values, my program calculated several derived features from the image data. In all, 11 features were used:
1. red
2. green
3. blue
4. hue
5. saturation
6. value
7. hue2
8. edge detector: 5x5
9. edge detector: 9x9
10. edge detector: 13x13
11. edge detector: 21x21
Hue, saturation and value, taken together, are another way of representing colors and are easily calculated using MATLAB's rgb2hsv function. The "hue2" variable is a fuzzy function of the hue variable, using a curved plateau function (see my posting of Nov-16-2006, Fuzzy Logic In MATLAB Part 1), hand-tweaked to flag appropriate colors. The edge detectors perform a simple, quick edge detection process over varying window sizes, and are intended to capture texture information at different scales.
There is nothing special about the above list, and indeed the list of possible derived features is limited only by the imagination. The image processing field has delivered a wide array of filters and other such manipulations. Interested readers are urged to examine either of these texts:
Algorithms for Image Processing and Computer Vision, by Parker (ISBN: 0-471-14056-2)
Digital Image Processing, by Gonzalez and Woods (ISBN: 0-201-50803-6)
4. Classifier Construction
My foliage classifier is a logistic regression, only because logistic regression is quick to train, and it was handy, as glmfit in the Statistics Toolbox. Any other machine learning or statistical classifier (linear discriminant, neural network, k-nearest neighbors, etc.) could have been used instead.
As this was just a quick experiment, I didn't bother with rigorous testing, variable selection, etc. Still, results on test images were quite nice (see below), and flaws in the classifier could certainly be addressed through a more thorough and structured effort.
5. Classifier Recall
The finished model was executed on some of my own digital photographs, which contained both "foliage" and "non-foliage" elements. The result appears below.
Overgrown Beams: Original Image
Overgrown Beams: Foliage Detection
Monarch Butterfly: Original Image
Monarch Butterfly: Foliage Detection
Potted Plant: Original Image
Potted Plant: Foliage Detection
Conclusion
I find working with image data to be particularly satisfying since the result is something one can actually look at. Images contain a great deal of data, both in terms of rich structure and in the sheer number of pixels. Even inexpensive digital cameras will deliver several million pixels per image, so consider image processing as a test-bed application for modeling experiments.
This was just a toy project, so it is hardly the last word in foliage detectors and weaknesses in the model should be evident in the images above. I strongly encourage readers to explore this field and improve on what has been presented. Consider training on other classes, like "skin", "people", "sky", "brickface", etc. I would be very interested in hearing from any readers who have results to share. Good luck!
Note on Handling Images in MATLAB
Even without toolboxes, MATLAB provides several tools for dealing with images, such as imread, which is used to load images. Most color images are loaded into MATLAB as 3-dimensional arrays, and are accessed as Image(VerticalCoordinate,HorizontalCoordinate,ColorPlane). The color channels are numbered: red (1), green (2) and blue (3), so Image(:,:,2) is just the green color plane of the image.
Related Work
One interesting work on the subject of pixel classification (skin detection) is Skin Detection, by Jennifer Wortman. While the classifier in this document is based only on color and is constructed by hand, the reader should find some insights on pixel classification.
See Also
Feb-02-2007 posting, Pixel Classificiation Project: Response
Mar-23-2007 posting, Two Bits of Code
Labels:
classifier,
foliage,
graphics,
image processing,
pixel,
skin
Friday, January 19, 2007
GASplineFit: A Flexible Curve Fitting Routine
In my work, I have frequently needed to fit curves to data. Curve-fitting is a broad and surprisingly subtle subject. The two most common types of single-input curve-fitting are simple (linear and non-linear) regression and splines.
Regressions involve the discovery of optimal coefficients for some functional form, which maps the input variable to the output variable. Regressions provide a number of advantages, but these are generally qualified advantages:
Regression Characteristics
1. Common functions (linear, logistic, exponential) are well-studied and widely available. More exotic or arbitrary functions are harder to find software for.
2. Regressions provide optimal fit to data, assuming that one wants a least-squares fit. Some software, notably the MATLAB Curve Fitting Toolbox, will provide other, less common fits, such as mean absolute error (L-1) and robust fits.
3. Regressions are the safest bet for extrapolation. Extrapolation is a hazardous endeavor, but well-behaved regression functions generate the most sensible guess in "the far, unlit unknown".
4. Complex regression curves, such as high-order polynomials, can be ill-behaved, especially at the extremes of the data range. While there are methods of combatting such problems (such as fitting at the Chebyshev points), these are not widely available in commercial software.
Splines, which are especially popular in engineering, are connected series of simple curves, most often low-order polynomials. The advantage of splines is that they are extremely flexible. Complex curves require complex regression functions, but splines can handle very complicated shapes simply by connecting several simple curves. Splines may pass directly through data points, or between them. To maintain continuity, these simple curves must come together at points called knots. Like regressions, splines have their advantages:
Spline Characteristics
1. Splines are extremely flexible, but number and placement of knots must be chosen.
2. Technically, most splines are undefined outside the range of the knots.
3. Most commercial software is oriented toward locating a spline through a small number of given data points. "Fitting" through a large number of data points is not commonly available.
Additionally, base MATLAB directly supports the construction and evaluation of splines.
GASplineFit
Wanting to free myself from having to select the right regression function every time I needed to fit a curve, I decided to turn to splines for my curve-fitting needs. The problem, as mentioned above, is that almost all spline routines and canned software are very low-level, "fitting" (if it can be called that) splines by stringing them through a very small number of selected points. My solution, GASplineFit.m, builds on the built-in MATLAB spline routines, and will generate splines which optimize any error function handled by my SampleError routine (see my posting of Jan-05-2007, Model Performance Measurement). Note: GASplineFit requires subroutine LogRand.m.
With GASplineFit, the user supplies:
1. An input variable
2. An output variable
3. The input values of the knots (I often just use linspace for these)
4. The spline type (such as 'pchip' or 'spline': see MATLAB's spline routines)
5. A selected performance measure ('MSE', 'L-1', etc.).
There are two additional parameters, which control the genetic algorithm which performs the fitting. I usually don't change these from 250 and 80. For better but slower fitting, turn these parameters up. For faster but cruder fitting, turn them down. See help GASplineFit for details.
Essentially, the software is sliding the spline knots up and down, trying to get this best fit. The routine returns:
1. The fitted spline as a MATLAB piecewise polynomial (see: help ppval)
2. The knot output values
3. The assessed error
Notice one very nice feature of this routine: that it will fit probability curves directly to 0/1 data. Very often, the statistician is faced with a large collection of examples, which involve a dummy variable indicating membership in a class of interest, and an explanatory variable. To smooth the data and approximate the probability of class membership, common practice is to bin the explanatory variable, and assess the mean for each bin. Fancier methods will fit a curve to these binned values, which isn't very smooth or use a kernel regression, which is smooth but typically difficult to encapsulate as code without taking the entire data set along for the ride. GASplineFit solves all of these problems. The genetic algorithm which is its engine doesn't care about the fact that our data may be only zeros and ones: it only cares about optimizing the performance function.
help GASplineFit explains the specifics of this routine, and provides an example of its use. MATLAB programmers beginning with genetic algorithms may be interested in examining the code, which uses an elitist genetic algorithm.
Regressions involve the discovery of optimal coefficients for some functional form, which maps the input variable to the output variable. Regressions provide a number of advantages, but these are generally qualified advantages:
Regression Characteristics
1. Common functions (linear, logistic, exponential) are well-studied and widely available. More exotic or arbitrary functions are harder to find software for.
2. Regressions provide optimal fit to data, assuming that one wants a least-squares fit. Some software, notably the MATLAB Curve Fitting Toolbox, will provide other, less common fits, such as mean absolute error (L-1) and robust fits.
3. Regressions are the safest bet for extrapolation. Extrapolation is a hazardous endeavor, but well-behaved regression functions generate the most sensible guess in "the far, unlit unknown".
4. Complex regression curves, such as high-order polynomials, can be ill-behaved, especially at the extremes of the data range. While there are methods of combatting such problems (such as fitting at the Chebyshev points), these are not widely available in commercial software.
Splines, which are especially popular in engineering, are connected series of simple curves, most often low-order polynomials. The advantage of splines is that they are extremely flexible. Complex curves require complex regression functions, but splines can handle very complicated shapes simply by connecting several simple curves. Splines may pass directly through data points, or between them. To maintain continuity, these simple curves must come together at points called knots. Like regressions, splines have their advantages:
Spline Characteristics
1. Splines are extremely flexible, but number and placement of knots must be chosen.
2. Technically, most splines are undefined outside the range of the knots.
3. Most commercial software is oriented toward locating a spline through a small number of given data points. "Fitting" through a large number of data points is not commonly available.
Additionally, base MATLAB directly supports the construction and evaluation of splines.
GASplineFit
Wanting to free myself from having to select the right regression function every time I needed to fit a curve, I decided to turn to splines for my curve-fitting needs. The problem, as mentioned above, is that almost all spline routines and canned software are very low-level, "fitting" (if it can be called that) splines by stringing them through a very small number of selected points. My solution, GASplineFit.m, builds on the built-in MATLAB spline routines, and will generate splines which optimize any error function handled by my SampleError routine (see my posting of Jan-05-2007, Model Performance Measurement). Note: GASplineFit requires subroutine LogRand.m.
With GASplineFit, the user supplies:
1. An input variable
2. An output variable
3. The input values of the knots (I often just use linspace for these)
4. The spline type (such as 'pchip' or 'spline': see MATLAB's spline routines)
5. A selected performance measure ('MSE', 'L-1', etc.).
There are two additional parameters, which control the genetic algorithm which performs the fitting. I usually don't change these from 250 and 80. For better but slower fitting, turn these parameters up. For faster but cruder fitting, turn them down. See help GASplineFit for details.
Essentially, the software is sliding the spline knots up and down, trying to get this best fit. The routine returns:
1. The fitted spline as a MATLAB piecewise polynomial (see: help ppval)
2. The knot output values
3. The assessed error
Notice one very nice feature of this routine: that it will fit probability curves directly to 0/1 data. Very often, the statistician is faced with a large collection of examples, which involve a dummy variable indicating membership in a class of interest, and an explanatory variable. To smooth the data and approximate the probability of class membership, common practice is to bin the explanatory variable, and assess the mean for each bin. Fancier methods will fit a curve to these binned values, which isn't very smooth or use a kernel regression, which is smooth but typically difficult to encapsulate as code without taking the entire data set along for the ride. GASplineFit solves all of these problems. The genetic algorithm which is its engine doesn't care about the fact that our data may be only zeros and ones: it only cares about optimizing the performance function.
help GASplineFit explains the specifics of this routine, and provides an example of its use. MATLAB programmers beginning with genetic algorithms may be interested in examining the code, which uses an elitist genetic algorithm.
Labels:
curve,
curve-fitting,
GA,
GASplineFit,
genetic algorithm,
regression,
SampleError,
spline
Saturday, January 13, 2007
Revisiting rand (MATLAB 2007a)
Pseudo-random number generators (PRNG) are frequently used in statistical and machine learning methods for things like train/test selection and solution initialization. (See an interesting discussion of this in the Nov-22-2006 posting, Explicit Randomization in Learning algorithms, on the Machine Learning (Theory) log.) Hence, it is important to understand the operation of the pseudo-random number generator employed in one's code.
I've just gotten word that MATLAB's built-in rand function will be changing (as of MATLAB 2007a) to use the Mersenne Twister method of generating numbers be default. Note that use of 'state' or 'seed' will change to a generator other than the default. Probably the most important impact at a practical level is that code not specifying another generator (not using 'state' or 'seed') will now generate different values.
Note, too, that if the randperm function continues to follow rand's lead (randperm was initialized by initializing rand in the past), then it will also produce different values than in previous releases if rand is not initialized.
I have no word on whether this affects randn or not.
MATLAB programming suggestion: Always initialize the random number functions before calling them. Repeatable results make testing code much easier.
See my post of Dec-07-2006, Quick Tip Regarding rand and randn for more information on rand and randn.
Also see the Mar-19-2008 post, Quasi-Random Numbers.
Also in 2007a, "divide-by-zero" and "log-of-zero" warning messages will be turned off by default.
I've just gotten word that MATLAB's built-in rand function will be changing (as of MATLAB 2007a) to use the Mersenne Twister method of generating numbers be default. Note that use of 'state' or 'seed' will change to a generator other than the default. Probably the most important impact at a practical level is that code not specifying another generator (not using 'state' or 'seed') will now generate different values.
Note, too, that if the randperm function continues to follow rand's lead (randperm was initialized by initializing rand in the past), then it will also produce different values than in previous releases if rand is not initialized.
I have no word on whether this affects randn or not.
MATLAB programming suggestion: Always initialize the random number functions before calling them. Repeatable results make testing code much easier.
See my post of Dec-07-2006, Quick Tip Regarding rand and randn for more information on rand and randn.
Also see the Mar-19-2008 post, Quasi-Random Numbers.
Also in 2007a, "divide-by-zero" and "log-of-zero" warning messages will be turned off by default.
Labels:
2007a,
Mersenne Twister,
PRNG,
pseudorandom,
rand,
randn,
random,
randperm,
seed,
state
Friday, January 05, 2007
Model Performance Measurement
A wide variety of model performance measures have been devised. Despite the popularity of mean squared error for numeric models and simple accuracy for classification models, there are many other choices. For my part, I generally prefer mean absolute error for numeric models and the AUC (for class separation) and informational loss (for probability assessment) for classification models.
This log entry is pretty much just a quick giveaway: I have constructed a generic performance calculation MATLAB routine, SampleError.m. Its operation is straightforward, but I find it handy to contain all of these measures in one routine, with the ability to switch among them as a simple parameter change. The use of this routine is simple and is explained by help SampleError and it makes a great building block for modeling routines.
I update many of my MATLAB routines from time to time, and this one is no exception. Presently, though, the following performance measures are supported:
'L-1' (mean absolute error)
'L-2' (mean squared error)
'L-4'
'L-16'
'L-Infinity'
'RMS' (root mean squared error)
'AUC' (requires tiedrank() from Statistics Toolbox)
'Bias'
'Conditional Entropy'
'Cross-Entropy'
'F-Measure'
'Informational Loss'
'MAPE'
'Median Squared Error'
'Worst 10%'
'Worst 20%'
Note: I still need to verify the Cross-Entropy measure. The last two are classification performance measures, being the proportion of the target class found in the predicted most likely 10% and 20%, respectively.
Incidentally, I'd appreciate any feedback on any of the code in this Web log, whether it be about typos, outright coding errors of efficiency issues. Also, please send suggestions for additional measures.
This log entry is pretty much just a quick giveaway: I have constructed a generic performance calculation MATLAB routine, SampleError.m. Its operation is straightforward, but I find it handy to contain all of these measures in one routine, with the ability to switch among them as a simple parameter change. The use of this routine is simple and is explained by help SampleError and it makes a great building block for modeling routines.
I update many of my MATLAB routines from time to time, and this one is no exception. Presently, though, the following performance measures are supported:
'L-1' (mean absolute error)
'L-2' (mean squared error)
'L-4'
'L-16'
'L-Infinity'
'RMS' (root mean squared error)
'AUC' (requires tiedrank() from Statistics Toolbox)
'Bias'
'Conditional Entropy'
'Cross-Entropy'
'F-Measure'
'Informational Loss'
'MAPE'
'Median Squared Error'
'Worst 10%'
'Worst 20%'
Note: I still need to verify the Cross-Entropy measure. The last two are classification performance measures, being the proportion of the target class found in the predicted most likely 10% and 20%, respectively.
Incidentally, I'd appreciate any feedback on any of the code in this Web log, whether it be about typos, outright coding errors of efficiency issues. Also, please send suggestions for additional measures.
Labels:
AUC,
error measure,
L-1,
L-2,
mean squared error,
MSE,
performance,
SampleError
Subscribe to:
Posts (Atom)