Wednesday, November 08, 2006

Why MATLAB for Data Mining?

There are plenty of commercial data mining tools and statistics packages out there. Why choose MATLAB? In short: flexibility and control. Having reviewed a wide assortment of analytical software for PC AI magazine and used any number of others in my work, I've had the opportunity to sample many tools. Some commercial tools are very polished, providing all manner of "bells and whistles" for things like data import, data preprocessing, etc.

In my work, however, I always found something missing, even in the best software. Real-world data mining projects often involve technical constraints which are unanticipated by commercial software developers. Not that this should be surprising: real projects come from all different fields and some impose the most bizarre constraints, like special treatment of negative values, large numbers of missing values, strange performance functions, small data requiring special testing procedures and on and on.

MATLAB provides the flexibility to deal with these quirky issues, if the analyst is able to code a solution. As a programming language, MATLAB is very like other procedural languages such as Fortran or C (MATLAB does have object-oriented features, but I won't get into that here). If I need to use a special type of regression and want to use my own feature selection process within a k-fold cross-validation procedure that needs a special data sampling procedure, I can by programming it in MATLAB.

Stepping back, consider major tasks frequently undertaken in a data mining project: data acquisition, data preparation, modeling, model execution and reporting/graphing. MATLAB allows me to do all of these under one "roof". The one gap with MATLAB is that it is not very good at relational joins. Look-up tables (even large ones) for tacking on a single variable are fine, but MATLAB is not built to perform SQL-style joins.

Much of this would be possible in more conventional programming languages, such as C or Java, but MATLAB's native treatment of arrays as data types and provision of many analysis-oriented functions in the base product make it much more convenient, and ensure that my MATLAB code will run for any other MATLAB user, without the need for them to own the same code libraries as I do.

Statistics packages fall short in that most of them provide a collection of canned routines. However many routines and options they provide, there will eventually be something which is missing, or a new procedure which you will find difficult or impossible to implement. Some statistics packages include their own scripting languages, but most of these are weak in comparison to full-blown programming languages.

Data mining tools vary, but tend to be even more limited in the procedures they provide than the statistics packages. Some even have only a single algorithm! They tend to be even more polished and easy to use than the statistics packages, but are hence that much more confining.

Graphing capability in MATLAB is among the best in the business, and all MATLAB graphs are compeltely configurable through software. Cutting-and-pasting to get data out of a statistics package (some provide little or no graphing capability in the base product) and into Excel isn't so bad if there is only one graph to produce. Recently, I needed to generate graphs for 15 market segments. Doing that by hand would have been collosally wasteful. In MATLAB, I set up a single graph with the fonts, etc. the way I wanted them, and looped over the data, producing 15 graphs.



On an unrelated note, you can read more posts my myself at Data Mining and Predictive Analytics. Also, consider visiting my seriously out-dated Web page at will.dwinnell.com.

10 comments:

Anonymous said...

Hi Will,

can you explain more about your comment "The one gap with MATLAB is that it is not very good at relational joins. Look-up tables (even large ones) for tacking on a single variable are fine, but MATLAB is not built to perform SQL-style joins."

Regards,
Eric Sampson
The MathWorks

Will Dwinnell said...

It is very easy to write vectorized MATLAB code to use a look-up table:


>> LookUp = [0.01 0.02 0.05 0.4 0.97]'

LookUp =

0.0100
0.0200
0.0500
0.4000
0.9700

>> MyData = [1 17 100; 3 17 110; 3 16 95; 4 19 89]

MyData =

1 17 100
3 17 110
3 16 95
4 19 89

>> LookUp(MyData(:,1))

ans =

0.0100
0.0500
0.0500
0.4000


Relational database joins, however, often involve more than one field to indicate the join. For instance, rows in the two tables being joined might need to match on 4 different columns.

If there is an easy and efficient way to do this in base MATLAB, I'd love to hear about it. Honestly, being proven wrong on this would make my week.


-Will

Anonymous said...

Will, I might be being dense but I can't extend your example to what you would like to do. Could you provide another simple example with the desired output? That way I can try to work out the required code, if it is possible.

Cheers,
Eric Sampson

Will Dwinnell said...

In a relational data base, these tables may be joined by matching up multiple key fields.

Let's say that there are two tables of data which we wish to join, A and B:

A =
1 1 2 99 88
1 2 1 77 88
1 1 2 66 55
4 1 3 44 11
1 2 1 66 22
8 0 8 11 44
4 1 3 22 88

B =
1 1 1 7
1 1 2 4
1 2 1 6
2 1 1 0
4 1 3 9

Relational joins are additionally are defined by which table's rows must be included. Let's say that we want all of the rows in A matched up with and matching rows in B, where the values in the first 3 columns match. The result, C, would be:

C =
1 1 2 99 88 4
1 2 1 77 88 6
1 1 2 66 55 4
4 1 3 44 11 9
1 2 1 66 22 6
8 0 8 11 44 NaN
4 1 3 22 88 9

Notice: 1. That in the second-to-last row, there is no match in B for the row in A, so a missing value is generated (which I indicated as a NaN, which is the convention in MATLAB). 2. There is no match for the second to last row in B, so its values never appear in the result.

Essentially, this is a look-up process, where more than one column must match up.

Does this make more sense?

Anonymous said...

Hi Will,

This isn't optimal but it might give you some things to think about:


A = ...
[1 1 2 99 88;
1 2 1 77 88;
1 1 2 66 55;
4 1 3 44 11;
1 2 1 66 22;
8 0 8 11 44;
4 1 3 22 88;]

B = ...
[1 1 1 7 5;
1 1 2 4 6;
1 2 1 6 4;
2 1 1 0 3;
4 1 3 9 2;]

numCols = 3;

C = [A nan(size(A,1),size(B,2)-numCols)];

[a, m, n] = unique(A(:,1:numCols),'rows')

ind = ismember(B(:,1:numCols), a, 'rows')

vals = B(ind,numCols+1:end);


for i = 1:size(vals,1)

C(n==i,6:7) = repmat(vals(i,:),sum(n==i),1);

end

Anonymous said...

Oops, that for loop should have been:

for i = 1:size(vals,1)

C(n==i,end-size(vals,2)+1:end) = repmat(vals(i,:),sum(n==i),1);

end

Will Dwinnell said...

Thanks! I had thought about something like this, but give me some time to digest your code. You can understand, though, why I consider this type of thing more cumbersome in MATLAB, than, say, SQL.

Anonymous said...

True, but you can put it into a function and then forget about it:

A = ...
[1 1 2 99 88;
1 2 1 77 88;
1 1 2 66 55;
4 1 3 44 11;
1 2 1 66 22;
8 0 8 11 44;
4 1 3 22 88;]

B = ...
[1 1 1 7 5;
1 2 1 4 6;
1 1 2 6 4;
2 1 1 0 3;
4 3 1 9 2;]

C = leftjoin(A,1:3,B,[1 3 2])


where

function C = leftjoin(A,colIndA,B,colIndB)

colIndA = unique(colIndA);
colIndB = unique(colIndB);

C = [A nan(size(A,1),size(B,2)-colIndB(end))];

[a, m, n] = unique(A(:,colIndA),'rows');

ind = ismember(B(:,colIndB), a, 'rows');

vals = B(ind,colIndB(end)+1:end);

colsToFill = size(A,2)+1 : size(C,2);


for i = 1:size(vals,1)

C(n==i,colsToFill) = repmat(vals(i,:),sum(n==i),1);

end

elventear said...

I would have to recommend the combo Python/numpy/scipy/matplotlib. It's a natural replacement of Matlab, comes with a lot of the basic tools and its grounded on Python which, IMHO, has a beautiful syntax in contrast to Matlab which is a little bit clunky for my taste.

In addition, all the additional robust packages (and OS hooks) available for Python make a good candidate to develope a final product, rather than just one time prototyping.

andrew lim said...

Great function but with a minor problem. It does not return the right results with your given matrices A and B.

It returns:

C =

1 1 2 99 88 4 6
1 2 1 77 88 6 4
1 1 2 66 55 4 6
4 1 3 44 11 NaN NaN
1 2 1 66 22 6 4
8 0 8 11 44 NaN NaN
4 1 3 22 88 NaN NaN

When it should return:

C =

1 1 2 99 88 6 4
1 2 1 77 88 4 6
1 1 2 66 55 6 4
4 1 3 44 11 NaN NaN
1 2 1 66 22 4 6
8 0 8 11 44 NaN NaN
4 1 3 22 88 NaN NaN

this is because your function assumes that the matched rows in B are ordered from small to large.

you may wish to consider the following modifications:

[ind,loc] = ismember(...

...

loc_index = loc(ind);

...

C(n==loc_index(i),colsToFill) = repmat(vals(i,:),sum(n==loc_index(i)),1);


Rgds,
Andrew Lim
JPMorgan
(andrew.t.lim@jpmorgan.com)