Data Mining in MATLAB: 2007a

Showing posts with label 2007a. Show all posts

Sunday, April 08, 2007

Getting Data Into MATLAB Using textread

Before any analysis can be performed in MATLAB, the data must somehow be imported. MATLAB offers a number of functions for data import (dlmread, fscanf, xlsread, etc.- try help iofun for more information), which the reader should explore. In my work, I tend to work with other software systems (relational databases and statistical packages) which can export data to text files. I have found it convenient for my purposes to convert all categorical data to numeric form (dummy variables or integer codes) in the originating system, and then to dump the data to a tab-delimited text file for consumption by MATLAB.

Below is a typical textread call, from one of my recent projects (Web formatting necessitates breaking this up: it is supposed to be a single line of code):

A = ...
textread('F:\MyFile.dat','',-1,'delimiter','\t', ...
'headerlines',1,'emptyvalue',NaN);

An explanation of each textread parameter being used follows:

A is the variable containing the data after it is loaded from disk.

'F:\MyFile.dat' is the fully qualified name of the file being loaded.

The empty string indicates that no format string is being used.

The -1 indicates that all rows of data are to be loaded. To load some specific fraction of all rows, change this value to the number of rows to be loaded.

All parameters after this point come in pairs.

The 'delimiter' indicates that the data is delimited (as opposed to fixed-width). The '\t' lets textread know that the delimiter is the tab character. Comma-delimited .CSV files, for instance, would use 'delimiter',',' instead.

Next is the parameter pair 'headerlines',1, which tells textread to ignore the first row (it contains column headers).

Last, the parameter pair 'emptyvalue',NaN indicates that any missing values should be represented in the resulting MATLAB matrix as NaN ("not-a-number") values. Other values could be used. Some analysts are accustomed to using values like -99, -9999, etc., although it is generally the convention in MATLAB to use NaN to represent missing values.

I use a separate chunk of code to read in the header line and digest the variable names. Some MATLAB programmers prefer using cell arrays instead. Yet a third possibility of dealing with variable names is to use the categorical arrays and dataset arrays from the Statistics Toolbox (new with MATLAB version 2007a), but I haven't had a chance to explore them yet.

Saturday, March 03, 2007

MATLAB 2007a Released

The latest version of MATLAB, 2007a, has been released. While some changes to base MATLAB are of interest to data miners (multi-threading, in particular), owners of the Statistics Toolbox receive a number of new features in this major upgrade.

First, the Statistics Toolbox makes new data structures available for categorical data (categorical arrays) and mixed-type data (dataset arrays). Most MATLAB users performing statistical analysis or data mining tend to store their data in numerical matrices (my preference) or cell arrays. Using ordinary matrices requires the programmer/analyst to manage things like variable names. Cell arrays deal with the variable name issue, but preclude some of the nice things about using MATLAB matrices. Hopefully these new structures make statistical analysis in MATLAB more natural.

Second, the Statistics Toolbox updates the classify function, permitting it to output the discovered discriminant coefficients (at last!). I have been complaining about this for a long time. Why? Because classify provides quadratic discriminant analysis (QDA), an important non-linear modeling algorithm. Without the coefficients, though, it is impossible to deliver models to other (admittedly inferior to MATLAB) platforms.

Also of note: the Genetic Algorithm and Direct Search Toolbox now includes simulated annealing.

More information on 2007a is available at The Mathworks.

Saturday, January 13, 2007

Revisiting rand (MATLAB 2007a)

Pseudo-random number generators (PRNG) are frequently used in statistical and machine learning methods for things like train/test selection and solution initialization. (See an interesting discussion of this in the Nov-22-2006 posting, Explicit Randomization in Learning algorithms, on the Machine Learning (Theory) log.) Hence, it is important to understand the operation of the pseudo-random number generator employed in one's code.

I've just gotten word that MATLAB's built-in rand function will be changing (as of MATLAB 2007a) to use the Mersenne Twister method of generating numbers be default. Note that use of 'state' or 'seed' will change to a generator other than the default. Probably the most important impact at a practical level is that code not specifying another generator (not using 'state' or 'seed') will now generate different values.

Note, too, that if the randperm function continues to follow rand's lead (randperm was initialized by initializing rand in the past), then it will also produce different values than in previous releases if rand is not initialized.

I have no word on whether this affects randn or not.

MATLAB programming suggestion: Always initialize the random number functions before calling them. Repeatable results make testing code much easier.

See my post of Dec-07-2006, Quick Tip Regarding rand and randn for more information on rand and randn.

Also see the Mar-19-2008 post, Quasi-Random Numbers.

Also in 2007a, "divide-by-zero" and "log-of-zero" warning messages will be turned off by default.