Wednesday, March 26, 2008

Statistical Data Management in MATLAB

In the Apr-08-2007 posting, Getting Data Into MATLAB Using textread, basic use of the textread function was explained, and I alluded to code which I used to load the variables names. The name-handling code was not included in that post, and reader Andy asked about it. The code in question appears below, and an explanation follows (apologies for the Web formatting).


% Specify filename
InFilename = 'C:\Data\LA47INTIME.tab';

% Import data from disk
A = textread(InFilename,'','headerlines',1, ...
'delimiter','\t','emptyvalue',NaN,'whitespace','.');

% Establish number of observations ('n') and variables ('m')
[n m] = size(A);

% Note: Load the headers separately because some software
% writes out stupid periods for missing values!!!

% Import headers from disk
FileID = fopen(InFilename); % Open data file
VarLabel = textscan(FileID,'%s',m); % Read column labels
VarLabel = VarLabel{1}; % Extract cell array
fclose(FileID); % Close data file

% Assign variable names
for i = 1:m % Loop over all variables
% Shave off leading and trailing double-quotes
VarLabel{i} = VarLabel{i}(2:end-1);

% Assign index to variable name
eval([VarLabel{i} ' = ' int2str(i) ';']);
end


After the user specifies the data file to be loaded, data is stored in array 'A', whose size is stored in 'n' and 'm'.

Next, the file is re-opened to read in the variable names. Variable names are stored two ways: the actual text names are stored in a cell array, 'VarLabel', and 1 new MATLAB variable is created as an index for each column.

To illustrate, consider a file containing 4 columns of data, "Name", "Sex", "Age" and "Height". The variable 'VarLabel' would contain those 4 names as entries. Assuming that one stores lists of columns as vectors of column indices, then labeling is easy: VarLabel(3) is "Age". This is especially useful when generating series of graphs which need appropriate labels.

Also, four new variables will be created, which index the array 'A'. They are 'Name' (which has a value of 1), 'Sex' (value: 2), 'Age' (3) and 'Height' (4). They make indexing into the main data array easy. The column of ages is: A(:,Age)

I had begun bundling this code as a function, but could not figure out how to assign the variable indices outside of the scope of the function. It is a short piece of code, and readers will likely want to customize some details, anyway. Hopefully, you find this helpful.

9 comments:

Cris said...

Try the function ASSIGNIN to make variables in the caller's workspace.

Will Dwinnell said...

Thanks for the suggestion!

If I get a chance, I will circle back and package this as a function.

Andy said...

Very cool! Thanks Will!
I'm looking forward to incorporating this in some of my scripts

Anonymous said...

Being a non-Matlab user but having seen how many people use it I am quite suprised that this type of functionality isn't already shipped with the system?

There are many systems out there with equal programmability and extendanbility to Matlab which already come with this type of generic base functionality included.

This makes me very curious as to what the "extra" is in Matlab that draws so many people to it?

Will Dwinnell said...

It is difficult to respond to your general question without knowing what software you are comparing MATLAB to. Still, I addressed this question broadly in my Nov-08-2006 post, Why MATLAB for Data Mining?.

In the specific case of the post to which you've commented, I can say that it is relatively easy to load data into a single, large array in MATLAB. Most of the few lines of code I supplied were specifically for the purpose of constructing labels and indices associated with such an array.

In contrast, relational databases and most statistical software used the equivalent of individual, named column vectors in MATLAB. Treating variables instead as a single array with indices and labels provides a convenient mechanism for generic manipulation of all variables, and for automatic generation of plots, etc.

Anonymous said...

Hi Will,

Thanks for the quick reply. I've been readin gyour blog for awhile now and did read that post when you first put it up.

There are tools and languages which have these specific abilities you just commented on built in leaving the modeler free to do other things.

I guess my question was more in general then just this data example and I in no way meant to rag on Matlab. Your post simply made me think there must be things in Matlab which I am not seeing or understanding which create such wide appeal.

Anonymous said...

Hi, I was wondering if you'd ever come across a way to read in DBF or WK1 formatted files in Matlab. I've tried numerous things but always have to resave these formats to XLS or CSV (which I wish to avoid in the name of automation). Thanks, Chris

Anonymous said...

I should have left off the WK1 file format from the last comment. That's obviously solved by canned routines in base Matlab.

Clark Adams said...

MATLAB's a pretty useful tool then if it could pretty much do some data management like this. A lot of engineers use this program for numerical operations for complicated calculations. Could you teach more tricks in using MATLAB as some sort of a data manager like the typical spreadsheet programs out there? Thanks!