Before any analysis can be performed in MATLAB, the data must somehow be imported. MATLAB offers a number of functions for data import (dlmread, fscanf, xlsread, etc.- try help iofun for more information), which the reader should explore. In my work, I tend to work with other software systems (relational databases and statistical packages) which can export data to text files. I have found it convenient for my purposes to convert all categorical data to numeric form (dummy variables or integer codes) in the originating system, and then to dump the data to a tab-delimited text file for consumption by MATLAB.
Below is a typical textread call, from one of my recent projects (Web formatting necessitates breaking this up: it is supposed to be a single line of code):
A = ...
An explanation of each textread parameter being used follows:
A is the variable containing the data after it is loaded from disk.
'F:\MyFile.dat' is the fully qualified name of the file being loaded.
The empty string indicates that no format string is being used.
The -1 indicates that all rows of data are to be loaded. To load some specific fraction of all rows, change this value to the number of rows to be loaded.
All parameters after this point come in pairs.
The 'delimiter' indicates that the data is delimited (as opposed to fixed-width). The '\t' lets textread know that the delimiter is the tab character. Comma-delimited .CSV files, for instance, would use 'delimiter',',' instead.
Next is the parameter pair 'headerlines',1, which tells textread to ignore the first row (it contains column headers).
Last, the parameter pair 'emptyvalue',NaN indicates that any missing values should be represented in the resulting MATLAB matrix as NaN ("not-a-number") values. Other values could be used. Some analysts are accustomed to using values like -99, -9999, etc., although it is generally the convention in MATLAB to use NaN to represent missing values.
I use a separate chunk of code to read in the header line and digest the variable names. Some MATLAB programmers prefer using cell arrays instead. Yet a third possibility of dealing with variable names is to use the categorical arrays and dataset arrays from the Statistics Toolbox (new with MATLAB version 2007a), but I haven't had a chance to explore them yet.