Data Mining in MATLAB: Basic Summary Statistics in MATLAB

Friday, April 13, 2007

Basic Summary Statistics in MATLAB

This posting covers basic summary statistics in MATLAB.

First, note that MATLAB has a strong array-orientation, so data sets to be analyzed are most often stored as a matrix of values. Note that the convention in MATLAB is for variables to be stored in columns, and observations to be stored in rows. This is not a hard-and-fast rule, but it is much more common than the alternative (variables in rows, observations in columns). Besides, most MATLAB routines (whether from the MathWorks or elsewhere) assume this convention.

Basic summaries are easy to obtain from MATLAB. For the examples below, the following matrix of data, A, will be used (No, it's not very exciting, but it will do for our purposes):

>> A = [1 2 3 4; -1 10 8 5; 9 8 7 0; 0 0 0 1]

A =

1 2 3 4
-1 10 8 5
9 8 7 0
0 0 0 1

MATLAB matrices are indexed as: MatrixName(row,column):

>> A(2,1)

ans =

-1

Common statistical summaries are available in MATLAB, such as: mean (arithmetic mean), median (median), min (minimum value), max (maximum value) and std (standard deviation). Their use is illustrated below:

>> mean(A)

ans =

2.2500 5.0000 4.5000 2.5000

>> median(A)

ans =

0.5000 5.0000 5.0000 2.5000

>> min(A)

ans =

-1 0 0 0

>> max(A)

ans =

9 10 8 5

>> std(A)

ans =

4.5735 4.7610 3.6968 2.3805

Note that each of these functions operate along the columns, yielding one summary for each, stored in a row vector. Sometimes it is desired to calculate along the rows instead. Some routines can be redirected by another parameter, like this:

>> mean(A,2)

ans =

2.5000
5.5000
6.0000
0.2500

The above calculates the arithmetic means of each row, storing them in a column vector. The second mean parameter, if it is specified, indicates the dimension along which mean is to operate.

For routines without this capability, the data matrix may be transposed (rows become columns and columns become rows) using the apostrophe operator while feeding it to the function:

>> mean(A')

ans =

2.5000 5.5000 6.0000 0.2500

Note that, this time, the result is stored in a row vector.

The colon operator, :, can be used to dump all of the contents of an array into one giant column vector. The result of this operation can then be fed to any of our summary routines:

>> A(:)

ans =

1
-1
9
0
2
10
8
0
3
8
7
0
4
5
0
1

>> mean(A(:))

ans =

3.5625

The reader will find more information on summary routines in base MATLAB through:

help datafun

The MATLAB Statistics Toolbox

MATLAB users lucky enough to own the Statistics Toolbox will have available still more summaries, such as iqr (inter-quartile range), trimmean (trimmed mean) and geomean (geometric mean). Also, there are extended versions of several summary functions, such as nanmean and nanmax, which will ignore NaN (IEEE floating point "not-a-number") values, which are commonly used to represent missing values in MATLAB.

To learn more, see the "Descriptive Statistics" section when using:

help stats

3 comments:

Mark said...: Hi,

I have just found your blog, and I find it very interesting and useful.

Currently I'm using MATLAB 2008a for my thesis, looking at a lot of data. Thus speed is one of my primary concerns...

You have said:
"Note that the convention in MATLAB is for variables to be stored in columns, and observations to be stored in rows. This is not a hard-and-fast rule, but it is much more common than the alternative (variables in rows, observations in columns). Besides, most MATLAB routines (whether from the MathWorks or elsewhere) assume this convention."

This is true, and indeed (to my surprise) it is faster to sum through the rows:
>> x = randn(10000);
>> tic; sum(x,1); toc;
Elapsed time is 0.212993 seconds.
>> tic; sum(x,2); toc;
Elapsed time is 0.171381 seconds.

On the other hand, as far as I know MATLAB is one of the few languages, that store arrays in column order. Thus reaching columns of a 2D array should be faster, because of less caching activity. So now I'm confused...

I know, that better structured code is more important than few percent in execution time. Nevertheless I'm interested...

I'd be happy for any comments regarding this...

Thanks,
Mark; 9:28 AM
Anonymous said...: Hi,
Thank you for your presentation. It's very useful and creative!
Would you please tell me which formula is for what purpose regarding standard deviation as there are two: one with 'n' and the other with 'n-1' in the denominator.
Bhoj R Shrestha; 4:55 PM
Flying said...: hi
I just found your blog as well and i wanted to say thank you for all the hard work! the grouping part is a life saved (before hand i had a loop running over 1M records... you can tell why i gave up on it).; 3:39 PM