Thursday, February 12, 2009

Parallel Programming: Another Look


In my last posting, Parallel Programming: A First Look (Nov-16-2008), I introduced the subject of parallel programming in MATLAB. In that case, I briefly described my experiences with the MATLAB Parallel Computing Toolbox, from the MathWorks. Since then, I have been made aware of another parallel programming product for MATLAB, the Jacket Engine for MATLAB, from AccelerEyes.

Jacket differs from the Parallel Computing Toolbox in that Jacket off-loads work to the computer's GPU (graphics processing unit), whereas the Parallel Computer Toolbox distributes work over multiple cores or processors. Each solution has its merits, and it would be worth the time of MATLAB programmers interested in accelerating computation to investigate the nuances of each.

Some History

Having observed the computer hardware industry for several decades now, I have witnessed the arrival and departure of any number of special-purpose add-in cards which have been used to speed up math for things like neural networks, etc. For my part, I have resisted the urge to employ such hardware assistance for several reasons:

First, special hardware nearly always requires special software. Accommodating the new hardware environment with custom software means an added learning curve for the program and drastically reduced code portability.

Second, there is the cost of the hardware itself, which was often considerable.

Third, there was the fundamental fact that general-purpose computing hardware was inexorably propelled forward by a very large market demand. Within 2 or 3 years, even the coolest turbo board would be outclassed by new PCs, which didn't involve either of the two issues mentioned above.

Two significant items have emerged in today's computer hardware environment: multi-core processors and high-power graphics processors. Even low-end PCs today sport central processors featuring at least two cores, which you may think of more-or-less as "2 (or more) computers on a single chip". As chip complexity has continued to grow, chip makers like Intel and AMD have fit multiple "cores" on single chips. It is tempting to think that this would yield a direct benefit to the user, but the reality is more subtle. Most software was written to run on single-core computers, and is not equipped to take advantage of the extra computing power of today's multi-core computers. This is where the Parallel Computer Toolbox steps in, by providing programmers a way to distribute the execution of their programs over several cores or processors, resulting in a substantially improved performance.

Similarly, the graphics subsystem in desktop PCs has also evolved to a very sophisticated state. At the dawn of the IBM PC (around 1980), graphics display cards with few exceptions basically converted the contents of a section of memory into a display signal usable by a computer monitor. Graphics cards did little more.

Over time, though, greater processing functionality was added to the graphics cards culminating in compute engines which would rival supercomputer-class machines of only a few years ago. This evolution has been fueled by the inclusion of many processing units (today, some cards contain hundreds of these units). Originally designed to perform specific graphics functions, many of these units are not small, somewhat general-purpose computers and they can be programmed to do things having nothing to do with the image shown on the computer's monitor. Tapping into this power requires some sort of programming interface, though, which is where Jacket comes in.


Here is a simple assessment of the pros and cons of these two methods of achieving parallel computing on the desktop:



The required hardware is cheap. If you program in MATLAB, you probably have at least 2 cores at your disposal already, if not more.


Most systems top out 4 cores, limiting the potential speed-up with this method (although doubling or quadrupling performance isn't bad).



The number of processing units which can be harnessed by this method is quite large. Some of the fancier graphics cards have over 200 such units.


The required hardware may be a bit pricey, although the price/performance is probably still very attractive.

Most GPUs will only perform single-precision floating point math. Newer GPUs, though, will perform double-precision floating-point math.

Moving data from the main computer to the graphics card and back takes time, eating into the potential gain.


My use of the Parallel Computing Toolbox has been limited to certain, very specific tasks, and I have not used Jacket at all. The use of ubiquitous multi-core computers and widely-available GPUs avoids most of the problems I described regarding special-purpose hardware. It will be very interesting to see how these technologies fit into the technological landscape over the next few years, and I am eager to learn of readers' experiences with them.


Dean Abbott said...

I remember those i860 board addons for neural networks in the 90s (I think Mercury had one...) and I even went to a programming class once. You are very much correct in that the specialized software was a big hurdle--rewriting C code at that time to work in the i860 environment was a problem (not too difficult, but you had to rethink how you stored your data arrays before you started any programming at all). Also there were some other hardware solutions (SIMD computers--Single Instruction, Multiple Data) that worked pretty well, but the cost was so high that they moved away from neural networks into more basic image processing, and then died completely.

For any of the technology to work for most of us, it has to be completely or nearly completely seamless. Any time you have to send folks to weeks of classes to figure out how to program in the special way some hardware requires makes them unusable for most organizations.

Kees said...

I tried Jacket, first version 0.5 and also version 1.0.1. It looks very promising, because doing matrix multiplication of the GPU is theoretically much faster. Unfortunately, even version 1.0.1 lacks essential abilities, such as vertical matrix concatenation, so I could not test a real application (I ended up moving too much data back and forth between GPU and CPU memory, thereby losing all advantage). Also, I experienced a lot of crashes due to bugs in the Jacket MEX-files.

I could, however, test a very simple program which should be able to demonstrate the matrix computation advantage of Jacket. The results are impressive, given that I used a laptop with a top line CPU (Intel Core 2 Duo T7300, 2 cores @ 2GHz) and a very modest CPU (Nvidia GeForce 8400 M G, 8 cores @ 800 MHz). Also, the bigger the matrix, the more the gain in speed.

Matrix size = 512; GPU total time = 2.1431; CPU total time = 1.5822
Matrix size = 1024; GPU total time = 7.6989; CPU total time = 7.9585

Source program:

for repeats=1:nRepeats
for i=1:50
gforce(A .* B);
gpu_time(repeats) = toc;
for i=1:50
A .* B;
cpu_time(repeats) = toc;
disp(['Matrix size = ' num2str(n) '; GPU total time = ' num2str(sum(gpu_time)) '; CPU total time = ' num2str(sum(cpu_time))])

Andrew Scott said...

In some cases this quite simple idea:

can be used to do parallel for loops without the parallel computing toolbox.