Introduction
In my last posting, Parallel Programming: A First Look (Nov-16-2008), I introduced the subject of parallel programming in MATLAB. In that case, I briefly described my experiences with the MATLAB Parallel Computing Toolbox, from the MathWorks. Since then, I have been made aware of another parallel programming product for MATLAB, the Jacket Engine for MATLAB, from AccelerEyes.
Jacket differs from the Parallel Computing Toolbox in that Jacket off-loads work to the computer's GPU (graphics processing unit), whereas the Parallel Computer Toolbox distributes work over multiple cores or processors. Each solution has its merits, and it would be worth the time of MATLAB programmers interested in accelerating computation to investigate the nuances of each.
Some History
Having observed the computer hardware industry for several decades now, I have witnessed the arrival and departure of any number of special-purpose add-in cards which have been used to speed up math for things like neural networks, etc. For my part, I have resisted the urge to employ such hardware assistance for several reasons:
First, special hardware nearly always requires special software. Accommodating the new hardware environment with custom software means an added learning curve for the program and drastically reduced code portability.
Second, there is the cost of the hardware itself, which was often considerable.
Third, there was the fundamental fact that general-purpose computing hardware was inexorably propelled forward by a very large market demand. Within 2 or 3 years, even the coolest turbo board would be outclassed by new PCs, which didn't involve either of the two issues mentioned above.
Two significant items have emerged in today's computer hardware environment: multi-core processors and high-power graphics processors. Even low-end PCs today sport central processors featuring at least two cores, which you may think of more-or-less as "2 (or more) computers on a single chip". As chip complexity has continued to grow, chip makers like Intel and AMD have fit multiple "cores" on single chips. It is tempting to think that this would yield a direct benefit to the user, but the reality is more subtle. Most software was written to run on single-core computers, and is not equipped to take advantage of the extra computing power of today's multi-core computers. This is where the Parallel Computer Toolbox steps in, by providing programmers a way to distribute the execution of their programs over several cores or processors, resulting in a substantially improved performance.
Similarly, the graphics subsystem in desktop PCs has also evolved to a very sophisticated state. At the dawn of the IBM PC (around 1980), graphics display cards with few exceptions basically converted the contents of a section of memory into a display signal usable by a computer monitor. Graphics cards did little more.
Over time, though, greater processing functionality was added to the graphics cards culminating in compute engines which would rival supercomputer-class machines of only a few years ago. This evolution has been fueled by the inclusion of many processing units (today, some cards contain hundreds of these units). Originally designed to perform specific graphics functions, many of these units are not small, somewhat general-purpose computers and they can be programmed to do things having nothing to do with the image shown on the computer's monitor. Tapping into this power requires some sort of programming interface, though, which is where Jacket comes in.
Caveats
Here is a simple assessment of the pros and cons of these two methods of achieving parallel computing on the desktop:
Multi-Core:
Good:
The required hardware is cheap. If you program in MATLAB, you probably have at least 2 cores at your disposal already, if not more.
Bad:
Most systems top out 4 cores, limiting the potential speed-up with this method (although doubling or quadrupling performance isn't bad).
GPU:
Good:
The number of processing units which can be harnessed by this method is quite large. Some of the fancier graphics cards have over 200 such units.
Bad:
The required hardware may be a bit pricey, although the price/performance is probably still very attractive.
Most GPUs will only perform single-precision floating point math. Newer GPUs, though, will perform double-precision floating-point math.
Moving data from the main computer to the graphics card and back takes time, eating into the potential gain.
Conclusion
My use of the Parallel Computing Toolbox has been limited to certain, very specific tasks, and I have not used Jacket at all. The use of ubiquitous multi-core computers and widely-available GPUs avoids most of the problems I described regarding special-purpose hardware. It will be very interesting to see how these technologies fit into the technological landscape over the next few years, and I am eager to learn of readers' experiences with them.
Showing posts with label parallel computing. Show all posts
Showing posts with label parallel computing. Show all posts
Thursday, February 12, 2009
Sunday, November 16, 2008
Parallel Programming: A First Look
Introduction
Recently, I have been experimenting with the MATLAB Parallel Computing Toolbox, which permits MATLAB programmers to spread work over multiple cores, processors or computers. My primary interest is in leveraging my quad-core desktop PC to accelerate the compute-intensive programs I use for data mining.
The Parallel Computing Toolbox is a MATLAB add-on package from the Mathworks which provides a number of parallel programming mechanisms. The one I have spent the most time with is parallel looping, which is accomplished via the parfor command. The basic idea is to have separate iterations of a for-loop be executed on separate cores or processors.
The required change to conventional code is tiny. For example, this conventional loop:
>> for i = 1:10, disp(int2str(i)), end
1
2
3
4
5
6
7
8
9
10
...becomes this parallel loop:
>> matlabpool open 4, parfor i = 1:10, disp(int2str(i)), end, matlabpool close
Starting matlabpool using the parallel configuration 'local'.
Waiting for parallel job to start...
Connected to a matlabpool session with 4 labs.
Sending a stop signal to all the labs...
Waiting for parallel job to finish...
4
3
2
1
6
5
9
8
10
7
Performing parallel job cleanup...
Done.
Notice three important differences:
First, the command "for" becomes "parfor"- easy, right?
Second, there is some stuff before and after the loop regarding the matlabpool. These commands, respectively, start up and shut down the parallel programming capability. They do not need to bracket every parfor-loop: you can start the matlabpool at the beginning of a program, use any number of parfor-loops and shut down the matlabpool at the end of the program.
Third, notice that the loop iterations did not execute in order. In many situations, this will not matter. In some, it will. This is one of the quirks of programming for a parallel processor. Being aware of this is the programmer's responsibility. Welcome to the future of computing!
Experiences
My experiences programming with the Parallel Computing Toolbox have been mixed. The good news is that, just using the parallel looping functionality, I have seen code which runs as much as 3 times as fast on my quad-core computer. My tests have involved large numbers of regressions or clusterings (k-means): tasks typical of a data mining project, especially where parameter sweeps or bootstrapping are involved. The bad news is that I have not always seen such dramatic improvement, and in fact I sometimes see minor slow-downs.
As far as I can tell, there is a limit to the amount of data I can be juggling at any one time, and going beyond that (remember that each core will need space for its own share of the problem) exceeds my system's available RAM, consequently slowing parallel processing as cores fight for memory. For reference, my current system is thus:
Manufacturer: Velocity Micro
Model: Vector Z35
CPU: Intel Q6600, 2.4GHz (4 cores)
RAM: 4GB
OS: Windows XP (32-bit)
At present, Windows only shows about 3.24GB of that physical RAM. My strong suspicion is that moving to a 64-bit environment (there are 64-bit versions of both Windows XP and Window Vista, as well as Linux) would permit access to more physical RAM and allow acceleration of parallel code which deals with larger data. In the meantime, though, at least some of my code is running 3 times as fast as it was, which would require the equivalent of a single core processor running at about 7.2GHz!
See also: Parallel Programming: Another Look (Feb-12-2009)
Recently, I have been experimenting with the MATLAB Parallel Computing Toolbox, which permits MATLAB programmers to spread work over multiple cores, processors or computers. My primary interest is in leveraging my quad-core desktop PC to accelerate the compute-intensive programs I use for data mining.
The Parallel Computing Toolbox is a MATLAB add-on package from the Mathworks which provides a number of parallel programming mechanisms. The one I have spent the most time with is parallel looping, which is accomplished via the parfor command. The basic idea is to have separate iterations of a for-loop be executed on separate cores or processors.
The required change to conventional code is tiny. For example, this conventional loop:
>> for i = 1:10, disp(int2str(i)), end
1
2
3
4
5
6
7
8
9
10
...becomes this parallel loop:
>> matlabpool open 4, parfor i = 1:10, disp(int2str(i)), end, matlabpool close
Starting matlabpool using the parallel configuration 'local'.
Waiting for parallel job to start...
Connected to a matlabpool session with 4 labs.
Sending a stop signal to all the labs...
Waiting for parallel job to finish...
4
3
2
1
6
5
9
8
10
7
Performing parallel job cleanup...
Done.
Notice three important differences:
First, the command "for" becomes "parfor"- easy, right?
Second, there is some stuff before and after the loop regarding the matlabpool. These commands, respectively, start up and shut down the parallel programming capability. They do not need to bracket every parfor-loop: you can start the matlabpool at the beginning of a program, use any number of parfor-loops and shut down the matlabpool at the end of the program.
Third, notice that the loop iterations did not execute in order. In many situations, this will not matter. In some, it will. This is one of the quirks of programming for a parallel processor. Being aware of this is the programmer's responsibility. Welcome to the future of computing!
Experiences
My experiences programming with the Parallel Computing Toolbox have been mixed. The good news is that, just using the parallel looping functionality, I have seen code which runs as much as 3 times as fast on my quad-core computer. My tests have involved large numbers of regressions or clusterings (k-means): tasks typical of a data mining project, especially where parameter sweeps or bootstrapping are involved. The bad news is that I have not always seen such dramatic improvement, and in fact I sometimes see minor slow-downs.
As far as I can tell, there is a limit to the amount of data I can be juggling at any one time, and going beyond that (remember that each core will need space for its own share of the problem) exceeds my system's available RAM, consequently slowing parallel processing as cores fight for memory. For reference, my current system is thus:
Manufacturer: Velocity Micro
Model: Vector Z35
CPU: Intel Q6600, 2.4GHz (4 cores)
RAM: 4GB
OS: Windows XP (32-bit)
At present, Windows only shows about 3.24GB of that physical RAM. My strong suspicion is that moving to a 64-bit environment (there are 64-bit versions of both Windows XP and Window Vista, as well as Linux) would permit access to more physical RAM and allow acceleration of parallel code which deals with larger data. In the meantime, though, at least some of my code is running 3 times as fast as it was, which would require the equivalent of a single core processor running at about 7.2GHz!
See also: Parallel Programming: Another Look (Feb-12-2009)
Subscribe to:
Comments (Atom)