Data Mining in MATLAB: A Quick Introduction to Monte-Carlo and Quasi-Monte Carlo Integration

Sunday, March 23, 2008

A Quick Introduction to Monte-Carlo and Quasi-Monte Carlo Integration

In a surprising range of circumstances, it is necessary to calculate the area or volume of a region. When the region is a simple shape, such as a rectangle or triangle, and its exact dimensions are known, this is easily accomplished through standard geometric formulas. Often in practice, however, the region's shape is irregular and perhaps of very high dimension. Some regions used in financial net present value calculations, for instance, may lie in spaces defined by hundreds of dimensions!

One method for estimating the areas of such regions is Monte Carlo integration. This is a conceptually simple, but very effective solution for some difficult problems. The basic idea is to sample over a region of known area, checking whether sampled points are within the region of interest or not. The proportion of sampled points found to lie within the region of interest multiplied by the (already known) area of the sampled region is an approximation of the area occupied by the region of interest. A related technique, Quasi-Monte Carlo integration, utilizes quasirandom ("low discrepancy") numbers in place of random ones.

An Example

To illustrate Monte Carlo integration, consider a problem with a known analytical solution: calculating the area of a circle. Suppose our circle lies within the unit square, ranging from 0.0 to 1.0 on both dimensions. This circle's center is at (0.5,0.5) and has a radius of 0.5. By the well-known formula, area of a circle = pi * radius squared, this circle has an area of:

>> pi * (0.5 ^ 2)

ans =

0.7854

Using MATLAB to approximate this measurement using Monte Carlo, we have:

>> n = 1e5; % Set number of samples to draw
>> rand('twister',9949902); % Initialize pseudorandom number generator
>> A = rand(n,2); % Draw indicated number of 2-dimensional coordinates in unit square
>> AA = sqrt(sum((A' - 0.5) .^ 2))' <= 0.5; % Determine whether samples are within circular region of interest
>> mean(AA) % Estimate area of the circle via the Monte Carlo method

ans =

0.7874

So, after one hundred thousand samples we are off by a small amount. Note that in this case, the area of the total sampled region is 1.0, so there's no need to divide. Let's see how well the Quasi-Monte Carlo technique performs:

>> n = 1e5; % Set number of samples to draw
>> HSet = haltonset(2); % Set up the Halton quasirandom process for 2-dimensional samples
>> B = net(HSet,n); % Draw indicated number of samples using quasirandom numbers
>> BB = sqrt(sum((B' - 0.5) .^ 2))' <= 0.5; % Determine whether samples are within circular region of interest
>> mean(BB) % Estimate area of the circle via the quasi-Monte Carlo method

ans =

0.7853

In this instance (and this will be true for many problems), quasirandom numbers have converged faster.

See also the Mar-19-2008 posting, Quasi-Random Numbers. As mentioned in that post, coordinates of a regular grid could be substituted for the random or quasirandom numbers, but this requires knowing in advance how many samples are to be drawn, and does not allow arbitrary amounts of further sampling to be used to improve the approximation.

Naturally, one would never bother with all of this to calculate the area of a circle, given the availability of convenient formula, but Monte Carlo can be used for regions of arbitrary, possibly highly irregular- even disjoint- regions. As with the bootstrap, simulated annealing and genetic algorithms, this method is made possible by the fact that modern computers provide tremendous amounts of rapid, accurate computation at extremely low cost.

Further Reading

For a good overview of the Monte Carlo method, see:

Monte Carlo Method

10 comments:

Delip Rao said...: A very good illustration to start lectures on MC. However, as with any Monte Carlo method, you might want to run the experiment several times and report mean results. For instance, I wrote a similar code in R using the Mersenne-Twister generator and got the answer 0.78537 on first attempt! It will be a good idea to just find the variance of the estimate or provide a chernoff bound.; 11:14 PM
Will Dwinnell said...: Yes, you make a good point. Though the case I demonstrated was typical, there is no guarantee that things will always work out this way.; 3:30 AM
Anonymous said...: Thanks I like this blog (I like examples)!! To extend your example, if I wanted to find out the intersection of 4 circles inside the square we basically see if it is inside all 4 circles independently?
Next we could replace the 4 circles with 4 crazy sets of equations.

Even when I don't know the actual equation, take for instance finding the area under an ROC curve I should be able to use this method!!

what are some of the good pseudo random generators and are there any guidelines for choosing one over the other.; 8:30 PM
Will Dwinnell said...: Thanks for your comments, Swami.

You are correct that Monte Carlo integration could be used to estimate the area of quite complex regions- even ones which are only known by calls to complex functions. In fact, this is one of the most important applications of such methods in practice, for example for financial calculations.

I would suggest, though, that calculation of the AUC ("area under the [ROC] curve") is probably more efficiently handled by code written specifcally for that purpose, such as my SampleError function, which I described in my Jun-20-2007 post, ROC Curves and AUC and in my Jul-14-2007 post, Calculating AUC Using SampleError().

There are a variety of pseudorandom number generators, but I haven't bother constructing my own, since the ones provided in the base MATLAB product (rand and randn) are of such high quality. I go over a few details of their use in my Dec-07-2006 post, Quick Tip Regarding rand and randn, and in my Jan-13-2007 post, Revisiting rand (MATLAB 2007a).; 4:20 AM
Anonymous said...: Thanks Will.. You have a lot of useful stuff here.

I'm moving from SAS to Matlab so your data management articles are very useful too, still I find that it is difficult to import data (large datasets) into Matlab.

Is there a convenient way to convert a small dataset into an equation (for instance the data associated with the ROC). I know this is a regression problem (i.e. i can simply fit some equation) but I'm wondering if there is a different if not better way to do it.; 2:23 PM
Will Dwinnell said...: As you say, this is a regression (or "curve-fitting") problem, for which MATLAB provides a variety of solutions. Which is best will depend on your particular problem.

The backslash operator, polyfit, interp1 and pchip in base MATLAB are a good place to start, and can be fairly flexible with a little imagination (such through the use of nonlinear transformations). The Statistics Toolbox and Curve Fitting Toolbox (among others) also provide yet more tools, and of course there is a wealth of code available on-line for free. See my Nov-14-2006 posting, Finding MATLAB Source Code And Tools for ideas about where and how to search.

One other possibility is my own GASplineFit routine, which I described in my post GASplineFit: A Flexible Curve Fitting Routine (Jan-19-2007).; 7:12 PM
Anonymous said...: Will, thanks again... you've been a great help...; 10:37 PM
Anonymous said...: Hi, I like this blog very much. Can you repost your sampleerror function on the blog, since it seems it can not be downloaded. Thanks.; 11:57 AM
Will Dwinnell said...: ying ding,

I do apologize for that. I believe the file links will be restored this weekend. In the meantime, please feel free to contact me (see View my complete profile on the main Web log page) with your e-mail address and I will e-mail the code to you.; 6:29 PM
commissioner said...: I think adding the MATLAB link for monte carlo simulation would be appropriate here as well. There are many techniques one can learn in conjunction with this post.; 12:38 PM