How to generate Gaussian samples
Part 4: Marsaglia polar method
- To see the code I wrote for this project, you can check out its Github repo
- For other parts of the project: part 1, part 2, part 3, part 4
Background
Generating Gaussian samples is at the core of many modern GenAI models, such as the Gaussian noise that are added and removed in diffusion models like OpenAI’s DALL-E.
In part 2 of my project on the different methods to generate Gaussian samples, I covered the ingenious Box-Muller transform. This elegant algorithm sidesteps the cumbersome inverse Gaussian CDF used in traditional inverse transform sampling (covered in part 1), and works by:
- Generating two uniformly distributed samples U₁ and U₂ that are independent from each other
- Transforming the first uniform random sample U₁ into an angle θ that is uniformly distributed between 0 and 2π
- Transforming the second uniform random sample U₂ into a radius R, such that half of the squared radius is exponentially distributed
- Converting each angle and radius to x and y coordinates of each Gaussian samples, which are R cos(θ) and R sin(θ) respectively
The animation below shows how the uniform samples U₁ (brown) are transformed into uniform angles θ (red), and how the unifom samples U₂ (green) is transformed into exponential half of squared radii R (blue). The angles and radii are then used to draw the 2D Gaussian samples in the right of the animation.
As later noted in the previous part, this method is used by an (older) random number generator in numpy. However, it is implemented as a modified version of the Box-Muller transform, also called the Marsaglia polar method. Therefore, this blog post will explain how this method works and compare it with the original Box-Muller transform.
How does the Marsaglia polar method works
A problem with Box-Muller transform
The main problem that the polar method addresses is the calculation of the trigonometric functions cos(θ) and sin(θ) in the Box-Muller transform, where θ is uniformly distributed from 0 and 2π. For instance, most programming libraries approximate cos & sin using a power series, such as the following Taylor series:
Adding up the terms of the series one by one can be cumbersome. In fact, this was first noticed—not by Marsaglia—but by the famous mathematician & computer scientist John von Neumann, the father of many random sampling algorithms, some were used in the development of the atomic bomb:
von Neumann’s stroke of genius
To overcome this “silliness”, von Neumann exploited the basic trigonometry of the cos & sin functions, which are the ratios of the respective adjacent side (U) and opposite side (V) to the hypotenuse of a right triangle.
This leads us to the key principle underlying the Marsaglia (or should we say von Neumann) polar method:
Instead of sampling θ and calculating its cos & sin, we sample the adjacent side U & opposite side V such that their corresponding angle θ is uniformly distributed between 0 and 2π.
The cos & sin of θ can then be directly calculated using the ratio of these sides to the hypotenuse — √(U² + V²) — rather than using indirect methods like the power series.
How to sample U & V
To accomplish this, von Neumann suggests sampling U & V uniformly as points within the unit circle, where U & V are the x & y-coordinates of each point. This makes intuitive sense, as:
By sampling points uniformly within the unit circle, their corresponding angles will spread evenly from 0 to 2π, the range of angles within the circle.
The animation below confirms this intuition: the corresponding angles θ from points U & V sampled within the unit circle closely follow the uniform distribution.
To prove why this is necessarily true, we observe that for a given angle ϕ, the probability of a sampled U & V such that their corresponding angle θ ≤ ϕ is ϕ/2π.
- This is because the probability of U & V having an angle θ ≤ ϕ is also the volume of the wedge bounded by ϕ under the joint PDF of U & V. Since U & V are uniformly distributed within a unit circle, this PDF covers the area of the unit circle π, and hovers at a uniform density of 1/π (so that the total probability of all relevant U & V values is π*1/π=1).
- As a result, the volume of the wedge bounded by ϕ will be its area of (ϕ/2π)*π, multiplied by its height of 1/π, which returns ϕ/2π — the proportion of the wedge angle ϕ relative to the full circle.
- This also implies the CDF of θ i.e. the probability of θ ≤ ϕ is ϕ/2π (for ϕ between 0 and 2π). In turn, the PDF of θ is simply 1/2π (for θ between 0 and 2π), as PDF is the derivative of CDF. This is indeed the PDF of a uniform random variable between 0 and 2π, the desired distribution of θ that we want to sample!
Sampling from a unit circle: a sneak peek at rejection sampling
To sample points within the unit circle using Unif(0, 1) random samples — the only starting block material that we are allowed to use, we can:
1. Generate two uniformly distributed samples U₁ and U₂ that are independent from each other
2. Calculate U = 2U₁ - 1, and V = 2U₂ - 1. This transforms the independent Unif(0, 1) random variables U₁ & U₂ into the independent Unif(-1, 1) random variables U & V. This also implies that U & V are uniformly sampled within the bounding square, whose sides extend from -1 to 1 in both horizontal & vertical directions.
3. From the sampled points U & V, only accept points whose squared radius T = U² + V² ≤ 1, and reject those whose squared radius is > 1. This implies that the accepted U & V are uniformly sampled from the unit circle: their radius is less than 1, given their squared radius is also less than 1.
Observing the accepted & rejected samples from this method, we see that:
- While the sampled U & V from the bounding square (step 2) do not produce uniform angles, rejecting points outside of the unit circle (step 3) discards the points near the corner of the square that contributed to the uneven distribution of sampled θ (shown in black).
- As a result, the accepted points (in red) are indeed uniformly sampled within the unit circle. In turn, their angles follow a uniform distribution between 0 and 2π as proven earlier.
- In other words, this method samples from a simpler distribution — the bounding square — and rejects some of these samples such that the accepted samples ultimately follow a more complicated distribution — the unit circle. This is the core principle of rejection sampling, which also underlies the newer Ziggurat algorithm that numpy uses to generate Gaussian samples (covered in the next part of the project).
In summary, the von Neumann method works by directly calculating cos(θ) and sin(θ) from sampling adjacent sides (U) & opposite sides (V) of a right triangle, then taking their ratio with the hypotenuse. To ensure θ follows the desired uniform distribution from 0 to 2π, U & V are sampled within the bounding square, and only points within the unit circle are accepted.
Compare von Neumann method to Box-Muller transform
Now that we understand how the von Neumann method works, let’s integrate it to the larger Box-Muller transform to generate Gaussian samples and compare it with the original algorithm:
- By calculating cos(θ) and sin(θ) directly rather than approximating them indirectly, say via a power series, von Neumann method actually ends up with more, albeit simpler steps.
- Furthermore, since some of the sampled points are rejected to obtain the uniform samples U & V within the unit circle, we need to compensate for the sample loss by sampling more U₁ & U₂ than needed. Specifically, by a factor of 4/π — or 27% more — which is the ratio of the bounding square area to the unit circle area.
- Lastly, by using both U₁ & U₂ to calculate cos(θ) & sin(θ), we need to generate an extra Unif(0, 1) sample U₃ in order to compute the radius R of the Gaussian sample.
To see how they fare in practice, we’ll use them to sample 1 million 2D Gaussian samples. The implementation of the von Neumann method in Python is extremely straightforward:
When comparing the time taken by both algorithms to generate these 1 million samples (averaged over 1000 trials for reliability), the von Neumann method is indeed faster than the Box-Muller method: 80 ms vs. 71 ms.
A closer examination of each step reveals:
- While the Box-Muller transform took 39 ms to calculate cos & sin, the von Neumann method did so in only 21 ms. More than half of this time (13 ms) was used to reject samples that fall outside of the unit circle.
- In contrast, the von Neumann method took 18 ms to generate U₁ & U₂ compared to 14 ms for the Box-Muller transform, consistent with the 27% more samples required to compensate for the rejection sampling.
- Lastly, the von Neumann method spent an extra 7 ms to sample 1 million extra uniform U₃’s, which it used to compute the radii R of the Gaussian samples.
In short, while the von Neumann method indeed speeds up the computation of cos & sin, it does so at the expense of sampling more data. As a result, the second part of this blog post will focus on the main contribution of George Marsaglia to the algorithm (remember him?), specifically to avoid generating the extra uniform samples.
The Marsaglia trick
Marsaglia’s proposal to bypass sampling the extra uniform U₃ in von Neumann’s method is remarkably simple:
To calculate the radius R of the Gaussian samples, just use the squared radius T from the previous samples of the unit circle — T = U² + V² — instead.
Marsaglia claims that this squared radius T is not only uniformly distributed between 0 and 1, but it is also independent from the angle θ. This ensures that the radii R of the Gaussian samples are also independent from their angles, a must when generating independent 2D Gaussian samples.
Perhaps to underline the simplicity of this trick, he did not deign to even provide a proof for it!
Hence, we’ll now prove that the squared radius of points within the unit circle is uniformly distributed, and independent from their angles.
Prove that T is uniform
Proving that Marsaglia’s T is Unif(0, 1) turns out to be similar to proving that von Neumann’s θ is Unif(0, 2π):
- The probability of U & V having an squared radius T ≤ t is the volume of the cylinder bounded by √t under the joint PDF of U & V.
- As a result, the volume of this cylinder will be its area of π(√t)², multiplied by its height of 1/π, which returns t.
- This also implies the CDF of T i.e. the probability of T ≤ t is t (for t between 0 and 1 within the unit circle), which corresponds to a simple PDF of 1 (for T between 0 and 1). This is indeed the PDF of a Unif(0, 1) random variable, the desired distribution of T that we want to sample.
Prove that T is independent from θ
To prove that T is independent from θ, we’ll start by deriving the joint distribution of T & θ from the joint distribution of U & V, given that the former is calculated from the latter. This is done by multiplying the joint distribution of U & V by the Jacobian of the transformation from T & θ to U & V (more details here):
From the above derivations, the joint PDF of T & θ is 1/2π, which is simply the product of the PDF of T (1) with the PDF of θ (1/2π). This implies that T & θ are indeed independent, and makes intuitive sense: knowing the (squared) radius of a random point in a unit circle does not provide any additional information about its angle, and vice versa.
The below animation shows how T & θ are sampled by deriving them from U & V in the unit circle: they are uniformly distributed not only individually (from their histograms), but also jointly (from the evenly spread samples of T & θ in the 2D space) — a result of their independence.
Compare Marsaglia method with previous methods
The code for the Marsaglia method is an extremely minor modification to the von Neumann method: T is used in place of U₃, which is no longer needed.
We can further streamline the Marsaglia method by combining the square roots in the calculations of R & θ into a single square root operation. This is indeed the formula often cited for the Marsaglia polar method:
When benchmarking the Marsaglia methods with the previous Box-Muller and von Neumann methods to again generate 1 million Gaussian samples, we see that:
- The Marsaglia method reduces the sampling time of the von Neumann method by 7 ms (71 ms to 64 ms), virtually all from no longer having to sample 1 million extra uniform U₃’s.
- Combining the square root operations in θ (step 2) & R (step 3) into a single operation in step 4 reduces the overall time by an extra 2 ms (64 ms to 62 ms).
- Ultimately, the Marsaglia method is 17 ms faster than the original Box-Muller transform, a not too shabby 22% time reduction! This is thanks to von Neumann’s faster calculation of cos(θ) & sin(θ) using rejection sampling, combined with Marsaglia’s trick to reuse the uniform samples of the unit circle’s squared radius.
Closing thoughts
Working on this project has made me appreciate even more the groundbreaking innovations from many mathematicians and computer scientists over decades to improve how our computers generate random samples: there are now 11 distinct algorithms to generate Gaussian samples!
In particular, each time we use numpy to generate Gaussian samples, we are in fact relying on the Marsaglia polar method, which we now know is a joint contribution from the Box-Muller transform (1958), the von Neumann method (1951), and the Marsaglia trick (1962).
In the next part of the project, I will explore the latest algorithm that numpy is using to generate Gaussian samples — the Ziggurat algorithm. In the mean time, here’s a recap of the three algorithms that I covered in this blog post, along with the pretty Gaussian samples that they produced ✨