Rejection-sampling the Zipf Distribution

Jason Crease
4 min readJul 8, 2017

In my last post, I discussed the Zipf Distribution, its pmf, cmf and inverse-cmf, and approximations of these for use in software — particularly with a view to sampling from the Zipf distribution. Today I’ll describe the rejection-sampling method for sampling from Zipf random variables. For the Zipf distribution, it has a few key advantages over the inverse transformation method:

  1. It’s precise. No approximations are used.
  2. It runs faster.
  3. It’s conceptually simpler.

Rejection sampling

I won’t describe rejection-sampling in depth. The essential idea is to sample observations from a distribution that is simple to generate and is always ‘above’ the required distribution when multiplied by some constant. Wikipedia gives a good analogy — describing how to generate points bounded by a curve, given only the ability to generate points within a larger rectangle:

‘…imagine graphing the density function of a random variable onto a large rectangular board and throwing darts at it. Assume that the darts are uniformly distributed around the board. Now remove all of the darts that are outside the area under the curve. The remaining darts will be distributed uniformly within the area under the curve…’

Here’s a diagram of the method. We use an inverse-cdf method to sample the bounding function b(x), then use a second uniform-sample u to generate a y-coordinate u*b(x). Samples that don’t fall within the Zipf-distribution’s pmf are rejected. If a sample is accepted, the x-coordinate is taken as the Zipfian sample.

There is a problem here — since b(x) is everywhere above z(x), and z(x) integrates to 1 (being a probability distribution), how can b(x) also be a distribution? We essentially use two b’s: a properly-normalized b(x) to generate x-coordinates, and b’(x) = a*b(x) (for some constant a) to get y-coordinates.

I should point out that this isn’t the only rejection-sampling method. Other formulations, for instance using the quotient of two uniform samples, can be useful.

Selecting functions

We use this continuous analogue of the Zipf distribution (the braces mean ‘ceiling’ or ‘round up to nearest whole number’). Remember, s is the skew, and N is the cardinality:

This does not sum to 1 over (0, N), but it doesn’t matter for our method. We only care about a*b(x) vs z(x) (see below), and normalising things would only alter a.

Avoiding normalisation is common in Monte-Carlo methods. Indeed, here the normalisation constant is a generalized harmonic number, which is hard to calculate.

b(x) needs these properties:

  1. For some constant a, which we need to find, a*b(x) z(x) for all x in (0, N).
  2. Easy to sample.
  3. Not give excessive rejections. That is, a*b(x) should preferably skirt close to z(x).

I chose b(x) as so:

Special-casing x < 1 avoids asymptotic behavior as x → 0, and is helpfully ‘tight’ to z(x). See the graph above for the functions z and b. By inspection, setting t = a gives a*b(x) z(x) over (0,N), achieving equality when x is an integer.

But we need to be able to sample from this b. For the purposes of sampling we’ll calculate a correct t so that b(x) is a probability distribution (ie integrates to 1).

Integrating b gives the cdf of b:

Inverting this so we can generate a sample of b given a uniformly-random sample:

Example

Here’s an example of running the algorithm with N=10 and s=0.8. These values give t = 3.92.

  1. Generate a uniform random sample u1 = 0.387. invB(0.387) = 1.64. This is our x-coordinate.
  2. Generate another uniform sample, u2 = 0.924.
  3. b(1.64) * u2 * t = 0.623 > z(1.64) = 0.574, therefore 1.64 is rejected. We try again…
  4. Generate u3 = 0.764. invB(0.764 ) = 5.37.
  5. Generate u4 = 0.583.
  6. b(5.37) * u4 * t = 0.152 < z(5.37) = 0.238, therefore 5.37 is accepted.

Note: invB is the inverse-cdf - Medium.com doesn’t have superscript!

Java source

Here’s the source-code. It generates about 5 millions samples per second on my desktop PC. There are a few bits that could be easily optimized, but I think this is the clearest way to present it:

--

--