Sampling

Exact sampling

For most distributions, if we know the exact parametric form, we can get exact sampling from it. Most programming languages achieve this by generating random samples from a uniform distribution, then inverse mapping the cumulative distribution function to give a sample from the distribution.

Therefore, for any continuous function \(F\) having the PDF \(f\),

\[X = F^{-1}(U) \implies X \sim f \; ; Y \sim \operatorname{Unif}(0,1)\]

Programming languages have their implementations to generate random numbers that have the same statistical property as the numbers sampled from a uniform distribution in \([0,1]\). We can understand the above more intuitively by imaging a CDF. A random number U sampled from \(\operatorname{Unif}(0,1)\) will be sampled on the y-axis of the CDF curve. More samples of \(X\) will be sampled from the distribution by \(X = F^{-1}(U)\) where the CDF is steeper, or in the region where the PDF peaks, eg., a narrow gaussian will rise more steeply compared to a broad one. Assuming both are centered around the same point, the narrow one will sample more points around the center.

Rejection sampling

Often, sampling from \(f(x)\) is not feasible but there maybe a bounding distribution \(g(x)\) such that

\[f(x) \le c g(x) \; ; c \ge 1 \quad \forall x\]

In such cases, where we know the bound of \(f(x)\) upto some proportionality constant, rejection sampling is employed. We sample from \(f(x)\) by

Sample \(X \sim g(x)\)
Sample \(U \sim \operatorname{Unif}[0,1]\)
Accept \(X\) as being drawn from \(f(x)\) when \(\frac{f(X)}{c e(X)} \ge U \; \text{where} \; e(x) = c g(x)\)

Rigorously, this works because:

\[\begin{aligned} P(X \le x | \frac{f(X)}{e(X)} \ge U) &= P(X \le x | U \le \frac{f(X)}{e(X)}) \\ &= \frac{P(X \le x , U \le \frac{f(X)}{e(X)})}{P(U \le \frac{f(X)}{e(X)})} \\ &= \frac{P(U \le \frac{f(X)}{e(X)}, X \le x)}{P(U \le \frac{f(X)}{e(X)}, X \le \infty )} \\ \text{In above, } X \le \infty & \text{ will not impose any constraint and allows us to see the steps more clearly} \\ \text{Now, noting that } P({X \le x}) &= \int_0^x g(y)dy \text{ because we're sampling from } g(y) \\ &= \frac{\int_0^{x} P(U \le \frac{f(X)}{e(X)}) g(y) dy}{\int_0^{\infty} P(U \le \frac{f(X)}{e(X)}) g(y) dy} \\ &= \frac{\int_0^x \frac{f(y)}{e(y)} g(y) dy}{\int_0^{\infty} \frac{f(y)}{e(y)} g(y) dy} \\ \because P({U \le k}) = k; \; & U \sim \operatorname{Unif}(0,1) \\ &= \frac{\int_0^x \frac{f(y)}{c g(y)} g(y) dy}{\int_0^{\infty} \frac{f(y)}{c g(y)} g(y) dy} \\ &= \frac{\frac{1}{c} \int_0^x f(y)}{\frac{1}{c} \int_0^{\infty} f(y)} \\ &= \int_0^x f(y) dy \\ &= P(X \le x) \end{aligned}\]

We, therefore, sample EXACTLY from \(f(x)\).

What about \(c\)…

The only thing we need to estimate now is \(c\). The only constraint we have is

\[f(x) \le c g(x) \; ; c \ge 1 \quad \forall x\]

It only seems easy that we choose a large value so that is easily satisfies \(f(x) \le c g(x)\). We, however, need to understand the effect of \(c\) on the acceptance probability.

The acceptance probability is a marginal probability for all samples :

\[\begin{aligned} P(\frac{f(X)}{e(X)} \ge U , X) &= % \sum_X P(\frac{f(X)}{e(X)} \ge U | X) P(X) \\ \mathrm{E}_{X \sim g}\left[P(\frac{f(X)}{e(X)} \ge U | X)\right] \\ &= \mathrm{E}_{X \sim g}\left[\frac{f(X)}{e(X)}\right] \\ &= \mathrm{E}_{X \sim g}\left[\frac{f(X)}{c g(X)}\right] \\ &= \int \frac{f(X)}{c g(x)} g(x) dx \\ &= \frac{1}{c} \end{aligned}\]

The only thing to note here is that

\[P(\frac{f(X)}{e(X)} \ge U | X) = \frac{f(x)}{e(x)}\]

Once we know the value of \(X\) than we can evaluate the above using the property of the uniform distribution.

The \(c\), therefore, plays an important role and should be chosen as small as possible to accept maximum number of samples.

When you know the probability distribution upto a constant

Rejection sampling is particularly useful if we know a distribution \(h(x)\) only upto an unknown constant, i.e., \(h(x) = K f(x) \implies f(x) = h(x)/K\), where \(f(x)\) can be easily computed.

Such cases are mostly observed in Bayesian inference, where \(K\) is an intractable constant (integral of a certain distribution) and less than 1, therefore, \(f(x) \ge h(x)\) and is an envelope over \(h(x)\).

We can then have \(e(x) = c g(x) \ge f(x)\) and follow similar steps, by sampling from \(g(x)\) and accepting if \(\frac{f(x)}{c g(x)} \ge U\). The constant \(1/K\) cancels out both in the numerator as well as the denominator and we sample exactly from \(f(x)\).

The acceptance probability, however, changes as

We, therefore, have more samples rejected. In the above, it is important to note that \(\int f(x) dx \neq 1\) because \(f(x)\) is not a probability distribution, it is a scaled distribution of \(h(x)\), and \(\int f(x) dx = 1\), \(f(x)\) is more of a convinience function that can be computed.