Sampling
A sample from a distribution \(p(x)\) is a single realization \(x\) whose probability distribution is \(p(x)\). Here, \(x\) can be high-dimensional or simply real valued.
The main objectives of sampling methods is to
- Generate random samples \(\{x^{(r)}\}^N_{r=1}\) from a given probability distribution \(p(x)\).
-
To estimate expectations of functions \(\phi(x)\), under the distribution \(p(x)\)
\[\Phi = E_{x\sim p(x)} [\phi(x)] = \int \phi(x)p(x)dx\]
For example, we are interested in the mean of some function \(f\), under distribution \(p(x)\). Then we have
Simple Monte Carlo
For the expectation \(\Phi = E_{x\sim p(x)} [\phi(x)] = \int \phi(x)p(x)dx\), we can estimate the integral by Monte Carlo integration, i.e. generate \(R\) samples \(\{x^{(r)}\}_{r=1}^R\) from \(p(x)\), and taking average
Properties of Simple Monte Carlo
Claim 1 \(\hat\Phi\) is a consistent estimator of \(\Phi\).
proof. Directly from LLN.
Claim 2 \(\hat\Phi\) is a unbiased estimator of \(\Phi\).
proof.
Claim 3 The variance of \(\hat\Phi\) decreases with rate \(1/R\).
proof. By consistency and unbiaseness
Normalizing Constant
Given an arbitrary continuous, positive function \(f: \mathbb R^n \rightarrow \mathbb R\) and the function is integrable over \(\mathbb R^n\). Say, \(\int_{\mathbb R^n} f(\mathbf x)d\mathbf x=Z\). Then, we can have a density
However, the normalizer \(Z\), in many cases, requires computing a high-dim integral, which is computationally intractable (exponential to the dimension). Also, drawing samples from \(p(\mathbf x)\) is a challenge, especially in high-dim spaces.
Importance Sampling
Importance sampling is a method for estimating the expectation of a function \(\phi\).
Suppose that we wish to draw samples from \(\tilde p(x)\) by
And we have a simpler density \(q(x)\) which is easy to sample from and easy to evaluate up to normalizing constant
In importance sampling, we first generate \(R\) samples from \(q(x)\).
Then we have an estimate of \(\phi\) over density \(q(x)\) as
The only problem is that the this is an estimation over \(q\). However, notice that at values of \(x\), we can represents \(\tilde p\) with a weights function \(\tilde p(x) = \tilde w(x)\tilde q(x)\), since we know \(\tilde p(x), \tilde q(x)\) over their domain.
Then, note that for our sampled points we have \(\tilde p(x^{(r)}) = \tilde w(x^{(r)})\tilde q(x^{(r)})\), which
and thus for our estimator under \(p\) from estimator under \(q\) is
Rejection Sampling
Another sampling method is rejection sampling. For a given \(\tilde p(x)\), we find a simpler proposal density \(q(x)\) and which \(\tilde q(x) = Z_q q(x)\).
Then, we further assume that we have some constant \(c_0\) s.t. \(c_0\tilde q(x) >\tilde p(x). \forall x\in\mathcal S\).
The idea is that we have a simpler density \(q\), such that the scaled \(\tilde q\) is above to cage (over-estimate) \(p\) for all input \(x\), so that we can reject part of the samples.
First, we generate a sample \(x ~ q(x)\) and \(u\sim \text{Uniform}[0, c\tilde q(x)]\). Then, if \(u > \tilde p(x)\), then \(x\) is outside of \(\tilde p\) so we reject such \(x\). Otherwise, we accept \(x\) into \(\{x^{(r)}\}\).
Claim rejection sampling samples \(x\sim p(x)\).
proof. Consider our sampling method, we have \(x\sim q(x), u|x \sim \text{Uniform}[0, c\tilde q(x)]\), and \(x\) is accepted is conditional on \(u \leq \tilde p(x)\). Thus, consider the probability over any set \(A\subseteq \mathcal S\). First note that the probability
Thus,
Curse of Dimension
Note that in high dimensions, a caging over some function will be very hard. Therefore, \(c\) will be huge and the acceptance rate \(\frac{Z_p}{cZ_q}\) will be exponentially reduced with increased dimensionality.
Metropolis-Hastings Algorithm
Importance sampling and rejection sampling work well only if the \(q\sim p\). However, such \(q\) is very hard to find in high dimensions. Instead, we could make use of a proposal density \(q\) which depends on the current state \(x^{(t)}\).
Given some function \(\tilde p(x)\) and a proposal conditional density \(q(x_t|x_{t-1})\). The procedure is
- Generate a new state \(x'\) from the proposal density \(q(x'|x^{(t)})\).
- Compute acceptance probability
the new state \(x'\) is accepted with probability \(a\) 3. If accepted, then \(x^{(t+1)} = x'\), otherwise \(x^{(t+1)} = x^{(t)}\)