Sampling

A sample from a distribution $p(x)$ is a single realization $x$ whose probability distribution is $p(x)$ . Here, $x$ can be high-dimensional or simply real valued.

The main objectives of sampling methods is to

Generate random samples $\{x^{(r)}\}^N_{r=1}$ from a given probability distribution $p(x)$ .
To estimate expectations of functions $\phi(x)$ , under the distribution $p(x)$

$\Phi = E_{x\sim p(x)} [\phi(x)] = \int \phi(x)p(x)dx$

For example, we are interested in the mean of some function $f$ , under distribution $p(x)$ . Then we have

\Phi = E_{x\sim p(x)} [f(x)]

Simple Monte Carlo

For the expectation $\Phi = E_{x\sim p(x)} [\phi(x)] = \int \phi(x)p(x)dx$ , we can estimate the integral by Monte Carlo integration, i.e. generate $R$ samples $\{x^{(r)}\}_{r=1}^R$ from $p(x)$ , and taking average

\Phi = E_{x\sim p(x)} [\phi(x)] = \int \phi(x)p(x)dx = R^{-1}\sum_{r=1}^R \phi(x^{(r)}) := \hat\Phi

Properties of Simple Monte Carlo

Claim 1 $\hat\Phi$ is a consistent estimator of $\Phi$ .

proof. Directly from LLN.

Claim 2 $\hat\Phi$ is a unbiased estimator of $\Phi$ .
proof.

\begin{align*} E(\hat \Phi) &= R^{-1}\sum_{r=1}^R E(\phi(x^{(r)}))\\ &= \frac{R}{R} E_{x\sim p(x)}(\phi(x))\\ &= \Phi \end{align*}

Claim 3 The variance of $\hat\Phi$ decreases with rate $1/R$ .

proof. By consistency and unbiaseness

var(\hat\Phi) = \frac{1}{R^2}\sum_{r=1}^R var(\phi(x^{(r)})) = R^{-1}var(\phi(x))

Normalizing Constant

Given an arbitrary continuous, positive function $f: \mathbb R^n \rightarrow \mathbb R$ and the function is integrable over $\mathbb R^n$ . Say, $\int_{\mathbb R^n} f(\mathbf x)d\mathbf x=Z$ . Then, we can have a density

p(\mathbf x) = \frac{f(\mathbf x)}{Z}

However, the normalizer $Z$ , in many cases, requires computing a high-dim integral, which is computationally intractable (exponential to the dimension). Also, drawing samples from $p(\mathbf x)$ is a challenge, especially in high-dim spaces.

Importance Sampling

Importance sampling is a method for estimating the expectation of a function $\phi$ .

Suppose that we wish to draw samples from $\tilde p(x)$ by

p(x)=\frac{\tilde p(x)}{Z_p}

And we have a simpler density $q(x)$ which is easy to sample from and easy to evaluate up to normalizing constant

q(x) = \frac{\tilde q(x)}{Z_q}

In importance sampling, we first generate $R$ samples from $q(x)$ .

\{x^{(r)}\}_1^R \sim q(x)

Then we have an estimate of $\phi$ over density $q(x)$ as

\Phi = E_{x\sim q(x)}[\phi(x)] = R^{-1}\sum_{r=1}^R \phi(x^{(r)}):=\hat\Phi

The only problem is that the this is an estimation over $q$ . However, notice that at values of $x$ , we can represents $\tilde p$ with a weights function $\tilde p(x) = \tilde w(x)\tilde q(x)$ , since we know $\tilde p(x), \tilde q(x)$ over their domain.

Then, note that for our sampled points we have $\tilde p(x^{(r)}) = \tilde w(x^{(r)})\tilde q(x^{(r)})$ , which

R^{-1}\sum_{r=1}^R \tilde w(x^{(r)}) = E_{x\sim q(x)}[\frac{\tilde p(x^{(r)}}{\tilde q(x^{(r)})}] = \int \frac{\tilde p(x^{(r)}}{\tilde q(x^{(r)})}q(x)dx = \frac{Z_p}{Z_q}

and thus for our estimator under $p$ from estimator under $q$ is

\begin{align*} \Phi &= \int\phi(x)p(x)dx\\ &= \int \phi(x)w(x)p(x)dx \\ &\approx R^{-1}\sum_{r=1}^R \phi(x^{(r)})w(x^{(r)})\\ &= \approx R^{-1}\sum_{r=1}^R \phi(x^{(r)})\frac{\tilde p(x^{(r)})/Z_p}{{\tilde q(x^{(r)})/Z_q}}\\ &= \frac{Z_q}{Z_p}R^{-1} \sum_{r=1}^R\phi(x^{(r)})\tilde w(x^{(r)})\\ &\approx (R^{-1}\sum_{r=1}^R \tilde w(x^{(r)}))^{-1}R^{-1}\sum_{r=1}^R\phi(x^{(r)})\tilde w(x^{(r)})\\ &=\sum_{r=1}^R\phi(x^{(r)})\frac{\tilde w(x^{(r)})}{\sum_{r=1}^R \tilde w(x^{(r)})}=:\hat\Phi_{iw} \end{align*}

Rejection Sampling

Another sampling method is rejection sampling. For a given $\tilde p(x)$ , we find a simpler proposal density $q(x)$ and which $\tilde q(x) = Z_q q(x)$ .

Then, we further assume that we have some constant $c_0$ s.t. $c_0\tilde q(x) >\tilde p(x). \forall x\in\mathcal S$ .
The idea is that we have a simpler density $q$ , such that the scaled $\tilde q$ is above to cage (over-estimate) $p$ for all input $x$ , so that we can reject part of the samples.

First, we generate a sample $x ~ q(x)$ and $u\sim \text{Uniform}[0, c\tilde q(x)]$ . Then, if $u > \tilde p(x)$ , then $x$ is outside of $\tilde p$ so we reject such $x$ . Otherwise, we accept $x$ into $\{x^{(r)}\}$ .

Claim rejection sampling samples $x\sim p(x)$ .

proof. Consider our sampling method, we have $x\sim q(x), u|x \sim \text{Uniform}[0, c\tilde q(x)]$ , and $x$ is accepted is conditional on $u \leq \tilde p(x)$ . Thus, consider the probability over any set $A\subseteq \mathcal S$ . First note that the probability

P_{x\sim p}(x\in A) = \int_A p(x)dx = \int\mathbb I(x\in A)p(x)dx = E_{x\sim p}[\mathbb{I}(x\in A)]

Thus,

\begin{align*} P_{x\sim p}(x\in A \mid u\leq \tilde p(x)) &= \frac{p_{x\sim p}(x\in A, u\leq \tilde p(x))}{E_{x\sim q}p(u\leq \tilde p(x) | x)}\\ &= \frac{ E_{x\sim q}[\mathbb I(x\in A) P(u\leq \tilde p(x)|x)]}{E_{x\sim q}[\frac{\tilde p(x)}{c\tilde q(x)}]}\\ &= \frac{E_{x\sim q}[\mathbb I(x\in A)\frac{\tilde p(x)}{c\tilde q(x)}]}{Z_p/cZ_q}\\ &= P_{x\sim p}(x\in A)\frac{Z_p}{cZ_q} / \frac{Z_p}{cZ_q}\\ &= P_{x\sim p}(x\in A) \end{align*}

Curse of Dimension

Note that in high dimensions, a caging over some function will be very hard. Therefore, $c$ will be huge and the acceptance rate $\frac{Z_p}{cZ_q}$ will be exponentially reduced with increased dimensionality.

Metropolis-Hastings Algorithm

Importance sampling and rejection sampling work well only if the $q\sim p$ . However, such $q$ is very hard to find in high dimensions. Instead, we could make use of a proposal density $q$ which depends on the current state $x^{(t)}$ .

Given some function $\tilde p(x)$ and a proposal conditional density $q(x_t|x_{t-1})$ . The procedure is

Generate a new state $x'$ from the proposal density $q(x'|x^{(t)})$ .
Compute acceptance probability

a = \min(1, \frac{\tilde p(x')q(x^{(t)}|x')}{\tilde p(x^{(t)})q(x'|x^{(t)})})

the new state $x'$ is accepted with probability $a$ 3. If accepted, then $x^{(t+1)} = x'$ , otherwise $x^{(t+1)} = x^{(t)}$