Skip to content

Sampling

A sample from a distribution p(x)p(x) is a single realization xx whose probability distribution is p(x)p(x). Here, xx can be high-dimensional or simply real valued.

The main objectives of sampling methods is to

  • Generate random samples {x(r)}r=1N\{x^{(r)}\}^N_{r=1} from a given probability distribution p(x)p(x).
  • To estimate expectations of functions ϕ(x)\phi(x), under the distribution p(x)p(x)

    Φ=Exp(x)[ϕ(x)]=ϕ(x)p(x)dx\Phi = E_{x\sim p(x)} [\phi(x)] = \int \phi(x)p(x)dx

For example, we are interested in the mean of some function ff, under distribution p(x)p(x). Then we have

Φ=Exp(x)[f(x)]\Phi = E_{x\sim p(x)} [f(x)]

Simple Monte Carlo

For the expectation Φ=Exp(x)[ϕ(x)]=ϕ(x)p(x)dx\Phi = E_{x\sim p(x)} [\phi(x)] = \int \phi(x)p(x)dx, we can estimate the integral by Monte Carlo integration, i.e. generate RR samples {x(r)}r=1R\{x^{(r)}\}_{r=1}^R from p(x)p(x), and taking average

Φ=Exp(x)[ϕ(x)]=ϕ(x)p(x)dx=R1r=1Rϕ(x(r)):=Φ^\Phi = E_{x\sim p(x)} [\phi(x)] = \int \phi(x)p(x)dx = R^{-1}\sum_{r=1}^R \phi(x^{(r)}) := \hat\Phi

Properties of Simple Monte Carlo

Claim 1 Φ^\hat\Phi is a consistent estimator of Φ\Phi.

proof. Directly from LLN.

Claim 2 Φ^\hat\Phi is a unbiased estimator of Φ\Phi.
proof.

E(Φ^)=R1r=1RE(ϕ(x(r)))=RRExp(x)(ϕ(x))=Φ\begin{align*} E(\hat \Phi) &= R^{-1}\sum_{r=1}^R E(\phi(x^{(r)}))\\ &= \frac{R}{R} E_{x\sim p(x)}(\phi(x))\\ &= \Phi \end{align*}

Claim 3 The variance of Φ^\hat\Phi decreases with rate 1/R1/R.

proof. By consistency and unbiaseness

var(Φ^)=1R2r=1Rvar(ϕ(x(r)))=R1var(ϕ(x))var(\hat\Phi) = \frac{1}{R^2}\sum_{r=1}^R var(\phi(x^{(r)})) = R^{-1}var(\phi(x))

Normalizing Constant

Given an arbitrary continuous, positive function f:RnRf: \mathbb R^n \rightarrow \mathbb R and the function is integrable over Rn\mathbb R^n. Say, Rnf(x)dx=Z\int_{\mathbb R^n} f(\mathbf x)d\mathbf x=Z. Then, we can have a density

p(x)=f(x)Zp(\mathbf x) = \frac{f(\mathbf x)}{Z}

However, the normalizer ZZ, in many cases, requires computing a high-dim integral, which is computationally intractable (exponential to the dimension). Also, drawing samples from p(x)p(\mathbf x) is a challenge, especially in high-dim spaces.

Importance Sampling

Importance sampling is a method for estimating the expectation of a function ϕ\phi.

Suppose that we wish to draw samples from p~(x)\tilde p(x) by

p(x)=p~(x)Zpp(x)=\frac{\tilde p(x)}{Z_p}

And we have a simpler density q(x)q(x) which is easy to sample from and easy to evaluate up to normalizing constant

q(x)=q~(x)Zqq(x) = \frac{\tilde q(x)}{Z_q}

In importance sampling, we first generate RR samples from q(x)q(x).

{x(r)}1Rq(x)\{x^{(r)}\}_1^R \sim q(x)

Then we have an estimate of ϕ\phi over density q(x)q(x) as

Φ=Exq(x)[ϕ(x)]=R1r=1Rϕ(x(r)):=Φ^\Phi = E_{x\sim q(x)}[\phi(x)] = R^{-1}\sum_{r=1}^R \phi(x^{(r)}):=\hat\Phi

The only problem is that the this is an estimation over qq. However, notice that at values of xx, we can represents p~\tilde p with a weights function p~(x)=w~(x)q~(x)\tilde p(x) = \tilde w(x)\tilde q(x), since we know p~(x),q~(x)\tilde p(x), \tilde q(x) over their domain.

Then, note that for our sampled points we have p~(x(r))=w~(x(r))q~(x(r))\tilde p(x^{(r)}) = \tilde w(x^{(r)})\tilde q(x^{(r)}), which

R1r=1Rw~(x(r))=Exq(x)[p~(x(r)q~(x(r))]=p~(x(r)q~(x(r))q(x)dx=ZpZqR^{-1}\sum_{r=1}^R \tilde w(x^{(r)}) = E_{x\sim q(x)}[\frac{\tilde p(x^{(r)}}{\tilde q(x^{(r)})}] = \int \frac{\tilde p(x^{(r)}}{\tilde q(x^{(r)})}q(x)dx = \frac{Z_p}{Z_q}

and thus for our estimator under pp from estimator under qq is

Φ=ϕ(x)p(x)dx=ϕ(x)w(x)p(x)dxR1r=1Rϕ(x(r))w(x(r))=R1r=1Rϕ(x(r))p~(x(r))/Zpq~(x(r))/Zq=ZqZpR1r=1Rϕ(x(r))w~(x(r))(R1r=1Rw~(x(r)))1R1r=1Rϕ(x(r))w~(x(r))=r=1Rϕ(x(r))w~(x(r))r=1Rw~(x(r))=:Φ^iw\begin{align*} \Phi &= \int\phi(x)p(x)dx\\ &= \int \phi(x)w(x)p(x)dx \\ &\approx R^{-1}\sum_{r=1}^R \phi(x^{(r)})w(x^{(r)})\\ &= \approx R^{-1}\sum_{r=1}^R \phi(x^{(r)})\frac{\tilde p(x^{(r)})/Z_p}{{\tilde q(x^{(r)})/Z_q}}\\ &= \frac{Z_q}{Z_p}R^{-1} \sum_{r=1}^R\phi(x^{(r)})\tilde w(x^{(r)})\\ &\approx (R^{-1}\sum_{r=1}^R \tilde w(x^{(r)}))^{-1}R^{-1}\sum_{r=1}^R\phi(x^{(r)})\tilde w(x^{(r)})\\ &=\sum_{r=1}^R\phi(x^{(r)})\frac{\tilde w(x^{(r)})}{\sum_{r=1}^R \tilde w(x^{(r)})}=:\hat\Phi_{iw} \end{align*}

Rejection Sampling

Another sampling method is rejection sampling. For a given p~(x)\tilde p(x), we find a simpler proposal density q(x)q(x) and which q~(x)=Zqq(x)\tilde q(x) = Z_q q(x).

Then, we further assume that we have some constant c0c_0 s.t. c0q~(x)>p~(x).xSc_0\tilde q(x) >\tilde p(x). \forall x\in\mathcal S.
The idea is that we have a simpler density qq, such that the scaled q~\tilde q is above to cage (over-estimate) pp for all input xx, so that we can reject part of the samples.

First, we generate a sample x q(x)x ~ q(x) and uUniform[0,cq~(x)]u\sim \text{Uniform}[0, c\tilde q(x)]. Then, if u>p~(x)u > \tilde p(x), then xx is outside of p~\tilde p so we reject such xx. Otherwise, we accept xx into {x(r)}\{x^{(r)}\}.

Claim rejection sampling samples xp(x)x\sim p(x).

proof. Consider our sampling method, we have xq(x),uxUniform[0,cq~(x)]x\sim q(x), u|x \sim \text{Uniform}[0, c\tilde q(x)], and xx is accepted is conditional on up~(x)u \leq \tilde p(x). Thus, consider the probability over any set ASA\subseteq \mathcal S. First note that the probability

Pxp(xA)=Ap(x)dx=I(xA)p(x)dx=Exp[I(xA)]P_{x\sim p}(x\in A) = \int_A p(x)dx = \int\mathbb I(x\in A)p(x)dx = E_{x\sim p}[\mathbb{I}(x\in A)]

Thus,

Pxp(xAup~(x))=pxp(xA,up~(x))Exqp(up~(x)x)=Exq[I(xA)P(up~(x)x)]Exq[p~(x)cq~(x)]=Exq[I(xA)p~(x)cq~(x)]Zp/cZq=Pxp(xA)ZpcZq/ZpcZq=Pxp(xA)\begin{align*} P_{x\sim p}(x\in A \mid u\leq \tilde p(x)) &= \frac{p_{x\sim p}(x\in A, u\leq \tilde p(x))}{E_{x\sim q}p(u\leq \tilde p(x) | x)}\\ &= \frac{ E_{x\sim q}[\mathbb I(x\in A) P(u\leq \tilde p(x)|x)]}{E_{x\sim q}[\frac{\tilde p(x)}{c\tilde q(x)}]}\\ &= \frac{E_{x\sim q}[\mathbb I(x\in A)\frac{\tilde p(x)}{c\tilde q(x)}]}{Z_p/cZ_q}\\ &= P_{x\sim p}(x\in A)\frac{Z_p}{cZ_q} / \frac{Z_p}{cZ_q}\\ &= P_{x\sim p}(x\in A) \end{align*}

Curse of Dimension

Note that in high dimensions, a caging over some function will be very hard. Therefore, cc will be huge and the acceptance rate ZpcZq\frac{Z_p}{cZ_q} will be exponentially reduced with increased dimensionality.

Metropolis-Hastings Algorithm

Importance sampling and rejection sampling work well only if the qpq\sim p. However, such qq is very hard to find in high dimensions. Instead, we could make use of a proposal density qq which depends on the current state x(t)x^{(t)}.

Given some function p~(x)\tilde p(x) and a proposal conditional density q(xtxt1)q(x_t|x_{t-1}). The procedure is

  1. Generate a new state xx' from the proposal density q(xx(t))q(x'|x^{(t)}).
  2. Compute acceptance probability
a=min(1,p~(x)q(x(t)x)p~(x(t))q(xx(t)))a = \min(1, \frac{\tilde p(x')q(x^{(t)}|x')}{\tilde p(x^{(t)})q(x'|x^{(t)})})

the new state xx' is accepted with probability aa 3. If accepted, then x(t+1)=xx^{(t+1)} = x', otherwise x(t+1)=x(t)x^{(t+1)} = x^{(t)}