A sample from a distribution p(x) is a single realization x whose probability distribution is p(x). Here, x can be high-dimensional or simply real valued.
The main objectives of sampling methods is to
Generate random samples {x(r)}r=1N from a given probability distribution p(x).
To estimate expectations of functions ϕ(x), under the distribution p(x)
Φ=Ex∼p(x)[ϕ(x)]=∫ϕ(x)p(x)dx
For example, we are interested in the mean of some function f, under distribution p(x). Then we have
Φ=Ex∼p(x)[f(x)]
Simple Monte Carlo
For the expectation Φ=Ex∼p(x)[ϕ(x)]=∫ϕ(x)p(x)dx, we can estimate the integral by Monte Carlo integration, i.e. generate R samples {x(r)}r=1R from p(x), and taking average
Φ=Ex∼p(x)[ϕ(x)]=∫ϕ(x)p(x)dx=R−1r=1∑Rϕ(x(r)):=Φ^
Properties of Simple Monte Carlo
Claim 1Φ^ is a consistent estimator of Φ.
proof. Directly from LLN.
Claim 2Φ^ is a unbiased estimator of Φ. proof.
E(Φ^)=R−1r=1∑RE(ϕ(x(r)))=RREx∼p(x)(ϕ(x))=Φ
Claim 3 The variance of Φ^ decreases with rate 1/R.
proof. By consistency and unbiaseness
var(Φ^)=R21r=1∑Rvar(ϕ(x(r)))=R−1var(ϕ(x))
Normalizing Constant
Given an arbitrary continuous, positive function f:Rn→R and the function is integrable over Rn. Say, ∫Rnf(x)dx=Z. Then, we can have a density
p(x)=Zf(x)
However, the normalizer Z, in many cases, requires computing a high-dim integral, which is computationally intractable (exponential to the dimension). Also, drawing samples from p(x) is a challenge, especially in high-dim spaces.
Importance Sampling
Importance sampling is a method for estimating the expectation of a function ϕ.
Suppose that we wish to draw samples from p~(x) by
p(x)=Zpp~(x)
And we have a simpler density q(x) which is easy to sample from and easy to evaluate up to normalizing constant
q(x)=Zqq~(x)
In importance sampling, we first generate R samples from q(x).
{x(r)}1R∼q(x)
Then we have an estimate of ϕ over density q(x) as
Φ=Ex∼q(x)[ϕ(x)]=R−1r=1∑Rϕ(x(r)):=Φ^
The only problem is that the this is an estimation over q. However, notice that at values of x, we can represents p~ with a weights function p~(x)=w~(x)q~(x), since we know p~(x),q~(x) over their domain.
Then, note that for our sampled points we have p~(x(r))=w~(x(r))q~(x(r)), which
Another sampling method is rejection sampling. For a given p~(x), we find a simpler proposal density q(x) and which q~(x)=Zqq(x).
Then, we further assume that we have some constant c0 s.t. c0q~(x)>p~(x).∀x∈S. The idea is that we have a simpler density q, such that the scaled q~ is above to cage (over-estimate) p for all input x, so that we can reject part of the samples.
First, we generate a sample xq(x) and u∼Uniform[0,cq~(x)]. Then, if u>p~(x), then x is outside of p~ so we reject such x. Otherwise, we accept x into {x(r)}.
Claim rejection sampling samples x∼p(x).
proof. Consider our sampling method, we have x∼q(x),u∣x∼Uniform[0,cq~(x)], and x is accepted is conditional on u≤p~(x). Thus, consider the probability over any set A⊆S. First note that the probability
Note that in high dimensions, a caging over some function will be very hard. Therefore, c will be huge and the acceptance rate cZqZp will be exponentially reduced with increased dimensionality.
Metropolis-Hastings Algorithm
Importance sampling and rejection sampling work well only if the q∼p. However, such q is very hard to find in high dimensions. Instead, we could make use of a proposal density q which depends on the current state x(t).
Given some function p~(x) and a proposal conditional density q(xt∣xt−1). The procedure is
Generate a new state x′ from the proposal density q(x′∣x(t)).
Compute acceptance probability
a=min(1,p~(x(t))q(x′∣x(t))p~(x′)q(x(t)∣x′))
the new state x′ is accepted with probability a 3. If accepted, then x(t+1)=x′, otherwise x(t+1)=x(t)