Probabilistic Models

Given r.v. \(X = (X_1,..., X_N)\) that are either observed or unobserved. We want a model that captures the relationship between these variables. The approach of probabilistic generative models is to relate all variables by a learned joint probability distribution \(p_\theta(X_1,...,X_N)\).

We make the assumption that the variables are generated by some distribution \((X_1,..., X_N) \sim p_*(X)\). Then, density estimation learns the joint probability distribution by choosing the parameters \(\theta\) of a specified parametric joint distribution \(p_\theta(X)\) that best matches \(p_*(X)\).

Therefore, we are interested in (e.g. the sense of classical stats course) - How should we specify \(p_\theta\) (which class of distribution) - How to measure the "best match" (goodness of fit) - How to find the best \(\theta\) (MLE, MAP, etc.)

Probabilistic Perspectives on ML Tasks

For a ML task, we are given input data \(X\subseteq \mathcal X\) and output data \(Y\subseteq \mathcal Y\) (for continuous) or \(C\in\mathcal C\) (for discrete classes, or labels).

From deep learning (nerual nets) perspective, we want to recover some function \(f_\Theta: \mathcal X\rightarrow \mathcal Y\). And by universal function theorem, we use parameters of the NN as the parameters \(\Theta\), and measure how good our \(f_\theta\) matches the underlying \(f_*\) by defining the task specific loss.

In the sense of probability theory, instead of dataset, we treat \(X, Y,C\) are random variables, and each data point is a sample. Then we have the joint probability over these random variables \(P(X, Y)\) or \(P(X,C)\), and by Bayesian theorem

\[P(Y|X) = \frac{p(X,Y)}{p(X)} = \frac{p(X,Y)}{\int_\mathcal Y p(X,Y)dY}\]

\[P(Y|C) = \frac{p(X,C)}{p(X)} = \frac{p(X,C)}{\sum_\mathcal C p(X,C)}\]

Observed vs. Unobserved

In general supervised vs unsupervised learning, in this probabilistic perspective is given by whether a random variable is observed or unobserved.

For supervised dataset \(\{x_i, c_i\}^N \sim p(X, C)\), the only thing need to find is the conditional distribution \(p(C|X)\), so that for each input \(x\), we can pick the label with the maximum conditional probability

\[c_* = \arg\max_{c\in\mathcal C}\{p(C=c\mid X=x)\}\]

For unsupervised dataset \(\{x_i\}^N \sim p(X, C)\), we don't change the generative assumption, the data \(x_i\) is still distributed according to a class label \(C=c_i\) even though it is unknown in the dataset.

Thus, the goal is the same, we want to find the conditional distribution \(P(C|X)\).

Furthermore, we may have latent variables or hidden variables, that are not observed but has large contribution to the modelling. By introducing latent variables, we will be able to naturally describe and capture abstract features of our input data.

Tasks on Probabilistic Models

The fundamental operations we will perform on a probabilistic model are - Generate data or sample new data points from the model - Estimate likelihood marginalizing or observing all variables, and then we get the probability of the all variables taking on those specific values. - Inference expected value of some variables given others which are either observed or marginalized. - Learning Set the parameters of the joint distribution given some observed data to maximize the probability the observed data.

Challenges with Probabilistic Models

The computations are efficient of marginal and conditional distributions. Note that margianlizations often involves integration, which is often computationally intractable.
They have compact representation so the size of parameterization scales well for joint distribution over many variables.

Unfortunately, these two challenges are still huge for Probabilistic Models in ML tasks.

Parameterized Joint Distribution

Consider the joint distribution of a few (finite) discrete random variables, say 3 random variables \(X_1 \in \{x_{11}, x_{12}, ..., x_{1N_1}\}, X_2 = \{x_{21}, x_{22}, ..., x_{2N_2}\}, X_3 =\{x_{31}, x_{32}, ..., x_{3N_1}\}\) so that \(N_i = |X_i|\) be the size of the set.

Then define a valid probability distribution so that \(P(X)\geq 0, \sum_x P(X=x) = 1\), by assigning, or parameterizing each possible combination of states. Thus, for each \(x_{1i}\in X_1, x_{2j}\in X_2, x_{3k}\in X_3\), we have that

\[p(X_1 = x_{1i}, X_2 = x_{2j}, X_3 = x_{3k}) = \theta_{ijk} \in \mathbb R^{\geq 0}\]

Note that there are \(N_1\times N_2\times N_3\) parameters.

With the joint distribution, we can compute marginals such as

\[p(X_1 = x_{1i}, X_2 = x_{2j}) = \sum_{x\in X_3} p(X_1 = x_{1i}, X_2 = x_{2j}, X_3=x)\]

or conditional probabilities through Bayesian inference as

\[p(X_1 = x_{1i}, X_2 = x_{2j} | X_3 = x_{3k}) = \frac{p(X_1 = x_{1i}, X_2 = x_{2j})}{p(X_3 = x_{3k})}\]

where both sides are computed from marginals.

Dimensionality

As mentioned, a fully parameterized joint distribution can have \(\prod_1^F N_i\) where \(F\) is the number of random variables, which is huge.

The primary way we will achieve this is to make assumptions about the independence between variables. Note that if \(X_1, X_2\) are independent, then we have that

\[p(X_1, X_2, X_3) = p(X_1)p(X_2)P(X3|X_1, X_2)\]

Thus the number of parameters is only \((N_1 + N_2)N_3\).

For random variables \(X_1,...,X_F\), if all of them are mutually independent, then \(p(X_1,..., X_F) = \prod_1^F p(X_i)\), and we only need \(\sum_1^F N_i\) parameters.

Likelihood Function

For a given set of parameters describing the distribution, we have focused on the density function \(p(x|\theta)\). However, our goal is to optimize (non-fixed) \(\theta\), from observations \(\mathcal D\) (fixed).

Thus, define the likelihood function \(\mathcal L(\theta; x) = p(x|\theta)\) and log-likelihood function \(l(\theta; x) = \log(\mathcal L(\theta; x))\)

Suppose that each observation \(x^{(m)}\) in the dataset is i.i.d. then,

\[\mathcal L(\theta; \mathcal D) = p( \mathcal D|\theta) = \prod_m p(x^{(m)}|\theta)\]

\[l(\theta; \mathcal D) = \log(p( \mathcal D|\theta)) = \sum_m \log(p(x^{(m)}|\theta))\]

Note that the goal for likelihood function is to optimize for \(\theta\), for example Maximum likelihood estimation (MLE)

\[\hat\theta = \arg\max_\theta(\mathcal L(\theta; \mathcal D)) = \arg\max_\theta(\mathcal l(\theta; \mathcal D))\]

Thus, the log-likelihood, by applying the sum of logs, is much more computational tractable.

Sufficient Statistics

A statistic is a (possibly vector valued) deterministic function of a (set of) random variable(s).

A sufficient statistic is a statistic that conveys exactly the same information about the generative distribution as the entire data itself.

A statitic \(T(X)\) is sufficient for \(X\) if

\[T(x^{(1)}) = T(x^{(2)})\implies \forall \theta\in\Theta. L(\theta; x^{(1)}) = L(\theta; x^{(2)})\]

equivalently,

\[p(\theta | T(X)) = p(\theta | X)\]

Neyman Factorization Theorem

If \(T\) is a sufficient statistic of \(X\), then exists \(f, g\) s.t.

\[p(\theta | T(X)) = p(\theta | X) = h(x, T(x)) g(T(x), \theta)\]

which means we can decompose the conditional probability into a function of only \(x\) and \(T\), and a function of only \(T\) and \(\theta\).

For example, the normal distribution is given by

\[p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{1}{2\sigma^2}(x-\mu)^2)\]

Then, let \(T(x) = [x, x^2]^T\), we have that

\[\begin{align*} p(x|\mu, \sigma) &= \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{1}{2\sigma^2}(x-\mu)^2)\\ &= \frac{1}{\sqrt{2\pi}\sigma}\exp([\frac{u}{\sigma^2}, \frac{-1}{2\sigma^2}] \cdot [x, x^2])\\ \end{align*}\]

where \(h(x, T(x)) = \frac{1}{\sqrt{2\pi}\sigma}, g(T(x), \theta) = \exp([\frac{u}{\sigma^2}, \frac{-1}{2\sigma^2}] \cdot T(x))\)