Probabilistic Models
Given r.v. \(X = (X_1,..., X_N)\) that are either observed or unobserved. We want a model that captures the relationship between these variables. The approach of probabilistic generative models is to relate all variables by a learned joint probability distribution \(p_\theta(X_1,...,X_N)\).
We make the assumption that the variables are generated by some distribution \((X_1,..., X_N) \sim p_*(X)\). Then, density estimation learns the joint probability distribution by choosing the parameters \(\theta\) of a specified parametric joint distribution \(p_\theta(X)\) that best matches \(p_*(X)\).
Therefore, we are interested in (e.g. the sense of classical stats course) - How should we specify \(p_\theta\) (which class of distribution) - How to measure the "best match" (goodness of fit) - How to find the best \(\theta\) (MLE, MAP, etc.)
Probabilistic Perspectives on ML Tasks
For a ML task, we are given input data \(X\subseteq \mathcal X\) and output data \(Y\subseteq \mathcal Y\) (for continuous) or \(C\in\mathcal C\) (for discrete classes, or labels).
From deep learning (nerual nets) perspective, we want to recover some function \(f_\Theta: \mathcal X\rightarrow \mathcal Y\). And by universal function theorem, we use parameters of the NN as the parameters \(\Theta\), and measure how good our \(f_\theta\) matches the underlying \(f_*\) by defining the task specific loss.
In the sense of probability theory, instead of dataset, we treat \(X, Y,C\) are random variables, and each data point is a sample. Then we have the joint probability over these random variables \(P(X, Y)\) or \(P(X,C)\), and by Bayesian theorem
Observed vs. Unobserved
In general supervised vs unsupervised learning, in this probabilistic perspective is given by whether a random variable is observed or unobserved.
For supervised dataset \(\{x_i, c_i\}^N \sim p(X, C)\), the only thing need to find is the conditional distribution \(p(C|X)\), so that for each input \(x\), we can pick the label with the maximum conditional probability
For unsupervised dataset \(\{x_i\}^N \sim p(X, C)\), we don't change the generative assumption, the data \(x_i\) is still distributed according to a class label \(C=c_i\) even though it is unknown in the dataset.
Thus, the goal is the same, we want to find the conditional distribution \(P(C|X)\).
Furthermore, we may have latent variables or hidden variables, that are not observed but has large contribution to the modelling. By introducing latent variables, we will be able to naturally describe and capture abstract features of our input data.
Tasks on Probabilistic Models
The fundamental operations we will perform on a probabilistic model are - Generate data or sample new data points from the model - Estimate likelihood marginalizing or observing all variables, and then we get the probability of the all variables taking on those specific values. - Inference expected value of some variables given others which are either observed or marginalized. - Learning Set the parameters of the joint distribution given some observed data to maximize the probability the observed data.
Challenges with Probabilistic Models
- The computations are efficient of marginal and conditional distributions. Note that margianlizations often involves integration, which is often computationally intractable.
- They have compact representation so the size of parameterization scales well for joint distribution over many variables.
Unfortunately, these two challenges are still huge for Probabilistic Models in ML tasks.
Parameterized Joint Distribution
Consider the joint distribution of a few (finite) discrete random variables, say 3 random variables \(X_1 \in \{x_{11}, x_{12}, ..., x_{1N_1}\}, X_2 = \{x_{21}, x_{22}, ..., x_{2N_2}\}, X_3 =\{x_{31}, x_{32}, ..., x_{3N_1}\}\) so that \(N_i = |X_i|\) be the size of the set.
Then define a valid probability distribution so that \(P(X)\geq 0, \sum_x P(X=x) = 1\), by assigning, or parameterizing each possible combination of states. Thus, for each \(x_{1i}\in X_1, x_{2j}\in X_2, x_{3k}\in X_3\), we have that
Note that there are \(N_1\times N_2\times N_3\) parameters.
With the joint distribution, we can compute marginals such as
or conditional probabilities through Bayesian inference as
where both sides are computed from marginals.
Dimensionality
As mentioned, a fully parameterized joint distribution can have \(\prod_1^F N_i\) where \(F\) is the number of random variables, which is huge.
The primary way we will achieve this is to make assumptions about the independence between variables. Note that if \(X_1, X_2\) are independent, then we have that
Thus the number of parameters is only \((N_1 + N_2)N_3\).
For random variables \(X_1,...,X_F\), if all of them are mutually independent, then \(p(X_1,..., X_F) = \prod_1^F p(X_i)\), and we only need \(\sum_1^F N_i\) parameters.
Likelihood Function
For a given set of parameters describing the distribution, we have focused on the density function \(p(x|\theta)\). However, our goal is to optimize (non-fixed) \(\theta\), from observations \(\mathcal D\) (fixed).
Thus, define the likelihood function \(\mathcal L(\theta; x) = p(x|\theta)\) and log-likelihood function \(l(\theta; x) = \log(\mathcal L(\theta; x))\)
Suppose that each observation \(x^{(m)}\) in the dataset is i.i.d. then,
Note that the goal for likelihood function is to optimize for \(\theta\), for example Maximum likelihood estimation (MLE)
Thus, the log-likelihood, by applying the sum of logs, is much more computational tractable.
Sufficient Statistics
A statistic is a (possibly vector valued) deterministic function of a (set of) random variable(s).
A sufficient statistic is a statistic that conveys exactly the same information about the generative distribution as the entire data itself.
A statitic \(T(X)\) is sufficient for \(X\) if
equivalently,
Neyman Factorization Theorem
If \(T\) is a sufficient statistic of \(X\), then exists \(f, g\) s.t.
which means we can decompose the conditional probability into a function of only \(x\) and \(T\), and a function of only \(T\) and \(\theta\).
For example, the normal distribution is given by
Then, let \(T(x) = [x, x^2]^T\), we have that
where \(h(x, T(x)) = \frac{1}{\sqrt{2\pi}\sigma}, g(T(x), \theta) = \exp([\frac{u}{\sigma^2}, \frac{-1}{2\sigma^2}] \cdot T(x))\)