Probabilistic Models
Likelihood Function
The density of the observed data, as a function of parameters \(\theta\).
Approaches to classification
Discriminative approach estimate parameters of decision boundary / class separator directly from labeled examples
- How do I separate the classes
- learn \(p(t|x)\) directly (logistic regression models)
- learn mapping from inputs to classes
Generative approach model the distribution of inputs characteristic of the class (Bayes classifier)
- What does each class "look" like?
- Build a model of \(p(x|t)\)
- Apply Bayes rule
Bayes Classifier
Given features \(x = [x_1,...,x_D]^T\), we want to compute class probabilities using Bayes Rule:
or by text
Bayes Nets
We can represent this model using an directed graphical model, or Bayesian network.
This graph structure means the joint distribution factorizes as a product of conditional distribution for each variable given its parent(s).
Intuitively, you can think of the edges as reflecting a causal structure. But mathematically, this doesn't hold without additional assumptions.
The parameters can be learned efficiently because the log-likelihood decomposes into independent terms for each feature.
Each of these log-likelihood terms depends on different set of parameters, so they can be optimized independently.
Bayes Inference
For input \(x\), predict by comparing the values of \(p(c)\prod_j^D p(x_j|c)\) for different \(c\).
Bayesian Parameter Estimation
Bayesian approach treats the parameters as random variables. \(\beta\) is the set of parameters in the prior distribution of \(\theta\)
To define a Bayesian model, we need to specify two distributions:
prior distribution\(p(\theta)\), which encodes our beliefs about the parameters before we observe the data.
likelihood, same as in MLE
When we update our beliefs based on the observations, we compute the posterior distribution using Bayes' rule.
Maximum A-Posteriori Estimation
Find the most likely parameter settings under the posterior
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)
Make decisions by comparing class posteriors.
Expanded as
Decision Boundary
Decision Boundary is quadratic since gaussian is quadratic. When we have to humps that share the same covariance, the decision boundary is linear.
Properties of Gaussian Distribution
\(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma})\) is defined as
Empirical Mean \(\hat{\boldsymbol{\mu}}=\frac{1}{N} \sum_{i=1}^{N} \mathbf{x}^{(i)}\)
Empirical Covariance \(\hat{\mathbf{\Sigma}}=\frac{1}{N} \sum_{i=1}^{N}\left(\mathbf{x}^{(i)}-\hat{\boldsymbol{\mu}}\right)\left(\mathbf{x}^{(i)}-\hat{\boldsymbol{\mu}}\right)^{\top}\)
GDA vs. Logistic Regression
- GDA is generative while LR is discriminative model.
- GDA makes stringer modelling assumptions: assumes gaussian distributon. When assumption true, GDA asymptotically efficient. - - LR more robust, less sensitive to incorrect modelling assumptions (LR uses CE, no assumption.)
- Class-conditional distributions usually lead to logistic classifier.