Linear Regression
Modular approach of a question
- Choose a model describing the relationships between variables of interest
- loss function quantifying how bad is the fit
- regularizer saying how much we prefer different candidate explanations
- fit the model using an optimization algorithm
For supervised learning
Given: target \(t\in\mathcal T\) (response, outcome, output, class)
features \(x\in\mathcal X\) (inputs, covariates, design)
Objective to learn a function \(f:\mathcal X \rightarrow \mathcal T\) .s.t. \(t\approx y = f(x)\)
Linear Regression
Model
where \(\vec x\) is the input, \(y\) is prediction, \(\vec w\) is wights, \(b\) is the bias
\(\vec w, b\) are the parameters
In matrix form
\(y = XW\) where
Loss Function (Squared error)
\(\mathcal L(y, t) = \frac{(y-t)^2}{2}\) where \(y-t\) is the residual and \(\frac{1}{2}\) is just to make the calculations convenient.
Therefore, define cost function to be the average over all training examples
To minimize the loss/cost, calculate \(\partial_{w_j} \mathcal J := 0, \forall j \in \{0,1,2,.., N\}, w_0 = b\)
The resulted
Improving model: Polynomial curve fitting
Consider feature mapping \(\psi(x):\mathbb R^D\rightarrow \mathbb R^M\), for example \(x\in\mathbb R, \psi(x) = [1, x, x^2]^T\), so that we get a new \(\vec x'\) and can be used into fit
Underfit and Overfit
Underfit: model is too simple to fit the data
Overfit: too complex so that fit the data perfectly
Improving model: L2 Regularization
A function that quantifies how much we prefer one hypothesis vs. another
We encourage the weights to be small by choosing as our regularizer the \(L^2\) penalty \(\mathcal R(\vec w) := \frac{\|\vec w\|^2_2}{2}\)
The regularized cost function makes a trade-off between fit to the data and the norm of the weights
Hence \(\lambda\) is a hyperparameter that we can tune with a validation set and allows to vary penalty on dimensionality.
When measuring the validation rate, we still measure \(\mathcal J(\vec w)\), but for training we will use \(\mathcal J_{reg}(\vec w)\) for determining \(M\)
Probelms need to make sure \(x_i\)'s have approximately the same unit so that \(\mathcal R(\vec w)\) is not dominated by some feature weights
For LS, regularized cost gives
L1 Regularization
\(\mathcal R_{L^1} = \sum |w_i|\) encourages weights to be exactly zero, we can design regularizers based on whatever property we'd like to encourage.
Conclusion
- Choose a model and a loss function
- Formulate an optimization problem
- Solve the minimization problem using either direct solution (set derivative to zero) or gradient descent (move \(\vec w\), start with a guess, slowly changes to minimize cost, when direct solution is unavailable)
- vectorize
- use features to get a more powerful linear model
- improve the generalization by adding a regularizer