Bias-Variane Decomposition
Loss function
A loss function \(L(y,t)\) defines how bad it is if, for some example \(x\), the algorithm predicts \(y\), but the target is actually \(t\)
Bias-Variance Decomposition
Suppose the training set \(\mathcal D\) consists of \(N\) pairs sampled IID from a sample generating distribution, i.e. \((x^{(i)}, t^{(i)})\sim p_{sample}\)
Let \(p_{\mathcal D}\) denote the induced distribution over training set \(i.e. \mathcal D\sim p_{\mathcal D}\)
Pick a fixed query point \(\vec x\), then consider an experiment where we sample lots of (say \(m\) times) training datasets IID from \(p_{\mathcal D}\)
Then, we can produce \(h_{k,\mathcal D}, k\in\{1,2,...,m\}\) and we compute each classifier's prediction \(h_{k,\mathcal D}(\vec x) = y\) at the chosen query point \(\vec x\)
Therefore, \(y\) is a random variable, i.e. \(D\Rightarrow h_{\mathcal D}\Rightarrow h_{\mathcal D}(\vec x)=y, \mathcal D\) is randomly chosen.
Basic setup
Assume \(t\) is deterministic given \(x\)
There is a distribution over the loss at \(\vec x\), with \(E_{\mathcal D\sim p_{\mathcal D}}(L(h_{\mathcal D}(x), t))\)
For each query point, the expected loss is different. We are interested in quantifying how well our classifier does over \(p_{sample}\) average over training set \(E_{\vec x\sim p_{sample}, \mathcal D\sim p_{\mathcal D}}(L(h_{\mathcal D}(\vec x), t))\)
Then, we can decompose the expected loss
Bias defines on average, how close is the classifier to true target
Variance defines how widely dispersed are the prediction as we generate new datasets
Bayes Optimality
What if \(t\) is not deterministic given \(\vec x\), i.e. \(p(t\mid \vec x)\).
Since there is a distribution over targets, we measure distance from \(y_*(x) = E(t\mid \vec x)\)
The first term show that s \(y=y_*(\vec x)\) is the minimized value
Bayes error / irreducible error The second term is the inherent unpredicatability, or noise, of the target.
If \(Var[t|x] = 0\), the algorithm is Bayes optimal.
We can then decompose the non-deterministic form
Hence we decompose out the Bayes error, where the first two terms are identical to the deterministic case, and will be decomposed into bias and variance
Bagging
Intuition
Suppose we sample \(m\) indep. training set \(D_i\) from \(p_d\), we could then learn a predictor \(h_i := h_{D_i}\) based on each one, then take the average \(h = \sum^m h_i /m\)
Bias unchanged
Variance becomes \(1/m\) of the original
However, we don't such iid datasets from \(p_d\)
So we have to take a single dataset \(D\) with \(n\) examples
Generate \(m\) new datasets by sampling \(n\) training examples from \(D\), with replacement
Average the predictions of models trained on each of these
Problem with independence
Let correlation be \(\rho\), the variance with correlated datasets is
Ironically, introduce additional variability reduces correlation between samples
- invest a diversified portfolio, not just one stock - help to use average over multiple algorithms, or multiple configurations
Example: Random Forests
When choose each node of the decision tree, choose a random set of \(d\) features, and only consider splits on those features
The main idea is to improve the variance reduction of bagging by reducing the correlation between the trees
One of the example for black-box ML algorithm