Let the target t∈{0,1} be the binary classification, using a linear function model z=wTx with threshold I(z≥0) where x is the training data with one more dummy variable 1 so that the threshold is always 0
Geometric Picture
Given input t=NOT x,x∈{0,1}
Input space
the weights (hypothesis) can be represented by half-spaces
H+={x∣wTx≥0},H−={x∣wTx<0}
The boundary is the decision boundary{x∣wTx=0}
If the training example can be perfectly separated by a linear decision rule, we say the data is linearly separable
weight space
each training example x specifies a half space w must lie in to be correctly classified: wTx>0 if t=1 The region satisfying all the constraints is the feasible region. The problem is feasible is the region =∅, otw infeasible
Note that if training set is separable, we can solve w using linear programming
Loss Function
0-1 Loss
Define the 0-1 Loss be
L0−1(y,t)=I(y=t)
Then, the cost is
J=N1∑NI(y(i)=t(i))
However, such loss is hard to optimize (NP-hard considering integer programming)
Note that ∂wjL0−1=0 almost everywhere (since L is a step function w.r.t z)
Surrogate loss function
If we treat the model as a linear regression model, then
LSE(y,t)=2(y−t)2
However, the loss function will give large loss if the prediction is correct with high confidence.
Logistic Activation Function
Using logistic function σ(z)=(1+e−z)−1 to transform z=wTx+b.
LSE(y,t)=2σ(wTx+b)−t
A linear model with a logistic nonlinearity is known as log-linear In this way, σ is called an activation function
However, for z→±∞,σ(z)≈0 If the prediction is really wrong, you should be far from a critical point
Cross-entropy loss (log loss)
More loss if the prediction is "more" confident about "wrong" answers and not punishing the correct one even not confident