Newton's Method and Steepest Descent
Newton's Method
Consider a twice-differentiable function \(f : I \to \mathbb R\) defined on an interval \(I \subseteq \mathbb R\). We would like to find the minima of \(f\). We shall do so by considering quadratic approximations of \(f\).
Let us start at a point \(x_0 \in I\). Consider
the (best) quadratic approximation to \(f\) at \(x_0\). Note that \(q(x_0) = f(x_0)\), \(q'(x_0) = f'(x_0)\) and \(q''(x_0) = f''(x_0)\). We will now find the local minimizer \(x_1\) for the quadratic \(q\). That is, we would like to find \(x_1\) such that
implying that, so long as \(f''(x_0) \neq 0\),
The idea of Newton's method is to iterate this procedure. (Consider the Newton's method for finding roots of functions; this is the same as finding the root of the derivative of the function.)
Newton's Method in 1-Dim
Precisely, we pick a starting point \(x_0 \in I\). Then we recursively define
We hope that the sequence \(x_n\) converges to a minimizer of \(f\). For the sake of the rest of the lecture, let \(g = f'\). With this notation we may write Newton's method as
Thrm. Convergence of Newton's Method in 1-Dim
Claim. Let \(g \in C^2(I)\) (i.e. \(f \in C^3(I)\)). Suppose there is an \(x_* \in I\) satisfies \(g(x_*) = 0\) and \(g'(x_*) \neq 0\). If \(x_0\) is sufficiently close to \(x_*\), then the sequence \(x_n\) generated by Newton's method converges to \(x_*\).
proof. Since \(g'(x_0) \neq 0\), there is, by continuity of \(g'\), an \(\alpha > 0\) such that \(|g'(x_1)| > \alpha\) for all \(x_1\) in a neighbourhood of \(x_0\) and \(|g''(x_2)| < \frac{1}{\alpha}\) for all \(x_2\) in the neighbourhood of \(x_0\).
The proof of the first claim is a simple continuity argument. The proof of the second claim follows from continuity of \(g''\) and the extreme value theorem applied to this neighbourhood('s closure). (That is, we can choose an \(\alpha\) to bound \(|g'|\) from below, and then shrink it possibly to ensure \(1/\alpha\) bounds \(|g''|\) from above.)
Since \(g(x_*) = 0\), the formula of Newton's method now implies
By the second order mean value theorem, there exists a \(\xi\) sufficiently close to \(x_*\) such that
Then \((*)\) becomes
The bounds on \(g'\) and \(g''\) we found at the start of the proof imply that
Let \(\rho\) be the constant \(\rho = \frac{1}{\alpha^2}|x_0 - x_*|\). Choose \(x_0\) close enough to \(x_*\) so that \(\rho < 1\). Then \((**)\) implies
Similarly, \((**)\) gives
Continuing in the same way we obtain
implying that Newton's method converges in our neighbourhood.
Newton's Method in Higher Dimensions
Consider a function \(f : \Omega \to \mathbb R\) defined on an open set \(\Omega \subseteq \mathbb R^n\). We choose a starting point \(x_0 \in \Omega\), and recursively define
For a general \(f\), the algorithm requires that \(\nabla^2 f(x_n)\) is invertible. The algorithm stops if \(\nabla f(x_n) = 0\) at some point (that is, the sequence given by Newton's method becomes constant if \(\nabla f(x_n) = 0\) for some \(x_n\).) Our main result is
Thrm. Convergence of Newton's Method in N-Dim
Suppose \(f \in C^3(\Omega)\). Suppose also that there is an \(x_* \in \Omega\) such that \(\nabla f(x_*) = 0\) and \(\nabla^2 f(x_*)\) is invertible. Then the sequence \(x_n\) defined by
converges for all \(x_0\) sufficiently close to \(x_*\).
The goal of Newton's method was to find a minimizer of \(f\), but it is possible for it to fail, for it only searches for critical points, not necessarily extrema.
Things May Go Wrong
It is possible for Newton's method to fail to converge even when \(f\) has a unique global minimizer \(x_*\) and the initial point \(x_0\) can be taken arbitrarily close to \(x_*\). Consider
This function is differentiable, and its derivative is \(f'(x) = \begin{cases} x^{1/2} & x \geq 0 \\ -(-x)^{1/2} & x \leq 0 \end{cases}\) and its second derivative is \(f''(x) = \begin{cases} \frac{1}{2}x^{-1/2} & x > 0 \\ \frac{1}{2}(-x)^{-1/2} & x < 0 \\ \text{N/A} & x = 0 \end{cases}\) so \(f \not\in C^3\) (it is not even \(C^2\)). Let \(x_0 = \epsilon\). Then
So Newton's method gives an alternating sequence \(\epsilon, -\epsilon, \epsilon, -\epsilon, \dots\). This definitely does not converge. This does not contradict the theorem of convergence because the function in question does not satisfy the conditions of the theorem.
Now we consider an example in which the function in question converges, just not to a minimizer. Consider \(f(x) = x^3\), which has derivatives \(f'(x) = 3x^2\) and \(f''(x) = 6x\). Starting at \(x_0\), we have
So Newton's method definitely converges to the critical point \(0\), no matter the choice of \(x_0 \in \mathbb R\). However, the function \(f\) in question does not have a global minimizer, so, while Newton's method converges, it does not converge to an extrema of any sorts.
Steepest Descent
Consider a \(C^1\) function \(f : \Omega \to \mathbb R\) defined on an open set \(\Omega \subseteq \mathbb R^n\). The idea is: at every point in the "landscape" of \(f\) (the graph of \(f\) in \(\mathbb R^{n+1}\)), make a step "downwards" in the steepest direction. (If you're on a mountain and want to descend to the bottom as fast as possible, how do you do so? You, at your current position, take a step down in the steepest direction, and repeat until you're done.)
Since the gradient \(\nabla f(x_0)\) represents the direction of greatest increase of \(f\) at \(x_0\), the vector \(-\nabla f(x_0)\) represents the direction of steepest decrease at \(x_0\). We would therefore like to move in the direction of the negative gradient. We will do so, with the condition that we move until we have a minimizer in the direction of the negative gradient (at which point we will stop moving and repeat).
Algorithm
where \(\alpha_k \geq 0\) satisfies \(f(x_k - \alpha_k \nabla f(x_k)) = \min_{\alpha \geq 0} f(x_k - \alpha \nabla f(x_k))\) We call \(\alpha_k\) the optimal step, since it is chosen so that \(x_{k+1}\) is the minimum of \(f\) sufficiently close to \(x_k\). We also call \(x_{k+1}\) the minimum point on the half-line \(x_k - \alpha \nabla f(x_k), \alpha \geq 0\). We now describe some properties of the method of steepest descent.
Thrm. Correctness of Steepest Descent
Claim. The steepest descent algorithm is actually descending; \(f(x_{k+1}) < f(x_k)\) so long as \(\nabla f(x_k) \neq 0\).
proof. We have \(f(x_{k+1}) = f(x_k - \alpha_k \nabla f(x_k)) \leq f(x_k - s \nabla f(x_k))\) for all \(s \in [0, \alpha_k]\). Also,
Then for sufficiently small \(s \geq 0, f(x_k - s\nabla f(x_k)) < f(x_k)\)
Thrm. Direction of Gradient
Claim. The steepest descent algorithm moves in perpendicular steps; for all \(k\), we have \((x_{k+2} - x_{k+1})\cdot(x_{k+1} - x_k) = 0\).
proof. We have
Recall that \(\alpha_k \geq 0\). If \(\alpha_k = 0\), then the whole expression is zero and we're done. Consider the possibility that \(\alpha_k > 0\). Then
implying that \(\alpha_k\) is a minimizer of the function on the right in the above. Then
proving the claim.
Thrm. Convergence of Steepest Descent
Claim. Suppose \(f\) is a \(C^1\) function on an open set \(\Omega \subseteq \mathbb R^n\). Let \(x_0 \in \Omega\), and let \(\{x_k\}_{k=0}^\infty\) be the sequence generated by the method of steepest descent. If there is a compact \(K \subseteq \Omega\) containing all \(x_k\), then every convergent subsequence of \(\{x_k\}_{k=0}^\infty\) in \(K\) will converge to a critical point \(x_*\) of \(f\).
Choose a convergent subsequence \(\{x_{k_i}\}\) converging to a point \(x_* \in K\). Note that \(\{ f(x_{k_i}) \}\) decreases and converges to \(f(x_*)\). Since \(\{f(x_k)\}\) is a decreasing sequence, it also converges to \(f(x_*)\).
Suppose for the sake of contradiction that \(\nabla f(x_*) \neq 0\). Since \(f\) is \(C^1\), \(\nabla f(x_{k_i})\) converges to \(\nabla f(x_*)\). Define \(y_{k_i} = x_{k_i} - \alpha_{k_i} \nabla f(x_{k_i})\) (i.e. \(y_{k_i} = x_{k_1+1}\)). We may therefore assume without loss of generality that \(y_{k_i}\) converges to some \(y_* \in K\). Since \(\nabla f(x_*) \neq 0\), we may write
Taking the limit as \(i \to \infty\), we have
Taking the same limit in the definition of \(y_{k_i}\) we have
Note that
Thus \(f(y_{k_i}) \leq f(x_{k_i} - \alpha \nabla f(x_{k_i}))\) for all \(\alpha \geq 0\). For any fixed \(\alpha \geq 0\), taking the limit \(i \to \infty\) gives us
since the function \(f\) decreases in the direction of \(-\nabla f(x_*) \neq 0\).
We can also argue the following: \(f(x_{k_i+1}) \to f(x_*)\). But since \(x_{k_i+1} = y_{k_i}\), we have \(f(y_{k_i}) \to f(y_*)\), implying \(f(x_*) = f(y_*)\), a contradiction.
Steepest Descent in the Quadratic Case
Consider a function \(f\) of the form \(f(x) = \frac{1}{2}x^TQx - b^Tx\) for \(b,x \in \mathbb R^n\) and \(Q\) an \(n \times n\) symmetric positive definite matrix. Let \(\lambda = \lambda_1 \leq \cdots \leq \lambda_n = \Lambda\) be the eigenvalues of \(Q\). (Note that they are all strictly positive.) Note that \(\nabla^2 f(x) = Q\) for any \(x\), so \(f\) is strictly convex. There therefore exists a unique global minimizer \(x_*\) of \(f\) in \(\mathbb R^n\) such that \(Qx_* = b\).
Let
So \(q\) and \(f\) differ by a constant. Therefore it suffices to find the minimizer of \(q\), rather than \(f\). Note that \(q(x) \geq 0\) for all \(x\), since \(Q\) is positive definite. So we shall study the minimizer \(x_*\) of \(q\).
Note that \(\nabla f(x) = \nabla q(x) = Qx - b\); let \(g(x) = Qx - b\). The method of steepest descent may therefore be written as
We would like a formula for the optimal step \(\alpha_k\). Recall that \(\alpha_k\) is defined to be the minimizer of the function \(f(x_k - \alpha g(x_k))\) over \(\alpha \geq 0\). Thus
This simplifies to
giving
Therefore
Lemma 1
Claim. \(q(x_{k+1}) = \left( 1 - \frac{|g(x_k)|^4}{(g(x_k)^T Q g(x_k))(g(x_k)^TQ^{-1}g(x_k))} \right)q(x_k)\)
proof.
implying
Dividing by \(q(x_k)\) gives
Let \(g_k = g(x_k)\) and \(y_k - x_k - x_*\). Then
Note that \(g_k = Qx_k - b = Q(x - x_*) = Qy_k\), so \(y_k = Q^{-1}g_k\). The above formula therefore simplifies to
Now recall the formula
This implies that
proving the theorem.
Thrm. Convergence Speed of Steepest descent, quadratic case
Claim. For \(x_0 \in \mathbb R^n\), the method of steepest descent starting at \(x_0\) converges to the unique minimizer \(x_*\) of the function \(f\), and we have \(q(x_{k+1}) \leq r q(x_k)\).
proof. We know that \(q(x_{k+1}) \leq r^k q(x_0)\). Since \(0 \leq r < 1\), when \(k \to \infty\), \(r^k \to 0\). Note that
This set is a sublevel set of \(q\). The sublevel sets of \(q\) look like concentric filled-in ellipses centred at \(x_*\), and as \(k \to \infty\), they seem to "shrink" into \(x_*\). Therefore steepest descent converges in the quadratic case.
Note that
so \(r\) depends only on the ratio \(\Lambda / \lambda\). This number is called the \emph{condition number of \(Q\)}. (The condition number may be defined as \(\|Q\|\|Q^{-1}\|\) in the operator norm on matrices; it is not hard to see that these numbers agree in our case.)
If the condition number \(\Lambda / \lambda \gg 1\) (large), then convergence is very slow. If \(\Lambda / \lambda = 1\), then \(r = 0\), and so convergence is achieved in one step.