Applied Mathematics 205

Unit 4. Optimization


Lecturer: Petr Karnakov

November 2, 2022

Motivation

  • This unit will cover nonlinear equations and optimization
  • So far we have mostly focused on linear problems
    • linear least squares (linear combination of basis functions)
    • linear physical laws (idealized behavior, small deformations)
    • discretizations of linear PDEs (wave equation, heat equation)
  • However, important applications lead to nonlinear problems
    • nonlinear least squares (nonlinear dependency on parameters)
    • nonlinear physical models (realistic materials, large deformations)
    • discretizations of nonlinear PDEs (Navier-Stokes)

Motivation: Nonlinear Equations

  • Some familiar problems can be reduced to nonlinear equations
  • For example, computing the points and weights of Gauss quadrature \[ \int_{-1}^1 f(x)\text{d}x \approx \sum_{k=0}^n w_k f(x_k) \] with $2n+2$ unknown parameters $x_0,\ldots,x_n$ and $w_0,\ldots,w_n$
  • Require that quadrature is exact on monomials of degree up to $2n+1$

Motivation: Nonlinear Equations

  • For $n=1$, this leads to a system of nonlinear equations \[ \begin{aligned} w_0 + w_1 &= \int_{-1}^1 1 \text{d}x = 2\\ w_0x_0 + w_1 x_1&= \int_{-1}^1 x \text{d}x = 0\\ w_0x_0^2 + w_1x_1^2 &= \int_{-1}^1 x^2 \text{d}x = 2/3\\ w_0x_0^3 + w_1x_1^3 &= \int_{-1}^1 x^3 \text{d}x = 0\end{aligned} \]

Motivation: Nonlinear Equations

  • A general system of $m$ equations for $n$ unknowns \[ F(x) = 0 \] where $F : \mathbb{R}^n \to \mathbb{R}^m$
  • We will focus on the case $m=n$,
    i.e. equal number of equations and unknowns
  • Cases $m\neq n$ can be formulated as nonlinear least squares

Motivation: Nonlinear Equations

  • One class of nonlinear equations is polynomial equations,
    i.e. $F(x)$ is a polynomial
  • The simplest case is a quadratic equation \[ ax^2 + bx + c = 0 \]
  • A closed-form solution is given by \[ x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \]

Motivation: Nonlinear Equations

  • There are also closed-form solutions for polynomial equations
    of degree three and four, due to Ferrari and Cardano (~1540)
  • However, the Abel–Ruffini theorem states that equations
    of degree five or higher have no general solution in radicals
  • Therefore, they have to be solved numerically with an iterative algorithm

Motivation: Nonlinear Equations

  • There are many iterative methods for nonlinear equations
  • One is the bisection method for a scalar equation
    \[ f(x) = 0 \] where $f \in C[a,b]$
  • Assume $f(a)f(b)<0$ and bisect the interval
    depending on the sign of $f(\frac{a+b}{2})$

Motivation: Nonlinear Equations

def f(x):
    return x * x - 4 * np.sin(x)

# Initial interval, assume f(a)*f(b)<0.
a = 1
b = 3
tol = 1e-3

# Bisection search.
while b - a > tol:
    print('a={:.5f} b={:.5f} f(a)={:.5f} f(b)={:.5f}'
          .format(a, b, f(a), f(b)))
    c = 0.5 * (b + a)
    if f(a) * f(c) < 0:
        b = c
    else:
        a = c

[examples/unit4/bisection.py]

a=1.00000 b=3.00000 f(a)=-2.36588 f(b)=8.43552
a=1.00000 b=2.00000 f(a)=-2.36588 f(b)=0.36281
a=1.50000 b=2.00000 f(a)=-1.73998 f(b)=0.36281
a=1.75000 b=2.00000 f(a)=-0.87344 f(b)=0.36281
a=1.87500 b=2.00000 f(a)=-0.30072 f(b)=0.36281
a=1.87500 b=1.93750 f(a)=-0.30072 f(b)=0.01985
a=1.90625 b=1.93750 f(a)=-0.14326 f(b)=0.01985
a=1.92188 b=1.93750 f(a)=-0.06241 f(b)=0.01985
a=1.92969 b=1.93750 f(a)=-0.02145 f(b)=0.01985
a=1.93359 b=1.93750 f(a)=-0.00085 f(b)=0.01985
a=1.93359 b=1.93555 f(a)=-0.00085 f(b)=0.00949

Motivation: Nonlinear Equations

  • Bisection is a robust method in 1D,
    but it needs an initial guess $f(a)f(b)<0$
    and does not generalize to higher dimensions
  • We will consider alternative methods
    • fixed-point iteration
    • Newton’s method

Motivation: Optimization

  • A related topic is optimization
  • Has important applications in science and engineering
  • Examples
    • find the shape of a racing car that maximizes downforce
    • design a bridge to minimize its weight
    • find the path of an airplane that minimizes fuel consumption
  • Solving nonlinear equations can be viewed
    as optimization of the residuals

Motivation: Optimization

  • Optimization can be constrained,
    i.e. parameters have to satisfy equations or inequalities
  • Examples
    • find the shape of a racing car that maximizes downforce,
      subject to a constant drag
    • design a bridge to minimize its weight,
      subject to a constant critical load
    • find the path of an airplane that minimizes fuel consumption,
      but avoids certain territories

Motivation: Optimization

  • All these problems can be formulated as constrained minimization

    Given an objective function $f : \mathbb{R}^n \to \mathbb{R}$ and a set $S \subset \mathbb{R}^n$,
    find $x^\ast \in S$ such that $f(x^\ast) \leq f(x)$ $\forall x \in S$

  • Here $S$ is the feasible set which describes the constraints,
    usually defined by equations or inequalities
  • If $S = \mathbb{R}^n$, then the minimization is unconstrained
  • Maximization of $f$ is equivalent to minimization of $-f$

Motivation: Optimization

  • The standard way to write an optimization problem is \[ \min_{x} f(x) \text{ subject to } g(x) = 0 \text{ and } h(x) \leq 0 \] with
    • objective function $f : \mathbb{R}^n \to \mathbb{R}$
    • equality constraints $g : \mathbb{R}^n \to \mathbb{R}^m$
    • inequality constraints $h : \mathbb{R}^n \to \mathbb{R}^p$

Motivation: Optimization

  • For example, consider a cylinder with radius $x_1$ and height $x_2$
  • Minimize the surface area of a cylinder subject to a constraint on its volume

    \[ \min_x f(x_1,x_2) = 2\pi x_1(x_1 + x_2) \] \[ \text{ subject to } g(x_1,x_2) = \pi x_1^2 x_2 - V = 0 \]

  • We will return to this example later

Motivation: Optimization

  • If $f$, $g$ and $h$ are all affine (i.e. $f(x) = Ax + b$, linear plus constant),
    then the optimization problem is called a linear programming
  • Here the term “program” is a synonym for “plan”,
    has nothing to do with computer software
  • The feasible set is a polyhedron and the minimum is found on its vertex

Motivation: Optimization

  • If the objective function or any of the constraints are nonlinear
    then we have a nonlinear optimization problem or nonlinear programming
  • We will consider several different approaches to nonlinear optimization
  • Optimization routines typically use local information
    about a function to iteratively approach its local minimum

Motivation: Optimization

  • In some cases an optimizer can find a global minimum
  • Extra conditions on the function (e.g. convexity) can help

Motivation: Optimization

  • But in general, global optimization is difficult
  • The optimizer can get “stuck” in local minimum

Motivation: Optimization

  • This can get even harder in higher dimensions

Motivation: Optimization

  • We will focus on methods for finding local minima
  • Global optimization is important, but not possible in general
    without extra conditions on the objective function
  • Global optimization usually relies on heuristics
    • try several different initial guesses (multistart methods)
    • simulated annealing (add decaying noise)
    • genetic methods (use a hierarchy of samples)

Nonlinear Equations

Fixed-Point Iteration

  • Consider iteration \[ x_{k+1} = g(x_k) \]
  • For example, recall Heron’s method for finding $\sqrt{a}$ from HW0 \[ x_{k+1} = \frac{1}{2}\left( x_k + \frac{a}{x_k}\right) \]
  • Denote $g_\text{heron}(x) = \frac{1}{2}\left( x + a/x\right)$

Fixed-Point Iteration

  • Suppose $\alpha\in\mathbb{R}$ is such that $g(\alpha) = \alpha$, then we call $\alpha$ a fixed point of $g$
  • For example, we see that $\sqrt{a}$ is a fixed point of $g_\text{heron}$ since \[ g_\text{heron}(\sqrt{a}) = \frac{1}{2}\left( \sqrt{a} + a/\sqrt{a}\right) = \sqrt{a} \]
  • A fixed-point iteration terminates once a fixed point is reached,
    since if $g(x_k) = x_k$ then we get $x_{k+1} = x_k$
  • Also, if $x_{k+1} = g(x_k)$ converges as $k\to\infty$, it must converge to a fixed point
  • Let $\alpha = \lim_{k\to\infty}x_k$, then \[ \alpha = \lim_{k\to \infty}x_{k+1} = \lim_{k\to \infty}g(x_k) = g\Big( \lim_{k\to \infty}x_{k} \Big) = g(\alpha) \]

Fixed-Point Iteration

  • Therefore, for example, if Heron’s method converges, it converges to $\sqrt{a}$
  • There are sufficient conditions for convergence of a fixed-point iteration
  • Recall that $g$ satisfies a Lipschitz condition in an interval $[a,b]$ if \[ |g(x) - g(y)| \leq L |x-y|, \quad \forall x,y \in [a,b] \] for some $L>0$
  • If $L<1$, then $g$ is called a contraction

Fixed-Point Iteration

  • Theorem: Suppose that $g$ is a contraction on $[\alpha - \delta, \alpha + \delta]$
    and $\alpha$ is a fixed point of $g$ (i.e. $g(\alpha) = \alpha$), where $\alpha\in\mathbb{R}$ and $\delta>0$
    Then the fixed point iteration converges to $\alpha$ for any $x_0 \in [\alpha - \delta, \alpha + \delta]$
  • Proof: Take $L<1$ from the Lipschitz condition. Then \[ |x_k - \alpha | = |g(x_{k-1}) - g(\alpha)| \leq L|x_{k-1} - \alpha|, \] which implies \[ |x_k - \alpha | \leq L^k|x_0 - \alpha| \] and, since $L<1$, $|x_k - \alpha | \to 0$ as $k \to \infty$
  • This also shows that each iteration reduces the error by factor $L$

Fixed-Point Iteration

  • Recall that if $g\in C^1[a,b]$, we can obtain a Lipschitz constant from $g'$ \[ \htmlClass{color5}{ L = \max\limits_{\theta\in[a,b]} |g'(\theta)|} \]
  • We now use this result to show that if $|g'(\alpha)| < 1$,
    then there is a neighborhood of $\alpha$ on which $g$ is a contraction
  • This tells us that we can verify convergence of a fixed point iteration
    by checking the gradient of $g$

Fixed-Point Iteration

  • By continuity of $|g'|$, for any $\epsilon > 0$, there is $\delta > 0$
    such that for any $x \in (\alpha-\delta,\alpha+\delta)$ we have $\big|\, |g'(x)| - |g'(\alpha)| \, \big| \leq \epsilon$
  • Therefore \[ \max\limits_{x\in(\alpha-\delta,\alpha+\delta)}|g'(x)| \leq |g'(\alpha)| + \epsilon \]
  • Suppose $|g'(\alpha)| < 1$ and set $\epsilon = \frac{1}{2}(1-|g'(\alpha)|)$,
    then there is an interval $(\alpha-\delta,\alpha+\delta)$,
    on which $g$ is Lipschitz with $L = \frac{1}{2}(1+|g'(\alpha)|)$
  • Since $L < 1$, then $g$ is a contraction in a neighborhood of $\alpha$

Fixed-Point Iteration

  • Furthermore, as $k\to \infty$, \[ \frac{|x_{k+1} - \alpha |}{|x_k - \alpha|} = \frac{|g(x_{k}) - g(\alpha) |}{|x_k - \alpha|} \to |g'(\alpha)|, \]
  • Therefore, asymptotically, after each iteration
    the error decreases by a factor of $|g'(\alpha)|$

Fixed-Point Iteration

  • We say that an iteration converges linearly if, for some $\mu \in (0,1)$, \[ \lim_{k\to\infty}\frac{|x_{k+1} - \alpha |}{|x_k - \alpha|} = \mu \]
  • An iteration converges superlinearly if \[ \lim_{k\to\infty}\frac{|x_{k+1} - \alpha |}{|x_k - \alpha|} = 0 \]

Fixed-Point Iteration

  • We can use these ideas to construct practical
    fixed-point iterations for solving $f(x) = 0$
  • For example, suppose $f(x) = e^x - x - 2$

  • From the plot, there is a root at $x \approx 1.15$

Fixed-Point Iteration

  • Equation $f(x) = 0$ is equivalent to $x = \log(x+2)$,
    so we seek a fixed point of the iteration \[ x_{k+1} = \log(x_k+2) \]
  • Here $g(x) = \log(x+2)$, and $g'(x) = 1/(x+2) < 1$ for all $x > -1$,
    therefore fixed point iteration will converge for $x_0 > -1$
  • We should get linear convergence with a factor about \[ g'(1.15) = 1/(1.15+2) \approx 0.32 \]

Fixed-Point Iteration

  • An alternative fixed-point iteration is to set \[ x_{k+1} = e^{x_k} - 2, \quad k=0,1,2,\ldots \]
  • Therefore $g(x) = e^x - 2$, and $g'(x) = e^x$
  • Hence $|g'(\alpha)| > 1$, so we can’t guarantee convergence
  • In fact, the iteration diverges

Fixed-Point Iteration

Newton’s Method

  • Constructing fixed-point iterations is not straightforward
  • Need to rewrite $f(x) = 0$ in a form $x = g(x)$ with certain properties on $g$
  • To obtain a more generally applicable iterative method,
    consider the following fixed-point iteration \[ \htmlClass{color5}{ x_{k+1} = x_k - \lambda(x_k)f(x_k)} \] corresponding to $g(x) = x - \lambda(x)f(x)$, for some function $\lambda$
  • A fixed point $\alpha$ of $g$ yields a solution to $f(\alpha) = 0$
    (except possibly when $\lambda(\alpha) = 0$), which is what we want

Newton’s Method

  • Recall that the asymptotic convergence rate is dictated by $|g'(\alpha)|$,
    so we want to have $|g'(\alpha)| = 0$ to get superlinear convergence
  • Suppose (as stated above) that $f(\alpha) = 0$, then \[ g'(\alpha) = 1 - \lambda'(\alpha)f(\alpha) - \lambda(\alpha)f'(\alpha) = 1 - \lambda(\alpha)f'(\alpha) \]
  • To satisfy $g'(\alpha) = 0$, we choose $\lambda(x) = 1/f'(x)$ to obtain \[ x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} \] known as Newton’s method

Newton’s Method

  • Based on fixed-point iteration theory,
    Newton’s method is convergent since $|g'(\alpha)| = 0 < 1$
  • However, we need a different argument to understand
    the superlinear convergence rate properly
  • To do this, we use a Taylor expansion for $f(\alpha)$ about $x_k$ \[ 0 = f(\alpha) = f(x_k) + (\alpha - x_k)f'(x_k) + \frac{(\alpha-x_k)^2}{2}f''(\theta_k) \] for some $\theta_k \in (\alpha,x_k)$

Newton’s Method

  • Dividing through by $f'(x_k)$ gives \[ \left(x_k - \frac{f(x_k)}{f'(x_k)}\right) - \alpha = \frac{f''(\theta_k)}{2f'(x_k)}(x_k - \alpha)^2 \] or \[ x_{k+1} - \alpha = \frac{f''(\theta_k)}{2f'(x_k)}(x_k - \alpha)^2 \]
  • Therefore, asymptotically,
    the error at iteration $k+1$ is the square of the error at iteration $k$
  • This is referred to as quadratic convergence, which is very rapid
  • We need to be sufficiently close to $\alpha$ to get quadratic convergence
    (the result relied on Taylor expansion near $\alpha$)

Secant Method

  • An alternative to Newton’s method is to approximate $f'(x_k)$
    using the finite difference \[ f'(x_k) \approx \frac{f(x_k) - f(x_{k-1})}{x_k - x_{k-1}} \]
  • Substituting this into the iteration leads to the secant method \[ x_{k+1} = x_k - f(x_k)\left( \frac{x_k - x_{k-1}}{f(x_k) - f(x_{k-1})} \right),\quad k=1,2,3,\ldots \]
  • The main advantages of the secant methods are
    • does not require computing $f'(x)$
    • requires only one extra evaluation of $f(x)$ per solution
      (Newton’s method also requires $f'(x_k)$ each iteration)

Secant Method

  • As one may expect, the secant method converges faster than
    a fixed-point iteration, but slower than Newton’s method
  • In fact, it can be shown that for the secant method, we have \[ \lim_{k\to\infty} \frac{|x_{k+1} - \alpha|}{|x_k - \alpha|^q} = \mu \] where $\mu$ is a positive constant and $q \approx 1.6$
  • See [examples/unit4/secant_vs_newton.py],
    Newton’s method versus secant method for $f(x) = e^x - x - 2$

Systems of Nonlinear Equations

Systems of Nonlinear Equations

  • We now consider fixed-point iterations and Newton’s method
    for systems of nonlinear equations
  • We suppose that $F : \mathbb{R}^n \to \mathbb{R}^n$, $n > 1$,
    and we seek $\alpha \in \mathbb{R}^n$ such that $F(\alpha) = 0$
  • In component form, this is equivalent to \[ \begin{aligned} F_1(\alpha) &= 0\\ F_2(\alpha) &= 0\\ &\dots&\\ F_n(\alpha) &= 0\end{aligned} \]

Fixed-Point Iteration

  • For a fixed-point iteration, we again rewrite $F(x) = 0$ as $x = G(x)$ to obtain \[ x_{k+1} = G(x_k) \]
  • The convergence proof is the same as in the scalar case,
    if we replace $|\cdot|$ with $\|\cdot\|$,
    i.e. if $\|G(x) - G(y)\| \leq L \|x-y\|$, then $\|x_{k} - \alpha\| \leq L^k\|x_0 - \alpha\|$
  • As before, if $G$ is a contraction it will converge to a fixed point $\alpha$

Fixed-Point Iteration

  • Recall that we define the Jacobian matrix, $J_G \in \mathbb{R}^{n\times n}$, to be \[ (J_G)_{ij} = \frac{\partial G_i}{\partial x_j}, \quad i,j=1,\ldots,n \]
  • If $\|J_G(\alpha)\|_\infty < 1$, then there is some neighborhood of $\alpha$
    for which the fixed-point iteration converges to $\alpha$
  • The proof of this is a natural extension of the corresponding scalar result

Fixed-Point Iteration: Example

  • Once again, we can employ a fixed point iteration to solve $F(x) = 0$
  • For example, consider \[ \begin{aligned} x_1^2 + x_2^2 - 1 &= 0\\ 5x_1^2 + 21x_2^2 - 9 &= 0 \end{aligned} \]
  • This can be rearranged to $x_1 = \sqrt{1 - x_2^2}$, $x_2 = \sqrt{(9 - 5x_1^2)/21}$

Fixed-Point Iteration: Example

  • Define \[ \begin{aligned} G_1(x_1,x_2) &= \sqrt{1 - x_2^2} \\ G_2(x_1,x_2) &= \sqrt{(9 - 5x_1^2)/21} \end{aligned} \]
  • See [examples/unit4/fixed_point_2d.py],
    fixed-point iteration in two dimensions

Newton’s Method

  • As in the one-dimensional case, Newton’s method is generally
    more useful than a standard fixed-point iteration
  • The natural generalization of Newton’s method is \[ x_{k+1} = x_k - {J_F(x_k)}^{-1} F(x_k) \]
  • Note that to put Newton’s method in the standard form
    for a linear system, we write \[ \htmlClass{color5}{ {J_F(x_k)}\Delta x_{k+1} = -F(x_k)} \] where $\Delta x_{k+1}= x_{k+1} - x_k$

Newton’s Method

  • Once again, if $x_0$ is sufficiently close to $\alpha$,
    then Newton’s method converges quadratically
  • This result again relies on Taylor’s theorem
  • We first consider how to generalize Taylor’s theorem to $\mathbb{R}^n$
  • First, we consider the case for $F : \mathbb{R}^n \to \mathbb{R}$

Multivariate Taylor Theorem

  • Let $\phi(s) = F(x+s\delta)$ and $\delta\in\mathbb{R}^n$. One-dimensional Taylor theorem yields \[ \phi(1) = \phi(0) + \sum_{\ell=1}^k\frac{\phi^{(\ell)}(0)}{\ell!} + \frac{1}{(k+1)!}\phi^{(k+1)}(\eta), \quad \eta \in (0,1) \] \[ \begin{aligned} \phi(0) &= F(x)\\ \phi(1) &= F(x+\delta)\\ \phi'(s) &= \frac{\partial F(x+s\delta)}{\partial x_1}\delta_1 +\frac{\partial F(x+s\delta)}{\partial x_2}\delta_2 + \cdots + \frac{\partial F(x+s\delta)}{\partial x_n}\delta_n\\ \phi''(s) &= \frac{\partial^2 F(x+s\delta)}{\partial x_1^2}\delta_1^2 + \cdots + \frac{\partial^2 F(x+s\delta)}{\partial x_1x_n}\delta_1\delta_n + \cdots + \\ & +\frac{\partial^2 F(x+s\delta)}{\partial x_1\partial x_n}\delta_1\delta_n + \cdots + \frac{\partial^2 F(x+s\delta)}{\partial x_n^2}\delta_n^2 \end{aligned} \]

Multivariate Taylor Theorem

  • We have \[ F(x+\delta) = F(x) + \sum_{\ell=1}^k\frac{U_\ell(x)}{\ell!} + E_k, \] where \[ U_\ell(x) = \left[\left(\frac{\partial}{\partial x_1}\delta_1 + \cdots + \frac{\partial}{\partial x_n}\delta_n\right)^\ell F \right](x), \quad \ell=1,2,\ldots,k, \] and \[ E_k = \frac{U_{k+1}(x+\eta\delta)}{(k+1)!}, \quad \eta \in (0,1) \]

Multivariate Taylor Theorem

  • Let $A$ be an upper bound on the absolute values
    of all derivatives of order $k+1$, then \[ \begin{aligned} |E_k| &\leq \frac{1}{(k+1)!} \Big| \Big[ \Big(\|\delta\|_\infty\frac{\partial}{\partial x_1} + \ldots + \|\delta\|_\infty\frac{\partial}{\partial x_n}\Big)^{k+1} F\Big] (x+\eta \delta) \Big| \\ &= \frac{1}{(k+1)!}\|\delta\|_\infty^{k+1} \Big| \Big [ \Big(\frac{\partial}{\partial x_1} + \ldots + \frac{\partial}{\partial x_n} \Big)^{k+1} F \Big] (x+\eta \delta) \Big| \\ &\leq\frac{n^{k+1}}{(k+1)!} A \|\delta\|_\infty^{k+1}\end{aligned} \] where the last line follows from the fact that there are $n^{k+1}$ terms in the product (i.e. there are $n^{k+1}$ derivatives of order $k+1$)

Multivariate Taylor Theorem

  • We only need an expansion up to first order terms
    for analysis of Newton’s method
  • From our expression above,
    we can write first order Taylor expansion as \[ \begin{aligned} \htmlClass{color5}{ F(x+\delta) = F(x) + \nabla F(x)^T \delta + E_1} \end{aligned} \]

Multivariate Taylor Theorem

  • For $F : \mathbb{R}^n \to \mathbb{R}^n$, Taylor expansion follows by developing
    a Taylor expansion for each $F_i$ \[ F_i(x+\delta) = F_i(x) + \nabla F_i(x)^T \delta + E_{i,1} \] so that for $F : \mathbb{R}^n \to \mathbb{R}^n$ we have \[ \htmlClass{color5}{ F(x+\delta) = F(x) + J_F(x)\delta + E_F} \] where $\|E_F\|_\infty = \max\limits_{1\leq i\leq n} |E_{i,1}| \leq \frac{1}{2}n^2 \left(\max\limits_{1 \leq i,j,\ell \leq n} \left|\frac{\partial^2 F_i}{\partial x_j\partial x_\ell}\right|\right) \|\delta\|_\infty^2$

Newton’s Method

  • Now return to Newton’s method
  • We have \[ 0 = F(\alpha) = F(x_k) + J_F(x_k)\left[\alpha - x_k\right] + E_F \] so that \[ x_k - \alpha = [J_F(x_k)]^{-1}F(x_k) + [J_F(x_k)]^{-1}E_F \]

Newton’s Method

  • Also, the Newton iteration itself can be rewritten as \[ J_F(x_k)\left[ x_{k+1} - \alpha\right] = J_F(x_k)\left[ x_{k} - \alpha \right] - F(x_k) \]
  • We obtain \[ x_{k+1} - \alpha = [J_F(x_k)]^{-1} E_F, \] which implies quadratic convergence \[ \|x_{k+1} - \alpha\|_\infty \leq C \|x_k - \alpha\|_\infty^2 \]

Newton’s Method: Example

  • Recall the conditions of the two-point Gauss quadrature rule \[ \begin{aligned} F_1(x_1,x_2,w_1,w_2) &= w_1 + w_2 - 2 = 0\\ F_2(x_1,x_2,w_1,w_2) &= w_1x_1 + w_2 x_2 = 0\\ F_3(x_1,x_2,w_1,w_2) &= w_1x_1^2 + w_2x_2^2 - 2/3 = 0\\ F_4(x_1,x_2,w_1,w_2) &= w_1x_1^3 + w_2x_2^3 = 0\end{aligned} \]
  • They constitute a nonlinear system of 4 equations for 4 unknowns

Newton’s Method: Example

  • We can solve this using Newton’s method
  • To do this, we require the Jacobian of this system: \[ J_F(x_1,x_2,w_1,w_2) = \left[ \begin{array}{cccc} 0 & 0 & 1 & 1\\ w_1 & w_2 & x_1 & x_2\\ 2w_1x_1 & 2w_2x_2 & x_1^2 & x_2^2 \\ 3w_1x_1^2 & 3w_2x_2^2 & x_1^3 & x_2^3 \end{array} \right] \]
  • Alternatively, use scipy.optimize.fsolve() that implements Powell’s hybrid method (combination of Newton and gradient descent) by calling HYBRD or HYBRJ from Fortran library MINPACK
  • See [examples/unit4/nonlin_gauss_quad.py],
    two-point Gauss quadrature found from a nonlinear system

Newton’s Method: Example

  • Using either approach with an initial guess $[-1,1,1,1]$,
    we get the solution \[ \begin{align*} x_1 &= -0.577350269189626 &\approx -1/\sqrt{3} \\ x_2 &= 0.577350269189626 &\approx 1/\sqrt{3} \\ w_1 &= 1.000000000000000 &\approx 1 \\ w_2 &= 1.000000000000000 &\approx 1 \end{align*} \]

Optimization

Existence of Global Minimum

  • To guarantee existence and uniqueness of a global minimum,
    we need to make assumptions about the objective function
  • For example, if $f$ is continuous on a closed (i.e. $\partial S\subset S$) and bounded set $S \subset \mathbb{R}^n$ then it has global minimum in $S$
  • In one dimension, this says $f$ achieves a minimum on the interval $[a,b] \subset \mathbb{R}$
  • In general $f$ does not achieve a minimum on $(a,b)$, e.g. consider $f(x) = x$

Coercive Functions

  • Another helpful concept for existence of global minimum is coercivity
  • A function $f:S\to \mathbb{R}$ on an unbounded set $S \subset \mathbb{R}^n$ is coercive if \[ \lim_{\|x\| \to \infty} f(x) = +\infty \]
  • That is, $f(x)$ must take large positive values whenever $\|x\|$ is large

Coercive Functions

  • If $f$ is continuous and coercive on a closed set $S$,
    then $f$ has a global minimum in $S$
  • Proof: From the definition of coercivity, for any $M \in \mathbb{R}$, $\exists r > 0$ such that $f(x) \geq M$ for all $x \in S$ where $\|x\| \geq r$
  • Take a point $x_0 \in S$, and set $M = f(x_0)$
  • Let $Y = S\cap \{\|x\| \geq r\}$, so that $f(x) \geq f(x_0)$ for all $x \in Y$
  • And we already know that $f$ achieves a minimum (which is at most $f(x_0)$) on the closed and bounded set $S\cap \{\|x\| \leq r\}$
  • Hence $f$ achieves a minimum on $S$ $\quad \square$

Coercive Functions: Examples

$f = x^2 + y^2$
coercive on $\mathbb{R}^2$

$f = x^2 - y^2$
not coercive on $\mathbb{R}^2$
$f(0,y) \to -\infty$
as $|y| \to \infty$

$f = 1-e^{-(x^2+y^2)}$
not coercive on $\mathbb{R}^2$
$f(x,y) \to 1$
as $x^2+y^2 \to \infty$

Convex Functions

  • An important concept for uniqueness is convexity
  • A set $S \subset \mathbb{R}^n$ is convex if it contains the line segment between any two of its points
  • That is, $S$ is convex if for any $x,y \in S$, we have \[ \{\theta x + (1-\theta)y : \theta \in [0,1] \} \subset S \]

Convex Functions

  • Similarly, we define convexity of a function $f : S \subset \mathbb{R}^n \to \mathbb{R}$
  • $f$ is convex if its graph along any line segment in $S$ is on or below the chord connecting the function values
  • For example, $f$ is convex if for any $x,y \in S$ and any $\theta \in (0,1)$, we have \[ f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y) \]
  • Also, if \[ f(\theta x + (1-\theta)y) < \theta f(x) + (1-\theta)f(y) \] then $f$ is strictly convex

Convex Functions: Examples

$f = x^2 + y^2$
convex on $\mathbb{R}^2$

$f = x^2 - y^2$
not convex on $\mathbb{R}^2$

$f = \max(1,\;x^2+ (y+1)^2)$
convex but
not strictly convex on $\mathbb{R}^2$

Convex Functions

  • If $f$ is a convex function on a convex set $S$,
    then any local minimum of $f$ must be a global minimum
  • Proof (1/2): Suppose $x$ is a local minimum,
    i.e. there is $\epsilon>0$ so that $f(x) \leq f(y)$ for $y \in B(x,\epsilon)$,
    where $B(x,\epsilon) = \{ y \in S : \|y-x\| \leq \epsilon\}$
  • Suppose that $x$ is not a global minimum,
    i.e. that there exists $w \in S$ such that $f(w) < f(x)$
  • We will show that this gives a contradiction
    by drawing a line segment between $x$ and $w$

Convex Functions

Proof (2/2):

  • For $\theta\in[0,1]$ we have $f(\theta w + (1-\theta)x) \leq \theta f(w) + (1-\theta)f(x)$
  • Let $\sigma \in (0,1]$ be sufficiently small so that \[ z = \sigma w + \left(1-\sigma\right)x \in B(x,\epsilon) \]
  • Then \[ f(z) \leq \sigma f(w) + \left(1-\sigma\right)f(x) < \sigma f(x) + \left(1-\sigma\right)f(x) = f(x), \] e.g. $f(z) < f(x)$, which contradicts that $f(x)$ is a local minimum
  • Hence we cannot have $w \in S$ such that $f(w) < f(x)$ $\quad \square$

Convex Functions

  • Note that convexity does not guarantee uniqueness of global minimum
  • However, if $f$ is a strictly convex function on a convex set $S$,
    then a local minimum of $f$ is the unique global minimum
  • Optimization of convex functions over convex sets is called
    convex optimization, which is an important field in optimization

Optimality Conditions

  • We have discussed existence and uniqueness of minima,
    but haven’t considered how to find a minimum
  • The familiar optimization idea from calculus in one dimension is:
    set derivative to zero, check the sign of the second derivative
  • This can be generalized to $\mathbb{R}^n$

Optimality Conditions

  • If $f : \mathbb{R}^n \to \mathbb{R}$ is differentiable,
    then the gradient vector $\nabla f : \mathbb{R}^n \to \mathbb{R}^n$ is \[ \nabla f(x) = \left[ \begin{array}{c} \frac{\partial f(x)}{\partial x_1}\\ \frac{\partial f(x)}{\partial x_2}\\ \vdots\\ \frac{\partial f(x)}{\partial x_n} \end{array} \right] \]
  • The importance of the gradient is that $\nabla f$ points “uphill”,
    i.e. towards points with larger values than $f(x)$
  • And similarly $-\nabla f$ points “downhill”

Optimality Conditions

  • This follows from Taylor’s theorem for $f : \mathbb{R}^n \to \mathbb{R}$
  • Recall that \[ f(x+\delta) - f(x) = \nabla f(x)^T \delta + \text{h.o.t.} \]
  • Let $\delta = -\epsilon\nabla f(x)$ for $\epsilon > 0$ and suppose that $\nabla f(x) \neq 0$, then: \[ f(x-\epsilon\nabla f(x)) - f(x) \approx -\epsilon\nabla f(x)^T \nabla f(x) < 0 \]
  • Also, we see from Cauchy–Schwarz that
    \[ \Big|\nabla f(x)^T\frac{\delta}{\|\delta\|_2}\Big| \leq \Big|\nabla f(x)^T \frac{\nabla f(x)}{\|\nabla f(x)\|_2}\Big| \] so $-\nabla f(x)$ is the steepest descent direction

Optimality Conditions

  • Similarly, we see that a necessary condition
    for a local minimum at $x^\ast \in S$ is that $\nabla f(x^\ast) = 0$
  • In this case there is no “downhill direction” at $x^\ast$
  • The condition $\nabla f(x^\ast) = 0$ is called
    a first-order necessary condition for optimality,
    since it only involves first derivatives

Optimality Conditions

  • $x^\ast \in S$ that satisfies the first-order optimality condition
    is called a critical point of $f$
  • A critical point can be
    a local minimum, local maximum, or saddle point
  • A saddle point is where some directions are “downhill”
    and others are “uphill”, e.g. $(x,y)=(0,0)$ for $f(x,y) = x^2 - y^2$

Optimality Conditions

  • As in the one-dimensional case, we can look
    at second derivatives to classify critical points
  • If $f : \mathbb{R}^n \to \mathbb{R}$ is twice differentiable, then
    the Hessian is the matrix-valued function $H_f : \mathbb{R}^n \to \mathbb{R}^{n\times n}$ \[ H_f(x) = \left[ \begin{array}{cccc} \frac{\partial^2f(x)}{\partial x_1^2} & \frac{\partial^2f(x)}{\partial x_1 x_2} & \cdots & \frac{\partial^2f(x)}{\partial x_1x_n} \\ \frac{\partial^2f(x)}{\partial x_2x_1} & \frac{\partial^2f(x)}{\partial x_2^2} & \cdots & \frac{\partial^2f(x)}{\partial x_2x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2f(x)}{\partial x_nx_1} & \frac{\partial^2f(x)}{\partial x_n x_2} & \cdots & \frac{\partial^2f(x)}{\partial x_n^2} \end{array} \right] \]
  • The Hessian is the Jacobian matrix of the gradient $\nabla f : \mathbb{R}^n \to \mathbb{R}^n$
  • If the second partial derivatives of $f$ are continuous,
    then $\partial^2 f/\partial x_i \partial x_j = \partial^2 f/\partial x_j \partial x_i$, and $H_f$ is symmetric

Optimality Conditions

  • Suppose we have found a critical point $x^\ast$, so that $\nabla f(x^\ast) = 0$
  • From Taylor’s theorem, for $\delta \in \mathbb{R}^n$, we have \[ \begin{aligned} f(x^\ast + \delta) &= f(x^\ast) + \nabla f(x^\ast)^T \delta + \frac{1}{2}\delta^T H_f(x^\ast + \eta \delta) \delta\\ &= f(x^\ast) + \frac{1}{2}\delta^T H_f(x^\ast + \eta \delta) \delta\end{aligned} \] for some $\eta \in (0,1)$

Optimality Conditions

  • Recall positive definiteness: $A$ is positive definite if $x^T A x > 0$
  • Suppose $H_f(x^\ast)$ is positive definite
  • Then (by continuity) $H_f(x^\ast + \eta\delta)$ is also positive definite
    for $\|\delta\|$ sufficiently small, so that: $\delta^T H_f(x^\ast + \eta \delta) \delta > 0$
  • Hence, we have $f(x^\ast + \delta) > f(x^\ast)$ for $\|\delta\|$ sufficiently small,
    e.g. $f(x^\ast)$ is a local minimum
  • Positive definiteness of $H_f$ at a critical point $x^\ast$
    is a second-order sufficient condition for a local minimum

Optimality Conditions

  • A matrix can also be negative definite: $x^T A x < 0$ for all $x \neq 0$
  • Or indefinite: There exists $x,y$ such that $x^T A x < 0 < y^T A y$
  • Then we can classify critical points as follows:
    • $H_f(x^\ast)$ positive definite  $\implies$ $x^\ast$ is a local minimum
    • $H_f(x^\ast)$ negative definite $\implies$ $x^\ast$ is a local maximum
    • $H_f(x^\ast)$ indefinite $\implies$ $x^\ast$ is a saddle point

Optimality Conditions

  • Also, positive definiteness of the Hessian
    is closely related to convexity of $f$
  • If $H_f(x)$ is positive definite, then $f$ is convex
    on some convex neighborhood of $x$
  • If $H_f(x)$ is positive definite for all $x \in S$,
    where $S$ is a convex set, then $f$ is convex on $S$
  • Question: How do we test for positive definiteness?

Optimality Conditions

  • Answer: For a symmetric matrix $A$
    • $A$ is positive definite if and only if all eigenvalues of $A$ are positive,
    • $A$ is negative definite if and only if all eigenvalues of $A$ are negative
  • Also, a matrix with positive and negative eigenvalues is indefinite
  • Hence we can compute all the eigenvalues of $A$ and check their signs

Optimality Conditions: Example

  • From Heath’s book (Example 6.5)
  • Consider \[ f(x) = 2x_1^3 + 3x_1^2 + 12 x_1x_2 + 3x_2^2 - 6x_2 + 6 \]
  • Then \[ \nabla f(x) = \left[ \begin{array}{c} 6x_1^2 + 6x_1 + 12x_2\\ 12x_1 + 6x_2 - 6 \end{array} \right] \]
  • We set $\nabla f(x) = 0$ to find critical points $[1,-1]^T$ and $[2,-3]^T$

Optimality Conditions: Example

  • The Hessian is \[ H_f(x) = \left[ \begin{array}{cc} 12x_1 + 6 & 12\\ 12 & 6 \end{array} \right] \]
  • and hence \[ \begin{aligned} H_f(1,-1) &= \left[ \begin{array}{cc} 18 & 12\\ 12 & 6 \end{array} \right], \text{ which has eigenvalues } 25.4, -1.4\\ H_f(2,-3) &= \left[ \begin{array}{cc} 30 & 12\\ 12 & 6 \end{array} \right], \text{ which has eigenvalues } 35.0, 1.0\end{aligned} \]
  • Hence $[2,-3]^T$ is a local minimum whereas $[1,-1]^T$ is a saddle point

Optimization Methods

Steepest Descent

  • One gradient-based method for
    unconstrained optimization is steepest descent
  • Key idea: The negative gradient $-\nabla f(x)$
    points in the “steepest downhill” direction for $f$ at $x$
  • An iterative method for minimizing $f$ is obtained
    by following $-\nabla f(x_k)$ at each step
  • Question: How far should we go in the direction of $-\nabla f(x_k)$?

Steepest Descent

  • We can try to find the best step size via an easier subproblem
  • For a direction $s \in \mathbb{R}^n$, let $\phi : \mathbb{R}\to \mathbb{R}$ be given by \[ \phi(\eta) = f(x + \eta s) \]
  • Then minimizing $f$ along $s$ corresponds
    to minimizing the one-dimensional function $\phi$
  • This process of minimizing $f$ along a line is called a line search

Steepest Descent

  • Putting these pieces together leads to the steepest descent method:

    1:$\hspace{0em}$choose initial guess $x_0$
    2:$\hspace{0em}$for $k = 0,1,2,\ldots$ do
    3:$\hspace{1.2em}$$s_k = -\nabla f(x_k)$
    4:$\hspace{1.2em}$choose $\eta_k$ to minimize $f(x_k + \eta_k s_k)$
    5:$\hspace{1.2em}$$x_{k+1} = x_k + \eta_k s_k$
    6:$\hspace{0em}$end for

  • However, steepest descent often converges very slowly
  • Steepest descent is part of HW4
  • A simpler option to use a constant $\eta_k=\eta$

Newton’s Method

  • We can get faster convergence by using more information about $f$
  • Note that $\nabla f(x) = 0$ is a system of nonlinear equations,
    so we can solve it with quadratic convergence via Newton’s method
  • The Jacobian matrix of $\nabla f(x)$ is $H_f(x)$ and
    therefore Newton’s method for unconstrained optimization is:

    1:$\hspace{0em}$choose initial guess $x_0$
    2:$\hspace{0em}$for $k = 0,1,2,\ldots$ do
    3:$\hspace{1.2em}$solve $H_f(x_k)s_k = -\nabla f(x_k)$
    4:$\hspace{1.2em}$$x_{k+1} = x_k + s_k$
    5:$\hspace{0em}$end for

Newton’s Method

  • We can also interpret Newton’s method as seeking a stationary point
    based on a sequence of local quadratic approximations
  • Recall that for small $\delta$ \[ \htmlClass{color5}{ f(x+\delta) \approx f(x) + \nabla f(x)^T \delta + \frac{1}{2} \delta^T H_f(x)\delta = q(\delta)} \] where $q(\delta)$ is quadratic in $\delta$ (for a fixed $x$)
  • We find stationary point of $q$ in the usual way: \[ \nabla q(\delta) = \nabla f(x) + H_f(x)\delta = 0 \]
  • This leads to $H_f(x)\delta = -\nabla f(x)$, as in the previous slide

Newton’s Method: Example

  • Rosenbrock function \[ f(x,y) = 100 (y - x^2)^2 + (1 - x)^2 \] with minimum 0 at $(x,y)=(1,1)$
  • See [examples/unit4/rosenbrock.py],
    Rosenbrock function minimized with Newton’s method

Newton’s Method: Robustness

  • Newton’s method generally converges much faster than steepest descent
  • However, Newton’s method can be unreliable far away from a solution
  • To improve robustness during early iterations
    it is common to perform a line search in the Newton step direction
  • Also line search can ensure we don’t approach a local maximum
    (instead of minimum) as can happen with raw Newton method
  • The line search modifies the Newton step size,
    therefore often referred to as a damped Newton method

Newton’s Method: Robustness

  • Another way to improve robustness is with trust region methods
  • At each iteration $k$, a “trust radius” $R_k$ is computed
  • This determines a region surrounding $x_k$
    on which we “trust” our quadratic approx.
  • We require $\|x_{k+1}-x_k\| \leq R_k$,
    which is a constrained optimization problem
    (with quadratic objective function) at each step

Newton’s Method: Robustness

  • Size of $R_{k+1}$ is based on comparing actual change,
    $f(x_{k+1}) - f(x_k)$, to change predicted by the quadratic model
  • If quadratic model is accurate, we expand the trust radius,
    otherwise we contract it
  • When close to a minimum, $R_k$ should be large enough
    to allow full Newton steps $\implies$ eventual quadratic convergence

Quasi-Newton Methods

  • Possible drawbacks of Newton’s method
    • unreliable: only converges when sufficiently close to a minimum
    • expensive: the Hessian $H_f$ is dense in general,
      making the method expensive if $n$ is large
    • complicated: can be impractical to compute the Hessian exactly
  • Methods that do not require the Hessian but achieve
    superlinear convergence are quasi-Newton methods

Quasi-Newton Methods

  • General form of quasi-Newton methods: \[ \htmlClass{color5}{ x_{k+1} = x_k - \alpha_k B_k^{-1}\nabla f(x_k) } \] where $\alpha_k$ is a line search parameter and
    $B_k$ is some approximation to the Hessian
  • Quasi-Newton methods generally lose quadratic convergence
    of Newton’s method, but often achieve superlinear convergence
  • We now consider some specific quasi-Newton methods

BFGS

  • The Broyden–Fletcher–Goldfarb–Shanno (BFGS) method
    is one of the most popular quasi-Newton methods

    1:$\hspace{0em}$choose initial guess $x_0$
    2:$\hspace{0em}$choose $B_0$, initial guess for Hessian, e.g. $B_0 = {\rm I}$
    3:$\hspace{0em}$for $k = 0,1,2,\ldots$ do
    4:$\hspace{1.2em}$solve $B_k s_k = -\nabla f(x_k)$
    5:$\hspace{1.2em}$$x_{k+1} = x_k + s_k$
    6:$\hspace{1.2em}$$y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$
    7:$\hspace{1.2em}$$B_{k+1} = B_k + \Delta B_k$
    8:$\hspace{0em}$end for

where $\Delta B_k = \frac{y_ky_k^T}{y_k^Ts_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_ks_k}$

BFGS

  • Basic idea is that $B_k$ accumulates second derivative information
    on successive iterations and eventually approximates $H_f$ well
  • BFGS is implemented in scipy.optimize.fmin_bfgs()
  • See [examples/unit4/rosenbrock.py],
    Rosenbrock function minimized with BFGS

BFGS: Derivation

  • Replace Newton’s update $H_f(x_k)s_k=-\nabla f(x_k)$ with \[ B_ks_k=-\nabla f(x_k) \] where $s_k = x_{k+1}-x_k$
  • Define $B_{k+1}\in\mathbb{R}^{n\times n}$ to satisfy the requirements
    • $B_{k+1}$ is obtained by a “small” change from $B_k$
    • $B_{k+1}$ is symmetric and positive definite
    • $B_{k+1}\approx H_f(x_{k+1})$

BFGS: Derivation

  • In particular, we want $B_{k+1}s_k\approx H_f(x_{k+1})s_k$
  • The product $H_f(x_{k+1})s_k$ is the directional derivative of $\nabla{f}$ along $s_k$
    and can be approximated by the difference $y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$ \[ H_f(x_{k+1})s_k = \lim_{h\to0} \frac{\nabla f(x_{k+1}) - \nabla f(x_{k+1} - hs_k)}{h} \underset{h=1}{\approx} \nabla f(x_{k+1}) - \nabla f(x_k) = y_k \]
  • Impose the requirement $B_{k+1}s_k = y_k$ exactly

BFGS: Derivation

  • Look for $B_{k+1}$ in the form of a rank-two update \[ B_{k+1} = \htmlClass{color1}{ B_k -\beta vv^T} + \htmlClass{color3}{ \alpha uu^T} \] with unknown $\alpha,\beta\in\mathbb{R}$ and $u,v\in\mathbb{R}^n$
    • impose $(B_k -\beta vv^T)s_k=0$ \[ 0=(B_k -\beta vv^T)s_k=B_ks_k - \beta vv^Ts_k=B_ks_k - (\beta v^Ts_k)v \] which is achieved by $v=B_ks_k$ and $\beta=\frac{1}{s_k^TB_ks_k}$
    • impose $\alpha uu^Ts_k=y_k$ \[ y_k=\alpha uu^Ts_k=(\alpha u^Ts_k)u \] which is achieved by $u=y_k$ and $\alpha=\frac{1}{y_k^T s_k}$
  • This implies $B_{k+1}s_k=y_k$ and recovers the BFGS algorithm above

BFGS: Derivation

  • Note that if $B_k$ is symmetric and positive definite,
    then $B_k-\beta vv^T=B_k-\frac{B_k s_k^T s_k^T B_k}{s_k^TB_ks_k}$ is positive semi-definite
  • Under the assumption $y_k^Ts_k>0$, known as the curvature condition,
    the matrix $\alpha uu^T=\frac{y_ky_k^T}{y_k^Ts_k}$ is positive definite
  • Therefore, $B_{k+1} = B_k-\beta vv^T + \alpha uu^T$ is positive definite

BFGS: Inverse Hessian

  • Actual implementation of BFGS: store and update
    the inverse approximate Hessian $H_k$ to avoid solving a linear system

    1:$\hspace{0em}$choose initial guess $x_0$
    2:$\hspace{0em}$choose $H_0$, initial guess for inverse Hessian, e.g. $H_0 = {\rm I}$
    3:$\hspace{0em}$for $k = 0,1,2,\ldots$ do
    4:$\hspace{1.2em}$$s_k = - H_k \nabla f(x_k)$
    5:$\hspace{1.2em}$$x_{k+1} = x_k + s_k$
    6:$\hspace{1.2em}$$y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$
    7:$\hspace{1.2em}$$H_{k+1} = (I-\rho_k s_k y_k^T)H_k(I-\rho_k y_k s_k^T) + \rho_k s_k s_k^T$
    8:$\hspace{0em}$end for

where $\rho_k=\frac{1}{y_k^T s_k}$

BFGS: Inverse Hessian

  • The update rule for $H_{k+1}$ follows from the update rule for $B_{k+1}$
    and the Sherman–Morrison–Woodbury formula \[ (A + U V^T)^{-1} = A^{-1} - A^{-1} U (I + V^TA^{-1}U)^{-1}V^TA^{-1} \] where $A\in\mathbb{R}^{n\times n}\quad$ and $\quad U,V\in\mathbb{U}^{n \times p}$
  • Provides a low-rank update of the inverse
    from a low-rank update of the matrix
  • In our case \[ B_{k+1} =B_k + UV^T = B_k+ \frac{1}{y_k^T s_k} y_k y_k^T-\frac{1}{s_k^T B_k s_k} B_ks_k s_k^T B_k \] \[ U=\Big[\frac{1}{y_k^T s_k} y_k \;\;\; -\!\frac{1}{s_k^T B_k s_k} B_s s_k\Big],\quad V=\big[y_k \;\;\; B_k s_k\big]\in\mathbb{R}^{n\times 2} \]

BFGS: Modifications

  • Typically, the search direction $s_k$ is adjusted
    by a more robust inexact line search, e.g. Wolfe conditions
  • Limited-memory BFGS (L-BFGS) avoids storing the full $H_k$
    and instead represents $H_k$ implicitly using a limited history
    of gradient evaluations. Suited for large-scale problems
  • Extra reading: Nocedal & Wright. Numerical Optimization, 1999
    (chapters 6 and 7)

Constrained Optimization

Constrained Optimization

  • So far we have ignored constraints
  • Now we consider equality constrained optimization \[ \min_{x\in\mathbb{R}^n} f(x) \quad \text{ subject to } \quad g(x) = 0, \] where $f : \mathbb{R}^n \to \mathbb{R}$ and $g : \mathbb{R}^n \to \mathbb{R}^m$, with $m \leq n$
  • There are $n$ unknowns and $m$ constraints
  • This problem is solved with Lagrange mutlipliers

Constrained Optimization

  • We illustrate the concept of Lagrange multipliers for $f, g : \mathbb{R}^2 \to \mathbb{R}$
  • Let $f(x,y) = x + y$ and $g(x,y) = 2x^2 + y^2 - 5$

  • $\nabla g$ is normal to $S$: at any $x \in S$ we must move in direction
    $(\nabla g(x))_\perp$ (tangent direction) to remain in $S$

Constrained Optimization

  • Also, change in $f$ due to infinitesimal step in direction $(\nabla g(x))_\perp$ is \[ f(x \pm \epsilon(\nabla g(x))_\perp) = f(x) \pm \epsilon\nabla f(x)^T(\nabla g(x))_\perp + \text{h.o.t.} \]
  • A critical point $x^\ast \in S$ satisfies $\nabla f(x^\ast)^T(\nabla g(x^\ast))_\perp = 0$, or \[ \htmlClass{color5}{ \nabla f(x^\ast) = \lambda^\ast \nabla g(x^\ast)}, \quad \text{for some } \lambda^\ast \in \mathbb{R} \]

Constrained Optimization

  • This shows that for a stationary point with $m=1$ constraints,
    $\nabla f$ cannot have any component in the “tangent direction” to $S$
  • Now, consider the case with $m > 1$ equality constraints
  • Then $g : \mathbb{R}^n \to \mathbb{R}^m$ and we have the gradients $\nabla g_i$, $i=1,\ldots,m$
  • Then the feasible set is $S = \{ x \in \mathbb{R}^n : g_i(x) = 0, i=1,\ldots,m\}$
  • Any “tangent direction” at $x \in S$ must be orthogonal to all
    gradient vectors $\{\nabla g_i(x), i=1,\ldots,m\}$ to remain in $S$

Constrained Optimization

  • Let $\mathcal{T}(x) = \{ v \in \mathbb{R}^n : \nabla g_i(x)^T v = 0, i=1,2,\ldots,m\}$
    denote the orthogonal complement of $\{\nabla g_i(x), i=1,\ldots,m\}$
  • Then, for $\delta \in \mathcal{T}(x)$ and $\epsilon>0$, $\epsilon \delta$ is a step in a “tangent direction” of $S$ at $x$
  • Since we have \[ f(x^\ast + \epsilon\delta) = f(x^\ast) + \epsilon\nabla f(x^\ast)^T\delta + \text{h.o.t.} \] it follows that for a stationary point we need
    \[ \htmlClass{color5}{\nabla f(x^\ast)^T\delta = 0}\;\text{for all}\;\delta \in \mathcal{T}(x^\ast) \]

Constrained Optimization

  • We require that at a stationary point $x^\ast \in S$ we have \[ \nabla f(x^\ast) \in \mathop{\mathrm{span}}\{\nabla g_i(x^\ast), i=1,\ldots,m\} \]
  • This can be written as a linear system \[ \htmlClass{color5}{ \nabla f(x^\ast) = (J_g(x^\ast))^T \lambda^\ast} \] for some $\lambda^\ast \in \mathbb{R}^m$, where $(J_g(x^\ast))^T \in \mathbb{R}^{n\times m}$
  • This follows because the columns of $(J_g(x^\ast))^T$
    are the vectors $\{\nabla g_i(x^\ast), i=1,\ldots,m\}$

Constrained Optimization

  • We can write equality constrained optimization problems more concisely
    by introducing the Lagrangian function, $\mathcal{L}: \mathbb{R}^{n+m} \to \mathbb{R}$, \[ \begin{aligned} \mathcal{L}(x,\lambda) &= f(x) + \lambda^T g(x)\\ &= f(x) + \lambda_1 g_1(x) + \cdots + \lambda_m g_m(x)\end{aligned} \]
  • Then \[ \begin{array}{llll} \frac{\partial \mathcal{L}(x,\lambda)}{\partial x_i} &= \frac{\partial f(x)}{\partial x_i} + \lambda_1 \frac{\partial g_1(x)}{\partial x_i} + \cdots + \lambda_n \frac{\partial g_n(x)}{\partial x_i}, & i=1,\ldots,n\\ \\ \frac{\partial \mathcal{L}(x,\lambda)}{\partial \lambda_i} &= g_i(x), & i=1,\ldots,m \end{array} \]

Constrained Optimization

  • In matrix form \[ \nabla \mathcal{L}(x,\lambda) = \left[ \begin{array}{c} \nabla_x \mathcal{L}(x,\lambda)\\ \nabla_\lambda\mathcal{L}(x,\lambda) \end{array} \right] = \left[ \begin{array}{c} \nabla f(x) + J_g(x)^T\lambda\\ g(x) \end{array} \right], \]
  • Therefore, the first order necessary optimality condition
    for the constrained problem can be written as a nonlinear system \[ \htmlClass{color5}{ \nabla\mathcal{L}(x,\lambda) = \left[ \begin{array}{c} \nabla f(x) + J_g(x)^T\lambda\\ g(x) \end{array} \right] = 0} \]

Constrained Optimization: Examples

  • Consider a cylinder with radius $x_1$ and height $x_2$
  • Minimize the surface area of a cylinder subject to a constraint on its volume

    \[ \min_x f(x_1,x_2) = 2\pi x_1(x_1 + x_2) \] \[ \text{ subject to } g(x_1,x_2) = \pi x_1^2 x_2 - V = 0 \]

Constrained Optimization: Examples

  • Another example is the underdetermined linear
    least squares problem from Unit 1 \[ \min_{b\in\mathbb{R}^n} f(b) \quad \text{ subject to } \quad g(b) = 0, \] where $f(b) = b^T b$, $g(b) = Ab - y$ and $A \in \mathbb{R}^{m\times n}$ with $m < n$

Constrained Optimization: Examples

  • Introducing Lagrange multipliers gives \[ \htmlClass{color5}{ \mathcal{L}(b,\lambda) = b^Tb + \lambda^T(Ab - y)} \] where $b \in \mathbb{R}^n$ and $\lambda \in \mathbb{R}^{m}$
  • And the necessary optimality condition $\nabla\mathcal{L}(b,\lambda) = 0$ is \[ \left[ \begin{array}{c} \nabla f(b) + J_g(b)^T\lambda\\ g(b) \end{array} \right] = \left[ \begin{array}{c} 2b + A^T\lambda\\ Ab-y \end{array} \right] = 0 \in \mathbb{R}^{n+m} \]

Constrained Optimization: Examples

  • We obtain the $(n+m)\times(n+m)$ square linear system \[ \left[ \begin{array}{cc} 2{\rm I} & A^T\\ A & 0 \end{array} \right] \left[ \begin{array}{c} b\\ \lambda \end{array} \right] = \left[ \begin{array}{c} 0\\ y \end{array} \right] \] which we can solve for $\left[\begin{array}{c}b \\ \lambda\end{array}\right] \in \mathbb{R}^{n+m}$

Constrained Optimization: Examples

  • We have $b = -\frac{1}{2}A^T\lambda$ from the first “block row”
  • Subsituting into $Ab = y$ (the second “block row”) yields $\lambda = -2(AA^T)^{-1}y$
  • And hence \[ \htmlClass{color5}{ b = -\frac{1}{2}A^T\lambda = A^T (AA^T)^{-1} y} \] which was the solution we introduced (but didn’t derive) in Unit 1

Sequential Quadratic Programming

Sequential Quadratic Programming

  • Consider equality constrained minimization \[ \htmlClass{color5}{ \min_{x\in\mathbb{R}^n} f(x) \quad \text{subject to} \quad g(x) = 0} \] where $f : \mathbb{R}^n \to \mathbb{R}$ and $g : \mathbb{R}^n \to \mathbb{R}^m$
  • With the Lagrangian $\mathcal{L}(x,\lambda) = f(x) + \lambda^T g(x)$,
    the necessary condition for optimality is \[ \htmlClass{color5}{ \nabla\mathcal{L}(x,\lambda) = \left[ \begin{array}{c} \nabla f(x) + J_g^T(x)\lambda\\ g(x) \end{array} \right] = 0} \]
  • Once again, this is a nonlinear system of equations
    that can be solved using Newton’s method

Sequential Quadratic Programming

  • To derive the Jacobian of this system, we write \[ \nabla\mathcal{L}(x,\lambda) = \left[ \begin{array}{c} \nabla f(x) + \sum_{k=1}^m \lambda_k \nabla g_k(x)\\ g(x) \end{array} \right] \in \mathbb{R}^{n+m} \]
  • Then we differentiate w.r.t to $x\in \mathbb{R}^n$ and $\lambda\in \mathbb{R}^m$
  • For $i=1,\ldots,n$, we have \[ (\nabla \mathcal{L}(x,\lambda))_i = \frac{\partial f(x)}{\partial x_i} + \sum_{k=1}^m \lambda_k \frac{\partial g_k(x)}{\partial x_i} \]
  • Differentiating w.r.t $x_j$, for $i,j=1,\ldots,n$, gives \[ \frac{\partial}{\partial x_j}(\nabla \mathcal{L}(x,\lambda))_i = \frac{\partial^2 f(x)}{\partial x_i \partial x_j} + \sum_{k=1}^m \lambda_k \frac{\partial^2 g_k(x)}{\partial x_i\partial x_j} \]

Sequential Quadratic Programming

  • The top-left $n\times n$ block of the Jacobian of $\nabla \mathcal{L}(x,\lambda)$ is \[ \htmlClass{color5}{ B(x,\lambda) = H_f(x) + \sum_{k=1}^m\lambda_k H_{g_k}(x) \in \mathbb{R}^{n\times n} } \]
  • Differentiating $(\nabla \mathcal{L}(x,\lambda))_i$ w.r.t $\lambda_j$, for $i=1,\ldots,n$, $j=1,\ldots,m$, gives \[ \frac{\partial}{\partial \lambda_j}(\nabla \mathcal{L}(x,\lambda))_i = \frac{\partial g_j(x)}{\partial x_i} \]
  • The top-right $n \times m$ block of the Jacobian of $\nabla \mathcal{L}(x,\lambda)$ is \[ \htmlClass{color5}{ J_g(x)^T \in \mathbb{R}^{n\times m} } \]

Sequential Quadratic Programming

  • For $i=n+1,\ldots,n+m$, we have \[ (\nabla \mathcal{L}(x,\lambda))_i = g_i(x) \]
  • Differentiating $(\nabla \mathcal{L}(x,\lambda))_i$ w.r.t $x_j$, for $i=n+1,\ldots,n+m$, $j=1,\ldots,n$, gives \[ \frac{\partial}{\partial x_j}(\nabla \mathcal{L}(x,\lambda))_i = \frac{\partial g_i(x)}{\partial x_j} \]
  • The bottom-left $m \times n$ block of the Jacobian of $\nabla \mathcal{L}(x,\lambda)$ is \[ \htmlClass{color5}{ J_g(x) \in \mathbb{R}^{m\times n} } \]
  • The final $m\times m$ bottom right block is zero ($g_i(x)$ does not depend on $\lambda_j$)

Sequential Quadratic Programming

  • We have derived the following Jacobian matrix for $\nabla\mathcal{L}(x,\lambda)$ \[ \htmlClass{color5}{ \left[ \begin{array}{cc} B(x,\lambda) & J_g^T(x)\\ J_g(x) & 0 \end{array} \right] \in \mathbb{R}^{(m+n)\times(m+n)}} \]
  • Note the $2\times 2$ block structure of this matrix
  • Matrices with this structure are called KKT matrices
    after Karush, Kuhn, and Tucker

Sequential Quadratic Programming

  • Therefore, Newton’s method for $\nabla\mathcal{L}(x,\lambda) = 0$ is \[ \htmlClass{color5}{ \left[ \begin{array}{cc} B(x_k,\lambda_k) & J_g^T(x_k)\\ J_g(x_k) & 0 \end{array} \right] \left[ \begin{array}{c} s_k\\ \delta_k \end{array} \right] = - \left[ \begin{array}{c} \nabla f(x_k) + J_g^T(x_k)\lambda_k\\ g(x_k) \end{array} \right]} \] for $k=0,1,2,\ldots$
  • Here $(s_k,\delta_k) \in \mathbb{R}^{n+m}$ is the $k$-th Newton step

Sequential Quadratic Programming

  • Now, consider the constrained minimization problem,
    where $(x_k,\lambda_k)$ is our Newton iterate at step $k$: \[ \min_s \left\{\frac{1}{2} s^TB(x_k,\lambda_k)s + s^T(\nabla f(x_k) + J_g^T(x_k)\lambda_k)\right\} \] \[ \text{subject to} \quad J_g(x_k) s + g(x_k) = 0 \]
  • The objective function is quadratic in $s$ (here $x_k$, $\lambda_k$ are constants)
  • This minimization problem has Lagrangian \[ \begin{aligned} \mathcal{L}_k(s,\delta) &= \frac{1}{2} s^TB(x_k,\lambda_k)s + s^T(\nabla f(x_k) + J_g^T(x_k)\lambda_k)\\ &+ \delta^T (J_g(x_k)s + g(x_k))\end{aligned} \]

Sequential Quadratic Programming

  • Then solving $\nabla\mathcal{L}_k(s,\delta) = 0$ (i.e. first-order necessary conditions)
    gives a linear system, which is the same as the $k$-th Newton step
  • Therefore, at each step of Newton’s method, we exactly solve
    a minimization problem with a quadratic objective and linear constraints
  • Optimization of this type is called quadratic programming
  • Therefore, Newton’s method applied to $\mathcal{L}(x,\lambda) = 0$
    is called sequential quadratic programming (SQP)

Sequential Quadratic Programming

  • SQP is an important method, and there are many issues to be considered
    to obtain an efficient and reliable implementation:
    • efficient solution of the linear systems at each Newton iteration — matrix block structure can be exploited
    • quasi-Newton approximations to the Hessian
    • trust region, line search to improve robustness
    • treatment of constraints (equality and inequality) during the iterative process
    • selection of a good initial guess for $\lambda$

Penalty Methods

  • Another approach to constrained optimization is penalty methods
  • This converts a constrained problem into an unconstrained problem
  • Key idea: Introduce a new objective function
    which is a weighted sum of objective function and constraints

Penalty Methods

  • Given the minimization problem \[ \min_x f(x) \quad \text{subject to} \quad g(x) = 0 \qquad \] define the corresponding penalized unconstrained problem \[ \min_x\phi_\rho(x) = f(x) + \frac{1}{2}\rho g(x)^Tg(x) \] with a parameter $\rho\in\mathbb{R}$
  • Let $x^\ast$ be the solution of the constrained problem
  • Let $x^\ast_\rho$ be the solution of the penalized unconstrained problem
  • Under appropriate conditions, it can be shown that \[ \htmlClass{color5}{ \lim_{\rho\to\infty} x^\ast_\rho = x^\ast} \]

Penalty Methods

  • In practice, we can solve the unconstrained problem for a large value of $\rho$ to get a good approximation of $x^\ast$
  • Another strategy is to solve for a sequence of penalty parameters $\rho_k$,
    where $x_{\rho_k}^\ast$ serves as an initial guess for $x_{\rho_{k+1}}^\ast$
  • Note that the major drawback of penalty methods is that
    a large factor $\rho$ will increase the condition number of the Hessian $H_{\phi_\rho}$
  • However, penalty methods can be convenient due to their simplicity

PDE-Constrained Optimization

PDE-Constrained Optimization

  • Consider a general optimization problem \[ \min_{p \in \mathbb{R}^n} \mathcal{G}(p) \] with the objective function $\mathcal{G}:\mathbb{R}^n\to\mathbb{R}$
  • Gradient-based methods require gradients of the objective
  • They could be approximated with finite differences

PDE-Constrained Optimization

  • However, each partial derivative requires an extra evaluation of $\mathcal{G}$ \[ \frac{\partial \mathcal{G}(p)}{\partial p_i} \approx \frac{\mathcal{G}(p+h e_i) - \mathcal{G}(p)}{h}, \] so we need $n+1$ evaluations of $\mathcal{G}$ to approximate $\nabla \mathcal{G}(p)$
  • For example, if $\mathcal{G}(p)$ requires solving a PDE and
    parameters $p$ represent an unknown field on a grid,
    this procedure becomes too expensive
  • The accuracy of finite differences is also limited

PDE-Constrained Optimization

  • There are two main alternative approaches
    for computing gradients of solutions of ODEs or PDEs
    • direct method
    • adjoint method
  • The direct method is simpler, but the adjoint method
    is more efficient in cases with many parameters

One-Dimensional Case

  • Consider the boundary value problem for an ODE \[ -u''(x;p) + r(x;p) u(x;p) = f(x), \qquad u(a) = u(b) = 0 \] referred to as the primal equation
  • Here the functions $r : \mathbb{R}\times \mathbb{R}^n \to \mathbb{R}$ and $f:\mathbb{R}\to\mathbb{R}$ are given
  • The objective function $\mathcal{G}:\mathbb{R}^n\to\mathbb{R}$ is assumed to be a linear functional \[ \mathcal{G}(p) = \int_a^b \sigma(x) u(x;p) \text{d}x \] for some given function $\sigma:\mathbb{R}\to\mathbb{R}$

Direct Method

  • Note that the gradient of the objective is \[ \frac{\partial \mathcal{G}(p)}{\partial p_i} = \int_a^b \sigma(x) \frac{\partial u}{\partial p_i} \text{d}x \] so we can compute it from derivatives of the solution $\frac{\partial u}{\partial p_i}$
  • Differentiate the original ODE with respect to $p_i$ \[ -\frac{\partial u}{\partial p_i}''(x;p) + r(x;p) \frac{\partial u}{\partial p_i}(x;p) = -\frac{\partial r}{\partial p_i}u(x;p) \] for $i=1,2,\ldots,n$

Direct Method

  • Once we compute each $\frac{\partial u}{\partial p_i}$ we can then evaluate $\nabla \mathcal{G}(p)$
    by evaluating a sequence of $n$ integrals
  • This is not much better than using finite differences:
    we still need to solve $n$ separate problems
  • However, those can be cheaper since only the right-hand side changes.
    For example, we can reuse a common LU factorization

Adjoint Method

  • A more efficient approach when $n$ is large is the adjoint method
  • The adjoint problem is defined as \[ -z''(x;p) + r(x;p) z(x;p) = \sigma(x), \qquad z(a) = z(b) = 0 \]
  • Since $\sigma(x)$ enters the right-hand side,
    the adjoint problem depends on the objective

Adjoint Method

  • Given a solution $z(x;p)$ of the adjoint problem, the gradient is \[ \begin{aligned} \frac{\partial \mathcal{G}(p)}{\partial p_i} &= \int_a^b \sigma(x) \frac{\partial u}{\partial p_i} \text{d}x \\ &= \int_a^b \left[-z''(x;p) + r(x;p) z(x;p)\right] \frac{\partial u}{\partial p_i} \text{d}x\\ &= \int_a^b z(x;p) \left[-\frac{\partial u}{\partial p_i}''(x;p) + r(x;p) \frac{\partial u}{\partial p_i}(x;p)\right] \text{d}x\end{aligned} \]
  • The last line follows from integrating by parts twice
    (boundary terms vanish because $\frac{\partial u}{\partial p_i}$ and $z$ are zero at $a$ and $b$)

Adjoint Method

  • Recall the derivative of the primal problem with respect to $p_i$ \[ -\frac{\partial u}{\partial p_i}''(x;p) + r(x;p) \frac{\partial u}{\partial p_i}(x;p) = -\frac{\partial r}{\partial p_i}u(x;p) \]
  • Combining both, we get \[ \frac{\partial \mathcal{G}(p)}{\partial p_i} = - \int_a^b \frac{\partial r}{\partial p_i} z(x;p) u(x;p) \text{d}x \]
  • Therefore, we only need to solve the primal and adjoint problems once
    and then can obtain each component of $\nabla \mathcal{G}(p)$ from the integral
  • This idea extends to PDEs

Linear Programming

Linear Programming

  • As we mentioned earlier, the optimization problem \[ \min_{x\in\mathbb{R}^n} f(x) \text{ subject to } g(x) = 0 \text{ and } h(x) \leq 0, \qquad\qquad \] with $f, g, h$ affine, is called a linear programming problem
  • The feasible region is a convex polyhedron
  • Since the objective function has a constant non-zero gradient,
    its global minimum must occur at a vertex of the feasible region

Linear Programming

  • Example of a convex feasible region in $\mathbb{R}^2$

Linear Programming

  • The standard approach to linear programming is conceptually simple:
    try a sequence of the vertices to find the minimum
  • This is called the simplex method
  • In the worst case, the computational cost of the simplex method
    grows exponentially with the size of the problem
  • But this worst-case behavior is rare. In practice, the cost grows linearly
  • We will not discuss the implementation of the simplex method

Linear Programming

  • scipy.optimize.linprog uses the HiGHS library that
    implements the dual revised simplex method
  • See [examples/unit4/linprog.py], solving the problem \[ \min_x f(x) = -5x_1 - 4x_2 -6x_3 \] subject to \[ \begin{aligned} x_1 - x_2 + x_3 &\leq& 20\\ 3x_1 + 2x_2 + 4x_3 &\leq& 42\\ 3x_1 + 2x_2 &\leq& 30\end{aligned} \] and $0 \leq x_1, 0 \leq x_2, 0 \leq x_3$