Bivariate Gradient Descent vs. Newton-Raphson

The following example demonstrates bi-variate optimization using Gradient Descent and Newton-Raphson method. It is the goal to minimize the objective function $f(x,y)=(x-y)^4+2x^2+y^2-x+2y$ and compare the two optimization approaches.
The provided example is authored by Christopher Hahne and inspired by a lab note from Colorado State University.

last update: 02/10/2020

Hahne, PhD

As we aim to find the minimum of $f(x, y)$, we can take it directly as the objective function, which writes


and implement it as

To employ the gradient descent in the multivariate case, we may analytically construct its partial derivatives given by

$$\frac{\partial f}{\partial x}(x,y)=4(x-y)^3 + 4x-1$$$$\frac{\partial f}{\partial y}(x,y)=-4(x-y)^3 + 2y+2$$

This is re-written in vector form as

$$\nabla \mathbf{F}(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x}(x,y)\\ \frac{\partial f}{\partial y}(x,y)\end{bmatrix} = \begin{bmatrix} 4(x-y)^3 + 4x-1 \\ -4(x-y)^3 + 2y+2 \end{bmatrix} $$

and implemented as

Gradient Descent

The gradient descent is applicable for multivariate functions with dimensions $m$ such that $\textbf{x}\in\mathbb{R}^m$ in

$$\mathbf{x}_{k+1}=\mathbf{x}_k-\gamma \nabla \mathbf{F}(\mathbf{x}_k), \qquad k \in \mathbb{N}$$

where $k$ is the iterative index, $\nabla \mathbf{F}(\cdot)$ the gradient of candidate $\mathbf{x}_k$ and $\gamma$ the learning rate. An implementation of the gradient descent is given below.

Newton-Raphson Method

For minimization, the Newton-Raphson approach can be generalized to $m$ variables with the first-order derivative $\nabla\mathbf{F}(\mathbf{x})\in\mathbb{R}^m$ from above and the second-order derivative $\mathbf{H}(\mathbf{x}) \in \mathbb{R}^{m\times m}$ as the Hessian matrix, which in our case is given by

$$ \mathbf H(\mathbf{x}) = \begin{bmatrix} \dfrac{\partial^2 f(x,y)}{\partial x^2} & \dfrac{\partial^2 f(x,y)}{\partial x \partial y}\\[1em] \dfrac{\partial^2 f(x,y)}{\partial y \partial x} & \dfrac{\partial^2 f(x,y)}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 12(x-y)^2 + 4 & -12(x-y)^2 \\ -12(x-y)^2 & 12(x-y)^2 + 2 \end{bmatrix} $$

and implemented as

One thus obtains the iterative scheme

$$\mathbf{x}_{k+1} = \mathbf{x}_k - \gamma \big[\mathbf H(\mathbf{x}_k)\big]^{-1} \nabla\mathbf{F}(\mathbf{x}_k), \qquad k \in \mathbb{N}$$

using the reciprocal of the second derivative with the invertible matrix of the Hessian matrix. Often this approach includes a small learning rate $ 0 < \gamma \le 1 $ instead of $\gamma=1$ to ensure that the Wolfe conditions are satisfied at each iteration. For step sizes other than 1, the method is referred to as the relaxed or damped Newton's method whose implementation is provided hereafter.