# Gauss-Newton vs. Levenberg-Marquardt¶

The Gauss-Newton Algorithm (GNA) and Levenberg-Marquardt Algorithm (LMA) are often utilized as generic methods to solve non-linear least-squares problems, which is interactively demonstrated and compared with each other in this notebook. For GNA and LMA implementation, a numerical approximation of the Jacobian is employed in this example.

last update: 30/04/2022
Author

Christopher
Hahne, PhD

## Model¶

For the parametrization, we choose a function model $M(\mathbf{p};x)$, which may be of the form

$$M(\mathbf{p};x)=M(a, \mu, \sigma, \lambda; x)= \frac{a}{\sigma\sqrt{2\pi}} e^{\frac{-\left(x-\mu\right)^2}{2\sigma^2}} \left(1 + \text{erf}\left(\lambda\frac{x-\mu}{\sigma\sqrt{2}}\right)\right)$$

with $\text{erf}(\cdot)$ as the error function and $\mathbf{p}=(a, \mu, \sigma, \lambda)$ as the fit coefficients.

## Residual Function¶

Estimation of $\mathbf{p}$ is achieved through iterative evaluation of a residual function $f(\mathbf{p}, y, x)$ which is defined as

$$\mathbf{f}=f(\mathbf{p}, y_i, x_i)=(y_i-M(\mathbf{p};x_i))^2$$

with $y_i$ as the observed data in our model $M(\mathbf{p};x_i)$ and $i$ as the index of a total number of $N$ samples while $\mathbf{f} \in \mathbb{R}^{N}$ is a vector-valued functional carrying the residuals. This is implemented by

## Optimization¶

The optimization goal we aim to solve is formulated as

$$\text{arg min}_{\mathbf{h}} \, \frac{1}{2} \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right)^2$$

with $\mathbf{J}$ denoting the Jacobian w.r.t. $\mathbf{p}$ and the update change rate $\mathbf{h}$, which together with the residuals $\mathbf{f}$ is minimized towards zero. To arrive at a minimum, the solution to this objective is obtained by the derivative of the cost function $\mathbf{F}(\mathbf{h})$, which writes

$$\nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \frac{\partial \frac{1}{2} \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right)^2}{\partial \mathbf{h}}$$

and yields

$$\nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \mathbf{J}^\intercal \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right)$$

where $\intercal$ is the matrix transpose. We set $\nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \mathbf{0}$, so that after multiplication and rearranging we have

$$- \mathbf{J}^\intercal\mathbf{f} = \mathbf{J}^\intercal \mathbf{J}\mathbf{h}$$

and finally

$$\mathbf{h} = -\left(\mathbf{J}^\intercal \mathbf{J}\right)^{-1} \mathbf{J}^\intercal\mathbf{f}$$

which gives the parameter update step at iteration $k$ by

$$\mathbf{p}_{k+1}=\mathbf{p}_k+\mathbf{h}$$

where $\mathbf{h}=\mathbf{H}^{-1}\mathbf{g}$ contains the second-order derivative from Newton's method with the Hessian $\mathbf{H} \in \mathbb{R}^{m\times m}$ and gradient $\mathbf{g}\in\mathbb{R}^{m}$.

## Numerical Jacobian¶

As opposed to the analytically derived Jacobian $\mathbf{J}$, a multi-variate numerical approximate $\mathbf{\tilde{J}}(\mathbf{p})\approx \mathbf{J}$ via $\mathbf{p}$ and the function $\mathbf{f}(\cdot)$ is provided hereafter.

## Gauss-Newton Algorithm (GNA)¶

From the numerical estimate of a Jacobian, we can further compute a multivariate gradient $\mathbf{g}$ from $\mathbf{J}$ and residuals $\mathbf{f}$ using

$$\mathbf{g}=\mathbf{J}^\intercal \mathbf{f}$$

Similary, the Hessian can be approximated via

$$\mathbf{H}\approx\mathbf{J}^\intercal \mathbf{J}$$

which then allows to replace the first- and second-order derivatives of the conventional Newton method so that

$$\mathbf{h}_{gn}=-(\mathbf{J}^\intercal\mathbf{J})^{-1}\mathbf{J}^\intercal \mathbf{f}$$

where $\mathbf{h}_{gn}$ denotes the Gauss-Newton step. This is used in an iterative scheme and can be accomplished via

$$\mathbf{p}_{k+1}=\mathbf{p}_k+\alpha\mathbf{h}_{gn}$$

with variable $\alpha$ which represents the learning rate in the context of machine learning.

## Levenberg-Marquardt Algorithm (LMA)¶

Next, we employ the LMA as a gradual blend of the Gauss-Newton and Gradient Descent method. The main motivation for LMA is to automatically fade between the two in an optimal way. To achieve this, Levenberg and Marquardt introduced the term $\mu\mathbf{D}$ to the Gauss-Newton step as seen in

$$\mathbf{h}_{lm}=-(\mathbf{H}+\mu\mathbf{D})^{-1}\mathbf{g}$$

or respectively

$$\mathbf{h}_{lm}=-(\mathbf{J}^\intercal\mathbf{J}+\mu\mathbf{D})^{-1}\mathbf{J}^\intercal\mathbf{f}$$

where $\mathbf{D}$ is a diagonal matrix and $\mu\ge0$ is the damping parameter, which shifts between the influence of the Gradient Descent and Gauss-Newton method. In practice, $\mu$ is reduced following an update which decreases the error so that $\mathbf{h}_{lm} \approx \mathbf{h}_{gn}$ and $\mu$ is increased once the error goes up. More precisely, Levenberg initially proposed to use an identity matrix $\mathbf{I}$ that scales with $\mu$ so that $\mathbf{D}=\mathbf{I}$. For the initialization of $\mu$, Levenberg used

$$\mu_0=\tau\max\{\text{diag}{[\mathbf{J}^\intercal\mathbf{J}]}\}$$

where $\tau$ is set by the user. Then $\mu_k$ is regularily updated on each iteration $k$ by

$$\mu_{k+1}=\mu_k\max\{1/3,1-(2\rho_k-1)^3\}$$

with $\rho_k$ acting as the gain ratio given by

$$\rho_k = \frac{\mathbf{f}(\mathbf{p}_k)-\mathbf{f}(\mathbf{p}_k+\mathbf{h}_{lm})}{L(\mathbf{0})-L(\mathbf{h}_{lm})}$$

where the denominator is predicted by

$$L(\mathbf{0})-L(\mathbf{h}_{lm})=-\mathbf{h}_{lm}^\intercal\mathbf{J}^\intercal\mathbf{f}-\frac{1}{2}\mathbf{h}_{lm}^\intercal\mathbf{J}^\intercal\mathbf{J}\mathbf{h}_{lm}$$

After substituting $\mathbf{g}$ for $\mathbf{J}^\intercal \mathbf{f}$ and placing $-\frac{1}{2}\mathbf{h}_{lm}^\intercal$ outside the brackets, this may be written as

$$L(\mathbf{0})-L(\mathbf{h}_{lm})=-\frac{1}{2}\mathbf{h}_{lm}^\intercal (\mathbf{2}\mathbf{g}+(\mathbf{J}^\intercal\mathbf{J}+\mu\mathbf{D}-\mu\mathbf{D})\mathbf{h}_{lm})$$

Since $\mathbf{J}^\intercal\mathbf{J}+\mu\mathbf{D} = -\mathbf{g}$ by definition of LMA, this becomes

$$L(\mathbf{0})-L(\mathbf{h}_{lm}) = \frac{1}{2}\mathbf{h}_{lm}^\intercal (\mu\mathbf{h}_{lm}-\mathbf{g})$$

after rearranging.

Later, Marquardt suggested to replace the identity matrix with the diagonal of the local Hessian resulting in $\mathbf{D}_k=\text{diag}[\mathbf{J}^\intercal_k\mathbf{J}_k]$. There exist alternatives to this approach including $\mathbf{D}_{k+1}=\max\limits_{1\leq i\leq m}\{\text{diag}{[\mathbf{J}^\intercal\mathbf{J}]}_{(ii)}, {\mathbf{D}_{k}}_{(ii)}\}$. Updates on $\mu_k$ have also been modified by Marquardt as follows

$$\mu_k = \beta\mu_k \quad \text{if} \, \rho_k < \rho_1 \quad \text{and} \quad \mu_k = \mu_k/\gamma \quad \text{if} \, \rho_k > \rho_2$$

where $\rho_1=0.25$ and $\rho_2=0.75$ represent thresholds for the gain ratio controlling the increase from $\beta=2$ and decrease via $\gamma=3$ as user-defined parameters.

All of the above-mentioned methodological solutions have the update procedure in common, which simplifies to

$$\mathbf{p}_{k+1}=\mathbf{p}_k+\mathbf{h}_{lm}$$

A full implementation of the LMA is provided below.

## Illustration of the result¶

The animation below indicates the optimization process with intermediate results from the iterative LMA procedure.

## Equivalent Scipy implementation¶

For ease of use, an equivalent of the above implementation is provided hereafter using Scipy's leastsq.