Gauss-Newton vs. Levenberg-Marquardt

The Gauss-Newton Algorithm (GNA) and Levenberg-Marquardt Algorithm (LMA) are often utilized as generic methods to solve non-linear least-squares problems, which is interactively demonstrated and compared with each other in this notebook. For GNA and LMA implementation, a numerical approximation of the Jacobian is employed in this example.

last update: 30/04/2022

Hahne, PhD


For the parametrization, we choose a function model $M(\mathbf{p};x)$, which may be of the form

$$ M(\mathbf{p};x)=M(a, \mu, \sigma, \lambda; x)= \frac{a}{\sigma\sqrt{2\pi}} e^{\frac{-\left(x-\mu\right)^2}{2\sigma^2}} \left(1 + \text{erf}\left(\lambda\frac{x-\mu}{\sigma\sqrt{2}}\right)\right) $$

with $\text{erf}(\cdot)$ as the error function and $\mathbf{p}=(a, \mu, \sigma, \lambda)$ as the fit coefficients.

Generation of synthetic data

Residual Function

Estimation of $\mathbf{p}$ is achieved through iterative evaluation of a residual function $f(\mathbf{p}, y, x)$ which is defined as

$$ \mathbf{f}=f(\mathbf{p}, y_i, x_i)=(y_i-M(\mathbf{p};x_i))^2 $$

with $y_i$ as the observed data in our model $M(\mathbf{p};x_i)$ and $i$ as the index of a total number of $N$ samples while $\mathbf{f} \in \mathbb{R}^{N}$ is a vector-valued functional carrying the residuals. This is implemented by


The optimization goal we aim to solve is formulated as

$$ \text{arg min}_{\mathbf{h}} \, \frac{1}{2} \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right)^2 $$

with $\mathbf{J}$ denoting the Jacobian w.r.t. $\mathbf{p}$ and the update change rate $\mathbf{h}$, which together with the residuals $\mathbf{f}$ is minimized towards zero. To arrive at a minimum, the solution to this objective is obtained by the derivative of the cost function $\mathbf{F}(\mathbf{h})$, which writes

$$ \nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \frac{\partial \frac{1}{2} \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right)^2}{\partial \mathbf{h}} $$

and yields

$$ \nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \mathbf{J}^\intercal \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right) $$

where $\intercal$ is the matrix transpose. We set $\nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \mathbf{0}$, so that after multiplication and rearranging we have

$$ - \mathbf{J}^\intercal\mathbf{f} = \mathbf{J}^\intercal \mathbf{J}\mathbf{h} $$

and finally

$$ \mathbf{h} = -\left(\mathbf{J}^\intercal \mathbf{J}\right)^{-1} \mathbf{J}^\intercal\mathbf{f} $$

which gives the parameter update step at iteration $k$ by

$$ \mathbf{p}_{k+1}=\mathbf{p}_k+\mathbf{h} $$

where $\mathbf{h}=\mathbf{H}^{-1}\mathbf{g}$ contains the second-order derivative from Newton's method with the Hessian $\mathbf{H} \in \mathbb{R}^{m\times m}$ and gradient $\mathbf{g}\in\mathbb{R}^{m}$.

Numerical Jacobian

As opposed to the analytically derived Jacobian $\mathbf{J}$, a multi-variate numerical approximate $\mathbf{\tilde{J}}(\mathbf{p})\approx \mathbf{J}$ via $\mathbf{p}$ and the function $\mathbf{f}(\cdot)$ is provided hereafter.

Gauss-Newton Algorithm (GNA)

From the numerical estimate of a Jacobian, we can further compute a multivariate gradient $\mathbf{g}$ from $\mathbf{J}$ and residuals $\mathbf{f}$ using

$$\mathbf{g}=\mathbf{J}^\intercal \mathbf{f}$$

Similary, the Hessian can be approximated via

$$\mathbf{H}\approx\mathbf{J}^\intercal \mathbf{J}$$

which then allows to replace the first- and second-order derivatives of the conventional Newton method so that

$$\mathbf{h}_{gn}=-(\mathbf{J}^\intercal\mathbf{J})^{-1}\mathbf{J}^\intercal \mathbf{f}$$

where $\mathbf{h}_{gn}$ denotes the Gauss-Newton step. This is used in an iterative scheme and can be accomplished via


with variable $\alpha$ which represents the learning rate in the context of machine learning.

Levenberg-Marquardt Algorithm (LMA)

Next, we employ the LMA as a gradual blend of the Gauss-Newton and Gradient Descent method. The main motivation for LMA is to automatically fade between the two in an optimal way. To achieve this, Levenberg and Marquardt introduced the term $\mu\mathbf{D}$ to the Gauss-Newton step as seen in


or respectively


where $\mathbf{D}$ is a diagonal matrix and $\mu\ge0$ is the damping parameter, which shifts between the influence of the Gradient Descent and Gauss-Newton method. In practice, $\mu$ is reduced following an update which decreases the error so that $\mathbf{h}_{lm} \approx \mathbf{h}_{gn}$ and $\mu$ is increased once the error goes up. More precisely, Levenberg initially proposed to use an identity matrix $\mathbf{I}$ that scales with $\mu$ so that $\mathbf{D}=\mathbf{I}$. For the initialization of $\mu$, Levenberg used


where $\tau$ is set by the user. Then $\mu_k$ is regularily updated on each iteration $k$ by


with $\rho_k$ acting as the gain ratio given by

$$\rho_k = \frac{\mathbf{f}(\mathbf{p}_k)-\mathbf{f}(\mathbf{p}_k+\mathbf{h}_{lm})}{L(\mathbf{0})-L(\mathbf{h}_{lm})}$$

where the denominator is predicted by


After substituting $\mathbf{g}$ for $\mathbf{J}^\intercal \mathbf{f}$ and placing $-\frac{1}{2}\mathbf{h}_{lm}^\intercal$ outside the brackets, this may be written as

$$L(\mathbf{0})-L(\mathbf{h}_{lm})=-\frac{1}{2}\mathbf{h}_{lm}^\intercal (\mathbf{2}\mathbf{g}+(\mathbf{J}^\intercal\mathbf{J}+\mu\mathbf{D}-\mu\mathbf{D})\mathbf{h}_{lm})$$

Since $\mathbf{J}^\intercal\mathbf{J}+\mu\mathbf{D} = -\mathbf{g}$ by definition of LMA, this becomes

$$L(\mathbf{0})-L(\mathbf{h}_{lm}) = \frac{1}{2}\mathbf{h}_{lm}^\intercal (\mu\mathbf{h}_{lm}-\mathbf{g})$$

after rearranging.

Later, Marquardt suggested to replace the identity matrix with the diagonal of the local Hessian resulting in $\mathbf{D}_k=\text{diag}[\mathbf{J}^\intercal_k\mathbf{J}_k]$. There exist alternatives to this approach including $\mathbf{D}_{k+1}=\max\limits_{1\leq i\leq m}\{\text{diag}{[\mathbf{J}^\intercal\mathbf{J}]}_{(ii)}, {\mathbf{D}_{k}}_{(ii)}\}$. Updates on $\mu_k$ have also been modified by Marquardt as follows

$$ \mu_k = \beta\mu_k \quad \text{if} \, \rho_k < \rho_1 \quad \text{and} \quad \mu_k = \mu_k/\gamma \quad \text{if} \, \rho_k > \rho_2 $$

where $\rho_1=0.25$ and $\rho_2=0.75$ represent thresholds for the gain ratio controlling the increase from $\beta=2$ and decrease via $\gamma=3$ as user-defined parameters.

All of the above-mentioned methodological solutions have the update procedure in common, which simplifies to


A full implementation of the LMA is provided below.

Illustration of the result

The animation below indicates the optimization process with intermediate results from the iterative LMA procedure.

Equivalent Scipy implementation

For ease of use, an equivalent of the above implementation is provided hereafter using Scipy's leastsq.