Gauss-Newton vs. Levenberg-Marquardt¶

The Gauss-Newton Algorithm (GNA) and Levenberg-Marquardt Algorithm (LMA) are often utilized as generic methods to solve non-linear least-squares problems, which is interactively demonstrated and compared with each other in this notebook. For GNA and LMA implementation, a numerical approximation of the Jacobian is employed in this example.

last update: 30/04/2022
Author







Christopher
Hahne, PhD

Model¶

For the parametrization, we choose a function model $M(\mathbf{p};x)$, which may be of the form

$$ M(\mathbf{p};x)=M(a, \mu, \sigma, \lambda; x)= \frac{a}{\sigma\sqrt{2\pi}} e^{\frac{-\left(x-\mu\right)^2}{2\sigma^2}} \left(1 + \text{erf}\left(\lambda\frac{x-\mu}{\sigma\sqrt{2}}\right)\right) $$

with $\text{erf}(\cdot)$ as the error function and $\mathbf{p}=(a, \mu, \sigma, \lambda)$ as the fit coefficients.

In [1]:
import numpy as np
from scipy.special import erf

def asym_gaussian(p, x=0):
    """
    Asymmetric Gaussian function model
    """
    amp = p[0] / (p[2] * np.sqrt(2 * np.pi))
    spread = np.exp((-(x - p[1]) ** 2.0) / (2 * p[2] ** 2.0))
    skew = 1 + erf((p[3] * (x - p[1])) / (p[2] * np.sqrt(2)))
    return amp * spread * skew

Generation of synthetic data¶

In [2]:
x = np.linspace(-5, 5, 1000)
norm, mean, sigm, skew = 10, -1, 2, 5
data = asym_gaussian([norm, mean, sigm, skew], x)
data_raw = data + .3 * np.random.randn(len(x))

%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.title('Synthetic data')
plt.plot(x, data_raw, label='raw signal')
Out[2]:
[<matplotlib.lines.Line2D at 0x7fda9a8f9210>]

Residual Function¶

Estimation of $\mathbf{p}$ is achieved through iterative evaluation of a residual function $f(\mathbf{p}, y, x)$ which is defined as

$$ \mathbf{f}=f(\mathbf{p}, y_i, x_i)=(y_i-M(\mathbf{p};x_i))^2 $$

with $y_i$ as the observed data in our model $M(\mathbf{p};x_i)$ and $i$ as the index of a total number of $N$ samples while $\mathbf{f} \in \mathbb{R}^{N}$ is a vector-valued functional carrying the residuals. This is implemented by

In [3]:
cost_fun = lambda p, y, x: (y - asym_gaussian(p, x))**2

Optimization¶

The optimization goal we aim to solve is formulated as

$$ \text{arg min}_{\mathbf{h}} \, \frac{1}{2} \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right)^2 $$

with $\mathbf{J}$ denoting the Jacobian w.r.t. $\mathbf{p}$ and the update change rate $\mathbf{h}$, which together with the residuals $\mathbf{f}$ is minimized towards zero. To arrive at a minimum, the solution to this objective is obtained by the derivative of the cost function $\mathbf{F}(\mathbf{h})$, which writes

$$ \nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \frac{\partial \frac{1}{2} \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right)^2}{\partial \mathbf{h}} $$

and yields

$$ \nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \mathbf{J}^\intercal \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right) $$

where $\intercal$ is the matrix transpose. We set $\nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \mathbf{0}$, so that after multiplication and rearranging we have

$$ - \mathbf{J}^\intercal\mathbf{f} = \mathbf{J}^\intercal \mathbf{J}\mathbf{h} $$

and finally

$$ \mathbf{h} = -\left(\mathbf{J}^\intercal \mathbf{J}\right)^{-1} \mathbf{J}^\intercal\mathbf{f} $$

which gives the parameter update step at iteration $k$ by

$$ \mathbf{p}_{k+1}=\mathbf{p}_k+\mathbf{h} $$

where $\mathbf{h}=\mathbf{H}^{-1}\mathbf{g}$ contains the second-order derivative from Newton's method with the Hessian $\mathbf{H} \in \mathbb{R}^{m\times m}$ and gradient $\mathbf{g}\in\mathbb{R}^{m}$.

Numerical Jacobian¶

As opposed to the analytically derived Jacobian $\mathbf{J}$, a multi-variate numerical approximate $\mathbf{\tilde{J}}(\mathbf{p})\approx \mathbf{J}$ via $\mathbf{p}$ and the function $\mathbf{f}(\cdot)$ is provided hereafter.

In [4]:
def jacobian_approx(p, f, dp=1e-8, args=()):
    """
    Numerical approximation for the multivariate Jacobian
    :param p: initial value(s)
    :param f: function handle
    :param dp: delta p for approximation
    :param args: additional arguments passed to function handle
    :return: jacobian
    """
    n = len(p)
    jac = np.zeros(n) if len(args) == 0 else np.zeros([n, len(args[0])])
    for j in range(n):  # through columns to allow for vector addition
        dpj = abs(p[j]) * dp if p[j] != 0 else dp
        p_plus = [(pi if k != j else pi + dpj) for k, pi in enumerate(p)]
        # compute jacobian while passing optional arguments
        jac[j] = (f(*((p_plus,) + args)) - f(*((p,) + args))) / dpj

    return jac if len(args) == 0 else jac.T

Gauss-Newton Algorithm (GNA)¶

From the numerical estimate of a Jacobian, we can further compute a multivariate gradient $\mathbf{g}$ from $\mathbf{J}$ and residuals $\mathbf{f}$ using

$$\mathbf{g}=\mathbf{J}^\intercal \mathbf{f}$$

Similary, the Hessian can be approximated via

$$\mathbf{H}\approx\mathbf{J}^\intercal \mathbf{J}$$

which then allows to replace the first- and second-order derivatives of the conventional Newton method so that

$$\mathbf{h}_{gn}=-(\mathbf{J}^\intercal\mathbf{J})^{-1}\mathbf{J}^\intercal \mathbf{f}$$

where $\mathbf{h}_{gn}$ denotes the Gauss-Newton step. This is used in an iterative scheme and can be accomplished via

$$\mathbf{p}_{k+1}=\mathbf{p}_k+\alpha\mathbf{h}_{gn}$$

with variable $\alpha$ which represents the learning rate in the context of machine learning.

In [5]:
def lsq_gna(p, function, args=(), l=.1, tol=1e-12, max_iter=500):
    """
    Gauss-Newton implementation
    :param p: initial value(s)
    :param function: user-provided function which takes p (and additional arguments) as input
    :param args: optional arguments passed to function
    :param l: learning rate
    :param tol: tolerance for stop condition
    :param max_iter: maximum number of iterations
    :return: list of results, eps
    """
    j = jacobian_approx(p, function, args=args)
    g = np.dot(j.T, function(*((p,) + args))) if len(args) != 0 else np.dot(j.T, function(p))
    H = np.dot(j.T, j)
    eps = 1
    p_list = []
    while len(p_list) < max_iter:
        h = -l*np.dot(np.linalg.inv(H), g)
        p += h
        p_list.append(p.copy())
        j = jacobian_approx(p, function, args=args)
        g = np.dot(j.T, function(*((p,) + args))) if len(args) != 0 else np.dot(j.T, function(p))
        H = np.dot(j.T, j)
        eps = max(abs(g))
        if eps < tol:
            break
    return p_list, eps

Levenberg-Marquardt Algorithm (LMA)¶

Next, we employ the LMA as a gradual blend of the Gauss-Newton and Gradient Descent method. The main motivation for LMA is to automatically fade between the two in an optimal way. To achieve this, Levenberg and Marquardt introduced the term $\mu\mathbf{D}$ to the Gauss-Newton step as seen in

$$\mathbf{h}_{lm}=-(\mathbf{H}+\mu\mathbf{D})^{-1}\mathbf{g}$$

or respectively

$$\mathbf{h}_{lm}=-(\mathbf{J}^\intercal\mathbf{J}+\mu\mathbf{D})^{-1}\mathbf{J}^\intercal\mathbf{f}$$

where $\mathbf{D}$ is a diagonal matrix and $\mu\ge0$ is the damping parameter, which shifts between the influence of the Gradient Descent and Gauss-Newton method. In practice, $\mu$ is reduced following an update which decreases the error so that $\mathbf{h}_{lm} \approx \mathbf{h}_{gn}$ and $\mu$ is increased once the error goes up. More precisely, Levenberg initially proposed to use an identity matrix $\mathbf{I}$ that scales with $\mu$ so that $\mathbf{D}=\mathbf{I}$. For the initialization of $\mu$, Levenberg used

$$\mu_0=\tau\max\{\text{diag}{[\mathbf{J}^\intercal\mathbf{J}]}\}$$

where $\tau$ is set by the user. Then $\mu_k$ is regularily updated on each iteration $k$ by

$$\mu_{k+1}=\mu_k\max\{1/3,1-(2\rho_k-1)^3\}$$

with $\rho_k$ acting as the gain ratio given by

$$\rho_k = \frac{\mathbf{f}(\mathbf{p}_k)-\mathbf{f}(\mathbf{p}_k+\mathbf{h}_{lm})}{L(\mathbf{0})-L(\mathbf{h}_{lm})}$$

where the denominator is predicted by

$$L(\mathbf{0})-L(\mathbf{h}_{lm})=-\mathbf{h}_{lm}^\intercal\mathbf{J}^\intercal\mathbf{f}-\frac{1}{2}\mathbf{h}_{lm}^\intercal\mathbf{J}^\intercal\mathbf{J}\mathbf{h}_{lm}$$

After substituting $\mathbf{g}$ for $\mathbf{J}^\intercal \mathbf{f}$ and placing $-\frac{1}{2}\mathbf{h}_{lm}^\intercal$ outside the brackets, this may be written as

$$L(\mathbf{0})-L(\mathbf{h}_{lm})=-\frac{1}{2}\mathbf{h}_{lm}^\intercal (\mathbf{2}\mathbf{g}+(\mathbf{J}^\intercal\mathbf{J}+\mu\mathbf{D}-\mu\mathbf{D})\mathbf{h}_{lm})$$

Since $\mathbf{J}^\intercal\mathbf{J}+\mu\mathbf{D} = -\mathbf{g}$ by definition of LMA, this becomes

$$L(\mathbf{0})-L(\mathbf{h}_{lm}) = \frac{1}{2}\mathbf{h}_{lm}^\intercal (\mu\mathbf{h}_{lm}-\mathbf{g})$$

after rearranging.

Later, Marquardt suggested to replace the identity matrix with the diagonal of the local Hessian resulting in $\mathbf{D}_k=\text{diag}[\mathbf{J}^\intercal_k\mathbf{J}_k]$. There exist alternatives to this approach including $\mathbf{D}_{k+1}=\max\limits_{1\leq i\leq m}\{\text{diag}{[\mathbf{J}^\intercal\mathbf{J}]}_{(ii)}, {\mathbf{D}_{k}}_{(ii)}\}$. Updates on $\mu_k$ have also been modified by Marquardt as follows

$$ \mu_k = \beta\mu_k \quad \text{if} \, \rho_k < \rho_1 \quad \text{and} \quad \mu_k = \mu_k/\gamma \quad \text{if} \, \rho_k > \rho_2 $$

where $\rho_1=0.25$ and $\rho_2=0.75$ represent thresholds for the gain ratio controlling the increase from $\beta=2$ and decrease via $\gamma=3$ as user-defined parameters.

All of the above-mentioned methodological solutions have the update procedure in common, which simplifies to

$$\mathbf{p}_{k+1}=\mathbf{p}_k+\mathbf{h}_{lm}$$

A full implementation of the LMA is provided below.

In [6]:
def lsq_lma(p, function, args=(), tol=1e-12, tau=1e-3, meth='lev', rho1=.25, rho2=.75, bet=2, gam=3, max_iter=500):
    """
    Levenberg-Marquardt implementation for least-squares fitting of non-linear functions
    :param p: initial value(s)
    :param function: user-provided function which takes p (and additional arguments) as input
    :param args: optional arguments passed to function
    :param tol: tolerance for stop condition
    :param tau: factor to initialize damping parameter
    :param meth: method which is default 'lev' for Levenberg and otherwise Marquardt
    :param rho1: first gain factor threshold for damping parameter adjustment for Marquardt
    :param rho2: second gain factor threshold for damping parameter adjustment for Marquardt
    :param bet: multiplier for damping parameter adjustment for Marquardt
    :param gam: divisor for damping parameter adjustment for Marquardt
    :param max_iter: maximum number of iterations
    :return: list of results, eps
    """
    j = jacobian_approx(p, function, args=args)
    g = np.dot(j.T, function(*((p,) + args))) if len(args) != 0 else np.dot(j.T, function(p))
    H = np.dot(j.T, j)
    u = tau * np.max(np.diag(H.diagonal()))
    v = 2
    eps = 1
    p_list = []
    while len(p_list) < max_iter:
        D = np.identity(j.shape[1])
        D *= 1 if meth == 'lev' else np.max(np.maximum(H.diagonal(), D.diagonal()))
        h = -np.dot(np.linalg.inv(H+u*D), g)
        f = function(*((p,) + args)) if len(args) != 0 else function(p)
        f_plus = function(*((p+h,) + args)) if len(args) != 0 else function(p+h)
        rho = (np.dot(f.T, f) - np.dot(f_plus.T, f_plus)) / np.dot(.5*h.T, u*h-g)
        if rho > 0:
            p += h
            p_list.append(p.copy())
            j = jacobian_approx(p, function, args=args)
            g = np.dot(j.T, function(*((p,) + args))) if len(args) != 0 else np.dot(j.T, function(p))
            H = np.dot(j.T, j)
        if meth== 'lev':
            u, v = (u*np.max([1/3, 1-(2*rho-1)**3]), 2) if rho > 0 else (u*v, v*2)
        else:
            u = u*bet if rho < rho1 else u/gam if rho > rho2 else u
        eps = max(abs(g))
        if eps < tol:
            break
    return p_list, eps
In [7]:
initials = [.5, .5, .5, .5]
coeffs_gn, eps_gn = lsq_gna(initials, cost_fun, args=(data, x), l=.01)
coeffs_lm, eps_lm = lsq_lma(initials, cost_fun, args=(data, x), meth='marq')
print('GNA: %s with an error of %s after %s iterations.' % (coeffs_gn[-1], round(eps_gn,9), len(coeffs_gn)))
print('LMA: %s with an error of %s after %s iterations.' % (coeffs_lm[-1], round(eps_lm,9), len(coeffs_lm)))
GNA: [ 9.13028684 -1.02513265  1.91222146  4.72345346] with an error of 6.494116083 after 500 iterations.
LMA: [ 9.99998965 -0.99999916  1.99999408  4.99991457] with an error of 0.0 after 25 iterations.

Illustration of the result¶

The animation below indicates the optimization process with intermediate results from the iterative LMA procedure.

In [8]:
%matplotlib notebook
%matplotlib notebook
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
plt.style.use('seaborn-pastel')

fig, ax = plt.figure(figsize=(8, 5)), plt.axes()
ax.set_xlabel('x', fontsize=16)
ax.set_ylabel('y', fontsize=16)
ax.xaxis.set_tick_params(labelsize=12)
ax.yaxis.set_tick_params(labelsize=12)
ax.plot(x, data_raw, label='data')
line, = ax.plot(x, asym_gaussian(coeffs_lm[-1], x), '--', color='red', lw=2, label='LMA fit')
ax.legend()
plt.tight_layout()

# animation
if True:
    animate = lambda i: line.set_data(x, asym_gaussian(coeffs_lm[i], x))
    anim = FuncAnimation(fig, animate, frames=len(coeffs_lm), interval=100)
    HTML(anim.to_jshtml())

# styles
if False:
    import matplotlib as mpl
    mpl.rcParams['savefig.facecolor'] = '#c7b897'#'#148ec8'
    fig.set_facecolor('#c7b897')#('#148ec8')
    #ax.set_facecolor('#c7b897')#('#148ec8')
    ax.spines['bottom'].set_color('white')
    ax.spines['top'].set_color('white')
    ax.spines['left'].set_color('white')
    ax.spines['right'].set_color('white')
    ax.xaxis.label.set_color('white')
    ax.yaxis.label.set_color('white')
    ax.tick_params(colors='white')
    anim.save('./data/lm-fit_anim.gif', writer='imagemagick', fps=5)

Equivalent Scipy implementation¶

For ease of use, an equivalent of the above implementation is provided hereafter using Scipy's leastsq.

In [9]:
from scipy.optimize import leastsq

residuals = lambda p, y, x: y - asym_gaussian(p, x)

# executes least-squares regression analysis to optimize initial parameters
coeffs_scipy, _ = leastsq(residuals, [1., 1., 1., 1.], args=(data_raw, x))

print(coeffs_scipy)

y_scipy = asym_gaussian(coeffs_scipy, x)

plt.figure(figsize=(8, 5))
plt.plot(x, data_raw, label='data')
plt.plot(x, data, label='ground truth')
plt.plot(x, y_scipy, linestyle='--', label='scipy fit')
plt.legend()
[10.05364597 -1.01000325  2.00578554  4.98348821]
Out[9]:
<matplotlib.legend.Legend at 0x7fda9b4ebc10>