For the parametrization, we choose a function model $M(\mathbf{p};x)$, which may be of the form
$$ M(\mathbf{p};x)=M(a, \mu, \sigma, \lambda; x)= \frac{a}{\sigma\sqrt{2\pi}} e^{\frac{-\left(x-\mu\right)^2}{2\sigma^2}} \left(1 + \text{erf}\left(\lambda\frac{x-\mu}{\sigma\sqrt{2}}\right)\right) $$with $\text{erf}(\cdot)$ as the error function and $\mathbf{p}=(a, \mu, \sigma, \lambda)$ as the fit coefficients.
import numpy as np
from scipy.special import erf
def asym_gaussian(p, x=0):
"""
Asymmetric Gaussian function model
"""
amp = p[0] / (p[2] * np.sqrt(2 * np.pi))
spread = np.exp((-(x - p[1]) ** 2.0) / (2 * p[2] ** 2.0))
skew = 1 + erf((p[3] * (x - p[1])) / (p[2] * np.sqrt(2)))
return amp * spread * skew
x = np.linspace(-5, 5, 1000)
norm, mean, sigm, skew = 10, -1, 2, 5
data = asym_gaussian([norm, mean, sigm, skew], x)
data_raw = data + .3 * np.random.randn(len(x))
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.title('Synthetic data')
plt.plot(x, data_raw, label='raw signal')
[<matplotlib.lines.Line2D at 0x7fda9a8f9210>]
Estimation of $\mathbf{p}$ is achieved through iterative evaluation of a residual function $f(\mathbf{p}, y, x)$ which is defined as
$$ \mathbf{f}=f(\mathbf{p}, y_i, x_i)=(y_i-M(\mathbf{p};x_i))^2 $$with $y_i$ as the observed data in our model $M(\mathbf{p};x_i)$ and $i$ as the index of a total number of $N$ samples while $\mathbf{f} \in \mathbb{R}^{N}$ is a vector-valued functional carrying the residuals. This is implemented by
cost_fun = lambda p, y, x: (y - asym_gaussian(p, x))**2
The optimization goal we aim to solve is formulated as
$$ \text{arg min}_{\mathbf{h}} \, \frac{1}{2} \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right)^2 $$with $\mathbf{J}$ denoting the Jacobian w.r.t. $\mathbf{p}$ and the update change rate $\mathbf{h}$, which together with the residuals $\mathbf{f}$ is minimized towards zero. To arrive at a minimum, the solution to this objective is obtained by the derivative of the cost function $\mathbf{F}(\mathbf{h})$, which writes
$$ \nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \frac{\partial \frac{1}{2} \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right)^2}{\partial \mathbf{h}} $$and yields
$$ \nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \mathbf{J}^\intercal \left(\mathbf{J}\mathbf{h}+\mathbf{f}\right) $$where $\intercal$ is the matrix transpose. We set $\nabla_{\mathbf{h}} \mathbf{F}\left(\mathbf{h}\right) = \mathbf{0}$, so that after multiplication and rearranging we have
$$ - \mathbf{J}^\intercal\mathbf{f} = \mathbf{J}^\intercal \mathbf{J}\mathbf{h} $$and finally
$$ \mathbf{h} = -\left(\mathbf{J}^\intercal \mathbf{J}\right)^{-1} \mathbf{J}^\intercal\mathbf{f} $$which gives the parameter update step at iteration $k$ by
$$ \mathbf{p}_{k+1}=\mathbf{p}_k+\mathbf{h} $$where $\mathbf{h}=\mathbf{H}^{-1}\mathbf{g}$ contains the second-order derivative from Newton's method with the Hessian $\mathbf{H} \in \mathbb{R}^{m\times m}$ and gradient $\mathbf{g}\in\mathbb{R}^{m}$.
As opposed to the analytically derived Jacobian $\mathbf{J}$, a multi-variate numerical approximate $\mathbf{\tilde{J}}(\mathbf{p})\approx \mathbf{J}$ via $\mathbf{p}$ and the function $\mathbf{f}(\cdot)$ is provided hereafter.
def jacobian_approx(p, f, dp=1e-8, args=()):
"""
Numerical approximation for the multivariate Jacobian
:param p: initial value(s)
:param f: function handle
:param dp: delta p for approximation
:param args: additional arguments passed to function handle
:return: jacobian
"""
n = len(p)
jac = np.zeros(n) if len(args) == 0 else np.zeros([n, len(args[0])])
for j in range(n): # through columns to allow for vector addition
dpj = abs(p[j]) * dp if p[j] != 0 else dp
p_plus = [(pi if k != j else pi + dpj) for k, pi in enumerate(p)]
# compute jacobian while passing optional arguments
jac[j] = (f(*((p_plus,) + args)) - f(*((p,) + args))) / dpj
return jac if len(args) == 0 else jac.T
From the numerical estimate of a Jacobian, we can further compute a multivariate gradient $\mathbf{g}$ from $\mathbf{J}$ and residuals $\mathbf{f}$ using
$$\mathbf{g}=\mathbf{J}^\intercal \mathbf{f}$$Similary, the Hessian can be approximated via
$$\mathbf{H}\approx\mathbf{J}^\intercal \mathbf{J}$$which then allows to replace the first- and second-order derivatives of the conventional Newton method so that
$$\mathbf{h}_{gn}=-(\mathbf{J}^\intercal\mathbf{J})^{-1}\mathbf{J}^\intercal \mathbf{f}$$where $\mathbf{h}_{gn}$ denotes the Gauss-Newton step. This is used in an iterative scheme and can be accomplished via
$$\mathbf{p}_{k+1}=\mathbf{p}_k+\alpha\mathbf{h}_{gn}$$with variable $\alpha$ which represents the learning rate in the context of machine learning.
def lsq_gna(p, function, args=(), l=.1, tol=1e-12, max_iter=500):
"""
Gauss-Newton implementation
:param p: initial value(s)
:param function: user-provided function which takes p (and additional arguments) as input
:param args: optional arguments passed to function
:param l: learning rate
:param tol: tolerance for stop condition
:param max_iter: maximum number of iterations
:return: list of results, eps
"""
j = jacobian_approx(p, function, args=args)
g = np.dot(j.T, function(*((p,) + args))) if len(args) != 0 else np.dot(j.T, function(p))
H = np.dot(j.T, j)
eps = 1
p_list = []
while len(p_list) < max_iter:
h = -l*np.dot(np.linalg.inv(H), g)
p += h
p_list.append(p.copy())
j = jacobian_approx(p, function, args=args)
g = np.dot(j.T, function(*((p,) + args))) if len(args) != 0 else np.dot(j.T, function(p))
H = np.dot(j.T, j)
eps = max(abs(g))
if eps < tol:
break
return p_list, eps
Next, we employ the LMA as a gradual blend of the Gauss-Newton and Gradient Descent method. The main motivation for LMA is to automatically fade between the two in an optimal way. To achieve this, Levenberg and Marquardt introduced the term $\mu\mathbf{D}$ to the Gauss-Newton step as seen in
$$\mathbf{h}_{lm}=-(\mathbf{H}+\mu\mathbf{D})^{-1}\mathbf{g}$$or respectively
$$\mathbf{h}_{lm}=-(\mathbf{J}^\intercal\mathbf{J}+\mu\mathbf{D})^{-1}\mathbf{J}^\intercal\mathbf{f}$$where $\mathbf{D}$ is a diagonal matrix and $\mu\ge0$ is the damping parameter, which shifts between the influence of the Gradient Descent and Gauss-Newton method. In practice, $\mu$ is reduced following an update which decreases the error so that $\mathbf{h}_{lm} \approx \mathbf{h}_{gn}$ and $\mu$ is increased once the error goes up. More precisely, Levenberg initially proposed to use an identity matrix $\mathbf{I}$ that scales with $\mu$ so that $\mathbf{D}=\mathbf{I}$. For the initialization of $\mu$, Levenberg used
$$\mu_0=\tau\max\{\text{diag}{[\mathbf{J}^\intercal\mathbf{J}]}\}$$where $\tau$ is set by the user. Then $\mu_k$ is regularily updated on each iteration $k$ by
$$\mu_{k+1}=\mu_k\max\{1/3,1-(2\rho_k-1)^3\}$$with $\rho_k$ acting as the gain ratio given by
$$\rho_k = \frac{\mathbf{f}(\mathbf{p}_k)-\mathbf{f}(\mathbf{p}_k+\mathbf{h}_{lm})}{L(\mathbf{0})-L(\mathbf{h}_{lm})}$$where the denominator is predicted by
$$L(\mathbf{0})-L(\mathbf{h}_{lm})=-\mathbf{h}_{lm}^\intercal\mathbf{J}^\intercal\mathbf{f}-\frac{1}{2}\mathbf{h}_{lm}^\intercal\mathbf{J}^\intercal\mathbf{J}\mathbf{h}_{lm}$$After substituting $\mathbf{g}$ for $\mathbf{J}^\intercal \mathbf{f}$ and placing $-\frac{1}{2}\mathbf{h}_{lm}^\intercal$ outside the brackets, this may be written as
$$L(\mathbf{0})-L(\mathbf{h}_{lm})=-\frac{1}{2}\mathbf{h}_{lm}^\intercal (\mathbf{2}\mathbf{g}+(\mathbf{J}^\intercal\mathbf{J}+\mu\mathbf{D}-\mu\mathbf{D})\mathbf{h}_{lm})$$Since $\mathbf{J}^\intercal\mathbf{J}+\mu\mathbf{D} = -\mathbf{g}$ by definition of LMA, this becomes
$$L(\mathbf{0})-L(\mathbf{h}_{lm}) = \frac{1}{2}\mathbf{h}_{lm}^\intercal (\mu\mathbf{h}_{lm}-\mathbf{g})$$after rearranging.
Later, Marquardt suggested to replace the identity matrix with the diagonal of the local Hessian resulting in $\mathbf{D}_k=\text{diag}[\mathbf{J}^\intercal_k\mathbf{J}_k]$. There exist alternatives to this approach including $\mathbf{D}_{k+1}=\max\limits_{1\leq i\leq m}\{\text{diag}{[\mathbf{J}^\intercal\mathbf{J}]}_{(ii)}, {\mathbf{D}_{k}}_{(ii)}\}$. Updates on $\mu_k$ have also been modified by Marquardt as follows
$$ \mu_k = \beta\mu_k \quad \text{if} \, \rho_k < \rho_1 \quad \text{and} \quad \mu_k = \mu_k/\gamma \quad \text{if} \, \rho_k > \rho_2 $$where $\rho_1=0.25$ and $\rho_2=0.75$ represent thresholds for the gain ratio controlling the increase from $\beta=2$ and decrease via $\gamma=3$ as user-defined parameters.
All of the above-mentioned methodological solutions have the update procedure in common, which simplifies to
$$\mathbf{p}_{k+1}=\mathbf{p}_k+\mathbf{h}_{lm}$$A full implementation of the LMA is provided below.
def lsq_lma(p, function, args=(), tol=1e-12, tau=1e-3, meth='lev', rho1=.25, rho2=.75, bet=2, gam=3, max_iter=500):
"""
Levenberg-Marquardt implementation for least-squares fitting of non-linear functions
:param p: initial value(s)
:param function: user-provided function which takes p (and additional arguments) as input
:param args: optional arguments passed to function
:param tol: tolerance for stop condition
:param tau: factor to initialize damping parameter
:param meth: method which is default 'lev' for Levenberg and otherwise Marquardt
:param rho1: first gain factor threshold for damping parameter adjustment for Marquardt
:param rho2: second gain factor threshold for damping parameter adjustment for Marquardt
:param bet: multiplier for damping parameter adjustment for Marquardt
:param gam: divisor for damping parameter adjustment for Marquardt
:param max_iter: maximum number of iterations
:return: list of results, eps
"""
j = jacobian_approx(p, function, args=args)
g = np.dot(j.T, function(*((p,) + args))) if len(args) != 0 else np.dot(j.T, function(p))
H = np.dot(j.T, j)
u = tau * np.max(np.diag(H.diagonal()))
v = 2
eps = 1
p_list = []
while len(p_list) < max_iter:
D = np.identity(j.shape[1])
D *= 1 if meth == 'lev' else np.max(np.maximum(H.diagonal(), D.diagonal()))
h = -np.dot(np.linalg.inv(H+u*D), g)
f = function(*((p,) + args)) if len(args) != 0 else function(p)
f_plus = function(*((p+h,) + args)) if len(args) != 0 else function(p+h)
rho = (np.dot(f.T, f) - np.dot(f_plus.T, f_plus)) / np.dot(.5*h.T, u*h-g)
if rho > 0:
p += h
p_list.append(p.copy())
j = jacobian_approx(p, function, args=args)
g = np.dot(j.T, function(*((p,) + args))) if len(args) != 0 else np.dot(j.T, function(p))
H = np.dot(j.T, j)
if meth== 'lev':
u, v = (u*np.max([1/3, 1-(2*rho-1)**3]), 2) if rho > 0 else (u*v, v*2)
else:
u = u*bet if rho < rho1 else u/gam if rho > rho2 else u
eps = max(abs(g))
if eps < tol:
break
return p_list, eps
initials = [.5, .5, .5, .5]
coeffs_gn, eps_gn = lsq_gna(initials, cost_fun, args=(data, x), l=.01)
coeffs_lm, eps_lm = lsq_lma(initials, cost_fun, args=(data, x), meth='marq')
print('GNA: %s with an error of %s after %s iterations.' % (coeffs_gn[-1], round(eps_gn,9), len(coeffs_gn)))
print('LMA: %s with an error of %s after %s iterations.' % (coeffs_lm[-1], round(eps_lm,9), len(coeffs_lm)))
GNA: [ 9.13028684 -1.02513265 1.91222146 4.72345346] with an error of 6.494116083 after 500 iterations. LMA: [ 9.99998965 -0.99999916 1.99999408 4.99991457] with an error of 0.0 after 25 iterations.
The animation below indicates the optimization process with intermediate results from the iterative LMA procedure.
%matplotlib notebook
%matplotlib notebook
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
plt.style.use('seaborn-pastel')
fig, ax = plt.figure(figsize=(8, 5)), plt.axes()
ax.set_xlabel('x', fontsize=16)
ax.set_ylabel('y', fontsize=16)
ax.xaxis.set_tick_params(labelsize=12)
ax.yaxis.set_tick_params(labelsize=12)
ax.plot(x, data_raw, label='data')
line, = ax.plot(x, asym_gaussian(coeffs_lm[-1], x), '--', color='red', lw=2, label='LMA fit')
ax.legend()
plt.tight_layout()
# animation
if True:
animate = lambda i: line.set_data(x, asym_gaussian(coeffs_lm[i], x))
anim = FuncAnimation(fig, animate, frames=len(coeffs_lm), interval=100)
HTML(anim.to_jshtml())
# styles
if False:
import matplotlib as mpl
mpl.rcParams['savefig.facecolor'] = '#c7b897'#'#148ec8'
fig.set_facecolor('#c7b897')#('#148ec8')
#ax.set_facecolor('#c7b897')#('#148ec8')
ax.spines['bottom'].set_color('white')
ax.spines['top'].set_color('white')
ax.spines['left'].set_color('white')
ax.spines['right'].set_color('white')
ax.xaxis.label.set_color('white')
ax.yaxis.label.set_color('white')
ax.tick_params(colors='white')
anim.save('./data/lm-fit_anim.gif', writer='imagemagick', fps=5)
For ease of use, an equivalent of the above implementation is provided hereafter using Scipy's leastsq.
from scipy.optimize import leastsq
residuals = lambda p, y, x: y - asym_gaussian(p, x)
# executes least-squares regression analysis to optimize initial parameters
coeffs_scipy, _ = leastsq(residuals, [1., 1., 1., 1.], args=(data_raw, x))
print(coeffs_scipy)
y_scipy = asym_gaussian(coeffs_scipy, x)
plt.figure(figsize=(8, 5))
plt.plot(x, data_raw, label='data')
plt.plot(x, data, label='ground truth')
plt.plot(x, y_scipy, linestyle='--', label='scipy fit')
plt.legend()
[10.05364597 -1.01000325 2.00578554 4.98348821]
<matplotlib.legend.Legend at 0x7fda9b4ebc10>