Stochastic Gradient Descent (SGD)

This notebook presents the Stochastic Gradient Descent (SGD), which is a popular algorithm frequently used in the field of machine learning. The example describes a single layer neural network with logistic regression for breast cancer prediction. The proposed model is analytically derived and implemented using the Numpy library to demonstrate the core functionality of training and testing. Nonetheless, a Pytorch equivalent of the model is given further below for validation purposes.

last update: 11/04/2022

Hahne, PhD

Data acquisition

For our classification example, we employ real data using the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset), which consists of 569 test persons with 2 classes (malignant and benign) and 30 different measured attributes known as features.

SGD theory

Stochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm used to learn weight parameters of a multi-layer perceptron, particularly useful for training large datasets. The weights of the model are updated iteratively. Models trained using SGD are found to generalize better on unseen data.


For a stochastic regression, a predicted value $\hat{y}$ is a scalar composed by $\hat{y}_i=\mathbf{X}_i\mathbf{w}$ where a column vector $\mathbf{w}=\left[w^{(1)}, w^{(2)}, \dots, w^{(J)}\right]^\intercal$ consists of weights $w^{(j)}$ for each feature $j$ and the row vector $\mathbf{X}_i=\left[x_i^{(1)}, x_i^{(2)}, \dots, x_i^{(J)}\right]$ represents a data sample with $J$ features while $i$ is the sample index from a set with total number of $N$ samples. Note that we add a feature column of ones $\mathbf{x}^{(1)}=[1, 1, \dots, 1]^\intercal$ to the data set to embed and train the bias as variable $w^{(1)}$ instead of treating it as a separate variable. Our predicted class value $\hat{y}$ is supposed to match its actual class $y$ for which a least-squares cost metric $(\hat{y}-y)^2$ may be a reasonable choice. Similar to conventional optimization, SGD aims to minimize an objective function $F(\mathbf{w})$, which in case of SGD may be defined as

$$ F(\mathbf{w})=\frac{1}{N}\sum_{i=1}^N f\left(\mathbf{w}, \mathbf{X}_i, y_i\right)=\frac{1}{N}\sum_{i=1}^N\left(\hat{y}_i-y_i\right)^2=\frac{1}{N}\sum_{i=1}^N\left(\mathbf{X}_i\mathbf{w}-y_i\right)^2 $$

where $f\left(\mathbf{w}, \mathbf{X}_i, y_i\right)=\left(\mathbf{X}_i\mathbf{w}-y_i\right)^2$ represents the loss function. In the machine learning context, the term training refers to an optimization problem where weight parameters $\mathbf{w}$ are adjusted so that the objective function is minimized which formally

$$ \text{arg min}_{\mathbf{w}} \, F(\mathbf{w}) $$

To achieve this, SGD inherits the Gradient Descent update method at iteration $k$ which is known as back-propagation and writes

$$ \mathbf{w}_{k+1} = \mathbf{w}_k - \gamma \, \nabla_{\mathbf{w}_k} f\left(\mathbf{w}_k, \mathbf{X}_i, y_i\right) \, , \, \forall i $$

where $\gamma$ denotes the learning rate and $\nabla_{\mathbf{w}_k} f\left(\mathbf{w}_k, \mathbf{X}_i, y_i\right)$ is the gradient of the loss function with respect to the weights $\mathbf{w}_k$. Here, the gradient $\nabla_{\mathbf{w}} f\left(\mathbf{w}, \cdot \right)$ can be generally obtained by

$$ \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{X}_i, y_i\right) = \frac{\partial f\left(\mathbf{w}, \mathbf{X}_i, y_i\right)}{\partial \mathbf{w}} = \begin{bmatrix} \frac{\partial}{\partial \mathbf{w}^{(1)}} f\left(\mathbf{w}, \mathbf{X}_i, y_i\right) \\ \frac{\partial}{\partial \mathbf{w}^{(2)}} f\left(\mathbf{w}, \mathbf{X}_i, y_i\right) \\ \vdots \\ \frac{\partial}{\partial \mathbf{w}^{(J)}} f\left(\mathbf{w}, \mathbf{X}_i, y_i\right) \\ \end{bmatrix} = \begin{bmatrix} \mathbf{X}_i^{(1)} 2\left(\mathbf{X}_i\mathbf{w}-y_i\right) \\ \mathbf{X}_i^{(2)} 2\left(\mathbf{X}_i\mathbf{w}-y_i\right) \\ \vdots \\ \mathbf{X}_i^{(J)} 2\left(\mathbf{X}_i\mathbf{w}-y_i\right) \\ \end{bmatrix} $$

and written in compact vector notation

$$ \nabla_{\mathbf{w}} f(\mathbf{w}, \mathbf{X}_i, y_i) = 2\mathbf{X}_i^\intercal\left(\mathbf{X}_i\mathbf{w}-y_i\right) $$

where $^\intercal$ denotes the transpose. Iteration through the entire data set $\forall i \in \{1, \dots, N\}$ is referred to as one epoch. The resulting weights $\mathbf{w}$ have shown to be improved by letting the optimization procedure see the training data several times. This means that SGD sweeps through the entire dataset for several epochs.


Completion of a single epoch is often sub-divided in bundled subsets of samples, so-called batches, of size $B$ where $B<N$. While $B=1$ in classical SGD, mini-batching requires $B>1$, which helps reduce the variance in each parameter update. The batch size is often chosen to be a power-of-two for better performance from available matrix multiplication libraries. In practice, we determine how many training examples will fit on the GPU or main memory and then use the nearest power-of-two as the batch size.

Activation function

In logistic regression, we desire a classification label $\mathring{y}$ that only has two possible values whereas, so far, our model employs linear combinations $\hat{y}_i=\mathbf{X}_i\mathbf{w}$ that yield results in the $\hat{y}_i \in (-\infty, \infty)$ range. Thus, we seek a continuous function that maps real numbers $\hat{y}_i=\mathbf{\hat{y}} \in \mathbb{R}^N$ to the $(0,1)$ codomain. A function that satisfies this condition is the sigmoid function, also known as standard logistic function, given by

$$ \text{sigmoid}\left(\mathbf{\hat{y}}\right)=[\sigma_1, \sigma_2, ..., \sigma_N]^\intercal, \quad \text{where} \quad \sigma_i=\frac{1}{1+\exp(-\hat{y}_i)} $$

The returned value $\sigma_i \in (0,1)$ of the activation function is then assigned a predicted label $\mathring{y}_i \in \{0,1\}$ which is negative if it is closer to 0 and positive in case it is closer to 1, so that

$$ \mathring{y}_i= \begin{cases} 1, & \text{if } \sigma_i \geq \tau\\ 0, & \text{otherwise} \end{cases} $$

where $\tau$ is an adjustable threshold scalar. To account for a logistic regression, activated predictions $\mathring{y}$ are thus substituted for unnormalized results $\hat{y}$ in the above SGD optimization.

SGD implementation

Overfitting validation

In statistical learning, we want to make sure that the classification performs equally well on test and training data. Therefore, we employ the Mean-Squared Error (MSE) given by $$ \text{MSE}(\mathbf{\hat{y}}, \mathbf{y})=\frac{1}{N}\sum_{i=1}^N \left(\hat{y}_i-y_i\right)^2 $$ on both sets while aiming for $\text{MSE}(\mathbf{\hat{y}}_{\text{test}}, \mathbf{y}_{\text{test}}) \approx \text{MSE}(\mathbf{\hat{y}}_{\text{train}}, \mathbf{y}_{\text{train}})$. If this fails, it gives indication for either under- or overfitting of the trained weights $\mathbf{w}$.

PyTorch equivalent

Neural network models are often defined using PyTorch's Module class, which offers inheritance from Object-Oriented Programming (OOP). Variables of other types (e.g. numpy) have to be converted to torch data types to enable the convenient automatic gradient computation. The model, optimizer (here SGD) and loss function are instantiated before training.