introduction
from the previous post on forward and backward propagation, we know the gradient descent update rule:
\[
\theta \leftarrow \theta - \eta \nabla C
\]
where \(\nabla C\) is the gradient of the loss over the entire training dataset. but computing \(\nabla C\) means running forward and backward propagation for every single training example, then averaging the results. when the dataset has millions of samples, this is extremely expensive.
how we compute \(\nabla C\) and when we update \(\theta\) leads to different variants of gradient descent: batch, stochastic, and mini-batch.
batch gradient descent
batch gradient descent (BGD) computes the gradient using all training examples before each update.
given a dataset with \(m\) samples, the gradient is the average over all individual gradients:
\[
\nabla C(\theta) = \frac{1}{m} \sum_{i=1}^{m} \nabla C_i(\theta)
\]
the update step is:
\[
\theta \leftarrow \theta - \eta \cdot \frac{1}{m} \sum_{i=1}^{m} \nabla C_i(\theta)
\]
characteristics:
- every update uses the full dataset, so the gradient direction is accurate
- the loss decreases smoothly and steadily toward the minimum
- guaranteed to converge to the global minimum for convex loss functions
drawbacks:
- each iteration is slow โ must process all \(m\) samples
- requires loading the entire dataset into memory
- for very large datasets, a single iteration may be infeasible
stochastic gradient descent
stochastic gradient descent (SGD) updates the parameters after each individual training example:
\[
\theta \leftarrow \theta - \eta \cdot \nabla C_i(\theta)
\]
the training data is typically shuffled and processed one sample at a time, then reshuffled for the next epoch.
characteristics:
- updates are very fast โ only one sample needed per step
- the noise from individual gradients can help escape shallow local minima
- memory efficient โ only one sample needs to be in memory at a time
drawbacks:
- the gradient direction fluctuates wildly, causing the loss to oscillate near the optimum
- never truly converges โ keeps bouncing around the minimum
- learning rate scheduling is essential to reduce oscillations over time
mini-batch gradient descent
mini-batch gradient descent is the middle ground: instead of one sample or all samples, use a small batch of \(k\) samples (commonly 32, 64, 128, or 256):
\[
\nabla C(\theta) = \frac{1}{k} \sum_{j \in \text{batch}} \nabla C_j(\theta)
\]
\[
\theta \leftarrow \theta - \eta \cdot \frac{1}{k} \sum_{j \in \text{batch}} \nabla C_j(\theta)
\]
Figure 1: mini-batch concept: dataset split into batches of size 4, each batch triggers one parameter update
why mini-batch is the default choice:
- vectorization: modern hardware (GPU/TPU) is optimized for batched matrix operations. processing 128 samples at once is much faster than 128 individual updates.
- reduced noise: averaging over a batch smooths out the gradient noise, leading to more stable convergence than pure SGD.
- memory-friendly: a batch of 128 samples fits easily in GPU memory, unlike millions of samples.
- one epoch = multiple updates: with \(m = 10000\) and \(k = 100\), one epoch produces 100 updates, giving the optimizer many chances to adjust direction.
comparison
Figure 2: convergence behavior of BGD (smooth), SGD (noisy), and mini-batch (moderately noisy but efficient)
| Batch GD | Stochastic GD | Mini-Batch GD | |
|---|---|---|---|
| samples per update | \(m\) (all) | 1 | \(k\) (e.g. 32) |
| updates per epoch | 1 | \(m\) | \(m / k\) |
| convergence | smooth, steady | noisy, oscillates | moderate noise |
| speed per update | slow | fast | medium (but vectorized) |
| memory | high | low | medium |
| can escape local minima | no | yes | partially |
| typical use case | small datasets | online learning | deep learning (default) |
summary
- batch gradient descent: accurate but slow โ good for small datasets where you can afford full passes.
- stochastic gradient descent: fast and noisy โ useful for online learning or when you need quick, rough progress.
- mini-batch gradient descent: the practical default โ balances accuracy, speed, and hardware efficiency. most deep learning frameworks (PyTorch, TensorFlow, etc.) use mini-batch SGD under the hood.
in practice, the choice of batch size \(k\) is a hyperparameter that depends on model size, GPU memory, and the desired trade-off between gradient noise and throughput.