Online Softmax: Tiling for Arbitrarily Large Rows

sky_io@outlook.com (K4i) — Tue, 21 Apr 2026 10:00:00 +0800

introduction

in the fused softmax post, we showed that keeping an entire row in GPU SRAM eliminates redundant global memory traffic — reducing softmax from \(8MN\) to \(2MN\) memory operations. the critical assumption: each row of size \(N\) fits in SRAM.

that assumption breaks for large \(N\). modern GPUs have between 48 KB and 228 KB of shared memory per SM. for float32, a row exceeding ~12K–57K elements won’t fit, and the fused approach fails.

Fused Softmax in Triton

sky_io@outlook.com (K4i) — Mon, 20 Apr 2026 10:00:00 +0800

introduction

softmax is one of the most ubiquitous operations in deep learning. it appears in attention mechanisms, classification heads, and anywhere we need to normalize a vector into a probability distribution.

the softmax function for a vector \(x\) of length \(N\) is:

\begin{equation}
\text{softmax}(x_i) = \frac{\exp(x_i - \max(x))}{\sum_{j=1}^{N} \exp(x_j - \max(x))}
\end{equation}

we subtract \(\max(x)\) for numerical stability — without it, \(\exp(x_i)\) can overflow for large \(x_i\).

for a matrix of shape \(M \times N\), softmax is applied row-wise. this means each of the \(M\) rows is independently normalized.

Performance on k4i's blog

Online Softmax: Tiling for Arbitrarily Large Rows

introduction

Fused Softmax in Triton

introduction