LLM Attention Kernels and GPU Primitives on k4i's blog

LLM Attention Kernels and GPU Primitives

sky_io@outlook.com (K4i) — Fri, 05 Jun 2026 11:10:00 +0800

This series is for kernels and GPU primitives. The mechanism series explains why a serving system needs an optimization; this series explains how the optimization works at the kernel and memory-access level.

Existing Posts

Planned Posts

FlashAttention: how online softmax becomes IO-aware attention
From FlashAttention to PagedAttention: how attention kernels and cache layout constrain each other
PagedAttention kernels: how block tables enter the attention memory path
Triton profiling: using roofline thinking for bandwidth-bound and compute-bound kernels
Why decode kernels are often limited by HBM bandwidth

Questions Each Post Should Answer

Which memory access does this kernel remove or reduce?
How does data move through HBM, L2, shared memory, and registers?
Does it improve prefill, decode, or both?
How is it coupled to vLLM / SGLang serving parameters or cache layout?

Online Softmax: Tiling for Arbitrarily Large Rows

sky_io@outlook.com (K4i) — Tue, 21 Apr 2026 10:00:00 +0800

introduction

in the fused softmax post, we showed that keeping an entire row in GPU SRAM eliminates redundant global memory traffic — reducing softmax from \(8MN\) to \(2MN\) memory operations. the critical assumption: each row of size \(N\) fits in SRAM.

that assumption breaks for large \(N\). modern GPUs have between 48 KB and 228 KB of shared memory per SM. for float32, a row exceeding ~12K–57K elements won’t fit, and the fused approach fails.

Fused Softmax in Triton

sky_io@outlook.com (K4i) — Mon, 20 Apr 2026 10:00:00 +0800

introduction

softmax is one of the most ubiquitous operations in deep learning. it appears in attention mechanisms, classification heads, and anywhere we need to normalize a vector into a probability distribution.

the softmax function for a vector \(x\) of length \(N\) is:

\begin{equation}
\text{softmax}(x_i) = \frac{\exp(x_i - \max(x))}{\sum_{j=1}^{N} \exp(x_j - \max(x))}
\end{equation}

we subtract \(\max(x)\) for numerical stability — without it, \(\exp(x_i)\) can overflow for large \(x_i\).

for a matrix of shape \(M \times N\), softmax is applied row-wise. this means each of the \(M\) rows is independently normalized.