LLM Attention Kernels and GPU Primitives

📅 Jun 5, 2026 · ☕ 1 min read · ✍️ k4i

A series index for LLM attention kernels and GPU primitives: fused softmax, online softmax, FlashAttention, PagedAttention kernels, Triton/CUDA, and memory-access optimization.

Online Softmax: Tiling for Arbitrarily Large Rows

📅 Apr 21, 2026 · ☕ 6 min read · ✍️ k4i

how online softmax extends the fused kernel to handle rows that exceed sram capacity, using a numerically stable 2-pass tiling algorithm.

Online Softmax: Tiling for Arbitrarily Large Rows

Fused Softmax in Triton

📅 Apr 20, 2026 · ☕ 7 min read · ✍️ k4i

how to write a fused softmax kernel in triton that eliminates redundant memory accesses and outperforms pytorch's native implementation.