<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>LLM Attention Kernels and GPU Primitives on k4i's blog</title><link>https://k4i.top/series/llm-attention-kernels-and-gpu-primitives/</link><description>Recent content in LLM Attention Kernels and GPU Primitives on k4i's blog</description><generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>sky_io@outlook.com (K4i)</managingEditor><webMaster>sky_io@outlook.com (K4i)</webMaster><copyright>All content is subject to the license of &lt;a rel="license noopener" href="https://creativecommons.org/licenses/by-nc-sa/4.0/" target="_blank"&gt;CC BY-NC-SA 4.0&lt;/a&gt; .</copyright><lastBuildDate>Fri, 05 Jun 2026 11:10:00 +0800</lastBuildDate><atom:link href="https://k4i.top/series/llm-attention-kernels-and-gpu-primitives/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Attention Kernels and GPU Primitives</title><link>https://k4i.top/posts/llm-attention-kernels-gpu-primitives/</link><pubDate>Fri, 05 Jun 2026 11:10:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:26:17 +0800</atom:modified><guid>https://k4i.top/posts/llm-attention-kernels-gpu-primitives/</guid><description>&lt;p&gt;This series is for kernels and GPU primitives. The mechanism series explains why a serving system needs an optimization; this series explains how the optimization works at the kernel and memory-access level.&lt;/p&gt;
&lt;h2 id="existing-posts"&gt;Existing Posts&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://k4i.top/posts/fused-softmax/"&gt;Fused Softmax in Triton&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://k4i.top/posts/online-softmax/"&gt;Online Softmax: Tiling for Arbitrarily Large Rows&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="planned-posts"&gt;Planned Posts&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;FlashAttention: how online softmax becomes IO-aware attention&lt;/li&gt;
&lt;li&gt;From FlashAttention to PagedAttention: how attention kernels and cache layout constrain each other&lt;/li&gt;
&lt;li&gt;PagedAttention kernels: how block tables enter the attention memory path&lt;/li&gt;
&lt;li&gt;Triton profiling: using roofline thinking for bandwidth-bound and compute-bound kernels&lt;/li&gt;
&lt;li&gt;Why decode kernels are often limited by HBM bandwidth&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="questions"&gt;Questions Each Post Should Answer&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Which memory access does this kernel remove or reduce?&lt;/li&gt;
&lt;li&gt;How does data move through HBM, L2, shared memory, and registers?&lt;/li&gt;
&lt;li&gt;Does it improve prefill, decode, or both?&lt;/li&gt;
&lt;li&gt;How is it coupled to vLLM / SGLang serving parameters or cache layout?&lt;/li&gt;
&lt;/ul&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/llm-attention-kernels-gpu-primitives/gpu-attention-kernel-icon.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>attention</category><category>triton</category><category>cuda</category><category>gpu</category><category>kernel</category><category>AI</category><category>LLM Attention Kernels and GPU Primitives</category></item><item><title>Online Softmax: Tiling for Arbitrarily Large Rows</title><link>https://k4i.top/posts/online-softmax/</link><pubDate>Tue, 21 Apr 2026 10:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:20:16 +0800</atom:modified><guid>https://k4i.top/posts/online-softmax/</guid><description>&lt;h2 id="introduction"&gt;introduction&lt;/h2&gt;
&lt;p&gt;in the &lt;a href="https://k4i.top/en/posts/fused-softmax/"&gt;fused softmax post&lt;/a&gt;, we showed that keeping an entire row in GPU SRAM eliminates redundant global memory traffic — reducing softmax from \(8MN\) to \(2MN\) memory operations. the critical assumption: each row of size \(N\) fits in SRAM.&lt;/p&gt;
&lt;p&gt;that assumption breaks for large \(N\). modern GPUs have between 48 KB and 228 KB of shared memory per SM. for &lt;code&gt;float32&lt;/code&gt;, a row exceeding ~12K–57K elements won&amp;rsquo;t fit, and the fused approach fails.&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/online-softmax/cover.png" medium="image"><media:title type="html">featured image</media:title></media:content><category>triton</category><category>gpu</category><category>softmax</category><category>online-softmax</category><category>performance</category><category>flash-attention</category><category>AI</category><category>LLM Attention Kernels and GPU Primitives</category></item><item><title>Fused Softmax in Triton</title><link>https://k4i.top/posts/fused-softmax/</link><pubDate>Mon, 20 Apr 2026 10:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:20:16 +0800</atom:modified><guid>https://k4i.top/posts/fused-softmax/</guid><description>&lt;h2 id="introduction"&gt;introduction&lt;/h2&gt;
&lt;p&gt;softmax is one of the most ubiquitous operations in deep learning. it appears in attention mechanisms, classification heads, and anywhere we need to normalize a vector into a probability distribution.&lt;/p&gt;
&lt;p&gt;the softmax function for a vector \(x\) of length \(N\) is:&lt;/p&gt;
&lt;p&gt;\begin{equation}&lt;br /&gt;
\text{softmax}(x_i) = \frac{\exp(x_i - \max(x))}{\sum_{j=1}^{N} \exp(x_j - \max(x))}&lt;br /&gt;
\end{equation}&lt;/p&gt;
&lt;p&gt;we subtract \(\max(x)\) for &lt;strong&gt;numerical stability&lt;/strong&gt; — without it, \(\exp(x_i)\) can overflow for large \(x_i\).&lt;/p&gt;
&lt;p&gt;for a matrix of shape \(M \times N\), softmax is applied &lt;strong&gt;row-wise&lt;/strong&gt;. this means each of the \(M\) rows is independently normalized.&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/fused-softmax/cover.png" medium="image"><media:title type="html">featured image</media:title></media:content><category>triton</category><category>gpu</category><category>softmax</category><category>kernel-fusion</category><category>performance</category><category>AI</category><category>LLM Attention Kernels and GPU Primitives</category></item></channel></rss>