<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>LLM Attention Kernels and GPU Primitives on k4i's blog</title><link>https://k4i.top/zh/series/llm-attention-kernels-and-gpu-primitives/</link><description>Recent content in LLM Attention Kernels and GPU Primitives on k4i's blog</description><generator>Hugo -- gohugo.io</generator><language>zh</language><managingEditor>sky_io@outlook.com (K4i)</managingEditor><webMaster>sky_io@outlook.com (K4i)</webMaster><copyright>All content is subject to the license of &lt;a rel="license noopener" href="https://creativecommons.org/licenses/by-nc-sa/4.0/" target="_blank"&gt;CC BY-NC-SA 4.0&lt;/a&gt; .</copyright><lastBuildDate>Fri, 05 Jun 2026 11:10:00 +0800</lastBuildDate><atom:link href="https://k4i.top/zh/series/llm-attention-kernels-and-gpu-primitives/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Attention Kernels and GPU Primitives：Attention Kernel 与 GPU 基元路线</title><link>https://k4i.top/zh/posts/llm-attention-kernels-gpu-primitives/</link><pubDate>Fri, 05 Jun 2026 11:10:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:26:17 +0800</atom:modified><guid>https://k4i.top/zh/posts/llm-attention-kernels-gpu-primitives/</guid><description>&lt;p&gt;这个系列专门放 kernel 和 GPU 基元。它和推理引擎机制系列的区别是：机制系列解释“系统为什么需要这个优化”，这里解释“这个优化在 kernel 和内存访问层面如何实现”。&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/llm-attention-kernels-gpu-primitives/gpu-attention-kernel-icon.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>attention</category><category>triton</category><category>cuda</category><category>gpu</category><category>kernel</category><category>AI</category><category>LLM Attention Kernels and GPU Primitives</category></item><item><title>Online Softmax：为任意大行设计的分块算法</title><link>https://k4i.top/zh/posts/online-softmax/</link><pubDate>Tue, 21 Apr 2026 10:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:20:16 +0800</atom:modified><guid>https://k4i.top/zh/posts/online-softmax/</guid><description>&lt;h2 id="introduction"&gt;引言&lt;/h2&gt;
&lt;p&gt;在&lt;a href="https://k4i.top/zh/posts/fused-softmax/"&gt;融合 softmax 一文&lt;/a&gt;中，我们展示了将整行保持在 GPU SRAM 中可以消除冗余全局内存流量——将 softmax 的内存操作从 \(8MN\) 降至 \(2MN\)。这背后有一个关键假设：大小为 \(N\) 的每一行能放入 SRAM。&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/online-softmax/cover.png" medium="image"><media:title type="html">featured image</media:title></media:content><category>triton</category><category>gpu</category><category>softmax</category><category>online-softmax</category><category>性能优化</category><category>flash-attention</category><category>AI</category><category>LLM Attention Kernels and GPU Primitives</category></item><item><title>Triton 中的融合 Softmax</title><link>https://k4i.top/zh/posts/fused-softmax/</link><pubDate>Mon, 20 Apr 2026 10:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:20:16 +0800</atom:modified><guid>https://k4i.top/zh/posts/fused-softmax/</guid><description>&lt;h2 id="introduction"&gt;引言&lt;/h2&gt;
&lt;p&gt;softmax 是深度学习中最常见的运算之一，出现在注意力机制、分类头，以及任何需要将向量归一化为概率分布的场景中。&lt;/p&gt;
&lt;p&gt;对于长度为 \(N\) 的向量 \(x\)，softmax 函数定义为：&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/fused-softmax/cover.png" medium="image"><media:title type="html">featured image</media:title></media:content><category>triton</category><category>gpu</category><category>softmax</category><category>kernel-fusion</category><category>性能优化</category><category>AI</category><category>LLM Attention Kernels and GPU Primitives</category></item></channel></rss>