<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Prefill on k4i's blog</title><link>https://k4i.top/zh/tags/prefill/</link><description>Recent content in Prefill on k4i's blog</description><generator>Hugo -- gohugo.io</generator><language>zh</language><managingEditor>sky_io@outlook.com (K4i)</managingEditor><webMaster>sky_io@outlook.com (K4i)</webMaster><copyright>All content is subject to the license of &lt;a rel="license noopener" href="https://creativecommons.org/licenses/by-nc-sa/4.0/" target="_blank"&gt;CC BY-NC-SA 4.0&lt;/a&gt; .</copyright><lastBuildDate>Fri, 05 Jun 2026 22:30:00 +0800</lastBuildDate><atom:link href="https://k4i.top/zh/tags/prefill/index.xml" rel="self" type="application/rss+xml"/><item><title>Prefill vs Decode：为什么同一个模型有两个完全不同的瓶颈</title><link>https://k4i.top/zh/posts/prefill-vs-decode/</link><pubDate>Fri, 05 Jun 2026 22:30:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 22:30:00 +0800</atom:modified><guid>https://k4i.top/zh/posts/prefill-vs-decode/</guid><description>&lt;p&gt;LLM 推理表面上像一个操作：输入 prompt，然后不断输出 token。底层其实是两个 workload 在共用同一套模型权重。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;prefill&lt;/strong&gt; 负责处理输入 prompt，并构建初始 KV cache。&lt;strong&gt;decode&lt;/strong&gt; 负责逐 token 生成，每一步读取已经存在的 KV cache，再追加新 token 的 KV。权重是同一套，但硬件瓶颈完全不同：prefill 更像大批量矩阵乘法；decode 更像很多小 query 反复读取一张不断增长的内存表。&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/prefill-vs-decode/two-bottlenecks.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>推理</category><category>prefill</category><category>decode</category><category>kv-cache</category><category>serving</category><category>systems</category><category>AI</category><category>LLM Inference Internals</category></item></channel></rss>