<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Systems on k4i's blog</title><link>https://k4i.top/zh/tags/systems/</link><description>Recent content in Systems on k4i's blog</description><generator>Hugo -- gohugo.io</generator><language>zh</language><managingEditor>sky_io@outlook.com (K4i)</managingEditor><webMaster>sky_io@outlook.com (K4i)</webMaster><copyright>All content is subject to the license of &lt;a rel="license noopener" href="https://creativecommons.org/licenses/by-nc-sa/4.0/" target="_blank"&gt;CC BY-NC-SA 4.0&lt;/a&gt; .</copyright><lastBuildDate>Wed, 22 Apr 2026 12:00:00 +0800</lastBuildDate><atom:link href="https://k4i.top/zh/tags/systems/index.xml" rel="self" type="application/rss+xml"/><item><title>Disaggregated Prefill：把计算拆到不同机器上</title><link>https://k4i.top/zh/posts/disaggregated-prefill/</link><pubDate>Wed, 22 Apr 2026 12:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Sat, 30 May 2026 00:04:33 +0800</atom:modified><guid>https://k4i.top/zh/posts/disaggregated-prefill/</guid><description>&lt;h2 id="ceiling"&gt;为什么同一张 GPU 上的共存有上限&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://k4i.top/zh/posts/chunked-prefill/"&gt;chunked prefill&lt;/a&gt; 通过把 prefill 切成小块，让 prefill 和 decode 在同一张 GPU 上共存得更平滑。但即使 chunk 切得再好，prefill 和 decode 仍然在&lt;em&gt;共享同一张 GPU&lt;/em&gt;。它们会竞争：&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/disaggregated-prefill/cover.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>inference</category><category>systems</category><category>distributed</category><category>scheduling</category><category>AI</category><category>LLM Inference Internals</category></item><item><title>Prefix Caching：跨请求复用 KV Cache</title><link>https://k4i.top/zh/posts/prefix-caching/</link><pubDate>Wed, 22 Apr 2026 11:30:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Sat, 30 May 2026 00:04:33 +0800</atom:modified><guid>https://k4i.top/zh/posts/prefix-caching/</guid><description>&lt;h2 id="problem"&gt;重复前缀问题&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://k4i.top/zh/posts/kv-cache/"&gt;KV cache&lt;/a&gt; 解决的是&lt;strong&gt;同一个请求内部&lt;/strong&gt;的重复计算：解码第 \(t\) 个 token 时，不需要重新计算前面 \(t-1\) 个 token 的 K、V。可是生产环境里还有另一种更大规模的重复：&lt;strong&gt;不同请求经常以完全相同的一段 token 开头&lt;/strong&gt;。&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/prefix-caching/cover.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>inference</category><category>systems</category><category>caching</category><category>kv-cache</category><category>AI</category><category>LLM Inference Internals</category></item><item><title>Chunked Prefill：把 Prefill 切片，保护 Decode 延迟</title><link>https://k4i.top/zh/posts/chunked-prefill/</link><pubDate>Wed, 22 Apr 2026 11:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Sat, 30 May 2026 00:04:33 +0800</atom:modified><guid>https://k4i.top/zh/posts/chunked-prefill/</guid><description>&lt;h2 id="interference"&gt;干扰问题&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://k4i.top/zh/posts/continuous-batching/"&gt;continuous batching&lt;/a&gt; 通过按迭代粒度调度请求，让 GPU 尽量保持忙碌。但它有一个很容易破坏延迟体验的边界情况：&lt;strong&gt;很长的 prefill&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;当一个带有 2048-token prompt 的请求到达时，朴素调度器会在一次迭代里把整个 prompt 跑完 prefill。以 A100 上的 7B 模型为例，2048-token prefill 大约需要 200 ms。在这 200 ms 里，当前 batch 里已经在流式输出的 decode 请求都要等待。&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/chunked-prefill/cover.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>推理</category><category>systems</category><category>latency</category><category>scheduling</category><category>AI</category><category>LLM Inference Internals</category></item><item><title>Continuous Batching：按迭代粒度调度</title><link>https://k4i.top/zh/posts/continuous-batching/</link><pubDate>Wed, 22 Apr 2026 10:30:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Sat, 30 May 2026 00:04:33 +0800</atom:modified><guid>https://k4i.top/zh/posts/continuous-batching/</guid><description>&lt;h2 id="batching-problem"&gt;batching 问题&lt;/h2&gt;
&lt;p&gt;batching 是 LLM serving 系统让 GPU 忙起来的基本手段。单个请求通常无法充分利用 GPU，但多个请求放在一起，就能把很多小矩阵运算变成更大的矩阵运算。问题是：请求不会同时结束。&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/continuous-batching/cover.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>inference</category><category>systems</category><category>batching</category><category>scheduling</category><category>AI</category><category>LLM Inference Internals</category></item><item><title>Paged Attention：GPU 上的虚拟内存</title><link>https://k4i.top/zh/posts/paged-attention/</link><pubDate>Wed, 22 Apr 2026 10:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Sat, 30 May 2026 00:04:33 +0800</atom:modified><guid>https://k4i.top/zh/posts/paged-attention/</guid><description>&lt;h2 id="memory-management-problem"&gt;显存管理问题&lt;/h2&gt;
&lt;h3 id="fragmentation"&gt;碎片化问题&lt;/h3&gt;
&lt;p&gt;上一篇 &lt;a href="https://k4i.top/zh/posts/kv-cache/"&gt;KV cache&lt;/a&gt; 解释了为什么自回归解码可以缓存 key 和 value。KV cache 帮我们避免了大量重复计算，但也引出了一个新的系统问题：&lt;strong&gt;这些不断增长的缓存到底放在哪里？&lt;/strong&gt;&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/paged-attention/cover.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>推理</category><category>systems</category><category>vllm</category><category>memory</category><category>AI</category><category>LLM Inference Internals</category></item></channel></rss>