<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Scheduling on k4i's blog</title><link>https://k4i.top/tags/scheduling/</link><description>Recent content in Scheduling on k4i's blog</description><generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>sky_io@outlook.com (K4i)</managingEditor><webMaster>sky_io@outlook.com (K4i)</webMaster><copyright>All content is subject to the license of &lt;a rel="license noopener" href="https://creativecommons.org/licenses/by-nc-sa/4.0/" target="_blank"&gt;CC BY-NC-SA 4.0&lt;/a&gt; .</copyright><lastBuildDate>Wed, 22 Apr 2026 12:00:00 +0800</lastBuildDate><atom:link href="https://k4i.top/tags/scheduling/index.xml" rel="self" type="application/rss+xml"/><item><title>Disaggregated Prefill: Splitting Compute Across Machines</title><link>https://k4i.top/posts/disaggregated-prefill/</link><pubDate>Wed, 22 Apr 2026 12:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Sun, 26 Apr 2026 16:08:06 +0800</atom:modified><guid>https://k4i.top/posts/disaggregated-prefill/</guid><description>&lt;h2 id="ceiling"&gt;why same-GPU coexistence has a ceiling&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://k4i.top/posts/chunked-prefill/"&gt;chunked prefill&lt;/a&gt; makes prefill-decode coexistence more tolerable by slicing the prefill into small pieces. but even with perfect chunking, prefill and decode are still &lt;em&gt;sharing the same GPU&lt;/em&gt;. they compete for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;HBM bandwidth&lt;/strong&gt; — both need to read from and write to GPU memory each iteration&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;compute units&lt;/strong&gt; — prefill&amp;rsquo;s GEMM and decode&amp;rsquo;s GEMV contend for the same tensor cores&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;KV cache space&lt;/strong&gt; — prefill temporarily occupies blocks that could serve decode requests&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;at moderate scale, this coexistence is acceptable. at large scale — thousands of requests/second, strict SLOs, multi-GPU clusters — the competition becomes a bottleneck that chunking alone cannot resolve.&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/disaggregated-prefill/cover.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>inference</category><category>systems</category><category>distributed</category><category>scheduling</category><category>AI</category><category>LLM Inference Internals</category></item><item><title>Chunked Prefill: Slicing the Prefill to Protect Decode Latency</title><link>https://k4i.top/posts/chunked-prefill/</link><pubDate>Wed, 22 Apr 2026 11:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Sun, 26 Apr 2026 16:08:06 +0800</atom:modified><guid>https://k4i.top/posts/chunked-prefill/</guid><description>&lt;h2 id="interference"&gt;the interference problem&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://k4i.top/posts/continuous-batching/"&gt;continuous batching&lt;/a&gt; keeps the GPU busy by scheduling at iteration granularity. but one edge case breaks the latency story: &lt;strong&gt;long prefills&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;when a request arrives with a 2048-token prompt, the scheduler runs it through prefill in a single iteration. on an A100, a 2048-token prefill for a 7B model takes roughly 200 ms. all the decode requests already in the batch are blocked for the entire duration.&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/chunked-prefill/cover.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>inference</category><category>systems</category><category>latency</category><category>scheduling</category><category>AI</category><category>LLM Inference Internals</category></item><item><title>Continuous Batching: Scheduling at Iteration Granularity</title><link>https://k4i.top/posts/continuous-batching/</link><pubDate>Wed, 22 Apr 2026 10:30:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Sun, 26 Apr 2026 16:08:06 +0800</atom:modified><guid>https://k4i.top/posts/continuous-batching/</guid><description>&lt;h2 id="static-batching"&gt;the static batching problem&lt;/h2&gt;
&lt;p&gt;before continuous batching existed, LLM serving systems used &lt;strong&gt;static batching&lt;/strong&gt;: collect a batch of requests, run them all through the model together, and wait until every request in the batch finishes generating before accepting the next batch.&lt;/p&gt;
&lt;p&gt;this sounds reasonable — batching is how you saturate a GPU — but it has a fatal flaw.&lt;/p&gt;
&lt;p&gt;different requests produce outputs of wildly different lengths. a request asking &amp;ldquo;what is 2+2?&amp;rdquo; might finish in 5 tokens. a request asking for a short story might need 800. in a static batch, the short request finishes early and then&amp;hellip; does nothing. the GPU keeps crunching for the long request while the short request&amp;rsquo;s slot sits idle.&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/continuous-batching/cover.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>inference</category><category>systems</category><category>batching</category><category>scheduling</category><category>AI</category><category>LLM Inference Internals</category></item></channel></rss>