Scheduling on k4i's blog

Disaggregated Prefill: Splitting Compute Across Machines

sky_io@outlook.com (K4i) — Wed, 22 Apr 2026 12:00:00 +0800

why same-GPU coexistence has a ceiling

chunked prefill makes prefill-decode coexistence more tolerable by slicing the prefill into small pieces. but even with perfect chunking, prefill and decode are still sharing the same GPU. they compete for:

HBM bandwidth — both need to read from and write to GPU memory each iteration
compute units — prefill’s GEMM and decode’s GEMV contend for the same tensor cores
KV cache space — prefill temporarily occupies blocks that could serve decode requests

at moderate scale, this coexistence is acceptable. at large scale — thousands of requests/second, strict SLOs, multi-GPU clusters — the competition becomes a bottleneck that chunking alone cannot resolve.

Chunked Prefill: Slicing the Prefill to Protect Decode Latency

sky_io@outlook.com (K4i) — Wed, 22 Apr 2026 11:00:00 +0800

the interference problem

continuous batching keeps the GPU busy by scheduling at iteration granularity. but one edge case breaks the latency story: long prefills.

when a request arrives with a 2048-token prompt, the scheduler runs it through prefill in a single iteration. on an A100, a 2048-token prefill for a 7B model takes roughly 200 ms. all the decode requests already in the batch are blocked for the entire duration.

Continuous Batching: Scheduling at Iteration Granularity

sky_io@outlook.com (K4i) — Wed, 22 Apr 2026 10:30:00 +0800

the static batching problem

before continuous batching existed, LLM serving systems used static batching: collect a batch of requests, run them all through the model together, and wait until every request in the batch finishes generating before accepting the next batch.

this sounds reasonable — batching is how you saturate a GPU — but it has a fatal flaw.

different requests produce outputs of wildly different lengths. a request asking “what is 2+2?” might finish in 5 tokens. a request asking for a short story might need 800. in a static batch, the short request finishes early and then… does nothing. the GPU keeps crunching for the long request while the short request’s slot sits idle.