Chunked Prefill: Slicing the Prefill to Protect Decode Latency

sky_io@outlook.com (K4i) — Wed, 22 Apr 2026 11:00:00 +0800

the interference problem

continuous batching keeps the GPU busy by scheduling at iteration granularity. but one edge case breaks the latency story: long prefills.

when a request arrives with a 2048-token prompt, the scheduler runs it through prefill in a single iteration. on an A100, a 2048-token prefill for a 7B model takes roughly 200 ms. all the decode requests already in the batch are blocked for the entire duration.

Latency on k4i's blog

Chunked Prefill: Slicing the Prefill to Protect Decode Latency

the interference problem