Continuous Batching: Scheduling at Iteration Granularity

sky_io@outlook.com (K4i) — Wed, 22 Apr 2026 10:30:00 +0800

the static batching problem

before continuous batching existed, LLM serving systems used static batching: collect a batch of requests, run them all through the model together, and wait until every request in the batch finishes generating before accepting the next batch.

this sounds reasonable — batching is how you saturate a GPU — but it has a fatal flaw.

different requests produce outputs of wildly different lengths. a request asking “what is 2+2?” might finish in 5 tokens. a request asking for a short story might need 800. in a static batch, the short request finishes early and then… does nothing. the GPU keeps crunching for the long request while the short request’s slot sits idle.

Batching on k4i's blog

Continuous Batching: Scheduling at Iteration Granularity

the static batching problem