vLLM Scheduler: How Request Queues Become SchedulerOutput

sky_io@outlook.com (K4i) — Tue, 23 Jun 2026 11:20:00 +0800

In the request lifecycle, Scheduler is the easiest piece to underestimate. The HTTP server admits requests, and ModelRunner executes batches on GPU. Scheduler answers the per-step question in between: who runs now, how many tokens do they get, and can the KV cache hold the result?

The three lifecycle posts fit together like this:

post	question
request lifecycle	how a request reaches EngineCore
Scheduler	EngineCore decides what to run in each step
ModelRunner	`SchedulerOutput` becomes how to run on GPU

Scheduler output is not a vague “batch.” It is a concrete SchedulerOutput: which requests are new, which are already cached on workers, how many tokens each request gets, which KV blocks were allocated, which requests were preempted, and which finished requests must be cleaned up.

Scheduler on k4i's blog

vLLM Scheduler: How Request Queues Become SchedulerOutput