vLLM ModelRunner: How SchedulerOutput Becomes a GPU Forward

sky_io@outlook.com (K4i) — Tue, 23 Jun 2026 10:30:00 +0800

Scheduler decides what to run this step. ModelRunner decides how to run it on GPU. If Scheduler compresses dynamic request queues into SchedulerOutput, ModelRunner translates that output into contiguous tensors, KV cache slots, attention metadata, forward context, logits, and sampled tokens.

So yes, ModelRunner is the execution core of inference. It does not own HTTP serving or global queue policy, but once SchedulerOutput exists, the model starts running here.

Read this after the Scheduler post. If the whole path is still fuzzy, start from the request lifecycle overview.

Model-Runner on k4i's blog

vLLM ModelRunner: How SchedulerOutput Becomes a GPU Forward