Scheduler decides what to run this step. ModelRunner decides how to run it on GPU. If Scheduler compresses dynamic request queues into SchedulerOutput, ModelRunner translates that output into contiguous tensors, KV cache slots, attention metadata, forward context, logits, and sampled tokens.
So yes, ModelRunner is the execution core of inference. It does not own HTTP serving or global queue policy, but once SchedulerOutput exists, the model starts running here.
Read this after the Scheduler post. If the whole path is still fuzzy, start from the request lifecycle overview.
Figure 1: GPUModelRunner sits between SchedulerOutput and the actual model forward. It owns input materialization, attention metadata, KV slot mapping, forward context, logits, and sampling state.
Start With A Small Batch
Suppose the scheduler emits this step:
| request | computed tokens | scheduled tokens | phase |
|---|---|---|---|
| A | 4 | 1 | decode |
| B | 0 | 3 | prefill chunk |
From the scheduler’s perspective, this is a normal mixed batch: request A decodes one token, while request B prefills three tokens. The GPU cannot execute that high-level description directly. ModelRunner lowers it into execution data:
| data | meaning | toy batch shape |
|---|---|---|
input_ids | actual tokens for this step | [A4, B0, B1, B2] |
positions | each token’s position in its sequence | [4, 0, 1, 2] |
query_start_loc | request boundaries in the flattened token batch | [0, 1, 4] |
seq_lens | optimistic sequence lengths after this forward | [5, 3] |
slot_mapping | physical KV cache slot for each token | computed from the block table |
logits_indices | hidden states that should become logits | usually the last position per request |
That is the core mental model: ModelRunner is not just “calling a PyTorch model.” It maintains execution invariants across scheduling, KV cache, attention backends, CUDA graphs, pipeline parallelism, speculative decoding, and sampling.
Entry Points And Execution Flow
This post follows the vLLM V1 path:
| |
GPUWorker is the worker-level entry point. It handles pipeline-parallel tensor receive/send and then calls model_runner.execute_model() with the current SchedulerOutput. In this checkout, GPUWorker can choose the V1 runner or the V2 runner. This article uses V1 because its responsibilities are concentrated in one file, which makes the mechanism easier to inspect first.
GPUModelRunner.execute_model() is a two-phase path: preprocess, forward, compute logits, store ephemeral state in ExecuteModelState, then return None; later, GPUWorker.sample_tokens() calls model_runner.sample_tokens() to sample, update request state, and produce ModelRunnerOutput. This split supports async scheduling, pipeline parallelism, speculative decoding, and structured output.
One execute_model() step can be compressed into this table:
| phase | what happens | why it matters |
|---|---|---|
| update persistent batch | apply SchedulerOutput deltas to runner-owned batch state | avoids rebuilding large tensors from Python objects every step |
| build input tensors | create req_indices, input_ids, positions, query_start_loc, logits_indices | lowers request-level decisions into token-level tensors |
| choose execution shape | select padding, CUDA graph mode, microbatching, cross-DP token counts | reconciles dynamic traffic with stable GPU shapes |
| build attention metadata | compute block tables, slot mappings, seq lens, prefill/decode/spec state | tells attention backends how to read/write KV cache |
| forward and sample | call the model under set_forward_context(...), compute logits, then sample | produces tokens and state for the next scheduler step |
The actual model call is short:
| |
That short call sits on top of all the prior preparation. set_forward_context(...) has already installed attention metadata, slot mappings, CUDA graph runtime mode, and microbatch slices. input_ids, positions, inputs_embeds, and pipeline intermediate tensors have been shaped for execution.
The boundary is important: model classes own transformer blocks, MLPs, MoE layers, and logits heads; ModelRunner plus attention backends own the runtime environment in which those layers execute. A high-performance forward pass needs both.
Why V2 Reworks This Layer
The vLLM source tree already contains a Model Runner V2 design document. Its existence is itself a signal: ModelRunner is where inference execution complexity accumulates.
| problem | V1 pressure | V2 direction |
|---|---|---|
| persistent batch | state and input tensors are coupled | decouple persistent state from per-step inputs |
| async scheduling | CPU/GPU async copies can race | async-first execution and fewer barriers |
| block table updates | large tensors are expensive to copy every step | staged writes that submit only deltas |
| sampling | Python/torch paths are complex | Triton-native sampler |
| CUDA graphs | capture/launch logic is implicit | explicit CUDA graph manager |
| file structure | V1 gpu_model_runner.py is large | split runner logic into focused modules |
Read V1 to understand the mechanism. Read V2 to understand the engineering direction. The hard part is not calling model.forward; it is preserving invariants across dynamic requests, KV cache, attention backends, sampling, parallel communication, and CUDA graph execution.
How vLLM-Omni Extends The Boundary
vLLM-Omni does not discard vLLM’s ModelRunner boundary. It reuses and extends it for multimodal, multi-stage execution.
In vllm-omni, OmniGPUModelRunner inherits from vLLM’s GPUModelRunner. GPUARModelRunner targets autoregressive stages. It keeps the two-phase execute/sample flow while returning per-request hidden representations, multimodal outputs, and connector payloads. GPUGenerationModelRunner targets non-autoregressive generation stages. It reuses input preparation, multimodal handling, and TP/PP/DP glue, but does not compute logits or run token sampling; instead, it returns generation outputs through output fields.
More generally: vLLM’s ModelRunner is the execution core for AR transformer serving; vLLM-Omni places that core inside a larger stage graph. Text tokens and speech tokens can still use the scheduler, KV cache, attention metadata, and model-runner machinery. Diffusion, vocoder, and code2wav stages need specialized runner/output protocols because their outputs are not next-token logits.
Source-Reading Invariants
ModelRunner is easy to get lost in because the file is long and feature-flag heavy. Start with five invariants:
SchedulerOutputis the input contract: the scheduler decides which requests get token budget in this step.InputBatchis cross-step state: the runner owns token ids, request indices, sampling metadata, block tables, and related persistent state.slot_mappingis the KV cache landing zone: every token in this step must map to a physical KV slot.forward contextis the attention runtime environment: attention layers read batch metadata from it.- sampled tokens feed the next scheduler step: a forward pass is not the endpoint; it produces the next scheduling input.
If you remember one sentence, make it this: ModelRunner translates “what should run this step” into “which GPU shape, which KV slots, and which attention metadata should be used to run it.”
A practical reading order:
vllm/v1/worker/gpu_worker.py: how the worker calls the runner and where pipeline parallelism enters.vllm/v1/worker/gpu_model_runner.py: focus onexecute_model(),_prepare_inputs(),_build_attention_metadata(), andsample_tokens().vllm/v1/worker/gpu_input_batch.py: understand howInputBatchcarries persistent batch state.vllm/docs/design/model_runner_v2.md: compare V1 complexity with the V2 design.vllm-omni/vllm_omni/worker/*model_runner.py: see how Omni inherits, overrides, and extends the runner boundary.