This page looks best with JavaScript enabled

vLLM ModelRunner: How SchedulerOutput Becomes a GPU Forward

 ·  ☕ 6 min read · 👀... views
Read aloud Ready
0/0

Scheduler decides what to run this step. ModelRunner decides how to run it on GPU. If Scheduler compresses dynamic request queues into SchedulerOutput, ModelRunner translates that output into contiguous tensors, KV cache slots, attention metadata, forward context, logits, and sampled tokens.

So yes, ModelRunner is the execution core of inference. It does not own HTTP serving or global queue policy, but once SchedulerOutput exists, the model starts running here.

Read this after the Scheduler post. If the whole path is still fuzzy, start from the request lifecycle overview.

Figure 1: GPUModelRunner sits between SchedulerOutput and the actual model forward. It owns input materialization, attention metadata, KV slot mapping, forward context, logits, and sampling state.

Figure 1: GPUModelRunner sits between SchedulerOutput and the actual model forward. It owns input materialization, attention metadata, KV slot mapping, forward context, logits, and sampling state.

Start With A Small Batch

Suppose the scheduler emits this step:

requestcomputed tokensscheduled tokensphase
A41decode
B03prefill chunk

From the scheduler’s perspective, this is a normal mixed batch: request A decodes one token, while request B prefills three tokens. The GPU cannot execute that high-level description directly. ModelRunner lowers it into execution data:

datameaningtoy batch shape
input_idsactual tokens for this step[A4, B0, B1, B2]
positionseach token’s position in its sequence[4, 0, 1, 2]
query_start_locrequest boundaries in the flattened token batch[0, 1, 4]
seq_lensoptimistic sequence lengths after this forward[5, 3]
slot_mappingphysical KV cache slot for each tokencomputed from the block table
logits_indiceshidden states that should become logitsusually the last position per request

That is the core mental model: ModelRunner is not just “calling a PyTorch model.” It maintains execution invariants across scheduling, KV cache, attention backends, CUDA graphs, pipeline parallelism, speculative decoding, and sampling.

Entry Points And Execution Flow

This post follows the vLLM V1 path:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
vllm/v1/worker/gpu_worker.py
  GPUWorker.execute_model()
    -> self.model_runner.execute_model(...)

vllm/v1/worker/gpu_model_runner.py
  GPUModelRunner.execute_model()
    -> _update_states(...)
    -> _prepare_inputs(...)
    -> _get_slot_mappings(...)
    -> _build_attention_metadata(...)
    -> _preprocess(...)
    -> set_forward_context(...)
    -> _model_forward(...)
    -> compute_logits(...)
    -> sample_tokens(...)

GPUWorker is the worker-level entry point. It handles pipeline-parallel tensor receive/send and then calls model_runner.execute_model() with the current SchedulerOutput. In this checkout, GPUWorker can choose the V1 runner or the V2 runner. This article uses V1 because its responsibilities are concentrated in one file, which makes the mechanism easier to inspect first.

GPUModelRunner.execute_model() is a two-phase path: preprocess, forward, compute logits, store ephemeral state in ExecuteModelState, then return None; later, GPUWorker.sample_tokens() calls model_runner.sample_tokens() to sample, update request state, and produce ModelRunnerOutput. This split supports async scheduling, pipeline parallelism, speculative decoding, and structured output.

One execute_model() step can be compressed into this table:

phasewhat happenswhy it matters
update persistent batchapply SchedulerOutput deltas to runner-owned batch stateavoids rebuilding large tensors from Python objects every step
build input tensorscreate req_indices, input_ids, positions, query_start_loc, logits_indiceslowers request-level decisions into token-level tensors
choose execution shapeselect padding, CUDA graph mode, microbatching, cross-DP token countsreconciles dynamic traffic with stable GPU shapes
build attention metadatacompute block tables, slot mappings, seq lens, prefill/decode/spec statetells attention backends how to read/write KV cache
forward and samplecall the model under set_forward_context(...), compute logits, then sampleproduces tokens and state for the next scheduler step

The actual model call is short:

1
2
3
4
5
6
7
return self.model(
    input_ids=input_ids,
    positions=positions,
    intermediate_tensors=intermediate_tensors,
    inputs_embeds=inputs_embeds,
    **model_kwargs,
)

That short call sits on top of all the prior preparation. set_forward_context(...) has already installed attention metadata, slot mappings, CUDA graph runtime mode, and microbatch slices. input_ids, positions, inputs_embeds, and pipeline intermediate tensors have been shaped for execution.

The boundary is important: model classes own transformer blocks, MLPs, MoE layers, and logits heads; ModelRunner plus attention backends own the runtime environment in which those layers execute. A high-performance forward pass needs both.

Why V2 Reworks This Layer

The vLLM source tree already contains a Model Runner V2 design document. Its existence is itself a signal: ModelRunner is where inference execution complexity accumulates.

problemV1 pressureV2 direction
persistent batchstate and input tensors are coupleddecouple persistent state from per-step inputs
async schedulingCPU/GPU async copies can raceasync-first execution and fewer barriers
block table updateslarge tensors are expensive to copy every stepstaged writes that submit only deltas
samplingPython/torch paths are complexTriton-native sampler
CUDA graphscapture/launch logic is implicitexplicit CUDA graph manager
file structureV1 gpu_model_runner.py is largesplit runner logic into focused modules

Read V1 to understand the mechanism. Read V2 to understand the engineering direction. The hard part is not calling model.forward; it is preserving invariants across dynamic requests, KV cache, attention backends, sampling, parallel communication, and CUDA graph execution.

How vLLM-Omni Extends The Boundary

vLLM-Omni does not discard vLLM’s ModelRunner boundary. It reuses and extends it for multimodal, multi-stage execution.

In vllm-omni, OmniGPUModelRunner inherits from vLLM’s GPUModelRunner. GPUARModelRunner targets autoregressive stages. It keeps the two-phase execute/sample flow while returning per-request hidden representations, multimodal outputs, and connector payloads. GPUGenerationModelRunner targets non-autoregressive generation stages. It reuses input preparation, multimodal handling, and TP/PP/DP glue, but does not compute logits or run token sampling; instead, it returns generation outputs through output fields.

More generally: vLLM’s ModelRunner is the execution core for AR transformer serving; vLLM-Omni places that core inside a larger stage graph. Text tokens and speech tokens can still use the scheduler, KV cache, attention metadata, and model-runner machinery. Diffusion, vocoder, and code2wav stages need specialized runner/output protocols because their outputs are not next-token logits.

Source-Reading Invariants

ModelRunner is easy to get lost in because the file is long and feature-flag heavy. Start with five invariants:

  • SchedulerOutput is the input contract: the scheduler decides which requests get token budget in this step.
  • InputBatch is cross-step state: the runner owns token ids, request indices, sampling metadata, block tables, and related persistent state.
  • slot_mapping is the KV cache landing zone: every token in this step must map to a physical KV slot.
  • forward context is the attention runtime environment: attention layers read batch metadata from it.
  • sampled tokens feed the next scheduler step: a forward pass is not the endpoint; it produces the next scheduling input.

If you remember one sentence, make it this: ModelRunner translates “what should run this step” into “which GPU shape, which KV slots, and which attention metadata should be used to run it.”

A practical reading order:

  1. vllm/v1/worker/gpu_worker.py: how the worker calls the runner and where pipeline parallelism enters.
  2. vllm/v1/worker/gpu_model_runner.py: focus on execute_model(), _prepare_inputs(), _build_attention_metadata(), and sample_tokens().
  3. vllm/v1/worker/gpu_input_batch.py: understand how InputBatch carries persistent batch state.
  4. vllm/docs/design/model_runner_v2.md: compare V1 complexity with the V2 design.
  5. vllm-omni/vllm_omni/worker/*model_runner.py: see how Omni inherits, overrides, and extends the runner boundary.
Share on