In the request lifecycle, Scheduler is the easiest piece to underestimate. The HTTP server admits requests, and ModelRunner executes batches on GPU. Scheduler answers the per-step question in between: who runs now, how many tokens do they get, and can the KV cache hold the result?
The three lifecycle posts fit together like this:
| post | question |
|---|---|
| request lifecycle | how a request reaches EngineCore |
| Scheduler | EngineCore decides what to run in each step |
| ModelRunner | SchedulerOutput becomes how to run on GPU |
Scheduler output is not a vague “batch.” It is a concrete SchedulerOutput: which requests are new, which are already cached on workers, how many tokens each request gets, which KV blocks were allocated, which requests were preempted, and which finished requests must be cleaned up.
Figure 1: Scheduler turns request queues, token budgets, KV cache allocation, prefix-cache hits, and preemption decisions into SchedulerOutput. ModelRunner consumes this object in the next stage.
Entry Points And The Loop
Put Scheduler back into EngineCore.step():
| |
Scheduler is not a one-time module that runs only when a request arrives. It runs repeatedly in the engine busy loop. Each step creates a SchedulerOutput, ModelRunner executes it, and Scheduler consumes ModelRunner output to update its state.
That means Scheduler maintains a dynamic system:
- a request’s
num_computed_tokenschanges every step; - output tokens, speculative tokens, and placeholder tokens change how many tokens remain;
- KV cache blocks may be allocated, reused, preempted, or freed later;
- waiting requests may be blocked by remote KV transfer, structured-output grammar, streaming input, or similar dependencies;
- running requests are not guaranteed to run in every step.
What One schedule() Step Does
The comment at the top of Scheduler.schedule() is the key: the scheduler does not have a hard-coded “decode phase” or “prefill phase.” Each request has num_computed_tokens and a target num_tokens_with_spec. At each step, Scheduler tries to assign enough tokens for requests to catch up.
That one abstraction covers normal decode, prefill, chunked prefill, prefix caching, and speculative decoding. Small example:
| request | state | computed tokens | target tokens | possible scheduling |
|---|---|---|---|---|
| A | running | 99 | 100 | decode 1 token |
| B | running | 0 | 4096 | prefill chunk |
| C | waiting | 0 | 128 | new prefill |
This is not plain FIFO. Scheduler must also check token budget, long-prefill thresholds, max_num_running_reqs, KV block availability, prefix-cache hits, and DP prefill balancing. The main path is:
| phase | what happens | key output |
|---|---|---|
| initialize budgets | set token_budget, encoder budget, temporary lists and maps | step resource limits |
| schedule running | active requests get first chance to advance | scheduled_running_reqs, num_scheduled_tokens |
| allocate KV slots | call kv_cache_manager.allocate_slots(...) for new tokens | req_to_new_blocks |
| preempt if needed | free low-priority running request blocks and move it back to waiting | preempted_reqs |
| admit waiting | admit new or preempted requests, handling prefix/remote KV | scheduled_new_reqs, scheduled_resumed_reqs |
| build output | gather request deltas, block ids, connector metadata | SchedulerOutput |
The important point: KV cache allocation happens during scheduling. Scheduler does not first form a batch and then hope workers can fit it in memory. It allocates KV slots while deciding the step. If allocation fails, preemption may happen.
_preempt_request(...) frees the request’s KV blocks and encoder cache, marks it as PREEMPTED, resets num_computed_tokens, clears speculative tokens, and puts it back at the front of the waiting queue. Scheduling is therefore constrained by KV block availability, not just fairness or FIFO order.
Prefix cache also changes scheduling here. When a waiting request first enters, kv_cache_manager.get_computed_blocks(request) checks local prefix-cache hits; KVConnector may add external or remote hits. After a hit, num_computed_tokens is no longer zero, so Scheduler only schedules the remaining tokens. Prefix cache changes num_scheduled_tokens and KV block allocation, not just a later attention detail.
What SchedulerOutput Contains
SchedulerOutput in vllm/v1/core/sched/output.py is the contract between Scheduler and ModelRunner. These fields are the important ones:
| field | role |
|---|---|
scheduled_new_reqs | requests scheduled for the first time; worker does not yet cache full request data |
scheduled_cached_reqs | requests already known to workers; only deltas are sent |
num_scheduled_tokens | core field: req_id -> token count for this step |
total_num_scheduled_tokens | total scheduled token count; ModelRunner uses it to decide whether forward is needed |
scheduled_spec_decode_tokens | speculative draft tokens verified or executed in this step |
scheduled_encoder_inputs | multimodal or encoder inputs that need processing now |
num_common_prefix_blocks | common prefix blocks among running requests, usable by cascade attention |
finished_req_ids | requests finished since the previous step, used for worker-side cleanup |
preempted_req_ids | requests preempted in this step, especially relevant to V2 runner paths |
kv_connector_metadata | opaque metadata for KV transfer/load/save |
new_block_ids_to_zero | freshly allocated KV blocks that workers should zero before use |
ModelRunner consumes this object by updating InputBatch from scheduled_new_reqs and scheduled_cached_reqs, preparing a flattened token batch from num_scheduled_tokens, and building attention metadata from block ids and slot mappings.
The Other Half: update_from_output()
Reading only schedule() is not enough. scheduler.update_from_output(...) updates Scheduler state after ModelRunner executes. It handles sampled token ids, accepted or rejected speculative draft tokens, stop conditions, logprobs, pooling outputs, KV connector results, stopped request cleanup, and scheduler stats.
One detail matters: _update_after_schedule(...) advances each request’s num_computed_tokens immediately after scheduling, so the next scheduler step can continue chunked prefills without waiting. If speculative tokens are later rejected, update_from_output(...) corrects the computed-token count.
Scheduler is therefore an optimistic state machine. It advances state based on the schedule to keep the engine pipeline moving, then corrects state when real GPU outputs, rejections, stops, errors, or KV transfer results arrive.
Boundary And Reading Guide
The Scheduler/ModelRunner boundary is:
| module | question answered | typical data |
|---|---|---|
| Scheduler | who runs this step, how many tokens, and whether KV cache can fit | SchedulerOutput, num_scheduled_tokens, block ids |
| ModelRunner | how this step runs on GPU | InputBatch, input_ids, positions, slot_mapping, attention metadata |
When reading Scheduler, hold onto these invariants:
- one engine step maps to one scheduling decision;
- running requests are considered before waiting requests;
- scheduled tokens must not exceed
max_num_scheduled_tokens; - KV slot allocation is part of scheduling;
- prefix-cache hits reduce this step’s forward work;
- GPU execution reads
SchedulerOutput, not waiting/running queues.
If you remember one sentence, make it this: Scheduler compresses dynamic request queues and KV cache constraints into an executable SchedulerOutput for the current step. ModelRunner then turns that output into a real GPU forward.
A practical reading order:
vllm/v1/engine/core.py: see howEngineCore.step()connects schedule, execute, and update.vllm/v1/core/sched/interface.py: read theschedule()interface comment first.vllm/v1/core/sched/scheduler.py: focus onschedule(),_preempt_request(),_make_cached_request_data(), andupdate_from_output().vllm/v1/core/sched/output.py: mapSchedulerOutputfields to how ModelRunner consumes them.vllm/v1/core/kv_cache_manager.py: followallocate_slots(...)to see why scheduling cannot be separated from KV block management.