This page looks best with JavaScript enabled

vLLM Scheduler: How Request Queues Become SchedulerOutput

 ·  ☕ 6 min read · 👀... views
Read aloud Ready
0/0

In the request lifecycle, Scheduler is the easiest piece to underestimate. The HTTP server admits requests, and ModelRunner executes batches on GPU. Scheduler answers the per-step question in between: who runs now, how many tokens do they get, and can the KV cache hold the result?

The three lifecycle posts fit together like this:

postquestion
request lifecyclehow a request reaches EngineCore
SchedulerEngineCore decides what to run in each step
ModelRunnerSchedulerOutput becomes how to run on GPU

Scheduler output is not a vague “batch.” It is a concrete SchedulerOutput: which requests are new, which are already cached on workers, how many tokens each request gets, which KV blocks were allocated, which requests were preempted, and which finished requests must be cleaned up.

Figure 1: Scheduler turns request queues, token budgets, KV cache allocation, prefix-cache hits, and preemption decisions into SchedulerOutput. ModelRunner consumes this object in the next stage.

Figure 1: Scheduler turns request queues, token budgets, KV cache allocation, prefix-cache hits, and preemption decisions into SchedulerOutput. ModelRunner consumes this object in the next stage.

Entry Points And The Loop

Put Scheduler back into EngineCore.step():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
vllm/v1/engine/core.py
  EngineCore.step()
    -> scheduler.schedule(...)
    -> model_executor.execute_model(scheduler_output, ...)
    -> scheduler.update_from_output(scheduler_output, model_runner_output)

vllm/v1/core/sched/scheduler.py
  Scheduler.schedule()

vllm/v1/core/sched/output.py
  SchedulerOutput

Scheduler is not a one-time module that runs only when a request arrives. It runs repeatedly in the engine busy loop. Each step creates a SchedulerOutput, ModelRunner executes it, and Scheduler consumes ModelRunner output to update its state.

That means Scheduler maintains a dynamic system:

  • a request’s num_computed_tokens changes every step;
  • output tokens, speculative tokens, and placeholder tokens change how many tokens remain;
  • KV cache blocks may be allocated, reused, preempted, or freed later;
  • waiting requests may be blocked by remote KV transfer, structured-output grammar, streaming input, or similar dependencies;
  • running requests are not guaranteed to run in every step.

What One schedule() Step Does

The comment at the top of Scheduler.schedule() is the key: the scheduler does not have a hard-coded “decode phase” or “prefill phase.” Each request has num_computed_tokens and a target num_tokens_with_spec. At each step, Scheduler tries to assign enough tokens for requests to catch up.

That one abstraction covers normal decode, prefill, chunked prefill, prefix caching, and speculative decoding. Small example:

requeststatecomputed tokenstarget tokenspossible scheduling
Arunning99100decode 1 token
Brunning04096prefill chunk
Cwaiting0128new prefill

This is not plain FIFO. Scheduler must also check token budget, long-prefill thresholds, max_num_running_reqs, KV block availability, prefix-cache hits, and DP prefill balancing. The main path is:

phasewhat happenskey output
initialize budgetsset token_budget, encoder budget, temporary lists and mapsstep resource limits
schedule runningactive requests get first chance to advancescheduled_running_reqs, num_scheduled_tokens
allocate KV slotscall kv_cache_manager.allocate_slots(...) for new tokensreq_to_new_blocks
preempt if neededfree low-priority running request blocks and move it back to waitingpreempted_reqs
admit waitingadmit new or preempted requests, handling prefix/remote KVscheduled_new_reqs, scheduled_resumed_reqs
build outputgather request deltas, block ids, connector metadataSchedulerOutput

The important point: KV cache allocation happens during scheduling. Scheduler does not first form a batch and then hope workers can fit it in memory. It allocates KV slots while deciding the step. If allocation fails, preemption may happen.

_preempt_request(...) frees the request’s KV blocks and encoder cache, marks it as PREEMPTED, resets num_computed_tokens, clears speculative tokens, and puts it back at the front of the waiting queue. Scheduling is therefore constrained by KV block availability, not just fairness or FIFO order.

Prefix cache also changes scheduling here. When a waiting request first enters, kv_cache_manager.get_computed_blocks(request) checks local prefix-cache hits; KVConnector may add external or remote hits. After a hit, num_computed_tokens is no longer zero, so Scheduler only schedules the remaining tokens. Prefix cache changes num_scheduled_tokens and KV block allocation, not just a later attention detail.

What SchedulerOutput Contains

SchedulerOutput in vllm/v1/core/sched/output.py is the contract between Scheduler and ModelRunner. These fields are the important ones:

fieldrole
scheduled_new_reqsrequests scheduled for the first time; worker does not yet cache full request data
scheduled_cached_reqsrequests already known to workers; only deltas are sent
num_scheduled_tokenscore field: req_id -> token count for this step
total_num_scheduled_tokenstotal scheduled token count; ModelRunner uses it to decide whether forward is needed
scheduled_spec_decode_tokensspeculative draft tokens verified or executed in this step
scheduled_encoder_inputsmultimodal or encoder inputs that need processing now
num_common_prefix_blockscommon prefix blocks among running requests, usable by cascade attention
finished_req_idsrequests finished since the previous step, used for worker-side cleanup
preempted_req_idsrequests preempted in this step, especially relevant to V2 runner paths
kv_connector_metadataopaque metadata for KV transfer/load/save
new_block_ids_to_zerofreshly allocated KV blocks that workers should zero before use

ModelRunner consumes this object by updating InputBatch from scheduled_new_reqs and scheduled_cached_reqs, preparing a flattened token batch from num_scheduled_tokens, and building attention metadata from block ids and slot mappings.

The Other Half: update_from_output()

Reading only schedule() is not enough. scheduler.update_from_output(...) updates Scheduler state after ModelRunner executes. It handles sampled token ids, accepted or rejected speculative draft tokens, stop conditions, logprobs, pooling outputs, KV connector results, stopped request cleanup, and scheduler stats.

One detail matters: _update_after_schedule(...) advances each request’s num_computed_tokens immediately after scheduling, so the next scheduler step can continue chunked prefills without waiting. If speculative tokens are later rejected, update_from_output(...) corrects the computed-token count.

Scheduler is therefore an optimistic state machine. It advances state based on the schedule to keep the engine pipeline moving, then corrects state when real GPU outputs, rejections, stops, errors, or KV transfer results arrive.

Boundary And Reading Guide

The Scheduler/ModelRunner boundary is:

modulequestion answeredtypical data
Schedulerwho runs this step, how many tokens, and whether KV cache can fitSchedulerOutput, num_scheduled_tokens, block ids
ModelRunnerhow this step runs on GPUInputBatch, input_ids, positions, slot_mapping, attention metadata

When reading Scheduler, hold onto these invariants:

  • one engine step maps to one scheduling decision;
  • running requests are considered before waiting requests;
  • scheduled tokens must not exceed max_num_scheduled_tokens;
  • KV slot allocation is part of scheduling;
  • prefix-cache hits reduce this step’s forward work;
  • GPU execution reads SchedulerOutput, not waiting/running queues.

If you remember one sentence, make it this: Scheduler compresses dynamic request queues and KV cache constraints into an executable SchedulerOutput for the current step. ModelRunner then turns that output into a real GPU forward.

A practical reading order:

  1. vllm/v1/engine/core.py: see how EngineCore.step() connects schedule, execute, and update.
  2. vllm/v1/core/sched/interface.py: read the schedule() interface comment first.
  3. vllm/v1/core/sched/scheduler.py: focus on schedule(), _preempt_request(), _make_cached_request_data(), and update_from_output().
  4. vllm/v1/core/sched/output.py: map SchedulerOutput fields to how ModelRunner consumes them.
  5. vllm/v1/core/kv_cache_manager.py: follow allocate_slots(...) to see why scheduling cannot be separated from KV block management.
Share on