From the outside, vLLM looks like an OpenAI-compatible HTTP server: send /v1/chat/completions, receive a token stream. The useful source-reading question is narrower:
When does a JSON request become an engine request? When does it cross process boundaries? When does it enter the scheduler? When does one model forward actually happen?
This post follows the vLLM V1 path: OpenAI Chat Completions API, AsyncLLM, EngineCore, Scheduler, GPUWorker, and GPUModelRunner. For multimodal requests, we only track the mm_tensor_ipc == "torch_shm" path, where large tensors bypass the main ZMQ/msgpack payload.
Overview: Four Lines Aligned By Request Id
The request lifecycle is not one simple queue. It is several state lines aligned by request id:
- the API server owns HTTP, chat templates, sampling params, and output streams;
- ZMQ carries control messages such as
ADD,ABORT, andUTILITY; - Tensor IPC carries large multimodal tensor payloads;
- EngineCore, Scheduler, and ModelRunner own scheduling and GPU execution.
Figure 1: One request id aligns API state, control transport, optional tensor payload, scheduler state, GPU execution, and streaming output.
Expanded as a call path:
| |
API Process: OpenAI Request To Engine Work
The OpenAI-compatible route lives in:
vllm/entrypoints/openai/chat_completion/api_router.pyvllm/entrypoints/openai/chat_completion/serving.py
The /v1/chat/completions handler is thin: resolve the chat handler, call handler.create_chat_completion(), then return either JSON or a StreamingResponse. No scheduler and no forward pass happen here.
The API-to-engine translation happens in OpenAIServingChat._create_chat_completion(): messages are rendered through the chat template, fields such as max_tokens, temperature, and top_p become SamplingParams, and multimodal content enters the engine input path. Then engine_client.generate() enters AsyncLLM.generate():
| |
The API process registers the output stream for the HTTP handler and sends an EngineCoreRequest to the engine process. Input and output paths split here.
Process Boundary: ZMQ For Control, Tensor IPC For Payload
Inside AsyncLLM, self.engine_core is an EngineCore client. It does not directly call EngineCore.add_request(), and it does not call model.forward(). It sends a typed control message:
| |
In the V1 multi-process path, this control path uses ZMQ. MsgpackEncoder encodes the request body; the EngineCore input thread decodes it into an EngineCoreRequest.
Large multimodal tensors take a separate payload path. When mm_tensor_ipc == "torch_shm", the API server-side encoder puts the real tensor into a shared-memory queue and leaves only a lightweight handle in the ZMQ message. On the EngineCore side, the decoder sees that handle and asks TensorIpcReceiver to fetch the real tensor.
sequenceDiagram
participant API as API server process
participant Enc as MsgpackEncoder
participant Q as Tensor IPC queue
participant ZMQ as ZMQ control message
participant Dec as EngineCore decoder
participant Core as EngineCore
API->>Enc: encode EngineCoreRequest
Enc->>Q: put real tensor
Enc-->>API: return handle
API->>ZMQ: send ADD request with handle
ZMQ->>Dec: receive msgpack payload
Dec->>Q: get tensor by handle
Dec-->>Core: reconstructed EngineCoreRequest
The boundary is simple: ZMQ carries control messages; Tensor IPC carries large request payload tensors. Output tokens do not use Tensor IPC.
EngineCore: schedule, execute, update
After the EngineCore input thread decodes an EngineCoreRequest, the request reaches:
| |
Still no forward pass. The request has only entered scheduler state. Model execution happens in EngineCore.step():
| |
Small example:
| |
The 4 for A is not the prompt length. It is the prefill chunk size chosen for this iteration. One forward computes this iteration’s token batch, not a whole request.
The mechanisms split like this:
| Mechanism | What scheduler cares about | Does it change model math? |
|---|---|---|
| continuous batching | merge tokens from different requests into one batch | no |
| chunked prefill | admit only a prompt chunk per iteration | no |
| prefix caching | skip already-computed prefix tokens | no, but positions/KV view changes |
| paged attention | allocate, reuse, and release KV blocks | attention backend memory access changes |
| speculative decoding | organize draft/verify token work | may add a verification path |
The GPU path enters GPUWorker.execute_model() and then GPUModelRunner.execute_model(). At that point, the input is no longer OpenAI JSON or a full prompt string. It is a tensorized batch prepared by the scheduler and model runner:
input_ids/inputs_embeds: tokens or embeddings for this iteration;positions: token positions;attn_metadata: context required by the attention backend;slot_mappings: KV-cache write locations;model_kwargs: multimodal, LoRA, spec decode, encoder-decoder, and other extra inputs.
Return Path And Boundaries
After forward, vLLM still needs sampling and state updates. EngineCore.step() calls:
| |
This merges sampled tokens, logprobs, finished state, KV/cache release, and related updates back into scheduler state. EngineCore outputs then return to the API process. AsyncLLM pushes them into the per-request collector, and the HTTP handler keeps yielding the SSE stream.
The loop is:
| |
Boundaries to remember:
- The OpenAI API layer is not the engine layer:
ChatCompletionRequestexpresses API semantics, whileEngineCoreRequestexpresses schedulable engine work. AsyncLLM.generate()is not a forward pass; it is the async facade in the API server process.- ZMQ is the control path; Tensor IPC is a payload side channel.
SchedulerOutputis the direct upstream of one forward pass: it decides which tokens and KV blocks this iteration uses.GPUModelRunnerconsumesSchedulerOutputand turns it into tensors, attention metadata, and real GPU execution.- One forward is one engine iteration, not one request.