vLLM / SGLang Source Reading: From Request to Forward Pass

This series is for source reading and engineering follow-through. The goal is not to translate files line by line, but to locate core inference-engine mechanisms in real code paths and verify their behavior with benchmarks or small experiments.

Reading Order

Start with the three core posts in request path -> scheduling decision -> GPU execution order:

Request lifecycle: from OpenAI API to one forward pass
vLLM Scheduler: How Request Queues Become SchedulerOutput
vLLM ModelRunner: How SchedulerOutput Becomes a GPU Forward
Inference sampling: temperature, top-p, and top-k
vLLM architecture map: from core serving engine to vLLM-Omni (draft)
vLLM Block Manager: from logical blocks to physical KV blocks
SGLang Radix Cache: why prefix reuse wants a tree
What a prefix cache hit actually saves
Chunked prefill parameters, scheduling branches, and benchmarks
Why structured output / FSM decoding is a strong SGLang use case

Standard Format

Each source-reading post should answer four questions:

What production problem does this mechanism solve?
Where is the code entry point?
How do the key data structures change?
Which metric proves the behavior affects TTFT, TPOT, throughput, or memory?

That keeps source reading tied to the job requirements: profiling, bottleneck analysis, and engineering delivery, not just recognizing class names.

vLLM / SGLang Source Reading: From Request to Forward Pass

Reading Order

Standard Format

See Also