This series is for source reading and engineering follow-through. The goal is not to translate files line by line, but to locate core inference-engine mechanisms in real code paths and verify their behavior with benchmarks or small experiments.
Reading Order
Start with the three core posts in request path -> scheduling decision -> GPU execution order:
- Request lifecycle: from OpenAI API to one forward pass
- vLLM Scheduler: How Request Queues Become SchedulerOutput
- vLLM ModelRunner: How SchedulerOutput Becomes a GPU Forward
- Inference sampling: temperature, top-p, and top-k
- vLLM architecture map: from core serving engine to vLLM-Omni (draft)
- vLLM Block Manager: from logical blocks to physical KV blocks
- SGLang Radix Cache: why prefix reuse wants a tree
- What a prefix cache hit actually saves
- Chunked prefill parameters, scheduling branches, and benchmarks
- Why structured output / FSM decoding is a strong SGLang use case
Standard Format
Each source-reading post should answer four questions:
- What production problem does this mechanism solve?
- Where is the code entry point?
- How do the key data structures change?
- Which metric proves the behavior affects TTFT, TPOT, throughput, or memory?
That keeps source reading tied to the job requirements: profiling, bottleneck analysis, and engineering delivery, not just recognizing class names.