This series is for source reading and engineering follow-through. The goal is not to translate files line by line, but to locate core inference-engine mechanisms in real code paths and verify their behavior with benchmarks or small experiments.
Reading Order
Planned posts will follow the request lifecycle:
- Request lifecycle: from OpenAI API to one forward pass
- Scheduler loop: waiting queue, running queue, token budget, and decode priority
- vLLM Block Manager: from logical blocks to physical KV blocks
- SGLang Radix Cache: why prefix reuse wants a tree
- What a prefix cache hit actually saves
- Chunked prefill parameters, scheduling branches, and benchmarks
- Why structured output / FSM decoding is a strong SGLang use case
Standard Format
Each source-reading post should answer four questions:
- What production problem does this mechanism solve?
- Where is the code entry point?
- How do the key data structures change?
- Which metric proves the behavior affects TTFT, TPOT, throughput, or memory?
That keeps source reading tied to the job requirements: profiling, bottleneck analysis, and engineering delivery, not just recognizing class names.