vLLM / SGLang Source Reading: From Request to Forward Pass

sky_io@outlook.com (K4i) — Thu, 04 Jun 2026 22:10:00 +0800

This series is for source reading and engineering follow-through. The goal is not to translate files line by line, but to locate core inference-engine mechanisms in real code paths and verify their behavior with benchmarks or small experiments.

Reading Order

Planned posts will follow the request lifecycle:

Request lifecycle: from OpenAI API to one forward pass
Scheduler loop: waiting queue, running queue, token budget, and decode priority
vLLM Block Manager: from logical blocks to physical KV blocks
SGLang Radix Cache: why prefix reuse wants a tree
What a prefix cache hit actually saves
Chunked prefill parameters, scheduling branches, and benchmarks
Why structured output / FSM decoding is a strong SGLang use case

Standard Format

Each source-reading post should answer four questions:

Ai-Infra on k4i's blog

vLLM / SGLang Source Reading: From Request to Forward Pass

Reading Order

Standard Format