Source-Reading

vLLM Scheduler: How Request Queues Become SchedulerOutput

📅 Jun 23, 2026 · ☕ 6 min read · ✍️ k4i

A source-reading walkthrough of vLLM V1 Scheduler: how it decides across running/waiting queues, token budget, KV cache blocks, prefix-cache hits, and preemption to produce SchedulerOutput for ModelRunner.

vLLM ModelRunner: How SchedulerOutput Becomes a GPU Forward

📅 Jun 23, 2026 · ☕ 6 min read · ✍️ k4i

A source-reading walkthrough of vLLM V1 GPUModelRunner: how SchedulerOutput becomes input batches, attention metadata, KV slot mappings, model forward, logits, and sampled tokens.

LLM Inference Sampling: What Temperature, Top-p, and Top-k Actually Control

📅 Jun 18, 2026 · ☕ 7 min read · ✍️ k4i

A small 5-token example for understanding temperature, top-p, and top-k during LLM inference, with source-reading notes from the vLLM V1 sampler.

vLLM Request Lifecycle: From OpenAI API to One Forward Pass

📅 Jun 7, 2026 · ☕ 5 min read · ✍️ k4i

A source-reading walkthrough of the vLLM V1 request path: OpenAI-compatible HTTP entrypoint, serving render, AsyncLLM, EngineCore client, Tensor IPC, scheduler, and one GPUModelRunner forward pass.

vLLM / SGLang Source Reading: From Request to Forward Pass

📅 Jun 4, 2026 · ☕ 1 min read · ✍️ k4i

A vLLM / SGLang source-reading series index: request lifecycle, scheduler, KV cache allocation, block manager, radix cache, and benchmarks.