Sglang on k4i's blog

LLM Inference Lab Reports: Experiments and Benchmarks for Serving Systems

sky_io@outlook.com (K4i) — Fri, 05 Jun 2026 10:00:00 +0800

This series is for experiment reports. Unlike mechanism explainers or source-reading notes, each post should include a reproducible environment, commands, metrics, tables or figures, and concrete tuning conclusions.

For inference-engine interviews, knowing the names PagedAttention, prefix cache, and chunked prefill is only the first layer. The stronger signal is being able to answer: which workload benefits, how much did the metric improve, where did the bottleneck move, and what should we inspect first if production metrics regress?

vLLM / SGLang Source Reading: From Request to Forward Pass

sky_io@outlook.com (K4i) — Thu, 04 Jun 2026 22:10:00 +0800

This series is for source reading and engineering follow-through. The goal is not to translate files line by line, but to locate core inference-engine mechanisms in real code paths and verify their behavior with benchmarks or small experiments.

Reading Order

Planned posts will follow the request lifecycle:

Request lifecycle: from OpenAI API to one forward pass
Scheduler loop: waiting queue, running queue, token budget, and decode priority
vLLM Block Manager: from logical blocks to physical KV blocks
SGLang Radix Cache: why prefix reuse wants a tree
What a prefix cache hit actually saves
Chunked prefill parameters, scheduling branches, and benchmarks
Why structured output / FSM decoding is a strong SGLang use case

Standard Format

Each source-reading post should answer four questions:

LLM Inference Internals: Core Mechanisms for Serving Engines

sky_io@outlook.com (K4i) — Thu, 04 Jun 2026 22:00:00 +0800

This series answers why inference engines are shaped the way they are. The focus is not framework APIs, but the core mechanisms behind vLLM / SGLang-style serving engines: prefill/decode, KV cache, PagedAttention, continuous batching, prefix caching, chunked prefill, and disaggregated prefill.

Existing Posts

Read the existing posts in this order:

Planned Posts

Prefill vs decode: why one model has two very different bottlenecks
The scheduler’s real objective: bigger batches are not always better
KV cache eviction: LRU, prefix trees, reference counts, and cache pollution

Questions Each Post Should Answer

What production problem does this mechanism solve?
Does it mainly affect TTFT, TPOT, throughput, or memory capacity?
How does it change KV cache, scheduler, attention kernels, or GPU workload?
Which vLLM / SGLang design or parameter does it map to?