Inference

vLLM Scheduler: How Request Queues Become SchedulerOutput

📅 Jun 23, 2026 · ☕ 6 min read · ✍️ k4i

A source-reading walkthrough of vLLM V1 Scheduler: how it decides across running/waiting queues, token budget, KV cache blocks, prefix-cache hits, and preemption to produce SchedulerOutput for ModelRunner.

vLLM ModelRunner: How SchedulerOutput Becomes a GPU Forward

📅 Jun 23, 2026 · ☕ 6 min read · ✍️ k4i

A source-reading walkthrough of vLLM V1 GPUModelRunner: how SchedulerOutput becomes input batches, attention metadata, KV slot mappings, model forward, logits, and sampled tokens.

LLM Inference Sampling: What Temperature, Top-p, and Top-k Actually Control

📅 Jun 18, 2026 · ☕ 7 min read · ✍️ k4i

A small 5-token example for understanding temperature, top-p, and top-k during LLM inference, with source-reading notes from the vLLM V1 sampler.

vLLM Request Lifecycle: From OpenAI API to One Forward Pass

📅 Jun 7, 2026 · ☕ 5 min read · ✍️ k4i

A source-reading walkthrough of the vLLM V1 request path: OpenAI-compatible HTTP entrypoint, serving render, AsyncLLM, EngineCore client, Tensor IPC, scheduler, and one GPUModelRunner forward pass.

Prefill vs Decode: Why One Model Has Two Very Different Bottlenecks

📅 Jun 5, 2026 · ☕ 8 min read · ✍️ k4i

Why LLM inference splits into a compute-bound prefill phase and a memory-bandwidth-bound decode phase, and how that explains TTFT, TPOT, batching, KV cache pressure, and serving-engine design.

LLM Inference Lab Reports: Experiments and Benchmarks for Serving Systems

📅 Jun 5, 2026 · ☕ 2 min read · ✍️ k4i

An LLM inference experiment series index: vLLM/SGLang benchmarks, TTFT/TPOT, prefix cache, chunked prefill, PagedAttention, quantization, and a profiler dashboard.

vLLM / SGLang Source Reading: From Request to Forward Pass

📅 Jun 4, 2026 · ☕ 1 min read · ✍️ k4i

A vLLM / SGLang source-reading series index: request lifecycle, scheduler, KV cache allocation, block manager, radix cache, and benchmarks.

LLM Inference Internals: Core Mechanisms for Serving Engines

📅 Jun 4, 2026 · ☕ 1 min read · ✍️ k4i

A series index for core LLM serving mechanisms: prefill/decode, KV cache, PagedAttention, continuous batching, prefix caching, and disaggregated prefill.

A Survey of LLM Quantization: From Linear Quantization to Codebooks

📅 Jun 1, 2026 · ☕ 34 min read · ✍️ k4i

A practical survey of LLM quantization, covering linear quantization, codebook quantization, LLM.int8(), SmoothQuant, GPTQ, AWQ, NF4, AQLM, KV cache quantization, and FP8.

Estimating Compute and Memory Requirements for LLM Training and Inference

📅 May 27, 2026 · ☕ 17 min read · ✍️ k4i

A back-of-the-envelope framework for estimating LLM training FLOPs, inference FLOPs, weight memory, KV cache, and training memory.

Disaggregated Prefill: Splitting Compute Across Machines

📅 Apr 22, 2026 · ☕ 9 min read · ✍️ k4i

Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency — at the cost of KV cache migration across machines.

Prefix Caching: Reusing KV Cache Across Requests

📅 Apr 22, 2026 · ☕ 8 min read · ✍️ k4i

When thousands of requests share the same system prompt, recomputing its KV cache each time is pure waste. Prefix caching stores and reuses those vectors, cutting TTFT by up to 97% in common deployments.