<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Sglang on k4i's blog</title><link>https://k4i.top/tags/sglang/</link><description>Recent content in Sglang on k4i's blog</description><generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>sky_io@outlook.com (K4i)</managingEditor><webMaster>sky_io@outlook.com (K4i)</webMaster><copyright>All content is subject to the license of &lt;a rel="license noopener" href="https://creativecommons.org/licenses/by-nc-sa/4.0/" target="_blank"&gt;CC BY-NC-SA 4.0&lt;/a&gt; .</copyright><lastBuildDate>Fri, 05 Jun 2026 10:00:00 +0800</lastBuildDate><atom:link href="https://k4i.top/tags/sglang/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Inference Lab Reports: Experiments and Benchmarks for Serving Systems</title><link>https://k4i.top/posts/llm-inference-lab-reports/</link><pubDate>Fri, 05 Jun 2026 10:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:26:17 +0800</atom:modified><guid>https://k4i.top/posts/llm-inference-lab-reports/</guid><description>&lt;p&gt;This series is for experiment reports. Unlike mechanism explainers or source-reading notes, each post should include a reproducible environment, commands, metrics, tables or figures, and concrete tuning conclusions.&lt;/p&gt;
&lt;p&gt;For inference-engine interviews, knowing the names PagedAttention, prefix cache, and chunked prefill is only the first layer. The stronger signal is being able to answer: which workload benefits, how much did the metric improve, where did the bottleneck move, and what should we inspect first if production metrics regress?&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/llm-inference-lab-reports/benchmark-profiler-dashboard-icon.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>inference</category><category>benchmark</category><category>profiling</category><category>vllm</category><category>sglang</category><category>AI</category><category>LLM Inference Lab Reports</category></item><item><title>vLLM / SGLang Source Reading: From Request to Forward Pass</title><link>https://k4i.top/posts/vllm-sglang-source-reading/</link><pubDate>Thu, 04 Jun 2026 22:10:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:26:17 +0800</atom:modified><guid>https://k4i.top/posts/vllm-sglang-source-reading/</guid><description>&lt;p&gt;This series is for source reading and engineering follow-through. The goal is not to translate files line by line, but to locate core inference-engine mechanisms in real code paths and verify their behavior with benchmarks or small experiments.&lt;/p&gt;
&lt;h2 id="reading-order"&gt;Reading Order&lt;/h2&gt;
&lt;p&gt;Planned posts will follow the request lifecycle:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Request lifecycle: from OpenAI API to one forward pass&lt;/li&gt;
&lt;li&gt;Scheduler loop: waiting queue, running queue, token budget, and decode priority&lt;/li&gt;
&lt;li&gt;vLLM Block Manager: from logical blocks to physical KV blocks&lt;/li&gt;
&lt;li&gt;SGLang Radix Cache: why prefix reuse wants a tree&lt;/li&gt;
&lt;li&gt;What a prefix cache hit actually saves&lt;/li&gt;
&lt;li&gt;Chunked prefill parameters, scheduling branches, and benchmarks&lt;/li&gt;
&lt;li&gt;Why structured output / FSM decoding is a strong SGLang use case&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="format"&gt;Standard Format&lt;/h2&gt;
&lt;p&gt;Each source-reading post should answer four questions:&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/vllm-sglang-source-reading/source-reading-code-path-icon.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>inference</category><category>vllm</category><category>sglang</category><category>source-reading</category><category>ai-infra</category><category>AI</category><category>vLLM and SGLang Source Reading</category></item><item><title>LLM Inference Internals: Core Mechanisms for Serving Engines</title><link>https://k4i.top/posts/llm-inference-internals/</link><pubDate>Thu, 04 Jun 2026 22:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:26:17 +0800</atom:modified><guid>https://k4i.top/posts/llm-inference-internals/</guid><description>&lt;p&gt;This series answers why inference engines are shaped the way they are. The focus is not framework APIs, but the core mechanisms behind vLLM / SGLang-style serving engines: prefill/decode, KV cache, PagedAttention, continuous batching, prefix caching, chunked prefill, and disaggregated prefill.&lt;/p&gt;
&lt;h2 id="existing-posts"&gt;Existing Posts&lt;/h2&gt;
&lt;p&gt;Read the existing posts in this order:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://k4i.top/posts/llm-flops-memory-estimation/"&gt;Estimating Compute and Memory Requirements for LLM Training and Inference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://k4i.top/posts/positional-encoding-to-rope/"&gt;From Absolute Positional Encoding to RoPE: Why Position Can Be a Rotation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://k4i.top/posts/kv-cache/"&gt;Why KV Cache Works in LLM Inference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://k4i.top/posts/paged-attention/"&gt;Paged Attention: Virtual Memory for the GPU&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://k4i.top/posts/continuous-batching/"&gt;Continuous Batching: Scheduling at Iteration Granularity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://k4i.top/posts/chunked-prefill/"&gt;Chunked Prefill: Slicing the Prefill to Protect Decode Latency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://k4i.top/posts/prefix-caching/"&gt;Prefix Caching: Reusing KV Cache Across Requests&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://k4i.top/posts/disaggregated-prefill/"&gt;Disaggregated Prefill: Splitting Compute Across Machines&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="planned-posts"&gt;Planned Posts&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Prefill vs decode: why one model has two very different bottlenecks&lt;/li&gt;
&lt;li&gt;The scheduler&amp;rsquo;s real objective: bigger batches are not always better&lt;/li&gt;
&lt;li&gt;KV cache eviction: LRU, prefix trees, reference counts, and cache pollution&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="questions"&gt;Questions Each Post Should Answer&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;What production problem does this mechanism solve?&lt;/li&gt;
&lt;li&gt;Does it mainly affect TTFT, TPOT, throughput, or memory capacity?&lt;/li&gt;
&lt;li&gt;How does it change KV cache, scheduler, attention kernels, or GPU workload?&lt;/li&gt;
&lt;li&gt;Which vLLM / SGLang design or parameter does it map to?&lt;/li&gt;
&lt;/ul&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/llm-inference-internals/engine-kv-cache-icon.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>inference</category><category>kv-cache</category><category>vllm</category><category>sglang</category><category>systems</category><category>AI</category><category>LLM Inference Internals</category></item></channel></rss>