<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Vllm on k4i's blog</title><link>https://k4i.top/tags/vllm/</link><description>Recent content in Vllm on k4i's blog</description><generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>sky_io@outlook.com (K4i)</managingEditor><webMaster>sky_io@outlook.com (K4i)</webMaster><copyright>All content is subject to the license of &lt;a rel="license noopener" href="https://creativecommons.org/licenses/by-nc-sa/4.0/" target="_blank"&gt;CC BY-NC-SA 4.0&lt;/a&gt; .</copyright><lastBuildDate>Wed, 22 Apr 2026 10:00:00 +0800</lastBuildDate><atom:link href="https://k4i.top/tags/vllm/index.xml" rel="self" type="application/rss+xml"/><item><title>Paged Attention: Virtual Memory for the GPU</title><link>https://k4i.top/posts/paged-attention/</link><pubDate>Wed, 22 Apr 2026 10:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Sun, 26 Apr 2026 16:08:06 +0800</atom:modified><guid>https://k4i.top/posts/paged-attention/</guid><description>&lt;h2 id="fragmentation"&gt;the fragmentation problem&lt;/h2&gt;
&lt;p&gt;the &lt;a href="https://k4i.top/posts/kv-cache/"&gt;previous post&lt;/a&gt; in this series explained &lt;em&gt;why&lt;/em&gt; we cache key and value vectors during autoregressive decoding. by the end of that post, the KV cache was saving us enormous amounts of recomputation — but it quietly introduced a new problem: &lt;strong&gt;where do you put all that memory?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;the naive answer is: allocate a contiguous block of GPU memory for each request, big enough to hold its maximum possible output length. but in practice, you don&amp;rsquo;t know the output length in advance. so you guess — reserving space for the worst case.&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/paged-attention/cover.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>inference</category><category>systems</category><category>vllm</category><category>memory</category><category>AI</category><category>LLM Inference Internals</category></item></channel></rss>