Paged Attention: Virtual Memory for the GPU

sky_io@outlook.com (K4i) — Wed, 22 Apr 2026 10:00:00 +0800

the fragmentation problem

the previous post in this series explained why we cache key and value vectors during autoregressive decoding. by the end of that post, the KV cache was saving us enormous amounts of recomputation — but it quietly introduced a new problem: where do you put all that memory?

the naive answer is: allocate a contiguous block of GPU memory for each request, big enough to hold its maximum possible output length. but in practice, you don’t know the output length in advance. so you guess — reserving space for the worst case.

Vllm on k4i's blog

Paged Attention: Virtual Memory for the GPU

the fragmentation problem