Why KV Cache Works in LLM Inference

sky_io@outlook.com (K4i) — Mon, 20 Apr 2026 12:00:00 +0800

introduction

large language models generate text autoregressively — one token at a time, each new token conditioned on all previous tokens. this sequential nature creates a fundamental opportunity for optimization: most of the computation at each step is redundant.

the KV cache is the technique that exploits this redundancy. by storing the key and value vectors from previous decoding steps, we avoid re-computing them, turning an \(O(n^2)\) per-step cost into \(O(n)\) — at the price of extra memory.

Inference on k4i's blog

Why KV Cache Works in LLM Inference

introduction