Prefill vs Decode: Why One Model Has Two Very Different Bottlenecks

sky_io@outlook.com (K4i) — Fri, 05 Jun 2026 22:30:00 +0800

LLM inference looks like one operation: send a prompt, get tokens back. under the hood it is two workloads sharing the same model weights.

prefill processes the input prompt and builds the initial KV cache. decode generates new tokens one step at a time while reading that cache. the weights are the same, but the hardware bottleneck is not. prefill behaves like a large batched matrix multiplication problem; decode behaves like a stream of small queries repeatedly reading a growing memory table.

Decode on k4i's blog

Prefill vs Decode: Why One Model Has Two Very Different Bottlenecks