Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency â at the cost of KV cache migration across machines.
When thousands of requests share the same system prompt, recomputing its KV cache each time is pure waste. Prefix caching stores and reuses those vectors, cutting TTFT by up to 97% in common deployments.
How iteration-level scheduling eliminates GPU idle time by inserting new requests the moment a slot opens, and the math behind mixing prefill and decode in a single forward pass.