Disaggregated Prefill: Splitting Compute Across Machines

sky_io@outlook.com (K4i) — Wed, 22 Apr 2026 12:00:00 +0800

why same-GPU coexistence has a ceiling

chunked prefill makes prefill-decode coexistence more tolerable by slicing the prefill into small pieces. but even with perfect chunking, prefill and decode are still sharing the same GPU. they compete for:

HBM bandwidth — both need to read from and write to GPU memory each iteration
compute units — prefill’s GEMM and decode’s GEMV contend for the same tensor cores
KV cache space — prefill temporarily occupies blocks that could serve decode requests

at moderate scale, this coexistence is acceptable. at large scale — thousands of requests/second, strict SLOs, multi-GPU clusters — the competition becomes a bottleneck that chunking alone cannot resolve.

Distributed on k4i's blog

Disaggregated Prefill: Splitting Compute Across Machines

why same-GPU coexistence has a ceiling