Posts
LLM Attention Kernels and GPU Primitives
· ☕ 1 min read · âœī¸ k4i
A series index for LLM attention kernels and GPU primitives: fused softmax, online softmax, FlashAttention, PagedAttention kernels, Triton/CUDA, and memory-access optimization.
LLM Attention Kernels and GPU Primitives
LLM Quantization and Low-Precision Serving
· ☕ 1 min read · âœī¸ k4i
A series index for LLM quantization and low-precision serving: INT8/INT4, GPTQ, AWQ, SmoothQuant, NF4, AQLM, KV cache quantization, FP8 serving, and quality/speed/memory tradeoffs.
LLM Quantization and Low-Precision Serving
Disaggregated Prefill: Splitting Compute Across Machines
· ☕ 9 min read · âœī¸ k4i
Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency — at the cost of KV cache migration across machines.
Disaggregated Prefill: Splitting Compute Across Machines