Memory

A Survey of LLM Quantization: From Linear Quantization to Codebooks

📅 Jun 1, 2026 · 📝 Jun 5, 2026 · ☕ 34 min read · ✍️ k4i

A practical survey of LLM quantization, covering linear quantization, codebook quantization, LLM.int8(), SmoothQuant, GPTQ, AWQ, NF4, AQLM, KV cache quantization, and FP8.

Estimating Compute and Memory Requirements for LLM Training and Inference

📅 May 27, 2026 · 📝 Jun 5, 2026 · ☕ 14 min read · ✍️ k4i

A back-of-the-envelope framework for estimating LLM training FLOPs, inference FLOPs, weight memory, KV cache, and training memory.

Paged Attention: Virtual Memory for the GPU

📅 Apr 22, 2026 · 📝 May 30, 2026 · ☕ 10 min read · ✍️ k4i

How vLLM borrows the OS paging idea to eliminate KV cache memory fragmentation, pushing GPU utilization from ~30% to ~96%.