Quantization

Numeric Types in Neural Networks: FP32, BF16, FP8, INT8, and INT4

📅 Jun 23, 2026 · ☕ 4 min read · ✍️ k4i

A concise map of floating point, integer quantization, storage dtype, compute dtype, and accumulation dtype in neural networks.

LLM Quantization and Low-Precision Serving

📅 Jun 5, 2026 · ☕ 1 min read · ✍️ k4i

A series index for LLM quantization and low-precision serving: INT8/INT4, GPTQ, AWQ, SmoothQuant, NF4, AQLM, KV cache quantization, FP8 serving, and quality/speed/memory tradeoffs.

A Survey of LLM Quantization: From Linear Quantization to Codebooks

📅 Jun 1, 2026 · ☕ 34 min read · ✍️ k4i

A practical survey of LLM quantization, covering linear quantization, codebook quantization, LLM.int8(), SmoothQuant, GPTQ, AWQ, NF4, AQLM, KV cache quantization, and FP8.