Numeric Types in Neural Networks: FP32, BF16, FP8, INT8, and INT4đ Jun 23, 2026 · â 4 min read · âī¸ k4iA concise map of floating point, integer quantization, storage dtype, compute dtype, and accumulation dtype in neural networks.
A Survey of LLM Quantization: From Linear Quantization to Codebooksđ Jun 1, 2026 · â 34 min read · âī¸ k4iA practical survey of LLM quantization, covering linear quantization, codebook quantization, LLM.int8(), SmoothQuant, GPTQ, AWQ, NF4, AQLM, KV cache quantization, and FP8.