LLM Quantization and Low-Precision Serving

用一张图和几条规则解释模型中的浮点、整数、量化、storage dtype、compute dtype 和 accumulation dtype。

LLM 量化与低精度推理系列索引：INT8/INT4、GPTQ、AWQ、SmoothQuant、NF4、AQLM、KV cache 量化、FP8 serving 和质量/速度/显存权衡。

从线性量化、非均匀量化和码本量化出发，系统梳理 LLM.int8()、SmoothQuant、GPTQ、AWQ、NF4、AQLM、KV cache 量化和 FP8 的数学原理、可行性与优缺点。