A Survey of LLM Quantization: From Linear Quantization to Codebooks

sky_io@outlook.com (K4i) — Mon, 01 Jun 2026 21:00:00 +0800

Introduction

A 7B model stored in FP16 needs this much memory just for parameters:

$$7 \times 10^9 \times 2\ \text{bytes} \approx 14\ \text{GB}$$

That does not include the KV cache, activations, temporary workspaces, CUDA graphs, batching overhead, or runtime fragmentation. For a 70B model, FP16 weights alone are about 140 GB, which is already beyond a single commodity GPU.

Quantization has a simple direct goal: represent model values with fewer bits. FP16 uses 16 bits per weight, INT8 uses 8 bits, and INT4 uses 4 bits. In the ideal case, weight memory drops to roughly $1/2$ and $1/4$ of FP16.

Int4 on k4i's blog

A Survey of LLM Quantization: From Linear Quantization to Codebooks

Introduction