<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>LLM Quantization and Low-Precision Serving on k4i's blog</title><link>https://k4i.top/zh/series/llm-quantization-and-low-precision-serving/</link><description>Recent content in LLM Quantization and Low-Precision Serving on k4i's blog</description><generator>Hugo -- gohugo.io</generator><language>zh</language><managingEditor>sky_io@outlook.com (K4i)</managingEditor><webMaster>sky_io@outlook.com (K4i)</webMaster><copyright>All content is subject to the license of &lt;a rel="license noopener" href="https://creativecommons.org/licenses/by-nc-sa/4.0/" target="_blank"&gt;CC BY-NC-SA 4.0&lt;/a&gt; .</copyright><lastBuildDate>Fri, 05 Jun 2026 11:00:00 +0800</lastBuildDate><atom:link href="https://k4i.top/zh/series/llm-quantization-and-low-precision-serving/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Quantization and Low-Precision Serving：量化与低精度推理路线</title><link>https://k4i.top/zh/posts/llm-quantization-low-precision-serving/</link><pubDate>Fri, 05 Jun 2026 11:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:26:17 +0800</atom:modified><guid>https://k4i.top/zh/posts/llm-quantization-low-precision-serving/</guid><description>&lt;p&gt;这个系列专门放量化和低精度 serving。它不只是“推理优化”的一个小节，因为量化同时牵涉表示方式、误差控制、校准数据、kernel 支持、KV cache、显存带宽和质量回归。&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/llm-quantization-low-precision-serving/quantization-4bit-buckets-icon.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>量化</category><category>低精度</category><category>int4</category><category>fp8</category><category>serving</category><category>AI</category><category>LLM Quantization and Low-Precision Serving</category></item><item><title>大模型量化综述：从线性量化到码本量化</title><link>https://k4i.top/zh/posts/llm-quantization/</link><pubDate>Mon, 01 Jun 2026 21:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:20:16 +0800</atom:modified><guid>https://k4i.top/zh/posts/llm-quantization/</guid><description>&lt;h2 id="introduction"&gt;引言&lt;/h2&gt;
&lt;p&gt;一个 7B 模型如果用 FP16 存权重，光参数就需要：&lt;/p&gt;
&lt;p&gt;$$7 \times 10^9 \times 2\ \text{bytes} \approx 14\ \text{GB}$$&lt;/p&gt;
&lt;p&gt;这还没有算 KV cache、activation、临时 workspace、CUDA graph、batching 和运行时碎片。到了 70B，FP16 权重约 140 GB，单卡部署基本不现实。&lt;/p&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/llm-quantization/quantization-buckets-icon.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>量化</category><category>推理</category><category>显存</category><category>int8</category><category>int4</category><category>fp8</category><category>AI</category><category>LLM Quantization and Low-Precision Serving</category></item></channel></rss>