<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Low-Precision on k4i's blog</title><link>https://k4i.top/tags/low-precision/</link><description>Recent content in Low-Precision on k4i's blog</description><generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>sky_io@outlook.com (K4i)</managingEditor><webMaster>sky_io@outlook.com (K4i)</webMaster><copyright>All content is subject to the license of &lt;a rel="license noopener" href="https://creativecommons.org/licenses/by-nc-sa/4.0/" target="_blank"&gt;CC BY-NC-SA 4.0&lt;/a&gt; .</copyright><lastBuildDate>Fri, 05 Jun 2026 11:00:00 +0800</lastBuildDate><atom:link href="https://k4i.top/tags/low-precision/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Quantization and Low-Precision Serving</title><link>https://k4i.top/posts/llm-quantization-low-precision-serving/</link><pubDate>Fri, 05 Jun 2026 11:00:00 +0800</pubDate><author>sky_io@outlook.com (K4i)</author><atom:modified>Fri, 05 Jun 2026 00:26:17 +0800</atom:modified><guid>https://k4i.top/posts/llm-quantization-low-precision-serving/</guid><description>&lt;p&gt;This series is for quantization and low-precision serving. It deserves its own track because quantization touches representation, error control, calibration data, kernel support, KV cache, memory bandwidth, and quality regression.&lt;/p&gt;
&lt;h2 id="existing-posts"&gt;Existing Posts&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://k4i.top/posts/llm-quantization/"&gt;A Survey of LLM Quantization: From Linear Quantization to Codebooks&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="planned-posts"&gt;Planned Posts&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;KV cache quantization: beyond weight memory, the cache is often the real footprint&lt;/li&gt;
&lt;li&gt;FP8 serving: E4M3 / E5M2, activation scales, and Tensor Core paths&lt;/li&gt;
&lt;li&gt;INT4 weight-only serving: why saving memory does not always mean going faster&lt;/li&gt;
&lt;li&gt;GPTQ / AWQ / SmoothQuant engineering boundaries&lt;/li&gt;
&lt;li&gt;NF4 / AQLM: why lower bit widths need codebooks&lt;/li&gt;
&lt;li&gt;Quantization benchmarks: measuring the quality, speed, and memory triangle&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="questions"&gt;Questions Each Post Should Answer&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Are we quantizing weights, activations, KV cache, or a communication/storage format?&lt;/li&gt;
&lt;li&gt;Does the benefit come from capacity, HBM bandwidth, Tensor Core throughput, or disk size?&lt;/li&gt;
&lt;li&gt;Does the error mainly come from outliers, scale granularity, rounding, or clipping?&lt;/li&gt;
&lt;li&gt;How do vLLM / SGLang load, observe, and roll back this choice?&lt;/li&gt;
&lt;/ul&gt;</description><dc:creator>K4i</dc:creator><media:content url="https://k4i.top//images/posts/llm-quantization-low-precision-serving/quantization-4bit-buckets-icon.svg" medium="image"><media:title type="html">featured image</media:title></media:content><category>llm</category><category>quantization</category><category>low-precision</category><category>int4</category><category>fp8</category><category>serving</category><category>AI</category><category>LLM Quantization and Low-Precision Serving</category></item></channel></rss>