2026
- vLLM Request Lifecycle: From OpenAI API to One Forward Pass
- Prefill vs Decode: Why One Model Has Two Very Different Bottlenecks
- LLM Attention Kernels and GPU Primitives
- LLM Quantization and Low-Precision Serving
- LLM Inference Lab Reports: Experiments and Benchmarks for Serving Systems
- vLLM / SGLang Source Reading: From Request to Forward Pass
- LLM Inference Internals: Core Mechanisms for Serving Engines
- A Survey of LLM Quantization: From Linear Quantization to Codebooks
- From Absolute Positional Encoding to RoPE: Why Position Can Be a Rotation
- Estimating Compute and Memory Requirements for LLM Training and Inference
- Agent Skill Management: Turning AI Assistants from Clever to Reliable
- Disaggregated Prefill: Splitting Compute Across Machines
- Prefix Caching: Reusing KV Cache Across Requests
- Chunked Prefill: Slicing the Prefill to Protect Decode Latency
- Continuous Batching: Scheduling at Iteration Granularity
- Paged Attention: Virtual Memory for the GPU
- Online Softmax: Tiling for Arbitrarily Large Rows
- Why KV Cache Works in LLM Inference
- Fused Softmax in Triton
- SSH Port Forwarding: Local and Remote Tunnels Explained