vLLM Scheduler: How Request Queues Become SchedulerOutput
· ☕ 6 min read · âœī¸ k4i
A source-reading walkthrough of vLLM V1 Scheduler: how it decides across running/waiting queues, token budget, KV cache blocks, prefix-cache hits, and preemption to produce SchedulerOutput for ModelRunner.
vLLM Scheduler: How Request Queues Become SchedulerOutput
Loss Functions: What a Model Is Really Optimizing
· ☕ 9 min read · âœī¸ k4i
A practical guide to loss functions: when to use MSE, MAE, Huber, binary cross entropy, cross entropy, KL divergence, hinge loss, contrastive loss, and triplet loss.
Loss Functions: What a Model Is Really Optimizing
Streaming Design: Why The Application Layer Still Matters
· ☕ 11 min read · âœī¸ k4i
A practical model for upload-side and download-side streaming: transport moves bytes, while the application layer defines boundaries, progress, recovery, idempotency, backpressure, and business meaning.
Streaming Design: Why The Application Layer Still Matters
vLLM Request Lifecycle: From OpenAI API to One Forward Pass
· ☕ 5 min read · âœī¸ k4i
A source-reading walkthrough of the vLLM V1 request path: OpenAI-compatible HTTP entrypoint, serving render, AsyncLLM, EngineCore client, Tensor IPC, scheduler, and one GPUModelRunner forward pass.
vLLM Request Lifecycle: From OpenAI API to One Forward Pass
LLM Attention Kernels and GPU Primitives
· ☕ 1 min read · âœī¸ k4i
A series index for LLM attention kernels and GPU primitives: fused softmax, online softmax, FlashAttention, PagedAttention kernels, Triton/CUDA, and memory-access optimization.
LLM Attention Kernels and GPU Primitives
LLM Quantization and Low-Precision Serving
· ☕ 1 min read · âœī¸ k4i
A series index for LLM quantization and low-precision serving: INT8/INT4, GPTQ, AWQ, SmoothQuant, NF4, AQLM, KV cache quantization, FP8 serving, and quality/speed/memory tradeoffs.
LLM Quantization and Low-Precision Serving
Disaggregated Prefill: Splitting Compute Across Machines
· ☕ 9 min read · âœī¸ k4i
Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency — at the cost of KV cache migration across machines.
Disaggregated Prefill: Splitting Compute Across Machines