k4i's blog

From Absolute Positional Encoding to RoPE: Why Position Can Be a Rotation

📅 May 28, 2026 · ☕ 10 min read · ✍️ k4i

A step-by-step explanation of positional encoding in Transformers, from absolute embeddings to sinusoidal encodings, Euler's formula, and rotary position embeddings.

Estimating Compute and Memory Requirements for LLM Training and Inference

📅 May 27, 2026 · ☕ 17 min read · ✍️ k4i

A back-of-the-envelope framework for estimating LLM training FLOPs, inference FLOPs, weight memory, KV cache, and training memory.

Agent Skill Management: Turning AI Assistants from Clever to Reliable

📅 May 23, 2026 · ☕ 14 min read · ✍️ k4i

A practical note on managing agent skills: how to install, create, remove, disable, update, and evolve skills, plus which meta-skills are worth installing first.

Agent Skill Management: Turning AI Assistants from Clever to Reliable

Disaggregated Prefill: Splitting Compute Across Machines

📅 Apr 22, 2026 · ☕ 9 min read · ✍️ k4i

Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency — at the cost of KV cache migration across machines.

Prefix Caching: Reusing KV Cache Across Requests

📅 Apr 22, 2026 · ☕ 8 min read · ✍️ k4i

When thousands of requests share the same system prompt, recomputing its KV cache each time is pure waste. Prefix caching stores and reuses those vectors, cutting TTFT by up to 97% in common deployments.

Chunked Prefill: Slicing the Prefill to Protect Decode Latency

📅 Apr 22, 2026 · ☕ 8 min read · ✍️ k4i

Splitting a long prefill across multiple iterations keeps decode requests from stalling, with no extra FLOPs and negligible IO overhead.

Continuous Batching: Scheduling at Iteration Granularity

📅 Apr 22, 2026 · ☕ 9 min read · ✍️ k4i

How iteration-level scheduling eliminates GPU idle time, and how prefill and decode rows can share one packed forward pass.

Paged Attention: Virtual Memory for the GPU

📅 Apr 22, 2026 · ☕ 10 min read · ✍️ k4i

How vLLM borrows the OS paging idea to eliminate KV cache memory fragmentation, pushing GPU utilization from ~30% to ~96%.

Online Softmax: Tiling for Arbitrarily Large Rows

📅 Apr 21, 2026 · ☕ 6 min read · ✍️ k4i

how online softmax extends the fused kernel to handle rows that exceed sram capacity, using a numerically stable 2-pass tiling algorithm.

Online Softmax: Tiling for Arbitrarily Large Rows

Why KV Cache Works in LLM Inference

📅 Apr 20, 2026 · ☕ 9 min read · ✍️ k4i

why the key-value cache avoids redundant computation in autoregressive decoding, and the memory/compute tradeoffs it introduces.

Fused Softmax in Triton

📅 Apr 20, 2026 · ☕ 7 min read · ✍️ k4i

how to write a fused softmax kernel in triton that eliminates redundant memory accesses and outperforms pytorch's native implementation.

SSH Port Forwarding: Local and Remote Tunnels Explained

📅 Apr 19, 2026 · ☕ 4 min read · ✍️ k4i

A practical guide to SSH local and remote port forwarding, with examples, comparison, and persistent configuration via ~/.ssh/config.

Mitmproxy + Tampermonkey = better {llm, …} viewer

📅 Mar 22, 2026 · ☕ 9 min read · ✍️ k4i

Use mitmproxy to capture LLM API traffic and Tampermonkey to turn mitmweb's raw JSON into a readable, chat-like viewer.

Mitmproxy + Tampermonkey = better {llm, …} viewer

Batch vs Stochastic Gradient Descent

📅 Feb 16, 2026 · ☕ 4 min read · ✍️ k4i

understand batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

Forward & Backward Propagation

📅 Feb 16, 2026 · ☕ 5 min read · ✍️ k4i

understand how backward propagation works in gradient descent.

Key Management With GnuPG

📅 Jun 1, 2024 · ☕ 11 min read · ✍️ k4i

learn how to manage you keys with GPG, and use it with ssh and git and pass.

DSU on Tree (Sack)

📅 Feb 16, 2024 · ☕ 9 min read · ✍️ k4i

DSU on tree answers subtree queries by keeping the largest child's contribution and rebuilding only the small parts. The trick is not union-find; it is small-to-large merging hidden inside a DFS.