LLM Attention Kernels and GPU Primitives

LLM attention kernel 与 GPU 基元系列索引：fused softmax、online softmax、FlashAttention、PagedAttention kernel、Triton/CUDA 和内存访问优化。

online softmax 如何将融合 kernel 扩展到超过 SRAM 容量的行，使用数值稳定的两遍分块算法。

如何在 Triton 中编写融合 softmax kernel，消除冗余内存访问，性能超越 PyTorch 原生实现。