2026 11posts
05-21 [CUDA in Practice] Hand-Rolled Flash Decoding on SM120: Beating flashinfer.single_decode_with_kv_cache #vitamin-cuda #cuda #c++ #GPU #GEMM #flash attention #flash decoding 05-19 [CUDA in Practice] FMHA on SM120: Beating torch.sdpa (FlashAttention-2) #vitamin-cuda #cuda #c++ #GPU #GEMM #flash attention 05-10 [CUDA in Practice] HGEMM SM120 — Micro-Sculpture Warfare in 100KB SMEM: Tensor Core, TMA, ldmatrix, mma #vitamin-cuda #cuda #c++ #GPU #GEMM 05-10 [CUDA in Practice] HGEMM — Beating cuBLAS: Tensor Core, cp.async, ldmatrix, mma #vitamin-cuda #cuda #c++ #GPU #GEMM 05-09 [CUDA in Practice] SGEMM TF32 — Beating cuBLAS with Tensor Cores, cp.async, ldmatrix & mma #vitamin-cuda #cuda #c++ #GPU #GEMM 04-01 [CUDA Basics] Understanding CUDA's "Nonexistent" Memory Tier: Local Memory #vitamin-cuda #cuda #c++ #GPU 03-31 [CUDA in Practice] Safe Online Softmax — A Must-Know for Interviews: Arbitrary hidden_size, One/Two Pass, Trade-offs, Split-K #vitamin-cuda #cuda #c++ #GPU 03-05 [CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++ #vitamin-cuda #cuda #c++ #GPU #GEMM 02-13 [CUDA in Practice] Matrix Transpose — From Padding to XOR Swizzle: The Art of Shared Memory Optimization #vitamin-cuda #cuda #c++ #GPU 02-09 Numbers Every CUDA Developer Should Know #vitamin-cuda #cuda #c++ #GPU 02-06 A Deep Dive into DeviceQuery: Understanding Your GPU Hardware #vitamin-cuda #cuda #c++ #GPU