2026 6posts
05-21 [CUDA in Practice] Hand-Rolled Flash Decoding on SM120: Beating flashinfer.single_decode_with_kv_cache #vitamin-cuda #cuda #c++ #GPU #GEMM #flash attention #flash decoding 05-19 [CUDA in Practice] FMHA on SM120: Beating torch.sdpa (FlashAttention-2) #vitamin-cuda #cuda #c++ #GPU #GEMM #flash attention 05-10 [CUDA in Practice] HGEMM SM120 — Micro-Sculpture Warfare in 100KB SMEM: Tensor Core, TMA, ldmatrix, mma #vitamin-cuda #cuda #c++ #GPU #GEMM 05-10 [CUDA in Practice] HGEMM — Beating cuBLAS: Tensor Core, cp.async, ldmatrix, mma #vitamin-cuda #cuda #c++ #GPU #GEMM 05-09 [CUDA in Practice] SGEMM TF32 — Beating cuBLAS with Tensor Cores, cp.async, ldmatrix & mma #vitamin-cuda #cuda #c++ #GPU #GEMM 03-05 [CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++ #vitamin-cuda #cuda #c++ #GPU #GEMM