[CUDA in Practice] Hand-Rolled Flash Decoding on SM120: Beating flashinfer.single_decode_with_kv_cache codeMay 21, 2026 :::ThisarticleisintendedforreaderswithasolidCUDAfoundation,familiarwithGEMM/multi-head-attentionoptimization,andinterestedinadvancedinlinePTXtuning.Fu vitamin-cudacudac++GPUGEMMflash attentionflash decoding
[CUDA in Practice] FMHA on SM120: Beating torch.sdpa (FlashAttention-2) codeMay 19, 2026 ThisarticleisintendedforreaderswithasolidCUDAfoundation,familiarwithGEMM/multi-head-attentionoptimization,andinterestedinadvancedTensorCore/inlinePTXt vitamin-cudacudac++GPUGEMMflash attention
Distributed Inference with PyTorch from First Principles: DP, TP, and PP in Less Than 200 Lines codeMay 15, 2026 Modelskeepgettingbigger.EvenifINT4quantizationsqueezestheweightsontoasingleGPU,inferencestillhastopayforKVcacheandactivations,bothofwhichscalewithbatc AI inferLLMPyTorch distributionTensor ParallelismData ParallelismPipeline Parallelism
[CUDA in Practice] HGEMM SM120 — Micro-Sculpture Warfare in 100KB SMEM: Tensor Core, TMA, ldmatrix, mma codeMay 10, 2026 Sorryfolks—Isaidtherewouldn’tbeasequeltotheGEMMseries,butIlied.Todayit’sHGEMMagain,butthistimewe’reembracingeverythingtheRTX5060Laptophastooffer:TMA+l vitamin-cudacudac++GPUGEMM
[CUDA in Practice] HGEMM — Beating cuBLAS: Tensor Core, cp.async, ldmatrix, mma codeMay 10, 2026 ThisarticleisintendedforreaderswithasolidfoundationinCUDAprogrammingwhoarefamiliarwithGEMMoptimizationandinterestedinadvancedTensorCore/inlinePTXinstr vitamin-cudacudac++GPUGEMM
[CUDA in Practice] SGEMM TF32 — Beating cuBLAS with Tensor Cores, cp.async, ldmatrix & mma codeMay 9, 2026 Headsup:Thisisanintense,diagram-heavydeepdive.Itcovershardcoreswizzlederivations(ifyoustilldon’tunderstandXORswizzleafterreadingthis,comefindme),layou vitamin-cudacudac++GPUGEMM
[CUDA Basics] Understanding CUDA's "Nonexistent" Memory Tier: Local Memory codeApril 1, 2026 TherearecountlessarticlesonlineintroducingNVIDIA’smemoryhierarchy,butmostfocusonglobalmemory,sharedmemory,constantmemory,texturememory,L2/L1cache,andr vitamin-cudacudac++GPU
[CUDA in Practice] Safe Online Softmax — A Must-Know for Interviews: Arbitrary hidden_size, One/Two Pass, Trade-offs, Split-K codeMarch 31, 2026 Thereisnobestkernel,onlythemostsuitablekernel.----------------------------------altumsonatur(throwinsomeLatinanditinstantlysoundsclassy)#0.Preface—Bac vitamin-cudacudac++GPU
[CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++ codeMarch 5, 2026 Warning:Extremelydensecontentahead,withmanydiagrams,heavybit-manipulation,andmemory-mappingderivations.BestreadonaPC.#0.Preface—TheLastStandofScalarCo vitamin-cudacudac++GPUGEMM
[CUDA in Practice] Matrix Transpose — From Padding to XOR Swizzle: The Art of Shared Memory Optimization codeFebruary 13, 2026 Matrixtransposeisoneofthemostfundamentaloperationsindeeplearningandhigh-performancecomputing.ThedeceptivelysimplecoordinateswapB[y][x]=A[x][y]B[y][x]= vitamin-cudacudac++GPU