[CUDA in Practice] HGEMM SM120 — Micro-Sculpture Warfare in 100KB SMEM: Tensor Core, TMA, ldmatrix, mma codeMay 10, 2026 Sorryfolks—Isaidtherewouldn’tbeasequeltotheGEMMseries,butIlied.Todayit’sHGEMMagain,butthistimewe’reembracingeverythingtheRTX5060Laptophastooffer:TMA+l vitamin-cudacudac++GPU
[CUDA in Practice] HGEMM — Beating cuBLAS: Tensor Core, cp.async, ldmatrix, mma codeMay 10, 2026 ThisarticleisintendedforreaderswithasolidfoundationinCUDAprogrammingwhoarefamiliarwithGEMMoptimizationandinterestedinadvancedTensorCore/inlinePTXinstr vitamin-cudacudac++GPU
[CUDA in Practice] SGEMM TF32 — Beating cuBLAS with Tensor Cores, cp.async, ldmatrix & mma codeMay 9, 2026 Headsup:Thisisanintense,diagram-heavydeepdive.Itcovershardcoreswizzlederivations(ifyoustilldon’tunderstandXORswizzleafterreadingthis,comefindme),layou vitamin-cudacudac++GPU
[CUDA 优化实战] safe online softmax - 面试必问:任意 hidden_size、one pass、two pass、trade-off、split-k codeMarch 31, 2026 Thereisnobestkernel,onlythemostsuitablekernel.----------------------------------altumsonatur(throwinsomeLatinanditinstantlysoundsclassy)#0.Preface—Bac vitamin-cudacudac++GPU
[CUDA 优化实战] sgemm - 超越 cuBLAS:带你学会极致优化的矩阵乘法 cuda c++ 实现 codeMarch 5, 2026 Warning:Extremelydensecontentahead,withmanydiagrams,heavybit-manipulation,andmemory-mappingderivations.BestreadonaPC.#0.Preface—TheLastStandofScalarCo vitamin-cudacudac++GPU
[CUDA in Practice] Matrix Transpose — From Padding to XOR Swizzle: The Art of Shared Memory Optimization codeFebruary 13, 2026 Matrixtransposeisoneofthemostfundamentaloperationsindeeplearningandhigh-performancecomputing.ThedeceptivelysimplecoordinateswapB[y][x]=A[x][y]B[y][x]= vitamin-cudacudac++GPU
Numbers Every CUDA Developer Should Know codeFebruary 9, 2026 ThispostisacheatsheetforCUDAprogrammers:thehardwareconstantsandlatencyscalesthatreallymatterwhenyoucareaboutperformance.#0.PrefaceInHPCanddeeplearning vitamin-cudacudac++GPU
A Deep Dive into DeviceQuery: Understanding Your GPU Hardware hpcFebruary 6, 2026 Beforewritingasinglelineofhigh-performanceCUDAcode,youmustknowyoursilicon.deviceQueryisoftenthefirstcommandadeveloperruns,yetitsoutputisusuallyignored vitamin-cudacudac++GPU