WingEdge777

[CUDA in Practice] Hand-Rolled Flash Decoding on SM120: Beating flashinfer.single_decode_with_kv_cache

[CUDA in Practice] Hand-Rolled Flash Decoding on SM120: Beating flashinfer.single_decode_with_kv_cache

codeMay 21, 2026

:::ThisarticleisintendedforreaderswithasolidCUDAfoundation,familiarwithGEMM/multi-head-attentionoptimization,andinterestedinadvancedinlinePTXtuning.Fu

[CUDA in Practice] FMHA on SM120: Beating torch.sdpa (FlashAttention-2)

[CUDA in Practice] FMHA on SM120: Beating torch.sdpa (FlashAttention-2)

codeMay 19, 2026

ThisarticleisintendedforreaderswithasolidCUDAfoundation,familiarwithGEMM/multi-head-attentionoptimization,andinterestedinadvancedTensorCore/inlinePTXt

Distributed Inference with PyTorch from First Principles: DP, TP, and PP in Less Than 200 Lines

Distributed Inference with PyTorch from First Principles: DP, TP, and PP in Less Than 200 Lines

codeMay 15, 2026

Modelskeepgettingbigger.EvenifINT4quantizationsqueezestheweightsontoasingleGPU,inferencestillhastopayforKVcacheandactivations,bothofwhichscalewithbatc

[CUDA in Practice] HGEMM SM120 — Micro-Sculpture Warfare in 100KB SMEM: Tensor Core, TMA, ldmatrix, mma

[CUDA in Practice] HGEMM SM120 — Micro-Sculpture Warfare in 100KB SMEM: Tensor Core, TMA, ldmatrix, mma

codeMay 10, 2026

Sorryfolks—Isaidtherewouldn’tbeasequeltotheGEMMseries,butIlied.Todayit’sHGEMMagain,butthistimewe’reembracingeverythingtheRTX5060Laptophastooffer:TMA+l

[CUDA in Practice] HGEMM — Beating cuBLAS: Tensor Core, cp.async, ldmatrix, mma

[CUDA in Practice] HGEMM — Beating cuBLAS: Tensor Core, cp.async, ldmatrix, mma

codeMay 10, 2026

ThisarticleisintendedforreaderswithasolidfoundationinCUDAprogrammingwhoarefamiliarwithGEMMoptimizationandinterestedinadvancedTensorCore/inlinePTXinstr

[CUDA in Practice] SGEMM TF32 — Beating cuBLAS with Tensor Cores, cp.async, ldmatrix & mma

[CUDA in Practice] SGEMM TF32 — Beating cuBLAS with Tensor Cores, cp.async, ldmatrix & mma

codeMay 9, 2026

Headsup:Thisisanintense,diagram-heavydeepdive.Itcovershardcoreswizzlederivations(ifyoustilldon’tunderstandXORswizzleafterreadingthis,comefindme),layou

[CUDA Basics] Understanding CUDA's "Nonexistent" Memory Tier: Local Memory

[CUDA Basics] Understanding CUDA's "Nonexistent" Memory Tier: Local Memory

codeApril 1, 2026

TherearecountlessarticlesonlineintroducingNVIDIA’smemoryhierarchy,butmostfocusonglobalmemory,sharedmemory,constantmemory,texturememory,L2/L1cache,andr

[CUDA in Practice] Safe Online Softmax — A Must-Know for Interviews: Arbitrary hidden_size, One/Two Pass, Trade-offs, Split-K

[CUDA in Practice] Safe Online Softmax — A Must-Know for Interviews: Arbitrary hidden_size, One/Two Pass, Trade-offs, Split-K

codeMarch 31, 2026

Thereisnobestkernel,onlythemostsuitablekernel.----------------------------------altumsonatur(throwinsomeLatinanditinstantlysoundsclassy)#0.Preface—Bac

[CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++

[CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++

codeMarch 5, 2026

Warning:Extremelydensecontentahead,withmanydiagrams,heavybit-manipulation,andmemory-mappingderivations.BestreadonaPC.#0.Preface—TheLastStandofScalarCo

[CUDA in Practice] Matrix Transpose — From Padding to XOR Swizzle: The Art of Shared Memory Optimization

[CUDA in Practice] Matrix Transpose — From Padding to XOR Swizzle: The Art of Shared Memory Optimization

codeFebruary 13, 2026

Matrixtransposeisoneofthemostfundamentaloperationsindeeplearningandhigh-performancecomputing.ThedeceptivelysimplecoordinateswapB[y][x]=A[x][y]B[y][x]=

1 2