How a tiled matmul works

Writing a tiled CUDA matmul 🚀 This year during my time at the Recurse Center, I worked through the various optimizations presented in Simon Boehm’s iconic post How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog. I approached these as a series of puzzles where I would read as little as possible, just the title or the first paragraphs describing the algorithm and then implement it in CUDA/C++. This was an exercise in writing & debugging CUDA along with implementing kernel code from a high level algorithm. ...

September 15, 2025 Â· 14 min Â· Suhith Rajesh