How a tiled matmul works
Writing a tiled CUDA matmul 🚀 This year during my time at the Recurse Center, I worked through the various optimizations presented in Simon Boehm’s iconic post How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog. I approached these as a series of puzzles where I would read as little as possible, just the title or the first paragraphs describing the algorithm and then implement it in CUDA/C++. This was an exercise in writing & debugging CUDA along with implementing kernel code from a high level algorithm. ...