Yet Another Primer On GPU Programming 1/N
I did write an older post in college, Understanding Volatile: A CUDA Prefix Sum Story, where I ventured outside the syllabus to read PTX and understand the volatile keyword. But that post assumed a lot of pre-existing context, partly because I had written it as a record for classmates who were in the parallel programming class with me or students that I had TA'd.
So this post seeks to be a first-principles 10,000-foot-view companion for all of that.
Standing on the Shoulders of Giants
In particular I'd like this to be a companion post for the following worklogs, of which I highly recommend going through each of these worklogs at least twice:
- On the A100: How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog by Simon Boehm
- On the H100: Outperforming cuBLAS on H100: a Worklog by Pranjal Shankhdhar
In addition to the above, of course, please use the following as reference:
- CUDA Programming Guide by Nvidia
- TPU Scaling Book by Google Deepmind
- Just the roofline analysis chapter
Road to Nvidia-Level Engineering
Both worklogs left me with a better understanding of kernel optimization, but also with a clearer picture of what I still didn't know, primarily in the broader ecosystem and in the... (okay, sorry, I like this cliché)... first principles reasoning underneath.
Without further ado, my questions:
Question 1: What is the CUDA ecosystem? What are CUTLASS, cuBLAS, and cuDNN, and what problems do they seek to address? Why did the authors choose cuBLAS as the baseline to outperform?
Motivation: Both authors choose Nvidia's cuBLAS implementation as the baseline to outperform.
Question 2: When onboarding onto a new piece of hardware, what mental model can I use and what are the first questions I can ask to write efficient parallel programs for this piece of hardware? Regardless of the species of AI chip and regardless of the generation of chip, the mental model should be the same.
Motivation: The two worklogs are on two different generations of GPU hardware: the Ampere and the Hopper.
Question 3: How do I read a worklog such that, on a kernel I haven't seen before, I'd know what to look for next? Can I predict what the next bottleneck will be before reading the next section (a la the old college study tip, "can you predict what the lecturer will say before they say it?")? And can I reproduce the relevant roofline numbers+analysis from first principles?
Motivation: For the "how to optimize a CUDA matmul kernel" series, it's tempting to copy-paste the author's code and move on to the next part of their worklog.
Question 4: You may notice that CUDA, a C++-based language, is rather hefty to write, and you spend a surprising fraction of your time on boilerplate. CUTLASS was created to reduce this sort of boilerplate. Even further up from CUTLASS though, there exist higher-level Pythonic languages called DSLs (domain-specific languages) that automate many of these decisions for you. The popular DSLs at the moment are Triton and CuteDSL. What are the manual decisions in CUDA that these DSLs abstract away, and what are the tradeoffs of these abstractions?
Motivation: Well, we write a lot of CUDA in the worklogs.
Q1: Nvidia Ecosystem
What is the CUDA ecosystem? What are CUTLASS, cuBLAS, and cuDNN, and what problems do they seek to address? Why did the authors choose cuBLAS as the baseline to outperform?
With each new generation of hardware, new instructions and hardware features are introduced while others are deprecated or retired, and thus naive CUDA code written for an older generation of hardware won't extract full performance on the newer generation.
On behalf of the researcher writing PyTorch training code, we want to maintain the illusion that they never need to think about which generation of hardware they're running on. They call the same Python ops as usual, rivers do not run backward, and birds do not drop dead out of the air.
"The earth stayed on its axis. Rivers did not run backward. Birds didn’t drop dead out of the air." - Big Magic by Elizabeth Gilbert
Nvidia's solution is to ship a layer of libraries, e.g. cuBLAS for linear algebra like matrix multiplications, and cuDNN for higher-level deep learning primitives like convolutions, attention, normalization. These libraries are built on top of CUTLASS, which provides reusable, composable templates that serve as building blocks for high-performance kernels.
CUTLASS, cuBLAS, and cuDNN are all co-designed with each new hardware generation. For example, when Hopper shipped with new WGMMA instructions, Nvidia rewrote the internals of these libraries to fully take advantage of them. Then when Blackwell came out, WGMMA instructions weren't forwards compatible with Blackwell, and the shiny new thing Blackwell introduced was the fancy shmancy UMMA instruction plus a new kind of shared memory, and once again, the Nvidia engineers got on with the rewrite.
The whole time, the AI researcher sleeps peacefully with their model.forward(), rivers run forwards, and birds remain alive in the air.
This is why both worklogs benchmark against cuBLAS for matmuls, as cuBLAS represents Nvidia's very own co-designed, hardware-tuned implementation for that generation.
While cuBLAS is closed-source, our best window into it is CUTLASS, which is open-source and provides the primitives that cuBLAS is built from.
A quick pop quiz on CUTLASS, what do these mean?
- tiling strategies
- tiling hierarchy
- software pipelining
- memory iterators
- epilogues
Couldn't tell ya, and that's a problem, because if we want to beat Nvidia at their own game, we should probably start by understanding what exactly they're composing and why.
So, in the name of understanding the CUTLASS vocabulary (tiling strategies! memory iterators! epilogues!) to demystify and ultimately outperform cuBLAS, the rest of this post works through GPU programming from... wait for it... first principles (!) to do exactly that.
Q2: My Chip-Agnostic Mental Model
When onboarding onto a new piece of hardware, what mental model can I use and what are the first questions I can ask to write efficient parallel programs for this piece of hardware? Regardless of the species of AI chip and regardless of the generation of chip, the mental model should be the same.
So, I've conceptualized parallel programming as:
- Who are the doer of things? (Execution Units)
- Who is scheduling these doer-of-things, and how many doer-of-things do they manage at a time? (Scheduling)
- This question is most relevant for hardware that uses runtime scheduling. For TPUs or FPGAs, there isn't so much a ""runtime scheduler,"" as the "schedule" is baked in at compile time or circuit design time, and every execution unit is always active.
- Where do these doer-of-things store their state? Where are the slow stores and where are the fast ones, and who are these stores shared among? (Memory Hierarchy)
Execution Units and Scheduling
I'd like to ground this into something familiar first. In CPUs, the doers are threads, and threads are scheduled onto cores, one thread per core.
Contrast this with GPUs, where the doers are warps of 32 threads. Warps are grouped into blocks, which are then scheduled onto streaming multiprocessors (SMs). The SM's warp scheduler keeps many warps in flight at once and switches between warps as necessary. When a warp's turn comes, its 32 threads execute across 32 CUDA cores in lockstep.
The pretty important thing to note here is that instead of scheduling individual threads onto individual cores, the GPU schedules warps of threads onto CUDA cores. I somewhat mentioned this in my post about volatile, that threads within the same warp move in lockstep, executing the same instruction at the same time. This is in contrast with CPU threads, which may be scheduled on different cores, and there is no guarantee they execute the same instruction at the same time or make progress at the same rate.
The warp has pretty huge consequences for writing efficient GPU programs. Since threads in a warp execute in lockstep, they also issue memory requests in lockstep, and the hardware can coalesce those requests into a single transaction if the threads are accessing contiguous memory addresses. So when reasoning about memory access patterns, the relevant unit to think about is the warp, not the thread.
Memory Hierarchy
Since warps issue memory requests in lockstep, and the cores themselves are very, very fast, the bottleneck in most GPU programs is often data movement rather than compute.
GPU memory is organized into levels, trading off size against speed:
- Global memory (DRAM, also referred to as HBM)
- Shared memory (SMEM) - shared among threads in the same block
- Registers - private to each thread
All Roads Lead To Tiling
The canonical optimization pattern in GPU programming is tiling, this idea that you load a "tile" of data from slow global memory into fast shared memory once, and allow as many warps as needed to consume the tile before it's discarded. In this sense, instead of each warp reaching out to global memory repeatedly, you stage a tile of data into shared memory for the whole block of warps to work on.
If you remember one word about kernel optimizations, remember tiling. One way to conceptualize a kernel optimization is to ask, "how does this relate to tiling?"
Most of the optimizations you'll see in the worklogs are refinements on top of this idea.
- Thread coarsening? Make each thread responsible for a larger output tile!
- Threadblock tiling, warptiling, thread tiling? Different forms of tiling that are higher/lower on the memory hierarchy!
- MMA / WGMMA / UMMA? This is the hardware instruction that consumes a warp-tile's worth of data and feeds it to the tensor cores! Each generation differs only in tile size and how many threads are issuing this instruction.
- Software pipelining? Once you have tiling, the natural next question is, can we overlap the next tile's load with the current tile's compute?
- Async copy instructions and TMA? They're mechanisms that enable software pipelining, which is just pipelining across tiles!
- Simon Boehm's worklog covers Ampere's
cp.asyncwhile Pranjal's covers TMA on Hopper
- Simon Boehm's worklog covers Ampere's
- Async copy instructions and TMA? They're mechanisms that enable software pipelining, which is just pipelining across tiles!
Q3: Check Your Understandings for Worklog
How do I read a worklog such that, on a kernel I haven't seen before, I'd know what to look for next? Can I predict what the next bottleneck will be before reading the next section (a la the old college study tip, "can you predict what the lecturer will say before they say it?")? And can I reproduce the relevant roofline numbers+analysis from first principles?
This section is running rather long and I am splitting this into a separate post or link to a separate doc.
TL;DR it's a compilation of check your understanding questions I wrote for myself to make sure I wasn't just blindly copy-pasting the code from each step of the optimization. I will publish the questions alongside my answers when done.
Q4: Beyond CUDA: Triton and CuteDSL
You may notice that CUDA, a C++-based language, is rather hefty to write, and you spend a surprising fraction of your time on boilerplate. CUTLASS was created to reduce this sort of boilerplate. Even further up from CUTLASS though, there exist higher-level Pythonic languages called DSLs (domain-specific languages) that automate many of these decisions for you. The popular DSLs at the moment are Triton and CuteDSL. What are the manual decisions in CUDA that these DSLs abstract away, and what are the tradeoffs of these abstractions?
Recall the pop quiz from the Q1 section about tiling strategies, memory iterators, epilogues. At the time I said "couldn't tell ya." By now, having worked through Q2 and Q3, these terms should feel less opaque, or at the very least you'll have felt the pain that some of these things exist to solve.
Tiling strategies. Everything discussed in Q2: All Roads Lead to Tiling and the entirety of Q3 cover this. CUTLASS, CuteDSL, and Triton each automate our tiling process with varying levels of ergonomics.
- In CUTLASS, you declare your tile sizes at the threadblock, warp, and thread levels, and the library generates the corresponding loops and data movement code.
- CuteDSL sits closer to CUTLASS in that you still reason about tile shapes explicitly, but you compose it via Python rather than C++ templates
- Triton goes even further, where you merely describe your workload on the block level (no reasoning about individual threads or warp tiles), and the compiler figures out the tiling hierarchy automatically.
Memory iterators. In tiling, there's a lot of schlep that goes into deciding which threads load which addresses from global memory, how to handle tile boundaries when the matrix dimensions aren't a clean multiple of the tile size, and how to ensure that those loads are coalesced.
In raw CUDA, you write all of this by hand, and as you'll see in the worklogs it's a minor pain in the butt. CUTLASS packages this logic into reusable memory iterators.
Epilogues. This one is best illustrated in the follow-up post's attention kernel, so we won't dwell on it here, but briefly...
After the matmul completes, there's usually some postprocessing to do on the result of matmul (scaling, adding a bias, applying activation function) before writing it back to global memory. Naively, each of these postprocessing steps would be a separate kernel launch, which would incur a separate round-trip to global memory.
An epilogue fuses all of this into the tail end of the matmul kernel itself, so the result doesn't write out to global memory until all the preprocessing steps are done. CUTLASS exposes epilogues as composable, swappable components, which makes it very reusable across different operator fusion patterns.
In a follow-up post, I plan to take some sample CUDA kernels (one of the matmuls kernels from the worklogs, plus an attention kernel) and translate it first into CUTLASS, then into CuteDSL, then into Triton to illustrate these abstractions. My goal is to show which lines of manual CUDA each layer of abstraction replaces, and compare the performance loss of using these higher levels of abstractions.
Hopefully this demystifies the CUTLASS vocabulary enough that the worklogs feel less like magic, and happy tiling. (ง •_•)ง ▦ ▦ ▦
