~/back to diwakars.pages.dev
Writing · Performance

Six Iterations to a Faster Attention Kernel — Two Were Reverts

It took six iterations to get an LLM decode-attention kernel 1.91× faster than PyTorch's SDPA (FlashAttention/cuDNN). Two of those six steps were reverts. The reverts taught me more than the wins.

Bar chart of per-iteration latency v0 through v5; v3 is the fastest at 0.713 ms, v4 and v5 are reverts, with the PyTorch SDPA reference drawn as a dashed line.

The third category

Every GPU optimization tutorial teaches two ways a kernel can be slow: memory-bandwidth-bound or compute-bound. Most of mine were neither. They were dependency-chain-bound — gated by the length of the per-iteration critical path, not by any hardware resource. That third category flips your intuition, and it cost me real time until it sank in.

Three results that went against intuition

The discipline that fixed it

Predict the direction AND the magnitude of every change before you run it. If your prediction is off by 4×, you don't understand the bottleneck yet — and the measurement won't explain it, it'll just tempt you to guess again.

Both reverts are still in git, each with its diagnostic write-up. The write-up outlived the code: the next kernel with the same shape, I'd already paid for the diagnosis.

The checklist I now run before touching any kernel

  1. Bandwidth-, compute-, or chain-bound? (Most are chain-bound.)
  2. What lever shortens that specifically?
  3. What magnitude do I predict — and does the measurement match?

If you want the first-principles version — what attention actually computes, the GPU mental model, and decode attention built up naive → fast — that's the Foundations field guide. Code and per-phase write-ups are on GitHub.