FlashKDA: Moonshot's CUTLASS Kernel for Kimi Linear (2026)

May 2, 2026

FlashKDA: Moonshot's CUTLASS Kernel for Kimi Linear (2026)

In late April 2026, Moonshot AI open-sourced FlashKDA — a CUTLASS-based CUDA kernel for Kimi Delta Attention that delivers 1.72x to 2.22x prefill speedup over the flash-linear-attention baseline on NVIDIA H20 GPUs.12 FlashKDA is the production-grade kernel that closes the gap between Kimi Linear's elegant linear-attention math and the throughput numbers any inference team actually has to ship.

TL;DR

FlashKDA is Moonshot AI's open-source CUTLASS kernel for the Kimi Delta Attention (KDA) mechanism that powers the Kimi Linear 48B-A3B hybrid model.13 On H20 GPUs it accelerates prefill by 1.72x in fixed-length and up to 2.22x in uniform variable-length workloads (seq_lens = 1024 x 8, total T = 8192) compared to the reference Triton implementation in flash-linear-attention.2 It ships under an MIT license, targets SM90+ Hopper hardware (H100, H20), and requires CUDA 12.9+ and PyTorch 2.4+.1 Variable-length batching is supported natively via cu_seqlens, and the kernel is auto-dispatched from flash-linear-attention.chunk_kda, making it a drop-in upgrade for any existing FLA codebase.2

What You'll Learn

  • What FlashKDA is and how it relates to the Kimi Linear architecture
  • The exact prefill speedup numbers and benchmark conditions on H20
  • Why CUTLASS replaces Triton for production-grade KDA workloads
  • Hardware, CUDA, and PyTorch requirements
  • How to drop FlashKDA into an existing flash-linear-attention pipeline
  • Current limitations (fixed head dimension, SM90+ only)
  • Where this fits in the broader linear-attention vs softmax-attention story

What FlashKDA Is

FlashKDA is a high-performance CUTLASS-based CUDA kernel implementation of Kimi Delta Attention (KDA), the linear-attention mechanism that drives Moonshot AI's Kimi Linear 48B-A3B hybrid model.13 It is published at github.com/MoonshotAI/FlashKDA under an MIT license.1

Where the original KDA reference kernels were written in Triton inside the open-source flash-linear-attention library (FLA, first-authored by Songlin Yang of Thinking Machines Lab), FlashKDA is rewritten in NVIDIA's CUTLASS templates — the same lower-level toolkit that powers FlashAttention-3.45 The result is a kernel that exploits Hopper-specific features like the Tensor Memory Accelerator (TMA) and Warpgroup Matrix-Multiply-Accumulate (WGMMA) instructions through hand-tuned scheduling patterns — warp specialization, ping-pong overlap of GEMM and softmax — that Triton's compiler does not always emit on its own.5

The Benchmark Numbers

The numbers Moonshot published target the prefill phase — the compute-bound stage that processes a prompt before the first output token is generated. Benchmarks were run on NVIDIA H20 GPUs against the reference Triton kernel inside flash-linear-attention.chunk_kda.2

WorkloadSequence configurationSpeedup vs flash-linear-attention
Fixed-lengthSingle sequence1.72x
Uniform variable-lengthseq_lens = 1024 x 8 (T = 8192)2.22x

The variable-length case is where FlashKDA pulls ahead the most: packing eight 1024-token sequences into a single kernel call lets the kernel saturate Hopper's tensor cores in a way the chunked Triton baseline cannot. The boost matters most for inference servers handling mixed-length prompts — exactly the setting where production LLM serving spends most of its compute budget.

Why CUTLASS Beats Triton Here

Triton is excellent for research velocity. You write Python that compiles to GPU code, and you can iterate on a new attention variant in an afternoon. FLA — the upstream library that FlashKDA plugs into — is built end-to-end in Triton precisely because that ergonomic loop matters for emerging architectures.4

But Triton sits one layer above the metal. Recent releases do support WGMMA and TMA emission, but the compiler decides how to schedule them — and the highly-tuned orchestration patterns that squeeze the last 30%+ of throughput out of Hopper (warp specialization, ping-pong GEMM-softmax overlap, asynchronous copy pipelines) typically require hand-written C++ templates to express reliably.5 CUTLASS, NVIDIA's open-source library of CUDA C++ template abstractions, is where those patterns live: TMA for asynchronous memory movement, WGMMA for warpgroup-level GEMMs, and the CuTe layout algebra for swizzled shared-memory tiling.5 FlashAttention-3 leveraged the same toolkit to reach 1.5x–2.0x speedup over FlashAttention-2 on H100s.5

For a research kernel you don't bother. For a model that ships to paying users on H100s and H20s, you do.

Hardware and Software Requirements

FlashKDA is opinionated about its target. The README lists the following constraints:1

RequirementValue
GPU architectureSM90+ (Hopper: H100, H20, and newer)
CUDA toolkit12.9 or later
PyTorch2.4 or later
Head dimensionFixed: K = V = 128
LicenseMIT

The K = V = 128 head-dimension constraint is the most important caveat. KDA-based models with different head dimensions cannot use FlashKDA today and would need to fall back to FLA's Triton implementation. Kimi Linear 48B-A3B uses K = V = 128, which is presumably why this dimension was prioritized first.3

The SM90+ requirement rules out Ampere (A100) and Ada Lovelace (L40, RTX 4090) deployments. Teams running KDA models on A100 fleets will need to keep using the Triton path until the FlashKDA team adds Ampere support — which is non-trivial because the optimization story relies on Hopper-only TMA and WGMMA instructions.

Drop-In Integration With flash-linear-attention

The most useful design decision in FlashKDA is what it does not require you to change. Once installed, the kernel is auto-dispatched from flash-linear-attention's chunk_kda entry point.2 Any code that already uses FLA for KDA — including production inference servers, vLLM integrations, and research checkpoints — picks up the speedup with no code changes beyond the install.

# Existing code — no changes needed
from fla.ops.kda import chunk_kda

output = chunk_kda(q, k, v, ...)
# With FlashKDA installed, this now hits the CUTLASS kernel
# instead of the Triton baseline on SM90+ hardware.

For the variable-length path, FlashKDA accepts a cu_seqlens tensor that flags where each sequence in a packed batch begins and ends — the same convention used by FlashAttention-2/3 for ragged inference batches.2 This is the feature that makes the 2.22x peak speedup possible: a single kernel launch processes eight 1024-token prompts together, sharing tensor-core occupancy across the batch.

The Kimi Linear Context

FlashKDA only makes sense if you understand what Kimi Linear is and why Moonshot bet on linear attention in the first place.

Kimi Delta Attention (KDA) is a refinement of Gated DeltaNet that introduces channel-wise (per-dimension) gating — a finer-grained forgetting mechanism than the head-wise or scalar gates used in earlier linear-attention models like Mamba2 or GDN.6 In the Kimi Linear architecture (arXiv:2510.26692, October 30, 2025), three KDA layers are interleaved with one Multi-Head Latent Attention (MLA) layer in a 3:1 hybrid ratio.6

The headline numbers from the Kimi Linear paper:63

  • 48B total parameters / 3B activated per token (8 of 256 experts active, including 1 shared expert)
  • Up to 75% reduction in KV cache versus full MLA
  • Up to 6x faster decoding throughput at 1M-token context
  • Time per output token at 1M context: 1.84 ms (Kimi Linear) vs 11.48 ms (full MLA) — roughly a 6x improvement, consistent with the paper's headline figure

That last number is the punchline. Linear attention has a fixed-size hidden state that does not grow with sequence length, which means decoding cost stays flat as context grows — exactly the opposite of softmax attention's quadratic blowup. FlashKDA is the kernel that turns those theoretical wins into wall-clock numbers on real hardware.

Why H20, Not H100?

Moonshot benchmarked FlashKDA on the H20 specifically, not the H100. That's not arbitrary. The H20 is NVIDIA's China-export-compliant Hopper variant — same architecture as the H100, but with reduced compute density (78 SMs versus 132 on the production H100 SXM5) and 96 GB of HBM3 at 4.0 TB/s of bandwidth.78 It's the current Hopper-class GPU Chinese AI labs can buy at scale, after the H800 used to train earlier Kimi models (including Kimi K2) was export-banned in late 2023.9 Benchmarking on the H20 makes the numbers directly relevant to whichever Hopper-class deployments Moonshot and other Chinese labs run today.

Practically, the same kernel runs on H100s — both expose SM90 — but the speedup ratio will shift because the H100 has more compute to absorb the same memory-movement wins. Expect FlashKDA to remain faster than the Triton baseline on H100, with the exact margin depending on workload shape.

Where This Fits

FlashKDA arrived less than two weeks after Moonshot shipped Kimi K2.6 on April 20, 2026 — a much larger 1T-parameter MoE coding model — but the two releases solve different problems.10 K2.6 is the production agentic-coding flagship. FlashKDA is infrastructure for the long-context regime, where the linear-attention math in Kimi Linear pays off most.

It also sits next to other 2026 inference-optimization releases: Google's TurboQuant tackles the KV-cache memory bottleneck through 3-bit quantization, while FlashKDA tackles the prefill compute bottleneck through architecture-aware kernel design.11 The two are complementary — if you're running a KDA-based model on Hopper, you can stack FlashKDA on top of a quantized cache to squeeze both axes.

Limitations to Know About

LimitationImplication
K = V = 128 head dim onlyOther KDA model variants must fall back to FLA Triton
SM90+ onlyNo A100, L40, or consumer-GPU support today
Targets prefillDecoding speedups are not the headline metric
KDA-specificDoes not accelerate softmax attention or other linear-attention variants
H20 benchmark onlyH100 numbers will shift; not officially published yet

Two of these (the head-dim constraint and the SM90 floor) are tractable engineering work. The KDA-specific design is intentional — FlashKDA is not trying to be a general flash-attention replacement, just the fastest possible kernel for one specific linear-attention variant.

Bottom Line

FlashKDA is a quiet but significant release. It does one thing — make Kimi Delta Attention go fast on Hopper — and it does it by replacing a Triton kernel with a CUTLASS kernel. The 1.72x to 2.22x prefill speedup over FLA's baseline is real and measurable, the MIT license removes friction, and the drop-in integration via chunk_kda means any team already using FLA for KDA workloads gets the win for free.12

For the broader linear-attention story, FlashKDA matters because it closes the gap between elegant theoretical wins (a fixed-size hidden state, 6x faster decoding at 1M context) and the wall-clock numbers production teams actually measure.6 Kimi Linear's promise of long-context efficiency only lands if the kernels keep up — and FlashKDA is what keeps them up.

If you're running Kimi Linear or any other KDA-based model on H100s or H20s, install it. If you're running softmax-attention models, watch this space: when Moonshot ships the next architecture, the kernels will already be ready.

Footnotes

  1. FlashKDA GitHub repository - MoonshotAI/FlashKDA 2 3 4 5 6 7 8 9 10

  2. Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention - MarkTechPost (April 30, 2026) 2 3 4 5 6 7 8 9 10

  3. Kimi-Linear-48B-A3B-Instruct - Hugging Face 2 3 4

  4. flash-linear-attention - fla-org GitHub 2 3

  5. Next Generation of FlashAttention - NVIDIA Technical Blog 2 3 4 5

  6. Kimi Linear: An Expressive, Efficient Attention Architecture - arXiv:2510.26692 2 3 4 5

  7. NVIDIA H20 Tensor Core GPU specifications - BurnCloud GPU catalog

  8. NVIDIA's China-Compliant H20 GPU Has 41% Fewer Cores - Wccftech

  9. China's Moonshot claims to build models with fewer high-end AI chips than US rivals use - South China Morning Post 2

  10. Moonshot AI Releases Kimi K2.6 - MarkTechPost (April 20, 2026)

  11. TurboQuant: Redefining AI efficiency with extreme compression - Google Research

Frequently Asked Questions

FlashKDA is Moonshot AI's open-source CUTLASS-based CUDA kernel for Kimi Delta Attention. It accelerates the prefill phase of inference for KDA-based models by 1.72x to 2.22x on NVIDIA H20 GPUs versus the Triton baseline in flash-linear-attention.12

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.