FlashKDA: Moonshot's CUTLASS Kernel for Kimi Linear (2026)
May 2, 2026
Moonshot open-sourced FlashKDA, a CUTLASS CUDA kernel for Kimi Delta Attention. Drop-in for flash-linear-attention with up to 2.22x prefill speedup on H20 GPUs.
Moonshot open-sourced FlashKDA, a CUTLASS CUDA kernel for Kimi Delta Attention. Drop-in for flash-linear-attention with up to 2.22x prefill speedup on H20 GPUs.
Google's TurboQuant compresses LLM KV caches to 3 bits with zero accuracy loss, cutting memory 6x and speeding up H100 attention computation up to 8x vs FP32.
Run a 70B LLM on a 4GB GPU? AirLLM uses layer-wise inference, no quantization. Benchmarks, latency tradeoffs, and how it compares to Ollama + llama.cpp.