FlashKDA: Moonshot's CUTLASS Kernel for Kimi Linear (2026)
May 2, 2026
Moonshot open-sourced FlashKDA, a CUTLASS CUDA kernel for Kimi Delta Attention. Drop-in for flash-linear-attention with up to 2.22x prefill speedup on H20 GPUs.
Moonshot open-sourced FlashKDA, a CUTLASS CUDA kernel for Kimi Delta Attention. Drop-in for flash-linear-attention with up to 2.22x prefill speedup on H20 GPUs.
Google's TurboQuant compresses LLM KV caches to 3 bits with zero accuracy loss, cutting memory 6x and speeding up H100 attention computation up to 8x vs FP32.
AirLLM runs 70B LLMs on a single 4GB GPU via layer-wise inference — no quantization needed. We test the claims, measure tradeoffs, and compare alternatives.
One email per week — courses, deep dives, tools, and AI experiments.
No spam. Unsubscribe anytime.