#llm-inference

FlashKDA: Moonshot's CUTLASS Kernel for Kimi Linear (2026)

May 2, 2026

Moonshot open-sourced FlashKDA, a CUTLASS CUDA kernel for Kimi Delta Attention. Drop-in for flash-linear-attention with up to 2.22x prefill speedup on H20 GPUs.

#FlashKDA #Kimi Delta Attention

Google's TurboQuant: 6x Less Memory for LLM Inference (2026)

April 6, 2026

Google's TurboQuant compresses LLM KV caches to 3 bits with zero accuracy loss, cutting memory 6x and speeding up H100 attention computation up to 8x vs FP32.

#TurboQuant #KV cache

AirLLM Tested: Run a 70B LLM on a 4GB GPU — Does It Work?

April 5, 2026

Run a 70B LLM on a 4GB GPU? AirLLM uses layer-wise inference, no quantization. Benchmarks, latency tradeoffs, and how it compares to Ollama + llama.cpp.

#AirLLM #local LLM