Chinese Open-Weight Coding LLMs: 2026's Three-Week Sweep
May 11, 2026
TL;DR
Between April 7 and April 24, 2026, three Chinese labs shipped open-weight coding models that closed almost the entire gap to the Western frontier. GLM-5.1 (Z.ai, April 7) became the first open-weight model to top SWE-Bench Pro at 58.4%, beating GPT-5.4's 57.7%.12 Nine days later Claude Opus 4.7 (Anthropic, April 16) reclaimed the lead at 64.3% and pushed SWE-bench Verified to 87.6%.34 Kimi K2.6 (Moonshot AI, April 20) answered with 58.6% on SWE-Bench Pro and 80.2% on SWE-bench Verified — a 1-trillion-parameter MoE that orchestrates up to 300 sub-agents in a single coding run.56 Four days later, DeepSeek V4 Preview (April 24) shipped a 1.6T MoE with 1M-token context, 80.6% SWE-bench Verified, a 3,206 Codeforces rating (above GPT-5.4's 3,168), and MIT-licensed weights at roughly one-seventh the output cost of Claude Opus 4.7.789 None of the three close the gap to Opus 4.7 on SWE-Bench Pro — but on price-per-task, agent orchestration, context length, and license clarity, the open-weight frontier is now within touching distance of the closed one.
What You'll Learn
- The exact 17-day timeline of releases — and why the order matters
- How each model scores on SWE-Bench Pro, SWE-bench Verified, LiveCodeBench, and Codeforces
- The price gap between the three Chinese open-weight models and GPT-5.5 / Claude Opus 4.7
- Why MIT vs Modified MIT vs non-commercial licensing changes who can actually use these weights
- The three different bets each lab made — and which one fits your workload
- Where the remaining gap to Claude Opus 4.7 still hurts, and where it doesn't
A 17-Day Window That Changed the Open-Weight Frontier
Before April 7, the open-weight coding leaderboard sat well behind the closed frontier. SWE-Bench Pro's top open-weight entrant trailed GPT-5.4 by several points. SWE-bench Verified was a wider gap. Then three releases landed in seventeen days.
| Date | Release | Lab | Headline result |
|---|---|---|---|
| April 7, 2026 | GLM-5.1 | Z.ai (formerly Zhipu) | First open-weight #1 on SWE-Bench Pro at 58.4%12 |
| April 16, 2026 | Claude Opus 4.7 | Anthropic | Reclaims SWE-Bench Pro lead at 64.3%, ships 87.6% SWE-bench Verified34 |
| April 20, 2026 | Kimi K2.6 | Moonshot AI | 58.6% SWE-Bench Pro, 80.2% SWE-bench Verified, 300-agent swarm56 |
| April 23, 2026 | GPT-5.5 | OpenAI | 58.6% SWE-Bench Pro (tied with Kimi K2.6) and 88.7% SWE-bench Verified — takes Verified lead from Opus 4.71011 |
| April 24, 2026 | DeepSeek V4 Preview | DeepSeek | 1.6T MoE, 80.6% SWE-bench Verified, 3,206 Codeforces, 1M context789 |
The story isn't that any one Chinese model surpassed Anthropic's flagship. None of them did on SWE-Bench Pro. The story is that within three weeks, the open-weight frontier matched or beat every other closed frontier vendor — including GPT-5.5 — on at least one major coding benchmark, while undercutting them on price by 6× to 30×.
SWE-Bench Pro: The Hardest Public Coding Benchmark
SWE-Bench Pro evaluates an agent's ability to resolve real GitHub issues end-to-end. It is harder than SWE-bench Verified because it includes long-horizon and multi-file changes. Here is where the field stands today.
| Model | SWE-Bench Pro | Released | License |
|---|---|---|---|
| Claude Opus 4.7 | 64.3% | April 16, 2026 | Closed4 |
| Kimi K2.6 | 58.6% | April 20, 2026 | Modified MIT56 |
| GPT-5.5 | 58.6% | April 23, 2026 | Closed10 |
| GLM-5.1 | 58.4% | April 7, 2026 | MIT12 |
| GPT-5.4 | 57.7% | March 5, 2026 | Closed1 |
| DeepSeek V4-Pro (max) | 55.4% | April 24, 2026 | MIT78 |
| Claude Opus 4.6 | 53.4% | February 5, 2026 | Closed6 |
Two things stand out. First, the gap from the open-weight leader (Kimi K2.6 at 58.6%) to the closed leader (Opus 4.7 at 64.3%) is 5.7 percentage points — wider than the gap between GPT-5.5 and Kimi K2.6 (zero). Second, GPT-5.5 and Kimi K2.6 are tied on this benchmark, but GPT-5.5 costs $5 per million input tokens and $30 per million output, while Kimi K2.6 lists at $0.95 input / $4.00 output on Moonshot's API.121314
If you score on SWE-Bench Pro alone, the open-weight frontier has matched GPT-5.5 at a fraction of the price. The remaining gap is to Anthropic's Opus 4.7 — and that gap is real but narrow enough that for many workloads it does not justify a 6-30× output-token premium.
SWE-bench Verified, LiveCodeBench, and Codeforces
SWE-Bench Pro is the headline benchmark, but it's not the only one. Different models win different categories.
| Model | SWE-bench Verified | LiveCodeBench | Codeforces Elo |
|---|---|---|---|
| GPT-5.5 | 88.7% (OpenAI)11 | Not headline-reported | Not officially reported |
| Claude Opus 4.7 | 87.6%4 | 89.6%4 | Not officially reported |
| DeepSeek V4-Pro (max) | 80.6%78 | 93.5%7 | 3,20678 |
| Kimi K2.6 | 80.2%5 | Not headline-reported | Not headline-reported |
| Claude Opus 4.6 | 80.8%7 | 88.8%7 | Not officially reported |
DeepSeek V4-Pro wins LiveCodeBench (93.5%) and Codeforces (3,206 — ahead of GPT-5.4's 3,168). Kimi K2.6 sits within a fraction of a point of Opus 4.6 on SWE-bench Verified. GPT-5.5 narrowly leads SWE-bench Verified at 88.7% (OpenAI-reported), with Claude Opus 4.7 close behind at 87.6%. Both lead the open-weight frontier (DeepSeek V4-Pro at 80.6%) by roughly 7 to 8 points on that benchmark.
Note that vendor-reported SWE-bench Verified numbers use each lab's own scaffold and tool budget; on the independent vals.ai leaderboard, which fixes a neutral scaffold, GPT-5.5 lands at ~82.6%, closer to the open-weight pack.15 Apples-to-apples comparison across labs is harder than the headline numbers suggest.
The pattern: open-weight models are competitive on competitive-programming and code-generation tasks (LiveCodeBench, Codeforces) and within striking distance on issue-resolution tasks (SWE-bench Verified). Long-horizon agentic resolution (SWE-Bench Pro) remains the area where Anthropic holds the clearest lead.
The Price Gap That Actually Matters
Per-token pricing tells the story of why these releases are reshaping production deployments.
| Model | Input ($/M) | Output ($/M) | Notes |
|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $25.00 | Closed12 |
| GPT-5.5 | $5.00 | $30.00 | Closed; nearly doubled vs GPT-5.4 ($2.50/$15)13 |
| Kimi K2.6 (Moonshot API) | $0.95 | $4.00 | Modified MIT, self-host available14 |
| DeepSeek V4-Pro (list) | $1.74 | $3.48 | MIT9 |
| DeepSeek V4-Pro (promo) | $0.435 | $0.87 | 75% promo until May 31, 20269 |
| GLM-5.1 (OpenRouter) | $1.05 | $3.50 | MIT, varies by provider16 |
On output cost alone:
- Kimi K2.6 is ~6× cheaper than Claude Opus 4.7 and ~7.5× cheaper than GPT-5.5.
- DeepSeek V4-Pro (list) is ~7× cheaper than Opus 4.7 and ~8.6× cheaper than GPT-5.5.
- DeepSeek V4-Pro (promo) is ~29× cheaper than Opus 4.7 — until May 31.
And these are inference-API prices. The MIT-licensed weights mean you can self-host either DeepSeek V4 or GLM-5.1 with zero per-token cost. For a heavy coding workload that emits millions of output tokens per day, the math gets uncomfortable fast for closed-source incumbents.
Three Different Bets
The three Chinese labs did not all build the same kind of model. Each made a distinct architectural and product bet.
GLM-5.1: The MIT-Licensed Long-Horizon Bet
GLM-5.1 is a 754-billion-parameter MoE with 40B active per token, MIT-licensed, with a 200K context window and 128K maximum output.17 Z.ai's pitch is long-horizon agentic coding: the model is engineered to run autonomously for up to eight hours of continuous task execution without human checkpoints.2 It was trained on 100,000 Huawei Ascend 910B chips — a deliberate signal that Chinese labs can train frontier models without NVIDIA hardware.2
Best for: agentic engineering workflows where the model needs to plan, execute, verify, and revise across multiple files and sub-tasks without supervision.
Kimi K2.6: The 300-Agent Swarm Bet
Kimi K2.6 is a 1-trillion-parameter MoE with 32B active per token, Modified MIT licensed, with a 256K context window.56 Moonshot's headline architectural feature is Agent Swarm: a single coding run can orchestrate up to 300 sub-agents executing 4,000 coordinated steps.6 Moonshot demonstrated 13 hours of continuous autonomous coding on an open-source financial matching engine during the launch.5
Best for: workloads that decompose naturally into parallel sub-tasks — refactoring across hundreds of files, test generation, multi-language porting.
DeepSeek V4: The Architecture-Driven Cost Bet
DeepSeek V4 ships as two models: V4-Pro (1.6T total / 49B active) and V4-Flash (284B total / 13B active), both under MIT license with a 1-million-token context window and 384K max output.89 The architectural headline is Hybrid Attention — Compressed Sparse Attention and Heavily Compressed Attention layers that cut per-token inference FLOPs to 27% of V3.2's and KV cache to 10% at 1M-token context.8 V4 is the only model in the wave with a million-token window.
Best for: large-codebase work where context length is the binding constraint — entire repositories, long debugging sessions, multi-document code review.
License Differences That Matter in Production
The "open" in "open-weight" varies meaningfully across the three.
| Model | License | Commercial use | Self-host |
|---|---|---|---|
| GLM-5.1 | MIT | Yes, unrestricted17 | Yes |
| DeepSeek V4 | MIT | Yes, unrestricted8 | Yes |
| Kimi K2.6 | Modified MIT | Yes — attribution required for deployments >100M MAU or >$20M monthly revenue14 | Yes |
| MiniMax M2.7 (earlier, March 18) | Restrictive Modified-MIT | "Written authorization required for commercial use"18 | Allowed for non-commercial |
GLM-5.1 and DeepSeek V4 shipping under standard MIT is the legal end-state most enterprise buyers want. Kimi K2.6's Modified MIT functions as standard MIT for the vast majority of teams — its one added clause requires prominent "Kimi K2" attribution only for deployments exceeding 100 million monthly active users or $20 million in monthly revenue.14 MiniMax M2.7 — released earlier on March 18, 2026, ahead of the April wave — illustrates the pattern that has not taken hold: a "modified MIT" with non-commercial restrictions that requires prior written authorization for any commercial use.18
For enterprise procurement, the practical hierarchy is: GLM-5.1 ≈ DeepSeek V4 (no friction) → Kimi K2.6 (light review) → MiniMax M2.7 (separate license).
What Each Model Actually Wins At
Comparing the three Chinese open-weight models head-to-head — and against the closed frontier where relevant:
- Best open-weight SWE-Bench Pro: Kimi K2.6 (58.6%), narrowly ahead of GLM-5.1 (58.4%); also ties closed-source GPT-5.5 (58.6%)
- Best LiveCodeBench (vendor-reported): DeepSeek V4-Pro (93.5%)
- Best Codeforces: DeepSeek V4-Pro (3,206 Elo)
- Best long-horizon autonomous coding: GLM-5.1 (8-hour continuous task execution)
- Best parallel sub-task orchestration: Kimi K2.6 (300-agent swarm)
- Best large-codebase context (among the three): DeepSeek V4 (1M tokens)
- Best price per output token: DeepSeek V4-Pro promo (until May 31)
- Cleanest license for enterprise procurement: GLM-5.1 or DeepSeek V4 (standard MIT)
The gap to Claude Opus 4.7 on SWE-Bench Pro (5.7 points) and SWE-bench Verified (~7 points) remains real. For mission-critical autonomous coding agents where every percentage point of issue-resolution accuracy matters, Opus 4.7 still has the lead — at 6× to 30× the output-token cost depending on which open-weight model and pricing tier you compare it to.
The Gap That Remains
It would be wrong to say the open-weight frontier has caught the closed frontier. Claude Opus 4.7's 64.3% on SWE-Bench Pro is the highest result publicly reported on that benchmark, and on SWE-bench Verified the closed frontier — GPT-5.5 at 88.7% and Opus 4.7 at 87.6% — still sits roughly 7-8 points above the open-weight leaders. On long-horizon agentic tasks involving multi-file resolution under ambiguity, Opus 4.7 currently wins.
But on most coding tasks that are not at the absolute frontier of difficulty, the open-weight models are now within statistical noise of GPT-5.5 — and dramatically cheaper. The interesting question is no longer "can open-weight catch the closed frontier?" but "for which subset of coding workloads does the closed-frontier premium still pay for itself?"
For high-stakes one-shot agentic resolution — yes. For bulk code generation, refactoring, test writing, multi-file editing, and CI-integration loops — increasingly, no.
Bottom Line
In a 17-day window from April 7 to April 24, 2026, three Chinese open-weight coding models — GLM-5.1, Kimi K2.6, and DeepSeek V4 — closed almost the entire benchmark gap to GPT-5.5 while undercutting GPT-5.5 and Claude Opus 4.7 on price by 6× to 30×. Anthropic's Claude Opus 4.7 still leads on the hardest agentic coding benchmark (SWE-Bench Pro at 64.3%), but for the broader class of coding workloads, MIT-licensed open-weight models with 200K-to-1M-token context windows are now a credible production default. The 2026 frontier of coding LLMs is not a single line. It is a cluster — and the open-weight side of that cluster has, for the first time, become competitive on every dimension except the very hardest.
Footnotes
Footnotes
-
Winbuzzer, Z.ai Releases GLM-5.1: 754B Model Tops SWE-Bench Pro, April 9, 2026. https://winbuzzer.com/2026/04/09/z-ai-releases-glm-5-1-754b-model-tops-swe-bench-pro-xcxwbn/ ↩ ↩2 ↩3 ↩4
-
NerdLevelTech, GLM-5.1: The Open-Source Model That Beat GPT-5.4, April 19, 2026. /glm-5-1-open-source-beats-gpt-coding-benchmarks ↩ ↩2 ↩3 ↩4 ↩5
-
TheNextWeb, Claude Opus 4.7 leads on SWE-bench and agentic reasoning, April 16, 2026. https://thenextweb.com/news/anthropic-claude-opus-4-7-coding-agentic-benchmarks-release ↩ ↩2
-
LLM-Stats, Claude Opus 4.7 Benchmarks, Pricing & Context Window. https://llm-stats.com/models/claude-opus-4-7 ↩ ↩2 ↩3 ↩4 ↩5
-
Kimi.com Blog, Kimi K2.6 Tech Blog: Advancing Open-Source Coding. https://www.kimi.com/blog/kimi-k2-6 ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
NerdLevelTech, Kimi K2.6: Open-Weight 300-Agent Swarm Tops GPT-5.4, April 27, 2026. /kimi-k2-6-300-agent-swarm-open-weight-frontier-coding ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
DeepSeek API Docs, DeepSeek V4 Preview Release, April 24, 2026. https://api-docs.deepseek.com/news/news260424 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
-
NerdLevelTech, DeepSeek V4: Open-Weight Frontier at 1/7 the Cost, May 2, 2026. /deepseek-v4-open-source-frontier-million-token-context ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
-
DeepSeek API Docs, Models & Pricing. https://api-docs.deepseek.com/quick_start/pricing/ ↩ ↩2 ↩3 ↩4 ↩5
-
OpenAI, Introducing GPT-5.5, April 23, 2026. https://openai.com/index/introducing-gpt-5-5/ ↩ ↩2
-
TokenMix, GPT-5.5 Review: 88.7% SWE-Bench, 92.4% MMLU, 2x Price Tag (2026). https://tokenmix.ai/blog/gpt-5-5-spud-review-88-swe-bench-2026 ; OpenAI announcement, April 23, 2026. ↩ ↩2
-
Anthropic Pricing. https://www.anthropic.com/claude/opus ↩ ↩2
-
OpenAI API Pricing. https://openai.com/api/pricing/ ↩ ↩2
-
Moonshot AI on Hugging Face, Kimi-K2.6 Model Card. https://huggingface.co/moonshotai/Kimi-K2.6 ↩ ↩2 ↩3 ↩4
-
vals.ai SWE-bench leaderboard, May 2026. ↩
-
OpenRouter, GLM-5.1 API Pricing & Benchmarks. https://openrouter.ai/z-ai/glm-5.1 ↩
-
LLM-Stats, GLM-5.1 Benchmarks, Pricing & Context Window. https://llm-stats.com/models/glm-5.1 ↩ ↩2
-
MarkTechPost, MiniMax Just Open Sourced MiniMax M2.7, April 12, 2026. https://www.marktechpost.com/2026/04/12/minimax-just-open-sourced-minimax-m2-7-a-self-evolving-agent-model-that-scores-56-22-on-swe-pro-and-57-0-on-terminal-bench-2/ ↩ ↩2