ai-ml

Chinese Open-Weight Coding LLMs: 2026's Three-Week Sweep

May 11, 2026

Chinese Open-Weight Coding LLMs: 2026's Three-Week Sweep

TL;DR

Between April 7 and April 24, 2026, three Chinese labs shipped open-weight coding models that closed almost the entire gap to the Western frontier. GLM-5.1 (Z.ai, April 7) became the first open-weight model to top SWE-Bench Pro at 58.4%, beating GPT-5.4's 57.7%.12 Nine days later Claude Opus 4.7 (Anthropic, April 16) reclaimed the lead at 64.3% and pushed SWE-bench Verified to 87.6%.34 Kimi K2.6 (Moonshot AI, April 20) answered with 58.6% on SWE-Bench Pro and 80.2% on SWE-bench Verified — a 1-trillion-parameter MoE that orchestrates up to 300 sub-agents in a single coding run.56 Four days later, DeepSeek V4 Preview (April 24) shipped a 1.6T MoE with 1M-token context, 80.6% SWE-bench Verified, a 3,206 Codeforces rating (above GPT-5.4's 3,168), and MIT-licensed weights at roughly one-seventh the output cost of Claude Opus 4.7.789 None of the three close the gap to Opus 4.7 on SWE-Bench Pro — but on price-per-task, agent orchestration, context length, and license clarity, the open-weight frontier is now within touching distance of the closed one.


What You'll Learn

  • The exact 17-day timeline of releases — and why the order matters
  • How each model scores on SWE-Bench Pro, SWE-bench Verified, LiveCodeBench, and Codeforces
  • The price gap between the three Chinese open-weight models and GPT-5.5 / Claude Opus 4.7
  • Why MIT vs Modified MIT vs non-commercial licensing changes who can actually use these weights
  • The three different bets each lab made — and which one fits your workload
  • Where the remaining gap to Claude Opus 4.7 still hurts, and where it doesn't

A 17-Day Window That Changed the Open-Weight Frontier

Before April 7, the open-weight coding leaderboard sat well behind the closed frontier. SWE-Bench Pro's top open-weight entrant trailed GPT-5.4 by several points. SWE-bench Verified was a wider gap. Then three releases landed in seventeen days.

DateReleaseLabHeadline result
April 7, 2026GLM-5.1Z.ai (formerly Zhipu)First open-weight #1 on SWE-Bench Pro at 58.4%12
April 16, 2026Claude Opus 4.7AnthropicReclaims SWE-Bench Pro lead at 64.3%, ships 87.6% SWE-bench Verified34
April 20, 2026Kimi K2.6Moonshot AI58.6% SWE-Bench Pro, 80.2% SWE-bench Verified, 300-agent swarm56
April 23, 2026GPT-5.5OpenAI58.6% SWE-Bench Pro (tied with Kimi K2.6) and 88.7% SWE-bench Verified — takes Verified lead from Opus 4.71011
April 24, 2026DeepSeek V4 PreviewDeepSeek1.6T MoE, 80.6% SWE-bench Verified, 3,206 Codeforces, 1M context789

The story isn't that any one Chinese model surpassed Anthropic's flagship. None of them did on SWE-Bench Pro. The story is that within three weeks, the open-weight frontier matched or beat every other closed frontier vendor — including GPT-5.5 — on at least one major coding benchmark, while undercutting them on price by 6× to 30×.


SWE-Bench Pro: The Hardest Public Coding Benchmark

SWE-Bench Pro evaluates an agent's ability to resolve real GitHub issues end-to-end. It is harder than SWE-bench Verified because it includes long-horizon and multi-file changes. Here is where the field stands today.

ModelSWE-Bench ProReleasedLicense
Claude Opus 4.764.3%April 16, 2026Closed4
Kimi K2.658.6%April 20, 2026Modified MIT56
GPT-5.558.6%April 23, 2026Closed10
GLM-5.158.4%April 7, 2026MIT12
GPT-5.457.7%March 5, 2026Closed1
DeepSeek V4-Pro (max)55.4%April 24, 2026MIT78
Claude Opus 4.653.4%February 5, 2026Closed6

Two things stand out. First, the gap from the open-weight leader (Kimi K2.6 at 58.6%) to the closed leader (Opus 4.7 at 64.3%) is 5.7 percentage points — wider than the gap between GPT-5.5 and Kimi K2.6 (zero). Second, GPT-5.5 and Kimi K2.6 are tied on this benchmark, but GPT-5.5 costs $5 per million input tokens and $30 per million output, while Kimi K2.6 lists at $0.95 input / $4.00 output on Moonshot's API.121314

If you score on SWE-Bench Pro alone, the open-weight frontier has matched GPT-5.5 at a fraction of the price. The remaining gap is to Anthropic's Opus 4.7 — and that gap is real but narrow enough that for many workloads it does not justify a 6-30× output-token premium.


SWE-bench Verified, LiveCodeBench, and Codeforces

SWE-Bench Pro is the headline benchmark, but it's not the only one. Different models win different categories.

ModelSWE-bench VerifiedLiveCodeBenchCodeforces Elo
GPT-5.588.7% (OpenAI)11Not headline-reportedNot officially reported
Claude Opus 4.787.6%489.6%4Not officially reported
DeepSeek V4-Pro (max)80.6%7893.5%73,20678
Kimi K2.680.2%5Not headline-reportedNot headline-reported
Claude Opus 4.680.8%788.8%7Not officially reported

DeepSeek V4-Pro wins LiveCodeBench (93.5%) and Codeforces (3,206 — ahead of GPT-5.4's 3,168). Kimi K2.6 sits within a fraction of a point of Opus 4.6 on SWE-bench Verified. GPT-5.5 narrowly leads SWE-bench Verified at 88.7% (OpenAI-reported), with Claude Opus 4.7 close behind at 87.6%. Both lead the open-weight frontier (DeepSeek V4-Pro at 80.6%) by roughly 7 to 8 points on that benchmark.

Note that vendor-reported SWE-bench Verified numbers use each lab's own scaffold and tool budget; on the independent vals.ai leaderboard, which fixes a neutral scaffold, GPT-5.5 lands at ~82.6%, closer to the open-weight pack.15 Apples-to-apples comparison across labs is harder than the headline numbers suggest.

The pattern: open-weight models are competitive on competitive-programming and code-generation tasks (LiveCodeBench, Codeforces) and within striking distance on issue-resolution tasks (SWE-bench Verified). Long-horizon agentic resolution (SWE-Bench Pro) remains the area where Anthropic holds the clearest lead.


The Price Gap That Actually Matters

Per-token pricing tells the story of why these releases are reshaping production deployments.

ModelInput ($/M)Output ($/M)Notes
Claude Opus 4.7$5.00$25.00Closed12
GPT-5.5$5.00$30.00Closed; nearly doubled vs GPT-5.4 ($2.50/$15)13
Kimi K2.6 (Moonshot API)$0.95$4.00Modified MIT, self-host available14
DeepSeek V4-Pro (list)$1.74$3.48MIT9
DeepSeek V4-Pro (promo)$0.435$0.8775% promo until May 31, 20269
GLM-5.1 (OpenRouter)$1.05$3.50MIT, varies by provider16

On output cost alone:

  • Kimi K2.6 is ~6× cheaper than Claude Opus 4.7 and ~7.5× cheaper than GPT-5.5.
  • DeepSeek V4-Pro (list) is ~7× cheaper than Opus 4.7 and ~8.6× cheaper than GPT-5.5.
  • DeepSeek V4-Pro (promo) is ~29× cheaper than Opus 4.7 — until May 31.

And these are inference-API prices. The MIT-licensed weights mean you can self-host either DeepSeek V4 or GLM-5.1 with zero per-token cost. For a heavy coding workload that emits millions of output tokens per day, the math gets uncomfortable fast for closed-source incumbents.


Three Different Bets

The three Chinese labs did not all build the same kind of model. Each made a distinct architectural and product bet.

GLM-5.1: The MIT-Licensed Long-Horizon Bet

GLM-5.1 is a 754-billion-parameter MoE with 40B active per token, MIT-licensed, with a 200K context window and 128K maximum output.17 Z.ai's pitch is long-horizon agentic coding: the model is engineered to run autonomously for up to eight hours of continuous task execution without human checkpoints.2 It was trained on 100,000 Huawei Ascend 910B chips — a deliberate signal that Chinese labs can train frontier models without NVIDIA hardware.2

Best for: agentic engineering workflows where the model needs to plan, execute, verify, and revise across multiple files and sub-tasks without supervision.

Kimi K2.6: The 300-Agent Swarm Bet

Kimi K2.6 is a 1-trillion-parameter MoE with 32B active per token, Modified MIT licensed, with a 256K context window.56 Moonshot's headline architectural feature is Agent Swarm: a single coding run can orchestrate up to 300 sub-agents executing 4,000 coordinated steps.6 Moonshot demonstrated 13 hours of continuous autonomous coding on an open-source financial matching engine during the launch.5

Best for: workloads that decompose naturally into parallel sub-tasks — refactoring across hundreds of files, test generation, multi-language porting.

DeepSeek V4: The Architecture-Driven Cost Bet

DeepSeek V4 ships as two models: V4-Pro (1.6T total / 49B active) and V4-Flash (284B total / 13B active), both under MIT license with a 1-million-token context window and 384K max output.89 The architectural headline is Hybrid Attention — Compressed Sparse Attention and Heavily Compressed Attention layers that cut per-token inference FLOPs to 27% of V3.2's and KV cache to 10% at 1M-token context.8 V4 is the only model in the wave with a million-token window.

Best for: large-codebase work where context length is the binding constraint — entire repositories, long debugging sessions, multi-document code review.


License Differences That Matter in Production

The "open" in "open-weight" varies meaningfully across the three.

ModelLicenseCommercial useSelf-host
GLM-5.1MITYes, unrestricted17Yes
DeepSeek V4MITYes, unrestricted8Yes
Kimi K2.6Modified MITYes — attribution required for deployments >100M MAU or >$20M monthly revenue14Yes
MiniMax M2.7 (earlier, March 18)Restrictive Modified-MIT"Written authorization required for commercial use"18Allowed for non-commercial

GLM-5.1 and DeepSeek V4 shipping under standard MIT is the legal end-state most enterprise buyers want. Kimi K2.6's Modified MIT functions as standard MIT for the vast majority of teams — its one added clause requires prominent "Kimi K2" attribution only for deployments exceeding 100 million monthly active users or $20 million in monthly revenue.14 MiniMax M2.7 — released earlier on March 18, 2026, ahead of the April wave — illustrates the pattern that has not taken hold: a "modified MIT" with non-commercial restrictions that requires prior written authorization for any commercial use.18

For enterprise procurement, the practical hierarchy is: GLM-5.1 ≈ DeepSeek V4 (no friction) → Kimi K2.6 (light review) → MiniMax M2.7 (separate license).


What Each Model Actually Wins At

Comparing the three Chinese open-weight models head-to-head — and against the closed frontier where relevant:

  • Best open-weight SWE-Bench Pro: Kimi K2.6 (58.6%), narrowly ahead of GLM-5.1 (58.4%); also ties closed-source GPT-5.5 (58.6%)
  • Best LiveCodeBench (vendor-reported): DeepSeek V4-Pro (93.5%)
  • Best Codeforces: DeepSeek V4-Pro (3,206 Elo)
  • Best long-horizon autonomous coding: GLM-5.1 (8-hour continuous task execution)
  • Best parallel sub-task orchestration: Kimi K2.6 (300-agent swarm)
  • Best large-codebase context (among the three): DeepSeek V4 (1M tokens)
  • Best price per output token: DeepSeek V4-Pro promo (until May 31)
  • Cleanest license for enterprise procurement: GLM-5.1 or DeepSeek V4 (standard MIT)

The gap to Claude Opus 4.7 on SWE-Bench Pro (5.7 points) and SWE-bench Verified (~7 points) remains real. For mission-critical autonomous coding agents where every percentage point of issue-resolution accuracy matters, Opus 4.7 still has the lead — at 6× to 30× the output-token cost depending on which open-weight model and pricing tier you compare it to.


The Gap That Remains

It would be wrong to say the open-weight frontier has caught the closed frontier. Claude Opus 4.7's 64.3% on SWE-Bench Pro is the highest result publicly reported on that benchmark, and on SWE-bench Verified the closed frontier — GPT-5.5 at 88.7% and Opus 4.7 at 87.6% — still sits roughly 7-8 points above the open-weight leaders. On long-horizon agentic tasks involving multi-file resolution under ambiguity, Opus 4.7 currently wins.

But on most coding tasks that are not at the absolute frontier of difficulty, the open-weight models are now within statistical noise of GPT-5.5 — and dramatically cheaper. The interesting question is no longer "can open-weight catch the closed frontier?" but "for which subset of coding workloads does the closed-frontier premium still pay for itself?"

For high-stakes one-shot agentic resolution — yes. For bulk code generation, refactoring, test writing, multi-file editing, and CI-integration loops — increasingly, no.


Bottom Line

In a 17-day window from April 7 to April 24, 2026, three Chinese open-weight coding models — GLM-5.1, Kimi K2.6, and DeepSeek V4 — closed almost the entire benchmark gap to GPT-5.5 while undercutting GPT-5.5 and Claude Opus 4.7 on price by 6× to 30×. Anthropic's Claude Opus 4.7 still leads on the hardest agentic coding benchmark (SWE-Bench Pro at 64.3%), but for the broader class of coding workloads, MIT-licensed open-weight models with 200K-to-1M-token context windows are now a credible production default. The 2026 frontier of coding LLMs is not a single line. It is a cluster — and the open-weight side of that cluster has, for the first time, become competitive on every dimension except the very hardest.


Footnotes

Footnotes

  1. Winbuzzer, Z.ai Releases GLM-5.1: 754B Model Tops SWE-Bench Pro, April 9, 2026. https://winbuzzer.com/2026/04/09/z-ai-releases-glm-5-1-754b-model-tops-swe-bench-pro-xcxwbn/ 2 3 4

  2. NerdLevelTech, GLM-5.1: The Open-Source Model That Beat GPT-5.4, April 19, 2026. /glm-5-1-open-source-beats-gpt-coding-benchmarks 2 3 4 5

  3. TheNextWeb, Claude Opus 4.7 leads on SWE-bench and agentic reasoning, April 16, 2026. https://thenextweb.com/news/anthropic-claude-opus-4-7-coding-agentic-benchmarks-release 2

  4. LLM-Stats, Claude Opus 4.7 Benchmarks, Pricing & Context Window. https://llm-stats.com/models/claude-opus-4-7 2 3 4 5

  5. Kimi.com Blog, Kimi K2.6 Tech Blog: Advancing Open-Source Coding. https://www.kimi.com/blog/kimi-k2-6 2 3 4 5 6

  6. NerdLevelTech, Kimi K2.6: Open-Weight 300-Agent Swarm Tops GPT-5.4, April 27, 2026. /kimi-k2-6-300-agent-swarm-open-weight-frontier-coding 2 3 4 5 6

  7. DeepSeek API Docs, DeepSeek V4 Preview Release, April 24, 2026. https://api-docs.deepseek.com/news/news260424 2 3 4 5 6 7 8

  8. NerdLevelTech, DeepSeek V4: Open-Weight Frontier at 1/7 the Cost, May 2, 2026. /deepseek-v4-open-source-frontier-million-token-context 2 3 4 5 6 7 8

  9. DeepSeek API Docs, Models & Pricing. https://api-docs.deepseek.com/quick_start/pricing/ 2 3 4 5

  10. OpenAI, Introducing GPT-5.5, April 23, 2026. https://openai.com/index/introducing-gpt-5-5/ 2

  11. TokenMix, GPT-5.5 Review: 88.7% SWE-Bench, 92.4% MMLU, 2x Price Tag (2026). https://tokenmix.ai/blog/gpt-5-5-spud-review-88-swe-bench-2026 ; OpenAI announcement, April 23, 2026. 2

  12. Anthropic Pricing. https://www.anthropic.com/claude/opus 2

  13. OpenAI API Pricing. https://openai.com/api/pricing/ 2

  14. Moonshot AI on Hugging Face, Kimi-K2.6 Model Card. https://huggingface.co/moonshotai/Kimi-K2.6 2 3 4

  15. vals.ai SWE-bench leaderboard, May 2026.

  16. OpenRouter, GLM-5.1 API Pricing & Benchmarks. https://openrouter.ai/z-ai/glm-5.1

  17. LLM-Stats, GLM-5.1 Benchmarks, Pricing & Context Window. https://llm-stats.com/models/glm-5.1 2

  18. MarkTechPost, MiniMax Just Open Sourced MiniMax M2.7, April 12, 2026. https://www.marktechpost.com/2026/04/12/minimax-just-open-sourced-minimax-m2-7-a-self-evolving-agent-model-that-scores-56-22-on-swe-pro-and-57-0-on-terminal-bench-2/ 2

Frequently Asked Questions

It depends on the workload. Kimi K2.6 leads on SWE-Bench Pro at 58.6%, narrowly ahead of GLM-5.1 (58.4%) and well ahead of DeepSeek V4-Pro (55.4%). DeepSeek V4-Pro leads on LiveCodeBench (93.5%) and Codeforces (3,206 Elo). GLM-5.1 leads on long-horizon autonomous execution. No single model wins every benchmark, so the deciding factors are licensing, context length, price, and which kind of coding workload you optimize for.