Chinese Open-Weight Coding LLMs: 2026's Three-Week Sweep

May 11, 2026

Chinese Open-Weight Coding LLMs: 2026's Three-Week Sweep

TL;DR

Between April 7 and April 24, 2026, three Chinese labs shipped open-weight coding models that closed almost the entire gap to the Western frontier. GLM-5.1 (Z.ai, April 7) became the first open-weight model to top SWE-Bench Pro at 58.4%, beating GPT-5.4's 57.7%.12 Nine days later Claude Opus 4.7 (Anthropic, April 16) reclaimed the lead at 64.3% and pushed SWE-bench Verified to 87.6%.34 Kimi K2.6 (Moonshot AI, April 20) answered with 58.6% on SWE-Bench Pro and 80.2% on SWE-bench Verified — a 1-trillion-parameter MoE that orchestrates up to 300 sub-agents in a single coding run.56 Four days later, DeepSeek V4 Preview (April 24) shipped a 1.6T MoE with 1M-token context, 80.6% SWE-bench Verified, a 3,206 Codeforces rating (above GPT-5.4's 3,168), and MIT-licensed weights at roughly one-seventh the output cost of Claude Opus 4.7.789 None of the three close the gap to Opus 4.7 on SWE-Bench Pro — but on price-per-task, agent orchestration, context length, and license clarity, the open-weight frontier is now within touching distance of the closed one.


What You'll Learn

  • The exact 17-day timeline of releases — and why the order matters
  • How each model scores on SWE-Bench Pro, SWE-bench Verified, LiveCodeBench, and Codeforces
  • The price gap between the three Chinese open-weight models and GPT-5.5 / Claude Opus 4.7
  • Why MIT vs Modified MIT vs non-commercial licensing changes who can actually use these weights
  • The three different bets each lab made — and which one fits your workload
  • Where the remaining gap to Claude Opus 4.7 still hurts, and where it doesn't

A 17-Day Window That Changed the Open-Weight Frontier

Before April 7, the open-weight coding leaderboard sat well behind the closed frontier. SWE-Bench Pro's top open-weight entrant trailed GPT-5.4 by several points. SWE-bench Verified was a wider gap. Then three releases landed in seventeen days.

DateReleaseLabHeadline result
April 7, 2026GLM-5.1Z.ai (formerly Zhipu)First open-weight #1 on SWE-Bench Pro at 58.4%12
April 16, 2026Claude Opus 4.7AnthropicReclaims SWE-Bench Pro lead at 64.3%, ships 87.6% SWE-bench Verified34
April 20, 2026Kimi K2.6Moonshot AI58.6% SWE-Bench Pro, 80.2% SWE-bench Verified, 300-agent swarm56
April 23, 2026GPT-5.5OpenAI58.6% SWE-Bench Pro (tied with Kimi K2.6) and 88.7% SWE-bench Verified — takes Verified lead from Opus 4.71011
April 24, 2026DeepSeek V4 PreviewDeepSeek1.6T MoE, 80.6% SWE-bench Verified, 3,206 Codeforces, 1M context789

The story isn't that any one Chinese model surpassed Anthropic's flagship. None of them did on SWE-Bench Pro. The story is that within three weeks, the open-weight frontier matched or beat every other closed frontier vendor — including GPT-5.5 — on at least one major coding benchmark, while undercutting them on price by 6× to 30×.


SWE-Bench Pro: The Hardest Public Coding Benchmark

SWE-Bench Pro evaluates an agent's ability to resolve real GitHub issues end-to-end. It is harder than SWE-bench Verified because it includes long-horizon and multi-file changes. Here is where the field stands today.

ModelSWE-Bench ProReleasedLicense
Claude Opus 4.764.3%April 16, 2026Closed4
Kimi K2.658.6%April 20, 2026Modified MIT56
GPT-5.558.6%April 23, 2026Closed10
GLM-5.158.4%April 7, 2026MIT12
GPT-5.457.7%March 5, 2026Closed1
DeepSeek V4-Pro (max)55.4%April 24, 2026MIT78
Claude Opus 4.653.4%February 5, 2026Closed6

Two things stand out. First, the gap from the open-weight leader (Kimi K2.6 at 58.6%) to the closed leader (Opus 4.7 at 64.3%) is 5.7 percentage points — wider than the gap between GPT-5.5 and Kimi K2.6 (zero). Second, GPT-5.5 and Kimi K2.6 are tied on this benchmark, but GPT-5.5 costs $5 per million input tokens and $30 per million output, while Kimi K2.6 lists at $0.95 input / $4.00 output on Moonshot's API.121314

If you score on SWE-Bench Pro alone, the open-weight frontier has matched GPT-5.5 at a fraction of the price. The remaining gap is to Anthropic's Opus 4.7 — and that gap is real but narrow enough that for many workloads it does not justify a 6-30× output-token premium.


SWE-bench Verified, LiveCodeBench, and Codeforces

SWE-Bench Pro is the headline benchmark, but it's not the only one. Different models win different categories.

ModelSWE-bench VerifiedLiveCodeBenchCodeforces Elo
GPT-5.588.7% (OpenAI)11Not headline-reportedNot officially reported
Claude Opus 4.787.6%489.6%4Not officially reported
DeepSeek V4-Pro (max)80.6%7893.5%73,20678
Kimi K2.680.2%5Not headline-reportedNot headline-reported
Claude Opus 4.680.8%788.8%7Not officially reported

DeepSeek V4-Pro wins LiveCodeBench (93.5%) and Codeforces (3,206 — ahead of GPT-5.4's 3,168). Kimi K2.6 sits within a fraction of a point of Opus 4.6 on SWE-bench Verified. GPT-5.5 narrowly leads SWE-bench Verified at 88.7% (OpenAI-reported), with Claude Opus 4.7 close behind at 87.6%. Both lead the open-weight frontier (DeepSeek V4-Pro at 80.6%) by roughly 7 to 8 points on that benchmark.

Note that vendor-reported SWE-bench Verified numbers use each lab's own scaffold and tool budget; on the independent vals.ai leaderboard, which fixes a neutral scaffold, GPT-5.5 lands at ~82.6%, closer to the open-weight pack.15 Apples-to-apples comparison across labs is harder than the headline numbers suggest.

The pattern: open-weight models are competitive on competitive-programming and code-generation tasks (LiveCodeBench, Codeforces) and within striking distance on issue-resolution tasks (SWE-bench Verified). Long-horizon agentic resolution (SWE-Bench Pro) remains the area where Anthropic holds the clearest lead.


The Price Gap That Actually Matters

Per-token pricing tells the story of why these releases are reshaping production deployments.

ModelInput ($/M)Output ($/M)Notes
Claude Opus 4.7$5.00$25.00Closed12
GPT-5.5$5.00$30.00Closed; nearly doubled vs GPT-5.4 ($2.50/$15)13
Kimi K2.6 (Moonshot API)$0.95$4.00Modified MIT, self-host available14
DeepSeek V4-Pro (list)$1.74$3.48MIT9
DeepSeek V4-Pro (promo)$0.435$0.8775% promo until May 31, 20269
GLM-5.1 (OpenRouter)$1.05$3.50MIT, varies by provider16

On output cost alone:

  • Kimi K2.6 is ~6× cheaper than Claude Opus 4.7 and ~7.5× cheaper than GPT-5.5.
  • DeepSeek V4-Pro (list) is ~7× cheaper than Opus 4.7 and ~8.6× cheaper than GPT-5.5.
  • DeepSeek V4-Pro (promo) is ~29× cheaper than Opus 4.7 — until May 31.

And these are inference-API prices. The MIT-licensed weights mean you can self-host either DeepSeek V4 or GLM-5.1 with zero per-token cost. For a heavy coding workload that emits millions of output tokens per day, the math gets uncomfortable fast for closed-source incumbents.


Three Different Bets

The three Chinese labs did not all build the same kind of model. Each made a distinct architectural and product bet.

GLM-5.1: The MIT-Licensed Long-Horizon Bet

GLM-5.1 is a 754-billion-parameter MoE with 40B active per token, MIT-licensed, with a 200K context window and 128K maximum output.17 Z.ai's pitch is long-horizon agentic coding: the model is engineered to run autonomously for up to eight hours of continuous task execution without human checkpoints.2 It was trained on 100,000 Huawei Ascend 910B chips — a deliberate signal that Chinese labs can train frontier models without NVIDIA hardware.2

Best for: agentic engineering workflows where the model needs to plan, execute, verify, and revise across multiple files and sub-tasks without supervision.

Kimi K2.6: The 300-Agent Swarm Bet

Kimi K2.6 is a 1-trillion-parameter MoE with 32B active per token, Modified MIT licensed, with a 256K context window.56 Moonshot's headline architectural feature is Agent Swarm: a single coding run can orchestrate up to 300 sub-agents executing 4,000 coordinated steps.6 Moonshot demonstrated 13 hours of continuous autonomous coding on an open-source financial matching engine during the launch.5

Best for: workloads that decompose naturally into parallel sub-tasks — refactoring across hundreds of files, test generation, multi-language porting.

DeepSeek V4: The Architecture-Driven Cost Bet

DeepSeek V4 ships as two models: V4-Pro (1.6T total / 49B active) and V4-Flash (284B total / 13B active), both under MIT license with a 1-million-token context window and 384K max output.89 The architectural headline is Hybrid Attention — Compressed Sparse Attention and Heavily Compressed Attention layers that cut per-token inference FLOPs to 27% of V3.2's and KV cache to 10% at 1M-token context.8 V4 is the only model in the wave with a million-token window.

Best for: large-codebase work where context length is the binding constraint — entire repositories, long debugging sessions, multi-document code review.


License Differences That Matter in Production

The "open" in "open-weight" varies meaningfully across the three.

ModelLicenseCommercial useSelf-host
GLM-5.1MITYes, unrestricted17Yes
DeepSeek V4MITYes, unrestricted8Yes
Kimi K2.6Modified MITYes — attribution required for deployments >100M MAU or >$20M monthly revenue14Yes
MiniMax M2.7 (earlier, March 18)Restrictive Modified-MIT"Written authorization required for commercial use"18Allowed for non-commercial

GLM-5.1 and DeepSeek V4 shipping under standard MIT is the legal end-state most enterprise buyers want. Kimi K2.6's Modified MIT functions as standard MIT for the vast majority of teams — its one added clause requires prominent "Kimi K2" attribution only for deployments exceeding 100 million monthly active users or $20 million in monthly revenue.14 MiniMax M2.7 — released earlier on March 18, 2026, ahead of the April wave — illustrates the pattern that has not taken hold: a "modified MIT" with non-commercial restrictions that requires prior written authorization for any commercial use.18

For enterprise procurement, the practical hierarchy is: GLM-5.1 ≈ DeepSeek V4 (no friction) → Kimi K2.6 (light review) → MiniMax M2.7 (separate license).


What Each Model Actually Wins At

Comparing the three Chinese open-weight models head-to-head — and against the closed frontier where relevant:

  • Best open-weight SWE-Bench Pro: Kimi K2.6 (58.6%), narrowly ahead of GLM-5.1 (58.4%); also ties closed-source GPT-5.5 (58.6%)
  • Best LiveCodeBench (vendor-reported): DeepSeek V4-Pro (93.5%)
  • Best Codeforces: DeepSeek V4-Pro (3,206 Elo)
  • Best long-horizon autonomous coding: GLM-5.1 (8-hour continuous task execution)
  • Best parallel sub-task orchestration: Kimi K2.6 (300-agent swarm)
  • Best large-codebase context (among the three): DeepSeek V4 (1M tokens)
  • Best price per output token: DeepSeek V4-Pro promo (until May 31)
  • Cleanest license for enterprise procurement: GLM-5.1 or DeepSeek V4 (standard MIT)

The gap to Claude Opus 4.7 on SWE-Bench Pro (5.7 points) and SWE-bench Verified (~7 points) remains real. For mission-critical autonomous coding agents where every percentage point of issue-resolution accuracy matters, Opus 4.7 still has the lead — at 6× to 30× the output-token cost depending on which open-weight model and pricing tier you compare it to.


The Gap That Remains

It would be wrong to say the open-weight frontier has caught the closed frontier. Claude Opus 4.7's 64.3% on SWE-Bench Pro is the highest result publicly reported on that benchmark, and on SWE-bench Verified the closed frontier — GPT-5.5 at 88.7% and Opus 4.7 at 87.6% — still sits roughly 7-8 points above the open-weight leaders. On long-horizon agentic tasks involving multi-file resolution under ambiguity, Opus 4.7 currently wins.

But on most coding tasks that are not at the absolute frontier of difficulty, the open-weight models are now within statistical noise of GPT-5.5 — and dramatically cheaper. The interesting question is no longer "can open-weight catch the closed frontier?" but "for which subset of coding workloads does the closed-frontier premium still pay for itself?"

For high-stakes one-shot agentic resolution — yes. For bulk code generation, refactoring, test writing, multi-file editing, and CI-integration loops — increasingly, no.


Bottom Line

In a 17-day window from April 7 to April 24, 2026, three Chinese open-weight coding models — GLM-5.1, Kimi K2.6, and DeepSeek V4 — closed almost the entire benchmark gap to GPT-5.5 while undercutting GPT-5.5 and Claude Opus 4.7 on price by 6× to 30×. Anthropic's Claude Opus 4.7 still leads on the hardest agentic coding benchmark (SWE-Bench Pro at 64.3%), but for the broader class of coding workloads, MIT-licensed open-weight models with 200K-to-1M-token context windows are now a credible production default. The 2026 frontier of coding LLMs is not a single line. It is a cluster — and the open-weight side of that cluster has, for the first time, become competitive on every dimension except the very hardest.


Footnotes

Footnotes

  1. Winbuzzer, Z.ai Releases GLM-5.1: 754B Model Tops SWE-Bench Pro, April 9, 2026. https://winbuzzer.com/2026/04/09/z-ai-releases-glm-5-1-754b-model-tops-swe-bench-pro-xcxwbn/ 2 3 4

  2. NerdLevelTech, GLM-5.1: The Open-Source Model That Beat GPT-5.4, April 19, 2026. /glm-5-1-open-source-beats-gpt-coding-benchmarks 2 3 4 5

  3. TheNextWeb, Claude Opus 4.7 leads on SWE-bench and agentic reasoning, April 16, 2026. https://thenextweb.com/news/anthropic-claude-opus-4-7-coding-agentic-benchmarks-release 2

  4. LLM-Stats, Claude Opus 4.7 Benchmarks, Pricing & Context Window. https://llm-stats.com/models/claude-opus-4-7 2 3 4 5

  5. Kimi.com Blog, Kimi K2.6 Tech Blog: Advancing Open-Source Coding. https://www.kimi.com/blog/kimi-k2-6 2 3 4 5 6

  6. NerdLevelTech, Kimi K2.6: Open-Weight 300-Agent Swarm Tops GPT-5.4, April 27, 2026. /kimi-k2-6-300-agent-swarm-open-weight-frontier-coding 2 3 4 5 6

  7. DeepSeek API Docs, DeepSeek V4 Preview Release, April 24, 2026. https://api-docs.deepseek.com/news/news260424 2 3 4 5 6 7 8

  8. NerdLevelTech, DeepSeek V4: Open-Weight Frontier at 1/7 the Cost, May 2, 2026. /deepseek-v4-open-source-frontier-million-token-context 2 3 4 5 6 7 8

  9. DeepSeek API Docs, Models & Pricing. https://api-docs.deepseek.com/quick_start/pricing/ 2 3 4 5

  10. OpenAI, Introducing GPT-5.5, April 23, 2026. https://openai.com/index/introducing-gpt-5-5/ 2

  11. TokenMix, GPT-5.5 Review: 88.7% SWE-Bench, 92.4% MMLU, 2x Price Tag (2026). https://tokenmix.ai/blog/gpt-5-5-spud-review-88-swe-bench-2026 ; OpenAI announcement, April 23, 2026. 2

  12. Anthropic Pricing. https://www.anthropic.com/claude/opus 2

  13. OpenAI API Pricing. https://openai.com/api/pricing/ 2

  14. Moonshot AI on Hugging Face, Kimi-K2.6 Model Card. https://huggingface.co/moonshotai/Kimi-K2.6 2 3 4

  15. vals.ai SWE-bench leaderboard, May 2026.

  16. OpenRouter, GLM-5.1 API Pricing & Benchmarks. https://openrouter.ai/z-ai/glm-5.1

  17. LLM-Stats, GLM-5.1 Benchmarks, Pricing & Context Window. https://llm-stats.com/models/glm-5.1 2

  18. MarkTechPost, MiniMax Just Open Sourced MiniMax M2.7, April 12, 2026. https://www.marktechpost.com/2026/04/12/minimax-just-open-sourced-minimax-m2-7-a-self-evolving-agent-model-that-scores-56-22-on-swe-pro-and-57-0-on-terminal-bench-2/ 2

Frequently Asked Questions

It depends on the workload. Kimi K2.6 leads on SWE-Bench Pro at 58.6%, narrowly ahead of GLM-5.1 (58.4%) and well ahead of DeepSeek V4-Pro (55.4%). DeepSeek V4-Pro leads on LiveCodeBench (93.5%) and Codeforces (3,206 Elo). GLM-5.1 leads on long-horizon autonomous execution. No single model wins every benchmark, so the deciding factors are licensing, context length, price, and which kind of coding workload you optimize for.

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.