Are these models truly open source?

They are open weight, not strictly open source. All three publish weights on Hugging Face. None publishes the full training dataset or training pipeline source. GLM-5.1 and DeepSeek V4 use standard MIT license; Kimi K2.6 uses a Modified MIT license.

How do they compare to Claude Opus 4.7?

On SWE-Bench Pro, Opus 4.7 leads the best open-weight model (Kimi K2.6) by 5.7 percentage points. On SWE-bench Verified, Opus 4.7 leads by 7 points. On price, Opus 4.7 costs roughly 6× more per output token than Kimi K2.6 and 7× more than DeepSeek V4-Pro list pricing.

Can I self-host these models?

Yes — all three publish weights on Hugging Face. Practical hosting requires substantial GPU resources (the active-parameter counts are smaller than the totals, but you still need to keep all expert weights in memory or accept routing-driven memory swaps). Most production users access them via API rather than self-hosting.

Why did three frontier-tier Chinese coding models ship within 17 days?

The timing is partly coincidence and partly competitive pressure — Claude Opus 4.7 launched on April 16 between Z.ai's release on April 7 and Moonshot/DeepSeek's later-April releases. Each lab had been working on its next-generation model for months; the convergence in release dates reflects similar training-cycle timelines, not coordination.

Chinese Open-Weight Coding LLMs: 2026's Three-Week Sweep

May 11, 2026

#open-weight LLM #open-source coding LLM #Chinese AI models #GLM-5.1 #Kimi K2.6 #DeepSeek V4 #Z.ai #Moonshot AI #DeepSeek #SWE-Bench Pro #SWE-bench Verified #LiveCodeBench #Codeforces #Claude Opus 4.7 #GPT-5.5 #MIT license #Mixture of Experts #agentic coding #long-horizon coding #open-weight comparison #frontier AI

Chinese Open-Weight Coding LLMs: 2026's Three-Week Sweep

TL;DR

Between April 7 and April 24, 2026, three Chinese labs shipped open-weight coding models that closed almost the entire gap to the Western frontier. GLM-5.1 (Z.ai, April 7) became the first open-weight model to top SWE-Bench Pro at 58.4%, beating GPT-5.4's 57.7%.¹² Nine days later Claude Opus 4.7 (Anthropic, April 16) reclaimed the lead at 64.3% and pushed SWE-bench Verified to 87.6%.³⁴ Kimi K2.6 (Moonshot AI, April 20) answered with 58.6% on SWE-Bench Pro and 80.2% on SWE-bench Verified — a 1-trillion-parameter MoE that orchestrates up to 300 sub-agents in a single coding run.⁵⁶ Four days later, DeepSeek V4 Preview (April 24) shipped a 1.6T MoE with 1M-token context, 80.6% SWE-bench Verified, a 3,206 Codeforces rating (above GPT-5.4's 3,168), and MIT-licensed weights at roughly one-seventh the output cost of Claude Opus 4.7.⁷⁸⁹ None of the three close the gap to Opus 4.7 on SWE-Bench Pro — but on price-per-task, agent orchestration, context length, and license clarity, the open-weight frontier is now within touching distance of the closed one.

What You'll Learn

The exact 17-day timeline of releases — and why the order matters
How each model scores on SWE-Bench Pro, SWE-bench Verified, LiveCodeBench, and Codeforces
The price gap between the three Chinese open-weight models and GPT-5.5 / Claude Opus 4.7
Why MIT vs Modified MIT vs non-commercial licensing changes who can actually use these weights
The three different bets each lab made — and which one fits your workload
Where the remaining gap to Claude Opus 4.7 still hurts, and where it doesn't

A 17-Day Window That Changed the Open-Weight Frontier

Before April 7, the open-weight coding leaderboard sat well behind the closed frontier. SWE-Bench Pro's top open-weight entrant trailed GPT-5.4 by several points. SWE-bench Verified was a wider gap. Then three releases landed in seventeen days.

Date	Release	Lab	Headline result
April 7, 2026	GLM-5.1	Z.ai (formerly Zhipu)	First open-weight #1 on SWE-Bench Pro at 58.4%¹²
April 16, 2026	Claude Opus 4.7	Anthropic	Reclaims SWE-Bench Pro lead at 64.3%, ships 87.6% SWE-bench Verified³⁴
April 20, 2026	Kimi K2.6	Moonshot AI	58.6% SWE-Bench Pro, 80.2% SWE-bench Verified, 300-agent swarm⁵⁶
April 23, 2026	GPT-5.5	OpenAI	58.6% SWE-Bench Pro (tied with Kimi K2.6) and 88.7% SWE-bench Verified — takes Verified lead from Opus 4.7¹⁰¹¹
April 24, 2026	DeepSeek V4 Preview	DeepSeek	1.6T MoE, 80.6% SWE-bench Verified, 3,206 Codeforces, 1M context⁷⁸⁹

The story isn't that any one Chinese model surpassed Anthropic's flagship. None of them did on SWE-Bench Pro. The story is that within three weeks, the open-weight frontier matched or beat every other closed frontier vendor — including GPT-5.5 — on at least one major coding benchmark, while undercutting them on price by 6× to 30×.

SWE-Bench Pro: The Hardest Public Coding Benchmark

SWE-Bench Pro evaluates an agent's ability to resolve real GitHub issues end-to-end. It is harder than SWE-bench Verified because it includes long-horizon and multi-file changes. Here is where the field stands today.

Model	SWE-Bench Pro	Released	License
Claude Opus 4.7	64.3%	April 16, 2026	Closed⁴
Kimi K2.6	58.6%	April 20, 2026	Modified MIT⁵⁶
GPT-5.5	58.6%	April 23, 2026	Closed¹⁰
GLM-5.1	58.4%	April 7, 2026	MIT¹²
GPT-5.4	57.7%	March 5, 2026	Closed¹
DeepSeek V4-Pro (max)	55.4%	April 24, 2026	MIT⁷⁸
Claude Opus 4.6	53.4%	February 5, 2026	Closed⁶

Two things stand out. First, the gap from the open-weight leader (Kimi K2.6 at 58.6%) to the closed leader (Opus 4.7 at 64.3%) is 5.7 percentage points — wider than the gap between GPT-5.5 and Kimi K2.6 (zero). Second, GPT-5.5 and Kimi K2.6 are tied on this benchmark, but GPT-5.5 costs $5 per million input tokens and $30 per million output, while Kimi K2.6 lists at $0.95 input / $4.00 output on Moonshot's API.¹²¹³¹⁴

If you score on SWE-Bench Pro alone, the open-weight frontier has matched GPT-5.5 at a fraction of the price. The remaining gap is to Anthropic's Opus 4.7 — and that gap is real but narrow enough that for many workloads it does not justify a 6-30× output-token premium.

SWE-bench Verified, LiveCodeBench, and Codeforces

SWE-Bench Pro is the headline benchmark, but it's not the only one. Different models win different categories.

Model	SWE-bench Verified	LiveCodeBench	Codeforces Elo
GPT-5.5	88.7% (OpenAI)¹¹	Not headline-reported	Not officially reported
Claude Opus 4.7	87.6%⁴	89.6%⁴	Not officially reported
DeepSeek V4-Pro (max)	80.6%⁷⁸	93.5%⁷	3,206⁷⁸
Kimi K2.6	80.2%⁵	Not headline-reported	Not headline-reported
Claude Opus 4.6	80.8%⁷	88.8%⁷	Not officially reported

DeepSeek V4-Pro wins LiveCodeBench (93.5%) and Codeforces (3,206 — ahead of GPT-5.4's 3,168). Kimi K2.6 sits within a fraction of a point of Opus 4.6 on SWE-bench Verified. GPT-5.5 narrowly leads SWE-bench Verified at 88.7% (OpenAI-reported), with Claude Opus 4.7 close behind at 87.6%. Both lead the open-weight frontier (DeepSeek V4-Pro at 80.6%) by roughly 7 to 8 points on that benchmark.

Note that vendor-reported SWE-bench Verified numbers use each lab's own scaffold and tool budget; on the independent vals.ai leaderboard, which fixes a neutral scaffold, GPT-5.5 lands at ~82.6%, closer to the open-weight pack.¹⁵ Apples-to-apples comparison across labs is harder than the headline numbers suggest.

The pattern: open-weight models are competitive on competitive-programming and code-generation tasks (LiveCodeBench, Codeforces) and within striking distance on issue-resolution tasks (SWE-bench Verified). Long-horizon agentic resolution (SWE-Bench Pro) remains the area where Anthropic holds the clearest lead.

The Price Gap That Actually Matters

Per-token pricing tells the story of why these releases are reshaping production deployments.

Model	Input ($/M)	Output ($/M)	Notes
Claude Opus 4.7	$5.00	$25.00	Closed¹²
GPT-5.5	$5.00	$30.00	Closed; nearly doubled vs GPT-5.4 ($2.50/$15)¹³
Kimi K2.6 (Moonshot API)	$0.95	$4.00	Modified MIT, self-host available¹⁴
DeepSeek V4-Pro (list)	$1.74	$3.48	MIT⁹
DeepSeek V4-Pro (promo)	$0.435	$0.87	75% promo until May 31, 2026⁹
GLM-5.1 (OpenRouter)	$1.05	$3.50	MIT, varies by provider¹⁶

On output cost alone:

Kimi K2.6 is ~6× cheaper than Claude Opus 4.7 and ~7.5× cheaper than GPT-5.5.
DeepSeek V4-Pro (list) is ~7× cheaper than Opus 4.7 and ~8.6× cheaper than GPT-5.5.
DeepSeek V4-Pro (promo) is ~29× cheaper than Opus 4.7 — until May 31.

And these are inference-API prices. The MIT-licensed weights mean you can self-host either DeepSeek V4 or GLM-5.1 with zero per-token cost. For a heavy coding workload that emits millions of output tokens per day, the math gets uncomfortable fast for closed-source incumbents.

Three Different Bets

The three Chinese labs did not all build the same kind of model. Each made a distinct architectural and product bet.

GLM-5.1: The MIT-Licensed Long-Horizon Bet

GLM-5.1 is a 754-billion-parameter MoE with 40B active per token, MIT-licensed, with a 200K context window and 128K maximum output.¹⁷ Z.ai's pitch is long-horizon agentic coding: the model is engineered to run autonomously for up to eight hours of continuous task execution without human checkpoints.² It was trained on 100,000 Huawei Ascend 910B chips — a deliberate signal that Chinese labs can train frontier models without NVIDIA hardware.²

Best for: agentic engineering workflows where the model needs to plan, execute, verify, and revise across multiple files and sub-tasks without supervision.

Kimi K2.6: The 300-Agent Swarm Bet

Kimi K2.6 is a 1-trillion-parameter MoE with 32B active per token, Modified MIT licensed, with a 256K context window.⁵⁶ Moonshot's headline architectural feature is Agent Swarm: a single coding run can orchestrate up to 300 sub-agents executing 4,000 coordinated steps.⁶ Moonshot demonstrated 13 hours of continuous autonomous coding on an open-source financial matching engine during the launch.⁵

Best for: workloads that decompose naturally into parallel sub-tasks — refactoring across hundreds of files, test generation, multi-language porting.

DeepSeek V4: The Architecture-Driven Cost Bet

DeepSeek V4 ships as two models: V4-Pro (1.6T total / 49B active) and V4-Flash (284B total / 13B active), both under MIT license with a 1-million-token context window and 384K max output.⁸⁹ The architectural headline is Hybrid Attention — Compressed Sparse Attention and Heavily Compressed Attention layers that cut per-token inference FLOPs to 27% of V3.2's and KV cache to 10% at 1M-token context.⁸ V4 is the only model in the wave with a million-token window.

Best for: large-codebase work where context length is the binding constraint — entire repositories, long debugging sessions, multi-document code review.

License Differences That Matter in Production

The "open" in "open-weight" varies meaningfully across the three.

Model	License	Commercial use	Self-host
GLM-5.1	MIT	Yes, unrestricted¹⁷	Yes
DeepSeek V4	MIT	Yes, unrestricted⁸	Yes
Kimi K2.6	Modified MIT	Yes — attribution required for deployments >100M MAU or >$20M monthly revenue¹⁴	Yes
MiniMax M2.7 (earlier, March 18)	Restrictive Modified-MIT	"Written authorization required for commercial use"¹⁸	Allowed for non-commercial

GLM-5.1 and DeepSeek V4 shipping under standard MIT is the legal end-state most enterprise buyers want. Kimi K2.6's Modified MIT functions as standard MIT for the vast majority of teams — its one added clause requires prominent "Kimi K2" attribution only for deployments exceeding 100 million monthly active users or $20 million in monthly revenue.¹⁴ MiniMax M2.7 — released earlier on March 18, 2026, ahead of the April wave — illustrates the pattern that has not taken hold: a "modified MIT" with non-commercial restrictions that requires prior written authorization for any commercial use.¹⁸

For enterprise procurement, the practical hierarchy is: GLM-5.1 ≈ DeepSeek V4 (no friction) → Kimi K2.6 (light review) → MiniMax M2.7 (separate license).

What Each Model Actually Wins At

Comparing the three Chinese open-weight models head-to-head — and against the closed frontier where relevant:

Best open-weight SWE-Bench Pro: Kimi K2.6 (58.6%), narrowly ahead of GLM-5.1 (58.4%); also ties closed-source GPT-5.5 (58.6%)
Best LiveCodeBench (vendor-reported): DeepSeek V4-Pro (93.5%)
Best Codeforces: DeepSeek V4-Pro (3,206 Elo)
Best long-horizon autonomous coding: GLM-5.1 (8-hour continuous task execution)
Best parallel sub-task orchestration: Kimi K2.6 (300-agent swarm)
Best large-codebase context (among the three): DeepSeek V4 (1M tokens)
Best price per output token: DeepSeek V4-Pro promo (until May 31)
Cleanest license for enterprise procurement: GLM-5.1 or DeepSeek V4 (standard MIT)

The gap to Claude Opus 4.7 on SWE-Bench Pro (5.7 points) and SWE-bench Verified (~7 points) remains real. For mission-critical autonomous coding agents where every percentage point of issue-resolution accuracy matters, Opus 4.7 still has the lead — at 6× to 30× the output-token cost depending on which open-weight model and pricing tier you compare it to.

The Gap That Remains

It would be wrong to say the open-weight frontier has caught the closed frontier. Claude Opus 4.7's 64.3% on SWE-Bench Pro is the highest result publicly reported on that benchmark, and on SWE-bench Verified the closed frontier — GPT-5.5 at 88.7% and Opus 4.7 at 87.6% — still sits roughly 7-8 points above the open-weight leaders. On long-horizon agentic tasks involving multi-file resolution under ambiguity, Opus 4.7 currently wins.

But on most coding tasks that are not at the absolute frontier of difficulty, the open-weight models are now within statistical noise of GPT-5.5 — and dramatically cheaper. The interesting question is no longer "can open-weight catch the closed frontier?" but "for which subset of coding workloads does the closed-frontier premium still pay for itself?"

For high-stakes one-shot agentic resolution — yes. For bulk code generation, refactoring, test writing, multi-file editing, and CI-integration loops — increasingly, no.

Bottom Line

In a 17-day window from April 7 to April 24, 2026, three Chinese open-weight coding models — GLM-5.1, Kimi K2.6, and DeepSeek V4 — closed almost the entire benchmark gap to GPT-5.5 while undercutting GPT-5.5 and Claude Opus 4.7 on price by 6× to 30×. Anthropic's Claude Opus 4.7 still leads on the hardest agentic coding benchmark (SWE-Bench Pro at 64.3%), but for the broader class of coding workloads, MIT-licensed open-weight models with 200K-to-1M-token context windows are now a credible production default. The 2026 frontier of coding LLMs is not a single line. It is a cluster — and the open-weight side of that cluster has, for the first time, become competitive on every dimension except the very hardest.

Footnotes

Winbuzzer, Z.ai Releases GLM-5.1: 754B Model Tops SWE-Bench Pro, April 9, 2026. https://winbuzzer.com/2026/04/09/z-ai-releases-glm-5-1-754b-model-tops-swe-bench-pro-xcxwbn/ ↩ ↩² ↩³ ↩⁴
NerdLevelTech, GLM-5.1: The Open-Source Model That Beat GPT-5.4, April 19, 2026. /glm-5-1-open-source-beats-gpt-coding-benchmarks ↩ ↩² ↩³ ↩⁴ ↩⁵
TheNextWeb, Claude Opus 4.7 leads on SWE-bench and agentic reasoning, April 16, 2026. https://thenextweb.com/news/anthropic-claude-opus-4-7-coding-agentic-benchmarks-release ↩ ↩²
LLM-Stats, Claude Opus 4.7 Benchmarks, Pricing & Context Window. https://llm-stats.com/models/claude-opus-4-7 ↩ ↩² ↩³ ↩⁴ ↩⁵
Kimi.com Blog, Kimi K2.6 Tech Blog: Advancing Open-Source Coding. https://www.kimi.com/blog/kimi-k2-6 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
NerdLevelTech, Kimi K2.6: Open-Weight 300-Agent Swarm Tops GPT-5.4, April 27, 2026. /kimi-k2-6-300-agent-swarm-open-weight-frontier-coding ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
DeepSeek API Docs, DeepSeek V4 Preview Release, April 24, 2026. https://api-docs.deepseek.com/news/news260424 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
NerdLevelTech, DeepSeek V4: Open-Weight Frontier at 1/7 the Cost, May 2, 2026. /deepseek-v4-open-source-frontier-million-token-context ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
DeepSeek API Docs, Models & Pricing. https://api-docs.deepseek.com/quick_start/pricing/ ↩ ↩² ↩³ ↩⁴ ↩⁵
OpenAI, Introducing GPT-5.5, April 23, 2026. https://openai.com/index/introducing-gpt-5-5/ ↩ ↩²
TokenMix, GPT-5.5 Review: 88.7% SWE-Bench, 92.4% MMLU, 2x Price Tag (2026). https://tokenmix.ai/blog/gpt-5-5-spud-review-88-swe-bench-2026 ; OpenAI announcement, April 23, 2026. ↩ ↩²
Anthropic Pricing. https://www.anthropic.com/claude/opus ↩ ↩²
OpenAI API Pricing. https://openai.com/api/pricing/ ↩ ↩²
Moonshot AI on Hugging Face, Kimi-K2.6 Model Card. https://huggingface.co/moonshotai/Kimi-K2.6 ↩ ↩² ↩³ ↩⁴
vals.ai SWE-bench leaderboard, May 2026. ↩
OpenRouter, GLM-5.1 API Pricing & Benchmarks. https://openrouter.ai/z-ai/glm-5.1 ↩
LLM-Stats, GLM-5.1 Benchmarks, Pricing & Context Window. https://llm-stats.com/models/glm-5.1 ↩ ↩²
MarkTechPost, MiniMax Just Open Sourced MiniMax M2.7, April 12, 2026. https://www.marktechpost.com/2026/04/12/minimax-just-open-sourced-minimax-m2-7-a-self-evolving-agent-model-that-scores-56-22-on-swe-pro-and-57-0-on-terminal-bench-2/ ↩ ↩²

Frequently Asked Questions

It depends on the workload. Kimi K2.6 leads on SWE-Bench Pro at 58.6%, narrowly ahead of GLM-5.1 (58.4%) and well ahead of DeepSeek V4-Pro (55.4%). DeepSeek V4-Pro leads on LiveCodeBench (93.5%) and Codeforces (3,206 Elo). GLM-5.1 leads on long-horizon autonomous execution. No single model wins every benchmark, so the deciding factors are licensing, context length, price, and which kind of coding workload you optimize for.