Does GLM-5.1 still lead SWE-bench Pro?

No. Claude Opus 4.7, released on April 16, 2026, scored 64.3% on SWE-bench Pro, ahead of GLM-5.1's 58.4%. GLM-5.1 held the top spot from April 7 to April 15 — nine days. It remains the #1 open-weight model on that benchmark.

Are the benchmark scores independently verified?

The SWE-bench Pro scores are from Z.ai's own evaluation. The Code Arena Elo of 1,530 was independently confirmed by Arena.ai on April 10, 2026. Treat the SWE-bench Pro scores as preliminary until a third-party lab publishes independent results.

How does GLM-5.1 compare to DeepSeek?

Both are Chinese open-weight models targeting the developer market. GLM-5.1's edge is its larger context window (200K vs DeepSeek-V3.2's 128K), its MIT license, and its demonstrated 8-hour autonomous task capability. DeepSeek-V3.2 also carries an MIT license, but has a larger community and wider third-party tooling support as of April 2026.

Can I run GLM-5.1 locally?

The weights are available on Hugging Face. At 754B total parameters with 40B active, you will need multiple high-end GPUs. Quantized variants (4-bit, 8-bit) significantly reduce memory requirements. See our local AI guide for setup details.

What happened to the name Zhipu AI?

Zhipu AI rebranded to Z.ai as it expanded internationally. The underlying company and research team are the same, still headquartered in Beijing with roots in Tsinghua University's NLP research group.

ai-ml

GLM-5.1: The Open-Source Model That Beat GPT-5.4

April 19, 2026

#GLM-5.1 #Z.ai #open source AI #AI benchmarks #agentic AI #coding AI #LLM #open weight models #SWE-bench

GLM-5.1: The Open-Source Model That Beat GPT-5.4

TL;DR

Z.ai released GLM-5.1 on April 7, 2026 — a 754-billion-parameter open-weight model that scored 58.4% on SWE-bench Pro, edging past GPT-5.4 (57.7%) and Claude Opus 4.6 (57.3%) to claim the top spot, making it the first open-source model to ever lead that leaderboard. It achieved that result with zero NVIDIA hardware, trained entirely on 100,000 Huawei Ascend 910B chips. The weights are MIT-licensed, free to download, and already available via API at $0.95 per million input tokens. Its agentic execution engine can run autonomously for eight hours straight — no human checkpoints required.

As of April 16, Claude Opus 4.7 has since moved ahead at 64.3% on SWE-bench Pro, but the fact that an open-weight model briefly held the #1 position at all is a milestone the AI industry will not forget quickly.

What You'll Learn

What GLM-5.1 is and how it differs from its predecessor GLM-5
Its benchmark performance on SWE-bench Pro and Code Arena, plus important caveats
How a model on the US export-control Entity List was built without a single NVIDIA chip
What 8-hour autonomous execution actually means in practice
How to access GLM-5.1 — locally, via API, and what it costs
How it compares to DeepSeek, Qwen, and other open-weight rivals

What Is GLM-5.1?

GLM-5.1 is the latest model in Z.ai's General Language Model series, released on April 7, 2026. It is a post-training upgrade to GLM-5, sharing the same Mixture-of-Experts (MoE) architecture but with substantially improved coding ability, tool use, and sustained autonomous execution¹.

Z.ai, formerly known as Zhipu AI, was spun out of Tsinghua University and has been building open LLMs since the original ChatGLM era. The company rebranded to Z.ai in July 2025 as it expanded internationally and shifted its focus toward agentic AI. On January 8, 2026, Z.ai debuted on the Hong Kong Stock Exchange (HKEX: 02513) as Knowledge Atlas Technology Joint Stock Company Limited, raising approximately $558 million in one of the largest AI IPOs of the year². You can read about its earlier work in our inside GLM-4 capabilities post.

One geopolitical note that shapes everything about this model: Zhipu AI was added to the US Commerce Department's Entity List on January 16, 2025³. As a result, the company cannot legally purchase NVIDIA hardware. GLM-5.1 was trained entirely on 100,000 Huawei Ascend 910B chips — a fact that matters both for evaluating the Ascend ecosystem's maturity and for understanding why this launch is diplomatically significant⁴.

Architecture: 754 Billion Parameters, 40 Billion Active

GLM-5.1 uses a Mixture-of-Experts architecture with 754 billion total parameters. During inference, only approximately 40 billion parameters are activated per token — a design that makes the model far more computationally tractable than a dense 754B model would be¹.

The MoE routing uses 256 experts with top-8-routing per token, meaning eight specialist sub-networks handle each token, plus one expert that operates across all tokens. This architecture is similar in principle to what Mixtral and DeepSeek-V3 pioneered in the open-source community.

Context window and output capacity:

Spec	Value
Context window	~200,000 tokens
Max output (official)	131,072 tokens
Architecture	Mixture-of-Experts
Total parameters	754 billion
Active per token	~40 billion
Training hardware	100,000 Huawei Ascend 910B

Benchmark Performance

SWE-bench Pro: A Historic First for Open Source

SWE-bench Pro is a coding benchmark created by Scale AI that evaluates a model's ability to resolve real-world GitHub issues against private, contamination-resistant codebases. Each task requires at least 10 lines of code change, averaging 107.4 lines across 4.1 files⁵. It is significantly harder than the standard SWE-bench Verified, and scores that look impressive on Verified translate to much lower numbers on Pro.

According to Z.ai's internal evaluation⁶, GLM-5.1 scored as follows:

Model	SWE-bench Pro	License
GLM-5.1	58.4%	MIT (open weight)
GPT-5.4	57.7%	Proprietary
Claude Opus 4.6	57.3%	Proprietary
Gemini 3.1 Pro	54.2%	Proprietary

This is the first time an open-weight model has claimed the top position on SWE-bench Pro.

Important caveat: These scores come from Z.ai's own evaluation runs, not a fully independent third-party assessment. The precise margin — less than one point over GPT-5.4 — should be treated as preliminary until independently verified. Z.ai used a 200K-token context window and their own scaffolding setup, which may differ from how other labs measure themselves. For context: Anthropic independently measured Claude Opus 4.6 at 53.4% on SWE-bench Pro using its own scaffolding. The 57.3% figure above is Z.ai's measurement of Opus 4.6 under Z.ai's setup — a 3.9-point difference that illustrates how much evaluation methodology affects scores on this benchmark.

Context update: On April 16, 2026, Anthropic released Claude Opus 4.7, which scored 64.3% on SWE-bench Pro — a 5.9-point margin over GLM-5.1⁷. GLM-5.1's time at the top lasted nine days. That does not diminish the achievement; the benchmark gap between the best open and closed models has been shrinking for two years, and briefly taking the crown confirms open source has arrived at the frontier.

Code Arena: First Open-Weight Model in the Top 3

Arena.ai independently confirmed on April 10, 2026 that GLM-5.1 holds an Elo rating of 1,530 on Code Arena — placing it third globally on their agentic webdev leaderboard⁴. This is notable because it is confirmed by an independent evaluator, not self-reported.

It is the first open-weight model to appear in the top three on Code Arena.

Other Benchmarks

Z.ai also reports strong performance across general reasoning tasks, though these scores are not independently verified:

Benchmark	GLM-5.1
LiveCodeBench	83.6%
GPQA Diamond	~88% (sources vary)

For the most reliable cross-model comparison, refer to independent evaluators like Artificial Analysis or LM Council rather than any single vendor's figures.

For broader context on where frontier models stand on human-equivalent benchmarks, see the Stanford AI Index 2026 breakdown.

Eight Hours of Autonomous Execution

The headline agentic capability of GLM-5.1 is its ability to run autonomously for more than eight hours without human checkpoints. Z.ai demonstrated this by having the model build a complete Linux desktop environment from scratch — a task it completed through 655 autonomous plan-execute-analyze-optimize iterations in a single uninterrupted session⁴.

This is longer than any publicly documented single-session autonomous run from a major model. For comparison, most current agentic frameworks assume regular human review cycles or short task horizons.

What enables this is not raw intelligence but execution architecture: GLM-5.1 was designed from the ground up for long-horizon tasks. Its 200K context window lets it hold large codebases in working memory, and its tool-use capabilities are tightly integrated rather than bolted on afterward.

For more on how agentic coding workflows are evolving across the industry, see our post on inside AI coding agents.

Access and Pricing

API Access

GLM-5.1 is available through Z.ai's API, OpenRouter, and Together AI. Pricing on the Z.ai platform⁸:

Token type	Price per million
Input	$1.40
Output	$4.40

⚠ Prices change frequently. The values above are for illustration only and may be out of date. Always verify current pricing directly with the provider before making cost decisions: Anthropic · OpenAI · Google Gemini · Google Vertex AI · AWS Bedrock · Azure OpenAI · Mistral · Cohere · Together AI · DeepSeek · Groq · Fireworks AI · Perplexity · xAI · Cursor · GitHub Copilot · Windsurf.

For context, Claude Opus 4.7 costs $5.00/M input and $25.00/M output. On output tokens — which dominate cost in agentic coding tasks — GLM-5.1 at $4.40/M is roughly one-sixth the cost of Opus 4.7 at $25.00/M, a significant difference for high-volume workloads.

Z.ai also offers a "GLM Coding Plan" for teams that want bundled access for IDE integrations, with tiers starting at approximately $10 per month (billed quarterly as $30/quarter for the Lite plan).

Local Deployment

Because GLM-5.1 is MIT-licensed, you can download the weights from Hugging Face (zai-org/GLM-5.1) and run them locally without any usage restrictions or royalty fees. You own your deployment entirely.

Running 754B MoE models locally requires substantial hardware — at 40B active parameters per token, you will need multiple high-end GPUs. The community is actively sharing quantized versions that reduce the footprint considerably. For a guide to running large models locally, see our build local AI guide with Ollama and Qwen 3.

How GLM-5.1 Compares to Other Open-Weight Models

The open-source frontier has become genuinely competitive. GLM-5.1's primary open-weight competitors in April 2026 are DeepSeek-V3.2 (MIT license, 128K context) from DeepSeek, and Qwen 3.6-35B-A3B (Apache 2.0 license, released April 16, 2026) from Alibaba's Qwen team.

GLM-5.1's distinguishing advantages in this peer group are its larger context window (200K vs 128K), its MIT license, and the longest documented autonomous run time of any publicly available model. On the Code Arena independent leaderboard, GLM-5.1's Elo of 1,530 puts it ahead of both, though all three are in striking distance of each other.

Standardized independent SWE-bench Pro comparisons across these three models are not yet published as of this writing. Until a third-party lab evaluates all three under the same scaffolding, treat vendor-reported scores as directional indicators rather than precise rankings.

A notable pattern: the top open-weight coding models in April 2026 all come from Chinese AI labs — Z.ai, DeepSeek, and Qwen — each navigating US export restrictions in different ways. For more on how Huawei's chip ecosystem supports this development, see our Huawei Ascend chip breakdown.

The Geopolitical Dimension

Zhipu AI's placement on the US Entity List in January 2025 was intended to restrict access to advanced computing — specifically NVIDIA's H100 and H200 GPUs, which dominate frontier model training globally.

GLM-5.1 demonstrates that this restriction did not stop frontier-level training. The 100,000 Huawei Ascend 910B chips used represent a massive domestic compute cluster built without any US silicon. The model's performance — reaching Code Arena Elo 1,530 and holding the top SWE-bench Pro position for nine days — suggests the Ascend ecosystem has crossed a threshold where it can produce genuinely competitive results⁴.

This has implications beyond Zhipu: it signals that China's alternative compute stack is now capable of supporting frontier model development, a conclusion that analysts tracking the US-China AI competition will be watching closely. The Stanford AI Index 2026 noted the performance gap between US and Chinese models has compressed to just 2.7 percentage points⁹.

Bottom Line

GLM-5.1 is the clearest evidence yet that the gap between open-weight and closed frontier models has collapsed to nearly nothing on coding benchmarks. A model built without a single NVIDIA chip, by a company on a US trade blacklist, using hardware that was supposed to be a generation behind — briefly held the top score on the industry's hardest coding benchmark.

The nine days it spent at the top of SWE-bench Pro matter because they prove that sustained frontier performance is no longer exclusive to companies with unlimited access to US silicon. For developers, the practical message is more immediate: MIT-licensed, 200K-context, $0.95-per-million-token frontier-class coding performance is now real and available today.

References

OpenCoder Review: The Free, Open-Source Code Model You Can Actually Deploy

Z.ai. "GLM-5.1 Model Architecture." Z.AI Developer Documentation. Published April 7, 2026. https://docs.z.ai/guides/llm/glm-5.1 ↩ ↩²
Z.ai. "Z.ai IPO on Hong Kong Stock Exchange." January 8, 2026. https://en.wikipedia.org/wiki/Z.ai ↩
Federal Register. "Addition of Entities to and Revision of Entry on the Entity List." Published January 16, 2025. https://www.federalregister.gov/documents/2025/01/16/2025-00704/addition-of-entities-to-and-revision-of-entry-on-the-entity-list ↩
MarkTechPost. "Z.AI Introduces GLM-5.1: An Open-Weight 754B Agentic Model That Achieves SOTA on SWE-Bench Pro and Sustains 8-Hour Autonomous Execution." Published April 8, 2026. https://www.marktechpost.com/2026/04/08/z-ai-introduces-glm-5-1-an-open-weight-754b-agentic-model-that-achieves-sota-on-swe-bench-pro-and-sustains-8-hour-autonomous-execution/ ↩ ↩² ↩³ ↩⁴
Scale AI. "SWE-Bench Pro: Raising the Bar for Agentic Coding." https://scale.com/blog/swe-bench-pro ↩
Awesome Agents. "GLM-5.1 Review: Open-Source Model Tops SWE-Bench Pro." Published April 2026. https://awesomeagents.ai/reviews/review-glm-5-1/ ↩
NerdLevelTech. "Claude Opus 4.7: Benchmarks, Features & Pricing." Published April 17, 2026. https://nerdleveltech.com/claude-opus-4-7-benchmarks-features-pricing ↩
Z.ai Pricing. "GLM Coding Plan." https://z.ai/subscribe ↩
Stanford HAI. "Inside the AI Index: 12 Takeaways from the 2026 Report." Published April 2026. https://hai.stanford.edu/news/inside-the-ai-index-12-takeaways-from-the-2026-report ↩

Frequently Asked Questions

Yes. The MIT license permits download, modification, fine-tuning, and commercial deployment with no royalty fees or usage restrictions. You need only your own hardware or cloud budget.