Nex-N2-Pro: Open-Weight Coder vs GPT-5.5 (2026)
June 10, 2026
Nex-N2-Pro is a free, open-weight 397-billion-parameter coding model from Nex AGI, post-trained on Alibaba's Qwen3.5.12 On its own benchmarks it leads the open-weight pack and edges GPT-5.5 on SWE-Bench Pro, but it trails the closed frontier on the hardest agentic and reasoning tasks.13
Released June 2, 2026 and now spreading across free endpoints, it is the latest entry in an early-June surge of open-weight coding models — arriving a day after MiniMax M3.45 Two things set it apart: its headline feature is an "Agentic Thinking" framework that decides on its own how hard to reason, and — unlike M3, whose weights were still pending at launch — you can actually download Nex-N2-Pro right now.12
TL;DR
Nex-N2-Pro is the open-weight, agentic coding model from Nex AGI, an open-source alliance initiated by the Shanghai Innovation Institute.16 It is a Mixture-of-Experts model with 397 billion total parameters but only 17 billion active per token, post-trained on Alibaba's Qwen3.5-397B-A17B base, with a 262K-token context window and image input.247 Its signature trick is Adaptive Thinking, which auto-adjusts reasoning depth per step and, Nex says, cuts thinking tokens by 30–50% versus always-on reasoning at equal or better quality.4 On Nex's own benchmark table, Nex-N2-Pro scores 80.8% on SWE-Bench Verified, 58.8% on SWE-Bench Pro, and 75.3 on Terminal-Bench 2.1, narrowly beating GPT-5.5 on SWE-Bench Pro and Claude Opus 4.7 on Terminal-Bench, while leading or matching rival open models like MiniMax M3, DeepSeek-V4-Pro, GLM-5.1, and Kimi-K2.6 on most rows.1 Three caveats matter: every number is self-reported on Nex's own harness; the comparison baseline is Opus 4.7, not the newer Opus 4.8; and on the toughest agentic tests the gap to the closed frontier is wide (DeepSWE: 33.6 vs GPT-5.5's 70).138 The weights are free under Apache 2.0, and it runs free on OpenRouter and SiliconFlow during launch — but self-hosting the full model needs roughly two nodes of 8×H100.19
What is Nex-N2-Pro?
Nex-N2-Pro is a large language model built for coding and agentic work, released and open-sourced on June 2, 2026 by Nex AGI (also styled "Nex").14 Nex is not a conventional startup but an open-source alliance "initiated by the Shanghai Innovation Institute," developing in the open under the nex-agi organization on GitHub and Hugging Face, with partners including Shanghai Qiji Zhifeng, Mosi Intelligence, and KuafuAI.6 The group ships a full agent stack — models, an agent framework (NexAU), data pipelines, and training infrastructure — of which Nex-N2 is the flagship model.6
Architecturally, Nex-N2-Pro is a sparse Mixture-of-Experts (MoE) model with 397 billion total parameters and 17 billion active per token, so its inference cost behaves closer to a 17B dense model than its headline size suggests.24 It accepts text and images as input and produces text, supports a 262,000-token context window, and ships under the permissive Apache 2.0 license with downloadable weights on Hugging Face and ModelScope.24
The crucial detail — and one the marketing soft-pedals — is that Nex-N2-Pro is post-trained on Alibaba's Qwen3.5-397B-A17B, not a base model Nex built from scratch.12 The 397B/17B architecture, the native multimodality, and the long-context efficiency all come from Qwen3.5, Alibaba's flagship open-weight MoE from February 2026, which fuses linear attention (via Gated Delta Networks) with a sparse MoE.7 Nex AGI's actual contribution sits on top: the agentic post-training. The previous generation, Nex-N1, was built the same way on a different base — DeepSeek-V3.1 — so swapping in Qwen3.5 for N2 is the lineage here.4
What is Agentic Thinking (and Adaptive Thinking)?
Agentic Thinking is Nex's framework for unifying reasoning, tool use, and environment execution into a single loop, instead of bolting them together as separate capabilities.1 It is the headline idea behind the model, and it has two halves.1
Adaptive Thinking lets the model decide, on its own, when to think and how deeply — running simple actions fast while reserving thorough reasoning for the decisions that matter. Nex's specific claim is that this cuts thinking tokens by 30–50% on routine steps versus always-on reasoning, with equal or better task performance.4 That is the part worth caring about: long agentic runs burn most of their cost on reasoning tokens, so a model that reasons only when it needs to is cheaper to operate at the same quality — if the claim holds up under independent testing.
Coherent Thinking is the second half: carrying one consistent reasoning style across general tasks and diverse agentic work, so capability transfers cleanly between, say, terminal execution and multimodal generation.1 In practice both behaviors are exposed through standard tooling — Nex-series models emit explicit reasoning traces and support function calling, using the same qwen3 reasoning parser and qwen3_coder tool-call parser as their Qwen base, and they advertise plug-and-play use with Claude Code, Cursor, and other agent harnesses.24
Nex-N2-Pro benchmarks: strong, but self-reported
On Nex's published evaluation suite, Nex-N2-Pro's coding numbers are genuinely good. It posts 80.8% on SWE-Bench Verified (fixing real bugs in real repositories), 58.8% on SWE-Bench Pro, 75.3 on Terminal-Bench 2.1, and 1585 on GDPval — OpenAI's benchmark for real-world, economically valuable knowledge work scored by expert graders across 44 occupations.110 Nex frames the model as keeping "pace with top-tier models such as GPT-5.5 and Opus 4.7."1
Here is the context most coverage drops. Every one of those numbers comes from Nex AGI's own benchmark table — its harness, its scaffolding, its choice of baselines.1 That is standard practice for a model launch, but it makes the results claims awaiting independent replication, not settled facts. The table even shows the seams: it lists GPT-5.5 at 82.9% on SWE-Bench Verified, whereas OpenAI officially reports GPT-5.5 at 88.7% on the same benchmark, and independent harnesses like Vals.ai put GPT-5.5 lower still at ~82.6%.311 In other words, the comparison column is a mix of harness conditions, which is exactly why a self-run table should be read as a strong opening bid rather than a verdict.
| Benchmark (Nex-reported) | Nex-N2-Pro | GPT-5.5 | Opus 4.7 | MiniMax M3 | DeepSeek-V4-Pro |
|---|---|---|---|---|---|
| SWE-Bench Verified | 80.8 | 82.9 | 87.6 | 80.5 | 80.6 |
| SWE-Bench Pro | 58.8 | 58.6 | 64.3 | 59.0 | 55.4 |
| Terminal-Bench 2.1 | 75.3 | 83.4 | 69.7 | 66.0 | 72.0 |
| DeepSWE | 33.6 | 70 | 54 | — | 8 |
| GDPval | 1585 | 1769 | 1753 | — | 1554 |
| GPQA Diamond | 90.7 | 93.6 | 94.2 | — | 90.1 |
All figures as published by Nex AGI.1 Higher is better in every row; GDPval is a score, the rest are percentages.
Nex-N2-Pro vs GPT-5.5 and Claude Opus: the honest read
Strip away the "keeps pace" framing and the picture is more specific. Against GPT-5.5, Nex-N2-Pro edges ahead on SWE-Bench Pro (58.8 vs 58.6) but trails on SWE-Bench Verified (80.8 vs 82.9), on Terminal-Bench 2.1 (75.3 vs 83.4), and badly on DeepSWE (33.6 vs 70).1 Against Claude Opus 4.7, it wins Terminal-Bench 2.1 (75.3 vs 69.7) but loses SWE-Bench Verified by nearly seven points (80.8 vs 87.6) and DeepSWE by twenty (33.6 vs 54).1 The selective wins are real; so are the losses.
Two structural caveats deepen the point. First, the comparison ceiling is Claude Opus 4.7, but Anthropic shipped Claude Opus 4.8 on May 28, 2026 — before Nex-N2-Pro's June 2 launch — and Opus 4.8 reports roughly 88.6% on SWE-Bench Verified.8 Measured against the Anthropic model that was actually current, the gap is wider than the table implies. It is the same flattering-baseline pattern we just flagged in the MiniMax M3 launch, where M3 was also benchmarked against Opus 4.7 rather than 4.8.5 Second, the DeepSWE and GDPval rows are where the closed frontier pulls clearly ahead: a 33.6 on DeepSWE against GPT-5.5's 70 is not "near-parity" on hard agentic coding, it is roughly half. (We dug into how brittle and gameable that benchmark is in our DeepSWE breakdown.)
The open-weight pack: where Nex-N2-Pro actually leads
The fairer frame is not Nex-N2-Pro versus the closed frontier — it is Nex-N2-Pro versus the other open-weight challengers, and there it looks like a leader. On Nex's table it matches or beats MiniMax M3, DeepSeek-V4-Pro, GLM-5.1, and Kimi-K2.6 on most rows: 80.8 on SWE-Bench Verified sits at the top of a tight open-weight cluster (M3 80.5, DeepSeek-V4-Pro 80.6, Kimi-K2.6 80.2); on Terminal-Bench 2.1 its 75.3 clears MiniMax M3's 66.0 and DeepSeek-V4-Pro's 72.0; and on DeepSWE its 33.6, while far behind the closed models, towers over the open field (Kimi-K2.6 24, GLM-5.1 18, DeepSeek-V4-Pro 8).1
That positioning is the story. Over the past few months we have watched Chinese open-weight coding labs run an aggressive cost-and-capability war, from DeepSeek V4 to GLM-5.1 to Kimi K2.6. Nex-N2-Pro's pitch is to take one of the strongest open bases — Alibaba's Qwen3.5 — and squeeze more agentic performance out of it through post-training, then give it away. If the numbers replicate, it is the new front-runner of the open-weight coding pack, even though it is not a genuine threat to GPT-5.5 or Opus 4.8 on the hardest tasks.
Nex-N2-Pro pricing: free now, but heavy to host
The price is the easy part: Nex-N2-Pro is free. The weights are Apache 2.0 and downloadable from Hugging Face and ModelScope, and during the launch window the model runs at no cost on hosted endpoints — $0 input and $0 output on OpenRouter's free tier (rate-limited to roughly 50 requests/day and 20/minute) and free early access on SiliconFlow.29 For a model posting frontier-adjacent coding scores, free hosted access is a strong way to drive adoption.
The catch is hardware. "Open weight" does not mean "runs on your laptop." Nex's own deployment guide launches Nex-N2-Pro across two nodes of 8× H100 GPUs (tensor-parallel 16) using its customized SGLang fork.1 That is a serious cluster — far beyond what an individual can self-host — so for most people "free" in practice means the rate-limited hosted tiers, not local inference. The smaller sibling, Nex-N2-mini (built on Qwen3.5-35B-A3B-Base), is the more self-hostable option Nex points to, running on a single 2× H100 box, though it gives up real ground on the benchmarks (74.4 vs 80.8 on SWE-Bench Verified).1
Bottom line
Nex-N2-Pro is a real step for open-weight coding, wrapped in a launch that oversells the comparison. The Agentic Thinking idea — reasoning only as deeply as a step requires — is a sensible answer to the cost of long agent runs, and giving the weights away free under Apache 2.0 is the kind of move that drives fast adoption.14 But the benchmark narrative leans on self-run tests against Opus 4.7, a baseline Anthropic had already replaced with Opus 4.8, and the hardest agentic tests (DeepSWE, GDPval) show the closed frontier is still comfortably ahead.18 The honest summary: Nex-N2-Pro is probably the strongest open-weight coding model you can download today, and a weaker claimant to outright GPT-5.5 parity than its own charts suggest. Independent benchmarks — and the experience of actually running it inside a real agent harness — will settle which half of that sentence matters more.
Related reading: MiniMax M3: open-weight coding at 1/10 the cost, DeepSeek V4: open-weight frontier at 1/7 the cost, and China's open-weight coding wave.
Footnotes
-
Nex AGI, "Nex-N2: An agentic model with Agentic Thinking" — official GitHub README (release, Agentic Thinking / Adaptive Thinking framework, full benchmark table, Qwen3.5 base, two-node 8×H100 deployment guide). https://github.com/nex-agi/Nex-N2 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22 ↩23 ↩24 ↩25 ↩26
-
"nex-agi/Nex-N2-Pro," Hugging Face model card (Apache-2.0 license, 397B params, MoE
qwen3_5_moearchitecture, image-text-to-text, benchmark table, parsers). https://huggingface.co/nex-agi/Nex-N2-Pro ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 -
"SWE-Bench Leaderboard," marc0.dev (GPT-5.5 #1 at 88.7% on SWE-Bench Verified, OpenAI-reported, released Apr 23, 2026; Claude Opus 4.7 87.6%, Apr 16, 2026), and Vals.ai independent harness (GPT-5.5 ~82.6%, Opus 4.7 ~82.0%) showing harness variance. https://www.marc0.dev/en/leaderboard ↩ ↩2 ↩3 ↩4
-
"Nex-N2-Pro — Model Info, Parameters, Benchmarks," SiliconFlow (created Jun 2, 2026; 262K context window; FP8; Apache-2.0; Adaptive Thinking 30–50% thinking-token claim; prior-generation DeepSeek-V3.1-Nex-N1). https://www.siliconflow.com/models/nex-n2-pro ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11
-
"New Models Today — AI & LLM Releases," Price Per Token (Nex-N2-Pro free listing on OpenRouter; MiniMax M3 release context). https://pricepertoken.com/news/model-releases ↩ ↩2
-
"Nex: Nexus of Agentic Intelligence," Nex AGI (alliance initiated by the Shanghai Innovation Institute; partners and full-stack agent ecosystem — models, NexAU framework, data and RL infrastructure). https://nex-agi.com/en/ ↩ ↩2 ↩3
-
"Alibaba's latest flagship Qwen3.5 models are open-weights MoE performers," DeepLearning.AI The Batch, and "Qwen3.5-397B-A17B — Everything you need to know," Artificial Analysis (Feb 2026 release, 397B total / 17B active, hybrid linear-attention + sparse MoE, native multimodal, Apache 2.0). https://www.deeplearning.ai/the-batch/alibabas-latest-flagship-models-are-open-weights-moe-performers-in-sizes-from-less-than-1b-parameters/ ↩ ↩2 ↩3 ↩4
-
"Anthropic's Claude Opus 4.8 is here," VentureBeat (Opus 4.8 released May 28, 2026; ~88.6% on SWE-Bench Verified). https://venturebeat.com/technology/anthropics-claude-opus-4-8-is-here-with-3x-cheaper-fast-mode-and-near-mythos-level-alignment ↩ ↩2 ↩3
-
"Nex AGI: Nex-N2-Pro (free) — API Pricing & Providers," OpenRouter ($0 input / $0 output free tier; free-tier rate limits). https://openrouter.ai/nex-agi/nex-n2-pro:free/pricing ↩ ↩2 ↩3
-
"GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks," OpenAI (benchmark across 44 occupations in the top 9 GDP sectors, expert blind grading). https://openai.com/index/gdpval/ ↩
-
"SWE-bench Verified," Vals.ai independent benchmark harness (GPT-5.5 82.60%, Claude Opus 4.7 82.00%). https://www.vals.ai/benchmarks/swebench ↩