DeepSeek V4: Open-Weight Frontier at 1/7 the Cost
May 2, 2026
TL;DR
On April 24, 2026, DeepSeek released DeepSeek V4 as a preview — a two-model open-weight family shipped under the MIT license: V4-Pro (1.6 trillion total parameters, 49 billion active per token, pre-trained on 33 trillion tokens) and V4-Flash (284 billion total, 13 billion active)123. Both models support a 1 million-token context window and 384,000-token maximum output45. V4-Pro (at maximum reasoning effort) scores 80.6% on SWE-bench Verified — statistically tied with the previous-generation Claude Opus 4.6 (80.8%) but trailing the current frontier Claude Opus 4.7 (87.6%) by 7 points and GPT-5.5 (~82.6% on vals.ai's leaderboard) by 2 points6789. On LiveCodeBench V4-Pro scores 93.5%, leading Gemini 3.1 Pro (91.7%) and Claude Opus 4.6 (88.8%)7. List API pricing lands at $1.74 per million input tokens (cache-miss) and $3.48 per million output tokens for V4-Pro, with cache hits at $0.174 per million — roughly one-seventh the output cost of Claude Opus 4.7 ($5/$25) and one-eighth the cost of GPT-5.5 ($5/$30)1011. A 75% launch promo discounts V4-Pro to $0.435/$0.87 until May 31, 202610. The architecture's headline change is a Hybrid Attention mechanism combining Compressed Sparse Attention and Heavily Compressed Attention, cutting V4-Pro's per-token inference FLOPs to 27% of V3.2's and KV cache to 10% at the 1M-token setting412. Huawei announced "full support" via its Ascend SuperNode product line for inference, while V4-Pro itself appears to have been trained primarily on NVIDIA GPUs8913.
What You'll Learn
- Why the V4 architecture's Hybrid Attention is more than a marketing label
- The exact V4-Pro and V4-Flash benchmark scores against Opus 4.7, GPT-5.5, and Gemini 3.1 Pro
- API pricing tiers — including the cache-hit price and the 75% launch promo that expires May 31
- What the Huawei Ascend partnership does and does not mean about training versus inference
- A practical decision matrix for when to pick V4-Pro versus the closed frontier
- Where V4 wins, where it trails, and where the open-weight frontier sits today
A Two-Model Open-Weight Release Under MIT License
DeepSeek shipped V4 as a preview release on April 24, 2026, roughly five months after V3.2's December 2025 launch214. The previous-generation deepseek-chat and deepseek-reasoner API endpoints are scheduled for full retirement on July 24, 2026, after which V4 becomes the only path on the official API1.
The family contains two distinct Mixture-of-Experts (MoE) models:
| Model | Total params | Active params | Pre-training tokens | Context | Max output |
|---|---|---|---|---|---|
| DeepSeek V4-Pro | 1.6T | 49B | 33T | 1M | 384K |
| DeepSeek V4-Flash | 284B | 13B | 32T | 1M | 384K |
Both models are released under the MIT license and published as open weights on Hugging Face. The instruct variants use mixed FP4/FP8 precision — MoE expert weights in FP4 and the rest in FP8 — while the base models are FP8 mixed15. Both expose three reasoning-effort modes — non-think, think (high), and think (max) — with thinking enabled by default at the high setting16.
A note on terminology: this is an open-weight release, not strictly open-source. DeepSeek published the weights under MIT, but did not release the full training data or the training pipeline source. That posture is the same as Llama 4, Qwen 3.6, and most other "open" frontier models15.
That MIT license still matters. Llama 4's release uses Meta's custom Community License, which restricts companies above 700 million monthly active users and is not OSI-approved as open source17. Some other open-weight releases (including Kimi K2.6) ship under "modified MIT" terms. DeepSeek choosing standard MIT — one of the most permissive widely-used open-source licenses — means V4 weights can be downloaded, fine-tuned, redistributed, and run commercially with no legal acrobatics.
Hybrid Attention Architecture: Why the Million-Token Window Is Cheaper
V4 is not the first open-weight model to ship a 1M-token context — Llama 4 Maverick offered 1M context in 2025, and Qwen 3.6 Plus opened a 1M-token preview on March 31, 202612. Calling V4 the first 1M-token open-weight model is wrong. What V4 does claim is a fundamentally cheaper way to actually serve that context.
The architectural headline of V4 is a Hybrid Attention mechanism that interleaves two new attention layers across the Transformer stack412.
Compressed Sparse Attention (CSA) compresses the key-value cache of every m tokens into a single entry using a learned token-level compressor. Each query then attends only to the top-k selected compressed entries via DeepSeek Sparse Attention (DSA). A sliding window branch runs in parallel for local-dependency modeling.
Heavily Compressed Attention (HCA) is more aggressive: it consolidates m' tokens (where m' is much larger than m) into a single compressed entry, then applies dense attention across those compressed representations. Layers alternate between CSA and HCA, so the model gets both fine-grained local recall and coarse-grained global context without paying full quadratic-attention cost everywhere.
The technical report also introduces Manifold-Constrained Hyper-Connections (mHC), an extension of residual connections meant to stabilize signal flow across layers, and uses the Muon optimizer — a relatively new alternative to AdamW that has been showing up in several large training runs in late 2025 and early 202615.
The efficiency numbers are the point. At a 1M-token context:
| Model | Per-token inference FLOPs vs V3.2 | KV cache size vs V3.2 |
|---|---|---|
| V4-Pro | 27% | ~10% |
| V4-Flash | 10% | 7% |
That is the gap between "1M context exists on a spec sheet" and "1M context is something an agent can actually use without burning a server farm." The Hugging Face technical write-up frames it as "a million-token context that agents can actually use" — which, given how often 1M-context claims fall apart in practice, is the harder problem to solve12.
Benchmark Scorecard: V4-Pro vs the Frontier
The DeepSeek team published benchmarks across coding, math, reasoning, and agentic categories. The pattern is consistent: V4-Pro leads or ties on coding and competitive programming when measured against the previous generation, sits in the upper-middle pack on math, and trails meaningfully on the hardest expert-knowledge benchmarks against the current frontier (Opus 4.7 and GPT-5.5).
Coding and software engineering
V4 exposes three reasoning-effort modes — Non-think, Think (high), and Think (max). The headline scores below are reported at the Think Max setting (V4-Pro-Max), so they are compared against competitor models at their maximum-effort settings where applicable18.
| Benchmark | DeepSeek V4-Pro (Max) | Claude Opus 4.6 | Claude Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|---|
| SWE-bench Verified | 80.6% | 80.8% | 87.6% | ~82.6% | — |
| LiveCodeBench | 93.5% | 88.8% | — | — | 91.7% |
| Codeforces (rating) | 3206 | — | — | 3168 (GPT-5.4) | 3052 |
| Terminal-Bench 2.0 | 67.9% | 65.4% | — | 75.1% (GPT-5.4 xHigh) | 68.5% |
The honest framing on SWE-bench Verified: V4-Pro is statistically tied with Opus 4.6 (the previous Anthropic flagship released before V4) but trails Opus 4.7 by 7 points and GPT-5.5 by ~2 points (per vals.ai's leaderboard tracking)89. Note that benchmark-aggregation sources disagree on GPT-5.5 — BenchLM and TokenMix report 88.7%, while vals.ai reports 82.6%. The discrepancy likely reflects standard tier versus Pro/high-effort settings; treat 82.6% as a conservative floor.
On LiveCodeBench V4-Pro leads the comparable Anthropic and Google flagships outright. On Terminal-Bench 2.0, it edges Claude Opus 4.6 by 2.5 points but trails GPT-5.4 (xHigh setting) by 7.2 points and Gemini 3.1 Pro by 0.6 points18.
Math and reasoning
| Benchmark | DeepSeek V4-Pro (Max) | Claude Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| IMO AnswerBench | 89.8 | 75.3 | 91.4 | 81.0 |
| HMMT 2026 | 95.2% | 96.2% | 97.7% | — |
| MMLU-Pro | 87.5% | — | — | ~91% |
| HLE (Humanity's Last Exam, no tools) | 37.7% | 40.0% | 39.8% | 44.4% |
V4-Pro takes IMO AnswerBench against Claude Opus 4.6 and Gemini 3.1 Pro and sits within striking distance of GPT-5.4. On HMMT 2026 and HLE without tools, the closed frontier models pull ahead192014.
HLE is particularly telling: it is the cross-domain expert benchmark where world-knowledge depth shows up, and V4-Pro lands more than six points behind Gemini 3.1 Pro. Against the newer Claude Opus 4.7 (released April 16, 2026) and GPT-5.5 (April 23, 2026), the gap on HLE widens further — Opus 4.7 reaches 46.9% and GPT-5.5 reaches 41.4% without tools21.
Agentic and web browsing
On BrowseComp, the agentic web-browsing benchmark, the V4-Pro Max variant reportedly scores 83.4% — putting it in the upper-middle pack among frontier closed-source models, with Claude Opus 4.7 at 79.3% and Gemini 3.1 Pro reported at 85.9%22. The DeepSeek team also says V4 has been optimized for use with agent stacks like Anthropic's Claude Code4.
The honest summary that emerges across these benchmarks: V4-Pro is decisively the best open-weight model on coding and competitive programming, competitive on agentic tasks, and roughly 3–8 percentage points behind the closed frontier on world-knowledge reasoning and the hardest software-engineering benchmarks28.
API Pricing: The Cache-Hit Math and the May 31 Promo Deadline
This is where V4 reshapes the conversation — but only after the launch promo expires.
V4-Pro list pricing (post-promo, what budgets should plan for)
| Tier | Per million tokens |
|---|---|
| Input (cache hit) | $0.174 |
| Input (cache miss) | $1.74 |
| Output | $3.48 |
The cache-hit price is exactly 1/10 of the cache-miss price — DeepSeek dropped it from the original launch ratio on April 26, 202610.
V4-Pro launch promo (active until May 31, 2026, 15:59 UTC)
| Tier | Per million tokens |
|---|---|
| Input (cache hit) | $0.003625 |
| Input (cache miss) | $0.435 |
| Output | $0.87 |
A 75% discount is in effect on V4-Pro through the end of May 202610. After that window, list pricing kicks in. Build budgets around the list price, not the promo — using $0.435/$0.87 as a planning number sets you up for a 4× sticker-shock when the promo ends.
V4-Flash (standard pricing)
| Tier | Per million tokens |
|---|---|
| Input (cache hit) | $0.0028 |
| Input (cache miss) | $0.14 |
| Output | $0.28 |
Frontier comparisons (V4-Pro list output cost)
| Model | Input | Output | V4-Pro savings (output) |
|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $25.00 | ~7.2× cheaper |
| GPT-5.5 | $5.00 | $30.00 | ~8.6× cheaper |
| GPT-5.5 Pro | $30.00 | $180.00 | ~51.7× cheaper |
| GPT-5.4 | $2.50 | $15.00 | ~4.3× cheaper |
A workload that produces 100M output tokens per month on Claude Opus 4.7 costs $2,500. The same workload on V4-Pro at list is $348. The same workload on GPT-5.5 is $3,00011. VentureBeat described V4-Pro as arriving "at 1/6th the cost" of Opus 4.7 and GPT-5.5; the math at list pricing is closer to 1/7 on output and roughly 1/3 on cache-miss input — but the order of magnitude holds11.
The V4-Flash numbers are even more striking. At $0.14/$0.28, V4-Flash undercuts every Western frontier-tier "small" model — GPT-5.4 Nano, Gemini 3.1 Flash, GPT-5.4 Mini, and Claude Haiku 4.5 — while still scoring 91.6% on LiveCodeBench and shipping output at roughly 83.7 tokens per second on DeepSeek's API23.
The Huawei Ascend Story: Inference Confirmed, Training Probably Still NVIDIA
Alongside the V4 release, Huawei announced "full support" for V4 inference on its Ascend AI processors. Huawei said its entire Ascend SuperNode product line was "fully adapted" to V4 ahead of the launch, with DeepSeek and Huawei collaborating closely in the run-up to the release89.
DeepSeek separately reported that V4 demonstrates performance parity on Huawei Ascend NPUs and NVIDIA GPUs for inference workloads — a claim that, if independently confirmed, undercuts the assumption that Chinese AI labs need NVIDIA's latest silicon to serve a frontier-class model in production24.
What the Ascend story does not establish: V4-Pro itself was probably still trained primarily on NVIDIA hardware. DeepSeek's technical report does not describe a full Ascend-only training pipeline, and a computer-science professor at Tsinghua University told MIT Technology Review that long-context features in particular look like they came from NVIDIA GPUs13. Community speculation suggests V4-Flash (the smaller 284B model) may have been used to test Huawei training infrastructure, but DeepSeek has not confirmed this.
So "China builds frontier AI without NVIDIA" is the wrong takeaway for V4. The accurate version is more interesting: DeepSeek built an inference path that works on domestic chips, which is the part of the AI export-control story that actually changes day-to-day economics inside China — even if training still depends on smuggled-in or pre-restriction NVIDIA stockpiles.
The forward-looking piece: Huawei's Ascend 950PR chip launched in Q1 2026 and the complementary 950DT is slated to ship by end of 2026, with DeepSeek expecting Ascend 950 SuperNodes to reach scale availability in the second half of the year825. If 950DT supports a credible training pipeline, the next DeepSeek release could be the first one with a plausible end-to-end domestic-silicon story.
How V4 Compares to the Open-Weight Field
V4 enters a competitive open-weight landscape that has tightened considerably over the past six months:
| Model | Lab | Released | Open weights | License | Top benchmark strength |
|---|---|---|---|---|---|
| DeepSeek V4-Pro | DeepSeek | April 24, 2026 | Yes | MIT | SWE-bench, LiveCodeBench, Codeforces |
| GLM-5.1 | Z.ai | April 7, 2026 | Yes | Apache 2.0 | SWE-Bench Pro (open-weight leader) |
| Kimi K2.6 | Moonshot AI | April 20, 2026 | Yes | Modified MIT | SWE-Bench, agentic swarms |
| Qwen 3.6 Plus | Alibaba | March 31, 2026 (1M preview) | Yes | Apache 2.0 | Multilingual, long-context |
| Llama 4 (Scout/Maverick) | Meta | 2025 | Yes | Custom Community | General-purpose |
V4-Pro takes the open-weight crown on coding and competitive programming, while GLM-5.1 still leads on SWE-Bench Pro at 58.4%. The headline is that the open-weight ceiling has moved into a position where, on most non-expert-knowledge benchmarks, it is within a single-digit gap of the closed frontier2.
Decision Matrix: When to Pick V4 — And When Not To
Given the April–May 2026 landscape:
| Use case | Better fit | Why |
|---|---|---|
| High-volume coding agents on a price-per-task budget | V4-Pro | Output cost matters more than the last 5–7 SWE-bench points |
| Document analysis at 500K+ token contexts | V4-Pro or V4-Flash | Hybrid attention makes 1M-token serving cheap |
| Top-of-line agentic coding where accuracy gaps decide | Claude Opus 4.7 | Leads SWE-bench Verified at 87.6% — a 7-point gap over V4-Pro |
| Production deployments needing strict data residency in China | V4 on Huawei Ascend | Uniquely compatible inference pathway |
| Self-hosted on commodity hardware | V4-Flash | 13B activated params makes it tractable to host |
| Workloads where output tokens dominate cost | V4-Pro | The 7–8× output discount stacks fast at scale |
| Small-context cheap calls (chat, classifications) | V4-Flash or GPT-5.4 Mini | Neither needs a 1M-context model |
| Hardest world-knowledge reasoning | GPT-5.5 / Opus 4.7 / Gemini 3.1 Pro | V4-Pro is 6+ points behind on HLE without tools |
Caveats Worth Stating Clearly
Three.
First, the benchmarks above are DeepSeek's own published numbers. Independent reproductions for V4-Pro are still arriving, and the model has been in public preview for just over a week as of writing. Some early third-party evaluations have come in slightly below DeepSeek's published figures on tasks like SWE-bench Pro3. Treat the official numbers as upper-bound claims until external benchmarks settle.
Second, the launch promotional pricing — $0.435/M input and $0.87/M output for V4-Pro — is a temporary discount, not the durable cost. The list price of $1.74/$3.48 is what budgets should be built around10. After May 31, 2026, anyone who priced their roadmap on the promo gets a 4× bill.
Third, "open weights" does not mean "easy to self-host." 1.6T parameters at FP4/FP8 mixed precision is still hundreds of gigabytes of weights, and serving 1M-token contexts at low latency requires significant accelerator capacity. For most teams, V4-Pro is open-weight in spirit and API-only in practice. V4-Flash is the more realistic self-hosting target.
What This Means for Builders
Three practical takeaways for teams deciding where to spend their token budget:
For coding-heavy workloads: V4-Pro at list pricing is roughly 7× cheaper on output than Claude Opus 4.7 while trailing it by 7 points on SWE-bench Verified. For high-volume code-generation and code-review workflows where the marginal task does not require frontier-tier reasoning, V4-Pro is the new default. Pair it with a frontier closed model for hard escalations, or use Opus 4.7 when accuracy gaps decide outcomes.
For agentic workflows: V4-Pro is competitive on BrowseComp and Terminal-Bench 2.0, trailing the top closed frontier models by a few points on each. For agent stacks where the orchestration layer can route hard subtasks to a closed frontier model and dispatch the rest to V4-Pro, the stack-level economics shift dramatically.
For self-hosters: V4-Flash at 284B total / 13B active is the more interesting number. The full V4-Pro requires substantial infrastructure even quantized; V4-Flash is in the range that a well-funded team can run on commodity GPU clusters or Huawei Ascend supernodes. Combined with MIT licensing and 1M context, V4-Flash is the strongest open-weight self-hosting target for code-and-agent workloads released to date.
For broader context on how frontier-tier output costs are being squeezed across the board, see our analysis of GPT-5.5 and OpenAI's first retrained base since GPT-4.5.
The Bottom Line
DeepSeek V4 does not redraw the frontier — Claude Opus 4.7 and GPT-5.5 still lead on the hardest expert-knowledge and reasoning benchmarks, and Opus 4.7's 7-point lead on SWE-bench Verified is real. What V4 does is collapse the price of access to near-frontier capability. A 1.6T MoE that lands within striking distance of Opus 4.6 on SWE-bench Verified and leads on LiveCodeBench, with 1M context and MIT licensing, at $1.74 input and $3.48 output per million tokens (list), is a different kind of release than what we have seen from any other lab in the past quarter.
The Huawei Ascend integration adds a second layer: V4 is the first frontier-class Chinese AI release with a confirmed domestic-silicon inference pathway, even if the training pipeline still leans on NVIDIA. For a builder choosing where to spend the next million tokens of budget — especially before the May 31 promo expires — V4 is now the default open-weight pick on coding workloads and one of the strongest options on agentic tasks.
What remains genuinely open is whether DeepSeek can sustain this cadence. V3.2 shipped in December 2025; V4 in April 2026. If V4.x and V5 land on similar timelines, the gap between the open-weight ceiling and the closed frontier may keep closing. If they do not, V4 becomes the high-water mark for a lab that earned its reputation by punching above its compute budget.
References
Footnotes
-
TechCrunch — "DeepSeek previews new AI model that 'closes the gap' with frontier models", April 24, 2026. ↩ ↩2 ↩3
-
Bloomberg — "DeepSeek Unveils Newest Flagship AI Model a Year after Upending Silicon Valley", April 24, 2026. ↩ ↩2 ↩3 ↩4
-
CNBC — "China's DeepSeek releases preview of long-awaited V4 model as AI race intensifies", April 24, 2026. ↩ ↩2
-
MarkTechPost — "DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts", April 24, 2026. ↩ ↩2 ↩3 ↩4 ↩5
-
NVIDIA Technical Blog — "Build with DeepSeek V4 Using NVIDIA Blackwell and GPU-Accelerated Endpoints", April 24, 2026. ↩ ↩2
-
NxCode — "DeepSeek V4 (2026): 1T Parameters, 81% SWE-bench, $0.30/MTok — Full Specs", April 24, 2026. ↩
-
BuildFastWithAI — "DeepSeek V4-Pro Review: Benchmarks, Pricing & Architecture", April 24, 2026. ↩ ↩2 ↩3
-
Anthropic — "Introducing Claude Opus 4.7", April 16, 2026 — 87.6% on SWE-bench Verified per third-party leaderboards (Opus 4.6 baseline 80.84%). ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
-
vals.ai SWE-bench Verified leaderboard tracking GPT-5.5 at 82.60% on SWE-bench Verified (vals.ai/benchmarks/swebench). BenchLM and TokenMix report 88.7% for GPT-5.5 Pro/high-effort settings — treat scores between 82.6% and 88.7% as the reported range depending on tier and methodology. ↩ ↩2 ↩3 ↩4 ↩5
-
DeepSeek API Docs — "Models & Pricing". 75% V4-Pro discount until 2026-05-31 15:59 UTC; cache-hit = 1/10 of cache-miss after April 26, 2026 adjustment. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
VentureBeat — "DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5", April 24, 2026. ↩ ↩2 ↩3 ↩4
-
Hugging Face Blog — "DeepSeek-V4: a million-token context that agents can actually use", April 24, 2026. Includes precedent on Llama 4 Maverick and Qwen 3.6 Plus 1M-context releases. ↩ ↩2 ↩3 ↩4 ↩5
-
MIT Technology Review — "Three reasons why DeepSeek's new model matters", April 24, 2026 — Tsinghua University CS professor on V4-Pro likely still trained primarily on NVIDIA GPUs, especially long-context features. ↩ ↩2 ↩3
-
DEV Community — "DeepSeek Just Dropped V4. Here's What the Benchmarks Actually Tell You.", April 24, 2026. ↩ ↩2
-
Hugging Face — deepseek-ai/DeepSeek-V4-Pro, MIT License. Technical report, mHC + Muon optimizer details, FP4/FP8 precision. ↩ ↩2 ↩3 ↩4
-
vLLM Recipes — DeepSeek-V4-Pro, April 2026. ↩
-
Llama 4 Community License Agreement — llama.com/llama4/license/ — 700M MAU restriction; classified as "source available" rather than OSI open source. ↩
-
BenchLM — "DeepSeek V4 Pro Benchmarks 2026: Scores, Rankings & Performance", April 2026. ↩ ↩2
-
Officechai — "DeepSeek Releases DeepSeek V4-Pro & V4-Flash, Delivers GPT 5.4 & Opus 4.6-Level Performance At Fraction Of The Price", April 24, 2026. ↩
-
AnalyticsIndiaMag — "DeepSeek Releases V4 Pro, Challenging OpenAI, Anthropic on Key Benchmarks", April 24, 2026. ↩
-
SCMP — "Underwhelming or underrated? DeepSeek V4 shows 'impressive' gains", April 24, 2026 — HLE without-tools comparison vs Opus 4.7 and GPT-5.5. ↩
-
Framia — "Third-party benchmark comparisons of DeepSeek V4 vs GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro", April 2026 — BrowseComp 83.4% for V4-Pro Max. ↩
-
ArtificialAnalysis — "DeepSeek V4 Flash (Max) - Intelligence, Performance & Price Analysis", April 2026. ↩ ↩2
-
Phemex News — "DeepSeek V4 Matches NVIDIA on Huawei Ascend, Dispels Rumors", April 24, 2026. ↩ ↩2
-
TrendForce — "Decoding DeepSeek V4: How Huawei's Ascend 950 PR Is Powering China's Push to Break CUDA Dependence", April 7, 2026 — Ascend 950PR launched Q1 2026, 950DT expected by end of 2026. ↩