Claude Opus 4.7: Benchmarks, Features & Pricing
April 17, 2026
TL;DR
Anthropic released Claude Opus 4.7 on April 16, 2026. It leads SWE-bench Pro at 64.3% — ahead of GPT-5.4 (57.7%) and Gemini 3.1 Pro (54.2%) — and sets a new high on OSWorld-Verified at 78.0%, up from 72.7% for Opus 4.6. Pricing is unchanged at $5.00/$25.00 per million tokens input/output, though a new tokenizer uses up to 35% more tokens for equivalent text. Key additions include a new xhigh effort level, adaptive thinking (replacing extended thinking budgets), task budgets for agentic loops, and 3× higher image resolution for computer use.
What You'll Learn
- How Claude Opus 4.7 scores on SWE-bench Pro, OSWorld, GPQA, CursorBench, and other key benchmarks
- What changed vs. Claude Opus 4.6 — and what was removed
- The new
xhigheffort level and adaptive thinking system - Pricing, tokenizer changes, and what they mean for real API costs
- Where Opus 4.7 leads, where competitors catch up, and what's still invitation-only
Release Details
Claude Opus 4.7 became generally available on April 16, 2026, one day before this post. The API model ID is claude-opus-4-7, available on Anthropic's API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, Snowflake Cortex AI, and GitHub Copilot Enterprise1.
The model ships as a single variant — there are no Thinking, Pro, or Mini tiers for Opus 4.7. Extended thinking budgets have been removed entirely (setting budget_tokens now returns a 400 error); adaptive thinking replaces them. Knowledge cutoff is January 20262.
Benchmark Results
Coding — Where Opus 4.7 Leads Most Clearly
SWE-bench is the standard measure for autonomous software engineering: given a GitHub issue, can the model write a pull request that passes the test suite? Opus 4.7 sets a new top score on the harder SWE-bench Pro variant:
| Model | SWE-bench Pro | SWE-bench Verified |
|---|---|---|
| Claude Opus 4.7 | 64.3% | 87.6% |
| GPT-5.4 | 57.7% | — |
| Gemini 3.1 Pro | 54.2% | 80.6% |
| Claude Opus 4.6 | 53.4% | 80.8% |
On SWE-bench Pro, Opus 4.7 leads by more than 6 points over GPT-5.4 and 10 points over Gemini. On the standard SWE-bench Verified leaderboard, Opus 4.7 climbs from 80.8% to 87.6% — a 6.8-point improvement over its predecessor3.
Anthropic also reports a 13% improvement on an internal 93-task coding benchmark, and Rakuten's production deployment found Opus 4.7 resolves 3× more real production tasks than Opus 4.6 on their SWE-bench variant4.
Computer Use — OSWorld
OSWorld-Verified measures autonomous desktop task completion (file management, browser navigation, multi-app workflows). The human expert baseline is approximately 72.4%:
| Model | OSWorld-Verified |
|---|---|
| GPT-5.4 | 75.0% |
| Claude Opus 4.7 | 78.0% |
| Claude Opus 4.6 | 72.7% |
| Human expert baseline | ~72.4% |
Opus 4.7 moves past GPT-5.4 on this benchmark, reaching 78.0% and extending the gap above the human baseline. The improvement is partly driven by a new 3.75MP image resolution ceiling for computer use — up from 1.15MP in Opus 4.6 — and coordinates that map 1:1 with pixels, removing the scale-factor math that previously introduced errors in screen coordinate targeting5.
Developer Workflows — CursorBench
CursorBench evaluates real-world coding assistant tasks as they occur in an IDE environment. Opus 4.7 scores 70%, up from 58% for Opus 4.6 — a 12-point jump that positions it above competing models on this benchmark6.
Graduate-Level Science — GPQA Diamond
On GPQA Diamond (graduate-level physics, chemistry, biology), the three frontier models are statistically indistinguishable:
| Model | GPQA Diamond |
|---|---|
| GPT-5.4 Pro | 94.4% |
| Gemini 3.1 Pro | 94.3% |
| Claude Opus 4.7 | 94.2% |
The differences here are within measurement noise. No single model holds a meaningful advantage on graduate-level scientific reasoning7.
Knowledge Work — GDPVal-AA
GDPVal-AA is an Elo-based benchmark measuring general knowledge work across business analysis, document processing, and professional reasoning tasks:
| Model | GDPVal-AA (Elo) |
|---|---|
| Claude Opus 4.7 | 1,753 |
| GPT-5.4 | 1,674 |
| Gemini 3.1 Pro | 1,314 |
Opus 4.7 holds a 79-point advantage over GPT-5.4 on this benchmark, with Gemini 3.1 Pro trailing significantly at 1,3148.
Security — XBOW Visual Acuity
On XBOW's visual acuity cybersecurity benchmark, Opus 4.7 scores 98.5%, versus 54.5% for Opus 4.6 — the largest single-generation jump of any benchmark in this release. Anthropic describes Opus 4.7 as a testbed for new cyber safeguards being validated before eventual broader release of Mythos-class models9.
Legal — BigLaw Bench
On Harvey's BigLaw Bench (professional legal reasoning), Opus 4.7 scores 90.9% at high effort10.
What's New vs. Opus 4.6
Adaptive Thinking Replaces Extended Thinking Budgets
The biggest architectural change: extended thinking budgets are gone. Setting budget_tokens in your API request now returns a 400 error. In their place, Anthropic introduces adaptive thinking — off by default, opt-in via the API — which Anthropic says outperforms extended thinking in internal evaluations. The system dynamically allocates reasoning compute rather than requiring developers to set a token ceiling.
Thinking blocks still appear in the response stream but the thinking field is empty by default unless you set display: "summarized" in your request11.
New xhigh Effort Level
A new xhigh effort tier sits between the previous high and max levels, giving developers finer control over the reasoning/latency tradeoff. Anthropic recommends xhigh for coding and agentic use cases where you want stronger reasoning without paying the full max cost12.
Task Budgets for Agentic Loops
A new public beta feature (task-budgets-2026-03-13 beta header) lets you set an advisory token budget across an entire agentic loop — not just a single model call. The minimum is 20,000 tokens. This is not a hard cap but guides the model's planning toward completing the task within budget, reducing runaway token usage in long agent workflows13.
3× Higher Image Resolution for Computer Use
Max image resolution climbs from 1,568px (1.15MP) to 2,576px (3.75MP) — more than three times the pixel count. Coordinate output now maps 1:1 to actual pixel positions, eliminating the scale-factor conversion errors that caused missed clicks in Opus 4.6 computer use deployments.
For a deeper look at how Claude's computer use capabilities are reshaping agent workflows, see our post on Claude managed agents.
Sampling Parameters Removed
Setting temperature, top_p, or top_k to non-default values now returns a 400 error. Anthropic has taken full control of sampling for Opus 4.7. Developers who relied on temperature tuning for output diversity will need to prompt-engineer instead.
More Direct, Less Deferential Tone
Opus 4.7 is described as more direct and opinionated than 4.6 — less validation-forward, with fewer emoji in responses and stronger opinions when asked. At lower effort levels, it is more literal and will not silently generalize instructions it considers ambiguous14.
/ultrareview in Claude Code
A new /ultrareview slash command is available in Claude Code for deeper code review passes15.
Pricing and Real-World Cost
The published rate card is unchanged from Opus 4.6:
| Tier | Input | Output |
|---|---|---|
| Standard | $5.00 / MTok | $25.00 / MTok |
| Batch API (50% off) | $2.50 / MTok | $12.50 / MTok |
| Cache reads | $0.50 / MTok | — |
| Cache writes (5-min) | $6.25 / MTok | — |
| Cache writes (1-hour) | $10.00 / MTok | — |
The critical caveat: Opus 4.7 uses a new tokenizer that converts the same input text into up to 35% more tokens than older Claude models. The per-token price is unchanged, but the effective per-request cost is higher. Developers migrating from Opus 4.6 should benchmark their actual token consumption before assuming cost parity16.
For reference, Gemini 3.1 Pro is available at approximately $2.00 input / $12.00 output per million tokens — roughly 2.5× cheaper at list price — though with a different capability profile and no published SWE-bench Pro or GDPVal-AA scores to compare directly.
What Opus 4.7 Is Not: Claude Mythos Preview
Anthropic's highest-capability model is not Opus 4.7. Claude Mythos Preview — developed under Project Glasswing — launched with 12 named enterprise and government partners, with access extended to over 40 additional vetted organizations on an invitation-only basis. Named launch partners are Anthropic, AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks.
Mythos is focused on defensive cybersecurity workflows. Anthropic describes it as "more capable and better-aligned" than Opus 4.7 in their internal evaluations. Opus 4.7 is explicitly positioned as the model on which Anthropic is testing new cyber safeguards — validating approaches before eventually moving toward a broader Mythos-class release.
For context on Mythos's cybersecurity evaluation, see our post on the AISI Claude Mythos cyber evaluation.
Safety Profile
Anthropic characterizes Opus 4.7's alignment as "largely well-aligned and trustworthy, though not fully ideal in its behavior." Specific improvements over 4.6 include better honesty and improved resistance to prompt injection attacks. One noted regression: Opus 4.7 is "modestly weaker" than 4.6 on avoiding overly detailed harm-reduction advice on controlled substances17.
The model automatically detects and blocks requests indicating prohibited or high-risk cybersecurity uses. A Cyber Verification Program is available for legitimate security professionals who need expanded capabilities. A full Claude Opus 4.7 System Card has been published18.
How It Compares: The Short Version
Opus 4.7 is the strongest model for autonomous coding and agentic tasks right now. On SWE-bench Pro it leads by over 6 points; on OSWorld it overtakes GPT-5.4 at 78.0%. For general scientific reasoning (GPQA Diamond), the gap between frontier models has effectively collapsed — all three sit within 0.2 points of each other.
Where Opus 4.7 loses ground: pricing (Gemini 3.1 Pro is cheaper at list price at approximately $2.00/$12.00 per million tokens, though benchmark comparisons are incomplete). Context windows are at parity — both models support 1M tokens. If you're optimizing for cost in long-document workflows rather than coding agents, the tradeoffs shift.
For a broader look at where AI agents now stand relative to human performance across every benchmark category, see the Stanford AI Index 2026 and our earlier breakdown of GPT-5.4's computer use scores.
References
Footnotes
-
Claude Opus 4.7 coding benchmarks — Anthropic announcement ↩
-
GDPVal-AA Elo scores — benchmark aggregators citing Anthropic data ↩
-
XBOW visual acuity benchmark and cyber safeguards — Anthropic announcement ↩
-
Adaptive thinking replaces extended thinking budgets — Anthropic platform docs ↩
-
Tone and instruction-following changes — Anthropic platform docs ↩
-
Claude Opus 4.7 pricing and tokenizer — Anthropic pricing docs ↩