Claude Opus 4.7: Benchmarks, Features & Pricing

April 17, 2026

Claude Opus 4.7: Benchmarks, Features & Pricing

TL;DR

Anthropic released Claude Opus 4.7 on April 16, 2026. It leads SWE-bench Pro at 64.3% — ahead of GPT-5.4 (57.7%) and Gemini 3.1 Pro (54.2%) — and sets a new high on OSWorld-Verified at 78.0%, up from 72.7% for Opus 4.6. Pricing is unchanged at $5.00/$25.00 per million tokens input/output, though a new tokenizer uses up to 35% more tokens for equivalent text. Key additions include a new xhigh effort level, adaptive thinking (replacing extended thinking budgets), task budgets for agentic loops, and 3× higher image resolution for computer use.


What You'll Learn

  • How Claude Opus 4.7 scores on SWE-bench Pro, OSWorld, GPQA, CursorBench, and other key benchmarks
  • What changed vs. Claude Opus 4.6 — and what was removed
  • The new xhigh effort level and adaptive thinking system
  • Pricing, tokenizer changes, and what they mean for real API costs
  • Where Opus 4.7 leads, where competitors catch up, and what's still invitation-only

Release Details

Claude Opus 4.7 became generally available on April 16, 2026, one day before this post. The API model ID is claude-opus-4-7, available on Anthropic's API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, Snowflake Cortex AI, and GitHub Copilot Enterprise1.

The model ships as a single variant — there are no Thinking, Pro, or Mini tiers for Opus 4.7. Extended thinking budgets have been removed entirely (setting budget_tokens now returns a 400 error); adaptive thinking replaces them. Knowledge cutoff is January 20262.


Benchmark Results

Coding — Where Opus 4.7 Leads Most Clearly

SWE-bench is the standard measure for autonomous software engineering: given a GitHub issue, can the model write a pull request that passes the test suite? Opus 4.7 sets a new top score on the harder SWE-bench Pro variant:

ModelSWE-bench ProSWE-bench Verified
Claude Opus 4.764.3%87.6%
GPT-5.457.7%
Gemini 3.1 Pro54.2%80.6%
Claude Opus 4.653.4%80.8%

On SWE-bench Pro, Opus 4.7 leads by more than 6 points over GPT-5.4 and 10 points over Gemini. On the standard SWE-bench Verified leaderboard, Opus 4.7 climbs from 80.8% to 87.6% — a 6.8-point improvement over its predecessor3.

Anthropic also reports a 13% improvement on an internal 93-task coding benchmark, and Rakuten's production deployment found Opus 4.7 resolves 3× more real production tasks than Opus 4.6 on their SWE-bench variant4.

Computer Use — OSWorld

OSWorld-Verified measures autonomous desktop task completion (file management, browser navigation, multi-app workflows). The human expert baseline is approximately 72.4%:

ModelOSWorld-Verified
GPT-5.475.0%
Claude Opus 4.778.0%
Claude Opus 4.672.7%
Human expert baseline~72.4%

Opus 4.7 moves past GPT-5.4 on this benchmark, reaching 78.0% and extending the gap above the human baseline. The improvement is partly driven by a new 3.75MP image resolution ceiling for computer use — up from 1.15MP in Opus 4.6 — and coordinates that map 1:1 with pixels, removing the scale-factor math that previously introduced errors in screen coordinate targeting5.

Developer Workflows — CursorBench

CursorBench evaluates real-world coding assistant tasks as they occur in an IDE environment. Opus 4.7 scores 70%, up from 58% for Opus 4.6 — a 12-point jump that positions it above competing models on this benchmark6.

Graduate-Level Science — GPQA Diamond

On GPQA Diamond (graduate-level physics, chemistry, biology), the three frontier models are statistically indistinguishable:

ModelGPQA Diamond
GPT-5.4 Pro94.4%
Gemini 3.1 Pro94.3%
Claude Opus 4.794.2%

The differences here are within measurement noise. No single model holds a meaningful advantage on graduate-level scientific reasoning7.

Knowledge Work — GDPVal-AA

GDPVal-AA is an Elo-based benchmark measuring general knowledge work across business analysis, document processing, and professional reasoning tasks:

ModelGDPVal-AA (Elo)
Claude Opus 4.71,753
GPT-5.41,674
Gemini 3.1 Pro1,314

Opus 4.7 holds a 79-point advantage over GPT-5.4 on this benchmark, with Gemini 3.1 Pro trailing significantly at 1,3148.

Security — XBOW Visual Acuity

On XBOW's visual acuity cybersecurity benchmark, Opus 4.7 scores 98.5%, versus 54.5% for Opus 4.6 — the largest single-generation jump of any benchmark in this release. Anthropic describes Opus 4.7 as a testbed for new cyber safeguards being validated before eventual broader release of Mythos-class models9.

On Harvey's BigLaw Bench (professional legal reasoning), Opus 4.7 scores 90.9% at high effort10.


What's New vs. Opus 4.6

Adaptive Thinking Replaces Extended Thinking Budgets

The biggest architectural change: extended thinking budgets are gone. Setting budget_tokens in your API request now returns a 400 error. In their place, Anthropic introduces adaptive thinking — off by default, opt-in via the API — which Anthropic says outperforms extended thinking in internal evaluations. The system dynamically allocates reasoning compute rather than requiring developers to set a token ceiling.

Thinking blocks still appear in the response stream but the thinking field is empty by default unless you set display: "summarized" in your request11.

New xhigh Effort Level

A new xhigh effort tier sits between the previous high and max levels, giving developers finer control over the reasoning/latency tradeoff. Anthropic recommends xhigh for coding and agentic use cases where you want stronger reasoning without paying the full max cost12.

Task Budgets for Agentic Loops

A new public beta feature (task-budgets-2026-03-13 beta header) lets you set an advisory token budget across an entire agentic loop — not just a single model call. The minimum is 20,000 tokens. This is not a hard cap but guides the model's planning toward completing the task within budget, reducing runaway token usage in long agent workflows13.

3× Higher Image Resolution for Computer Use

Max image resolution climbs from 1,568px (1.15MP) to 2,576px (3.75MP) — more than three times the pixel count. Coordinate output now maps 1:1 to actual pixel positions, eliminating the scale-factor conversion errors that caused missed clicks in Opus 4.6 computer use deployments.

For a deeper look at how Claude's computer use capabilities are reshaping agent workflows, see our post on Claude managed agents.

Sampling Parameters Removed

Setting temperature, top_p, or top_k to non-default values now returns a 400 error. Anthropic has taken full control of sampling for Opus 4.7. Developers who relied on temperature tuning for output diversity will need to prompt-engineer instead.

More Direct, Less Deferential Tone

Opus 4.7 is described as more direct and opinionated than 4.6 — less validation-forward, with fewer emoji in responses and stronger opinions when asked. At lower effort levels, it is more literal and will not silently generalize instructions it considers ambiguous14.

/ultrareview in Claude Code

A new /ultrareview slash command is available in Claude Code for deeper code review passes15.


Pricing and Real-World Cost

The published rate card is unchanged from Opus 4.6:

TierInputOutput
Standard$5.00 / MTok$25.00 / MTok
Batch API (50% off)$2.50 / MTok$12.50 / MTok
Cache reads$0.50 / MTok
Cache writes (5-min)$6.25 / MTok
Cache writes (1-hour)$10.00 / MTok

The critical caveat: Opus 4.7 uses a new tokenizer that converts the same input text into up to 35% more tokens than older Claude models. The per-token price is unchanged, but the effective per-request cost is higher. Developers migrating from Opus 4.6 should benchmark their actual token consumption before assuming cost parity16.

For reference, Gemini 3.1 Pro is available at approximately $2.00 input / $12.00 output per million tokens — roughly 2.5× cheaper at list price — though with a different capability profile and no published SWE-bench Pro or GDPVal-AA scores to compare directly.


What Opus 4.7 Is Not: Claude Mythos Preview

Anthropic's highest-capability model is not Opus 4.7. Claude Mythos Preview — developed under Project Glasswing — launched with 12 named enterprise and government partners, with access extended to over 40 additional vetted organizations on an invitation-only basis. Named launch partners are Anthropic, AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks.

Mythos is focused on defensive cybersecurity workflows. Anthropic describes it as "more capable and better-aligned" than Opus 4.7 in their internal evaluations. Opus 4.7 is explicitly positioned as the model on which Anthropic is testing new cyber safeguards — validating approaches before eventually moving toward a broader Mythos-class release.

For context on Mythos's cybersecurity evaluation, see our post on the AISI Claude Mythos cyber evaluation.


Safety Profile

Anthropic characterizes Opus 4.7's alignment as "largely well-aligned and trustworthy, though not fully ideal in its behavior." Specific improvements over 4.6 include better honesty and improved resistance to prompt injection attacks. One noted regression: Opus 4.7 is "modestly weaker" than 4.6 on avoiding overly detailed harm-reduction advice on controlled substances17.

The model automatically detects and blocks requests indicating prohibited or high-risk cybersecurity uses. A Cyber Verification Program is available for legitimate security professionals who need expanded capabilities. A full Claude Opus 4.7 System Card has been published18.


How It Compares: The Short Version

Opus 4.7 is the strongest model for autonomous coding and agentic tasks right now. On SWE-bench Pro it leads by over 6 points; on OSWorld it overtakes GPT-5.4 at 78.0%. For general scientific reasoning (GPQA Diamond), the gap between frontier models has effectively collapsed — all three sit within 0.2 points of each other.

Where Opus 4.7 loses ground: pricing (Gemini 3.1 Pro is cheaper at list price at approximately $2.00/$12.00 per million tokens, though benchmark comparisons are incomplete). Context windows are at parity — both models support 1M tokens. If you're optimizing for cost in long-document workflows rather than coding agents, the tradeoffs shift.

For a broader look at where AI agents now stand relative to human performance across every benchmark category, see the Stanford AI Index 2026 and our earlier breakdown of GPT-5.4's computer use scores.


References

Footnotes

  1. Claude Opus 4.7 announcement — Anthropic

  2. Claude models overview — Anthropic platform docs

  3. Claude Opus 4.7 SWE-bench Pro scores — The Next Web

  4. Claude Opus 4.7 coding benchmarks — Anthropic announcement

  5. What's new in Claude 4.7 — Anthropic platform docs

  6. CursorBench scores — Anthropic announcement

  7. GPQA Diamond benchmark comparison — Startup Fortune

  8. GDPVal-AA Elo scores — benchmark aggregators citing Anthropic data

  9. XBOW visual acuity benchmark and cyber safeguards — Anthropic announcement

  10. BigLaw Bench — Anthropic announcement

  11. Adaptive thinking replaces extended thinking budgets — Anthropic platform docs

  12. xhigh effort level — Anthropic platform docs

  13. Task budgets beta — Anthropic platform docs

  14. Tone and instruction-following changes — Anthropic platform docs

  15. /ultrareview in Claude Code — Anthropic platform docs

  16. Claude Opus 4.7 pricing and tokenizer — Anthropic pricing docs

  17. Claude Opus 4.7 safety profile — Anthropic announcement

  18. Claude Opus 4.7 System Card — Anthropic

Frequently Asked Questions

Claude Opus 4.7 is Anthropic's latest publicly available large language model, released April 16, 2026. It is the most capable Claude model available to general users, with top scores on coding (SWE-bench Pro: 64.3%) and computer use (OSWorld-Verified: 78.0%) benchmarks.

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.