How does GPT-5.4 compare to Claude?

As of April 2026, GPT-5.4 leads on OSWorld-Verified (75.0%) compared to Claude Opus 4.6 (72.7%) and Claude Sonnet 4.6 (72.5%). Both model families offer native computer-use capabilities and 1-million-token context windows.

How much does GPT-5.4 cost?

GPT-5.4 standard is priced at $2.50 per million input tokens and $15.00 per million output tokens. Smaller variants (Mini at $0.75/$4.50, Nano at $0.20/$1.25) are available for cost-sensitive use cases.

Can GPT-5.4 replace human workers?

GPT-5.4 excels at well-defined, repetitive desktop tasks. It is most effective as an automation layer for structured workflows rather than a wholesale replacement for human judgment. The 75% OSWorld score means it still fails on roughly one in four complex tasks.

ai-ml

GPT-5.4 Beats Humans at Computer Use: What It Means

Q: What is the OSWorld benchmark?

OSWorld-Verified is a standardized benchmark that measures how well AI models can autonomously complete real desktop tasks — navigating files, using browsers, and operating software. Human experts score 72.4% on average; GPT-5.4 scores 75.0%.

April 5, 2026

#GPT-5.4 #AI agents #computer use #OpenAI #OSWorld #enterprise AI #autonomous agents #Claude #LLM benchmarks

GPT-5.4 Beats Humans at Computer Use: What It Means

TL;DR

GPT-5.4, released by OpenAI on March 5, 2026, is the highest-scoring general-purpose model on the OSWorld-Verified benchmark — scoring 75.0% against the human baseline of 72.4%, and the first to exceed it by a decisive margin. With native computer-use capabilities, a 1-million-token context window, and strong enterprise benchmarks, GPT-5.4 marks a genuine inflection point for autonomous AI agents. Here is what changed, why it matters, and how it compares to the competition.

What You'll Learn

What GPT-5.4's computer-use capabilities actually do and how they work
How the OSWorld benchmark measures real desktop task automation
Where GPT-5.4 stands relative to Claude Sonnet 4.6, Claude Opus 4.6, and Gemini 3.1 Pro
What this means for enterprise workflows and the future of AI agents
Pricing, context window, and practical considerations for developers

What Is GPT-5.4?

GPT-5.4 is OpenAI's latest frontier model, launched on March 5, 2026. It represents a significant architectural leap over its predecessor, GPT-5.2, with three headline features that set it apart: native computer-use capabilities, a 1-million-token context window (the largest OpenAI has ever offered), and substantially improved token efficiency during reasoning tasks¹.

The model ships in several variants. The standard GPT-5.4 handles general-purpose tasks. GPT-5.4 Thinking is optimized for complex reasoning chains. GPT-5.4 Pro targets high-performance enterprise workloads. Smaller variants — GPT-5.4 Mini and GPT-5.4 Nano — followed on March 17, bringing the architecture's improvements to lower price points².

What makes GPT-5.4 genuinely newsworthy is not just incremental benchmark gains. It is OpenAI's first model with built-in computer-use capabilities — meaning it can interact directly with desktop software through screenshots, mouse commands, and keyboard inputs. (Anthropic's Claude was the first frontier model to offer computer use, launching the feature in public beta with Claude 3.5 Sonnet in October 2024.)

The OSWorld Breakthrough: Beating Human Experts

The OSWorld-Verified benchmark is the standard evaluation for measuring how well AI models can autonomously complete real desktop tasks — navigating files, using browsers, operating terminal interfaces, and interacting with productivity software. Human expert testers set the baseline at 72.4%.

GPT-5.4 scored 75.0% — the highest of any general-purpose model and the first to exceed the human baseline by a clear margin³. Simular's specialized Agent S framework reached 72.6% in December 2025, and Claude Opus 4.6 narrowly exceeded the baseline at 72.7% in February 2026, but GPT-5.4's 75.0% represents the first unambiguous gap above human-level performance.

To appreciate the pace of improvement, consider the trajectory across OpenAI's recent models:

Model	OSWorld-Verified Score	Release Date
GPT-5.2	47.3%	December 2025
GPT-5.3 Codex	64.7%	February 2026
GPT-5.4	75.0%	March 5, 2026
Human Expert Baseline	72.4%	—

That is a 28-point improvement in under four months — from GPT-5.2 in December 2025 to GPT-5.4 in March 2026 — a rate of progress that suggests computer-use accuracy could reach the high 80s or low 90s within the next year if the trend holds.

The practical implication is straightforward: GPT-5.4 can now fill out forms, navigate multi-step software workflows, extract data from desktop applications, and coordinate across browser tabs more reliably than the average human expert performing the same tasks under controlled conditions.

How Computer Use Actually Works

GPT-5.4's computer-use capability operates through a perception-action loop. The model receives a screenshot of the current screen state, reasons about what action to take next, and then issues structured commands — mouse clicks at specific coordinates, keyboard inputs, scroll actions, or application switches.

This is fundamentally different from API-based tool use, where the model calls predefined functions. Computer use means the model interacts with software the same way a human does — through the visual interface. It can operate applications that have no API, navigate unfamiliar interfaces, and recover from unexpected states like pop-up dialogs or error messages.

OpenAI's implementation also supports code-driven automation via libraries like Playwright, giving developers a hybrid approach: the model can write and execute browser automation scripts when APIs are available, and fall back to visual computer use when they are not⁴.

How GPT-5.4 Compares to the Competition

GPT-5.4 is not the only model with computer-use capabilities. Anthropic's Claude models have offered computer use since late 2024, and the latest Claude Sonnet 4.6 and Claude Opus 4.6 models are competitive on the same benchmarks.

Model	OSWorld-Verified	Context Window	Computer Use
GPT-5.4	75.0%	1M tokens	Native
Claude Opus 4.6	72.7%	1M tokens	Native
Claude Sonnet 4.6	72.5%	1M tokens	Native
Gemini 3.1 Pro	~72.5% (unofficial)	1M tokens	Yes

The gap between GPT-5.4 and Claude's models is approximately 2-3 percentage points on OSWorld — meaningful but not enormous⁵. Claude Sonnet 4.6 is particularly notable because it delivers near-Opus-level computer-use performance at a significantly lower price point, making it the most cost-effective option for agent workloads that prioritize computer use.

Google's Gemini 3.1 Pro has not officially published OSWorld benchmark results, though third-party evaluations suggest it scores around 72.5% — competitive with Claude's models. All three frontier model families now offer 1-million-token context windows, removing context length as a meaningful differentiator. Google has focused its agentic efforts more on search integration and multimodal reasoning than on desktop computer-use scenarios.

Enterprise Impact: From Benchmarks to Boardrooms

The real story of GPT-5.4 is not the benchmark number — it is what that number enables in production enterprise environments.

OpenAI highlighted GPT-5.4's performance on financial modeling tasks specifically. On an internal benchmark simulating tasks that a junior investment banking analyst might perform — spreadsheet modeling, scenario analysis, data extraction — GPT-5.4 scored 87.3%, up from 68.4% for GPT-5.2⁶. That is a capability jump that translates directly into time saved on repetitive analytical work.

The enterprise use cases that GPT-5.4 unlocks or significantly improves include document review at scale (legal briefs, scientific literature, multi-chapter manuscripts leveraging the 1M-token context window), automated data entry and extraction across legacy systems without APIs, end-to-end onboarding workflows that coordinate across multiple applications, and financial report generation combining data from spreadsheets, databases, and web sources.

GPT-5.4 is available through Microsoft Foundry (Azure's AI platform), making it accessible to enterprise customers who already operate within the Microsoft ecosystem⁷.

Pricing and Developer Considerations

GPT-5.4's pricing reflects its position as a frontier model, with a tiered structure that covers different use cases:

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-5.4	$2.50	$15.00
GPT-5.4 Mini	$0.75	$4.50
GPT-5.4 Nano	$0.20	$1.25
GPT-5.4 Pro	$30.00	$180.00

⚠ Prices change frequently. The values above are for illustration only and may be out of date. Always verify current pricing directly with the provider before making cost decisions: Anthropic · OpenAI · Google Gemini · Google Vertex AI · AWS Bedrock · Azure OpenAI · Mistral · Cohere · Together AI · DeepSeek · Groq · Fireworks AI · Perplexity · xAI · Cursor · GitHub Copilot · Windsurf.

Cached input tokens cost $0.25 per million (a 90% discount applied automatically to repeating context). An extended context surcharge kicks in past 272K tokens, doubling the input rate to $5.00 per million⁸.

For developers building agent systems, the practical tradeoff is between the full GPT-5.4 (best computer-use accuracy) and GPT-5.4 Mini (adequate for many automation tasks at roughly one-third the cost). The Nano variant is API-only and best suited for high-volume, lower-complexity tasks.

The Bigger Picture: What Superhuman Computer Use Means

Crossing the human baseline on OSWorld is a symbolic milestone, but its significance extends beyond a single benchmark score.

First, it validates the computer-use paradigm itself. When AI models could only complete 30-40% of desktop tasks reliably, computer use was a research curiosity. At 75% and climbing, it becomes a practical automation layer that enterprises can deploy with reasonable confidence for well-defined workflows.

Second, it accelerates the shift from "copilot" to "agent" in how organizations think about AI deployment. A copilot assists — an agent acts. GPT-5.4's capabilities push more workflows into the "agent" category, where the AI can execute multi-step processes end-to-end with minimal human supervision. If you want to understand the architectural patterns behind these agent systems, our guide on agent orchestration patterns covers the key design decisions. For a broader look at how AI agents are reshaping development workflows, see our deep dive on autonomous dev workflows.

Third, the competitive dynamics are intensifying. Anthropic's Claude models are within 2-3 points of GPT-5.4 on OSWorld, and Google is likely working on equivalent capabilities for Gemini. This three-way competition is producing rapid improvements — the 28-point OSWorld jump since December 2025 is a direct result of this competitive pressure.

What to Watch Next

The AI agent space is evolving fast. Several developments to monitor over the coming months include the OSWorld scores from Claude's next model update (Anthropic has been closing the gap steadily), Google's response with Gemini's agentic capabilities (strong multimodal reasoning gives them a unique angle for agent workflows), enterprise adoption data (how many organizations actually deploy computer-use agents in production versus running pilots), and safety and reliability frameworks (superhuman benchmark scores do not mean zero errors — the failure modes for computer-use agents in production environments remain an active area of research).

References

Cursor AI Editor 2.5 Review 2026: Models, Pricing, Verdict

Frequently Asked Questions

GPT-5.4 is OpenAI's latest large language model, released on March 5, 2026. It features native computer-use capabilities, a 1-million-token context window, and improved reasoning efficiency. It holds the highest score of any general-purpose model on the OSWorld desktop task benchmark, surpassing human expert performance by a clear margin.