GPT-5.4 Beats Humans at Computer Use: What It Means
April 5, 2026
TL;DR
GPT-5.4, released by OpenAI on March 5, 2026, is the highest-scoring general-purpose model on the OSWorld-Verified benchmark — scoring 75.0% against the human baseline of 72.4%, and the first to exceed it by a decisive margin. With native computer-use capabilities, a 1-million-token context window, and strong enterprise benchmarks, GPT-5.4 marks a genuine inflection point for autonomous AI agents. Here is what changed, why it matters, and how it compares to the competition.
What You'll Learn
- What GPT-5.4's computer-use capabilities actually do and how they work
- How the OSWorld benchmark measures real desktop task automation
- Where GPT-5.4 stands relative to Claude Sonnet 4.6, Claude Opus 4.6, and Gemini 3.1 Pro
- What this means for enterprise workflows and the future of AI agents
- Pricing, context window, and practical considerations for developers
What Is GPT-5.4?
GPT-5.4 is OpenAI's latest frontier model, launched on March 5, 2026. It represents a significant architectural leap over its predecessor, GPT-5.2, with three headline features that set it apart: native computer-use capabilities, a 1-million-token context window (the largest OpenAI has ever offered), and substantially improved token efficiency during reasoning tasks1.
The model ships in several variants. The standard GPT-5.4 handles general-purpose tasks. GPT-5.4 Thinking is optimized for complex reasoning chains. GPT-5.4 Pro targets high-performance enterprise workloads. Smaller variants — GPT-5.4 Mini and GPT-5.4 Nano — followed on March 17, bringing the architecture's improvements to lower price points2.
What makes GPT-5.4 genuinely newsworthy is not just incremental benchmark gains. It is OpenAI's first model with built-in computer-use capabilities — meaning it can interact directly with desktop software through screenshots, mouse commands, and keyboard inputs. (Anthropic's Claude was the first frontier model to offer computer use, launching the feature in public beta with Claude 3.5 Sonnet in October 2024.)
The OSWorld Breakthrough: Beating Human Experts
The OSWorld-Verified benchmark is the standard evaluation for measuring how well AI models can autonomously complete real desktop tasks — navigating files, using browsers, operating terminal interfaces, and interacting with productivity software. Human expert testers set the baseline at 72.4%.
GPT-5.4 scored 75.0% — the highest of any general-purpose model and the first to exceed the human baseline by a clear margin3. Simular's specialized Agent S framework reached 72.6% in December 2025, and Claude Opus 4.6 narrowly exceeded the baseline at 72.7% in February 2026, but GPT-5.4's 75.0% represents the first unambiguous gap above human-level performance.
To appreciate the pace of improvement, consider the trajectory across OpenAI's recent models:
| Model | OSWorld-Verified Score | Release Date |
|---|---|---|
| GPT-5.2 | 47.3% | December 2025 |
| GPT-5.3 Codex | 64.7% | February 2026 |
| GPT-5.4 | 75.0% | March 5, 2026 |
| Human Expert Baseline | 72.4% | — |
That is a 28-point improvement in under four months — from GPT-5.2 in December 2025 to GPT-5.4 in March 2026 — a rate of progress that suggests computer-use accuracy could reach the high 80s or low 90s within the next year if the trend holds.
The practical implication is straightforward: GPT-5.4 can now fill out forms, navigate multi-step software workflows, extract data from desktop applications, and coordinate across browser tabs more reliably than the average human expert performing the same tasks under controlled conditions.
How Computer Use Actually Works
GPT-5.4's computer-use capability operates through a perception-action loop. The model receives a screenshot of the current screen state, reasons about what action to take next, and then issues structured commands — mouse clicks at specific coordinates, keyboard inputs, scroll actions, or application switches.
This is fundamentally different from API-based tool use, where the model calls predefined functions. Computer use means the model interacts with software the same way a human does — through the visual interface. It can operate applications that have no API, navigate unfamiliar interfaces, and recover from unexpected states like pop-up dialogs or error messages.
OpenAI's implementation also supports code-driven automation via libraries like Playwright, giving developers a hybrid approach: the model can write and execute browser automation scripts when APIs are available, and fall back to visual computer use when they are not4.
How GPT-5.4 Compares to the Competition
GPT-5.4 is not the only model with computer-use capabilities. Anthropic's Claude models have offered computer use since late 2024, and the latest Claude Sonnet 4.6 and Claude Opus 4.6 models are competitive on the same benchmarks.
| Model | OSWorld-Verified | Context Window | Computer Use |
|---|---|---|---|
| GPT-5.4 | 75.0% | 1M tokens | Native |
| Claude Opus 4.6 | 72.7% | 1M tokens | Native |
| Claude Sonnet 4.6 | 72.5% | 1M tokens | Native |
| Gemini 3.1 Pro | ~72.5% (unofficial) | 1M tokens | Yes |
The gap between GPT-5.4 and Claude's models is approximately 2-3 percentage points on OSWorld — meaningful but not enormous5. Claude Sonnet 4.6 is particularly notable because it delivers near-Opus-level computer-use performance at a significantly lower price point, making it the most cost-effective option for agent workloads that prioritize computer use.
Google's Gemini 3.1 Pro has not officially published OSWorld benchmark results, though third-party evaluations suggest it scores around 72.5% — competitive with Claude's models. All three frontier model families now offer 1-million-token context windows, removing context length as a meaningful differentiator. Google has focused its agentic efforts more on search integration and multimodal reasoning than on desktop computer-use scenarios.
Enterprise Impact: From Benchmarks to Boardrooms
The real story of GPT-5.4 is not the benchmark number — it is what that number enables in production enterprise environments.
OpenAI highlighted GPT-5.4's performance on financial modeling tasks specifically. On an internal benchmark simulating tasks that a junior investment banking analyst might perform — spreadsheet modeling, scenario analysis, data extraction — GPT-5.4 scored 87.3%, up from 68.4% for GPT-5.26. That is a capability jump that translates directly into time saved on repetitive analytical work.
The enterprise use cases that GPT-5.4 unlocks or significantly improves include document review at scale (legal briefs, scientific literature, multi-chapter manuscripts leveraging the 1M-token context window), automated data entry and extraction across legacy systems without APIs, end-to-end onboarding workflows that coordinate across multiple applications, and financial report generation combining data from spreadsheets, databases, and web sources.
GPT-5.4 is available through Microsoft Foundry (Azure's AI platform), making it accessible to enterprise customers who already operate within the Microsoft ecosystem7.
Pricing and Developer Considerations
GPT-5.4's pricing reflects its position as a frontier model, with a tiered structure that covers different use cases:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-5.4 | $2.50 | $15.00 |
| GPT-5.4 Mini | $0.75 | $4.50 |
| GPT-5.4 Nano | $0.20 | $1.25 |
| GPT-5.4 Pro | $30.00 | $180.00 |
Cached input tokens cost $1.25 per million (a 50% discount applied automatically to repeating context). An extended context surcharge kicks in past 272K tokens, doubling the input rate to $5.00 per million8.
For developers building agent systems, the practical tradeoff is between the full GPT-5.4 (best computer-use accuracy) and GPT-5.4 Mini (adequate for many automation tasks at roughly one-third the cost). The Nano variant is API-only and best suited for high-volume, lower-complexity tasks.
The Bigger Picture: What Superhuman Computer Use Means
Crossing the human baseline on OSWorld is a symbolic milestone, but its significance extends beyond a single benchmark score.
First, it validates the computer-use paradigm itself. When AI models could only complete 30-40% of desktop tasks reliably, computer use was a research curiosity. At 75% and climbing, it becomes a practical automation layer that enterprises can deploy with reasonable confidence for well-defined workflows.
Second, it accelerates the shift from "copilot" to "agent" in how organizations think about AI deployment. A copilot assists — an agent acts. GPT-5.4's capabilities push more workflows into the "agent" category, where the AI can execute multi-step processes end-to-end with minimal human supervision. If you want to understand the architectural patterns behind these agent systems, our guide on agent orchestration patterns covers the key design decisions. For a broader look at how AI agents are reshaping development workflows, see our deep dive on autonomous dev workflows.
Third, the competitive dynamics are intensifying. Anthropic's Claude models are within 2-3 points of GPT-5.4 on OSWorld, and Google is likely working on equivalent capabilities for Gemini. This three-way competition is producing rapid improvements — the 28-point OSWorld jump since December 2025 is a direct result of this competitive pressure.
What to Watch Next
The AI agent space is evolving fast. Several developments to monitor over the coming months include the OSWorld scores from Claude's next model update (Anthropic has been closing the gap steadily), Google's response with Gemini's agentic capabilities (strong multimodal reasoning gives them a unique angle for agent workflows), enterprise adoption data (how many organizations actually deploy computer-use agents in production versus running pilots), and safety and reliability frameworks (superhuman benchmark scores do not mean zero errors — the failure modes for computer-use agents in production environments remain an active area of research).
References
Footnotes
-
OpenAI launches GPT-5.4 with Pro and Thinking versions — TechCrunch ↩
-
OpenAI Launches GPT-5.4 With Computer Agent Capabilities, Beats Human Baseline on OSWorld — AI Haven ↩
-
GPT-5.4 Unveiled: Native Computer Use and a Million-Token Context Window — Applying AI ↩
-
2025-2026 AI Computer-Use Benchmarks and Top AI Agents Guide — o-mega ↩
-
OpenAI launches GPT-5.4, its most powerful model for enterprise work — Fortune ↩
-
GPT-5.4 Is Now Available in Microsoft Foundry — Cloud Factory Group ↩