Is GPT-5.5 actually stronger than Claude Mythos on cyber tasks?

The Expert-tier pass-rate gap (71.4% vs 68.6%) sits inside the overlapping standard errors. Statistically, this is parity, not separation. On the harder TLO benchmark, Mythos Preview is ahead in absolute success rate (3/10 vs 2/10), but both are within the same regime — and Mythos got there first, on April 13. 1 2

What is "The Last Ones" benchmark?

A SpecterOps-built, 32-step end-to-end corporate-network attack range used by AISI as its hardest cyber test. Spans four subnets, ~20 hosts, 9 milestones (recon through exfiltration), and is estimated at 20 hours of human-expert work. Two AI models have ever completed it autonomously: Claude Mythos Preview (April 2026) and GPT-5.5 (April 2026). 1 2

Why does the OpenAI Preparedness Framework matter here?

Because OpenAI's own classification ("High capability, below Critical") is consistent with what AISI measured. It also defines what would tip GPT-5.5 into the Critical band: independent generation of functional zero-day exploits in real hardened systems. AISI's log-linear inference-compute scaling finding suggests that bar is closer than the current guardrail framing implies. 1 6

What is GPT-5.5-Cyber?

A variant of GPT-5.5 with relaxed guardrails for permitted security work (vulnerability research, malware reverse-engineering), launched in limited preview on May 7, 2026 for organizations vetted into OpenAI's Trusted Access for Cyber program. Reported as roughly on par with Mythos on offensive cyber tasks but gated to defenders. 7 8

What's the single most actionable defender takeaway?

Re-baseline assumptions about expert-time cost. Tasks your threat model treats as 12-hour expert workloads — reverse engineering, custom protocol analysis, exploitation primitive development — are now 11-minute, $1.73-of-API-spend agent workloads. 1

GPT-5.5 Cyber Eval: AISI Finds Parity with Mythos 2026

May 8, 2026

#AISI #GPT-5.5 #Claude Mythos #OpenAI #Anthropic #AI cybersecurity #The Last Ones #frontier AI evaluation #capture the flag #OpenAI Preparedness Framework #GPT-5.5-Cyber

GPT-5.5 Cyber Eval: AISI Finds Parity with Mythos 2026

TL;DR

On April 30, 2026, the UK's AI Security Institute (AISI) published its evaluation of OpenAI's GPT-5.5 cyber capabilities — and the headline is parity, not separation. GPT-5.5 hit a 71.4% pass rate (±8.0% SEM) on AISI's Expert-tier capture-the-flag tasks, narrowly ahead of Claude Mythos Preview at 68.6% (±8.7%), and well above GPT-5.4 (52.4%) and Claude Opus 4.7 (48.6%).¹ More striking, GPT-5.5 became the second model ever to autonomously complete AISI's 32-step "The Last Ones" enterprise attack range end-to-end, succeeding in 2 of 10 attempts versus 3 of 10 for Mythos.² In one reverse-engineering challenge that took a human expert roughly 12 hours with Binary Ninja, gdb, Python, and Z3, GPT-5.5 finished autonomously in 10 minutes 22 seconds at $1.73 of API spend.¹ Frontier offensive cyber capability is no longer a single-vendor outlier — it's a property of the frontier itself.

What You'll Learn

What AISI tested on April 30 and why parity, not lead, is the real story
How GPT-5.5 stacks up against Mythos Preview, GPT-5.4, and Opus 4.7 across AISI's four difficulty tiers
How "The Last Ones" 32-step benchmark is structured and what 2/10 success actually means
Why a 12-hour reverse-engineering task fell in under 11 minutes for $1.73
How OpenAI's "High but below Critical" Preparedness rating maps to the AISI numbers
What the May 7 GPT-5.5-Cyber rollout to vetted defenders signals next
Defender-side actions to take this week

AISI's April 30 GPT-5.5 cyber evaluation, in one paragraph

The AI Security Institute (AISI) — the UK government body created at the November 2023 Bletchley summit and rebranded from "AI Safety Institute" to "AI Security Institute" on February 14, 2025 during the Munich Security Conference — runs pre- and post-deployment cyber evaluations on frontier models.³ AISI's April 30 GPT-5.5 report is the second in a back-to-back pair: on April 13, 2026, the same team published an evaluation of Anthropic's then-unreleased Claude Mythos Preview that showed the first AI system to autonomously beat AISI's 32-step network attack range.⁴ Seventeen days later, GPT-5.5 reached comparable scores from a different lab, on the same suite. AISI's framing is unambiguous: cyber-offensive capability is now showing up as a side effect of broader gains in autonomy, reasoning, and coding — not a deliberate Mythos-only training direction.¹

How AISI's cyber suite is structured

AISI's basic methodology is 95 narrow cyber tasks across four difficulty tiers, all in capture-the-flag (CTF) format and covering vulnerability research, exploitation, reverse engineering, web exploitation, and cryptography.¹ The four tiers are calibrated to operator skill levels rather than abstract difficulty bands:⁵

Tier	Operator profile	Typical experience
Technical Non-Expert	High-school-to-university technical user (data analyst, engineer)	No formal security background
Apprentice	Early-career security professional, opportunistic criminal, lower-tier hacker-for-hire	1–3 years
Practitioner	SOC analyst, ransomware developer, focused-domain specialist	3–10 years
Expert	Veteran offensive- or defensive-security operator	10+ years, deep cross-domain

Layered on top is "The Last Ones" (TLO) — a single integrated attack range built with SpecterOps, modelled on a real enterprise kill chain across four subnets and roughly twenty hosts. Thirty-two graded steps are bundled into nine milestones (reconnaissance → credential theft → web exploitation → reverse-engineered binaries → CI/CD supply-chain pivot → multi-domain Active Directory traversal → SQL-injection chain → exfiltration of a protected internal database). AISI estimates a human expert needs about 20 hours to complete the full chain.¹²

Expert-tier pass rates: GPT-5.5 vs the field

The headline scoreboard from AISI's report:¹

Model	Expert-tier pass rate	Standard error (1 SEM)
GPT-5.5	71.4%	±8.0%
Claude Mythos Preview	68.6%	±8.7%
GPT-5.4	52.4%	±9.8%
Claude Opus 4.7	48.6%	±10.0%

Two reads matter here. First, the gap between GPT-5.5 and Mythos is well inside their overlapping standard errors, so the responsible framing is "parity," not "lead." Second, the gap between this generation and the prior one is enormous: Mythos and GPT-5.5 are roughly 20 percentage points ahead of GPT-5.4 and Opus 4.7 on Expert-tier work. AISI explicitly notes the broader pattern — performance keeps scaling log-linearly with inference-time compute, and even today's strongest models have not plateaued on Expert-tier tasks.¹

"The Last Ones": what 2-of-10 actually means

End-to-end success on TLO is the headline figure, but the run-rate numbers are the substantive ones. GPT-5.5 completed all 32 steps autonomously in 2 of 10 attempts, making it the second model ever to do so. Mythos Preview did it in 3 of 10 attempts, the first.² A human expert needs roughly 20 hours of focused work for the same chain, on a network designed by SpecterOps to mirror real corporate Active Directory complexity.

A 20% end-to-end success rate sounds modest until you put it next to economics. The blocking factor for human red teams is salary, scheduling, and ramp time. The blocking factor for an AI agent is API spend and parallel-run capacity — both of which scale on the cloud-vendor side, not the attacker side. A 20% rate at $50–$200 per attempt becomes a guaranteed compromise after 10–20 retries, where each retry is a fresh seed. AISI's log-linear scaling finding amplifies the point: every additional doubling of inference budget tilts the success curve further toward "always succeeds eventually."¹

The reverse-engineering moment that captures the shift

Buried in the AISI report is a single anecdote that compresses the convergence story into one number. AISI ran a custom-VM reverse-engineering CTF that requires building an instruction decoder to recover a flag from a synthetic architecture. The expert playtester from third-party evaluator Crystal Peak solved it in roughly 12 hours using a stack of Binary Ninja, gdb, Python, and Z3.¹

GPT-5.5 solved the same challenge autonomously in 10 minutes and 22 seconds at $1.73 of API usage.¹ No human assistance, no scaffolding beyond the standard agent harness, no precomputed hints. The 70× speedup is striking, but the cost compression is the load-bearing detail. A $1.73 unit-cost on a 12-hour expert task changes who can plausibly run mass reverse-engineering campaigns against shipping software — including against the binaries inside critical infrastructure.

"High but below Critical" — what OpenAI's own framework says

OpenAI's GPT-5.5 system card classifies the model at High capability in cybersecurity under the company's Preparedness Framework, below the Critical threshold.⁶ The framework's wording matters because AISI's findings have to be read against it.

A High capability model, in OpenAI's definition, can "automate end-to-end cyber operations against reasonably hardened targets" or "significantly remove bottlenecks in discovering operationally relevant vulnerabilities." That language tracks the AISI evidence: 71.4% on Expert CTFs, 2/10 end-to-end on a 32-step range, and 12-hour-to-10-minute compressions on focused RE work.¹

A Critical model would have to "identify and develop functional zero-day exploits of all severity levels in many hardened real-world critical systems without human intervention" or "devise and execute end-to-end novel strategies for cyberattacks against hardened targets given only a high level desired goal."⁶ OpenAI's documented reason for not crossing that line: GPT-5.5 didn't independently produce a verifier-confirmed full exploit chain in real-world targets during evaluation. The bottleneck wasn't search breadth — it was exploit-development judgment: deciding which leads were worth deep investment, converting crashes into controlled primitives, ruling out diagnostic-only bugs.⁶

That bottleneck is precisely the kind of capability AISI's log-linear scaling finding suggests will close with more inference compute, not necessarily a new generation.

The universal jailbreak — and why it lands harder this generation

AISI's red-teamers found a universal jailbreak that elicited violative content across all malicious cyber queries, including in multi-turn agentic settings, after six hours of expert effort.¹ Six hours is short by red-team standards. Compare it to the deployed-time defensive expectation: most enterprise SOCs budget a multi-week purple-team engagement to find a single durable bypass.

The interpretation isn't that GPT-5.5's safety stack is poor — by industry standards it's well above the GPT-5.4 baseline. The interpretation is that the defender's investment-to-bypass ratio has shifted. When the model's capability is High and a universal jailbreak is six hours of work away, the security perimeter for cyber-offense has effectively moved to OpenAI's monitoring and access-control stack rather than the model's refusal behaviors.

Why AISI publishing twice in 17 days matters

The cadence is the story alongside the numbers. AISI's Mythos Preview report on April 13, 2026 was the first time any AI evaluator had documented an AI agent completing TLO end-to-end.⁴ Seventeen days later, on April 30, the same team published comparable numbers for GPT-5.5 from a different vendor, trained on a different stack, with a different safety regimen.¹ Two labs, broadly the same level on a benchmark designed to be hard.

Frontier offensive cyber capability has shifted from a "Mythos exception" frame to a "frontier property" frame. That has direct implications for two audiences. For policymakers, it kills the argument that capability concentration in one lab can be managed via vendor-specific safeguards alone. For defenders, it means the threat surface scales with whoever ships next — not whoever shipped first.

OpenAI's response: GPT-5.5-Cyber for vetted defenders

On May 7, 2026 — exactly one week after the AISI report — OpenAI began rolling out GPT-5.5-Cyber in limited preview to organizations vetted into the highest tier of its Trusted Access for Cyber (TAC) program.⁷ The model is "primarily trained to be more permissive on security-related tasks" — bug-hunting, malware reverse-engineering, attack reconstruction — while remaining blocked on credential theft and offensive-malware generation.⁷

A source familiar with internal benchmarks told Axios that GPT-5.5-Cyber's offensive cyber profile is "roughly on par with Mythos."⁸ The framing is intentional: same capability surface as the lab AISI described as a step-change, but with structured access gating rather than open availability. Defenders have to apply, prove credentials, and operate under TAC rules.

The choice of timing is itself a signal. AISI's report dropped on April 30. OpenAI's response was a defenders-only variant within five business days — an explicit acknowledgement that the capability is dual-use and that the company's posture is to put the strongest version into vetted hands first.

What defenders should actually do this week

Five concrete actions, calibrated to the AISI findings rather than vendor talking points.

1. Treat 12-hour expert tasks as 11-minute LLM tasks in your threat model. The reverse-engineering anecdote is generalizable. Anything in your defensive workflow that assumes a multi-hour human cost on the attacker side — IDA-Pro analysis, custom-protocol reverse engineering, bespoke crypto break — is now a one-prompt task at single-digit-dollar cost.

2. Stop equating "below Critical" with "manageable." OpenAI's own framework defines High as "automate end-to-end cyber operations against reasonably hardened targets." If your hardening posture would not survive a competent professional running a 32-step attack chain in 20 hours, GPT-5.5 will not be your savior either. Pen-test scoping should reference AISI's TLO milestone structure as a baseline.

3. Re-baseline your CI/CD and Active Directory threat models specifically. TLO's nine milestones explicitly include CI/CD supply-chain pivots and multi-domain AD lateral movement. Both are areas where most enterprises have measurable gaps. Run an internal exercise that maps your current detections against those nine milestones, milestone by milestone.

4. If you qualify, apply to OpenAI's Trusted Access for Cyber. The May 7 rollout is the first defenders-only frontier-cyber model. Apply to verify your team's eligibility regardless of immediate use; access decisions are reversible, but timing matters.⁷

5. Add inference-compute cost to your detection economics. AISI's log-linear scaling finding means attacker success is a tunable knob — more compute, more success. Your detection cost-per-incident should be benchmarked against attacker cost-per-attempt. A defender who spends $10k per investigation against an attacker spending $1.73 per attempt is structurally underwater.

Internal links and further reading

For the original AISI Mythos evaluation that established the TLO benchmark, see AISI Claude Mythos Eval: AI Owns 32-Step Network Attack.
For the GPT-5.5 launch details and base-model benchmarks, see GPT-5.5: OpenAI's First Retrained Base Since GPT-4.5.
For the consumer rollout of GPT-5.5 Instant on May 5, see GPT-5.5 Instant: ChatGPT's New Default Model in 2026.
For a cybersecurity-defender perspective on how small models complement frontier models, see AI Cybersecurity's Jagged Frontier: Small Models vs Mythos.

The bottom line

April 30, 2026 is the date frontier offensive cyber capability stopped being a single-vendor story. GPT-5.5 reaching parity with Claude Mythos on AISI's hardest tests, 17 days after Mythos was first to complete the same benchmarks, makes the convergence concrete. The numbers — 71.4% Expert pass rate, 2-of-10 TLO completions, $1.73 to compress 12 hours of reverse engineering — are the surface. The deeper signal is AISI's log-linear scaling observation: capability tracks inference budget, not generation. And the defender's job has just shifted from "watch the leading lab" to "assume the frontier itself is the threat."

AI Security Institute, "Our evaluation of OpenAI's GPT-5.5 cyber capabilities," April 30, 2026. aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5-5-cyber-capabilities ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸
AI Security Institute, "How do frontier AI agents perform in multi-step cyber-attack scenarios?" — describes "The Last Ones" benchmark methodology. aisi.gov.uk/blog/how-do-frontier-ai-agents-perform-in-multi-step-cyber-attack-scenarios ↩ ↩² ↩³ ↩⁴ ↩⁵
"AI Security Institute," Wikipedia (covers the February 14, 2025 rebrand from AI Safety Institute). en.wikipedia.org/wiki/AI_Security_Institute ↩
AI Security Institute, "Our evaluation of Claude Mythos Preview's cyber capabilities," April 13, 2026. aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities ↩ ↩²
AISI difficulty-tier definitions (Technical Non-Expert, Apprentice, Practitioner, Expert) appear in both the April 13 Mythos and April 30 GPT-5.5 evaluation posts. ↩
OpenAI, "GPT-5.5 System Card" (Preparedness Framework cybersecurity classification). openai.com/index/gpt-5-5-system-card ↩ ↩² ↩³ ↩⁴
OpenAI, "Scaling Trusted Access for Cyber with GPT-5.5 and GPT-5.5-Cyber," May 7, 2026. openai.com/index/gpt-5-5-with-trusted-access-for-cyber ↩ ↩² ↩³ ↩⁴
Axios, "OpenAI makes GPT-5.5 more widely available to cyber defenders," May 7, 2026. axios.com/2026/05/07/openai-gpt-55-cybersecurity-model ↩ ↩²

Frequently Asked Questions

A pre-deployment evaluation of OpenAI's GPT-5.5, focused on cyber capability across 95 narrow CTF tasks (split into four difficulty tiers) and the integrated 32-step "The Last Ones" enterprise attack range. The report concludes GPT-5.5 is among the strongest models AISI has tested and is the second to complete TLO end-to-end. 1