GPT-5.5 Cyber Eval: AISI Finds Parity with Mythos 2026

May 8, 2026

GPT-5.5 Cyber Eval: AISI Finds Parity with Mythos 2026

TL;DR

On April 30, 2026, the UK's AI Security Institute (AISI) published its evaluation of OpenAI's GPT-5.5 cyber capabilities — and the headline is parity, not separation. GPT-5.5 hit a 71.4% pass rate (±8.0% SEM) on AISI's Expert-tier capture-the-flag tasks, narrowly ahead of Claude Mythos Preview at 68.6% (±8.7%), and well above GPT-5.4 (52.4%) and Claude Opus 4.7 (48.6%).1 More striking, GPT-5.5 became the second model ever to autonomously complete AISI's 32-step "The Last Ones" enterprise attack range end-to-end, succeeding in 2 of 10 attempts versus 3 of 10 for Mythos.2 In one reverse-engineering challenge that took a human expert roughly 12 hours with Binary Ninja, gdb, Python, and Z3, GPT-5.5 finished autonomously in 10 minutes 22 seconds at $1.73 of API spend.1 Frontier offensive cyber capability is no longer a single-vendor outlier — it's a property of the frontier itself.


What You'll Learn

  • What AISI tested on April 30 and why parity, not lead, is the real story
  • How GPT-5.5 stacks up against Mythos Preview, GPT-5.4, and Opus 4.7 across AISI's four difficulty tiers
  • How "The Last Ones" 32-step benchmark is structured and what 2/10 success actually means
  • Why a 12-hour reverse-engineering task fell in under 11 minutes for $1.73
  • How OpenAI's "High but below Critical" Preparedness rating maps to the AISI numbers
  • What the May 7 GPT-5.5-Cyber rollout to vetted defenders signals next
  • Defender-side actions to take this week

AISI's April 30 GPT-5.5 cyber evaluation, in one paragraph

The AI Security Institute (AISI) — the UK government body created at the November 2023 Bletchley summit and rebranded from "AI Safety Institute" to "AI Security Institute" on February 14, 2025 during the Munich Security Conference — runs pre- and post-deployment cyber evaluations on frontier models.3 AISI's April 30 GPT-5.5 report is the second in a back-to-back pair: on April 13, 2026, the same team published an evaluation of Anthropic's then-unreleased Claude Mythos Preview that showed the first AI system to autonomously beat AISI's 32-step network attack range.4 Seventeen days later, GPT-5.5 reached comparable scores from a different lab, on the same suite. AISI's framing is unambiguous: cyber-offensive capability is now showing up as a side effect of broader gains in autonomy, reasoning, and coding — not a deliberate Mythos-only training direction.1

How AISI's cyber suite is structured

AISI's basic methodology is 95 narrow cyber tasks across four difficulty tiers, all in capture-the-flag (CTF) format and covering vulnerability research, exploitation, reverse engineering, web exploitation, and cryptography.1 The four tiers are calibrated to operator skill levels rather than abstract difficulty bands:5

TierOperator profileTypical experience
Technical Non-ExpertHigh-school-to-university technical user (data analyst, engineer)No formal security background
ApprenticeEarly-career security professional, opportunistic criminal, lower-tier hacker-for-hire1–3 years
PractitionerSOC analyst, ransomware developer, focused-domain specialist3–10 years
ExpertVeteran offensive- or defensive-security operator10+ years, deep cross-domain

Layered on top is "The Last Ones" (TLO) — a single integrated attack range built with SpecterOps, modelled on a real enterprise kill chain across four subnets and roughly twenty hosts. Thirty-two graded steps are bundled into nine milestones (reconnaissance → credential theft → web exploitation → reverse-engineered binaries → CI/CD supply-chain pivot → multi-domain Active Directory traversal → SQL-injection chain → exfiltration of a protected internal database). AISI estimates a human expert needs about 20 hours to complete the full chain.12

Expert-tier pass rates: GPT-5.5 vs the field

The headline scoreboard from AISI's report:1

ModelExpert-tier pass rateStandard error (1 SEM)
GPT-5.571.4%±8.0%
Claude Mythos Preview68.6%±8.7%
GPT-5.452.4%±9.8%
Claude Opus 4.748.6%±10.0%

Two reads matter here. First, the gap between GPT-5.5 and Mythos is well inside their overlapping standard errors, so the responsible framing is "parity," not "lead." Second, the gap between this generation and the prior one is enormous: Mythos and GPT-5.5 are roughly 20 percentage points ahead of GPT-5.4 and Opus 4.7 on Expert-tier work. AISI explicitly notes the broader pattern — performance keeps scaling log-linearly with inference-time compute, and even today's strongest models have not plateaued on Expert-tier tasks.1

"The Last Ones": what 2-of-10 actually means

End-to-end success on TLO is the headline figure, but the run-rate numbers are the substantive ones. GPT-5.5 completed all 32 steps autonomously in 2 of 10 attempts, making it the second model ever to do so. Mythos Preview did it in 3 of 10 attempts, the first.2 A human expert needs roughly 20 hours of focused work for the same chain, on a network designed by SpecterOps to mirror real corporate Active Directory complexity.

A 20% end-to-end success rate sounds modest until you put it next to economics. The blocking factor for human red teams is salary, scheduling, and ramp time. The blocking factor for an AI agent is API spend and parallel-run capacity — both of which scale on the cloud-vendor side, not the attacker side. A 20% rate at $50–$200 per attempt becomes a guaranteed compromise after 10–20 retries, where each retry is a fresh seed. AISI's log-linear scaling finding amplifies the point: every additional doubling of inference budget tilts the success curve further toward "always succeeds eventually."1

The reverse-engineering moment that captures the shift

Buried in the AISI report is a single anecdote that compresses the convergence story into one number. AISI ran a custom-VM reverse-engineering CTF that requires building an instruction decoder to recover a flag from a synthetic architecture. The expert playtester from third-party evaluator Crystal Peak solved it in roughly 12 hours using a stack of Binary Ninja, gdb, Python, and Z3.1

GPT-5.5 solved the same challenge autonomously in 10 minutes and 22 seconds at $1.73 of API usage.1 No human assistance, no scaffolding beyond the standard agent harness, no precomputed hints. The 70× speedup is striking, but the cost compression is the load-bearing detail. A $1.73 unit-cost on a 12-hour expert task changes who can plausibly run mass reverse-engineering campaigns against shipping software — including against the binaries inside critical infrastructure.

"High but below Critical" — what OpenAI's own framework says

OpenAI's GPT-5.5 system card classifies the model at High capability in cybersecurity under the company's Preparedness Framework, below the Critical threshold.6 The framework's wording matters because AISI's findings have to be read against it.

A High capability model, in OpenAI's definition, can "automate end-to-end cyber operations against reasonably hardened targets" or "significantly remove bottlenecks in discovering operationally relevant vulnerabilities." That language tracks the AISI evidence: 71.4% on Expert CTFs, 2/10 end-to-end on a 32-step range, and 12-hour-to-10-minute compressions on focused RE work.1

A Critical model would have to "identify and develop functional zero-day exploits of all severity levels in many hardened real-world critical systems without human intervention" or "devise and execute end-to-end novel strategies for cyberattacks against hardened targets given only a high level desired goal."6 OpenAI's documented reason for not crossing that line: GPT-5.5 didn't independently produce a verifier-confirmed full exploit chain in real-world targets during evaluation. The bottleneck wasn't search breadth — it was exploit-development judgment: deciding which leads were worth deep investment, converting crashes into controlled primitives, ruling out diagnostic-only bugs.6

That bottleneck is precisely the kind of capability AISI's log-linear scaling finding suggests will close with more inference compute, not necessarily a new generation.

The universal jailbreak — and why it lands harder this generation

AISI's red-teamers found a universal jailbreak that elicited violative content across all malicious cyber queries, including in multi-turn agentic settings, after six hours of expert effort.1 Six hours is short by red-team standards. Compare it to the deployed-time defensive expectation: most enterprise SOCs budget a multi-week purple-team engagement to find a single durable bypass.

The interpretation isn't that GPT-5.5's safety stack is poor — by industry standards it's well above the GPT-5.4 baseline. The interpretation is that the defender's investment-to-bypass ratio has shifted. When the model's capability is High and a universal jailbreak is six hours of work away, the security perimeter for cyber-offense has effectively moved to OpenAI's monitoring and access-control stack rather than the model's refusal behaviors.

Why AISI publishing twice in 17 days matters

The cadence is the story alongside the numbers. AISI's Mythos Preview report on April 13, 2026 was the first time any AI evaluator had documented an AI agent completing TLO end-to-end.4 Seventeen days later, on April 30, the same team published comparable numbers for GPT-5.5 from a different vendor, trained on a different stack, with a different safety regimen.1 Two labs, broadly the same level on a benchmark designed to be hard.

Frontier offensive cyber capability has shifted from a "Mythos exception" frame to a "frontier property" frame. That has direct implications for two audiences. For policymakers, it kills the argument that capability concentration in one lab can be managed via vendor-specific safeguards alone. For defenders, it means the threat surface scales with whoever ships next — not whoever shipped first.

OpenAI's response: GPT-5.5-Cyber for vetted defenders

On May 7, 2026 — exactly one week after the AISI report — OpenAI began rolling out GPT-5.5-Cyber in limited preview to organizations vetted into the highest tier of its Trusted Access for Cyber (TAC) program.7 The model is "primarily trained to be more permissive on security-related tasks" — bug-hunting, malware reverse-engineering, attack reconstruction — while remaining blocked on credential theft and offensive-malware generation.7

A source familiar with internal benchmarks told Axios that GPT-5.5-Cyber's offensive cyber profile is "roughly on par with Mythos."8 The framing is intentional: same capability surface as the lab AISI described as a step-change, but with structured access gating rather than open availability. Defenders have to apply, prove credentials, and operate under TAC rules.

The choice of timing is itself a signal. AISI's report dropped on April 30. OpenAI's response was a defenders-only variant within five business days — an explicit acknowledgement that the capability is dual-use and that the company's posture is to put the strongest version into vetted hands first.

What defenders should actually do this week

Five concrete actions, calibrated to the AISI findings rather than vendor talking points.

1. Treat 12-hour expert tasks as 11-minute LLM tasks in your threat model. The reverse-engineering anecdote is generalizable. Anything in your defensive workflow that assumes a multi-hour human cost on the attacker side — IDA-Pro analysis, custom-protocol reverse engineering, bespoke crypto break — is now a one-prompt task at single-digit-dollar cost.

2. Stop equating "below Critical" with "manageable." OpenAI's own framework defines High as "automate end-to-end cyber operations against reasonably hardened targets." If your hardening posture would not survive a competent professional running a 32-step attack chain in 20 hours, GPT-5.5 will not be your savior either. Pen-test scoping should reference AISI's TLO milestone structure as a baseline.

3. Re-baseline your CI/CD and Active Directory threat models specifically. TLO's nine milestones explicitly include CI/CD supply-chain pivots and multi-domain AD lateral movement. Both are areas where most enterprises have measurable gaps. Run an internal exercise that maps your current detections against those nine milestones, milestone by milestone.

4. If you qualify, apply to OpenAI's Trusted Access for Cyber. The May 7 rollout is the first defenders-only frontier-cyber model. Apply to verify your team's eligibility regardless of immediate use; access decisions are reversible, but timing matters.7

5. Add inference-compute cost to your detection economics. AISI's log-linear scaling finding means attacker success is a tunable knob — more compute, more success. Your detection cost-per-incident should be benchmarked against attacker cost-per-attempt. A defender who spends $10k per investigation against an attacker spending $1.73 per attempt is structurally underwater.

The bottom line

April 30, 2026 is the date frontier offensive cyber capability stopped being a single-vendor story. GPT-5.5 reaching parity with Claude Mythos on AISI's hardest tests, 17 days after Mythos was first to complete the same benchmarks, makes the convergence concrete. The numbers — 71.4% Expert pass rate, 2-of-10 TLO completions, $1.73 to compress 12 hours of reverse engineering — are the surface. The deeper signal is AISI's log-linear scaling observation: capability tracks inference budget, not generation. And the defender's job has just shifted from "watch the leading lab" to "assume the frontier itself is the threat."

Footnotes

  1. AI Security Institute, "Our evaluation of OpenAI's GPT-5.5 cyber capabilities," April 30, 2026. aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5-5-cyber-capabilities 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

  2. AI Security Institute, "How do frontier AI agents perform in multi-step cyber-attack scenarios?" — describes "The Last Ones" benchmark methodology. aisi.gov.uk/blog/how-do-frontier-ai-agents-perform-in-multi-step-cyber-attack-scenarios 2 3 4 5

  3. "AI Security Institute," Wikipedia (covers the February 14, 2025 rebrand from AI Safety Institute). en.wikipedia.org/wiki/AI_Security_Institute

  4. AI Security Institute, "Our evaluation of Claude Mythos Preview's cyber capabilities," April 13, 2026. aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities 2

  5. AISI difficulty-tier definitions (Technical Non-Expert, Apprentice, Practitioner, Expert) appear in both the April 13 Mythos and April 30 GPT-5.5 evaluation posts.

  6. OpenAI, "GPT-5.5 System Card" (Preparedness Framework cybersecurity classification). openai.com/index/gpt-5-5-system-card 2 3 4

  7. OpenAI, "Scaling Trusted Access for Cyber with GPT-5.5 and GPT-5.5-Cyber," May 7, 2026. openai.com/index/gpt-5-5-with-trusted-access-for-cyber 2 3 4

  8. Axios, "OpenAI makes GPT-5.5 more widely available to cyber defenders," May 7, 2026. axios.com/2026/05/07/openai-gpt-55-cybersecurity-model 2

Frequently Asked Questions

A pre-deployment evaluation of OpenAI's GPT-5.5, focused on cyber capability across 95 narrow CTF tasks (split into four difficulty tiers) and the integrated 32-step "The Last Ones" enterprise attack range. The report concludes GPT-5.5 is among the strongest models AISI has tested and is the second to complete TLO end-to-end. 1

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.