updateApril 30, 2026

GPT-5.5 ranks at 71.4% on UK AISI cyber eval with 2/10 TLO completions

Multiple summaries of the UK AISI report say GPT-5.5 roughly matches Claude Mythos Preview on long-horizon cyber tasks, including 2 of 10 end-to-end TLO completions. That matters because the model is broadly usable today, shifting cyber-workflow choices toward availability and mitigations rather than gated access alone.

5 min read

GPT-5.5 ranks at 71.4% on UK AISI cyber eval with 2/10 TLO completions

TL;DR

According to scaling01's summary and AISI's report, GPT-5.5 posted a 71.4% expert-tier pass rate on AISI's advanced cyber suite, versus 68.6% for Claude Mythos Preview.
cryps1s highlighted AISI's biggest operational result: GPT-5.5 completed the 32-step "The Last Ones" corporate attack simulation end-to-end in 2 of 10 attempts, while Mythos did it in 3 of 10.
In AISI's rust_vm reverse-engineering challenge, the official report says GPT-5.5 solved a task that takes a human expert about 12 hours in 10 minutes and 22 seconds for $1.73, a result echoed by scaling01's post.
Simon Willison's note surfaced the practical wrinkle fast: unlike Mythos, GPT-5.5 is generally available now, even as markchen90 stressed that public deployment still depends on safeguards and mitigations.

You can read AISI's full report, skim Simon Willison's short take, and inspect the attack-chain chart in scaling01's post. The oddest detail in the report is a custom-VM reversing task that the model cleared in just over ten minutes, while the most consequential one is that AISI now has a second model from a different lab finishing its flagship long-horizon cyber range.

AISI's advanced suite

AISI's report says the result came from a 95-task cyber suite spanning four difficulty tiers. The headline numbers are from the advanced suite, split into 27 Practitioner tasks and 21 Expert tasks at a 50M token budget.

On Expert tasks, AISI's report puts GPT-5.5 at 71.4% (±8.0%), Mythos Preview at 68.6% (±8.7%), GPT-5.4 at 52.4% (±9.8%), and Opus 4.7 at 48.6% (±10.0%). deredleritt3r also noted that OpenAI had earlier misstated the TLO completion count as 1 of 10, before AISI's published figure landed at 2 of 10.

The task mix is much nastier than generic CTF trivia. AISI says the advanced set covers stripped-binary and firmware reversing, stack and heap exploit development, crypto attacks such as padding-oracle and nonce-reuse, TOCTOU races, obfuscated malware, and planted vulnerabilities in real open-source software.

The Last Ones

AISI's range result is the part people immediately screenshotted. "The Last Ones" is a 32-step corporate network attack simulation, built with SpecterOps, that spans four subnets and about twenty hosts.

According to AISI's report, the agent starts with no credentials on an unprivileged attack box and has to chain together:

reconnaissance
credential theft
lateral movement across multiple Active Directory forests
a CI/CD supply-chain pivot
exfiltration of a protected internal database

rohanpaul_ai's newsletter screenshot framed the broader pattern clearly: the biggest jumps now show up in long-horizon, multi-stage runs, not in the already-saturated basic tasks. TheRealAdamG's quote from AISI captured the institutional read too, namely that GPT-5.5 looks less like a one-off cyber spike and more like a second data point in a broader frontier-model trend.

rust_vm

The cleanest technical vignette in the report is rust_vm, a reverse-engineering challenge contributed by Crystal Peak Security. The target is a stripped Rust ELF that implements a custom VM, plus an unknown bytecode blob that guards a service on port 8080.

AISI says a human expert playtester needed roughly 12 hours using Binary Ninja, gdb, Python, and Z3. GPT-5.5 solved it in 10 minutes and 22 seconds with a basic ReAct scaffold, Bash and Python tools, and a Kali Linux container, for $1.73 in API spend.

The report breaks the solve into five phases:

Recon on the binary and VM structure.
ISA recovery from the dispatch loop and relocation table.
Bytecode disassembly with a Python disassembler.
Recovery of the password-check logic from the authenticator.
Constraint solving, local verification, and flag submission.

AISI's transcript excerpts matter because they show error correction, not just raw speed. The model first mixed up interrupt numbers for read and write, noticed the output mismatch, fixed the emulator, and kept going.

Safeguards and limits

The last section of AISI's report is where the buried caveats live. AISI says these capability evaluations do not reflect what an ordinary public user can necessarily access, because public deployments add safeguards, monitoring, and access controls.

AISI also says its red team found a universal jailbreak that elicited violative malicious-cyber content across all queries OpenAI provided, including multi-turn agentic settings. The attack reportedly took six hours of expert red-teaming to develop, and OpenAI updated its safeguard stack afterward, but AISI says a configuration issue in the version provided for follow-up testing prevented full evaluation of the final patched setup.

The report adds two more constraints that got less attention on X. GPT-5.5 failed AISI's separate "Cooling Tower" range, and no model has solved it yet. AISI also says its current ranges still lack active defenders, defensive tooling, and alert penalties, so the results do not show whether GPT-5.5 would succeed against a well-defended target in the wild.

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads

AISI's advanced suite1 post

The Last Ones2 posts

Other sources· 1 post

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

Our evaluation of OpenAI's GPT-5.5 cyber capabilities The UK's AI Security Institute previously evaluated Claude Mythos: now they've evaluated GPT-5.5 for finding security vulnerability and found it to be comparable to Mythos, but unlike Mythos it's generally available right now. Tags: ai, openai, generative-ai, llms, anthropic, claude, ai-security-research, gpt