UK AISI: GPT-5.5 Matches Claude Mythos on Full Network Attack Simulation

TL;DR

The UK AI Security Institute reports that OpenAI's GPT-5.5 is the second model capable of autonomously solving a full network attack simulation. Performance is nearly on par with Anthropic's Mythos, which remains restricted to a small group. GPT-5.5 is generally available in ChatGPT and the API. The capability gating that was justified for Mythos no longer holds across the frontier — a GA model has reached the same offensive-cyber bar.

Key findings

Second model to autonomously complete a full network attack simulation. The first was Anthropic Mythos.
Capability nearly on par with Mythos on UK AISI's cyber benchmark.
GA model. GPT-5.5 ships in ChatGPT and through the OpenAI API.
Mythos remains restricted access despite the parity.

Why it matters

The justification for Mythos's restricted release was its offensive cyber capability. With GPT-5.5 reaching the same level and being publicly available, the policy logic of capability-based withholding is empirically broken: an attacker has had access to the equivalent capability via a different lab the whole time. The defender-product response (Anthropic Claude Security, 05-01) is the immediate commercial consequence; the policy response is undefined.

Relation to prior wiki knowledge

Closes the Anthropic Mythos withholding thread (04-17). The wiki noted from 04-17 onward that Mythos's restricted release was the load-bearing policy claim. That claim is now empirically falsified — a GA frontier model from a different lab has the same capability.

Pairs same-day with Anthropic Claude Security launch (05-01). Anthropic shipped a defender product the same week. The narrative ordering is: restrict offensive capability → competitor reaches parity in a GA product → ship a defender product instead. The market is doing the policy work.

Pairs with FlashRT (05-02). FlashRT improves the optimization-based attack pipeline 2–7×. Combined with a GA GPT-5.5 capable of full network attack simulation, the cost of high-quality cyber-offensive prompting just dropped on two vectors at once.

Reframes the CAIS AI Dashboard ranking (AISN #72, 05-01). GPT-5.5 ranks first overall in text and vision; ranks fourth on the risk index, behind all Anthropic models. The risk index is now the primary capability differentiator on the public dashboard — capability proper is converged.

Open threads

Reproducibility of UK AISI's benchmark. Single-evaluator capability claims need a second evaluator. METR or Apollo running the same simulation on GPT-5.5 in the next 30 days would resolve.
Mythos vs GPT-5.5 hands-on comparison. The "nearly on par" qualifier hides the actual delta — failure modes likely differ even if pass rates match.
Policy response. The Mythos-restriction logic doesn't survive this finding; whether the policy framework adapts (capability-gap-based gating, time-limited gating, defender-side balancing) is the open question.

UK AISI: GPT-5.5 Matches Claude Mythos on Full Network Attack Simulation

UK AISI: GPT-5.5 Matches Claude Mythos on Full Network Attack Simulation

TL;DR

Key findings

Why it matters

Relation to prior wiki knowledge

Open threads

Links