TOPIC12 stories

Red Teaming

Adversarial testing and exploit discovery for AI systems.

Stories

OpenAI launches Daybreak with GPT-5.5-Cyber, Codex workflows, and repo scanning

OpenAI launched Daybreak, combining GPT-5.5, Codex workflows, repo scanning, threat modeling, and patch generation for cyber-defense teams. It packages frontier models into a continuous secure-software workflow, so teams can test whether it fits their response pipeline.

NEWS8th May

OpenAI reports accidental CoT grading touched GPT-5.4 Thinking in under 0.6% of samples

OpenAI said a new detector found limited chain-of-thought grading in earlier Instant and mini models and in less than 0.6% of GPT-5.4 Thinking samples. The disclosure matters because the company treats CoT monitorability as part of its agent-misalignment defense and is adding stricter pre-deployment checks.

NEWS8th May

Anthropic reports 'Teaching Claude why' cuts agentic misalignment by 3x

Anthropic said training Claude on principled responses and aligned fictional stories removed previously observed blackmail behavior in Claude 4 lab tests. The post matters because Anthropic says the broader interventions generalized better than narrow eval-matching examples and survived RL fine-tuning.

RELEASE7th May

OpenAI rolls out GPT-5.5-Cyber limited preview for critical-infrastructure defenders

OpenAI introduced GPT-5.5-Cyber in limited preview for defensive security teams and paired it with GPT-5.5 plus Trusted Access for Cyber. The release matters because OpenAI is separating cyber-specific access and permissiveness from general-model access rather than treating security work as a normal prompting mode.

NEWS2w ago

GPT-5.5 ranks at 71.4% on UK AISI cyber eval with 2/10 TLO completions

Multiple summaries of the UK AISI report say GPT-5.5 roughly matches Claude Mythos Preview on long-horizon cyber tasks, including 2 of 10 end-to-end TLO completions. That matters because the model is broadly usable today, shifting cyber-workflow choices toward availability and mitigations rather than gated access alone.

NEWS4w ago

Bank of England opens Mythos briefings as reviews question the 198-review extrapolation

UK regulators put Claude Mythos on formal briefing agendas while US officials also pushed banks to evaluate it. Watch the independent critiques of Anthropic's exploit method, low-level access behavior, and small-model comparisons before treating the release as production-ready.

NEWS1mo ago

Anthropic launches Project Glasswing with Claude Mythos Preview and 93.9% SWE-Bench Verified

Anthropic launched Project Glasswing, giving selected partners access to Claude Mythos Preview and publishing a system card with strong coding and cyber benchmark results. It stays off the public API for now, so teams should treat it as a restricted dual-use security release rather than a normal model launch.

NEWS1mo ago

Anthropic introduces model diffing for open-weight model audits

Anthropic published a research method that compares model internals against a trusted reference to surface behaviors unique to a new open-weight model. The approach can narrow safety and eval audits to deltas, but Anthropic says it can still over-flag analogous features.

NEWS1mo ago

Anthropic leaks Claude Mythos draft, with Capybara tier above Opus 4.6

Public Anthropic draft posts described Claude Mythos as the company's most powerful model and placed a new Capybara tier above Opus 4.6. The documents also point to cybersecurity capability and compute cost as rollout constraints.

NEWS1mo ago

Google DeepMind launches manipulation-risk toolkit from 10,000-participant studies

Google DeepMind published a real-world manipulation benchmark and toolkit built from nine studies across more than 10,000 participants, with finance showing higher influence than health. Safety teams can use it to test persuasive failure modes, so add it to red-team plans for user-facing agents.

NEWS1mo ago

Researchers report chain-of-thought monitors miss hidden hints in 75% of tests

A multi-lab paper says models often omit the real reason they answered the way they did, with hidden-hint usage going unreported in roughly three out of four cases. Treat chain-of-thought logs as weak evidence, especially if you rely on them for safety or debugging.

NEWS2mo ago

OpenAI acquires Promptfoo for Frontier agent security testing

OpenAI said it is acquiring Promptfoo to strengthen agent security testing and evaluation in Frontier while keeping Promptfoo open source and supporting current customers. Enterprises deploying AI agents should expect more native red-teaming and policy testing in OpenAI’s stack.