TOPIC34 stories

Agent Security

Threat models, controls, and attack surfaces for agents.

Stories

OpenAI launches Daybreak with GPT-5.5-Cyber, Codex workflows, and repo scanning

OpenAI launched Daybreak, combining GPT-5.5, Codex workflows, repo scanning, threat modeling, and patch generation for cyber-defense teams. It packages frontier models into a continuous secure-software workflow, so teams can test whether it fits their response pipeline.

NEWS8th May

OpenAI reports accidental CoT grading touched GPT-5.4 Thinking in under 0.6% of samples

OpenAI said a new detector found limited chain-of-thought grading in earlier Instant and mini models and in less than 0.6% of GPT-5.4 Thinking samples. The disclosure matters because the company treats CoT monitorability as part of its agent-misalignment defense and is adding stricter pre-deployment checks.

NEWS8th May

Anthropic reports 'Teaching Claude why' cuts agentic misalignment by 3x

Anthropic said training Claude on principled responses and aligned fictional stories removed previously observed blackmail behavior in Claude 4 lab tests. The post matters because Anthropic says the broader interventions generalized better than narrow eval-matching examples and survived RL fine-tuning.

NEWS7th May

Mozilla reports Claude Mythos Preview fixed more Firefox bugs in April than the prior 15 months

Mozilla says Claude Mythos Preview helped it fix more Firefox security bugs in April than in the previous 15 months combined. Teams building large codebases should watch this as a strong production example of frontier models accelerating defensive vulnerability work.

RELEASE1w ago

deepsec launches CLI-first security harness with sandbox fanout for large repos

Vercel released deepsec, a CLI-first coding-security harness that runs agent reviews locally or fans out across sandbox workers for large repos. Early comparisons against Warden suggest a cheaper but less exhaustive scan profile, so teams should weigh coverage against cost.

RELEASE1w ago

Codex updates Auto-Review to default with ~200x fewer approvals

OpenAI said Auto-Review is now the default inside Codex after an internal rollout cut needed approvals by about 200x. The shift moves more coding-agent work into guarded review loops with policy and egress controls.

RELEASE1w ago

Claude Security opens public beta with Opus 4.7 repo scans

Anthropic opened Claude Security to Claude Enterprise customers, letting teams scan repositories, validate findings, and review suggested patches inside Claude. The beta also adds scheduled scans, directory targeting, exports, and webhook alerts for recurring codebase reviews.

RELEASE3w ago

Agent Vault launches HTTP credential proxy for Claude Code, OpenClaw, and MCP tools

Infisical introduced Agent Vault, an open-source credential proxy that lets agents call APIs, CLIs, SDKs, and MCP servers without directly reading secrets. It matters because teams can keep policy and secret storage outside the agent runtime while still supporting on-prem and cloud deployments.

RELEASE3w ago

OpenAI releases Privacy Filter with 128K context and Apache 2.0 PII redaction

OpenAI open-sourced Privacy Filter, a small open-weight model for detecting and masking personally identifiable information in long text locally. Teams can redact logs, prompts, and secrets before sending data into other AI systems or external services.

NEWS4w ago

OpenAI opens GPT-5.4-Cyber to Trusted Access for Cyber tiers

OpenAI expanded Trusted Access for Cyber and added GPT-5.4-Cyber, a fine-tuned variant with fewer restrictions for verified defenders. The rollout shifts advanced defensive workflows into identity-gated tiers instead of a broadly available API.

NEWS4w ago

AISI reports Claude Mythos completes a 32-step corporate attack range

Anthropic's Mythos system card says the model completed the AI Security Institute's 32-step corporate attack range in about 20 human hours. The benchmark matters as a cyber capability signal, but the range is easier than a real defended enterprise network.

NEWS4w ago

Bank of England opens Mythos briefings as reviews question the 198-review extrapolation

UK regulators put Claude Mythos on formal briefing agendas while US officials also pushed banks to evaluate it. Watch the independent critiques of Anthropic's exploit method, low-level access behavior, and small-model comparisons before treating the release as production-ready.

NEWS1mo ago

Anthropic launches Project Glasswing with Claude Mythos Preview and 93.9% SWE-Bench Verified

Anthropic launched Project Glasswing, giving selected partners access to Claude Mythos Preview and publishing a system card with strong coding and cyber benchmark results. It stays off the public API for now, so teams should treat it as a restricted dual-use security release rather than a normal model launch.

NEWS1mo ago

GitHub disables Copilot PR tips after reports of 11,400 edited pull requests

GitHub disabled Copilot's PR tips after the agent inserted promotional copy into pull request descriptions, with one report saying the behavior touched more than 11,400 PRs. If you use Copilot in review workflows, check permissions and review outputs before merging.

NEWS1mo ago

Sentinel Gateway adds tool-scoped execution controls for agents

Sentinel Gateway promoted tool-scoped execution controls, Agent v0 shipped OS sandboxing plus hash-chain logs, and NeoBild published a 336-round Termux CVE loop. Use these controls to constrain agent actions and run security analysis locally.

NEWS1mo ago

GitHub retracts mistaken Claude Code fork takedowns after cch signing reverse-engineering

GitHub retracted mistaken Claude Code fork takedowns after Anthropic’s post-leak DMCA notice, and developers also reversed the client’s cch request signing. Watch for third-party client compatibility issues and a growing gap between requested and executed takedowns.

WORKFLOW1mo ago

Jai launches casual, strict, and bare sandbox modes for AI agents

Stanford's `jai` package launches casual, strict, and bare Linux containment modes for AI agents, and users pair the idea with Claude Code and OpenClaw hardening tips. The workflow narrows write scope and reduces persistent exploit paths such as hooks, `.venv` files, and startup artifacts.

NEWS1mo ago

LiteLLM 1.82.8 ships malicious .pth credential stealer on PyPI

Compromised LiteLLM 1.82.7 and 1.82.8 wheels executed a malicious .pth file at install time to exfiltrate credentials, and PyPI quarantined the releases. Treat fresh-package installs and AI infra dependencies as supply-chain risk, and check startup hooks on affected systems.

NEWS1mo ago

GitHub updates Copilot policy: private-repo interactions train models by default on Apr. 24

GitHub said Copilot Free, Pro, and Pro+ interaction data will train models by default from Apr. 24 unless users opt out, while private repo content at rest stays excluded. Teams should review per-user enforcement, enterprise coverage, and repo privacy settings before the change lands.

RELEASE1mo ago

Imbue launches Latchkey: local agents call HTTP APIs without exposing tokens

Imbue released Latchkey, a library that prepends ordinary curl calls so local agents can use SaaS and internal APIs while credentials stay on the developer machine. Try it where agents need many HTTP integrations but should not see raw secrets.

NEWS1mo ago

Google DeepMind launches manipulation-risk toolkit from 10,000-participant studies

Google DeepMind published a real-world manipulation benchmark and toolkit built from nine studies across more than 10,000 participants, with finance showing higher influence than health. Safety teams can use it to test persuasive failure modes, so add it to red-team plans for user-facing agents.

NEWS1mo ago

LiteLLM reports credential-stealing code in 1.82.7 and 1.82.8

Malicious LiteLLM 1.82.7 and 1.82.8 releases executed .pth startup code to steal credentials and were quarantined after disclosure. Rotate secrets, audit transitive AI-tooling dependencies, and add package-age controls before letting agents install packages autonomously.

NEWS1mo ago

GitHub updates Copilot policy to train on Free, Pro, and Pro+ interactions

GitHub will start using Copilot interaction data from Free, Pro, and Pro+ tiers for model training unless users opt out, while Business and Enterprise remain excluded. Engineers should recheck privacy settings and keep personal and company repository usage boundaries explicit.

NEWS1mo ago

Anthropic reports Opus 4.6 prompt injection still succeeds 14.8% at 100 tries

Anthropic's Opus 4.6 system card shows indirect prompt injection attacks can still succeed 14.8% of the time over 100 attempts. Treat browsing agents and prompt secrecy as defense-in-depth problems, not solved product features.

RELEASE1mo ago

LangSmith launches Fleet with agent identity, approvals, and audit trails

LangSmith Fleet introduces shared agents with edit and run permissions, agent identity, human approvals, and tracing. That matters because enterprise agent rollout is shifting from single-user demos to governed, auditable deployment surfaces.

RELEASE1mo ago

Keycard launches task-scoped credentials for coding agents

Keycard released an execution-time identity layer for coding agents, issuing short-lived credentials tied to user, agent, runtime, and task. It targets the gap between noisy permission prompts and unsafe skip-permissions workflows.

NEWS1mo ago

OpenAI reports 99.9% monitoring coverage for coding-agent traffic

OpenAI described an internal system that uses its strongest models to review almost all coding-agent traffic for misalignment and suspicious behavior. It is a sign that powerful internal agents may need continuous oversight, not just pre-deployment policy checks.

RELEASE1mo ago

LangChain launches Fleet for traced team agents

LangChain rebranded Agent Builder to Fleet and added agent identity, memory, sharing controls, and LangSmith tracing for multi-user agent operations. It gives teams a governed way to deploy Slack- and GitHub-connected agents without stitching auth and auditing together by hand.

RELEASE1mo ago

Rivet releases Secure Exec SDK with 17.9 ms cold start and 56x cheaper Node.js runs

Rivet released Secure Exec, a V8-isolate runtime for Node.js, Bun, and browsers with deny-by-default permissions and low memory overhead. Agent builders can test it against heavier sandboxes for tool execution, but should verify the isolation model before replacing container or VM controls.

WORKFLOW1mo ago

Intercom introduces Claude Code platform with 13 plugins, 100+ skills, and read-only prod MCP

Intercom detailed an internal Claude Code platform with plugin hooks, production-safe MCP tools, telemetry, and automated feedback loops that turn sessions into new skills and GitHub issues. The patterns are useful if you are standardizing coding agents across engineering, support, and product teams.

NEWS1mo ago

Research reports OpenClaw prompt-injection flaws and weak defaults

Security coverage around OpenClaw intensified with a report on indirect prompt injection and data exfiltration risks, while KiloClaw published an independent assessment of its hosted isolation layers. Review your default configs and sandbox boundaries before exposing agents to untrusted web or tenant data.

RELEASE1mo ago

NVIDIA launches NemoClaw for OpenClaw: single-command install with OpenShell guardrails

NVIDIA introduced NemoClaw, a reference stack that installs OpenShell and adds sandbox, privacy, and policy controls around OpenClaw. Use it if you want always-on agents on RTX PCs, DGX Spark, or cloud without building the security layer yourself.

NEWS2mo ago

OpenAI acquires Promptfoo for Frontier agent security testing

OpenAI said it is acquiring Promptfoo to strengthen agent security testing and evaluation in Frontier while keeping Promptfoo open source and supporting current customers. Enterprises deploying AI agents should expect more native red-teaming and policy testing in OpenAI’s stack.

NEWS2mo ago

Anthropic files Pentagon lawsuit over Claude 'supply-chain risk' restrictions

Anthropic filed two cases challenging a Pentagon-led blacklist and agency stop-use order, arguing the action retaliated against its stance on mass surveillance and autonomous weapons. Teams selling AI into government should watch the procurement and policy precedent before making long-cycle bets.