breakingMarch 21, 2026

Anthropic reports Opus 4.6 prompt injection still succeeds 14.8% at 100 tries

Anthropic's Opus 4.6 system card shows indirect prompt injection attacks can still succeed 14.8% of the time over 100 attempts. Treat browsing agents and prompt secrecy as defense-in-depth problems, not solved product features.

3 min read

Anthropic reports Opus 4.6 prompt injection still succeeds 14.8% at 100 tries

TL;DR

Anthropic's Opus 4.6 thread says indirect prompt injection is improved but not solved: in the Opus 4.6 system card, attacks still succeeded 14.8% of the time when attackers got 100 attempts across 19 scenarios, with the chart labeled "lower is better." 14.8% detail
The same Opus 4.6 system card puts Opus 4.6 near the top of Anthropic's benchmark, but the remaining success rate means browsing and tool-using agents still face meaningful compromise risk even after model-side hardening. system card summary
Practitioner discussion around a leaked prompt example reinforces the point that "no prompt is safe" if the defense is only prompt text, especially for apps whose system prompts contain app logic, roles, or access instructions.
Simon Willison's profiling example shows a related deployment risk: modern models can extract surprisingly rich inferences from public data, which raises the stakes when compromised agents can browse, summarize, or profile at scale. model output screenshot

What Anthropic measured

Anthropic's Opus 4.6 system card includes an "Indirect Prompt Injection Robustness" chart covering 12 models and three attack budgets: one try, 10 tries, and 100 tries. In Simon Willison's thread on the chart, Opus 4.6 posts 0.2% success at one try, 2.1% at 10 tries, and 14.8% at 100 tries, which is better than many peers but still far from zero.

The practical detail is the retry budget. Willison's follow-up note makes the setup explicit: at k=100, "attacker gets 100 attempts," and even Anthropic's best reported score still lets a non-trivial share through. That matters more for agent products than for single-turn chat, because long-running workflows naturally create repeated opportunities to hit vulnerable tool calls, retrieval steps, or browser contexts.

Why engineers should treat this as defense in depth

The strongest takeaway is that prompt secrecy is not a security boundary. The shared Reddit screenshot describes an internal tool whose system prompt contained "instructions on data access, user roles, response formatting," and users could still get the model to "dump the entire system prompt" after a few follow-ups. That is consistent with Anthropic's benchmark framing: training helps, but prompt injection resistance is probabilistic, not absolute.

Willison's profiling write-up widens the threat model beyond prompt leakage. His example prompt, "Profile this user," run against 1,000 public Hacker News comments, produced a detailed profile that his posted screenshots say captured "personality and debate style," recurring technical views, and personal interests. If an injected agent can be induced to gather public text, summarize it, and act on it, the risk is not just hidden-prompt exposure but downstream misuse of tools and data aggregation.

TL;DR

What Anthropic measured

Why engineers should treat this as defense in depth

Discussion across the web