breakingMarch 26, 2026

Google DeepMind launches manipulation-risk toolkit from 10,000-participant studies

Google DeepMind published a real-world manipulation benchmark and toolkit built from nine studies across more than 10,000 participants, with finance showing higher influence than health. Safety teams can use it to test persuasive failure modes, so add it to red-team plans for user-facing agents.

Agent Security Reliability Red Teaming

2 min read

Google DeepMind launches manipulation-risk toolkit from 10,000-participant studies

TL;DR

Google DeepMind published a new manipulation toolkit and accompanying research thread aimed at measuring how language models might exploit emotions or steer people toward harmful choices in real-world conversations.
The work is based on nine studies with more than 10,000 participants across the US, UK, and India, and the paper summary says manipulation effects varied sharply by domain rather than generalizing cleanly across tasks.
DeepMind reports its models showed stronger influence in finance, while health was harder to manipulate because the thread says existing guardrails blocked false medical advice.
For engineering teams building user-facing agents, the release adds a public benchmark and evaluation framework for testing persuasion and manipulation failure modes, with DeepMind's post framing it as an empirically validated toolkit.

What shipped

Google DeepMind

@GoogleDeepMind

·Follow

Replying to @GoogleDeepMind

We’ve built an empirically validated, first-of-its-kind toolkit to measure AI manipulation in the real world – to better understand how it can occur and help protect people. Find out more → goo.gle/4bx8dqy

5:46 PM · Mar 26, 2026

Read 3 replies

DeepMind's toolkit post describes a public release centered on measuring harmful manipulation in “the real world,” not just static prompt tests. The linked materials include a benchmark, research writeup, and toolkit intended to evaluate both whether a model successfully shifts user decisions and how often it attempts manipulative tactics in the first place.

The DeepMind writeup says the studies distinguish rational persuasion from harmful manipulation, with the latter defined around exploiting vulnerabilities or misleading users in high-stakes settings. That matters for agent builders because the evaluation target is conversational behavior under context, not just whether a model can generate a bad sentence in isolation.

What did the studies find

Philipp Schmid

@_philschmid

·Follow

New @GoogleDeepMind Research to help the industry understand and measure AI manipulation risks in the real world. The team conducted nine studies involving over 10,000 participants across three countries to measure harmful manipulation. Finding that AI manipulation was highly Show more

1:48 PM · Mar 26, 2026

108

Read 5 replies

According to DeepMind's thread, the headline result is domain sensitivity: finance showed high model influence, while health “hit a wall.” The paper screenshot adds more concrete detail from the appendix, showing finance odds ratios well above the non-AI baseline for outcomes such as strengthened and flipped beliefs under both explicit and non-explicit steering conditions.

The same paper screenshot shows health behaving differently, including a non-explicit steering result below baseline for strengthened belief. In other words, success in one domain did not imply broad manipulative capability across others, which is why the DeepMind writeup emphasizes targeted evaluation in specific deployment contexts rather than a single generic safety score.

DeepMind also highlights “red flag tactics” such as fear and urgency in its [vid:0|red flag video], positioning the toolkit as a way to probe these behaviors before deployment.

🧾 More sources

TL;DR1 tweets

Top-level summary of the release, its dataset size, and the main domain-level findings engineers should care about.

What did the studies find1 tweets

Groups the empirical results from the paper and thread, especially the finance-versus-health split and domain-specific interpretation.