breakingApril 3, 2026

Anthropic introduces model diffing for open-weight model audits

Anthropic published a research method that compares model internals against a trusted reference to surface behaviors unique to a new open-weight model. The approach can narrow safety and eval audits to deltas, but Anthropic says it can still over-flag analogous features.

3 min read

Anthropic introduces model diffing for open-weight model audits

TL;DR

Anthropic's launch thread introduced model diffing as a way to compare a new open-weight model against a trusted reference, while Anthropic's official writeup frames it as a tool for finding behavioral deltas instead of re-auditing an entire model from scratch.
According to Anthropic's thread, the goal is to isolate features unique to the new model, and the paper extends that idea to models with different architectures.
Anthropic's example tweet highlights one early result, a "CCP alignment" feature in Qwen and an "American exceptionalism" feature in Llama, while the official post adds a GPT-OSS-20B copyright refusal mechanism to the same set of findings.
Anthropic's caveat tweet says the method can over-flag analogous features as distinct, and the official post describes it as a high-recall screening tool rather than a silver bullet.

You can read Anthropic's full research note, jump to the arXiv paper, and inspect Wes Roth's screenshot of the paper's side-by-side feature chart. The useful twist is architectural: Anthropic is not pitching another benchmark, but a way to surface model-specific internal features before anyone knows the right eval prompts to write.

Diffing the delta

Anthropic's pitch is simple: new-model audits miss "unknown unknowns" because benchmarks only test for risks people already imagined. In the official post, the company borrows the software diff metaphor directly, arguing that auditors should focus on what changed between models, not reread the whole encyclopedia.

That framing matters because the project is aimed at open-weight releases, where researchers can inspect internals instead of only outputs. The paper describes model diffing as comparing internal representations to identify differences that may map to safety-relevant behavior.

Dedicated Feature Crosscoders

The technical contribution is a cross-architecture diffing method built on what the paper calls Dedicated Feature Crosscoders, or DFCs. Anthropic's research note explains the design as a bilingual dictionary with three buckets:

a shared dictionary for concepts both models represent
a model-A-only section for features unique to the first model
a model-B-only section for features unique to the second model

Once a candidate feature is found, the team tests it with steering. The official post describes suppressing or amplifying a feature during generation, then checking whether the suspected behavior gets weaker or stronger.

Model-exclusive behaviors

Anthropic is already attaching names to a few diffed features. Across the thread, the official post, and the paper, the examples line up like this:

Qwen3-8B and DeepSeek-R1-0528-Qwen3-8B: a "Chinese Communist Party alignment" feature tied to censorship and pro-government framing
Llama-3.1-8B-Instruct: an "American exceptionalism" feature tied to claims of US superiority
GPT-OSS-20B: a copyright refusal mechanism tied to refusing copyrighted material

Wes Roth's screenshot is useful here because it shows the paper's structure, not just the headline claim: model-exclusive features sit beside a shared-feature column with ordinary concepts like cats, trees, variable names, and statistical terms.

Oversensitive by design

Anthropic is pretty explicit that this is a screening pass, not an explanation engine. The official post says a single diff can surface thousands of unique features, and only a small fraction may correspond to meaningful behavioral risks.

That matches Anthropic's oversensitivity caveat, which says the method can mistake analogous features for distinct ones. The last detail is organizational: Anthropic's final thread post says the work came out of the Anthropic Fellows program, led by Thomas Jiralerspong and supervised by Trenton Bricken, with the paper first submitted to arXiv on February 12.

TL;DR

Diffing the delta

Dedicated Feature Crosscoders

Model-exclusive behaviors

Oversensitive by design

Discussion across the web