TOPIC5 stories

Interpretability

Understanding model internals and controllable reasoning.

Stories

Anthropic reports 'Teaching Claude why' cuts agentic misalignment by 3x

Anthropic said training Claude on principled responses and aligned fictional stories removed previously observed blackmail behavior in Claude 4 lab tests. The post matters because Anthropic says the broader interventions generalized better than narrow eval-matching examples and survived RL fine-tuning.

NEWS7th May

Anthropic introduces Natural Language Autoencoders for Claude activations

Anthropic introduced Natural Language Autoencoders, a two-model method that translates Claude activations into text explanations and reconstructs them back. The system exposed hidden rhyme planning and evaluation awareness in Claude, but Anthropic says the explanations are useful rather than guaranteed faithful.

RELEASE2w ago

Qwen-Scope releases SAE toolkit for Qwen3.5-27B steering

Alibaba’s Qwen team released Qwen-Scope, an open sparse-autoencoder suite for Qwen3.5-27B that can steer outputs, surface repetition features, and compare benchmark feature overlap. The toolkit turns interpretability artifacts into debugging, data-generation, and evaluation workflows.

NEWS1mo ago

Anthropic introduces model diffing for open-weight model audits

Anthropic published a research method that compares model internals against a trusted reference to surface behaviors unique to a new open-weight model. The approach can narrow safety and eval audits to deltas, but Anthropic says it can still over-flag analogous features.

RELEASE1mo ago

llm-circuit-finder compares duplicated layers and reports BBH logical deduction gains

The toolkit sweeps contiguous layer ranges in GGUF and llama.cpp-style setups to test whether duplicating them can unlock better reasoning without retraining. Treat the jump as a reproducible experiment, not a settled mechanism, because thread responses challenge whether the effect reflects circuits, routing, or training artifacts.