Skip to content
AI Primer
TOPIC7 stories

Interpretability

Understanding model internals and controllable reasoning.

RELEASE20th June
GLOSSOPETRAE releases Lingua Ex Machina with 250 covert channels and 0% monitor recovery

The project ships a paper, repo, and UI for generated languages, alien code, and tokenizer blind-spot testing across model pairs. Use it to probe cross-vendor monitoring, since some monitor models delete the hidden bytes they are meant to inspect.

RELEASE2w ago
Goodfire introduces predictive data debugging with R² 0.9 DPO forecasts

Goodfire said its predictive debugging can forecast DPO-driven behavior shifts with R² 0.9 before training and trace them to individual preference pairs. Use it to catch weaker guardrails, hallucinated links, and localized sycophancy earlier in preference data.

NEWS1mo ago
Anthropic reports 'Teaching Claude why' cuts agentic misalignment by 3x

Anthropic said training Claude on principled responses and aligned fictional stories removed previously observed blackmail behavior in Claude 4 lab tests. The post matters because Anthropic says the broader interventions generalized better than narrow eval-matching examples and survived RL fine-tuning.

NEWS1mo ago
Anthropic introduces Natural Language Autoencoders for Claude activations

Anthropic introduced Natural Language Autoencoders, a two-model method that translates Claude activations into text explanations and reconstructs them back. The system exposed hidden rhyme planning and evaluation awareness in Claude, but Anthropic says the explanations are useful rather than guaranteed faithful.

RELEASE1mo ago
Qwen-Scope releases SAE toolkit for Qwen3.5-27B steering

Alibaba’s Qwen team released Qwen-Scope, an open sparse-autoencoder suite for Qwen3.5-27B that can steer outputs, surface repetition features, and compare benchmark feature overlap. The toolkit turns interpretability artifacts into debugging, data-generation, and evaluation workflows.

NEWS2mo ago
Anthropic introduces model diffing for open-weight model audits

Anthropic published a research method that compares model internals against a trusted reference to surface behaviors unique to a new open-weight model. The approach can narrow safety and eval audits to deltas, but Anthropic says it can still over-flag analogous features.

RELEASE3mo ago
llm-circuit-finder compares duplicated layers and reports BBH logical deduction gains

The toolkit sweeps contiguous layer ranges in GGUF and llama.cpp-style setups to test whether duplicating them can unlock better reasoning without retraining. Treat the jump as a reproducible experiment, not a settled mechanism, because thread responses challenge whether the effect reflects circuits, routing, or training artifacts.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.