TOPIC4 stories

LLM as Judge

Using models to score or review model outputs.

Stories

Anthropic launches Claude Managed Agents with Dreaming, Outcomes, and multiagent orchestration

Anthropic added Dreaming in research preview plus public-beta Outcomes, multiagent orchestration, and webhooks to Claude Managed Agents. Teams should try the new grader loops and shared-container sub-agents if they want more control over long-running agent work.

NEWS2w ago

Plurai introduces vibe-training with sub-100ms agent guardrails and 43% fewer failures

Plurai launched vibe-training to turn natural-language intents into task-specific eval and guardrail APIs backed by small models. That matters because it positions SLM-based checks as a faster, cheaper alternative to frontier LLM judges for production agents.

WORKFLOW1mo ago

LongTracer opens local STS+NLI claim checks for RAG validation

LongTracer open-sourced local STS+NLI claim checks, while qi published a private search engine with a Claude Code plugin and LM Studio users shared MCP search configs for Qwen. Use these stacks to ground retrieval and verify answers without a second judge model.

NEWS1mo ago

LLM Debate Benchmark ranks Sonnet 4.6 first across 1,162 side-swapped debates

LLM Debate Benchmark ran 1,162 side-swapped debates across 21 models and ranked Sonnet 4.6 first, ahead of GPT-5.4 high. It adds a stronger adversarial eval pattern for judge or debate systems, but you should still inspect content-block rates and judge selection when reading the leaderboard.