Trust, Evaluation & Reliability — Explore AI Tools & Stories

Fresh stories

Codex app reportedly leaks GPT-5.6 Sol, Terra, and Luna model names

Codex app code now references GPT-5.6 Sol, Terra, and Luna, while posts claim Sol Ultra reaches 91.9% on TerminalBench at lower cost. Treat release timing, limits, and benchmark claims as unofficial until OpenAI publishes details.

🧠Codex3rd July

Release

Claude Code releases 2.1.200/2.1.201 with Manual approval fixes

Claude Code 2.1.200 changed Manual permission defaults and fixed background-agent crash and recovery paths; 2.1.201 removed mid-conversation Sonnet 5 harness reminders. Update to reduce accidental advances and repeated reminders in stalled sessions.

Claude Code·3rd July·5 min read

Gemini Omni Flash ranks #1 on Video Arena with 1404 Elo

Gemini Omni Flash ranked #1 on Video Arena at 1404 Elo, 101 points above Seedance 2.0 Mini, and ComfyUI posted a text-prompt video-edit workflow. Google noted the leaderboard is third-party, leaving benchmark provenance as the main caveat.

🧠Gemini3rd July

Claude Sonnet 5 ranks #3 on Vals and hits 183 turns on AA-Briefcase

Vals and Artificial Analysis published independent Sonnet 5 results a day after launch, placing it just behind Opus 4.8 and Fable 5 while using far more turns than Sonnet 4.6. Lower token pricing did not make agentic tasks cheaper, and some finance benchmarks still triggered refusals.

💳Evals1st July

Fable 5 users report Opus 4.8 fallbacks and $600 Max quota rotations

Fable 5 users reported Opus 4.8 fallbacks, $600 Max-account rotations, slow browser automation, and token-saving subagents. Watch routing opacity, quota burn, and latency before relying on it for long-running agent work.

⚙️Fable3rd July·7 min read

GLM-5.2 benchmarks at 97.6% tool-calling and 2,626 tok/s on MI355X

Kilo, Composio, Together, and Wafer posted GLM-5.2 measurements including 40/41 tool tasks, 7/10 code review, and 2,626 tok/s on MI355X. Try it for lower-cost coding and tool use, but validate cross-file reasoning and latency on your workload.

🧠GLM3rd July

New

Vercel adds FUSE Sandbox mounts and Agent Runs MCP/CLI access

Vercel shipped FUSE-based Sandbox mounts for S3 and network filesystems and opened Agent Runs through MCP and CLI. Use it to connect remote state, sandbox execution, and agent-readable Eve traces for self-improving workflows.

Release⚙️Agent runtime infrastructure3rd July

Breaking

Devin launches Security Swarm with Agentic MapReduce and 36/50 GHSA hits

Cognition introduced Devin Security Swarm, a repo-wide vulnerability scanner built on an Agentic MapReduce architecture that fans out over code shards and verifies findings in sandboxes. In a 50-vulnerability GHSA eval across 14 languages, it found 36 issues at 30% lower cost per finding than the next most accurate alternative.

New

Agent product updates·1st July·4 min read

Fable 5 users report Opus 4.8 fallbacks, refusals, and $321 sessions

Users posted mixed reports after Anthropic brought Fable 5 back: some sessions stayed on Fable, while others routed most work to Opus 4.8 or stalled mid-run. Watch for routing changes and cost spikes, since reports also mention refusals on ordinary tasks and ad hoc multi-model workarounds.

💳Reliability1st July

New

GLM 5.2 supports Amp, dcode, and Next.js workflows after Composio tops 41 tool tasks

Independent toolmakers pushed GLM 5.2 into coding workflows via dcode, Amp plugin modes, and Wafer-backed Next.js routes, while Composio reported it tied or won across 41 real-tool tasks. That matters because GLM is moving from benchmark curiosity into a practical open-weight option for agentic coding and long-running repo work.

🧠GLM1st July

See all stories →

GLM 5.2 supports Amp, dcode, and Next.js workflows after Composio tops 41 tool tasks

🧠GLM1st July

Briefs forJuly 3

Top storiesthis week

See all →

Breaking

Ramp introduces PorTAL with half-cost LoRA porting across Qwen and Gemma models

Ramp published PorTAL, a method that learns a reusable task representation once and recalibrates only a thin converter when moving that task to a new base model. In reported Qwen and Gemma experiments, it matched per-task LoRA accuracy while cutting data and cost roughly in half.

New

Cost Optimization·1st July·3 min read

New

Anthropic removes Claude Code ANTHROPIC_BASE_URL prompt marking after proxy reports

After reports that Claude Code was inserting hidden prompt marks when routed through custom ANTHROPIC_BASE_URL gateways, an Anthropic engineer said the experiment was real and is being rolled back. The issue matters for teams proxying Claude Code through gateways because prompt mutation on custom routes creates trust and debugging problems even if the effect was narrow.

⌨️Claude Code30th June

New

Anthropic launches Claude Sonnet 5 with 1M context and adaptive thinking

Anthropic launched Claude Sonnet 5 across Claude, the API, and Claude Code with 1M context, adaptive thinking, and $2/$10 intro pricing through Aug. 31. Independent evals place it near Opus 4.8 on coding and tool use, so teams should benchmark it against Opus before switching.

Release⌨️Claude Code30th June

New

OpenAI introduces GeneBench-Pro with GPT-5.6 Sol Pro at 31.5%

OpenAI introduced GeneBench-Pro to test whether agents can handle messy, judgment-heavy computational biology work instead of fixed bio QA. GPT-5.6 Sol Pro reached 31.5%, which shows progress on research workflows but also how far current systems remain from expert-level autonomy.

🛡️Evals30th June

US Commerce removes Fable 5 export controls; Anthropic restores access July 1

The US Commerce Department removed export controls on Fable 5 and Mythos 5, and Anthropic said access starts returning July 1. Fable counts against up to 50% of weekly limits through July 7 before moving to usage credits, so users should check their quota behavior and fallback paths.

💳Claude Code30th June