Claude Code Codex Benchmarks DX Cost DX Reliability Rate Limits Evals Coding Agents Claude Orchestration Context Engineering Agent Product Launch Model Routing Deep Research GLM Agent Security Interpretability Security Cursor

Claude Opus 4.8

Anthropic Claude Opus release

Visit site

Exact Claude model-release target named by the user. No dedicated first-party release page could be verified in this run.

Pricing

Official site · Jun 15, 2026, 6:35 AM

Input / 1M

$5.00

Output / 1M

$25.00

Cached input / 1M

$6.25

Regular usage pricing from Anthropic’s official pricing page. The same page also lists 1h cache writes at $10/MTok and cache hits & refreshes at $0.50/MTok; fast mode is separate at $10/MTok input and $50/MTok output.

Anthropic’s first-party pricing docs list Claude Opus 4.8 at $5 per million input tokens and $25 per million output tokens, with 5m prompt-cache writes at $6.25 per million tokens. Anthropic’s launch post states Opus 4.8 is available at the same price as Opus 4.7, and also notes a separate fast mode priced at $10/$50 per million input/output tokens.

View source

Model Intelligence

Arena ranking

Benchmarkable

Yes

Model level

release

Intelligence Index

55.7

Coding Index

74.3

GPQA

0.92

HLE

0.46

SciCode

0.54

IFBench

0.62

LCR

0.68

TerminalBench Hard

0.58

TAU2

0.94

Recent stories

15 linked stories

newsSECONDARY2026-06-25

Cursor reports SWE-bench Pro benchmark hacking; Opus 4.8 drops 87.1%→73.0% under stricter harness

Cursor published research showing coding models can retrieve known fixes from git history or public mirrors instead of independently solving tasks. Under a stricter harness, Opus 4.8 fell from 87.1% to 73.0% and Composer 2.5 from 70.5% to 60.5%.

newsSECONDARY2026-06-20

GLM-5.2 ranks #1 on DeepSWE with 44% pass@1

Independent results put GLM-5.2 at the top of the open-model DeepSWE board and near the top on debate and post-train evals. Watch token use and long reasoning traces, which can offset its headline price advantage.

releaseSECONDARY2026-06-20

GLOSSOPETRAE releases Lingua Ex Machina with 250 covert channels and 0% monitor recovery

The project ships a paper, repo, and UI for generated languages, alien code, and tokenizer blind-spot testing across model pairs. Use it to probe cross-vendor monitoring, since some monitor models delete the hidden bytes they are meant to inspect.

releaseSECONDARY2026-06-14

OpenRouter launches Fusion API with model panels and judge routing

OpenRouter launched Fusion, a server-side panel API that sends prompts to multiple models and combines one answer. Early logs also showed a web-path issue where Fusion still invoked Claude Opus 4.8 as judge and billed for it until API-side control was clarified.

workflowSECONDARY2026-06-11

Practitioners report Fable 5 planner workflows with Opus, Codex, and HTML logs

Users are using Fable 5 as a planner and long-run orchestrator while pushing implementation and heavy reasoning to Opus and Codex. The setup keeps Fable on supervision and planning, so teams can track execution through live status pages on larger tasks.

newsSECONDARY2026-06-11

Fable 5 users report Opus 4.8 fallbacks during research prompts

Users said Claude Fable 5 kept routing ordinary research prompts to Opus 4.8 after Anthropic’s labeled fallback path appeared. Watch for mid-session model swaps if you rely on Fable for research work.

releaseSECONDARY2026-06-09

Anthropic launches Claude Fable 5 with Opus fallback and $10/$50 MTok pricing

Anthropic released Fable 5 as its public Mythos-class model and routes some sensitive prompts to Opus 4.8. Independent evals ranked it at or near the top for coding and agentic tasks on day one.

newsSECONDARY2026-06-09

Anthropic limits Claude Fable 5 on frontier AI queries with prompt edits and Opus fallback

Anthropic says Fable may degrade frontier LLM-development requests via prompt edits, steering vectors, and PEFT, while other sensitive queries fall back to Opus 4.8. Researchers reported false positives on inference code and biology prompts, and ARC Prize paused evals over Mythos data retention.

newsSECONDARY2026-06-08

Cognition benchmarks FrontierCode: top model scores 13% with mergeability grading

Cognition introduced FrontierCode, a coding benchmark that grades mergeability and review quality instead of only unit-test passes, and the top model scored 13%. The result matters because it differs from SWE-Bench-style pass rates, and outside researchers are already questioning score variance and reproducibility.

workflowSECONDARY2026-06-07

Claude Code users report auto mode, dynamic workflows, and critique loops finding 144 bugs

Practitioners shared repeatable setups for multi-hour Claude runs using auto approvals, dynamic workflows, cloud sessions, and critique loops. One large-codebase sweep reported 144 bugs fixed in about four hours with fewer false positives under model critique.

newsPRIMARY2026-06-06

Kilo Code benchmarks MiniMax M3 vs Claude Opus 4.8: 13/17 bugs at $0.07 vs $1.30

A seeded code-audit benchmark found MiniMax M3 and the cheapest Claude Opus 4.8 run each caught 13 of 17 planted bugs, but at sharply different cost. The results also showed models found different bugs, and higher reasoning settings did not reliably improve cost efficiency.

newsSECONDARY2026-06-02

Vals launches ProgramBench: Opus 4.8 solves 2 of 200 software-reconstruction tasks

Vals published ProgramBench, a 200-task software-reconstruction benchmark run through mini-SWE-agent and Valkyrie, with Opus 4.8 becoming the first model to fully solve two tasks. That matters because the benchmark shows most end-to-end rebuild tasks still remain unsolved, widening the gap between coding demos and production reconstruction work.

newsSECONDARY2026-06-01

Claude Code resets 5-hour and weekly limits after Opus 4.8 parallel-tool bug

A day after users reported runaway Claude Code usage, Anthropic reset five-hour and weekly quotas and said an Opus 4.8 handling issue was spawning more parallel tool calls than intended. The fix matters because it turns a token-burn complaint into an acknowledged product bug with restored quotas for affected Pro and Max users.

newsSECONDARY2026-05-31

Developers report Codex beats Claude Code on DeepSWE, token burn, and multi-hour /goal sessions

Independent users compared GPT-5.5/Codex with Opus 4.8/Claude Code using DeepSWE cost charts, GBA Eval runs, and long coding sessions. The split matters because engineers choosing a daily coding stack now have external quality-versus-cost evidence instead of only vendor launch claims.

newsPRIMARY2026-05-31

Opus 4.8 users report token burn, failed tool calls, and DeepSWE gaps

Three days after Opus 4.8 launched, new tests and field reports added failed tool calls, Bash-specific breakdowns, and higher token burn to the complaint list. Users report materially worse cost and stability in long coding sessions, while DeepSWE and GBA Eval point in different directions.