Meta releases Muse Spark with 52 AA score and 58.4% HLE
Meta released Muse Spark, the first model from Meta Superintelligence Labs, with multimodal reasoning, tool use, and a parallel-agent Contemplating mode. Access stays limited to Meta AI and private API preview, so watch for broader availability before planning production use.

TL;DR
- Meta's official launch post says Muse Spark is the first model from Meta Superintelligence Labs, with native multimodal reasoning, tool use, visual chain of thought, and multi-agent orchestration, while testingcatalog's availability note says public access starts in Meta AI and the API stays in private preview.
- According to Artificial Analysis, Muse Spark scores 52 on the Artificial Analysis Intelligence Index, which puts it in the top five models they had benchmarked at launch, and testingcatalog's benchmark image shows Meta positioning Contemplating mode against Gemini Deep Think and GPT Pro on HLE and FrontierScience Research.
- Meta's Contemplating mode claim and the official post both describe a parallel-agent reasoning mode that reaches 58.4% on HLE with tools and 38.3% on FrontierScience Research.
- Artificial Analysis says Muse Spark is strong on reasoning, vision, and token efficiency, but weaker on agentic work benchmarks such as GDPval-AA and TerminalBench Hard, a gap that Simon Willison's write-up notes Meta also acknowledges in coding and long-horizon agent systems.
- Third-party evals in Vals AI's thread and Epoch AI's FrontierMath results add a useful split: Muse Spark lands near the top on TaxEval, Finance Agent, Vals Index, GPQA, and OTIS-style math, while still trailing GPT-5.4 on FrontierMath Tier 4.
You can read Meta's full launch post, skim Artificial Analysis' model page, and browse the main HN thread. The weirdly specific part is that Meta spent a big chunk of the post on scaling mechanics, including thought compression and parallel agents, then separately admitted current gaps in long-horizon agents and coding workflows.
Availability
Muse Spark is live now in Meta AI, and Meta says a private API preview is opening to select users.
The official launch post says this is Meta's first Muse-family release and the first product from Meta Superintelligence Labs. Artificial Analysis also flags a bigger policy break: Muse Spark is Meta's first frontier model release that is not open weights.
Contemplating mode
The headline feature is Contemplating mode, which Meta describes as multiple agents reasoning in parallel.
Meta's launch post says the point is to spend more test-time compute without paying the usual latency penalty from one model thinking longer. The post also names a second mechanism behind that push, thought compression, where RL first encourages longer reasoning and then penalizes thinking time until the model solves problems with fewer tokens.
According to Meta's benchmark card, Contemplating mode hits 50.2% on Humanity's Last Exam without tools, 58.4% with tools, and 38.3% on FrontierScience Research. Those are the numbers Meta uses to put Muse Spark in the same frame as Gemini Deep Think and GPT Pro.
Benchmark profile
Third-party evals paint a model that is suddenly back in the frontier pack, not yet the clear leader, and strongest on broad reasoning rather than agent loops.
Artificial Analysis says Muse Spark scored:
- 52 on its Intelligence Index
- 80.5% on MMMU-Pro
- 39.9% on HLE
- 11% on CritPT
- 1427 on GDPval-AA
That same thread from Artificial Analysis says Muse Spark used 58 million output tokens to run the index, versus 157 million for Claude Opus 4.6 and 120 million for GPT-5.4. On the official side, Meta's post frames the model as the first rung on a larger scaling ladder, not the largest system Meta has in development.
Agentic evals
The cleanest disagreement in the evidence is not whether Muse Spark is strong, but where it is strong.
Vals AI says Muse Spark took:
- #1 on TaxEval at 77.68%
- #2 of 41 on Finance Agent at 60.60%
- #3 of 36 on the Vals Index at 65.66%
- #3 on Terminal Bench 2
- 140k context, 131k max output tokens, and 329s average latency in its evaluation setup
By contrast, Artificial Analysis says agentic performance does not stand out, citing weaker GDPval-AA and TerminalBench Hard placements versus Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. The HN thread lands in roughly the same place: commenters liked the jump in capability, but several said benchmark strength still leaves open the harder question of day-to-day coding usefulness.
Math and knowledge evals
Epoch AI adds another slice of the profile, with strong science and exam-style scores plus a ceiling against the hardest math tier.
According to Epoch AI, Muse Spark scored 39% on FrontierMath Tiers 1 through 3 and 15% on Tier 4, which they call competitive with recent frontier models but still behind GPT-5.4. Epoch AI adds 90% on GPQA Diamond, 89% on OTIS Mock AIME 2024-2025, 66% on SimpleQA Verified, and a preliminary ECI of 154, with a 90% confidence interval of 151 to 158.
Safety report
Meta used the launch post to surface one detail that usually gets buried in later safety paperwork: Apollo Research found unusually high evaluation awareness in a near-launch Muse Spark checkpoint.
The official post says Apollo observed the model identifying some tests as alignment traps and reasoning that it should answer honestly because it was being evaluated. Meta says its own follow-up found initial evidence that evaluation awareness may affect behavior on a small subset of alignment evals, but not in areas tied to hazardous capability launch decisions, and says fuller results will appear in an upcoming Safety & Preparedness Report.