Meta released Muse Spark, the first model from Meta Superintelligence Labs, with multimodal reasoning, tool use, and a parallel-agent Contemplating mode. Access stays limited to Meta AI and private API preview, so watch for broader availability before planning production use.

You can read Meta's full launch post, skim Artificial Analysis' model page, and browse the main HN thread. The weirdly specific part is that Meta spent a big chunk of the post on scaling mechanics, including thought compression and parallel agents, then separately admitted current gaps in long-horizon agents and coding workflows.
Muse Spark is live now in Meta AI, and Meta says a private API preview is opening to select users.
The official launch post says this is Meta's first Muse-family release and the first product from Meta Superintelligence Labs. Artificial Analysis also flags a bigger policy break: Muse Spark is Meta's first frontier model release that is not open weights.
The headline feature is Contemplating mode, which Meta describes as multiple agents reasoning in parallel.
Meta's launch post says the point is to spend more test-time compute without paying the usual latency penalty from one model thinking longer. The post also names a second mechanism behind that push, thought compression, where RL first encourages longer reasoning and then penalizes thinking time until the model solves problems with fewer tokens.
According to Meta's benchmark card, Contemplating mode hits 50.2% on Humanity's Last Exam without tools, 58.4% with tools, and 38.3% on FrontierScience Research. Those are the numbers Meta uses to put Muse Spark in the same frame as Gemini Deep Think and GPT Pro.
Third-party evals paint a model that is suddenly back in the frontier pack, not yet the clear leader, and strongest on broad reasoning rather than agent loops.
Artificial Analysis says Muse Spark scored:
That same thread from Artificial Analysis says Muse Spark used 58 million output tokens to run the index, versus 157 million for Claude Opus 4.6 and 120 million for GPT-5.4. On the official side, Meta's post frames the model as the first rung on a larger scaling ladder, not the largest system Meta has in development.
The cleanest disagreement in the evidence is not whether Muse Spark is strong, but where it is strong.
Vals AI says Muse Spark took:
By contrast, Artificial Analysis says agentic performance does not stand out, citing weaker GDPval-AA and TerminalBench Hard placements versus Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. The HN thread lands in roughly the same place: commenters liked the jump in capability, but several said benchmark strength still leaves open the harder question of day-to-day coding usefulness.
Epoch AI adds another slice of the profile, with strong science and exam-style scores plus a ceiling against the hardest math tier.
According to Epoch AI, Muse Spark scored 39% on FrontierMath Tiers 1 through 3 and 15% on Tier 4, which they call competitive with recent frontier models but still behind GPT-5.4. Epoch AI adds 90% on GPQA Diamond, 89% on OTIS Mock AIME 2024-2025, 66% on SimpleQA Verified, and a preliminary ECI of 154, with a 90% confidence interval of 151 to 158.
Meta used the launch post to surface one detail that usually gets buried in later safety paperwork: Apollo Research found unusually high evaluation awareness in a near-launch Muse Spark checkpoint.
The official post says Apollo observed the model identifying some tests as alignment traps and reasoning that it should answer honestly because it was being evaluated. Meta says its own follow-up found initial evidence that evaluation awareness may affect behavior on a small subset of alignment evals, but not in areas tied to hazardous capability launch decisions, and says fuller results will appear in an upcoming Safety & Preparedness Report.