Skip to content
AI Primer
release

Microsoft launches MAI-Thinking-1 and six companion models with 97.0% AIME 2025

Microsoft introduced MAI-Thinking-1, MAI-Code-1-Flash, and five other MAI models across code, image, voice, and speech. The launch puts Microsoft back into the frontier-model race and starts landing pieces of the stack in Copilot and partner runtimes.

7 min read
Microsoft launches MAI-Thinking-1 and six companion models with 97.0% AIME 2025
Microsoft launches MAI-Thinking-1 and six companion models with 97.0% AIME 2025

TL;DR

You can read the main MAI announcement, jump straight to the MAI-Thinking-1 tech report, browse the MAI-Code-1-Flash page, check the Image Edit Arena board, and compare the speech claims against the Artificial Analysis speech-to-text leaderboard. The rollout already touched GitHub Copilot, PowerPoint, OneDrive, Teams, Dynamics 365 Contact Centre, OpenRouter, fal, and Baseten.

Seven MAI models

The launch is broader than the headline model. Microsoft's own list, echoed in ai_for_success's inventory, spans seven releases across text, code, image, voice, and speech.

  • Reasoning: MAI-Thinking-1
  • Code: MAI-Code-1-Flash
  • Image: MAI-Image-2.5, MAI-Image-2.5-Flash
  • Speech to text: MAI-Transcribe-1.5
  • Text to speech: MAI-Voice-2, MAI-Voice-2-Flash

The day-one surfaces are split. The MAI keynote transcript says MAI-Image-2.5 is live in PowerPoint and rolling into OneDrive, MAI-Transcribe-1.5 is being integrated into Copilot, Teams, GitHub, and Dynamics 365 Contact Centre, and MAI-Thinking-1 is in private preview on Foundry.

MAI-Thinking-1

The flagship text model is a 35B-active, roughly 1T-total sparse MoE. In the official MAI-Thinking-1 announcement, Microsoft says it trained the model from scratch on clean, commercially licensed data, excluded AI-generated content from pre-training, and skipped third-party distillation entirely.

The core numbers from the tech report are unusually concrete:

  • 35B active parameters, about 1T total parameters
  • 30T pre-training tokens
  • 256K context window
  • 97.0% on AIME 2025
  • 94.5% on AIME 2026
  • 87.7% on LiveCodeBench v6
  • 52.8% on SWE-Bench Pro

Microsoft also leans on a human eval claim, not just a benchmark chart. The launch post says professional raters from Surge compared 1,276 tasks in blind side-by-sides and preferred MAI-Thinking-1 to Sonnet 4.6.

One small but useful correction came from Simon Willison's notes, where he updated an initial misread of the active-parameter count. The distinction matters here because Microsoft is explicitly selling the model as mid-weight reasoning, not brute-force scale.

MAI-Code-1-Flash

MAI-Code-1-Flash is the first piece of the MAI stack that shipped straight into a mainstream developer surface. Pierce Boggan's post said it was rolling out in the GitHub Copilot model picker and Auto for developers, while the model page says it is optimized for GitHub Copilot in VS Code.

Microsoft's positioning is straightforward: smaller active footprint, coding-first tuning, and lower token use. Across the numbers cited by WesRoth's summary and scaling01's post, the model is a 137B-parameter MoE with 5B active parameters and a 256K context window.

The benchmark comparisons Microsoft chose all target Claude Haiku 4.5:

  • SWE-Bench Verified: 71.6 vs 66.6
  • SWE-Bench Pro: 51.2 vs 35.2
  • Terminal Bench 2: 54.8 vs 41.6
  • Token usage: up to 60% fewer tokens

The HN thread on MAI-Code-1-Flash immediately turned that framing into a size-class debate. Commenters mostly treated the interesting question as whether Microsoft has built a genuinely useful cheap coding model, not whether it beat the biggest frontier systems.

Image, voice, and transcription

The multimodal side of the launch is more distribution-heavy than the text models, and some of it is already live outside Microsoft properties.

MAI-Image-2.5

  • No. 2 on Arena's image-edit leaderboard, according to arena
  • No. 3 on Arena for text-to-image, according to the official launch post
  • Live in PowerPoint, rolling out to OneDrive, and available in Foundry
  • Also shipped on fal, per fal's launch post
  • Official pricing for MAI-Image-2.5: $5 per 1M text input tokens, $8 per 1M image input tokens, $47 per 1M image output tokens, per the launch post

MAI-Transcribe-1.5

MAI-Voice-2

Where the rollout landed first

Microsoft spread the launch across its own apps and a small partner mesh instead of keeping everything inside one API gateway.

Day-one distribution breaks down like this:

  • Microsoft surfaces: GitHub Copilot and VS Code for MAI-Code-1-Flash, PowerPoint and OneDrive for MAI-Image-2.5, Copilot and Teams for MAI-Transcribe-1.5
  • Foundry: MAI-Thinking-1 in private preview, MAI-Image-2.5, MAI-Transcribe-1.5, and MAI-Voice-2 available through Microsoft's platform posts
  • Partner runtimes: Baseten for MAI-Thinking-1, OpenRouter for image, voice, and transcription, fal for MAI-Image-2.5

Baseten's announcement adds the most concrete enterprise control claim. In Baseten's post, the company said customers will be able to fine-tune MAI-Thinking-1 without handing post-training data back to Microsoft, using Baseten Loops so they keep their checkpoints.

The hill-climbing machine

The part engineers will probably keep around is the report, not the launch copy. Microsoft's MAI-Thinking-1 tech report spells out a full-stack training story the company calls a "hill-climbing machine," where data, rewards, infra, environments, safety, and evals get tuned as one system.

A few concrete details stand out:

  • Microsoft says reasoning, tool use, and agentic behavior were learned in post-training rather than inherited through model distillation, a point highlighted by Rohan Paul's summary
  • The RL inference stack used SGLang inside Microsoft's Rocket framework for routing, prefix caching, traffic control, and failure recovery across thousands of chips, according to lmsysorg
  • The report includes exact MFU and scaling-ladder details that Elie Bakouch called unusually transparent for a model at this scale
  • Microsoft used GEPA and DSPy to tune an LLM judge prompt for quality scoring, according to lateinteraction's repost
  • The model is co-designed with Microsoft's MAIA 200 chip, and Mustafa Suleyman's thread claims 30% better performance per dollar and 1.4x better performance per watt versus GB200 when running MAI models end to end

That transparency also creates an obvious next question. As altryne's post noted, independent evaluator access looked thin on launch day, so most of the strongest claims still come from Microsoft, its partners, or benchmarks the company chose to publish itself.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 5 threads
Seven MAI models1 post
MAI-Code-1-Flash1 post
Image, voice, and transcription3 posts
Where the rollout landed first1 post
The hill-climbing machine3 posts
Share on X