Microsoft launches MAI-Thinking-1 and six companion models with 97.0% AIME 2025
Microsoft introduced MAI-Thinking-1, MAI-Code-1-Flash, and five other MAI models across code, image, voice, and speech. The launch puts Microsoft back into the frontier-model race and starts landing pieces of the stack in Copilot and partner runtimes.

TL;DR
- Microsoft shipped seven in-house MAI models at Build, led by Mustafa Suleyman's launch thread and the Building a hill-climbing machine post: MAI-Thinking-1, MAI-Code-1-Flash, MAI-Image-2.5, MAI-Image-2.5-Flash, MAI-Transcribe-1.5, MAI-Voice-2, and MAI-Voice-2-Flash.
- According to Asadovsky's tech report link and the 109-page MAI-Thinking-1 report, MAI-Thinking-1 is a 35B-active, roughly 1T-total MoE trained on 30T tokens with a 256K context window and no third-party distillation.
- Microsoft says WesRoth's benchmark summary and the MAI-Code-1-Flash model page put its new coding model into GitHub Copilot in VS Code, with better scores than Claude Haiku 4.5 on the coding benchmarks Microsoft published and up to 60% fewer tokens.
- For multimodal, arena's Image Edit Arena post and the MAI-Image-2.5 launch post placed MAI-Image-2.5 at No. 2 for image editing, while Artificial Analysis and the MAI-Transcribe-1.5 post highlighted a speed-heavy speech model that transcribes an hour of audio in under 15 seconds.
- The unusual part is not just the model count. As Elie Bakouch's reaction and Simon Willison's notes both noticed, Microsoft also published one of the more detailed frontier-model tech reports in recent memory, including training recipe, infra details, and evaluation methodology.
You can read the main MAI announcement, jump straight to the MAI-Thinking-1 tech report, browse the MAI-Code-1-Flash page, check the Image Edit Arena board, and compare the speech claims against the Artificial Analysis speech-to-text leaderboard. The rollout already touched GitHub Copilot, PowerPoint, OneDrive, Teams, Dynamics 365 Contact Centre, OpenRouter, fal, and Baseten.
Seven MAI models
The launch is broader than the headline model. Microsoft's own list, echoed in ai_for_success's inventory, spans seven releases across text, code, image, voice, and speech.
- Reasoning: MAI-Thinking-1
- Code: MAI-Code-1-Flash
- Image: MAI-Image-2.5, MAI-Image-2.5-Flash
- Speech to text: MAI-Transcribe-1.5
- Text to speech: MAI-Voice-2, MAI-Voice-2-Flash
The day-one surfaces are split. The MAI keynote transcript says MAI-Image-2.5 is live in PowerPoint and rolling into OneDrive, MAI-Transcribe-1.5 is being integrated into Copilot, Teams, GitHub, and Dynamics 365 Contact Centre, and MAI-Thinking-1 is in private preview on Foundry.
MAI-Thinking-1
The flagship text model is a 35B-active, roughly 1T-total sparse MoE. In the official MAI-Thinking-1 announcement, Microsoft says it trained the model from scratch on clean, commercially licensed data, excluded AI-generated content from pre-training, and skipped third-party distillation entirely.
The core numbers from the tech report are unusually concrete:
- 35B active parameters, about 1T total parameters
- 30T pre-training tokens
- 256K context window
- 97.0% on AIME 2025
- 94.5% on AIME 2026
- 87.7% on LiveCodeBench v6
- 52.8% on SWE-Bench Pro
Microsoft also leans on a human eval claim, not just a benchmark chart. The launch post says professional raters from Surge compared 1,276 tasks in blind side-by-sides and preferred MAI-Thinking-1 to Sonnet 4.6.
One small but useful correction came from Simon Willison's notes, where he updated an initial misread of the active-parameter count. The distinction matters here because Microsoft is explicitly selling the model as mid-weight reasoning, not brute-force scale.
MAI-Code-1-Flash
MAI-Code-1-Flash is the first piece of the MAI stack that shipped straight into a mainstream developer surface. Pierce Boggan's post said it was rolling out in the GitHub Copilot model picker and Auto for developers, while the model page says it is optimized for GitHub Copilot in VS Code.
Microsoft's positioning is straightforward: smaller active footprint, coding-first tuning, and lower token use. Across the numbers cited by WesRoth's summary and scaling01's post, the model is a 137B-parameter MoE with 5B active parameters and a 256K context window.
The benchmark comparisons Microsoft chose all target Claude Haiku 4.5:
- SWE-Bench Verified: 71.6 vs 66.6
- SWE-Bench Pro: 51.2 vs 35.2
- Terminal Bench 2: 54.8 vs 41.6
- Token usage: up to 60% fewer tokens
The HN thread on MAI-Code-1-Flash immediately turned that framing into a size-class debate. Commenters mostly treated the interesting question as whether Microsoft has built a genuinely useful cheap coding model, not whether it beat the biggest frontier systems.
Image, voice, and transcription
The multimodal side of the launch is more distribution-heavy than the text models, and some of it is already live outside Microsoft properties.
MAI-Image-2.5
- No. 2 on Arena's image-edit leaderboard, according to arena
- No. 3 on Arena for text-to-image, according to the official launch post
- Live in PowerPoint, rolling out to OneDrive, and available in Foundry
- Also shipped on fal, per fal's launch post
- Official pricing for MAI-Image-2.5: $5 per 1M text input tokens, $8 per 1M image input tokens, $47 per 1M image output tokens, per the launch post
MAI-Transcribe-1.5
- 2.4% AA-WER and No. 3 overall on Artificial Analysis, according to Artificial Analysis
- Processes audio at about 276x real time, and an hour of audio in under 15 seconds, per Artificial Analysis and the official post
- Expanded from 25 to 43 languages, per the official post
- Priced at $6 per 1,000 minutes in Foundry, according to Artificial Analysis pricing note
- Integrating into Copilot, Teams, GitHub, and Dynamics 365 Contact Centre
MAI-Voice-2
- Expressive TTS with emotional controls like excited, embarrassed, and whispered, according to OpenRouter's MAI-Voice-2 post
- Stable speaker identity across long-form output, per OpenRouter
- Available in Azure Foundry and being integrated into VS Code and Dynamics 365 Contact Centre, according to the official MAI-Voice-2 post
- Live on OpenRouter, per OpenRouter's launch thread
Where the rollout landed first
Microsoft spread the launch across its own apps and a small partner mesh instead of keeping everything inside one API gateway.
Day-one distribution breaks down like this:
- Microsoft surfaces: GitHub Copilot and VS Code for MAI-Code-1-Flash, PowerPoint and OneDrive for MAI-Image-2.5, Copilot and Teams for MAI-Transcribe-1.5
- Foundry: MAI-Thinking-1 in private preview, MAI-Image-2.5, MAI-Transcribe-1.5, and MAI-Voice-2 available through Microsoft's platform posts
- Partner runtimes: Baseten for MAI-Thinking-1, OpenRouter for image, voice, and transcription, fal for MAI-Image-2.5
Baseten's announcement adds the most concrete enterprise control claim. In Baseten's post, the company said customers will be able to fine-tune MAI-Thinking-1 without handing post-training data back to Microsoft, using Baseten Loops so they keep their checkpoints.
The hill-climbing machine
The part engineers will probably keep around is the report, not the launch copy. Microsoft's MAI-Thinking-1 tech report spells out a full-stack training story the company calls a "hill-climbing machine," where data, rewards, infra, environments, safety, and evals get tuned as one system.
A few concrete details stand out:
- Microsoft says reasoning, tool use, and agentic behavior were learned in post-training rather than inherited through model distillation, a point highlighted by Rohan Paul's summary
- The RL inference stack used SGLang inside Microsoft's Rocket framework for routing, prefix caching, traffic control, and failure recovery across thousands of chips, according to lmsysorg
- The report includes exact MFU and scaling-ladder details that Elie Bakouch called unusually transparent for a model at this scale
- Microsoft used GEPA and DSPy to tune an LLM judge prompt for quality scoring, according to lateinteraction's repost
- The model is co-designed with Microsoft's MAIA 200 chip, and Mustafa Suleyman's thread claims 30% better performance per dollar and 1.4x better performance per watt versus GB200 when running MAI models end to end
That transparency also creates an obvious next question. As altryne's post noted, independent evaluator access looked thin on launch day, so most of the strongest claims still come from Microsoft, its partners, or benchmarks the company chose to publish itself.