breakingMarch 30, 2026

Microsoft launches Critique and Council for M365 Copilot research

Microsoft rolled out Critique, a two-model reviewer flow, and Council, a side-by-side multi-model mode inside M365 Copilot research workflows. Critique was reported at 57.4 on Draco and about 7 points above earlier Researcher versions.

Orchestration Evals LLM as Judge Deep Research

3 min read

Microsoft launches Critique and Council for M365 Copilot research

TL;DR

Microsoft added two multi-model research modes to M365 Copilot: Critique splits report generation and review across different models, while Council runs the same prompt across multiple models side by side.
According to Microsoft's Draco result, Critique scored 57.4 on the Draco benchmark, and the launch post says that is a 7-point gain over earlier Researcher versions.
The reported Critique workflow is sequential: one model handles planning, retrieval, synthesis, and drafting, then a second model checks claims, structure, evidence, and citations before the report is finalized Critique details.
Council appears aimed at comparison rather than review: the product demo shows several models answering the same research prompt in parallel inside the M365 Copilot interface.

What shipped in M365 Copilot research?

Paul Couvert

@itsPaulAi

·Follow

Microsoft has just released a VERY powerful feature Copilot Critique allows one model from Anthropic/OpenAI to generate the research output and another one reviews it. → Model 1 plans, retrieves sources, synthesizes, and drafts → Model 2 evaluates claims, strengthens Show more

Satya Nadella

@satyanadella

Introducing Critique, a new multi-model deep research system in M365 Copilot. You can use multiple models together to generate optimal responses and reports.

Watch on X

1:53 PM · Mar 30, 2026

283

Read 39 replies

Microsoft's new release is really two different multi-model patterns inside the same research workflow. In the Critique announcement, one model produces the research output, while a second model acts as reviewer. The reviewer is described as checking factual grounding, strengthening structure, and improving citation quality rather than generating from scratch. A separate product clip shows Critique surfaced directly in the Copilot interface as a research feature rather than a backend-only change.

TestingCatalog News 🗞

@testingcatalog

·Follow

Microsoft launched "Council" for M365 Copilot, a new multi-model mode in which several models execute the same prompt simultaneously. Things are getting more and more interesting there. Coplexity 👀

Watch on X

Satya Nadella

@satyanadella

Available today: techcommunity.microsoft.com/blog/microsoft…

7:10 PM · Mar 30, 2026

159

Read 9 replies

Council takes the parallel path instead. The Council demo shows the same prompt executed across multiple models at once, with outputs displayed side by side in the Researcher experience. That makes it a comparison mode for prompt-level variance, whereas Critique is a staged handoff between generator and reviewer. Together, the launch shifts M365 Copilot from single-model answering toward orchestrated multi-model research flows.

What do the benchmark and demos actually say?

TestingCatalog News 🗞

@testingcatalog

·Follow

Microsoft announced Critique, a multi-model Deep Research solution for M365 Copilot, which achieved a score of 57.4 on the Draco benchmark. The future is multi-model 👀

Watch on X

Satya Nadella

@satyanadella

Benchmarks show this delivers best-in-class deep research!

2:37 PM · Mar 30, 2026

211

Read 11 replies

The clearest performance claim is Microsoft's reported 57.4 score on Draco. In the benchmark post, that number is attached specifically to Critique, and the rollout summary says it is "+7.0" versus previous Researcher versions. The public detail here is limited, but the claimed gain is tied to the reviewer step: better factual accuracy, broader analysis, stronger presentation, and better citations.

The demo evidence also clarifies how Microsoft is differentiating the two modes in practice. The video teaser for Critique ends on "Draco Benchmark: 57.4," reinforcing it as the quality-focused path, while Council interface shows Council splitting one Q2-summary prompt into multiple concurrent outputs. For engineers, the implementation signal is less about a new base model and more about orchestration: M365 Copilot is exposing multi-model routing patterns as product features.

🧾 More sources

What shipped in M365 Copilot research?1 tweets

Groups the evidence that defines the two product workflows and how they appear in the interface.