releaseMarch 13, 2026

Together releases Open Deep Research v2 with app, eval dataset, and repo

Together released Open Deep Research v2 and published the hosted app, codebase, blog, and evaluation dataset together. Use it as a full open reference stack for report-generation agents rather than another closed demo.

2 min read

Together releases Open Deep Research v2 with app, eval dataset, and repo

TL;DR

Together's launch thread introduced Open Deep Research v2 as a "fully free & open source" report-generation app built on open models, with the hosted app, code, blog, and evals released together.resources post
The new app post points engineers to a live deployment, while Together's resources post also links the public GitHub repo and a build writeup, making this more than a one-off demo.
Together says the package includes an "evaluation dataset" alongside the app and code launch thread, which gives teams a reusable reference stack for testing research-agent workflows instead of just inspecting outputs.

What actually shipped

Together's announcement bundles four artifacts at once: the Open Deep Research v2 app, the source code, a technical blog post, and an evaluation dataset. In the launch thread, the company frames the project as a way to "generate detailed reports on any topic with open source LLMs," and the follow-up app post sends users directly to the hosted app.

The resources post is the key engineering detail because it breaks the release into runnable pieces: a live app at the hosted demo, a public GitHub repo, and a build blog post. That makes v2 inspectable at both the product layer and the implementation layer.

Why this matters for engineers

For engineers building agentic research or report-generation systems, the useful part is the packaging. Closed deep-research demos usually expose only the UX; here, Together is publishing the code and the eval dataset alongside the app, so teams can compare prompting, orchestration, and output quality against a concrete baseline rather than reverse-engineering behavior from screenshots.launch thread

The hosted app also lowers the cost of evaluation before adoption: teams can test the workflow in the live app and then inspect how it was assembled in the repo. The release does not publish benchmark numbers in these posts, but it does provide a full open reference implementation for long-form, cited research agents.

TL;DR

What actually shipped

Why this matters for engineers

Discussion across the web