Skip to content
AI Primer
release

Prime Intellect launches Hosted Evaluations with harnesses, sandboxes, and rollouts viewer

Prime Intellect launched Hosted Evaluations to manage harnesses, sandboxes, and rollout inspection for model testing. The service packages eval infrastructure while still supporting local runs against arbitrary engines, so teams can centralize testing without losing flexibility.

3 min read
Prime Intellect launches Hosted Evaluations with harnesses, sandboxes, and rollouts viewer
Prime Intellect launches Hosted Evaluations with harnesses, sandboxes, and rollouts viewer

TL;DR

You can jump straight to the official launch post, watch the short product demo, and get a feel for the UI from the rollout viewer screenshot. One useful wrinkle came from Ivan Fioravanti's repost: the same eval stack can still run locally against arbitrary engines. Separately, Vtrivedy10's workflow note shows people already pairing Prime Intellect's CLI with coding agents for RL-style experiment setup.

Hosted Evaluations

Prime Intellect's pitch is blunt: evals are an infra problem. The launch framing in the announcement relay names the usual moving parts, including harnesses and sandboxes, then offers Hosted Evaluations as the layer that manages them.

That makes this a packaging story more than a new benchmark story. johannes_hage describes it as the smoothest way to run evals, with the official write-up living in the company blog post.

Rollouts viewer

The most concrete product reveal is the rollouts viewer. eliebakouch calls out how easy it is to create runs, inspect outputs, and look at eval data, while xeophon's demo clip shows the feature live instead of as a static screenshot.

For engineering teams, that shifts the product from pure job execution toward inspection. The launch is not just about firing evals, it is about having a place to look through rollout traces after the run finishes.

Local engines

One of the more useful details came from outside the main announcement. Ivan Fioravanti's repost via TheZachMueller says you can run Prime Intellect evaluations locally against any model or engine.

That means the hosted layer is not presented as an all-or-nothing runtime. Teams can centralize parts of evaluation management without giving up the ability to point the harness at whatever engine they already use.

CLI and agent-driven setup

Vtrivedy10 described using a skill, the Prime Intellect CLI, and Codex's in-context learning to set up an RL-style workflow, debug niche errors, and leave the human role mostly to interpreting results and deciding next steps.

That post is not the product announcement, but it does add a concrete picture of how the tooling is being used around launch day: not only as a UI for hosted evals, but as part of an agent-assisted experimentation loop. eliebakouch's follow-up also points readers to a deeper walkthrough from xeophon, suggesting more product detail is landing outside the initial launch blurb.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
TL;DR1 post
Rollouts viewer1 post
Share on X