releaseMay 27, 2026

Trajectory launches continual-learning platform with off-policy SDPO

Trajectory launched a platform that turns agent traces and user corrections into post-deployment model updates instead of prompt-only fixes. Baseten and Tinker described live A/B post-training, 397B-model deployment work, and an off-policy recipe for stabilizing the loop.

5 min read

Trajectory launches continual-learning platform with off-policy SDPO

TL;DR

rohanpaul_ai's overview says Trajectory is selling a continual-learning loop that trains on full agent trajectories, including what users accepted, edited, retried, or fixed later, instead of stopping at prompt tweaks.
According to tinkerapi's technical pointer, the core technical claim is an off-policy SDPO recipe meant to stabilize training when each task may only yield a single stale rollout.
Baseten's launch post adds concrete deployment detail: Trajectory ran live customer A/B tests starting in April 2026, used FP8 and NVFP4 quantization, and expanded H100 capacity 3x in under three weeks with zero outages.
kevinhou22's Windsurf note ties the idea to coding-model practice, where agent and user traces were already used internally to improve SWE-1.

You can see the official launch being passed around in andyzhang's repost, the infra story in Baseten's thread, and the sharper technical hook in tinkerapi's link to the technical post. kevinhou22's Windsurf anecdote makes the pitch less theoretical, while Vtrivedy10's trace loop shows how much of the current agent stack already points in this direction.

Trajectory

Trajectory is launching as both a research lab and a product company built around post-deployment learning, according to andyzhang's repost of the company launch. The basic claim, as rohanpaul_ai's summary frames it, is that production AI is still mostly frozen software even though users generate corrections all day.

The company centers everything on the "trajectory": not just the model output, but the agent actions and the user's follow-on behavior. rohanpaul_ai's breakdown says that includes accepted answers, rejections, edits, retries, and fixes that happen later, which lets Trajectory train on full failure chains instead of isolated bad completions.

That matters because it pushes three levers at once:

model weights
n- prompts
the surrounding agent harness

kevinhou22's post is the most concrete outside validation in the evidence set. He says Windsurf used agent and user traces, internally called trajectories, to continuously improve its coding model SWE-1.

Off-policy SDPO

The most specific technical reveal came from tinkerapi's post linking the technical writeup. It says Trajectory is using an off-policy SDPO recipe because continual learning in production rarely gives you clean on-policy training data.

The problem statement in that post is unusually crisp: by the time you train, a single rollout per task is already off-policy. In other words, the logged trace came from an earlier model, prompt stack, or tool setup, but it is still the main data you have.

Vtrivedy10's thread helps place that claim inside the broader agent workflow. It breaks the loop into tracing, understanding failures, then choosing an intervention: harness engineering, eval creation, model swaps, or post-training. Trajectory's bet is that post-training can become a routine production lever rather than a rare research event.

Baseten's deployment details

Baseten supplied the hardest numbers in the launch material. According to its launch thread, the company helped Trajectory quantize and deploy a 397 billion parameter model, ran automated deployment pipelines, and provided autoscaled H100 infrastructure.

The same post says Trajectory started live A/B tests with customers in April 2026, saw zero outages during the launch period, and expanded capacity 3x in under three weeks. Baseten's follow-up describes the workload as "live, continuous post-training for frontier-scale models," which is a more aggressive description than the usual fine-tuning launch copy.

Those details also narrow the product shape. This is not positioned as offline model customization with occasional retrains. Baseten's account describes an always-on deployment loop tied directly to customer traffic.

Trace data is becoming training data

natolambert's post argues that continual learning is most likely to show up first in knowledge-work products, naming Cursor, Claude, and Copilot as the kind of systems that could learn from real usage. That matches kevinhou22's Windsurf example, where trace data from a coding product already fed model improvement.

The surrounding tooling ecosystem has been moving in the same direction, but usually one layer earlier in the loop. Braintrust's workshop post is about mining support tickets and production traces into evals, while LangChain's post points to using traces to build evals for production agents. Vtrivedy10's thread lists post-training as one option after teams have enough trace visibility.

Trajectory's launch is the point where that trace stack gets marketed not just as observability or eval generation, but as a path to updating the model itself.

The neighboring products are converging on the same loop

Trajectory is not the only company framing production improvement as a product surface this week. fastinoAI's Pioneer post says Fastino is demoing a system that autonomously improves models in production through continual learning.

The contrast is useful. Braintrust's post says production traces need human expertise to become golden datasets that improve over time, while Fastino's post and rohanpaul_ai's description of Trajectory both lean toward a more automated learning loop. The shared pattern is that support tickets, traces, edits, and retries are being treated less like logs and more like raw training substrate.