releaseJuly 3, 2026

harbor exec launches agentic-map-reduce CLI via npx skills add harbor-exec

harbor exec launched an agentic-map-reduce CLI installed with npx skills add harbor-exec. Use it to run sandboxed agents for trace analysis, session mining, search, and rollout aggregation.

4 min read

harbor exec launches agentic-map-reduce CLI via npx skills add harbor-exec

TL;DR

alexgshaw's launch post introduced harbor exec, an agentic-map-reduce CLI for running and aggregating sandboxed agents over traces, sessions, search, and related workloads.
alexgshaw's terminal example shows a concrete trace-mining run over Codex session JSONL with cursor-cli map workers, Modal execution, and a Claude Code reducer.
alexgshaw's rollout post framed the bigger target as rollouts for eval, RL, GEPA, prod, trajectory analysis, and SFT data generation.
hwchase17's LangChain thread put Harbor into LangChain's launch week through a tutorial for long-running, stateful evals, while Vtrivedy10's eval-loop post tied the stack to trace analysis and agent improvement loops.

Harbor's GitHub README describes the project as a framework for evaluating and improving agents, and Harbor's docs say it grew out of Terminal-Bench usage for custom evals, prompt optimization, RL, SFT trace generation, and CI/CD agent testing. LangChain's Harbor integration post shows the current partner path. Cognition's Agentic MapReduce writeup and Matt Rickard's git-branch rebuild explain why the pattern is having a week.

harbor exec

harbor exec is a skill-installed CLI surface for agentic map-reduce jobs: execute many sandboxed agent runs, then aggregate their outputs.

The launch post gives four starting workloads:

trace analysis
agent session mining
search
general execute-and-aggregate workflows

The install command in the post is:

Harbor's task tutorial already uses the same npx skills add harbor-framework/harbor pattern for its create-task skill. The new bit is the harbor-exec skill, aimed at running agent swarms over existing artifacts rather than authoring a Harbor task from scratch.

Agentic MapReduce

The pattern has a simple shape: split a large workset, run focused agents in parallel, reduce their outputs into a smaller artifact.

Cognition's Agentic MapReduce post breaks the codebase-scale version into these stages:

Plan: an agent writes selectors for the repo and task.
Shard: deterministic selectors produce bounded batches.
Map: one agent per batch investigates in parallel.
Reduce: a reducer agent dedupes, reconciles, and synthesizes results.
Verify: Security Swarm adds sandboxed verification for serious findings.

Rickard's rebuild used git branches as the agent-to-agent communication layer, with worker branches, candidate branches, and git merge as the reduce primitive. Hamel Husain pointed to DocETL as an earlier place he saw the agentic MapReduce approach work beyond security.

Rollouts

The most important reply was not about the CLI command. It was the word "rollout."

alexgshaw listed the surfaces Harbor wants to cover:

eval rollouts
RL rollouts
GEPA rollouts
production rollouts
trajectory analysis
SFT data generation

In a reply about team learning and session aggregation, alexgshaw called Harbor "the rollout primitive that powers every optimization loop" and said common optimization loops would eventually be implemented on Harbor primitives. andykonwinski put the narrower adoption case plainly: if an agent is already using Harbor, harbor exec is the easy distributed-compute path.

LangSmith integration

LangChain's Harbor x LangChain post describes the integration around three pieces: Deep Agents, LangSmith Sandboxes, and LangSmith Observability.

The LangSmith docs spell out the execution path:

Harbor is the execution layer for sandboxed evals and rollouts.
LangSmith evaluations record Harbor jobs as experiments with --plugin langsmith.
Deep Agents or LangGraph apps can run as Harbor agents.
LangSmith sandboxes provide isolated per-trial environments.
Results, traces, verifier feedback, and costs land in LangSmith.

Vtrivedy10's longer thread named the operational pain points behind the integration: storing and sharing coding-agent evals, understanding behavior across rollouts, and turning observed failures into new evals. Vtrivedy10's earlier Harbor x LangSmith flow described the loop as reward metrics, traces, and rollouts flowing into experiments and tracing projects, then agents or Engine reading that data to explain behavior changes across iterations.

Trace mining example

The best concrete example is a failure-analysis job over local agent sessions.

The command in alexgshaw's terminal screenshot maps over ~/.codex/sessions/2026/06/2*/rollout-*.jsonl, asks each worker to write analysis.json when the human corrected the agent, then reduces recurring mistakes into feedback.md.

The run uses:

--env modal
--agent cursor-cli
--model cursor/composer-2.5
--n-concurrent 32
--reduce-agent claude-code
--reduce-model anthropic/claude-fable-5

That example pairs neatly with alexgshaw's Harbor Hub anecdote, where an agent fetched an entire Postgres table and filtered client-side. The failure category is mundane, expensive, and exactly the kind of behavior a trace-mining reducer can turn into a reusable eval target.