Skip to content
AI Primer
workflow

Developers compare red-green TDD, Hurl tests, and label-triggered agents for code verification

Practitioners published tests-first coding-agent workflows built around red-green TDD, Hurl suites, GitHub label actions, and Codex-based execution checks. The pattern matters because verification remains the main bottleneck once generation is fast, especially in longer multi-file sessions.

5 min read
Developers compare red-green TDD, Hurl tests, and label-triggered agents for code verification
Developers compare red-green TDD, Hurl tests, and label-triggered agents for code verification

TL;DR

You can read Simon Willison's full guide, inspect Matt Pocock's workflow YAML, browse Clawsweeper, and skim the Hacker News thread. There is also a fresh Code as Agent Harness paper arguing that tests, repos, logs, and sandboxes work as the agent's external cognition.

Red-green TDD

Tests-first is becoming the default answer when developers talk about coding-agent verification.

r/ClaudeCode

How are you handling coding agent testing after generation?

4 comments

Willison said he is "firmly on team red/green TDD for agent code," while the Reddit thread filled in the mechanics: write or draft the tests first, confirm they fail, let the agent implement, then rerun tests and linters. One commenter in the r/ClaudeCode testing thread said NestJS agents work well with Hurl suites as long as the instruction to run npm run hurl is baked into the repo instructions.

That gives this workflow a concrete shape:

  1. Restate the task.
  2. Draft the tests.
  3. Run the tests to make sure they fail.
  4. Implement the change.
  5. Rerun tests and linters.

Label-triggered agents

The other pattern in the evidence is orchestration through GitHub itself.

Pocock's setup is dead simple: add a label, fire an action. The labels he listed were agent:implement, agent:update-branch, agent:review, and agent:to-issues, with the trigger defined inside a standard GitHub Actions workflow file rather than a custom agent dashboard.

That matters because it keeps the handoff point visible inside the repo. The label is the permission boundary, the action is the runner, and the PR stays the unit of work.

Execution checks

Several developers are now splitting generation from verification instead of asking one agent to do both jobs.

According to PerceptualPeak's workflow report, Claude Code worked best as the implementation agent while Codex handled planning and execution verification. In steipete's autotriage thread, Codex only works issues that fit a narrow filter: they match the repo vision, have a clear fix, can be inferred from code with high confidence, and can be live-tested in a VM with computer vision before the human reviews the suggestion.

The same verification instinct shows up at smaller scale. onusoz's Codex desktop post praised automations for tracking Clawsweeper automerge status without burning tokens on continuous polling, while HCSolakoglu's critique argued products like Codex still need release-to-release regression checks on cache hit ratio, context rot, token counts, runtime, tool behavior, and SWE-Bench Pro subsets.

Hardening techniques

The strongest new wrinkle in the discussion is agents writing the tests that harden other agents.

Martin said agents can decide when property testing fits a function, generate the domains and ranges, run the tests, and fix the failures, which he said already surfaced two production bugs. James Long's reply described a similar move inside OpenCode, where a property-based testing system required simulated filesystem and network layers.

The hardening menu in the evidence is broader than unit tests:

zeeg's pushback is the useful correction here. He said agents are still poor at managing quality in systems with messy network interactions or in agent-on-agent setups, and his follow-up added that the issue is often not testability itself, but agents failing to stay within the constraints needed to produce good tests.

Code as harness

The broader framing around all of this is that verification is becoming part of the agent runtime, not an afterthought.

Y
Hacker News

Vibe coding and agentic engineering are getting closer than I'd like

787 upvotes · 885 comments

The Code as Agent Harness paper, cited in Rohan Paul's summary, argues that code works as the environment an agent thinks inside: tests become sensors, repositories become memory, logs become history, and sandboxes become boundaries. The HN discussion sharpened the social version of that claim, with commenters in the main thread separating lightweight vibe coding from multi-step agentic engineering by the presence of quality gates, review stages, and explicit accountability.

That is why the most interesting artifacts in this story are not new prompts. They are the verification surfaces developers are bolting on around the model: failing tests, label-triggered runs, VM checks, automerge monitors, and harnesses that keep the agent answer executable long enough to dispute it.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
TL;DR1 post
Hardening techniques1 post
Share on X