workflowMay 25, 2026

Developers compare red-green TDD, Hurl tests, and label-triggered agents for code verification

Practitioners published tests-first coding-agent workflows built around red-green TDD, Hurl suites, GitHub label actions, and Codex-based execution checks. The pattern matters because verification remains the main bottleneck once generation is fast, especially in longer multi-file sessions.

5 min read

Developers compare red-green TDD, Hurl tests, and label-triggered agents for code verification

TL;DR

Simon Willison's red-green TDD post pushed a tests-first pattern for agent code, and the r/ClaudeCode testing thread showed the same instinct in practice, with developers using failing tests, linters, smoke tests, and Hurl suites as the gate before merge.
Matt Pocock's label-triggered workflow turned GitHub labels into agent entrypoints, and his workflow file link showed the trigger living directly inside a GitHub Action rather than in a separate control plane.
one Codex and Claude Code workflow report described a split setup where Codex plans and verifies while Claude Code implements, which lines up with steipete's autotriage skill using VM-based execution checks before a human review.
Robert C. Martin's property-testing thread argued agents can profitably add property tests, but zeeg's reply said harder systems still break because agents drift outside the constraints needed to write reliable tests.
the main HN thread kept returning to the same divide: fast generation is easy, while Aiden Bai's guardrails complaint argued review capacity and fake testing are still the real bottlenecks.

You can read Simon Willison's full guide, inspect Matt Pocock's workflow YAML, browse Clawsweeper, and skim the Hacker News thread. There is also a fresh Code as Agent Harness paper arguing that tests, repos, logs, and sandboxes work as the agent's external cognition.

Red-green TDD

Tests-first is becoming the default answer when developers talk about coding-agent verification.

r/ClaudeCode

How are you handling coding agent testing after generation?

4 comments

Willison said he is "firmly on team red/green TDD for agent code," while the Reddit thread filled in the mechanics: write or draft the tests first, confirm they fail, let the agent implement, then rerun tests and linters. One commenter in the r/ClaudeCode testing thread said NestJS agents work well with Hurl suites as long as the instruction to run npm run hurl is baked into the repo instructions.

That gives this workflow a concrete shape:

Restate the task.
Draft the tests.
Run the tests to make sure they fail.
Implement the change.
Rerun tests and linters.

Label-triggered agents

The other pattern in the evidence is orchestration through GitHub itself.

Pocock's setup is dead simple: add a label, fire an action. The labels he listed were agent:implement, agent:update-branch, agent:review, and agent:to-issues, with the trigger defined inside a standard GitHub Actions workflow file rather than a custom agent dashboard.

That matters because it keeps the handoff point visible inside the repo. The label is the permission boundary, the action is the runner, and the PR stays the unit of work.

Execution checks

Several developers are now splitting generation from verification instead of asking one agent to do both jobs.

According to PerceptualPeak's workflow report, Claude Code worked best as the implementation agent while Codex handled planning and execution verification. In steipete's autotriage thread, Codex only works issues that fit a narrow filter: they match the repo vision, have a clear fix, can be inferred from code with high confidence, and can be live-tested in a VM with computer vision before the human reviews the suggestion.

The same verification instinct shows up at smaller scale. onusoz's Codex desktop post praised automations for tracking Clawsweeper automerge status without burning tokens on continuous polling, while HCSolakoglu's critique argued products like Codex still need release-to-release regression checks on cache hit ratio, context rot, token counts, runtime, tool behavior, and SWE-Bench Pro subsets.

Hardening techniques

The strongest new wrinkle in the discussion is agents writing the tests that harden other agents.

Martin said agents can decide when property testing fits a function, generate the domains and ranges, run the tests, and fix the failures, which he said already surfaced two production bugs. James Long's reply described a similar move inside OpenCode, where a property-based testing system required simulated filesystem and network layers.

The hardening menu in the evidence is broader than unit tests:

Property tests, per Martin's thread
Hurl or HTTP-level regression suites, per the Reddit testing thread
Live UI or VM checks with computer vision, per steipete's autotriage skill
Evals-as-tests for harder agent systems, per zeeg's critique

zeeg's pushback is the useful correction here. He said agents are still poor at managing quality in systems with messy network interactions or in agent-on-agent setups, and his follow-up added that the issue is often not testability itself, but agents failing to stay within the constraints needed to produce good tests.

Code as harness

The broader framing around all of this is that verification is becoming part of the agent runtime, not an afterthought.

Vibe coding and agentic engineering are getting closer than I'd like

Relevant as a discussion of how coding agents fit into real software delivery: where they help most, where humans still need to review, and how teams should think about tests, specs, and accountability when using LLMs in production.

The Code as Agent Harness paper, cited in Rohan Paul's summary, argues that code works as the environment an agent thinks inside: tests become sensors, repositories become memory, logs become history, and sandboxes become boundaries. The HN discussion sharpened the social version of that claim, with commenters in the main thread separating lightweight vibe coding from multi-step agentic engineering by the presence of quality gates, review stages, and explicit accountability.

That is why the most interesting artifacts in this story are not new prompts. They are the verification surfaces developers are bolting on around the model: failing tests, label-triggered runs, VM checks, automerge monitors, and harnesses that keep the agent answer executable long enough to dispute it.