Do we still need TDD?

I’ve been thinking a lot about this question these days. Most of my career, I’ve worked on teams that valued TDD, either religiously or as a common mode of development. TDD is roughly defined by the following principles:

Illustration of a checklist, partially completed

Incremental development. TDD’s red-green-refactor cycle (write a failing test, make it pass with minimal code, then clean up) keeps changes small and reversible. The goal is to never be more than a few minutes away from working code. This shrinks the debugging surface area—if a test fails, you know the problem is in the handful of lines you just wrote.

Design feedback, not just verification. Writing tests first forces you to think about your code’s interface and behavior before implementation. If something is hard to test, that difficulty may be a signal that the design has coupling problems, unclear responsibilities, or hidden dependencies. The test acts as the first client of your code.

Executable documentation. Tests describe what the code is supposed to do in concrete terms. Unlike comments or external docs, tests can’t drift out of sync with the implementation because they’ll fail. The goal is a living specification that stays accurate.

Confidence to refactor. A comprehensive test suite means you can restructure code aggressively without fear. The goal is to make the codebase malleable over time rather than increasingly rigid as it grows. (Granted, you can overtest and cause rigidity from the other direction, but that’s a separate discussion.)

Avoid scope creep. The TDD discipline of “write just enough code to make the test pass” serves as a forcing function against scope creep and speculative generalization. Your test suite defines the boundaries of your codebase’s scope.

Whether you practice TDD or not, I think most engineers understand the value of all the above.

Agentic Coders are bad TDD practitioners

If you’re a TDD practitioner, the way that agentic coding tools such as Claude Code, Codex, Gemini, etc., write code is unsettling. They are most definitely not TDD practitioners by nature. And there’s good reason for that: they have been trained on codebases full of abstractions, design patterns, and fully mature architecture. By default, they’ll often produce:

Interfaces with one implementation
Configuration systems for things that have one value
Abstraction layers “in case requirements change”
Comprehensive error handling for conditions that can’t occur in your context
Plugin architectures when you need exactly one plugin

None of this costs the agent anything to produce. It flows naturally from pattern-matching on existing code. And it doesn’t cost you anything in the moment—you didn’t have to write it. The cost is hidden and bites you in the ass later because there’s more code to understand, more surface area for bugs, and more inertia against change. You can very quickly end up with a bloated codebase that demands a heavy cognitive load to understand and manipulate. Worse, since you didn’t directly author it, you lack the intuitive understanding that you would have gotten through the sweat and tears of writing it by hand.

"VIBES" illustration

When I first started using Cursor with Sonnet 3.5, I got excited by its abilities and decided to push its limits to build a side project idea I had. I started vibe coding (though it hadn’t been labeled as such yet). It was incredible! But soon I would cross a threshold where I realized I had lost a grasp of the codebase. What I thought the codebase was doing and what it was actually doing had diverged several cycles back. Sonnet had begun hallucinating successful implementation of a piece of critical functionality. When I dug deeper, I discovered that its implementation was deeply flawed and figuring out where to unwind to was quite challenging. I hadn’t been practicing TDD because this was just little side project for myself and I was having fun exploring the boundaries of this new paradigm. I hadn’t been making small incremental changes and commits. I was just letting Sonnet go and committing whenever I wanted to create a checkpoint.

Forcing traditional TDD into agentic coding is performative

These days, Opus 4.5 and Claude Code (and similar agents) are considerably better at writing more correct code with fewer hallucinations. But none of them are naturally TDD practitioners and the risks associated with not practicing TDD remain. Even if you explicitly demand an agent follow TDD in an AGENTS.md/CLAUDE.md file, it will often ignore that instruction. When it does make an attempt to follow the instruction, all it really does is write tests. It doesn’t follow red-green-refactor. It doesn’t incrementally implement functionality. It writes the entire test file and implementation at once and runs the test suite after to see if it passes. This is writing tests; it is not test-driven development.

We could go to a lot of effort to add guardrails (e.g. Claude Code hooks) in an attempt to enforce a workflow that’s more true to TDD, but I’d rather take a step back. TDD is a process designed for human engineers writing software. Its rituals are designed around the human experience of designing, writing, and evolving code based on specifications provided by the product owner. It intentionally adds friction at specific stages of the development process to force the human engineer to think about the design, scope, and behavior of the code they’re writing. Even if we add the ritual to the agentic coding process, the agent is not going to be affected by the process in the same way that the process affects the human coder. It would be performative, not meaningful.

Are TDD values still relevant and valuable?

If forcing traditional TDD into agentic coding is performative, does that mean TDD itself is a relic of the pre-LLM era? Let’s consider why we do TDD in the first place. The principles I outlined above are focused on the concrete actions—the “hows”—of TDD. What are the underlying values—the “whys”—of TDD and are those values still important in the context of agentic coding?

Rapid feedback loops are valuable. The shorter the gap between writing code and knowing whether it works, the easier problems are to fix and the less context you lose. TDD compresses this loop to seconds or minutes rather than hours or days. This principle extends beyond testing—it’s why compilation errors are easier to fix than runtime errors, and why continuous integration catches integration issues faster than end-of-sprint merges. Whether a human is writing the code or an agent is, a rapid feedback loop is valuable to our process.

Separating concerns in thinking leads to better decisions. TDD explicitly separates three mental modes: deciding what the code should do (writing the test), making it work (green), and making it clean (refactor). Trying to do all three simultaneously leads to muddled decisions. By forcing sequential phases, you can focus fully on each concern without juggling competing goals. Agentic coders can fall into the same trap of muddled decisionmaking when the context window contains the instructions and history of executing on all three concerns and often agentic coders aren’t even following the red-green-refactor phase gating–it’s doing all three at once.

Designing for testability equals designing for usability. Code that’s easy to test in isolation tends to have clear inputs and outputs, minimal hidden state, explicit dependencies, and well-defined responsibilities. These same properties make code easier to understand, reuse, and modify. Testability becomes a proxy metric for general code quality. This can be even more important in an agentic coding environment where we’re dealing with limited context windows. When business logic is spread across multiple files with poor boundaries, hidden state, and implicit dependencies, the agent will have a much more difficult time reasoning about your code and will struggle to fit it all in the context window without muddling it with unrelated code. This inevitably results in poorer code generation.

Working software as a verifiable ground truth. Rather than reasoning abstractly about whether code is correct, TDD insists on demonstrable behavior. The test suite is a collection of existence proofs—“here is evidence that this specific behavior works.” This shifts arguments about correctness away from speculation and toward deterministic observation. Agentic coders will often speculatively declare that some code is correct when it isn’t. We still need deterministic evidence of correctness of code.

Sustainable pace through reduced rework. Bugs found later cost more to fix, both in time and in collateral changes. TDD front-loads the cost of quality rather than deferring it. The principle is that consistent small investments beat sporadic large ones when compounded over a project’s lifetime. When I was experimenting with Cursor and Sonnet 3.5, if I had been making small, test-verified changes and making frequent commits, not only would I have realized that the code wasn’t doing what I thought it was, but it would have been easier to identify the commit to revert to in order to course correct.

Humility about reasoning ability. TDD assumes we’re not good at holding complex systems in our heads or predicting all edge cases upfront. It substitutes confidence with automated verification, acknowledging that “I think this works” is weaker than “I have a passing test that demonstrates this works.” Coding agents are even worse than we are at holding complex systems in their “heads.” Their context windows are much smaller than our cognitive load capacity.

Scope discipline. TDD’s “write just enough code” and YAGNI (“You Ain’t Gonna Need It”) constraint resists the natural tendency to build for imagined future requirements. By limiting implementation to what the current test demands, you avoid accumulating code that serves no present purpose but carries ongoing maintenance cost. The test suite becomes a forcing function that keeps scope anchored to demonstrable needs rather than speculated ones. This value becomes arguably more critical with agentic coders, since, as noted earlier, agents will freely produce abstractions, plugin architectures, and configuration systems which cost them nothing to generate but burden you with unnecessary complexity.

Our “whys” are just as relevant when coding agents are writing code as they are when humans are—arguably moreso.

Brain illustration

Rethinking TDD principles within agentic constraints

So if forcing traditional TDD into agentic coding is performative but the values of TDD are still relevant, where does that leave us? As I said earlier, TDD rituals were designed around the human experience and our cognitive constraints. When an agent writes code, the constraints shift. Let’s work through each principle and how it might manifest differently in an agentic generation process:

Rapid feedback loops—but feedback on what?

In human TDD, the loop is “write code → run test → learn if code is correct.” With agents, the tighter loop becomes “specify intent → agent generates → learn if agent understood correctly.”

The problem shifts from implementation correctness to specification clarity. You might write a test that passes, but the agent satisfied it in a way that technically works while missing your actual intent. The feedback you need most is whether your specification was unambiguous enough.

This suggests a workflow where you see agent output quickly and in small pieces. Asking an agent to build an entire feature in one shot breaks the feedback loop—you get a wall of code and no way to localize where misunderstandings crept in. Incremental generation with verification checkpoints preserves the principle even if the mechanism looks different.

Separation of concerns—different concerns now.

Human TDD separates “what should it do” (test), “make it work” (green), and “make it clean” (refactor). With agents, the human role shifts almost entirely to the “what” while the agent handles implementation.

But a new concern emerges: validation that intent was preserved through the translation from natural language specifications to code. You’re now operating in a specify → generate → validate loop. These phases benefit from explicit separation. Trying to specify, review generated code, and assess design quality all at once leads to the same muddled thinking that TDD’s phases were designed to prevent.

This suggests a new set of phases are called for: first write your specification (tests, examples, or natural-language contracts), then let the agent generate without simultaneously reviewing, then validate as a final, distinct step. Mixing them together invites confirmation bias—you see the code and unconsciously adjust your sense of what you wanted.

Testability as a proxy for quality—where’s the friction?

Here’s a real challenge. In human TDD, you experience the pain of testing tightly coupled or poorly designed code. That pain is the signal. Agents don’t feel pain. They’ll happily generate code with hidden dependencies, implicit state, or tangled responsibilities and won’t report any difficulty.

This means testability friction has to be reintroduced deliberately. Some options we could consider:

Use a separate review pass (human or agent) specifically focused on testability and design, not just correctness.
Ask the agent to generate tests before implementation, from the same specification. If the agent struggles to write clear tests, that’s a design smell surfacing early.
Ask the agent to explain how it would test the code it just wrote. Vague or complicated answers indicate problematic structure.

The underlying principle still holds though: testable code is better code, but we need more explicit mechanisms to surface the quality signal that comes from testability friction.

Working software as ground truth—more important, not less.

Agents produce plausible-looking text. Code that reads correctly but doesn’t actually run correctly is a genuine failure mode of agentic coding systems. In fact, it is the basis for Reinforcement Learning with Verifiable Rewards (RLVR), a now-critical post-training technique for LLMs.

This makes execution-based verification more essential than ever. And since LLMs can have a tendency to satisfy specific examples while missing general behavior in ways humans wouldn’t, you might want to add property-based tests and/or fuzzing to your testing strategies.

Humility—now about two unreliable systems.

TDD’s humility principle is about not trusting human reasoning. Now you have two reasoning systems to distrust: yours (for specification) and the agent’s (for implementation).

This suggests value in adversarial or independent verification. Some approaches we might take:

Write the specification, have the agent implement, then you write tests independently (not just reviewing agent-generated tests). Your tests probe what you meant; discrepancies reveal either agent misunderstanding or ambiguity in your spec.
Have one agent implement and a separate agent (or separate context) review or test. Independence matters here because an agent asked “does this code match this spec” is not the same as an agent asked “here’s a spec, write tests for it” followed by running those tests against the implementation.

Scope discipline—the constraint the agent lacks entirely.

In human TDD, “write just enough code to make the test pass” is self-enforcing: you feel the effort of writing unnecessary code, so you don’t write it. Agents have no such constraint. They’ll generate abstractions, configuration layers, and “future-proofing” infrastructure as readily as the minimal solution because it costs them nothing. The agent won’t naturally resist scope creep; you have to impose it externally.

Human involvement is necessary here because someone has to take responsibility for the code as committed, and that requires real judgment. We can employ some automated assistance to get us there, but I would not fully delegate this responsibility, especially for meatier code contributions. Some approaches we might take:

First of all, be explicit about scope in your specification. Rather than “implement user authentication,” try “implement password-based login for a single user type with no OAuth, no social login, no multi-factor—just email and password.” The agent will still try to over-engineer; explicit constraints give you leverage to push back.
Then, use a combination of tests + code coverage tooling to make your test suite the scope boundary. If a piece of generated code isn’t exercised by any test, question whether its existence is justified or speculative. Pruning becomes an explicit part of your review workflow.
Even if the tests exercise the code, it doesn’t mean we’ve avoided the scope creep of implementation complexity, so review for YAGNI violations. Specifically ask yourself while reviewing: what code here serves no current requirement? Interfaces with one implementation, configuration for single values, and abstraction layers for hypothetical extensions are candidates for removal.

The underlying value of avoiding accidental complexity from building for imagined futures remains critical, but where TDD’s mechanism was additive friction for the human writing the code, the agentic equivalent may be subtractive review.

What does this look like in practice?

I’m still thinking through and experimenting in my own Claude Code setup, but I think it roughly looks like this.

First, the specification, generation, and validation phases should be distinct, with explicit transitions between them. You can’t proceed to the next phase without completing the current one. Not because ritual matters, but because mixing phases leads to the muddled reasoning I discussed earlier.

Your tests, examples, or contracts should serve as the canonical input—specification as the source of truth. The agent’s implementation is measured against this, not against vague notions of what you wanted. This also creates an audit trail: here’s what was specified, here’s what was generated, here’s how they compare. Something approximating this has already begun to emerge as a proposed methodology through spec-driven development (note: in this context, the specs are natural language, not code), though the solutions are still fairly immature and rapidly evolving.

After generation, we run the test suite, code coverage, static analysis, and possibly a separate review agent in an automated fashion (e.g. with hooks). These automated validation pipelines give us deterministic and relatively instant feedback that can be fed right back into the agentic loop, minimizing the manual work we have to do in the human review step.

We can also build in minimality checks. Code coverage tools will identify unexercised code for us. We can flag any new code that is unexercised by our specifications and send it to the agent with instructions… “potentially unnecessary—justify or delete.” This gives us an automated mechanism for pushing our agents towards minimality. The agent will still overengineer sometimes, but again this minimizes the manual review work we’ll have to do.

For independent verification, the framework could invoke a second agent (or the same agent in a fresh context) to generate a second set of tests from the natural language specification. If the implementation passes the original tests but fails the separately-generated ones, we have a discrepancy that reveals either a flawed implementation or a misunderstanding about the specification.

And finally, while we might not be writing most of our code anymore, we still must take responsibility for the code we commit. Once our automated pipeline believes the implementation is complete, correct, and well-scoped, we still need to sit down and manually review our code. That human review step isn’t going away.