Spec-Driven Development: When Structure Helps and When It Becomes Tax

Tisha Chawla · Software Engineer at Microsoft The bottleneck in AI-assisted engineering is no longer code generation. It is intent preservation. A human engineer can resolve ambiguity using product context, team memory, architectural taste, and accumulated judgment. A coding agent cannot assume that context unless it is written somewhere durable. When intent only exists in chat history, the model fills the gaps with plausible assumptions. Some assumptions are harmless. Others shape the architecture before anyone notices they were wrong. I call this the ambiguity tax: the rework, context drift, and architectural fragmentation caused by vague requirements entering an automated coding loop. IBM describes related failure modes in AI-assisted development, including context drift and fragmentation when generated code lacks complete system context. The GitHub Blog makes the same underlying point from the agent side: coding agents should be treated more like literal-minded pair programmers that need unambiguous instructions, not like search engines. Spec-Driven Development, or SDD, is a response to that failure mode. It moves engineering intent out of transient chat history and into versioned, reviewable artifacts. The word spec is overloaded. If we do not define it, the rest of the discussion collapses. A spec is a structured, behavior-oriented artifact, usually written in natural language, that describes what a system or feature must do and gives an AI coding agent enough context to plan, implement, and verify the work. Birgitta Böckeler uses this framing in her analysis of Kiro, Spec Kit, and Tessl on Martin Fowler's site. A spec is not a prompt. A prompt is conversational and ephemeral. A spec is persistent and reviewable. A spec is not a PRD. A PRD primarily aligns business stakeholders. An engineering spec must be precise enough to guide code generation, implementation choices, and verification. A spec is not a test suite. Tests validate behavior after or during implementation. A spec defines expected behavior before implementation. The best SDD workflows connect the two, but they are not the same artifact. A spec is not useful because it is long. A useful spec reduces uncertainty. A bad spec simply migrates ambiguity from chat into Markdown. Spec-Driven Development inverts the traditional relationship between code and specification. Traditional lifecycle: Code (source of truth) → Documentation (stale artifact) Spec-driven lifecycle: Specification (source of truth) → Code (implementation artifact) Spec Kit's documentation describes this shift as making specifications executable rather than treating them as scaffolding discarded once coding begins. This is not a return to waterfall. The strongest version of SDD is not "write a perfect document, then code." It is "make intent explicit, review it, let the agent operate against it, then evolve the spec as understanding changes." The GitHub Blog describes specs as living artifacts that evolve with the project and become shared source of truth. The practical engineering question is never "Should we write documentation?" It is: how much structure earns its maintenance cost for this specific change? SDD is not a productivity trick. It is a tradeoff. You spend more effort before implementation so the agent spends less effort guessing during implementation. That tradeoff works when ambiguity is expensive. It fails when the ceremony costs more than the code. The mistake is not using SDD. The mistake is using the same amount of SDD for every change. Böckeler's most useful contribution is the three-level taxonomy of SDD. It separates practices that are often conflated under one label. SDD Level Source of Truth Codebase Relationship Best Fit Primary Risk Spec-first Spec during the lifecycle of a task Spec guides implementation, then becomes a historical artifact Most teams adopting SDD today Spec is archived and forgotten after merge Spec-anchored Spec over the feature lifetime Spec evolves with feature maintenance Long-lived product capabilities Spec-code drift Spec-as-source Spec as exclusive source Code is generated output Narrow or experimental workflows Inflexibility plus model non-determinism Most production teams should begin with spec-first. Spec-anchored is attractive, but it creates a maintenance obligation. If nobody owns the spec after merge, the artifact becomes documentation debt. Spec-as-source is the most ambitious and the most fragile. Tessl explores this direction, while Böckeler explicitly compares the ambition to Model-Driven Development and warns that it risks combining MDD-style rigidity with LLM non-determinism. The taxonomy gives teams a language for adoption. You do not need to "become spec-driven." You choose which level is justified by the work. Most SDD tools converge on a similar loop: Explore → Specify → Plan → Tasks → Implement → Verify Phase Purpose Typical Artifact Explore Understand the problem space, existing system, constraints, and hidden dependencies Notes, research, questions Specify Define what must be built and why spec.md Plan Define how the system should realize the spec plan.md, design notes, data model Tasks Break the plan into dependency-aware atomic work units tasks.md Implement Execute tasks against the spec and plan Code and tests Verify Check implementation against the spec and plan Test results, review notes Spec Kit formalizes this through Constitution, Specify, Plan, Tasks, and Implement. Its README explicitly instructs users to focus the spec phase on what and why, then use the plan phase for technology and architecture choices. Kiro's official docs describe specs as structured artifacts that generate three key files: requirements.md or bugfix.md, design.md, and tasks.md. OpenSpec uses proposal, specs, design, and tasks for each change, with a lighter, more iterative process. The command names differ. The shape is the same: reduce ambiguity before asking the agent to modify the codebase. Star counts are useful for trend detection. They are not a technical evaluation. The better question is: what architectural layer does the tool operate on? What are we building, and what does done mean? Tool Architectural Intent Best Fit Primary Failure Mode Spec Kit Repository-native SDD with specs, plans, tasks, extensions, and project principles Teams that want explicit governance across multiple coding agents Markdown and review overhead OpenSpec Lightweight change proposals with localized spec deltas Brownfield work and lower-ceremony adoption Less systemic governance Kiro IDE-native spec workflow with requirements, design, tasks, steering, and hooks Teams that want SDD embedded in the development environment Tooling lock-in concerns BMAD-METHOD Multi-agent agile workflow spanning ideation, planning, and implementation Teams that want structured lifecycle coverage Process overhead How does the agent behave while doing the work? Tool Architectural Intent Best Fit Primary Failure Mode Superpowers Mandatory composable skills for planning, TDD, debugging, review, and verification Individual developers and small pods that want discipline by default Can feel opinionated when the workflow does not match the task GSD Context engineering and thin orchestration to avoid context rot Long-running AI coding sessions Less useful when formal team governance is required HVE Core Prompt, agent, instruction, and skill collections for structured Copilot workflows Teams already using GitHub Copilot and VS Code Requires disciplined context separation HVE Core's RPI workflow separates Research, Plan, Implement, and Review. Its documentation describes RPI as a transformation pipeline: Uncertainty → Knowledge → Strategy → Working Code → Validated Code. Who does the work, and how do agents coordinate? Tool Architectural Intent Best Fit Primary Failure Mode Squad Repository-native multi-agent team with persistent decisions Parallel work across frontend, backend, tests, and documentation Coordination overhead on small tasks BMAD-METHOD Multi-agent process orchestration across the development lifecycle Large structured workflows Process complexity The GitHub Blog deep dive on Squad describes a coordinator agent that routes work, loads repository context, and spawns specialists with task-specific instructions. The ecosystem is not one ladder. It is a stack. You can use Spec Kit for intent, HVE Core for execution discipline, and Squad for orchestration. Or you can use Superpowers alone for an opinionated individual workflow. The right choice depends on the failure mode you are trying to reduce. Use SDD when the cost of ambiguity is higher than the cost of specification. Signal Skip SDD Use SDD Ambiguity One obvious implementation path Multiple plausible interpretations Scope Single function or small bug Multi-file, multi-service, or architectural change Context duration One short session Multiple sessions Team size One engineer with full context Multiple engineers or reviewers Codebase familiarity Fresh, familiar code Legacy or brownfield system Risk Easy rollback Production-facing or compliance-sensitive Review burden Code diff is easier to inspect Intent must be reviewed before implementation If three or more signals fall in the right column, SDD is worth trying. If most fall in the left column, use plain AI-assisted coding with tests. Böckeler's analysis supports this filter from the negative case. She found that Kiro turned a small bug into four user stories and sixteen acceptance criteria, and that Spec Kit felt too heavy for a mid-sized feature in an existing codebase. Her conclusion was not that SDD is useless. It was that SDD tools need to fit the problem size. Do not adopt a methodology. Apply a filter. Every additional rule consumes reasoning budget. If the rule reduces uncertainty, it is structure. If it does not, it is tax. I call this the Law of Surplus Structure. This is my framing, but it is grounded in two data points. The token inflation benchmark. Jamie Telin published a head-to-head comparison of OpenSpec and Spec Kit on the same implementation task: adding streaming and session support to an AI chat application. Reported Run OpenSpec Spec Kit Difference Test 1 total tokens ~57,740 ~120,947 +109% Test 2 planning tokens 38,117 96,298 +152% Test 2 implementation tokens 53,612 84,742 +58% Test 2 total tokens 91,729 181,040 +97% Telin reported that OpenSpec achieved a higher task success rate while requiring fewer assistant turns and tool calls. More structure increased cost without improving outcomes. The compliance loop trap. The ETH Zurich paper "Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?" finds that repository-level context files tend to reduce task success rates compared with no repository context while increasing inference cost by over 20%. The paper reports that agents do not ignore context files. They follow them. They explore more broadly, traverse more files, run more tool calls, and consume more reasoning tokens, without a meaningful reduction in steps to reach the solution. The agent is not necessarily failing because it ignores governance. It may be failing because it obeys governance too broadly. When you add governance, the model allocates effort toward satisfying governance. If that governance is redundant, vague, or unrelated to the current task, you have not improved reliability. You have redirected intelligence away from the problem. This does not mean "avoid structure." It means structure must earn its place. A good constitution removes degrees of freedom. A bad constitution creates ceremony. A good spec narrows behavior. A bad spec inflates context. A good workflow reduces search. A bad workflow creates compliance loops. Token cost is not a billing footnote. It is an architectural property. The more artifacts a workflow requires, the more context must be read, summarized, checked, and reconciled. That cost shows up as latency, tool calls, reasoning tokens, review time, and money. The spec-kit-cost extension documentation includes a sample feature report totaling $13.3983, with the implementation phase accounting for $11.64 of that example. This is not an average benchmark, but it illustrates why phase-level cost visibility matters. The mitigation is not "use fewer tokens" in the abstract. It is architecture. Cost Problem Mitigation Planning history pollutes implementation Start a fresh session between phases Constitution is too broad Keep only rules that affect code generation Specs duplicate obvious codebase facts Link to source files instead of restating Implementation agent reads every artifact Pass only the relevant spec, plan, and task Same model used for every phase Use stronger reasoning models for planning, cheaper models for execution HVE Core's RPI guidance explicitly recommends clearing context or starting a new chat between phases because each custom agent has different instructions and accumulated context causes confusion. GSD addresses the same problem structurally with a thin orchestrator pattern and fresh contexts for specialized agents. The engineering principle is simple: preserve intent in files, not in bloated chat state. If you decline to adopt a full SDD framework, you can still improve your AI-assisted development workflow by changing how you write requirements. EARS (Easy Approach to Requirements Syntax) was developed by Alistair Mavin and colleagues at Rolls-Royce while analyzing airworthiness regulations for a jet engine control system. It was first published at the IEEE International Requirements Engineering Conference in 2009 and has been adopted across organizations including Airbus, Bosch, Dyson, Honeywell, Intel, NASA, Rolls-Royce, and Siemens. The core syntax: While <optional pre-condition>, when <optional trigger>, the <system name> shall <system response> The keyword SHALL turns a vague preference into a mandatory, testable behavior. The system either satisfies the requirement or it does not. We need to handle session timeouts gracefully. If a user's token is expired, clear their data out of the cache quickly and redirect them, but make sure we don't log any sensitive info if it fails. This is understandable to a human, but it leaves too much room for interpretation: What does "gracefully" mean? How quickly is "quickly"? What data should be cleared? What happens if cache eviction fails? What is considered sensitive? Where should the redirect happen? What should be logged? A coding agent will fill those gaps with plausible defaults. WHEN an identity token expires, THE SYSTEM SHALL invalidate the active session cache within 500ms. IF cache eviction fails, THEN THE SYSTEM SHALL retry up to 3 times, log a structured JSON error with a correlation ID, and SHALL NOT persist plain-text PII in telemetry. This version is still natural language, but it is much more executable. It gives the agent explicit triggers, required behavior, retry semantics, timing constraints, logging requirements, and a prohibition. Use EARS especially for error handling, authentication and authorization flows, data retention and deletion behavior, retry and timeout semantics, security-sensitive constraints, and compliance-facing workflows. A useful spec is not longer. A useful spec has fewer degrees of freedom. Crucially, EARS constrains behavior, not architecture. For execution agents working inside real codebases, pair EARS requirements with explicit engineering anchors such as file-path boundaries, owning modules, dependency tags, or test targets. Otherwise, the agent may understand the requirement correctly but still implement it in the wrong part of the system. EARS tells the agent what behavior must hold. Engineering anchors tell the agent where to look, where to edit, and how to verify. Without both, a spec can be semantically precise but operationally expensive. SDD is useful, but its failure modes are real. These are the ones I would actively design against. Spec-heavy workflows can produce many Markdown artifacts per feature. Böckeler's analysis of Spec Kit notes that generated files can become repetitive, verbose, and sometimes contain implementation-level detail, shifting review burden from code to generated documents. The mitigation is to decide which artifacts are allowed to block implementation. A practical rule: only three artifacts should block implementation: the spec, the plan, and the task list. Everything else is supporting evidence. A spec can create the illusion that the agent is constrained. It is not automatically constrained. Böckeler documented cases where an agent used research notes correctly in one phase but later ignored them during implementation, including regenerating classes that already existed. The mitigation is to verify against the codebase, not against the agent's explanation. Superpowers' verification-before-completion skill captures this principle directly: no completion claim should be made without fresh verification evidence. If nobody owns the spec after merge, the spec becomes stale documentation. This is the failure mode of spec-anchored workflows: the spec is valuable only if it evolves with the code. The mitigation is ownership. For each long-lived spec, assign one policy: archive after merge, maintain as feature memory, or promote into a testable contract. Do not leave it in an ambiguous middle state. When an execution agent is assigned a multi-file change from a static plan.md, it often works linearly: File A, then File B, then File C. But after File A changes, the workspace is no longer the same workspace described by the original plan. If the agent continues using stale assumptions while editing File B and File C, it can introduce subtle mismatches: renamed types, changed function signatures, outdated imports, broken dependency assumptions, or runtime behavior that no longer lines up across files. This is not a sourced benchmark claim. It is an operational pattern to design against. The mitigation is State Checkpoints. After each meaningful file mutation, the execution loop should force the agent to: 1. Save the change. 2. Re-read the changed file. 3. Re-scan directly affected imports, call sites, and tests. 4. Run the cheapest relevant verification command. 5. Update the active task context. 6. Only then move to the next task. This aligns with two public patterns. HVE Core's RPI workflow separates implementation and review, and its Review phase validates implementation against plan specifications, checks convention compliance, and runs validation commands such as lint, build, and test. Superpowers' verification-before-completion skill requires fresh verification evidence before claiming work is complete, explicitly rejecting assumptions such as "should pass" or "looks correct." For multi-file work: no task is complete until the workspace has been re-read and verified after the latest file mutation. Natural language can only be constrained so far. Arcturus Labs argues that if you make natural language sufficiently precise, you may eventually write so much structure that you lose the benefit of not writing code in the first place. The mitigation is to stop treating specs as prose. Use structured requirements, examples, acceptance criteria, diagrams, and tests where appropriate. Böckeler compares spec-as-source approaches to Model-Driven Development. LLMs remove some MDD constraints, such as rigid DSLs and custom code generators, but add non-determinism. The risk is combining rigidity with stochastic generation. The mitigation is to reserve spec-as-source for narrow, well-bounded domains until the tooling proves itself. Do not roll out SDD as a process mandate. Start with one class of work: ambiguous, multi-file, production-facing features. Week Action Goal Week 1 Run Specify and Plan only on one real feature See whether the spec improves alignment Week 2 Write a minimal constitution Capture 5 to 10 rules that actually affect implementation Week 3 Run full SDD on one feature Measure review time, token cost, quality, and rework Week 4 Decide the filter Document when your team will use SDD and when it will skip it The adoption artifact should not be "We use SDD." It should be a filter: Use SDD when: 1. The change touches more than one subsystem. 2. Requirements have more than one valid interpretation. 3. The work will span more than one AI session. 4. The cost of a wrong implementation is high. That is enough. The best case for SDD is not that it makes AI coding magical. It does not. The best case is narrower and stronger: SDD makes intent inspectable before implementation begins. That is valuable because AI coding has changed the economics of software development. Code became cheaper to generate. Intent became the bottleneck. But structure is not free. Every extra Markdown file, checklist, context file, and governance rule consumes reasoning budget. The job of a senior engineer is not to maximize process. It is to apply the smallest amount of structure that eliminates the most ambiguity. Spec it when ambiguity is expensive. Skip it when the code is cheaper than the ceremony. Official Sources Spec Kit documentation GitHub Blog: Spec-driven development with AI Microsoft Developer Blog: Diving into SDD with Spec Kit IBM: What is spec-driven development? Superpowers repository GSD repository OpenSpec repository BMAD-METHOD repository Kiro documentation Squad repository HVE Core documentation EARS official guide Independent Analysis Thoughtworks: Understanding SDD (Böckeler) Jamie Telin: SDD Is Wasting Tokens ETH Zurich: Evaluating AGENTS.md HackerNoon: The Limits of SDD Sibylline Software: The Problems with SDD Arcturus Labs: Why SDD Breaks at Scale Awesome SDD resource list Spec Kit Cost Tracker extension

Key Takeaways

•Tisha Chawla · Software Engineer at Microsoft The bottleneck in AI-assisted engineering is no longer code generation

•This story was reported by Dev.to, covering developments in the dev space.

•AI advancements continue to reshape industries — read the full article on Dev.to for complete coverage.

Spec-Driven Development: When Structure Helps and When It Becomes Tax

Key Takeaways

•Tisha Chawla · Software Engineer at Microsoft The bottleneck in AI-assisted engineering is no longer code generation

•This story was reported by Dev.to, covering developments in the dev space.

•AI advancements continue to reshape industries — read the full article on Dev.to for complete coverage.

Spec-Driven Development: When Structure Helps and When It Becomes Tax

Key Takeaways

Related Articles

What Is Generative UI? (And Why Text Output Is No Longer Enough)

Free contextual chunk headers: heading-aware chunking for hybrid retrieval

Real Cost per Voice Call: $0.31 After 12 Months in Production

Why are large language models so terrible at video games?!

Discussion

Spec-Driven Development: When Structure Helps and When It Becomes Tax

Key Takeaways

Related Articles

What Is Generative UI? (And Why Text Output Is No Longer Enough)

Free contextual chunk headers: heading-aware chunking for hybrid retrieval

Real Cost per Voice Call: $0.31 After 12 Months in Production

Why are large language models so terrible at video games?!

Discussion

Related Articles

Dev.to
What Is Generative UI? (And Why Text Output Is No Longer Enough)
Most AI apps still treat the model response as text. That is understandable. Text is the native output format of an LLM. It is easy to stream, easy to log, easy to copy, and easy to display in a chat bubble. If the user asks for an explanation, a summary, a draft, or a piece of code, text is often t

Dev.to
Free contextual chunk headers: heading-aware chunking for hybrid retrieval
In September 2024, Anthropic published Contextual Retrieval. The trick: generate a one-sentence context per chunk with an LLM and prepend it to the chunk before embedding. On their hybrid vector + BM25 setup, the top-20 retrieval failure rate drops from 5.7% to 2.9% (a 49% reduction). Add a reranker

Dev.to
Real Cost per Voice Call: $0.31 After 12 Months in Production
When our client’s call volume spiked to 9,842 inbound calls in a single Friday night, the bill jumped from $3,200 to $17,850 within 24 hours – a $14,650 surprise that broke their runway — see our voice agent platform for the full breakdown. Most vendors quote a “per‑minute” rate that looks good on p

Dev.to
Why are large language models so terrible at video games?!
The assertion that large language models (LLMs) are "terrible at video games" warrants a nuanced technical examination. While LLMs demonstrate remarkable capabilities in text generation, translation, and code comprehension, their performance in interactive, real-time, and often visually complex envi

Key Takeaways

Related Articles

What Is Generative UI? (And Why Text Output Is No Longer Enough)

Free contextual chunk headers: heading-aware chunking for hybrid retrieval

Real Cost per Voice Call: $0.31 After 12 Months in Production

Why are large language models so terrible at video games?!

Discussion

Key Takeaways

Related Articles

What Is Generative UI? (And Why Text Output Is No Longer Enough)

Free contextual chunk headers: heading-aware chunking for hybrid retrieval

Real Cost per Voice Call: $0.31 After 12 Months in Production

Why are large language models so terrible at video games?!

Discussion

Real Cost per Voice Call: $0.31 After 12 Months in Production

Real Cost per Voice Call: $0.31 After 12 Months in Production