Definition of Done

Definition of Done (DoD) standards are documented criteria that define when a piece of work is considered complete. These standards reduce ambiguity by aligning engineering, product, and quality teams on what “done” actually means. DoD criteria help prevent partial work from leaking into production and ensure that features meet the necessary quality, testing, documentation, and stakeholder expectations before they are closed.

As teams adopt AI tooling (like coding assistants) and ship AI-powered functionality (including agentic AI), the Definition of Done becomes even more important. “Done” must cover not only code and tests, but also the safety, reliability, and operational readiness of AI behaviors that can be probabilistic, data-sensitive, and harder to validate with traditional test approaches.

Background and History of Definition of Done Standards

The concept of a shared Definition of Done originated in Scrum and Agile software delivery. The Scrum Guide defines the DoD as “a formal description of the state of the Increment when it meets the quality measures required for the product.” Over time, many teams have extended the DoD beyond working software to include non-functional requirements, release readiness, and cross-functional validations such as security, compliance, and stakeholder sign-off.

As teams matured, Definition of Done checklists evolved to clarify accountability across functions and prevent gaps between what’s delivered and what’s expected by end users.

That evolution is accelerating in the AI era. Teams increasingly need “done” to include evaluation and monitoring expectations for AI outputs, plus guardrails for systems that can take autonomous actions (for example, agents that file tickets, modify code, or trigger workflows).

Goals of Definition of Done Standards

Definition of Done Standards aim to prevent quality gaps and reduce rework by making completeness explicit. They are designed to solve common problems such as:

  • Unclear Requirements, which lead to mismatched expectations at delivery.
  • Roll Over Tickets, where incomplete work carries from sprint to sprint.
  • Definition Drift, when different teams apply inconsistent delivery standards.
  • Low Test Coverage, caused by unclear expectations about test completeness.

When well-implemented, DoD standards prevent work from being “done but not done,” ensuring that effort translates into real progress and deployable value.

In practice, a strong DoD supports all three outcomes leaders care about:

  • Quality, by preventing incomplete validation, missing instrumentation, or skipped safeguards.
  • Predictability, by reducing late surprises, last-minute rework, and hidden “almost done” work.
  • Workflow efficiency, by reducing back-and-forth and making handoffs clearer between build, review, test, and release stages.

For AI-enabled work, DoD also helps prevent “it demoed once” from being treated as “it’s production-ready.”

Scope of Definition of Done Standards

A team’s Definition of Done typically applies to work items at the story, feature, or epic level. Standard criteria often include:

  • Code is reviewed, merged, and passes all automated tests.
  • Acceptance criteria are met and validated by the product owner or QA.
  • Documentation (user-facing or technical) is updated.
  • Observability instrumentation and security checks are in place.
  • Work is deployable and/or behind a Feature Flagging control.

If the work includes AI-assisted implementation or AI-powered runtime behavior, teams often expand the DoD to cover additional “readiness” expectations, such as:

  • AI-related configuration is versioned (e.g., prompts, policies, routing rules, model settings).
  • The team has validated expected behavior using a repeatable evaluation approach (not just ad hoc manual checks).
  • Safety and abuse cases have been considered (especially for externally facing features).
  • Monitoring and rollback/kill-switch plans exist for AI behavior changes that may be harder to predict than deterministic code.

The DoD is not a rigid contract, it must adapt based on team maturity, platform, and customer expectations. For example, teams in early discovery phases may adopt a lighter-weight DoD, while regulated industries require strict compliance before work is closed.

Critically, the DoD must be consistently applied. If certain criteria are only enforced selectively or are bypassed under deadline pressure, the standard loses its effectiveness.

Definition of Done vs Acceptance Criteria vs Definition of Ready

These concepts are often confused, especially when teams are moving quickly:

  • Acceptance criteria describe what must be true for this specific work item to satisfy the user or stakeholder need.
  • Definition of Done describes the standard of completeness for any work item to be considered shippable (or safely releasable).
  • Definition of Ready (where used) describes what must be true before work is pulled in, so execution doesn’t stall due to ambiguity.

For AI features, this separation matters. A ticket can meet acceptance criteria (“the chatbot answers X”) while still failing DoD if it lacks monitoring, evaluation coverage, privacy checks, or safe fallback behaviors.

Definition of Done for AI-Assisted Development and Agentic AI

AI changes “done” in two different ways: how work is produced (AI-assisted coding) and what is shipped (AI-powered behavior).

1) When AI tools help produce the code A practical DoD expansion is to ensure AI-assisted output is still owned, understood, and maintainable:

  • Human review confirms intent, correctness, and edge cases (not just “it compiles”).
  • Tests cover critical behavior, not only the happy path.
  • Security and secret-scanning checks run, since AI-generated code can accidentally introduce unsafe patterns.
  • Dependencies and licensing implications are reviewed where relevant (especially when AI suggests copying patterns or snippets).

2) When the shipped feature includes AI behavior AI features often require DoD to include validation beyond traditional unit tests:

  • A repeatable evaluation exists (test cases, golden sets, scenario suites, or regression prompts) that can be rerun on changes.
  • The system has defined failure behavior (fallbacks, safe defaults, or constrained modes).
  • The team has an operational plan (monitoring, alerting, incident playbooks, and a rollback strategy).

3) When agentic AI can take actions Agentic workflows raise the bar for “done” because the system can create downstream impact:

  • Permissions are scoped (least privilege) and actions are constrained to the intended domain.
  • There is auditability (logs of what the agent did, when, and why).
  • Guardrails exist for cost, rate limiting, and runaway loops.
  • Human approvals are defined for high-risk actions (or the system has explicit policy gates that substitute for manual review in low-risk cases).

Metrics to Track Definition of Done Standards

MetricPurpose
Roll Over Tickets Indicates whether incomplete work is being marked as done prematurely.
Rework Rate High rework may signal that tickets are being closed before they meet DoD standards.
Defect Density Escaped defects can reveal gaps in testing, acceptance, or readiness criteria.
Change Failure Rate Helps validate whether “done” standards are strong enough to prevent release-driven incidents and hotfix cycles.
First-Time Pass Rate Low first-time pass rates can indicate missing DoD requirements (or weak automation) that cause repeated validation churn.
Pipeline Success Rate Consistently low success rates may suggest teams are merging work that isn’t meeting DoD expectations for test reliability and validation stability.

These metrics help teams identify systemic gaps between defined standards and actual delivery behaviors.

Definition of Done Implementation Steps

Successful implementation of a Definition of Done requires more than a checklist, it requires adoption, clarity, and reinforcement. Start by aligning on what “done” means and embedding the standard into team rituals.

  1. Collaboratively define your DoD – Include engineering, product, QA, and other stakeholders.
  2. Make it visible and accessible – Store the DoD in your team’s handbook, board template, or Definition of Ready documentation.
  3. Include DoD criteria in acceptance checklists – Use tooling or status fields to ensure all items are met before a ticket can be closed.
  4. Review and refine regularly – Inspect the DoD during retrospectives or delivery reviews to keep it current and enforceable.
  5. Embed into peer review and QA – Have reviewers validate completeness against the DoD, not just the code.
  6. Use CI/CD automation where possible – Flag tickets missing tests, documentation, or required tags as non-mergeable.
  7. Add AI-specific DoD clauses where relevant – If a work item touches AI behavior (prompts, model settings, agent actions), require evaluation evidence, operational readiness, and clear failure handling before closure.
  8. Automate “AI readiness” checks when feasible – Treat AI evaluations and safety checks like other quality gates: repeatable, visible, and hard to bypass.
  9. Define who owns AI behavior changes post-merge – DoD should clarify ownership for prompt/config updates, monitoring dashboards, and incident response for AI-driven regressions.

With time and iteration, DoD standards can become an embedded part of your team’s operating system.

Gotchas in Definition of Done Standards

The most common failure modes of DoD implementation are behavioral, not structural.

  • Overly generic DoDs, which are hard to enforce because they mean different things to different people.
  • DoD criteria that are too aspirational, making it impossible to close tickets without heroic effort.
  • DoD applied inconsistently, which erodes trust and invites exceptions.
  • Tools that support closure without validation, allowing standards to be bypassed.
  • AI behaviors treated as “demo-complete” rather than “production-ready”, where teams validate outputs informally but don’t define repeatable evaluation and monitoring expectations.
  • Brittle AI checks that become flaky (for example, asserting exact text output for probabilistic systems), which can lead to “ignore the failure” dynamics instead of meaningful quality gates.

A good DoD is clear, practical, and tightly aligned with how work actually flows through the system.

Limitations of Definition of Done Standards

DoD standards are less effective when:

  • Teams are working on exploratory or research-heavy tasks where outcomes are undefined.
  • Development is upstream of dependent systems that prevent full closure.
  • Automation is lacking, making criteria like test coverage or deployment readiness difficult to validate.
  • AI outputs are highly variable or context-dependent, making it harder to fully “prove” correctness at PR time without strong evaluation design and runtime monitoring.

Some critics argue that DoD standards encourage box-checking at the expense of real understanding. This is true when the DoD is adopted as a compliance tool rather than a thinking tool.

Ultimately, a Definition of Done is only useful when it reflects the reality of what your team can deliver and sustain consistently. For AI-enabled systems, that reality typically includes both pre-release evaluation and post-release observability.