06 — Agentic Workflows#

Figure C — Multi-Agent Workflow (gated & evidenced) · open SVG

Part of Agentic-Native SDLC for Regulated Medical Device Engineering. Status: Working draft — May 2026. All numeric thresholds are placeholders pending calibration in 05-evaluation-and-validation.md. Scope: how individual and multi-agent workflows are designed, bounded, gated, and governed across the SDLC under IEC 62304, ISO 13485/QMSR, ISO 14971, FDA CSA, 21 CFR Part 11, and GAMP 5.

This document operationalizes the seven principles into concrete, buildable workflows. It assumes the reference architecture in 03-reference-architecture.md (K8s, MCP tool plane, A2A, Argo Workflows, gVisor/Kata sandboxes, policy server, deterministic hooks, durable memory, OpenTelemetry trajectory tracing) and the model fleet tiers (S/M/L/V/E) in 04-model-strategy-and-finetuning.md.

1. Workflow design principles in a regulated setting#

A workflow is a versioned, validated, change-controlled artifact — not an ad-hoc prompt chain. Every workflow MUST satisfy the following structural invariants before it is permitted to execute against a Class A/B/C codebase.

#	Invariant	Mechanism	Principle	Failure mode it prevents
WD-1	Bounded tasks	Each agent step has an explicit input contract, output schema, max-iteration count, and acceptance predicate. No open-ended "do the thing" steps.	P1, P5	Unbounded loops; scope creep
WD-2	Deterministic gates between steps	Inter-step transitions are mediated by deterministic verifiers (compilers, linters, test runners, schema validators, policy server). Probabilistic output never flows to the next step ungated.	P2	Error propagation; non-reproducible state
WD-3	HITL checkpoints by safety class	Checkpoint density scales with IEC 62304 class; Class C requires dual human control.	P3	Inappropriate autonomy on safety-critical code
WD-4	Evidence capture per step	Every step emits a signed evidence record (inputs, model+harness version, tool calls, gate results, diffs, approver) to the Part 11 evidence store.	P4	Untraceable changes; audit gaps
WD-5	Budget caps per loop	Token + GPU + wall-clock budgets enforced per step and per workflow; breach → abstain/escalate, never silently truncate.	P6	Runaway cost; "cost-per-green-PR" blowout
WD-6	Abstention / escalation	Agents emit a typed `ABSTAIN(reason, evidence)` rather than guessing when confidence/coverage falls below threshold. Escalation routes to HITL or a higher model tier.	P1, P3	Confident-but-wrong output reaching a gate
WD-7	Idempotency / rollback	Every mutating action is idempotent and reversible: ephemeral sandbox branches, content-addressed artifacts, transactional commits, one-command rollback.	P7	Partial/irreversible damage to mainline

The harness is the product (P5). Workflows are expressed declaratively as Argo Workflow templates whose steps invoke agents through the harness; the LLM is a swappable component (see §3, Agent = Model + Harness). The cost-per-green-PR metric (P6) is the primary economic objective function for every workflow (see 08-token-and-gpu-economics.md).

flowchart LR
    G[Generate<br/>probabilistic] --> V[Verify<br/>deterministic gate]
    V -->|fail| R[Repair<br/>bounded loop]
    R --> V
    V -->|pass| GA[Gate<br/>policy + eval + HITL]
    GA -->|reject| R
    GA -->|approve| M[(Mainline)]
    V -->|budget/iter breach| ESC[Abstain → Escalate]
    R -->|budget/iter breach| ESC
    classDef det fill:#e8f0fe,stroke:#1a73e8;
    classDef prob fill:#fce8e6,stroke:#d93025;
    class V,GA det;
    class G,R prob;

This Generate → Verify → Repair → Gate loop is the atomic unit of every workflow in this document and is the realization of the 99.9% release-gate correctness system property (P1): individual model calls are unreliable; the loop is engineered to be reliable.

2. Operating modes: Conductor vs Orchestrator#

Two operating modes cover the spectrum from synchronous developer assistance to autonomous async batch work. A given task is routed to a mode by the policy server based on safety class, change size, and required latency.

Dimension	Conductor (real-time, in-IDE)	Orchestrator (async multi-agent)
Interaction model	Synchronous; human in the loop continuously	Asynchronous; human at defined checkpoints
Latency target	Sub-second to seconds	Minutes to hours
Primary surface	IDE / editor / terminal	Argo Workflows, A2A mesh, CI
Concurrency	Single active agent, human-paced	Many sub-agents in parallel (Planner/Coder/Test/…)
Control granularity	Per-edit, per-tool-call (human approves inline)	Per-stage gate + HITL checkpoints
Typical model tier	S/M (low latency), V for visual context	M/L (capable), E for hard reasoning; routed per step
Tooling	MCP tools scoped to open repo; pre-tool/post-edit hooks	Full MCP plane; sandbox per sub-agent; policy server inline
Memory	Session-scoped; durable per-developer	Durable shared workflow memory + per-agent scratch
Evidence	Lightweight (suggestion accepted/rejected)	Full per-step signed evidence chain
Best for	Conductor: pair-style implementation, refactor-in-place, explain, local test authoring	Orchestrator: feature delivery, coverage expansion, migrations, repo-watching, batch fixes
Autonomy ceiling	Class A/B with human approving each apply	Up to L4 within validated bounds; Class C still dual-control
ASMM-Med fit	L1–L2	L2–L5

A workflow may hand off between modes: a developer in Conductor mode dispatches a bounded task to the Orchestrator ("expand coverage on pump_controller"), reviews the resulting PR, and pulls it back into Conductor for final touch-ups. All handoffs are A2A messages with attached context manifests.

sequenceDiagram
    participant Dev as Developer (IDE)
    participant C as Conductor Agent
    participant O as Orchestrator (Argo)
    participant Sub as Sub-agents (A2A)
    Dev->>C: "implement story PUMP-142"
    C->>C: scope check (safety class B)
    C->>O: dispatch bounded task + context manifest
    O->>Sub: fan-out Planner/Coder/Test/Review
    Sub-->>O: gated PR + evidence
    O-->>Dev: HITL checkpoint (PR review)
    Dev->>C: pull back for local refinement

3. Agent anatomy — Agent = Model + Harness#

The platform's first axiom: an agent is not a model. An agent is a model wrapped in a deterministic harness that supplies capability, constraint, and evidence. The model is the only probabilistic component; everything else is engineered software under change control.

flowchart TB
    subgraph Agent
      direction TB
      M[Model — fleet tier S/M/L/V/E<br/>self-hosted, fine-tuned, open-weight]
      subgraph Harness
        I[Instructions / rule files<br/>AGENTS.md, specs/, BDD/Gherkin]
        T[MCP tool plane<br/>typed, permissioned tools]
        S[Sandbox<br/>gVisor/Kata, ephemeral, egress-deny]
        OR[Orchestration<br/>Argo steps + A2A handoffs]
        H[Hooks / guardrails<br/>pre-tool, post-edit, pre-commit]
        ME[Memory<br/>durable sessions + scoped scratch]
        OB[Observability<br/>OpenTelemetry trajectory tracing]
        PS[Policy server<br/>structural + semantic gating]
      end
    end
    M <--> I
    M <--> T
    T --> PS
    T --> S
    M --> H
    M --> ME
    Agent --> OB

Component	Role	Determinism	Cross-ref
Model (S/M/L/V/E)	Generation, reasoning, repair proposals	Probabilistic	`04`
Instructions / rule files	`AGENTS.md`, `specs/`, BDD/Gherkin define behavior, conventions, constraints	Deterministic source-of-truth	`01`
MCP tools	Typed, permissioned actions (read/edit/test/build/query)	Deterministic interface	`03`
Sandbox	gVisor/Kata ephemeral env; egress-deny; per-task	Deterministic isolation	`07`
Orchestration	Argo step graph + A2A handoffs	Deterministic	`03`
Hooks / guardrails	Lifecycle interceptors (pre-tool, post-edit, pre-commit)	Deterministic	§9
Memory	Durable session/workflow memory + scoped scratch	Deterministic store	`03`
Observability	OTel trajectory spans for every tool call and gate	Deterministic	`05`
Policy server	Structural + semantic gating, intercepts every tool call	Deterministic	`07`

The policy server sits between the model and every tool: a model may propose a write_file or run_command, but the action only executes if it passes structural rules (path allowlist, diff size, no secret egress) and semantic rules (change consistent with the active spec). This is how "determinism wraps probabilism" (P2) is enforced at the tool boundary.

4. Multi-agent decomposition pattern#

The Orchestrator decomposes a unit of work into specialized sub-agents connected by A2A handoffs. Each handoff crosses a deterministic gate. Humans sign at safety-class-appropriate checkpoints.

flowchart LR
    H[Human request / story] --> PL[Planner]
    PL -->|A2A: plan| SP[Spec / Story agent]
    SP -->|A2A: spec + Gherkin| G1{Spec gate<br/>policy + eval}
    G1 -->|approve| HC1[[HITL: spec sign-off]]
    HC1 --> CO[Coder]
    CO -->|A2A: diff| G2{Build/lint gate}
    G2 --> TE[Test agent]
    TE -->|A2A: tests+coverage| G3{Test+coverage gate}
    G3 --> RV[Review agent]
    RV -->|A2A: findings| G4{Review gate<br/>policy + eval}
    G4 --> HC2[[HITL: PR approval<br/>Class C = dual control]]
    HC2 --> IN[Integrator]
    IN -->|A2A: merge req| G5{Release gate ≥99.9%}
    G5 --> M[(Mainline / deploy)]
    G1 -.reject.-> SP
    G2 -.reject.-> CO
    G3 -.reject.-> CO
    G4 -.reject.-> CO

Policy server is invoked inside every gate (G1–G5) and on every tool call within each sub-agent.
Eval gates (deterministic evaluation harness, 05) sit at G1, G4, G5.
Humans sign at HC1 (spec) and HC2 (PR); Class C requires two independent approvers at HC2 (P3).
A2A messages are content-addressed and carry the evidence manifest forward so the Integrator can assemble a complete Part 11 record.

5. Per-SDLC-phase workflows#

Sub-template used for every phase: Goal · Agent role · Inputs/context · Tools · Deterministic gates · HITL checkpoint · Evidence · IEC 62304 activity · Model tier.

5.1 Requirements / planning#

Field	Value
Goal	Convert stakeholder intent into structured, testable requirements + acceptance criteria
Agent role	Spec/Story agent (decompose, normalize, detect ambiguity/conflict)
Inputs / context	Intake notes, existing `specs/`, risk file (ISO 14971), product requirements
Tools	`requirements_db`, `traceability_query`, `risk_register`, MCP doc retrieval
Deterministic gates	Schema validation of requirement records; traceability completeness check; duplicate/conflict linter
HITL checkpoint	Requirements review board sign-off (mandatory all classes)
Evidence	Versioned requirements set, ambiguity report, trace links, approver record
IEC 62304	5.2 Software requirements analysis
Model tier	L (reasoning over ambiguity); E for high-risk Class C decomposition

5.2 Design / architecture#

Field	Value
Goal	Produce software architecture + detailed design consistent with requirements and risk controls
Agent role	Design agent (architecture proposal, interface contracts, design rationale)
Inputs / context	Approved requirements, architecture standards, risk controls, existing design docs
Tools	`architecture_model`, `diagram_gen`, `interface_registry`, dependency graph query
Deterministic gates	Interface contract validation; architecture rule checks; risk-control coverage check
HITL checkpoint	Design review (mandatory); Class C requires safety reviewer
Evidence	Design records, interface specs, design-to-requirement trace, decision log
IEC 62304	5.3 Software architectural design; 5.4 detailed design
Model tier	L; V for diagram/visual artifacts

5.3 Implementation#

Field	Value
Goal	Implement units to satisfy design + spec, passing all deterministic gates
Agent role	Coder (generate diff in sandbox, self-repair against gates)
Inputs / context	Approved design, `specs/`, `AGENTS.md`, target files, dependency graph
Tools	`read_file`, `write_file`, `run_build`, `run_lint`, `run_unit_tests`, `static_analyzer`
Deterministic gates	Compile/build, lint, static analysis (SAST), unit tests, diff-size policy
HITL checkpoint	Diff review on PR; Class B/C never auto-merged without diff review
Evidence	Diff, build/test logs, static-analysis report, model+harness version
IEC 62304	5.5 Software unit implementation and verification
Model tier	M (default); L for complex units; routed by complexity

stateDiagram-v2
    [*] --> Plan
    Plan --> Generate: bounded task
    Generate --> Verify
    Verify --> Repair: gate fail
    Repair --> Verify
    Verify --> Abstain: iter/budget breach
    Verify --> PR: all gates green
    Abstain --> Escalate
    PR --> [*]: diff review (HITL)

5.4 Test / QA#

Field	Value
Goal	Author/extend verification tests; achieve coverage + behavioral targets
Agent role	Test agent (generate tests, mutation-check, close coverage gaps)
Inputs / context	Code under test, requirements/Gherkin, existing tests, coverage baseline
Tools	`run_tests`, `coverage_tool`, `mutation_tester`, `requirements_trace`
Deterministic gates	Tests pass; coverage ≥ threshold (placeholder); mutation score ≥ threshold; no flaky/quarantine regressions
HITL checkpoint	QA lead review of new test suite; Class C verifies requirement-to-test trace
Evidence	Test suite diff, coverage delta, mutation report, requirement-test trace matrix
IEC 62304	5.5 unit verification; 5.6 integration testing; 5.7 system testing
Model tier	M; L for hard property/edge-case synthesis

flowchart LR
    C[Code under test] --> TA[Test agent]
    TA --> GEN[Generate tests]
    GEN --> R{Run tests}
    R -->|fail to author| TA
    R -->|pass| COV{Coverage ≥ θ?}
    COV -->|no| TA
    COV -->|yes| MUT{Mutation ≥ θ?}
    MUT -->|no| TA
    MUT -->|yes| TR{Req-trace complete?}
    TR -->|yes| QA[[HITL: QA review]]

5.5 Code review#

Field	Value
Goal	Detect defects, spec deviations, risk-control violations before merge
Agent role	Review agent (semantic diff review, standards + risk checks)
Inputs / context	PR diff, spec, design, coding standards, risk controls, prior findings
Tools	`diff_view`, `policy_check`, `standards_linter`, `trace_query`, `sec_review`
Deterministic gates	Policy server semantic gate; standards lint; no unresolved high-severity findings
HITL checkpoint	Human reviewer approves; Class C dual control; conditional-LGTM only where permitted (§7)
Evidence	Review findings, resolution log, approver(s), gate results
IEC 62304	5.5/5.6 verification; supports 9 problem resolution
Model tier	L (judgment); M for routine diffs

sequenceDiagram
    participant PR as Pull Request
    participant RA as Review Agent
    participant PS as Policy Server
    participant Ev as Eval Gate
    participant Hu as Human Reviewer(s)
    PR->>RA: diff + context
    RA->>PS: semantic + structural check
    PS-->>RA: pass/findings
    RA->>Ev: deterministic review eval
    Ev-->>RA: score ≥ θ
    RA->>Hu: findings + recommendation
    Hu-->>PR: approve (dual for Class C)

5.6 Deployment#

Field	Value
Goal	Promote validated build through release gate to target environment
Agent role	Integrator/Release agent (assemble release record, drive pipeline)
Inputs / context	Approved PR, full evidence chain, release checklist, change-control record
Tools	`argo_pipeline`, `release_gate`, `evidence_store`, `signing_service`, `deploy`
Deterministic gates	Release gate ≥99.9% correctness; evidence completeness; signed approvals present
HITL checkpoint	Release authority sign-off (mandatory); Class C dual authority
Evidence	Signed release record, DHF/Part 11 package, deploy manifest, rollback plan
IEC 62304	5.8 software release
Model tier	S/M (orchestration, not generation)

5.7 Maintenance / legacy modernization#

Field	Value
Goal	Remediate defects; modernize legacy code with behavior preservation
Agent role	Maintenance/Migration sub-agent pipeline (graph-native understanding)
Inputs / context	Defect report or migration scope, code graph, characterization tests, risk file
Tools	`code_graph_query`, `characterization_tests`, `run_tests`, `diff_view`, `equivalence_check`
Deterministic gates	Characterization tests pass pre/post; behavioral equivalence; coverage maintained
HITL checkpoint	Change review; Class C dual control; CAPA linkage for defects
Evidence	Defect-to-fix trace, before/after behavior proof, migration manifest
IEC 62304	6 software maintenance; 9 problem resolution
Model tier	L (analysis) + M (bulk edits); E for hard equivalence reasoning

6. Concrete worked example workflows#

6.1 Feature implementation on a Class B module#

Scenario: Story PUMP-142 adds a configurable alarm threshold to infusion_rate_monitor (Class B).

Step	Mode/agent	Action	Gate	Evidence
1	Orchestrator / Spec	Decompose story → spec + Gherkin acceptance	Schema + trace	Spec record, trace links
2	HITL	Spec sign-off (single approver, Class B)	—	Approver record
3	Coder (M)	Implement in sandbox; self-repair vs build/lint/unit	Compile, lint, SAST, unit	Diff, logs
4	Test (M)	Add tests for new branches; close coverage	Coverage ≥ θ, mutation ≥ θ	Coverage delta, mutation report
5	Review (L)	Semantic review vs spec + risk controls	Policy semantic gate	Findings + resolutions
6	HITL	PR diff review (mandatory, single for Class B)	—	Approval
7	Integrator (S)	Release gate, assemble record	≥99.9% release gate	Signed release record

Budget cap: workflow aborts and escalates if token+GPU spend exceeds the per-PR cap (P6, 08). No auto-merge — Class B requires human diff review (§10 anti-pattern).

6.2 AI-generated test-coverage expansion#

Scenario: Raise coverage on dosage_calculator from 71% to ≥ target without changing behavior.

flowchart LR
    BL[Baseline coverage] --> GAP[Coverage-gap analysis<br/>code graph]
    GAP --> TA[Test agent: synth tests]
    TA --> RUN{Tests green?}
    RUN -->|no| TA
    RUN -->|yes| FLK{Flaky? quarantine check}
    FLK -->|stable| MUT{Mutation ≥ θ}
    MUT -->|yes| BEH{No behavior change<br/>vs baseline}
    BEH -->|confirmed| QA[[HITL: QA lead]]
    BEH -->|drift| TA

Key controls: tests must be additive and behavior-preserving — any test that would have failed against unchanged production code is flagged as a latent defect and escalated rather than silently "fixed." Mutation testing guards against vacuous tests. Evidence: coverage delta, mutation score, requirement-test trace.

6.3 Bug fix in forensic mode (failing-test-first, evidence prompting)#

Scenario: Field complaint → defect DEF-908 in battery_health_estimator (Class C). Forensic mode enforces reproduce-before-repair.

stateDiagram-v2
    [*] --> Reproduce
    Reproduce --> WriteFailingTest: capture defect as test
    WriteFailingTest --> ConfirmRed: test fails on current code
    ConfirmRed --> RootCause: evidence-prompted analysis
    RootCause --> Fix: bounded minimal diff
    Fix --> Green: failing test now passes
    Green --> Regression: full suite + characterization
    Regression --> DualReview: Class C dual control
    DualReview --> CAPA: link to problem resolution
    CAPA --> [*]

Failing-test-first: the defect is encoded as a test that is red before any fix; this becomes permanent regression evidence.
Evidence prompting: the root-cause step is required to cite specific code-graph nodes, traces, and the failing assertion — no unsupported hypotheses.
Class C: dual human control at review; fix is linked to CAPA / IEC 62304 §9 problem resolution.
Minimal-diff policy: the repair loop is bounded to the smallest change that turns the test green; scope expansion triggers escalation.

6.4 Legacy modernization / framework migration at scale#

Scenario: Migrate ~400 modules from a deprecated UI framework to the supported one, behavior-preserving, across Class A/B code.

flowchart TB
    SC[Scope intake] --> GRAPH[Graph-native code understanding<br/>build dependency + call graph]
    GRAPH --> CHAR[Characterization test harvest<br/>pin current behavior]
    CHAR --> PART[Partition into wave batches<br/>by risk + coupling]
    PART --> FAN[Fan-out sub-agent pipeline]
    subgraph perModule[Per-module pipeline]
      MIG[Migration coder] --> EQ{Behavioral equivalence}
      EQ -->|fail| MIG
      EQ -->|pass| RV2[Review agent]
    end
    FAN --> perModule
    RV2 --> HITL[[HITL: batch review]]
    HITL --> INT[Integrator: staged merge]
    INT --> M[(Mainline)]

Graph-native understanding: the migration is planned over the actual code/dependency graph, not file-by-file, so coupling and ordering are respected.
Characterization tests pin pre-migration behavior; the equivalence gate is the deterministic guarantee of behavior preservation.
Batching by risk: Class A modules may use lighter checkpoints; Class B retain mandatory diff review; any Class C in scope keeps dual control.
Cost discipline: per-module budget caps and tier routing (M for mechanical edits, L for complex ones) keep cost-per-green-PR within bounds at scale.

7. Human-in-the-loop design#

HITL is the mechanism for risk-proportional autonomy (P3). Checkpoint placement is a function of IEC 62304 safety class.

Safety class	Spec sign-off	Diff review	Release approval	Auto-merge on green
Class A	Required (may batch)	Required for non-trivial; conditional-LGTM permitted for low-risk additive change	Single	Permitted only where policy explicitly allows (additive, non-safety, fully gated)
Class B	Required	Mandatory, per-PR diff review	Single	Not permitted
Class C	Required	Mandatory, dual independent reviewers	Dual authority	Never

Design measures to keep HITL effective without inducing approval fatigue:

Dual-control for Class C: two independent qualified humans; the system enforces approver disjointness (no self-approval, no single person satisfying both).
Batching: related low-risk approvals are grouped into a single review surface with shared context, reducing context-switch cost while preserving per-item evidence.
Digital quiet hours: the Orchestrator respects configured quiet windows; checkpoints queued during quiet hours are surfaced at the next active window, never auto-escalated to auto-approval.
Conditional-LGTM (merge on green): permitted only for Class A, additive, non-safety-critical changes that pass the full deterministic gate chain and the release gate; explicitly disabled for Class B/C. Every conditional-LGTM merge still produces a full evidence record and is sampled into the audit pipeline.
Escalation routing: abstentions and budget breaches route to a named human queue, not into a silent retry storm.

8. Continuous code-review and repo-watcher agents#

Continuous agents observe the repository and act on events (PR opened, push, scheduled scan). They are deployed in three tiers by integration depth.

Tier	Deployment	Trigger	Catches	Evidence posting
T1 — managed-equivalent	Self-hosted equivalent of a managed review bot	PR opened/updated	Style, obvious bugs, lint/standards drift, secret leaks	Inline PR comments + evidence record
T2 — hybrid CI-triggered	Review agent invoked from CI pipeline	CI stage on PR/push	Above + spec deviation, coverage/mutation regressions, risk-control violations	CI check + structured findings to evidence store
T3 — custom A2A	Full Orchestrator sub-agent in the A2A mesh	Event or schedule (repo-watcher)	Above + cross-module/arch drift, dependency risk, traceability gaps; can open remediation PRs	Signed findings, optional auto-PR, OTel trajectory

All tiers route every proposed action through the policy server and post evidence to the Part 11 store. Repo-watcher agents (T3) operate under strict budget caps and an action allowlist; they may propose remediation PRs but never merge to Class B/C without the §7 HITL path. Findings are linked to requirements/risk items for traceability.

9. Workflow governance#

A workflow is a validated software item in its own right and is managed under the QMS (ISO 13485/QMSR, GAMP 5).

Versioning. Each workflow (Argo template + agent configs + rule files + tool permissions) is a content-addressed, semantically-versioned artifact. The exact model+harness versions used by a run are pinned and recorded (P7 reproducibility).

Validation (tie to 05). Before promotion, a workflow is validated against the deterministic evaluation harness: golden task sets, gate-correctness measurement toward the ≥99.9% release-gate property, abstention calibration, and cost envelope. Validation evidence is part of the workflow's release record. CSA-aligned, risk-based validation depth scales with the highest safety class the workflow may touch.

Change control. Workflow changes follow the same change-control and approval path as code: proposed diff → review → eval gate → approval → versioned release. A workflow change that alters gate behavior or autonomy level requires re-validation.

flowchart LR
    WF[Workflow change proposal] --> RV[Review]
    RV --> EV[Eval harness validation\n→ 05]
    EV -->|meets ≥99.9% + cost| AP[Approval / change control]
    EV -->|fails| WF
    AP --> REL[Versioned workflow release]
    REL --> REG[(Registry — pinned)]

Cost guardrails in-loop (tie to 08). Per-step and per-workflow token/GPU/wall-clock budgets are enforced at runtime by the orchestrator and hooks. Breach behavior is deterministic: pause → abstain → escalate. Cost telemetry feeds the cost-per-green-PR metric (P6) and the economics dashboards.

Failure handling and rollback. Every workflow defines: (a) idempotent steps over ephemeral sandbox branches; (b) a one-command rollback to the last known-good mainline state; (c) a quarantine path for flaky/unstable artifacts; (d) escalation to HITL on repeated gate failure. No partial state ever reaches mainline — integration is transactional behind the release gate.

Failure	Detection	Response
Gate fail (transient)	Verifier	Bounded repair loop
Iteration/budget breach	Orchestrator counter	Abstain → escalate to HITL
Non-reproducible result	Eval/replay	Pin freeze; block promotion; investigate
Bad merge slipped	Release-gate audit / repo-watcher	Automated rollback + CAPA

10. Anti-patterns#

Anti-pattern	Why it is dangerous	Mitigation in this platform
Unbounded loops	Runaway cost; non-terminating agents; eroded determinism	WD-1/WD-5 hard iteration + budget caps; abstain on breach
Multi-file autonomous edits without diff review on Class B/C	Unreviewed safety-relevant change reaches mainline	§7 mandatory diff review; auto-merge disabled for B/C; policy server diff-size + path gates
Agent-to-agent error amplification	One sub-agent's hallucination becomes the next's "fact"; compounding error across A2A	WD-2 deterministic gate at every handoff; no probabilistic output flows ungated; evidence carries provenance
Context fragmentation	Sub-agents work from inconsistent/partial context; divergent assumptions	Single source-of-truth (`specs/`, `AGENTS.md`); content-addressed context manifests on every A2A handoff; durable shared memory
Confident wrong abstention-suppression	Agent guesses instead of abstaining	WD-6 typed `ABSTAIN`; calibration validated in `05`
Mode misuse (Conductor for batch)	Latency/cost mismatch; weak evidence	Policy-server routing by class/size/latency (§2)

Cross-references#

01-requirements.md · 02-maturity-model.md · 03-reference-architecture.md · 04-model-strategy-and-finetuning.md · 05-evaluation-and-validation.md · 07-security-and-compliance.md · 08-token-and-gpu-economics.md · 09-adoption-roadmap.md