← Unovie.AI Agentic-Native SDLC · Regulated MedTech

06 — Agentic Workflows#

Multi-agent workflow — gated, evidenced, human-signedPlannerSpec / StoryCoderTestReviewIntegratorevalevalevalHuman reviewHuman signdual · Class CPolicy Server — structural + semantic gating on every tool callMCP Tool Plane — sandboxed (gVisor / Kata), egress-denyImmutable evidence captured at every step → 21 CFR Part 11 record
Figure C — Multi-Agent Workflow (gated & evidenced)  ·  open SVG

Part of Agentic-Native SDLC for Regulated Medical Device Engineering. Status: Working draft — May 2026. All numeric thresholds are placeholders pending calibration in 05-evaluation-and-validation.md. Scope: how individual and multi-agent workflows are designed, bounded, gated, and governed across the SDLC under IEC 62304, ISO 13485/QMSR, ISO 14971, FDA CSA, 21 CFR Part 11, and GAMP 5.

This document operationalizes the seven principles into concrete, buildable workflows. It assumes the reference architecture in 03-reference-architecture.md (K8s, MCP tool plane, A2A, Argo Workflows, gVisor/Kata sandboxes, policy server, deterministic hooks, durable memory, OpenTelemetry trajectory tracing) and the model fleet tiers (S/M/L/V/E) in 04-model-strategy-and-finetuning.md.


1. Workflow design principles in a regulated setting#

A workflow is a versioned, validated, change-controlled artifact — not an ad-hoc prompt chain. Every workflow MUST satisfy the following structural invariants before it is permitted to execute against a Class A/B/C codebase.

#InvariantMechanismPrincipleFailure mode it prevents
WD-1Bounded tasksEach agent step has an explicit input contract, output schema, max-iteration count, and acceptance predicate. No open-ended "do the thing" steps.P1, P5Unbounded loops; scope creep
WD-2Deterministic gates between stepsInter-step transitions are mediated by deterministic verifiers (compilers, linters, test runners, schema validators, policy server). Probabilistic output never flows to the next step ungated.P2Error propagation; non-reproducible state
WD-3HITL checkpoints by safety classCheckpoint density scales with IEC 62304 class; Class C requires dual human control.P3Inappropriate autonomy on safety-critical code
WD-4Evidence capture per stepEvery step emits a signed evidence record (inputs, model+harness version, tool calls, gate results, diffs, approver) to the Part 11 evidence store.P4Untraceable changes; audit gaps
WD-5Budget caps per loopToken + GPU + wall-clock budgets enforced per step and per workflow; breach → abstain/escalate, never silently truncate.P6Runaway cost; "cost-per-green-PR" blowout
WD-6Abstention / escalationAgents emit a typed ABSTAIN(reason, evidence) rather than guessing when confidence/coverage falls below threshold. Escalation routes to HITL or a higher model tier.P1, P3Confident-but-wrong output reaching a gate
WD-7Idempotency / rollbackEvery mutating action is idempotent and reversible: ephemeral sandbox branches, content-addressed artifacts, transactional commits, one-command rollback.P7Partial/irreversible damage to mainline

The harness is the product (P5). Workflows are expressed declaratively as Argo Workflow templates whose steps invoke agents through the harness; the LLM is a swappable component (see §3, Agent = Model + Harness). The cost-per-green-PR metric (P6) is the primary economic objective function for every workflow (see 08-token-and-gpu-economics.md).

flowchart LR
    G[Generate<br/>probabilistic] --> V[Verify<br/>deterministic gate]
    V -->|fail| R[Repair<br/>bounded loop]
    R --> V
    V -->|pass| GA[Gate<br/>policy + eval + HITL]
    GA -->|reject| R
    GA -->|approve| M[(Mainline)]
    V -->|budget/iter breach| ESC[Abstain → Escalate]
    R -->|budget/iter breach| ESC
    classDef det fill:#e8f0fe,stroke:#1a73e8;
    classDef prob fill:#fce8e6,stroke:#d93025;
    class V,GA det;
    class G,R prob;

This Generate → Verify → Repair → Gate loop is the atomic unit of every workflow in this document and is the realization of the 99.9% release-gate correctness system property (P1): individual model calls are unreliable; the loop is engineered to be reliable.


2. Operating modes: Conductor vs Orchestrator#

Two operating modes cover the spectrum from synchronous developer assistance to autonomous async batch work. A given task is routed to a mode by the policy server based on safety class, change size, and required latency.

DimensionConductor (real-time, in-IDE)Orchestrator (async multi-agent)
Interaction modelSynchronous; human in the loop continuouslyAsynchronous; human at defined checkpoints
Latency targetSub-second to secondsMinutes to hours
Primary surfaceIDE / editor / terminalArgo Workflows, A2A mesh, CI
ConcurrencySingle active agent, human-pacedMany sub-agents in parallel (Planner/Coder/Test/…)
Control granularityPer-edit, per-tool-call (human approves inline)Per-stage gate + HITL checkpoints
Typical model tierS/M (low latency), V for visual contextM/L (capable), E for hard reasoning; routed per step
ToolingMCP tools scoped to open repo; pre-tool/post-edit hooksFull MCP plane; sandbox per sub-agent; policy server inline
MemorySession-scoped; durable per-developerDurable shared workflow memory + per-agent scratch
EvidenceLightweight (suggestion accepted/rejected)Full per-step signed evidence chain
Best forConductor: pair-style implementation, refactor-in-place, explain, local test authoringOrchestrator: feature delivery, coverage expansion, migrations, repo-watching, batch fixes
Autonomy ceilingClass A/B with human approving each applyUp to L4 within validated bounds; Class C still dual-control
ASMM-Med fitL1–L2L2–L5

A workflow may hand off between modes: a developer in Conductor mode dispatches a bounded task to the Orchestrator ("expand coverage on pump_controller"), reviews the resulting PR, and pulls it back into Conductor for final touch-ups. All handoffs are A2A messages with attached context manifests.

sequenceDiagram
    participant Dev as Developer (IDE)
    participant C as Conductor Agent
    participant O as Orchestrator (Argo)
    participant Sub as Sub-agents (A2A)
    Dev->>C: "implement story PUMP-142"
    C->>C: scope check (safety class B)
    C->>O: dispatch bounded task + context manifest
    O->>Sub: fan-out Planner/Coder/Test/Review
    Sub-->>O: gated PR + evidence
    O-->>Dev: HITL checkpoint (PR review)
    Dev->>C: pull back for local refinement

3. Agent anatomy — Agent = Model + Harness#

The platform's first axiom: an agent is not a model. An agent is a model wrapped in a deterministic harness that supplies capability, constraint, and evidence. The model is the only probabilistic component; everything else is engineered software under change control.

flowchart TB
    subgraph Agent
      direction TB
      M[Model — fleet tier S/M/L/V/E<br/>self-hosted, fine-tuned, open-weight]
      subgraph Harness
        I[Instructions / rule files<br/>AGENTS.md, specs/, BDD/Gherkin]
        T[MCP tool plane<br/>typed, permissioned tools]
        S[Sandbox<br/>gVisor/Kata, ephemeral, egress-deny]
        OR[Orchestration<br/>Argo steps + A2A handoffs]
        H[Hooks / guardrails<br/>pre-tool, post-edit, pre-commit]
        ME[Memory<br/>durable sessions + scoped scratch]
        OB[Observability<br/>OpenTelemetry trajectory tracing]
        PS[Policy server<br/>structural + semantic gating]
      end
    end
    M <--> I
    M <--> T
    T --> PS
    T --> S
    M --> H
    M --> ME
    Agent --> OB
ComponentRoleDeterminismCross-ref
Model (S/M/L/V/E)Generation, reasoning, repair proposalsProbabilistic04
Instructions / rule filesAGENTS.md, specs/, BDD/Gherkin define behavior, conventions, constraintsDeterministic source-of-truth01
MCP toolsTyped, permissioned actions (read/edit/test/build/query)Deterministic interface03
SandboxgVisor/Kata ephemeral env; egress-deny; per-taskDeterministic isolation07
OrchestrationArgo step graph + A2A handoffsDeterministic03
Hooks / guardrailsLifecycle interceptors (pre-tool, post-edit, pre-commit)Deterministic§9
MemoryDurable session/workflow memory + scoped scratchDeterministic store03
ObservabilityOTel trajectory spans for every tool call and gateDeterministic05
Policy serverStructural + semantic gating, intercepts every tool callDeterministic07

The policy server sits between the model and every tool: a model may propose a write_file or run_command, but the action only executes if it passes structural rules (path allowlist, diff size, no secret egress) and semantic rules (change consistent with the active spec). This is how "determinism wraps probabilism" (P2) is enforced at the tool boundary.


4. Multi-agent decomposition pattern#

The Orchestrator decomposes a unit of work into specialized sub-agents connected by A2A handoffs. Each handoff crosses a deterministic gate. Humans sign at safety-class-appropriate checkpoints.

flowchart LR
    H[Human request / story] --> PL[Planner]
    PL -->|A2A: plan| SP[Spec / Story agent]
    SP -->|A2A: spec + Gherkin| G1{Spec gate<br/>policy + eval}
    G1 -->|approve| HC1[[HITL: spec sign-off]]
    HC1 --> CO[Coder]
    CO -->|A2A: diff| G2{Build/lint gate}
    G2 --> TE[Test agent]
    TE -->|A2A: tests+coverage| G3{Test+coverage gate}
    G3 --> RV[Review agent]
    RV -->|A2A: findings| G4{Review gate<br/>policy + eval}
    G4 --> HC2[[HITL: PR approval<br/>Class C = dual control]]
    HC2 --> IN[Integrator]
    IN -->|A2A: merge req| G5{Release gate ≥99.9%}
    G5 --> M[(Mainline / deploy)]
    G1 -.reject.-> SP
    G2 -.reject.-> CO
    G3 -.reject.-> CO
    G4 -.reject.-> CO
  • Policy server is invoked inside every gate (G1–G5) and on every tool call within each sub-agent.
  • Eval gates (deterministic evaluation harness, 05) sit at G1, G4, G5.
  • Humans sign at HC1 (spec) and HC2 (PR); Class C requires two independent approvers at HC2 (P3).
  • A2A messages are content-addressed and carry the evidence manifest forward so the Integrator can assemble a complete Part 11 record.

5. Per-SDLC-phase workflows#

Sub-template used for every phase: Goal · Agent role · Inputs/context · Tools · Deterministic gates · HITL checkpoint · Evidence · IEC 62304 activity · Model tier.

5.1 Requirements / planning#

FieldValue
GoalConvert stakeholder intent into structured, testable requirements + acceptance criteria
Agent roleSpec/Story agent (decompose, normalize, detect ambiguity/conflict)
Inputs / contextIntake notes, existing specs/, risk file (ISO 14971), product requirements
Toolsrequirements_db, traceability_query, risk_register, MCP doc retrieval
Deterministic gatesSchema validation of requirement records; traceability completeness check; duplicate/conflict linter
HITL checkpointRequirements review board sign-off (mandatory all classes)
EvidenceVersioned requirements set, ambiguity report, trace links, approver record
IEC 623045.2 Software requirements analysis
Model tierL (reasoning over ambiguity); E for high-risk Class C decomposition

5.2 Design / architecture#

FieldValue
GoalProduce software architecture + detailed design consistent with requirements and risk controls
Agent roleDesign agent (architecture proposal, interface contracts, design rationale)
Inputs / contextApproved requirements, architecture standards, risk controls, existing design docs
Toolsarchitecture_model, diagram_gen, interface_registry, dependency graph query
Deterministic gatesInterface contract validation; architecture rule checks; risk-control coverage check
HITL checkpointDesign review (mandatory); Class C requires safety reviewer
EvidenceDesign records, interface specs, design-to-requirement trace, decision log
IEC 623045.3 Software architectural design; 5.4 detailed design
Model tierL; V for diagram/visual artifacts

5.3 Implementation#

FieldValue
GoalImplement units to satisfy design + spec, passing all deterministic gates
Agent roleCoder (generate diff in sandbox, self-repair against gates)
Inputs / contextApproved design, specs/, AGENTS.md, target files, dependency graph
Toolsread_file, write_file, run_build, run_lint, run_unit_tests, static_analyzer
Deterministic gatesCompile/build, lint, static analysis (SAST), unit tests, diff-size policy
HITL checkpointDiff review on PR; Class B/C never auto-merged without diff review
EvidenceDiff, build/test logs, static-analysis report, model+harness version
IEC 623045.5 Software unit implementation and verification
Model tierM (default); L for complex units; routed by complexity
stateDiagram-v2
    [*] --> Plan
    Plan --> Generate: bounded task
    Generate --> Verify
    Verify --> Repair: gate fail
    Repair --> Verify
    Verify --> Abstain: iter/budget breach
    Verify --> PR: all gates green
    Abstain --> Escalate
    PR --> [*]: diff review (HITL)

5.4 Test / QA#

FieldValue
GoalAuthor/extend verification tests; achieve coverage + behavioral targets
Agent roleTest agent (generate tests, mutation-check, close coverage gaps)
Inputs / contextCode under test, requirements/Gherkin, existing tests, coverage baseline
Toolsrun_tests, coverage_tool, mutation_tester, requirements_trace
Deterministic gatesTests pass; coverage ≥ threshold (placeholder); mutation score ≥ threshold; no flaky/quarantine regressions
HITL checkpointQA lead review of new test suite; Class C verifies requirement-to-test trace
EvidenceTest suite diff, coverage delta, mutation report, requirement-test trace matrix
IEC 623045.5 unit verification; 5.6 integration testing; 5.7 system testing
Model tierM; L for hard property/edge-case synthesis
flowchart LR
    C[Code under test] --> TA[Test agent]
    TA --> GEN[Generate tests]
    GEN --> R{Run tests}
    R -->|fail to author| TA
    R -->|pass| COV{Coverage ≥ θ?}
    COV -->|no| TA
    COV -->|yes| MUT{Mutation ≥ θ?}
    MUT -->|no| TA
    MUT -->|yes| TR{Req-trace complete?}
    TR -->|yes| QA[[HITL: QA review]]

5.5 Code review#

FieldValue
GoalDetect defects, spec deviations, risk-control violations before merge
Agent roleReview agent (semantic diff review, standards + risk checks)
Inputs / contextPR diff, spec, design, coding standards, risk controls, prior findings
Toolsdiff_view, policy_check, standards_linter, trace_query, sec_review
Deterministic gatesPolicy server semantic gate; standards lint; no unresolved high-severity findings
HITL checkpointHuman reviewer approves; Class C dual control; conditional-LGTM only where permitted (§7)
EvidenceReview findings, resolution log, approver(s), gate results
IEC 623045.5/5.6 verification; supports 9 problem resolution
Model tierL (judgment); M for routine diffs
sequenceDiagram
    participant PR as Pull Request
    participant RA as Review Agent
    participant PS as Policy Server
    participant Ev as Eval Gate
    participant Hu as Human Reviewer(s)
    PR->>RA: diff + context
    RA->>PS: semantic + structural check
    PS-->>RA: pass/findings
    RA->>Ev: deterministic review eval
    Ev-->>RA: score ≥ θ
    RA->>Hu: findings + recommendation
    Hu-->>PR: approve (dual for Class C)

5.6 Deployment#

FieldValue
GoalPromote validated build through release gate to target environment
Agent roleIntegrator/Release agent (assemble release record, drive pipeline)
Inputs / contextApproved PR, full evidence chain, release checklist, change-control record
Toolsargo_pipeline, release_gate, evidence_store, signing_service, deploy
Deterministic gatesRelease gate ≥99.9% correctness; evidence completeness; signed approvals present
HITL checkpointRelease authority sign-off (mandatory); Class C dual authority
EvidenceSigned release record, DHF/Part 11 package, deploy manifest, rollback plan
IEC 623045.8 software release
Model tierS/M (orchestration, not generation)

5.7 Maintenance / legacy modernization#

FieldValue
GoalRemediate defects; modernize legacy code with behavior preservation
Agent roleMaintenance/Migration sub-agent pipeline (graph-native understanding)
Inputs / contextDefect report or migration scope, code graph, characterization tests, risk file
Toolscode_graph_query, characterization_tests, run_tests, diff_view, equivalence_check
Deterministic gatesCharacterization tests pass pre/post; behavioral equivalence; coverage maintained
HITL checkpointChange review; Class C dual control; CAPA linkage for defects
EvidenceDefect-to-fix trace, before/after behavior proof, migration manifest
IEC 623046 software maintenance; 9 problem resolution
Model tierL (analysis) + M (bulk edits); E for hard equivalence reasoning

6. Concrete worked example workflows#

6.1 Feature implementation on a Class B module#

Scenario: Story PUMP-142 adds a configurable alarm threshold to infusion_rate_monitor (Class B).

StepMode/agentActionGateEvidence
1Orchestrator / SpecDecompose story → spec + Gherkin acceptanceSchema + traceSpec record, trace links
2HITLSpec sign-off (single approver, Class B)Approver record
3Coder (M)Implement in sandbox; self-repair vs build/lint/unitCompile, lint, SAST, unitDiff, logs
4Test (M)Add tests for new branches; close coverageCoverage ≥ θ, mutation ≥ θCoverage delta, mutation report
5Review (L)Semantic review vs spec + risk controlsPolicy semantic gateFindings + resolutions
6HITLPR diff review (mandatory, single for Class B)Approval
7Integrator (S)Release gate, assemble record≥99.9% release gateSigned release record

Budget cap: workflow aborts and escalates if token+GPU spend exceeds the per-PR cap (P6, 08). No auto-merge — Class B requires human diff review (§10 anti-pattern).

6.2 AI-generated test-coverage expansion#

Scenario: Raise coverage on dosage_calculator from 71% to ≥ target without changing behavior.

flowchart LR
    BL[Baseline coverage] --> GAP[Coverage-gap analysis<br/>code graph]
    GAP --> TA[Test agent: synth tests]
    TA --> RUN{Tests green?}
    RUN -->|no| TA
    RUN -->|yes| FLK{Flaky? quarantine check}
    FLK -->|stable| MUT{Mutation ≥ θ}
    MUT -->|yes| BEH{No behavior change<br/>vs baseline}
    BEH -->|confirmed| QA[[HITL: QA lead]]
    BEH -->|drift| TA

Key controls: tests must be additive and behavior-preserving — any test that would have failed against unchanged production code is flagged as a latent defect and escalated rather than silently "fixed." Mutation testing guards against vacuous tests. Evidence: coverage delta, mutation score, requirement-test trace.

6.3 Bug fix in forensic mode (failing-test-first, evidence prompting)#

Scenario: Field complaint → defect DEF-908 in battery_health_estimator (Class C). Forensic mode enforces reproduce-before-repair.

stateDiagram-v2
    [*] --> Reproduce
    Reproduce --> WriteFailingTest: capture defect as test
    WriteFailingTest --> ConfirmRed: test fails on current code
    ConfirmRed --> RootCause: evidence-prompted analysis
    RootCause --> Fix: bounded minimal diff
    Fix --> Green: failing test now passes
    Green --> Regression: full suite + characterization
    Regression --> DualReview: Class C dual control
    DualReview --> CAPA: link to problem resolution
    CAPA --> [*]
  • Failing-test-first: the defect is encoded as a test that is red before any fix; this becomes permanent regression evidence.
  • Evidence prompting: the root-cause step is required to cite specific code-graph nodes, traces, and the failing assertion — no unsupported hypotheses.
  • Class C: dual human control at review; fix is linked to CAPA / IEC 62304 §9 problem resolution.
  • Minimal-diff policy: the repair loop is bounded to the smallest change that turns the test green; scope expansion triggers escalation.

6.4 Legacy modernization / framework migration at scale#

Scenario: Migrate ~400 modules from a deprecated UI framework to the supported one, behavior-preserving, across Class A/B code.

flowchart TB
    SC[Scope intake] --> GRAPH[Graph-native code understanding<br/>build dependency + call graph]
    GRAPH --> CHAR[Characterization test harvest<br/>pin current behavior]
    CHAR --> PART[Partition into wave batches<br/>by risk + coupling]
    PART --> FAN[Fan-out sub-agent pipeline]
    subgraph perModule[Per-module pipeline]
      MIG[Migration coder] --> EQ{Behavioral equivalence}
      EQ -->|fail| MIG
      EQ -->|pass| RV2[Review agent]
    end
    FAN --> perModule
    RV2 --> HITL[[HITL: batch review]]
    HITL --> INT[Integrator: staged merge]
    INT --> M[(Mainline)]
  • Graph-native understanding: the migration is planned over the actual code/dependency graph, not file-by-file, so coupling and ordering are respected.
  • Characterization tests pin pre-migration behavior; the equivalence gate is the deterministic guarantee of behavior preservation.
  • Batching by risk: Class A modules may use lighter checkpoints; Class B retain mandatory diff review; any Class C in scope keeps dual control.
  • Cost discipline: per-module budget caps and tier routing (M for mechanical edits, L for complex ones) keep cost-per-green-PR within bounds at scale.

7. Human-in-the-loop design#

HITL is the mechanism for risk-proportional autonomy (P3). Checkpoint placement is a function of IEC 62304 safety class.

Safety classSpec sign-offDiff reviewRelease approvalAuto-merge on green
Class ARequired (may batch)Required for non-trivial; conditional-LGTM permitted for low-risk additive changeSinglePermitted only where policy explicitly allows (additive, non-safety, fully gated)
Class BRequiredMandatory, per-PR diff reviewSingleNot permitted
Class CRequiredMandatory, dual independent reviewersDual authorityNever

Design measures to keep HITL effective without inducing approval fatigue:

  • Dual-control for Class C: two independent qualified humans; the system enforces approver disjointness (no self-approval, no single person satisfying both).
  • Batching: related low-risk approvals are grouped into a single review surface with shared context, reducing context-switch cost while preserving per-item evidence.
  • Digital quiet hours: the Orchestrator respects configured quiet windows; checkpoints queued during quiet hours are surfaced at the next active window, never auto-escalated to auto-approval.
  • Conditional-LGTM (merge on green): permitted only for Class A, additive, non-safety-critical changes that pass the full deterministic gate chain and the release gate; explicitly disabled for Class B/C. Every conditional-LGTM merge still produces a full evidence record and is sampled into the audit pipeline.
  • Escalation routing: abstentions and budget breaches route to a named human queue, not into a silent retry storm.

8. Continuous code-review and repo-watcher agents#

Continuous agents observe the repository and act on events (PR opened, push, scheduled scan). They are deployed in three tiers by integration depth.

TierDeploymentTriggerCatchesEvidence posting
T1 — managed-equivalentSelf-hosted equivalent of a managed review botPR opened/updatedStyle, obvious bugs, lint/standards drift, secret leaksInline PR comments + evidence record
T2 — hybrid CI-triggeredReview agent invoked from CI pipelineCI stage on PR/pushAbove + spec deviation, coverage/mutation regressions, risk-control violationsCI check + structured findings to evidence store
T3 — custom A2AFull Orchestrator sub-agent in the A2A meshEvent or schedule (repo-watcher)Above + cross-module/arch drift, dependency risk, traceability gaps; can open remediation PRsSigned findings, optional auto-PR, OTel trajectory

All tiers route every proposed action through the policy server and post evidence to the Part 11 store. Repo-watcher agents (T3) operate under strict budget caps and an action allowlist; they may propose remediation PRs but never merge to Class B/C without the §7 HITL path. Findings are linked to requirements/risk items for traceability.


9. Workflow governance#

A workflow is a validated software item in its own right and is managed under the QMS (ISO 13485/QMSR, GAMP 5).

Versioning. Each workflow (Argo template + agent configs + rule files + tool permissions) is a content-addressed, semantically-versioned artifact. The exact model+harness versions used by a run are pinned and recorded (P7 reproducibility).

Validation (tie to 05). Before promotion, a workflow is validated against the deterministic evaluation harness: golden task sets, gate-correctness measurement toward the ≥99.9% release-gate property, abstention calibration, and cost envelope. Validation evidence is part of the workflow's release record. CSA-aligned, risk-based validation depth scales with the highest safety class the workflow may touch.

Change control. Workflow changes follow the same change-control and approval path as code: proposed diff → review → eval gate → approval → versioned release. A workflow change that alters gate behavior or autonomy level requires re-validation.

flowchart LR
    WF[Workflow change proposal] --> RV[Review]
    RV --> EV[Eval harness validation\n→ 05]
    EV -->|meets ≥99.9% + cost| AP[Approval / change control]
    EV -->|fails| WF
    AP --> REL[Versioned workflow release]
    REL --> REG[(Registry — pinned)]

Cost guardrails in-loop (tie to 08). Per-step and per-workflow token/GPU/wall-clock budgets are enforced at runtime by the orchestrator and hooks. Breach behavior is deterministic: pause → abstain → escalate. Cost telemetry feeds the cost-per-green-PR metric (P6) and the economics dashboards.

Failure handling and rollback. Every workflow defines: (a) idempotent steps over ephemeral sandbox branches; (b) a one-command rollback to the last known-good mainline state; (c) a quarantine path for flaky/unstable artifacts; (d) escalation to HITL on repeated gate failure. No partial state ever reaches mainline — integration is transactional behind the release gate.

FailureDetectionResponse
Gate fail (transient)VerifierBounded repair loop
Iteration/budget breachOrchestrator counterAbstain → escalate to HITL
Non-reproducible resultEval/replayPin freeze; block promotion; investigate
Bad merge slippedRelease-gate audit / repo-watcherAutomated rollback + CAPA

10. Anti-patterns#

Anti-patternWhy it is dangerousMitigation in this platform
Unbounded loopsRunaway cost; non-terminating agents; eroded determinismWD-1/WD-5 hard iteration + budget caps; abstain on breach
Multi-file autonomous edits without diff review on Class B/CUnreviewed safety-relevant change reaches mainline§7 mandatory diff review; auto-merge disabled for B/C; policy server diff-size + path gates
Agent-to-agent error amplificationOne sub-agent's hallucination becomes the next's "fact"; compounding error across A2AWD-2 deterministic gate at every handoff; no probabilistic output flows ungated; evidence carries provenance
Context fragmentationSub-agents work from inconsistent/partial context; divergent assumptionsSingle source-of-truth (specs/, AGENTS.md); content-addressed context manifests on every A2A handoff; durable shared memory
Confident wrong abstention-suppressionAgent guesses instead of abstainingWD-6 typed ABSTAIN; calibration validated in 05
Mode misuse (Conductor for batch)Latency/cost mismatch; weak evidencePolicy-server routing by class/size/latency (§2)

Cross-references#

01-requirements.md · 02-maturity-model.md · 03-reference-architecture.md · 04-model-strategy-and-finetuning.md · 05-evaluation-and-validation.md · 07-security-and-compliance.md · 08-token-and-gpu-economics.md · 09-adoption-roadmap.md