06 — Agentic Workflows#
Part of Agentic-Native SDLC for Regulated Medical Device Engineering. Status: Working draft — May 2026. All numeric thresholds are placeholders pending calibration in
05-evaluation-and-validation.md. Scope: how individual and multi-agent workflows are designed, bounded, gated, and governed across the SDLC under IEC 62304, ISO 13485/QMSR, ISO 14971, FDA CSA, 21 CFR Part 11, and GAMP 5.
This document operationalizes the seven principles into concrete, buildable workflows. It assumes the reference architecture in 03-reference-architecture.md (K8s, MCP tool plane, A2A, Argo Workflows, gVisor/Kata sandboxes, policy server, deterministic hooks, durable memory, OpenTelemetry trajectory tracing) and the model fleet tiers (S/M/L/V/E) in 04-model-strategy-and-finetuning.md.
1. Workflow design principles in a regulated setting#
A workflow is a versioned, validated, change-controlled artifact — not an ad-hoc prompt chain. Every workflow MUST satisfy the following structural invariants before it is permitted to execute against a Class A/B/C codebase.
| # | Invariant | Mechanism | Principle | Failure mode it prevents |
|---|---|---|---|---|
| WD-1 | Bounded tasks | Each agent step has an explicit input contract, output schema, max-iteration count, and acceptance predicate. No open-ended "do the thing" steps. | P1, P5 | Unbounded loops; scope creep |
| WD-2 | Deterministic gates between steps | Inter-step transitions are mediated by deterministic verifiers (compilers, linters, test runners, schema validators, policy server). Probabilistic output never flows to the next step ungated. | P2 | Error propagation; non-reproducible state |
| WD-3 | HITL checkpoints by safety class | Checkpoint density scales with IEC 62304 class; Class C requires dual human control. | P3 | Inappropriate autonomy on safety-critical code |
| WD-4 | Evidence capture per step | Every step emits a signed evidence record (inputs, model+harness version, tool calls, gate results, diffs, approver) to the Part 11 evidence store. | P4 | Untraceable changes; audit gaps |
| WD-5 | Budget caps per loop | Token + GPU + wall-clock budgets enforced per step and per workflow; breach → abstain/escalate, never silently truncate. | P6 | Runaway cost; "cost-per-green-PR" blowout |
| WD-6 | Abstention / escalation | Agents emit a typed ABSTAIN(reason, evidence) rather than guessing when confidence/coverage falls below threshold. Escalation routes to HITL or a higher model tier. | P1, P3 | Confident-but-wrong output reaching a gate |
| WD-7 | Idempotency / rollback | Every mutating action is idempotent and reversible: ephemeral sandbox branches, content-addressed artifacts, transactional commits, one-command rollback. | P7 | Partial/irreversible damage to mainline |
The harness is the product (P5). Workflows are expressed declaratively as Argo Workflow templates whose steps invoke agents through the harness; the LLM is a swappable component (see §3, Agent = Model + Harness). The cost-per-green-PR metric (P6) is the primary economic objective function for every workflow (see 08-token-and-gpu-economics.md).
flowchart LR
G[Generate<br/>probabilistic] --> V[Verify<br/>deterministic gate]
V -->|fail| R[Repair<br/>bounded loop]
R --> V
V -->|pass| GA[Gate<br/>policy + eval + HITL]
GA -->|reject| R
GA -->|approve| M[(Mainline)]
V -->|budget/iter breach| ESC[Abstain → Escalate]
R -->|budget/iter breach| ESC
classDef det fill:#e8f0fe,stroke:#1a73e8;
classDef prob fill:#fce8e6,stroke:#d93025;
class V,GA det;
class G,R prob;
This Generate → Verify → Repair → Gate loop is the atomic unit of every workflow in this document and is the realization of the 99.9% release-gate correctness system property (P1): individual model calls are unreliable; the loop is engineered to be reliable.
2. Operating modes: Conductor vs Orchestrator#
Two operating modes cover the spectrum from synchronous developer assistance to autonomous async batch work. A given task is routed to a mode by the policy server based on safety class, change size, and required latency.
| Dimension | Conductor (real-time, in-IDE) | Orchestrator (async multi-agent) |
|---|---|---|
| Interaction model | Synchronous; human in the loop continuously | Asynchronous; human at defined checkpoints |
| Latency target | Sub-second to seconds | Minutes to hours |
| Primary surface | IDE / editor / terminal | Argo Workflows, A2A mesh, CI |
| Concurrency | Single active agent, human-paced | Many sub-agents in parallel (Planner/Coder/Test/…) |
| Control granularity | Per-edit, per-tool-call (human approves inline) | Per-stage gate + HITL checkpoints |
| Typical model tier | S/M (low latency), V for visual context | M/L (capable), E for hard reasoning; routed per step |
| Tooling | MCP tools scoped to open repo; pre-tool/post-edit hooks | Full MCP plane; sandbox per sub-agent; policy server inline |
| Memory | Session-scoped; durable per-developer | Durable shared workflow memory + per-agent scratch |
| Evidence | Lightweight (suggestion accepted/rejected) | Full per-step signed evidence chain |
| Best for | Conductor: pair-style implementation, refactor-in-place, explain, local test authoring | Orchestrator: feature delivery, coverage expansion, migrations, repo-watching, batch fixes |
| Autonomy ceiling | Class A/B with human approving each apply | Up to L4 within validated bounds; Class C still dual-control |
| ASMM-Med fit | L1–L2 | L2–L5 |
A workflow may hand off between modes: a developer in Conductor mode dispatches a bounded task to the Orchestrator ("expand coverage on pump_controller"), reviews the resulting PR, and pulls it back into Conductor for final touch-ups. All handoffs are A2A messages with attached context manifests.
sequenceDiagram
participant Dev as Developer (IDE)
participant C as Conductor Agent
participant O as Orchestrator (Argo)
participant Sub as Sub-agents (A2A)
Dev->>C: "implement story PUMP-142"
C->>C: scope check (safety class B)
C->>O: dispatch bounded task + context manifest
O->>Sub: fan-out Planner/Coder/Test/Review
Sub-->>O: gated PR + evidence
O-->>Dev: HITL checkpoint (PR review)
Dev->>C: pull back for local refinement
3. Agent anatomy — Agent = Model + Harness#
The platform's first axiom: an agent is not a model. An agent is a model wrapped in a deterministic harness that supplies capability, constraint, and evidence. The model is the only probabilistic component; everything else is engineered software under change control.
flowchart TB
subgraph Agent
direction TB
M[Model — fleet tier S/M/L/V/E<br/>self-hosted, fine-tuned, open-weight]
subgraph Harness
I[Instructions / rule files<br/>AGENTS.md, specs/, BDD/Gherkin]
T[MCP tool plane<br/>typed, permissioned tools]
S[Sandbox<br/>gVisor/Kata, ephemeral, egress-deny]
OR[Orchestration<br/>Argo steps + A2A handoffs]
H[Hooks / guardrails<br/>pre-tool, post-edit, pre-commit]
ME[Memory<br/>durable sessions + scoped scratch]
OB[Observability<br/>OpenTelemetry trajectory tracing]
PS[Policy server<br/>structural + semantic gating]
end
end
M <--> I
M <--> T
T --> PS
T --> S
M --> H
M --> ME
Agent --> OB
| Component | Role | Determinism | Cross-ref |
|---|---|---|---|
| Model (S/M/L/V/E) | Generation, reasoning, repair proposals | Probabilistic | 04 |
| Instructions / rule files | AGENTS.md, specs/, BDD/Gherkin define behavior, conventions, constraints | Deterministic source-of-truth | 01 |
| MCP tools | Typed, permissioned actions (read/edit/test/build/query) | Deterministic interface | 03 |
| Sandbox | gVisor/Kata ephemeral env; egress-deny; per-task | Deterministic isolation | 07 |
| Orchestration | Argo step graph + A2A handoffs | Deterministic | 03 |
| Hooks / guardrails | Lifecycle interceptors (pre-tool, post-edit, pre-commit) | Deterministic | §9 |
| Memory | Durable session/workflow memory + scoped scratch | Deterministic store | 03 |
| Observability | OTel trajectory spans for every tool call and gate | Deterministic | 05 |
| Policy server | Structural + semantic gating, intercepts every tool call | Deterministic | 07 |
The policy server sits between the model and every tool: a model may propose a write_file or run_command, but the action only executes if it passes structural rules (path allowlist, diff size, no secret egress) and semantic rules (change consistent with the active spec). This is how "determinism wraps probabilism" (P2) is enforced at the tool boundary.
4. Multi-agent decomposition pattern#
The Orchestrator decomposes a unit of work into specialized sub-agents connected by A2A handoffs. Each handoff crosses a deterministic gate. Humans sign at safety-class-appropriate checkpoints.
flowchart LR
H[Human request / story] --> PL[Planner]
PL -->|A2A: plan| SP[Spec / Story agent]
SP -->|A2A: spec + Gherkin| G1{Spec gate<br/>policy + eval}
G1 -->|approve| HC1[[HITL: spec sign-off]]
HC1 --> CO[Coder]
CO -->|A2A: diff| G2{Build/lint gate}
G2 --> TE[Test agent]
TE -->|A2A: tests+coverage| G3{Test+coverage gate}
G3 --> RV[Review agent]
RV -->|A2A: findings| G4{Review gate<br/>policy + eval}
G4 --> HC2[[HITL: PR approval<br/>Class C = dual control]]
HC2 --> IN[Integrator]
IN -->|A2A: merge req| G5{Release gate ≥99.9%}
G5 --> M[(Mainline / deploy)]
G1 -.reject.-> SP
G2 -.reject.-> CO
G3 -.reject.-> CO
G4 -.reject.-> CO
- Policy server is invoked inside every gate (G1–G5) and on every tool call within each sub-agent.
- Eval gates (deterministic evaluation harness,
05) sit at G1, G4, G5. - Humans sign at HC1 (spec) and HC2 (PR); Class C requires two independent approvers at HC2 (P3).
- A2A messages are content-addressed and carry the evidence manifest forward so the Integrator can assemble a complete Part 11 record.
5. Per-SDLC-phase workflows#
Sub-template used for every phase: Goal · Agent role · Inputs/context · Tools · Deterministic gates · HITL checkpoint · Evidence · IEC 62304 activity · Model tier.
5.1 Requirements / planning#
| Field | Value |
|---|---|
| Goal | Convert stakeholder intent into structured, testable requirements + acceptance criteria |
| Agent role | Spec/Story agent (decompose, normalize, detect ambiguity/conflict) |
| Inputs / context | Intake notes, existing specs/, risk file (ISO 14971), product requirements |
| Tools | requirements_db, traceability_query, risk_register, MCP doc retrieval |
| Deterministic gates | Schema validation of requirement records; traceability completeness check; duplicate/conflict linter |
| HITL checkpoint | Requirements review board sign-off (mandatory all classes) |
| Evidence | Versioned requirements set, ambiguity report, trace links, approver record |
| IEC 62304 | 5.2 Software requirements analysis |
| Model tier | L (reasoning over ambiguity); E for high-risk Class C decomposition |
5.2 Design / architecture#
| Field | Value |
|---|---|
| Goal | Produce software architecture + detailed design consistent with requirements and risk controls |
| Agent role | Design agent (architecture proposal, interface contracts, design rationale) |
| Inputs / context | Approved requirements, architecture standards, risk controls, existing design docs |
| Tools | architecture_model, diagram_gen, interface_registry, dependency graph query |
| Deterministic gates | Interface contract validation; architecture rule checks; risk-control coverage check |
| HITL checkpoint | Design review (mandatory); Class C requires safety reviewer |
| Evidence | Design records, interface specs, design-to-requirement trace, decision log |
| IEC 62304 | 5.3 Software architectural design; 5.4 detailed design |
| Model tier | L; V for diagram/visual artifacts |
5.3 Implementation#
| Field | Value |
|---|---|
| Goal | Implement units to satisfy design + spec, passing all deterministic gates |
| Agent role | Coder (generate diff in sandbox, self-repair against gates) |
| Inputs / context | Approved design, specs/, AGENTS.md, target files, dependency graph |
| Tools | read_file, write_file, run_build, run_lint, run_unit_tests, static_analyzer |
| Deterministic gates | Compile/build, lint, static analysis (SAST), unit tests, diff-size policy |
| HITL checkpoint | Diff review on PR; Class B/C never auto-merged without diff review |
| Evidence | Diff, build/test logs, static-analysis report, model+harness version |
| IEC 62304 | 5.5 Software unit implementation and verification |
| Model tier | M (default); L for complex units; routed by complexity |
stateDiagram-v2
[*] --> Plan
Plan --> Generate: bounded task
Generate --> Verify
Verify --> Repair: gate fail
Repair --> Verify
Verify --> Abstain: iter/budget breach
Verify --> PR: all gates green
Abstain --> Escalate
PR --> [*]: diff review (HITL)
5.4 Test / QA#
| Field | Value |
|---|---|
| Goal | Author/extend verification tests; achieve coverage + behavioral targets |
| Agent role | Test agent (generate tests, mutation-check, close coverage gaps) |
| Inputs / context | Code under test, requirements/Gherkin, existing tests, coverage baseline |
| Tools | run_tests, coverage_tool, mutation_tester, requirements_trace |
| Deterministic gates | Tests pass; coverage ≥ threshold (placeholder); mutation score ≥ threshold; no flaky/quarantine regressions |
| HITL checkpoint | QA lead review of new test suite; Class C verifies requirement-to-test trace |
| Evidence | Test suite diff, coverage delta, mutation report, requirement-test trace matrix |
| IEC 62304 | 5.5 unit verification; 5.6 integration testing; 5.7 system testing |
| Model tier | M; L for hard property/edge-case synthesis |
flowchart LR
C[Code under test] --> TA[Test agent]
TA --> GEN[Generate tests]
GEN --> R{Run tests}
R -->|fail to author| TA
R -->|pass| COV{Coverage ≥ θ?}
COV -->|no| TA
COV -->|yes| MUT{Mutation ≥ θ?}
MUT -->|no| TA
MUT -->|yes| TR{Req-trace complete?}
TR -->|yes| QA[[HITL: QA review]]
5.5 Code review#
| Field | Value |
|---|---|
| Goal | Detect defects, spec deviations, risk-control violations before merge |
| Agent role | Review agent (semantic diff review, standards + risk checks) |
| Inputs / context | PR diff, spec, design, coding standards, risk controls, prior findings |
| Tools | diff_view, policy_check, standards_linter, trace_query, sec_review |
| Deterministic gates | Policy server semantic gate; standards lint; no unresolved high-severity findings |
| HITL checkpoint | Human reviewer approves; Class C dual control; conditional-LGTM only where permitted (§7) |
| Evidence | Review findings, resolution log, approver(s), gate results |
| IEC 62304 | 5.5/5.6 verification; supports 9 problem resolution |
| Model tier | L (judgment); M for routine diffs |
sequenceDiagram
participant PR as Pull Request
participant RA as Review Agent
participant PS as Policy Server
participant Ev as Eval Gate
participant Hu as Human Reviewer(s)
PR->>RA: diff + context
RA->>PS: semantic + structural check
PS-->>RA: pass/findings
RA->>Ev: deterministic review eval
Ev-->>RA: score ≥ θ
RA->>Hu: findings + recommendation
Hu-->>PR: approve (dual for Class C)
5.6 Deployment#
| Field | Value |
|---|---|
| Goal | Promote validated build through release gate to target environment |
| Agent role | Integrator/Release agent (assemble release record, drive pipeline) |
| Inputs / context | Approved PR, full evidence chain, release checklist, change-control record |
| Tools | argo_pipeline, release_gate, evidence_store, signing_service, deploy |
| Deterministic gates | Release gate ≥99.9% correctness; evidence completeness; signed approvals present |
| HITL checkpoint | Release authority sign-off (mandatory); Class C dual authority |
| Evidence | Signed release record, DHF/Part 11 package, deploy manifest, rollback plan |
| IEC 62304 | 5.8 software release |
| Model tier | S/M (orchestration, not generation) |
5.7 Maintenance / legacy modernization#
| Field | Value |
|---|---|
| Goal | Remediate defects; modernize legacy code with behavior preservation |
| Agent role | Maintenance/Migration sub-agent pipeline (graph-native understanding) |
| Inputs / context | Defect report or migration scope, code graph, characterization tests, risk file |
| Tools | code_graph_query, characterization_tests, run_tests, diff_view, equivalence_check |
| Deterministic gates | Characterization tests pass pre/post; behavioral equivalence; coverage maintained |
| HITL checkpoint | Change review; Class C dual control; CAPA linkage for defects |
| Evidence | Defect-to-fix trace, before/after behavior proof, migration manifest |
| IEC 62304 | 6 software maintenance; 9 problem resolution |
| Model tier | L (analysis) + M (bulk edits); E for hard equivalence reasoning |
6. Concrete worked example workflows#
6.1 Feature implementation on a Class B module#
Scenario: Story PUMP-142 adds a configurable alarm threshold to infusion_rate_monitor (Class B).
| Step | Mode/agent | Action | Gate | Evidence |
|---|---|---|---|---|
| 1 | Orchestrator / Spec | Decompose story → spec + Gherkin acceptance | Schema + trace | Spec record, trace links |
| 2 | HITL | Spec sign-off (single approver, Class B) | — | Approver record |
| 3 | Coder (M) | Implement in sandbox; self-repair vs build/lint/unit | Compile, lint, SAST, unit | Diff, logs |
| 4 | Test (M) | Add tests for new branches; close coverage | Coverage ≥ θ, mutation ≥ θ | Coverage delta, mutation report |
| 5 | Review (L) | Semantic review vs spec + risk controls | Policy semantic gate | Findings + resolutions |
| 6 | HITL | PR diff review (mandatory, single for Class B) | — | Approval |
| 7 | Integrator (S) | Release gate, assemble record | ≥99.9% release gate | Signed release record |
Budget cap: workflow aborts and escalates if token+GPU spend exceeds the per-PR cap (P6, 08). No auto-merge — Class B requires human diff review (§10 anti-pattern).
6.2 AI-generated test-coverage expansion#
Scenario: Raise coverage on dosage_calculator from 71% to ≥ target without changing behavior.
flowchart LR
BL[Baseline coverage] --> GAP[Coverage-gap analysis<br/>code graph]
GAP --> TA[Test agent: synth tests]
TA --> RUN{Tests green?}
RUN -->|no| TA
RUN -->|yes| FLK{Flaky? quarantine check}
FLK -->|stable| MUT{Mutation ≥ θ}
MUT -->|yes| BEH{No behavior change<br/>vs baseline}
BEH -->|confirmed| QA[[HITL: QA lead]]
BEH -->|drift| TA
Key controls: tests must be additive and behavior-preserving — any test that would have failed against unchanged production code is flagged as a latent defect and escalated rather than silently "fixed." Mutation testing guards against vacuous tests. Evidence: coverage delta, mutation score, requirement-test trace.
6.3 Bug fix in forensic mode (failing-test-first, evidence prompting)#
Scenario: Field complaint → defect DEF-908 in battery_health_estimator (Class C). Forensic mode enforces reproduce-before-repair.
stateDiagram-v2
[*] --> Reproduce
Reproduce --> WriteFailingTest: capture defect as test
WriteFailingTest --> ConfirmRed: test fails on current code
ConfirmRed --> RootCause: evidence-prompted analysis
RootCause --> Fix: bounded minimal diff
Fix --> Green: failing test now passes
Green --> Regression: full suite + characterization
Regression --> DualReview: Class C dual control
DualReview --> CAPA: link to problem resolution
CAPA --> [*]
- Failing-test-first: the defect is encoded as a test that is red before any fix; this becomes permanent regression evidence.
- Evidence prompting: the root-cause step is required to cite specific code-graph nodes, traces, and the failing assertion — no unsupported hypotheses.
- Class C: dual human control at review; fix is linked to CAPA / IEC 62304 §9 problem resolution.
- Minimal-diff policy: the repair loop is bounded to the smallest change that turns the test green; scope expansion triggers escalation.
6.4 Legacy modernization / framework migration at scale#
Scenario: Migrate ~400 modules from a deprecated UI framework to the supported one, behavior-preserving, across Class A/B code.
flowchart TB
SC[Scope intake] --> GRAPH[Graph-native code understanding<br/>build dependency + call graph]
GRAPH --> CHAR[Characterization test harvest<br/>pin current behavior]
CHAR --> PART[Partition into wave batches<br/>by risk + coupling]
PART --> FAN[Fan-out sub-agent pipeline]
subgraph perModule[Per-module pipeline]
MIG[Migration coder] --> EQ{Behavioral equivalence}
EQ -->|fail| MIG
EQ -->|pass| RV2[Review agent]
end
FAN --> perModule
RV2 --> HITL[[HITL: batch review]]
HITL --> INT[Integrator: staged merge]
INT --> M[(Mainline)]
- Graph-native understanding: the migration is planned over the actual code/dependency graph, not file-by-file, so coupling and ordering are respected.
- Characterization tests pin pre-migration behavior; the equivalence gate is the deterministic guarantee of behavior preservation.
- Batching by risk: Class A modules may use lighter checkpoints; Class B retain mandatory diff review; any Class C in scope keeps dual control.
- Cost discipline: per-module budget caps and tier routing (M for mechanical edits, L for complex ones) keep cost-per-green-PR within bounds at scale.
7. Human-in-the-loop design#
HITL is the mechanism for risk-proportional autonomy (P3). Checkpoint placement is a function of IEC 62304 safety class.
| Safety class | Spec sign-off | Diff review | Release approval | Auto-merge on green |
|---|---|---|---|---|
| Class A | Required (may batch) | Required for non-trivial; conditional-LGTM permitted for low-risk additive change | Single | Permitted only where policy explicitly allows (additive, non-safety, fully gated) |
| Class B | Required | Mandatory, per-PR diff review | Single | Not permitted |
| Class C | Required | Mandatory, dual independent reviewers | Dual authority | Never |
Design measures to keep HITL effective without inducing approval fatigue:
- Dual-control for Class C: two independent qualified humans; the system enforces approver disjointness (no self-approval, no single person satisfying both).
- Batching: related low-risk approvals are grouped into a single review surface with shared context, reducing context-switch cost while preserving per-item evidence.
- Digital quiet hours: the Orchestrator respects configured quiet windows; checkpoints queued during quiet hours are surfaced at the next active window, never auto-escalated to auto-approval.
- Conditional-LGTM (merge on green): permitted only for Class A, additive, non-safety-critical changes that pass the full deterministic gate chain and the release gate; explicitly disabled for Class B/C. Every conditional-LGTM merge still produces a full evidence record and is sampled into the audit pipeline.
- Escalation routing: abstentions and budget breaches route to a named human queue, not into a silent retry storm.
8. Continuous code-review and repo-watcher agents#
Continuous agents observe the repository and act on events (PR opened, push, scheduled scan). They are deployed in three tiers by integration depth.
| Tier | Deployment | Trigger | Catches | Evidence posting |
|---|---|---|---|---|
| T1 — managed-equivalent | Self-hosted equivalent of a managed review bot | PR opened/updated | Style, obvious bugs, lint/standards drift, secret leaks | Inline PR comments + evidence record |
| T2 — hybrid CI-triggered | Review agent invoked from CI pipeline | CI stage on PR/push | Above + spec deviation, coverage/mutation regressions, risk-control violations | CI check + structured findings to evidence store |
| T3 — custom A2A | Full Orchestrator sub-agent in the A2A mesh | Event or schedule (repo-watcher) | Above + cross-module/arch drift, dependency risk, traceability gaps; can open remediation PRs | Signed findings, optional auto-PR, OTel trajectory |
All tiers route every proposed action through the policy server and post evidence to the Part 11 store. Repo-watcher agents (T3) operate under strict budget caps and an action allowlist; they may propose remediation PRs but never merge to Class B/C without the §7 HITL path. Findings are linked to requirements/risk items for traceability.
9. Workflow governance#
A workflow is a validated software item in its own right and is managed under the QMS (ISO 13485/QMSR, GAMP 5).
Versioning. Each workflow (Argo template + agent configs + rule files + tool permissions) is a content-addressed, semantically-versioned artifact. The exact model+harness versions used by a run are pinned and recorded (P7 reproducibility).
Validation (tie to 05). Before promotion, a workflow is validated against the deterministic evaluation harness: golden task sets, gate-correctness measurement toward the ≥99.9% release-gate property, abstention calibration, and cost envelope. Validation evidence is part of the workflow's release record. CSA-aligned, risk-based validation depth scales with the highest safety class the workflow may touch.
Change control. Workflow changes follow the same change-control and approval path as code: proposed diff → review → eval gate → approval → versioned release. A workflow change that alters gate behavior or autonomy level requires re-validation.
flowchart LR
WF[Workflow change proposal] --> RV[Review]
RV --> EV[Eval harness validation\n→ 05]
EV -->|meets ≥99.9% + cost| AP[Approval / change control]
EV -->|fails| WF
AP --> REL[Versioned workflow release]
REL --> REG[(Registry — pinned)]
Cost guardrails in-loop (tie to 08). Per-step and per-workflow token/GPU/wall-clock budgets are enforced at runtime by the orchestrator and hooks. Breach behavior is deterministic: pause → abstain → escalate. Cost telemetry feeds the cost-per-green-PR metric (P6) and the economics dashboards.
Failure handling and rollback. Every workflow defines: (a) idempotent steps over ephemeral sandbox branches; (b) a one-command rollback to the last known-good mainline state; (c) a quarantine path for flaky/unstable artifacts; (d) escalation to HITL on repeated gate failure. No partial state ever reaches mainline — integration is transactional behind the release gate.
| Failure | Detection | Response |
|---|---|---|
| Gate fail (transient) | Verifier | Bounded repair loop |
| Iteration/budget breach | Orchestrator counter | Abstain → escalate to HITL |
| Non-reproducible result | Eval/replay | Pin freeze; block promotion; investigate |
| Bad merge slipped | Release-gate audit / repo-watcher | Automated rollback + CAPA |
10. Anti-patterns#
| Anti-pattern | Why it is dangerous | Mitigation in this platform |
|---|---|---|
| Unbounded loops | Runaway cost; non-terminating agents; eroded determinism | WD-1/WD-5 hard iteration + budget caps; abstain on breach |
| Multi-file autonomous edits without diff review on Class B/C | Unreviewed safety-relevant change reaches mainline | §7 mandatory diff review; auto-merge disabled for B/C; policy server diff-size + path gates |
| Agent-to-agent error amplification | One sub-agent's hallucination becomes the next's "fact"; compounding error across A2A | WD-2 deterministic gate at every handoff; no probabilistic output flows ungated; evidence carries provenance |
| Context fragmentation | Sub-agents work from inconsistent/partial context; divergent assumptions | Single source-of-truth (specs/, AGENTS.md); content-addressed context manifests on every A2A handoff; durable shared memory |
| Confident wrong abstention-suppression | Agent guesses instead of abstaining | WD-6 typed ABSTAIN; calibration validated in 05 |
| Mode misuse (Conductor for batch) | Latency/cost mismatch; weak evidence | Policy-server routing by class/size/latency (§2) |
Cross-references#
01-requirements.md · 02-maturity-model.md · 03-reference-architecture.md · 04-model-strategy-and-finetuning.md · 05-evaluation-and-validation.md · 07-security-and-compliance.md · 08-token-and-gpu-economics.md · 09-adoption-roadmap.md