08 — Token & GPU Economics (FinOps for Self-Hosted Agentic Dev)#

Part of Agentic-Native SDLC for Regulated Medical Device Engineering. Status: Reference baseline · Date: May 2026 · Audience: CFO/Finance, Platform Engineering, Quality/Regulatory. Cross-refs: 01-requirements · 02-maturity-model · 03-reference-architecture · 04-model-strategy-and-finetuning · 05-evaluation-and-validation · 06-agentic-workflows · 07-security-and-compliance · 09-adoption-roadmap

This document is the financial control plane for the program. Every formula, rate, and ratio below is illustrative; the parameters are org-set and owned by the FinOps practice. Where a number appears, treat it as a placeholder to be replaced by measured values from your own fleet telemetry.

1. The Economic Reframing#

Self-hosting fine-tuned open-weight models is not primarily a performance decision for this program — it is an economic and sovereignty decision, and it changes the shape of the cost, not just the magnitude.

A SaaS LLM API is pure, uncapped OpEx: a per-token meter that scales linearly and forever with usage, with no asset on the balance sheet and no floor on marginal cost. A 1000+ developer org running agentic workflows generates enormous token volume — most of it on repair loops, retrieval, and evaluation, not the final answer the human sees. At that volume, the per-token meter becomes the dominant line item and is structurally unbounded.

Self-hosting converts that into two different categories:

Dimension	SaaS API (rejected)	Self-Hosted Fleet (chosen)
Cost class	OpEx, per-token, uncapped	GPU CapEx (amortized) + Ops OpEx
Marginal cost of one more token	Vendor rate (fixed, never zero)	≈ marginal electricity once GPU is owned
Cost behavior at scale	Linear, unbounded	Step-function (buy capacity) + high utilization wins
Data path	Source/spec leaves the boundary	Stays inside the regulated boundary
Reproducibility	Vendor-controlled model drift	Pinned weights, P7 reproducible
Negotiating position	Vendor pricing power	Internal control of the curve

Why the org chose this (decision is non-optional, see §6):

Cost at scale. Above a break-even volume, owned-and-amortized GPU beats per-token billing, and our volume is far above it.
Data sovereignty. Regulated source, specifications, defect data, and patient-adjacent context must not transit a third-party LLM (see 07-security-and-compliance).
No per-token vendor billing / no model drift. Validation under IEC 62304 requires that a release gate run today reproduces tomorrow; a silently-updated vendor model breaks that (P7, 05-evaluation-and-validation).

The governing metric: COST-PER-GREEN-PR#

Per Principle P6, cost is measured per verified task, not per token. The unit of economic value this program produces is a Green PR: a change that passes all deterministic gates plus human review (see 06-agentic-workflows). Tokens that do not contribute to a Green PR are waste, regardless of how cheap each one was.

COST-PER-GREEN-PR =Total compute $ (inference + eval + retrieval + idle + amortized CapEx + ops)Number of PRs that passed all gates + review

Why optimizing cost-per-token alone is a trap. Cost-per-token is a component, not the objective. The classic failure: a team cuts token price 40% by quantizing aggressively or routing everything to a tiny model, first-pass yield collapses, agents loop and re-attempt, escape-rate to human reviewers rises — and cost-per-Green-PR goes UP even as cost-per-token went down. Cheap-but-wrong is the most expensive mode in a regulated SDLC, because rework, re-validation, and reviewer time dwarf raw inference. The denominator is the lever; the numerator is the temptation.

2. Cost Taxonomy#

All program compute spend decomposes into seven drivers. Each is metered separately (OpenTelemetry → cost, §7) so it can be attributed and optimized independently.

#	Cost driver	What moves it	Primary controls
1	Inference — model size	Params served, tier (S/M/L), MoE active params	Smallest-capable-model, tiered routing (§4), distillation
2	Inference — context length	Prompt tokens, retrieved context, history	Context economy (§4), prefix/prompt cache, retrieval over stuffing
3	Inference — output length	Generated tokens, reasoning-effort, loop count	Reasoning-effort caps, loop-count limits, structured output
4	Inference — batch efficiency	Continuous batching, concurrency, queue depth	vLLM continuous batching, PagedAttention, request shaping
5	Inference — GPU type/utilization	GPU SKU $/hr, MIG slicing, idle fraction	Right-sizing, MIG partitioning, KEDA scale-to-zero
6	Training / fine-tuning	LoRA vs full FT, dataset size, epochs, runs	Multi-LoRA, spot/preemptible + Kueue, distillation runs
7	Evaluation / validation	Gate suites, reproducibility reruns, 99.9% sampling	Eval caching, deterministic seeds, eval-tier routing
—	Retrieval / embedding	Embedding calls, rerank, index refresh, vector store	Tier-E batching, retrieval cache, incremental indexing
—	Idle / overprovisioning	Reserved-but-unused GPU, warm pools, headroom	Scale-to-zero, autoscaling, batch/interactive split
—	Ops / people	Platform SRE, FinOps, MLOps, on-call, eval engineering	Automation, self-service, maturity (L1→L5)

Do not forget evaluation cost. The ≥99.9% release-gate correctness target (P1) is enforced by running a great deal of inference — large eval suites, adversarial probes, statistical sampling, and reproducibility reruns that re-execute gates on pinned weights for the regulatory record. For mature agentic repos, eval + reproducibility compute is frequently 25–45% of total inference spend and must be a first-class budget line, not an afterthought (see 05-evaluation-and-validation).

3. An Illustrative Cost Model#

ILLUSTRATIVE — all rates are placeholders. Parameters ($/GPU-hr, throughput, yields) are org-set and replaced by measured fleet telemetry. The structure is the deliverable, not the digits.

3.1 GPU-hour → cost per 1k tokens, by tier#

The per-token cost of a served model is the GPU rental cost divided by how many tokens that GPU produces per hour.

Cost_per_1k_tokens =(GPU_count × $/GPU-hr) ÷ UtilizationEffective_throughput_tok_per_hr× 1000

Effective_throughput already bakes in quantization, continuous batching, and speculative decoding (§4). Illustrative steady-state rates:

Tier	Model class	GPU footprint (illus.)	Eff. throughput (tok/s, batched)	$/GPU-hr (illus.)	$ / 1k tok (illus.)
Tier-S "Reflex"	1–8B, FP8/INT8	MIG slice / 1× GPU	6,000	$2.50	$0.00012
Tier-M "Worker"	14–34B, AWQ/GPTQ	1–2× GPU	2,200	$2.50	$0.00063
Tier-L "Reasoner"	70B+/MoE	4–8× GPU	700	$2.50	$0.0079
Tier-V Multimodal	VLM	1–2× GPU	1,000	$2.50	$0.0028
Tier-E Embed/Rerank	embedding	MIG slice	40,000 (items)	$2.50	$0.0000175

The ~65× spread between Tier-S and Tier-L is the entire economic argument for tiered routing: a call needlessly sent to Tier-L costs as much as ~65 correct Tier-S calls.

3.2 Cost per agent task#

A single agent task is rarely one model call. It is a sequence of calls across tiers, plus the verifier/sandbox compute that makes the work verifiable, plus its share of evaluation — all divided by first-pass yield (FPY) to account for repair loops.

Cost_per_task =Σ_over_tiers( tokens_tier × rate_tier ) + Verifier_sandbox_compute + Eval_amortizedFirst_Pass_Yield (0 < FPY ≤ 1)

Verifier_sandbox_compute = build/test/static-analysis/sandbox-exec cost to check the candidate (the harness is the product, P5). Eval_amortized = task's share of gate + reproducibility runs. FPY is the multiplier that ties quality to cost: every failed attempt re-spends the numerator.

The repair-loop explosion, holding raw token cost constant, as FPY falls:

First-Pass Yield	Effective cost multiplier (1 ÷ FPY)	Interpretation
0.90	1.11×	Healthy; small rework tax
0.70	1.43×	Noticeable loop spend
0.50	2.00×	Half of all work is redone
0.30	3.33×	Loop-dominated; cheap model is a false economy
0.15	6.67×	Pathological; escape-rate to humans spikes

This table is the quantified form of the §1 trap: driving down per-token cost while letting FPY fall is a net loss.

3.3 Worked numeric example (labeled placeholders)#

ILLUSTRATIVE. "Implement a bounded requirement-to-code change with passing unit tests."

Input (placeholder)	Symbol	Value
Tier-S router/classify + lint tokens	`t_S`	8,000 tok @ $0.00012/1k
Tier-M implementation tokens	`t_M`	40,000 tok @ $0.00063/1k
Tier-L escalation (10% of tasks need it)	`t_L`	6,000 tok @ $0.0079/1k × 0.10
Embedding/retrieval	`t_E`	20,000 items @ $0.0000175/1k
Verifier/sandbox compute (build+test+sast)	`C_v`	$0.018
Eval/repro amortized share	`C_e`	$0.012
First-pass yield	`FPY`	0.70

Token + retrieval cost:
  Tier-S : 8,000/1000  × $0.00012 = $0.00096
  Tier-M : 40,000/1000 × $0.00063 = $0.02520
  Tier-L : 6,000/1000  × $0.0079 × 0.10 = $0.00474
  Tier-E : 20,000/1000 × $0.0000175 = $0.00035
  Subtotal tokens                       = $0.03125

Numerator = $0.03125 + C_v($0.018) + C_e($0.012)  = $0.06125
Cost_per_task = $0.06125 ÷ FPY(0.70)              = $0.0875   ✅

Sensitivity — same task, FPY collapses to 0.30 (e.g., over-aggressive quantization or routing too small): $0.06125 ÷ 0.30 = $0.2042 — a 2.3× cost increase with zero change to per-token rates. If that low FPY also raises human escape-rate, the true cost-per-Green-PR rises further still (reviewer minutes are the most expensive tokens in the system).

4. The Optimization Levers#

Each lever lists mechanism → expected impact (illustrative) → tradeoff. They compound; they also interact (over-using one can sink FPY and undo another), so they are tuned against cost-per-Green-PR, never in isolation.

4.1 Tiered model routing#

Mechanism. A lightweight classifier/router running on Tier-S scores incoming task complexity and dispatches to the smallest capable tier; escalate to Tier-M/Tier-L only on confidence/complexity thresholds or verifier failure. Smallest-capable-model principle.
Impact. If the majority of low-complexity calls resolve on Tier-S (~65× cheaper than Tier-L), blended $/token can drop 40–70% vs. always-on Tier-L.
Tradeoff. Router error is double-edged: under-routing tanks FPY (loops); the router itself must be validated and is an eval surface. Mis-tuned thresholds look cheap per-token while raising cost-per-Green-PR.

4.2 Caching (KV/prefix, prompt, semantic, retrieval)#

Mechanism. PagedAttention/KV-cache reuse + prefix/prompt caching skip recompute of shared system prompts, specs, and skill preambles; semantic caching returns prior answers for near-duplicate requests; retrieval caching avoids re-embedding/re-fetching stable context.
Impact. Prompt/prefix cache can cut prefill compute 30–80% on repetitive agentic prompts (large shared spec/skill prefixes); retrieval cache cuts Tier-E load materially.
Tradeoff. Semantic cache must be conservative in regulated paths — a stale or near-miss hit that flips a gate decision is a correctness defect. Cache keys must include weight/version/spec hashes for reproducibility.

4.3 Quantization + speculative decoding + continuous batching#

Mechanism. FP8/INT8/AWQ/GPTQ shrink memory/raise throughput; speculative decoding uses a small draft model to propose tokens a larger model verifies; continuous batching (vLLM) keeps GPUs saturated across concurrent requests.
Impact. Quantization commonly yields 1.5–3× throughput/$; speculative decoding 1.5–2.5× latency/throughput on accept-heavy workloads; continuous batching lifts utilization from ~30% to 70–90%.
Tradeoff. Quantization can degrade accuracy — every quantized model must re-pass deterministic gates (05) before serving. Quantize, then measure FPY, never assume.

4.4 Multi-LoRA adapter amortization#

Mechanism. Serve one base model with many LoRA adapters (per-domain/per-task behaviors) hot-swapped per request, instead of standing up a full fine-tuned model per behavior.
Impact. Collapses N dedicated deployments into ~1 base footprint — large reduction in idle/overprovisioning and CapEx; new specialized behaviors become near-zero marginal serving cost.
Tradeoff. Adapter routing/versioning complexity; per-adapter eval still required; a bad base upgrade invalidates all adapters at once (manage via 04).

4.5 Context economy#

Mechanism. Lean specs; dynamic context / skills loaded on demand; retrieval over context-stuffing; prune history; structured rather than verbose I/O. Avoid "context dumping" entire repos/specs into every prompt.
Impact. Context length drives prefill cost super-linearly via attention; trimming 50% of tokens often cuts prefill cost >50% and improves FPY (less distraction).
Tradeoff. Under-supplying context tanks FPY too — economy means right context, not minimal context. Tune against yield.

4.6 Autoscaling, partitioning, queueing#

Mechanism. KEDA scale-to-zero for spiky/interactive services; MIG partitioning to pack small models onto GPU slices; spot/preemptible for training/eval batch; Kueue for queue + quota fairness.
Impact. Scale-to-zero eliminates overnight idle on bursty endpoints; MIG raises packing density; spot cuts training $ 60–90%.
Tradeoff. Cold-start latency on scale-from-zero (mitigate with warm minimums for interactive tiers); spot preemption requires checkpointing. Never put latency-critical interactive gates on pure scale-to-zero without a warm floor.

4.7 Distillation (big → small)#

Mechanism. Distill Tier-L behavior into Tier-S/M adapters; the expensive Reasoner generates training signal once, the cheap model serves it forever.
Impact. Shifts steady-state load down a tier — recurring 40–65% serving-cost reduction on distilled task families; reduces Tier-L invocation frequency.
Tradeoff. Up-front distillation + eval CapEx; distilled model can lag base capability on edge cases — gated re-validation required before it replaces escalation paths.

4.8 In-loop budget guardrails#

Mechanism. Hard token/compute caps per task, reasoning-effort caps, and loop-count limits enforced by the orchestrator (06); on breach, fail-closed to human triage rather than burning unbounded compute.
Impact. Bounds the worst-case tail — caps the cost of pathological low-FPY tasks that would otherwise loop indefinitely; protects the monthly budget from a single runaway agent.
Tradeoff. Caps set too tight truncate legitimately hard tasks (raising escape-rate); caps are themselves tuned against cost-per-Green-PR.

flowchart TD
    A[Incoming agent task] --> B{Cache hit?<br/>prompt / semantic / retrieval}
    B -- yes --> Z[Return cached / cheap path]
    B -- no --> C[Tier-S router/classifier]
    C -->|low complexity| D[Tier-S Reflex<br/>+ multi-LoRA adapter]
    C -->|medium| E[Tier-M Worker]
    C -->|high / escalated| F[Tier-L Reasoner<br/>sparingly]
    D --> V{Verifier / deterministic gates}
    E --> V
    F --> V
    V -- pass --> G[GREEN PR candidate]
    V -- fail --> H{Budget guardrail:<br/>tokens / loops / effort left?}
    H -- yes --> C
    H -- no --> I[Fail-closed → human triage]
    G --> M[OpenTelemetry cost metering →<br/>cost-per-Green-PR]

5. Capacity Planning for 1000+ Developers#

Sizing the fleet is a queueing problem, not a headcount multiplication. The goal is enough capacity to hold interactive latency SLOs at peak while keeping steady-state utilization high.

Estimating concurrent load (illustrative).

Active_devs           = 1000 × engagement_factor(0.6)        = 600
Req_per_active_dev_hr = 30  (agentic calls incl. loops/retrieval/eval)
Average_RPS           = 600 × 30 / 3600                       ≈ 5 RPS sustained
Peak_RPS              = Average_RPS × peakiness(3.0)          ≈ 15 RPS
GPU_needed_at_peak    = Peak_RPS ÷ per-GPU_throughput_at_SLO  (per tier)

Planning dimension	Approach
Peak vs. average	Size interactive tiers for peak RPS at the latency SLO; size batch (eval/training) for average throughput with queueing. Peakiness factor measured per region/timezone.
GPU fleet sizing	Per-tier: `ceil(Peak_RPS ÷ throughput_at_SLO)` + headroom. Bottom-heavy fleet (mostly Tier-S/M, few Tier-L) mirrors the routing distribution.
Batch vs. interactive separation	Dedicated pools. Interactive = warm, latency-bounded, KEDA with warm floor. Batch = Kueue-queued, spot-backed, scale-to-zero, latency-tolerant. Never let a training job preempt an interactive gate.
Multi-tenancy fairness	Kueue quotas per team/repo so no tenant starves others; borrowing from idle quotas allowed, reclaimable on demand.
Headroom for eval/training	Reserve explicit capacity (illus. 15–25%) for gate suites, reproducibility reruns, and fine-tuning — these are non-optional regulatory load, not discretionary.

6. Build-vs-Buy / Self-Host Math#

ILLUSTRATIVE structured comparison. Numbers are placeholders to frame the reasoning, not a quote.

Factor	Hypothetical SaaS API at this scale	Self-Hosted Fleet
Annual token volume (illus.)	~30B billable tok/yr (incl. loops, eval, retrieval)	same workload, owned compute
Unit basis	blended vendor $/1k tok	amortized $/GPU-hr + ops
Annual run cost (illus.)	`30M × $X_blended_per_1k` → large, uncapped, linear	`GPU_CapEx ÷ amort_yrs + Ops_OpEx + power`
Marginal next-token cost	vendor rate (never zero)	≈ marginal power on owned GPU
Cost trajectory at growth	scales with usage forever	flattens as utilization rises

Break-even reasoning. Self-host carries up-front CapEx (GPUs, networking) + steady Ops OpEx (SRE, FinOps, MLOps, power, eval engineering). API carries zero fixed cost but a per-token meter. There is a crossover volume above which amortized self-host is cheaper:

Break-even when:  (CapEx ÷ amort_years) + Ops_OpEx_annual + Power_annual
                   <  Annual_token_volume × Blended_API_rate_per_token

→ Self-host wins decisively once volume × API_rate exceeds fixed+ops cost.
  At 1000+ devs with loop/eval/retrieval amplification, our volume is FAR above crossover.

Non-cost drivers that make self-host non-optional here (these hold even if the math were neutral):

Sovereignty. Regulated source, specs, and defect data must not leave the boundary (07).
IP control. Proprietary device engineering knowledge stays in-house; no third-party training on our data.
Regulatory control. Pinned, reproducible weights for IEC 62304 validation; no vendor-driven model drift mid-gate (P7, 05).

The economics make self-host attractive; the regulatory and sovereignty constraints make it mandatory.

7. FinOps Operating Model#

Ties to D7 in 02-maturity-model. FinOps is a standing practice, not a quarterly cleanup.

Capability	Implementation
Cost attribution	Every inference/eval/training call tagged with team, repo, agent, tier, adapter, task-id, cache-status via OpenTelemetry spans → cost pipeline. Attributable to cost-per-Green-PR per repo.
Dashboards	OTel → cost warehouse → dashboards: $/Green-PR, FPY, tier mix, cache hit-rate, GPU utilization, eval/repro share, idle %.
Budgets & alerts	Per-team monthly budgets; alerts at 70/90/100%; automatic in-loop guardrails (§4.8) enforce hard caps independent of dashboards.
Showback / chargeback	Showback by default (visibility, behavior change); chargeback for high-volume teams to internalize cost.
Cost SLOs	Explicit objectives, e.g. $/Green-PR ≤ target, eval share ≤ 40%, GPU util ≥ 70%, cache hit ≥ 50%, idle ≤ 10%. Breach triggers review.
Quarterly optimization review	Re-tune routing thresholds, quantization choices, cache policy, fleet mix, distillation candidates against measured FPY and $/Green-PR. Feeds the L5 closed loop (§8).

FinOps register (illustrative)#

ID	Metric / control	Illustrative target	Owner	Cadence
FIN-01	Cost-per-Green-PR	≤ $T per repo class	FinOps + Repo lead	Weekly
FIN-02	First-Pass Yield	≥ 0.70	Platform + Eval	Weekly
FIN-03	GPU utilization	≥ 70%	Platform SRE	Daily
FIN-04	Idle / overprovision	≤ 10%	Platform SRE	Daily
FIN-05	Cache hit-rate (prompt+semantic)	≥ 50%	Platform	Weekly
FIN-06	Tier-L invocation share	≤ 10% of calls	Routing owner	Weekly
FIN-07	Eval + reproducibility share	≤ 40% of inference $	Quality	Monthly
FIN-08	Spot usage on training	≥ 80% of training GPU-hr	MLOps	Monthly
FIN-09	Budget breach incidents	0 unbounded-loop runaways	FinOps	Monthly

8. Cost Across the Maturity Levels#

Unit economics improve as the org climbs ASMM-Med — not because tokens get cheaper, but because the denominator (Green PRs) grows and waste shrinks.

Level	Cost profile	Unit economics & control	$/Green-PR trend
L0 Ad-hoc	Untracked, sporadic; no fleet metering	No attribution; cost-per-token invisible	Unknown / uncontrolled
L1 Governed Assistance	Mostly interactive assist; basic metering begins	Cost-per-token visible; FPY undefined; little caching	High, noisy
L2 Spec-Driven Bounded Automation	Bounded tasks; routing + caching introduced	$/Green-PR first measured; guardrails appear	Declining, variable
L3 Orchestrated Agentic Workflows	Multi-step agents; eval load rises sharply	Full tier routing, multi-LoRA, budget caps; eval share managed	Stabilizing
L4 Validated Autonomous Agents	High-volume autonomous, heavy validation/repro	Strong FPY; distillation steady-state; tight cost SLOs	Low, predictable
L5 Self-Optimizing Agentic Enterprise	Closed-loop cost optimization	System auto-tunes routing/quantization/cache/fleet against $/Green-PR; data-driven distillation pipeline	Minimized, self-correcting

L5 is the target end state: routing thresholds, quantization choices, cache policy, and distillation candidates are selected by the system from telemetry, continuously, against cost-per-Green-PR — with every change still passing deterministic gates (05).

9. Anti-Patterns#

Anti-pattern	Why it costs	Counter
Always-on big models	Tier-L (~65× Tier-S) serving routine calls; idle Reasoner GPUs	Tiered routing + smallest-capable-model + scale-to-zero (§4.1, §4.6)
Context dumping	Whole repos/specs into every prompt; super-linear prefill cost; lowers FPY	Context economy, retrieval over stuffing, prompt cache (§4.5, §4.2)
Unbounded agentic loops	Pathological tasks loop forever, blow the budget on one runaway	Hard token/loop/effort caps, fail-closed to human (§4.8)
Optimizing tokens, ignoring escape-rate/rework	Cheap-but-wrong: per-token down, FPY down, $/Green-PR up — the §1/§3 trap	Govern by cost-per-Green-PR; measure FPY before/after every change
Idle GPU sprawl	Reserved-but-unused GPUs, warm pools nobody uses, no batch/interactive split	Scale-to-zero, MIG packing, Kueue quotas, idle SLO (FIN-04)
No caching	Recompute identical prefixes/embeddings every call	KV/prefix + prompt + semantic + retrieval cache (§4.2)
Forgetting eval cost	99.9% gates + reproducibility runs un-budgeted; surprise overruns	Treat eval/repro as first-class line (FIN-07), reserve headroom (§5)
Quantize-and-pray	Throughput up, accuracy silently down, gates start failing	Re-pass deterministic gates after every quantization (§4.3)

Bottom line. This program does not minimize cost-per-token; it minimizes cost-per-Green-PR while holding ≥99.9% gate correctness. Self-hosting gives us the cost curve and the sovereignty; the levers in §4, the capacity discipline in §5, and the FinOps loop in §7 keep that curve flat as the org scales L1→L5. Quality is the dominant cost lever — a higher first-pass yield is cheaper than any cheaper token.