Edge AI Models Without the PhD — An Architect’s Field Guide

Who this is for

You design and govern systems for a living. You know what an API, a container, and a database are — but terms like gradient descent and backpropagation get hand-waved in vendor decks. This book closes that gap. Part I gives you an accurate working model of how AI learns, in plain language and pictures; the rest shows how to turn that understanding into sound architecture decisions for on-premise, regulated, and edge deployments. No mathematics is required — only the diagrams.

Part I

How machines learn

A correct, jargon-free mental model of training, gradient descent, and backpropagation — the foundation for every decision that follows.

1What a model actually is

Strip away the mystique and a neural network is a very large mathematical function with millions or billions of adjustable numbers, called weights (or parameters). You feed something in — text, an image — it flows through layers of simple operations, and a result comes out (Figure 1).

Figure 1. Anatomy of a model. Information flows left to right through layers; every connection has a weight (a dial). “Training” means finding good values for billions of these dials. “Inference” is one left-to-right pass with the dials held fixed.

The weights are the knobs. Set them randomly and the output is gibberish. Set them well and the same function translates languages, drafts code, or extracts fields from a document. Training is the process of finding good knob settings. Inference is using the trained function — one pass through the layers with the knobs held fixed. Almost everything your users experience is inference; training happened earlier, usually on someone else’s cluster.

Key idea

A “model” is a function plus a giant table of numbers. “Training” changes the numbers; “inference” reads them. Knowing which one you are doing — and whether you need to do it at all — is the central architectural question of this book.

For the architect

Inference is cheap, repeatable, and stateless; it scales like any read-heavy service. Training is expensive, stateful, and risky. When a vendor says “we’ll fine-tune a model for you,” they are proposing to run the expensive, risky activity — possibly on your data, possibly repeatedly. Always ask which activity is being proposed, where it runs, and how it is reverted.

2Gradient descent: learning as rolling downhill

How does training find good knob settings among astronomically many possibilities? It does not search them all. It uses feedback to take small, smart steps — an algorithm called gradient descent (Figure 2).

Imagine the model’s total error as a landscape of hills and valleys. Every possible combination of weights is a point on the ground; the height at that point is how wrong the model is. Good weight settings sit in the valleys (low error); bad ones sit on the peaks. Training drops a ball onto this landscape and rolls it downhill: at each point it measures which direction is most steeply downhill (the gradient) and takes one small step that way. Repeat millions of times and the ball settles near a valley floor — a good set of weights.

Figure 2. Gradient descent. Picture the model’s error as a hilly landscape; training rolls a ball downhill, taking a small step in the steepest-down direction each round. The step size is the learning rate. “Converged” means it reached a valley floor.

Two terms you will hear, now demystified: the learning rate is the size of each step (too big and the ball overshoots the valley; too small and training crawls), and convergence means the ball has stopped meaningfully moving — it found a valley. “An epoch” is one pass over the training data; models take many.

Plain-English

Gradient descent is just “measure which way is downhill, step a little, repeat.” The cleverness is in computing “downhill” efficiently for billions of knobs at once — which is exactly what backpropagation does.

3Backpropagation: assigning the blame

Gradient descent needs to know, for every single weight, “which way is downhill?” With billions of weights, computing that naively would be hopeless. Backpropagation is the efficient trick that makes it possible — and it is the engine behind essentially every modern AI model.

It works in two passes. First the forward pass: the input flows left-to-right through the layers and the model makes a prediction. Then the model compares that prediction to the known correct answer and computes an error. Now the backward pass: that error is pushed back through the network from right to left, and at each layer it is divided up — each weight receives a share of the blame proportional to how much it contributed to the mistake. Every weight learns the direction it should nudge to reduce the error. Take that nudge (that is the gradient-descent step), and repeat (Figure 3).

Figure 3. Backpropagation. After a forward pass produces a prediction, the network compares it to the truth and sends the error backward, assigning each weight a share of the blame and nudging it. Do this millions of times and the dials settle into useful values.

That is the whole idea: forward to predict, backward to assign blame, nudge, repeat. Run it across a large dataset and the weights organize themselves into something useful. Backpropagation is extraordinarily effective — it built the models everyone is talking about — but, as we will see, it has properties that make it an awkward fit for on-device adaptation.

Why this matters later

Backpropagation needs a global error signal threaded backward through the entire network, and it changes all the weights at once. On the edge — small data, no rollback, live systems — that global, all-at-once update is precisely what causes drift and forgetting. Hold this thought; Part II builds on it.

4Pretraining, fine-tuning, and inference

Three words get used loosely. Separating them cleanly removes most of the confusion in enterprise AI conversations.

Activity	What happens	Cost & risk
Pretraining	The foundation model is trained from scratch on enormous general data (the lab does this).	Millions of dollars; not something you do.
Fine-tuning	You continue training a pretrained model on your data, changing its weights.	Moderate cost; real risk — drift, forgetting, hard to reverse (Figure 4).
Inference	You run the (pre/fine-)trained model to get answers. Weights are read-only.	Cheap, safe, repeatable.

The industry reflex is: “generic model underperforms → fine-tune it.” This book argues that on the edge there is a fourth option that is usually better: keep the weights frozen and change the model’s inputs and surroundings instead. That option — in-context learning plus external memory — gets most of the benefit of fine-tuning with almost none of the risk. Parts III–IV make it concrete.

Part II

Why “just fine-tune it” fails on the edge

Four hard constraints, the science that quietly supports freezing the base, and how to tell a capacity problem from an adaptation problem.

5The four constraints of the edge

In a cloud lab, fine-tuning is routine. On an edge appliance — a box in a hospital, a factory, a retail back office — four constraints turn the same operation into a liability.

Reversibility. A weight update is baked into billions of numbers. Proving you can undo it, instantly and completely, is hard — and regulators ask.
Forgetting & drift. Training on a narrow corpus quietly degrades unrelated abilities (next chapter).
Capacity & data. A small on-device model plus a small in-house dataset is the regime where fine-tuning pays least and risks most.
Cadence. Retraining is an event you schedule. Domains change continuously. Adaptation needs to be a habit, not a quarterly project.

For the architect

Map each constraint to a control you already understand: reversibility → versioning & rollback; drift → regression testing; capacity → right-sizing; cadence → CI/CD. The frozen-base design in Part III gives you all four as properties of the architecture rather than processes you must bolt on.

6Catastrophic forgetting and drift

The most underappreciated risk of fine-tuning has a dramatic name: catastrophic forgetting (Figure 4).

Because backpropagation changes all the weights to fit the new task, it overwrites some of what made the model good at everything else. Teach it your billing terminology and it may get subtly worse at general reasoning, other languages, or last quarter’s task — with no error message. You only find out if you have a comprehensive test suite covering the old abilities, which most teams do not.

Figure 4. Catastrophic forgetting. Push gradient updates onto a narrow new task and the model gets better at it — while quietly getting worse at things it used to do. On the edge you rarely have the regression suite to even notice.

Drift is the slow-motion cousin: re-fine-tune every few weeks and the model wanders away from its tested behavior, one update at a time. In a regulated setting, “we are not sure exactly how today’s model differs from the one we validated” is an unacceptable sentence. Freezing the base eliminates this class of risk by construction.

7What the brain-alignment research hints at

There is a striking line of computational-neuroscience research suggesting the foundation we freeze is doing more of the work than we assume — and that heavy retraining can even be counterproductive.

Researchers compared several ways of training small vision networks — including standard backpropagation, two “local” biologically-inspired rules, and an untrained network with random weights — and measured how closely each matched real brain activity. Three findings rhyme with everything in this book:

The untrained network often matched or beat the backprop-trained one at early processing stages. The structure of the network carried most of the signal; aggressive training sometimes moved it the wrong way.
Local, lightweight learning preserved structure that global backprop eroded. Gentle, targeted updates beat sweeping ones.
Higher-level abstraction needed more capacity and data, not a cleverer update rule. When the task genuinely required more, only a bigger model helped — the learning rule was irrelevant.

An honest caveat

These are small vision experiments scored against brain data, with deliberately hedged results — not proof about language models. We use them as intuition, not evidence. The engineering case stands on its own measurements. But the rhyme is useful: freeze the structure, adapt gently, and spend capacity only when the problem truly demands it.

8Capacity vs. adaptation: which problem do you have?

Before choosing a tool, diagnose the failure. Almost every “the model is not good enough” complaint is one of two very different problems.

Symptom	It is a…	Right tool
Model knows the domain but is sloppy, inconsistent, wrong format, ignores your rules	Adaptation problem	In-context learning, memory, the self-learning loop (Parts III–IV) — no weight changes.
Model fundamentally cannot represent the concept — reasoning is beyond its size, even with perfect context	Capacity problem	A larger model, or distillation into the small one (Part VI). Weight changes justified.

The expensive mistake is treating an adaptation problem as a capacity problem — fine-tuning (or buying a bigger model) to fix what was really a prompt-and-grounding issue. The discipline in Part IV is precisely how you tell them apart with data instead of opinion.

Part III

The frozen-base doctrine

Keep the weights fixed and move every form of adaptation into external, reversible state: knowledge, an in-context skill, and lightweight controllers.

9Freeze the base, adapt the surroundings

The doctrine in one line: the model’s weights never change; everything that makes it yours lives outside the weights, in state you can read, diff, and revert.

“Self-learning” does not have to mean “weight-updating.” We define it as swappable external state that measurably improves a held-out score over time. It lives in four layers — only the last touches weights, and that one is held in reserve (Figure 5):

Figure 5. The frozen-base stack. Adaptation lives in external, reversible state (L0–L2) that shapes a frozen model at request time. The only weight-changing layer, L3, is a detachable adapter held in reserve.

Layer	What it holds	Weights?
L0 · Knowledge	A living knowledge graph + search index of your corpus.	No
L1 · Skill	A plain-language instruction the model follows — the unit we optimize.	No
L2 · Behavior	Tiny per-request controllers that steer the frozen model on hard cases.	No
L3 · Weights (parked)	A detachable, never-merged adapter — last resort only.	Yes (revertible)

Key idea

Every change is an artifact — a graph delta, a versioned text file, a small controller blob — not an opaque shift in billions of parameters. Auditability and one-step rollback are built in, not bolted on.

10In-context learning: teaching without retraining

The single most useful idea for enterprise teams: you can change a model’s behavior dramatically without touching its weights, simply by changing what you put in front of it. This is in-context learning (Figure 6).

Figure 6. In-context learning. The same frozen model produces different behavior depending only on what you put in front of it. Improving the prompt — not the weights — is the cheapest, most reversible way to specialize a model.

The system prompt is not decoration — it is the procedure the model executes, and it is the thing we treat as the unit of learning. We call a well-developed one a skill: a precise, failure-aware specification of how to do the domain task. Because it is text, it is human-readable, version-controlled, reviewable, and portable — the skill that scored highest last night is exactly what production loads today.

For the architect

Think of the skill the way you think of configuration or a stored procedure: a versioned artifact in your repository, promoted through environments, rolled back on regression. “Training the model” becomes “promoting a new skill version” — a change management process you already operate.

11Memory and retrieval: grounding answers in your data

A frozen model knows the world broadly but knows nothing of your records. Rather than burn your data into its weights, we keep the data in an external memory and hand the model the relevant pieces at request time. This is retrieval-augmented generation (RAG).

Figure 7. Retrieval-augmented generation (RAG). Instead of hoping the model memorized your data, you fetch the relevant records from a knowledge store and hand them to the model as context — so answers are grounded in source records and the data can change without retraining.

A continuous knowledge layer turns your corpus into typed, queryable structure (a graph) plus a search index. When a question arrives, the system retrieves the relevant records and composes them into the prompt (Figure 7), so the answer is grounded in source records and can cite them. Crucially, the knowledge can change — new documents, corrected facts — without retraining anything. The same memory also supplies the score the self-learning loop optimizes against (Part IV), coupling “what the system knows” and “how well it performs” through one metric.

12Runtime controllers: lightweight local steering

The lightest form of “learning” in the stack is a set of tiny runtime controllers — kilobyte-sized adjustments, fitted from already-verified examples, that nudge the frozen model on hard or ambiguous cases.

They are the on-device echo of the research finding from Chapter 7 — local, structure-preserving adaptation beats global retraining. A controller (the L2 layer of Figure 5) attaches and detaches like a feature flag, never alters the base, and is discarded the moment it stops helping. For most deployments L0–L1 (knowledge + skill) carry the load; controllers are the optional third gear for the genuinely hard cases.

Part IV

Making it improve itself

A closed loop that improves the in-context skill automatically, every night, with a program as the judge — no labels, no humans, no data leaving the box.

13The self-learning loop

In-context adaptation only compounds if something improves the context automatically. That engine is a nightly loop with a programmatic verifier at its heart (Figure 8).

Figure 8. The self-learning loop. Each night the model attempts graded tasks, a program scores every attempt, a larger model reflects on the failures and proposes a text patch, and a gate keeps the change only if it beats a frozen test set. No labels, no human, no data leaving the box.

One cycle: the model tries the domain tasks (sampling several candidate answers each); a program verifies every attempt against ground truth and scores it; a larger “teacher” model reflects on the failures and proposes a small text patch to the skill; and a gate keeps the patch only if it beats a frozen test set the optimizer never sees. Accept → new skill version; reject → revert and log. It runs unattended and entirely on the box.

# one nightly cycle
champion  = skill.current()
scored    = [verify(t, a) for t in tasks
                          for a in rollout(model, champion, t, n=N)]
baseline  = score_on_frozen_holdout(champion)
patch     = teacher.reflect(champion, failures(scored))   # text edits, not scoring
candidate = apply(champion, patch)
if score_on_frozen_holdout(candidate) - baseline >= MIN_LIFT:
    skill.commit(candidate)        # new champion
else:
    keep(champion)                 # auto-revert

14The verifier is the reward

The make-or-break component is the judge. We insist it be a graded, deterministic program — not a human, and not another AI grading the first.

For an extraction task, the verifier is simply: what fraction of the expected fields did the model get right, after normalization? — a number between 0 and 1. Because the reward is code, the loop runs fully unattended; because it is graded (not pass/fail), a near-miss is distinguishable from a disaster, giving the teacher a gradient to climb. The teacher model writes patches but never scores — generation and evaluation stay strictly separate (step 3 in Figure 8), so the optimizer cannot grade its own homework.

Two traps to engineer around

Goodharting — optimize exactly what you measure and the model finds technically-valid garbage; defend with layered checks (format and correctness and grounding). Sparsity — a yes/no reward starves the loop; always grade on a 0–1 scale.

15Evaluation discipline: never fool yourself

The most transferable rule in all of this: a change is only “better” relative to the unadapted model, measured on data the optimizer never saw. The neuroscience papers state it bluntly — always include an untrained baseline — and the loop bakes it in.

Frozen, fingerprinted test set. Content-hashed and verified before every run; if a byte drifts, the run aborts. The optimizer can never see or touch it.
One metric, end to end. The same scorer grades practice and the final gate — you optimize exactly what you measure.
A noise floor. A minimum-improvement threshold rejects sub-noise “wins.” A good gate says no far more often than yes.
Self-limiting. On a task already near its ceiling, a long string of rejections holding the line is the correct outcome — a system that cannot quietly make itself worse.

For the architect

This is a regression-test gate for AI, expressed in language you already enforce for code: a protected test set, a single source-of-truth metric, and a merge that is blocked unless it provably improves things. Audit logs fall out for free — every accept/reject is recorded with its score delta.

16What the results taught us

The loop is not a thought experiment. We ran it unattended, night after night, against three real-world corpora — telecom network incidents, pharmaceutical recall records and medical-device files — and watched a frozen ~4-billion-parameter model teach itself with no human labels and no data leaving the box.

Figure 9. Real campaigns, one loop. Telecom (blue) climbed from 0.69 to 0.98 over three accepted nights; Pharma (green) was already near its ceiling at 0.96, so the gate rejected every proposed change. A third domain (medical devices) sits between them. Both behaviours are correct — and both came from the same machinery.

Telecom — large headroom, captured automatically. Starting from a short seed instruction that scored 0.69 on a frozen 60-prompt test set, the skill climbed to 0.75, then 0.90, then 0.98 over three accepted nights (Figure 9) — a +42% relative jump to near-perfect extraction of vendor, cell, severity and resolution from messy incident logs. Then it plateaued and the gate stopped accepting changes. No labels were written; the only human input was the scorer.

Pharma — the discipline of knowing when to stop. On a frozen 30-prompt holdout of real public recall data the seed skill already scored 0.96. Over the following nights the loop kept proposing changes and the gate rejected every one — including a candidate that would have dropped the score to 0.93 — holding the line at 0.96. That is not the loop failing; it is the loop working correctly on a task already near its ceiling.

Medical devices — the middle case. A third, harder corpus of medical-device records confirmed the pattern from the other direction: over sixteen nights the loop accepted exactly one improvement (0.77 → 0.78 on a 100-prompt holdout) and rejected the other fifteen. Three domains, three honest outcomes, one unchanged loop.

Key idea

The headline (telecom 0.69 → 0.98) and the anticlimax (pharma held at 0.96) come from the same machinery. A system that compounds gains where there is room and refuses changes where there is none is exactly what you want running unattended in production.

Five things the testing made concrete:

Real domains carry large recoverable headroom, and a programmatic verifier captures it without labels (telecom: +0.29 absolute, fully autonomously).
The same loop knows when to stop. On a near-ceiling domain the correct behaviour is a long run of rejections (pharma) — direct evidence the gate is doing its job (Figure 8).
It transfers cheaply. Two very different corpora ran on the same hardware and the same loop; only a corpus and a scorer changed — no model retraining, no new architecture.
Honest, boring results are a feature. “+0.01 and then nothing for two weeks” is the signature of a system that cannot quietly make itself worse — the property regulated buyers actually pay for.
The discipline is what makes the numbers trustworthy. Every value above is measured on a frozen, fingerprinted test set the optimizer never saw; without that control (Chapter 15) the gains would be unfalsifiable.

For the architect

Read these as a regression-test track record, not a benchmark leaderboard. The valuable artifact is not a single high score — it is the audit log: a per-night record of what was proposed, accepted or rejected, and by how much it beat the frozen baseline. That log is your evidence trail for change control and compliance.

Read the numbers honestly

These corpora are domain-representative but curated, and the test sets are small (30–100 prompts). The point is not the absolute scores — it is the shape: autonomous, label-free improvement where headroom exists, and disciplined refusal where it does not. Re-validate on your own corpus before quoting figures.

System facts — from the live deployment

Behind the scores sit real stores. In the multi-domain deployment the telecom knowledge graph holds 8.3 MB of typed graph across 41 indexed chunks, 49 entities and 63 edge types (service memory-telecom, port 9003); the medical-device graph is a leaner 1.4 MB (memory-medical, port 9013). Serving runs the executor at roughly 120 tokens/sec on the NVFP4 fast path versus ~30 on bf16, and the whole stack — executor, nightly teacher and memory — occupies about 50 of the 128 GB available.

Part V

The hardware reality

What “edge” means in the Blackwell era, the model built to match it, quantization without tears, why throughput sets the pace of learning, and the develop-then-deploy workflow.

17Edge hardware in the Blackwell era — Thor, DGX Spark & RTX Spark

“Edge” no longer means underpowered. A new class of device puts datacenter-grade AI silicon in a box you can put on a desk, in a rack, or on a robot — and run entirely offline.

The common thread across NVIDIA’s current devices is a large pool of unified memory (128 GB shared coherently between CPU and GPU, ~273 GB/s) and native 4-bit (NVFP4) compute. That combination lets the serving model, a larger teacher used only at night, and the memory layer all stay resident at once — no swapping, no second machine.

Two devices anchor this book, at opposite ends of the spectrum (Figure 10): the desktop-class DGX Spark developer box and the embedded Jetson Thor edge module. A third, the consumer RTX Spark, brings the same substrate to Windows PCs.

NVIDIA DGX Spark — **Figure 10.** The two Blackwell devices that anchor this book. DGX Spark (left) is the desktop developer box; Jetson Thor (right) is the embedded edge module. Both carry 128 GB of unified memory and native NVFP4 compute. *Product images © NVIDIA.*

NVIDIA Jetson AGX Thor — **Figure 10.** The two Blackwell devices that anchor this book. DGX Spark (left) is the desktop developer box; Jetson Thor (right) is the embedded edge module. Both carry 128 GB of unified memory and native NVFP4 compute. *Product images © NVIDIA.*

Spec	Jetson AGX Thor	DGX Spark	RTX Spark
Role	Edge · physical AI	Desktop · develop & fine-tune	Windows PC · agents
FP4 compute	~2070 TFLOPS	~1 PFLOP	~1 PFLOP (6,144 RTX cores)
CPU	14-core Arm Neoverse-V3AE	20-core Arm (GB10 Grace)	20-core Arm (MediaTek)
Memory	128 GB unified LPDDR5X · ~273 GB/s (shared across all three)
Power	40–130 W	desktop	laptop / mini-desktop
Notable	Holoscan + Isaac sensor fusion, MIG	fine-tune ~70B; cluster ×2 (ConnectX-7) → ~405B	120B LLM @ up to 1M-token context

Three form factors, one substrate — complementary, not competing. The workflow follows the shape of the table: do the heavy, occasional work on the desktop and deploy the result at the edge (Figure 11).

Figure 11. Develop on the desktop, deploy at the edge. All three NVIDIA devices share the same Blackwell + 128 GB + NVFP4 substrate, so the frozen base and the proven skill move between them unchanged.

18Gemma 4 E4B: a model built for the edge

Edge-class hardware is only half the story. It is paired with a model designed, from its name down, to be edge-first: Gemma 4 E4B, a compact instruction-tuned open model whose architecture trades raw size for footprint efficiency at every turn.

The name itself decodes the key ideas:

E = “Effective.” About 8B weights are stored, but only ~4B activate for any given token — large-model behaviour at a ~4B compute and memory cost.
PLE — Per-Layer Embeddings. The mechanism behind “effective”: a large block of embedding parameters (roughly 40% of the model) feeds each decoder layer from a cheaper memory tier instead of crowding the accelerator’s hot path. It is also the quantization-fragile block the next chapter protects.
E2B nested (MatFormer). A smaller ~2B model lives inside the ~4B one, usable for cheap drafts or speculative decoding.
128K context, hybrid attention. Long inputs are handled with a mix of sliding-window and global attention — efficient on long sequences.

Figure 12. Why Gemma 4 E4B is edge-first. About 8B weights are stored but only ~4B activate per token; the ~40% that are per-layer embeddings (PLE) sit in a cheaper memory tier in bf16. With a nested E2B submodel, 128K context and hybrid attention, every choice trades raw size for footprint efficiency.

Why this is edge-first. Every one of these choices buys footprint efficiency rather than raw scale (Figure 12). They are not generic tricks; they are specifically what lets the whole stack — the executor model, a larger nightly teacher, and the memory store — fit inside 128 GB of unified memory and run at interactive speed on a 40–130 W board. Pairing E4B with a Blackwell device is a co-design, not a compromise.

Key idea

“Effective” is the whole game: the model stores more parameters than it activates, so it punches above its weight class on reasoning while keeping a small model’s memory and speed footprint.

For the architect

Because E4B is small and its embeddings can sit in a cheaper tier, the executor, the teacher, and the memory layer can all stay resident on one box at once. That residency is exactly what makes the nightly self-learning loop (Part IV) feasible on a single edge appliance — you are not paging models in and out to make room.

A model is only edge-first if it is served correctly

PLE is fragile. A serving runtime that silently drops it produces degraded output with no crash to warn you. The rule (detailed next chapter): keep PLE in bf16, quantize only the compute weights to NVFP4, and fail the build if the conversion touches a PLE tensor.

19Quantization without tears

The lever that makes big models fit and run fast on the edge is quantization: storing each weight in fewer bits (Figure 13).

A weight stored in 16-bit precision (bf16) is accurate but heavy; the same weight in 4-bit (NVFP4) is a quarter of the size and runs roughly four times faster, with carefully managed accuracy loss. The catch is that not all parts of the model tolerate it equally. Some compact models keep a fragile block of per-layer embeddings (PLE) that must stay high-precision — quantize them and quality collapses silently, with no crash to warn you.

Figure 13. Quantization, plainly. Storing each weight in fewer bits makes the model smaller and faster; 4-bit NVFP4 runs ~4× quicker than 16-bit. The catch: a fragile slice of the model (per-layer embeddings) must stay high-precision, or quality silently collapses.

The PLE-safe rule

Quantize only the compute weights (attention and feed-forward) to NVFP4; keep the per-layer embedding tables in bf16. Bake a safety check into the build so it fails if a fragile tensor gets quantized, rather than shipping a quietly worse model. And never serve such a model on a runtime that silently drops PLE.

20Throughput is the learning rate

Here is the non-obvious connection that ties hardware to outcomes: on an edge device, how fast the model runs sets how fast it can learn.

The self-learning loop improves by running many graded attempts per night — try, score, reflect, repeat. The more attempts you can run in the nightly window, the more the skill improves. Because 4-bit compute is ~4× faster than 16-bit (Figure 13), it buys ~4× more turns of the self-learning loop (Figure 8) per night on the same box. Throughput is not a vanity metric here; it is the literal pace of improvement. This is why the quantization and memory choices in the previous chapters are first-order, not housekeeping.

21Develop on the desktop, deploy at the edge

Because the same substrate appears in a desktop developer box and an embedded edge module, a clean workflow falls out.

Do the heavy, occasional work — building the corpus, any optional fine-tuning, proving out the skill — on the desktop-class DGX Spark (which can even cluster two units for large jobs). Then deploy the frozen base plus the proven skill artifact, unchanged, to Jetson Thor at the edge for real-time use, and to RTX Spark PCs for individual users. Nothing is re-architected between them; the artifact is portable because the substrate is shared. The artifact is portable because the substrate is shared (Figure 11). This mirrors the system’s own split: the nightly teacher loop is desktop-class work; the on-device serving is the edge deployment.

Part VI

Enterprise architecture & economics

A reference architecture, scaling to many domains, the rare case for touching weights, and the ROI and governance story for decision-makers.

22A reference architecture

Production systems built on these ideas share a recognizable shape: an OpenAI-compatible orchestrator behind a secure edge, with model serving and a retrieval layer, fed by an offline ingestion pipeline.

The request path is familiar to any architect: a client calls a single authenticated entry point; identity and routing are enforced at the edge; an orchestration layer selects an agent and tools; the agent reasons with the served model and grounds its answer via the retrieval layer. A separate, scheduled ingestion pipeline keeps the retrieval index fresh out-of-band. The self-learning loop plugs in as a nightly job against the same serving and retrieval components — it is not a separate system.

For the architect

Everything here maps to controls you already run: an API gateway, OIDC/JWT identity, a service mesh, stateless inference workers, a vector/graph store, and scheduled jobs. The AI-specific parts are the skill artifact, the verifier, and the frozen-test gate — three small additions to an otherwise conventional platform.

23Scaling to many domains

The frozen-base design scales sideways cheaply because the expensive parts are shared and the domain-specific parts are small and isolated (Figure 14).

Figure 14. Many domains on one box. The expensive teacher and encoder are shared; each domain keeps its own executor, memory and skill chain. A new domain needs a corpus and a scorer — not a new model.

The larger teacher model and the embedding encoder are domain-agnostic and idle most of the day, so one instance serves every domain. Each domain keeps its own executor, its own memory collection (no cross-contamination), and its own skill chain and frozen test set — an independent, auditable proof trail. Onboarding a new domain means supplying a corpus and a scorer, declared in one config — no model retraining. Three to four domains fit comfortably on a single 128 GB box; the per-domain marginal cost is a slice of one power-efficient appliance.

24When to actually touch the weights

The doctrine is “weights last,” not “weights never.” Chapter 8 told you when last arrives: when the limit is genuinely capacity, not adaptation.

Operationalize it with an objective trigger: if held-out improvement stalls below a small threshold for several consecutive nights while the lighter levers are still active, the reversible options have demonstrably flattened — the task may now be capacity-bound. Only then consider a weight update, under strict conditions:

Detachable, never merged — keep any adapter as a separate, hot-swappable file so the frozen base is always recoverable.
Distill from the verified corpus — the teacher generates and the program filters; you are amortizing an already-proven corpus, not inventing capability beyond the small model’s ceiling.
Forgetting gate — promote the adapter only if new-domain gain clears a bar and no prior-domain regression exceeds a tight bound, measured on frozen test sets. The same discipline as Chapter 15, now guarding against forgetting.

25ROI, governance, and risk

For decision-makers, the frozen-base approach changes the shape of both the cost curve and the risk profile.

Dimension	Cloud fine-tune / API	Frozen-base on the edge
Cost shape	Opex that rises with usage; repricing risk	Capex once; marginal query cost ≈ electricity
Data	Leaves the building	Never leaves; compliance unlock
Improvement	Event-driven; vendor-paced	Compounds nightly on your data
Reversibility	Hard (merged weights)	One step (detach the artifact)
Audit	Opaque weight diff	Versioned artifacts + scored accept/reject log

The through-line of the whole book is a single inversion of the default. The cloud reflex — “if it underperforms, train the weights” — is the wrong primitive on the edge, where reversibility, drift, capacity, and cadence all argue the other way. Freeze the substrate, adapt in context, verify against a frozen baseline, and let throughput compound the gains nightly. Spend capacity — deliberately, and last — only on the problems that truly demand it.

Appendix

Reference

A plain-language glossary of every term introduced, and the sources behind the science.

AGlossary

Weight / parameter: One of the millions/billions of adjustable numbers inside a model. Training sets them; inference reads them.
Training: The process of adjusting weights so the model performs better, using gradient descent + backpropagation.
Inference: Running a trained model to get an output. Weights are read-only; cheap and repeatable.
Gradient descent: The learning algorithm: repeatedly step the weights a little in the most error-reducing direction (“roll downhill”).
Backpropagation: The efficient method that computes, for every weight at once, which way reduces error — by sending the error backward through the network.
Learning rate: The size of each gradient-descent step. Too big overshoots; too small crawls.
Fine-tuning: Continuing to train a pretrained model on your data — it changes the weights, with the risks this book describes.
Catastrophic forgetting: When training on a new task silently degrades previously-learned abilities.
In-context learning: Changing a model’s behavior via what you put in the prompt, without changing weights.
Skill: A versioned, optimized in-context instruction — the unit of learning in the frozen-base design.
RAG (retrieval-augmented generation): Fetching relevant records from a knowledge store and giving them to the model as context, so answers are grounded.
Verifier: A deterministic program that scores a model’s answer 0–1 against ground truth — the reward the loop optimizes.
Frozen holdout: A protected test set the optimizer never sees, used to decide whether a change is genuinely better.
Quantization: Storing weights in fewer bits to shrink and speed up the model (e.g. 16-bit → 4-bit NVFP4).
PLE (per-layer embeddings): A quantization-fragile part of some compact models that must stay high-precision.
Unified memory: Memory shared coherently between CPU and GPU — lets a whole stack stay resident on one device.
Distillation: Training a small model to imitate a larger one — a capacity tool, used only when adaptation has plateaued.

BReferences & further reading

Leutenegger, N. (2026). Untrained CNNs Match Backpropagation at V1: A Systematic RSA Comparison of Four Learning Rules Against Human fMRI. arXiv:2604.16875. Frozen/untrained network matches trained at early stages; local rules beat global; “always include an untrained baseline.”
Leutenegger, N. (2026). Cross-Species RSA Reveals Conserved Early Visual Alignment but Divergent Higher-Area Rankings. arXiv:2605.22401. Higher-level alignment scales with model capacity and data, not the learning rule.
Rumelhart, Hinton & Williams (1986). Learning representations by back-propagating errors. Nature. The original backpropagation paper.
Brown, T. et al. (2020). Language Models are Few-Shot Learners. NeurIPS. In-context learning at scale.
Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. Detachable adapters.
Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314.
NVIDIA (2026). DGX Spark, Jetson Thor, and RTX Spark product documentation. Blackwell, 128 GB unified memory, NVFP4.
Unovie.AI (2026). Training Without Retraining: A Frozen-Base Doctrine for Custom Models on the Edge (companion whitepaper).

Edge AI ModelsWithout the PhD

How machines learn

1What a model actually is

2Gradient descent: learning as rolling downhill

3Backpropagation: assigning the blame

4Pretraining, fine-tuning, and inference

Why “just fine-tune it” fails on the edge

5The four constraints of the edge

6Catastrophic forgetting and drift

7What the brain-alignment research hints at

8Capacity vs. adaptation: which problem do you have?

The frozen-base doctrine

9Freeze the base, adapt the surroundings

10In-context learning: teaching without retraining

11Memory and retrieval: grounding answers in your data

12Runtime controllers: lightweight local steering

Making it improve itself

13The self-learning loop

14The verifier is the reward

15Evaluation discipline: never fool yourself

16What the results taught us

The hardware reality

17Edge hardware in the Blackwell era — Thor, DGX Spark & RTX Spark

18Gemma 4 E4B: a model built for the edge

19Quantization without tears

20Throughput is the learning rate

21Develop on the desktop, deploy at the edge

Enterprise architecture & economics

22A reference architecture

23Scaling to many domains

24When to actually touch the weights

25ROI, governance, and risk

Reference

AGlossary

BReferences & further reading