Technical Whitepaper · Edge AI

Training Without Retraining: A Frozen-Base Doctrine for Custom Models on the Edge

Best practices for adapting small models on-device through in-context learning, external memory, and self-verification — motivated by what the learning-rules-and-brain-alignment literature tells us about when weight updates actually help.

Suresh Mandava · [email protected] Unovie.AI · EdgeAI Context Engineering Revision 1.0 · June 2026 Hardware target: NVIDIA Jetson Thor · DGX Spark · RTX Spark (Blackwell, 128 GB unified, NVFP4)

Abstract

The instinct when a foundation model underperforms on a niche task is to fine-tune its weights. On the edge — constrained memory, no cloud, regulated data, a need for reversibility — that instinct is usually wrong. We argue for a frozen-base doctrine: keep the pretrained weights fixed and move all adaptation into external, reversible state — a knowledge graph, an optimized in-context skill, and lightweight runtime controllers — closed by a programmatic verifier that lets a small model teach itself overnight without labels, drift, or data egress. We connect this engineering practice to a recent line of computational-neuroscience work showing that a frozen, untrained network can match or exceed a backpropagation-trained one at the representational level, that local, lightweight learning preserves structure that global gradient descent erodes, and that higher-level abstraction is gated by capacity and data rather than by the update rule. Treated as motivation rather than proof, those results sharpen a practical rule of thumb: adapt in context first; touch the weights last, and only when a frozen-baseline evaluation proves you must. We give the supporting hardware practices (PLE-safe NVFP4 quantization, throughput-as-learning-rate, develop-on-DGX-Spark / deploy-on-Thor) and a best-practices checklist.

1The edge adaptation problem

A capable open model — say a 4-billion-active-parameter instruction-tuned model — rarely fails on the edge because it is too small. It fails because it is generic: it knows your specialty broadly and shallowly, it has no memory of your corpus between calls, and it cannot be sent the one thing that would fix it — your private data.

The reflex is to fine-tune. But on an edge appliance that reflex collides with four hard constraints:

Reversibility. A merged weight update is difficult to audit and to undo. Regulated buyers need to prove that a change can be rolled back instantly and that the trusted base is always one step away.
Catastrophic forgetting & drift. Gradient updates on a narrow corpus quietly degrade unrelated capabilities. Detecting this requires a regression suite the operator rarely has.
Capacity and data. A small on-device model trained on a small in-house corpus is exactly the regime where fine-tuning yields the least and risks the most.
Cost and cadence. Retraining is an event; adaptation needs to be a habit. If improving the model is a quarterly project, it will not keep pace with the domain.

This whitepaper makes the case that the right primitive on the edge is not the gradient step but in-context adaptation closed by verification: the model's behavior is shaped by what it is given at inference time — retrieved context and an optimized instruction — and that context is itself improved, nightly, against a ground-truth checker. The weights stay frozen. Below we first look at why this is more than an engineering convenience: a strand of recent neuroscience-adjacent ML research suggests the frozen substrate is doing more work than we usually credit.

2What the learning-rules literature suggests

A useful corrective to "more training is always better" comes from work that asks a sharper question: does the learning rule even matter for the representation a network forms? The answer, at least at the lower levels of the hierarchy, is surprisingly often no.

Leutenegger^[1][2] systematically compares five conditions — backpropagation (BP), feedback alignment (FA), predictive coding (PC), spike-timing-dependent plasticity (STDP), and an untrained random-weights baseline — on an identical small convolutional architecture, and scores each against human fMRI and macaque electrophysiology using Representational Similarity Analysis (RSA). Four findings are directly relevant to anyone deciding how to adapt a model:

2.1 Architecture can dominate the update rule

At early visual cortex (V1), the untrained network exceeds the backpropagation-trained one (ρ = 0.076 vs 0.034; Δρ = +0.044, p < 0.001)^[1]. The structural priors of the architecture — local connectivity, pooling, nonlinearity — carry most of the alignment for free; training on a narrow objective actually moved representations away from the target. This echoes older results that random-weight networks already carry non-trivial visual structure^[3][11] and that unsupervised sparse coding alone yields V1-like receptive fields^[4].

2.2 Local, lightweight learning preserves structure that global gradients erode

Among trained rules, the local ones — STDP and predictive coding — produced the highest early-cortex alignment, above backpropagation, in both species^[1][2]. By contrast, feedback alignment (random global feedback) was consistently the worst, despite reaching meaningful task accuracy — task performance and representational fidelity were dissociated throughout. The lesson is not that STDP is the answer; it is that local, structure-preserving updates beat global error signals when the goal is to keep the representation faithful rather than merely to fit a label.

2.3 Higher-level abstraction is gated by capacity and data, not the rule

At the top of the hierarchy (IT cortex), all rules — including no training — converged; no update rule was reliably more brain-like^[1]. A capacity control made the cause explicit: a pretrained ResNet-50 (ImageNet) jumped to ρ ≈ 0.25 at IT versus 0.07–0.14 for every small-CNN condition^[2]. Higher-level, abstract structure is bought with model capacity and training-data richness, not with a cleverer update on a small model.

2.4 The discipline of the untrained baseline

The papers' explicit methodological moral: always include an untrained, same-architecture baseline. Without it, "architecture effects are confounded with learning effects — our data show this confound can be essentially complete at V1"^[1]. Any claim that an adaptation helped is meaningless unless measured against the unadapted model on held-out data.

Table 1 — From the learning-rules literature to edge adaptation practice.
Reported finding	Edge-adaptation principle it motivates
Untrained / frozen architecture matches or beats a trained one at low levels^[1]	Freeze the base. The pretrained substrate already carries most of the signal; default to not retraining it.
Local rules (STDP, PC) preserve structure; global BP can move it away^[1][2]	Adapt locally & reversibly (in-context, controllers) rather than with global gradient descent that risks drift/forgetting.
Higher-area abstraction scales with capacity + data, not the rule^[2]	Diagnose the bottleneck. If a task genuinely needs more abstraction, it is a capacity problem (distill/upgrade), not an adaptation problem.
Task accuracy ≠ representational alignment^[1]	Optimize the right objective. Grade adaptation on grounded task fidelity, not a convenient proxy.
"Always include an untrained baseline"^[1]	Gate on a frozen baseline. Accept a change only if it beats the unadapted model on data it never saw.

An honest caveat. These studies are small-scale convolutional vision models scored against primate visual cortex, with explicitly hedged, sometimes null results (n = 5 rules; n = 3 fMRI subjects; single datasets)^[1][2]. They are not evidence about large language models, and we cite them as motivation and intuition, not as proof. The engineering case below stands on its own measurements; the neuroscience simply rhymes with it.

3The frozen-base doctrine

"Self-learning" on the edge does not have to mean "weight-updating." We define it as swappable external state that measurably improves a held-out domain metric over time — each store operating at a different timescale, none of them touching the base weights.

Figure 1 — The frozen-base adaptation stack. External state (L0–L2) conditions a frozen model at inference time; the L3 weight adapter is detachable and parked.

Concretely, adaptation lives in four layers, only the last of which involves gradients — and that one is held in reserve:

Table 2 — The adaptation stack. Weights stay frozen for layers 0–2.
Layer	What changes	Timescale	Weights?
L0 · Knowledge	A typed knowledge graph + vector index, grown by continuous ingestion of the domain corpus.	per-ingest, continuous	No
L1 · Skill	A plain-language skill document (the in-context instruction the model follows), rewritten when a better version is proven.	nightly batch	No
L2 · Behavior	Lightweight per-request runtime controllers that steer the frozen model on hard cases.	per-request	No
L3 · Weights (parked)	A detachable, never-merged low-rank adapter, used only after the reversible levers plateau.	on plateau	Yes (revertible)

Principle 1

Keep the base model byte-for-byte frozen. Put adaptation in external state that can be inspected, diffed, and reverted in one step. The trusted base is always one detach away.

This is the engineering analog of §2.1–2.2: the frozen substrate carries the heavy lifting, and the cheap, local, reversible stores do the domain-specific shaping. Crucially, every change is an artifact — a graph delta, a versioned skill file, a controller blob — not an opaque shift in billions of parameters. Auditability and reversibility are properties of the design, not bolt-ons.

4In-context learning as the primary lever

In-context learning^[12] — shaping behavior through what the model is shown at inference time rather than through its weights — is the edge-appropriate adaptation mechanism. Three surfaces carry it.

4.1 The skill document is a learned artifact

The system prompt is not boilerplate; it is the procedure the model executes, and it is the thing we optimize. A skill document begins as a short seed and is grown by the loop (§5) into a precise, failure-aware specification of how to perform the domain task. Because it is text, it is human-readable, version-controlled, and graftable: the same skill that scored highest last night is exactly what production loads today.

Principle 2

Treat the in-context instruction as the unit of learning. Version it, score it on a frozen holdout, and promote it only when it beats the incumbent. The "trained model" you ship is a frozen base plus a proven skill artifact.

4.2 Retrieval-grounded context (L0)

The second surface is what the model retrieves. A continuous knowledge-graph memory turns the private corpus into typed, queryable structure; at request time the relevant subgraph and supporting chunks are composed into the prompt. This grounds answers in source records and — unlike weight memorization — lets the knowledge change without touching the model. The same memory stack also supplies the reward for the loop (schema conformance, entity resolution, retrieval grounding), so improving the knowledge layer and improving the skill are coupled through one metric.

4.3 Runtime behavioral controllers (L2)

The third surface is the lightest gradient-free form of "local learning" in the literature's sense (§2.2): tiny per-request controllers (kilobytes, composable, fitted from already-verified examples) that nudge the frozen model's activations on hard or ambiguous cases. They attach and detach like a setting, never alter the base, and are the on-device echo of "local, structure-preserving adaptation beats global retraining."

5Closing the loop: the self-learning cycle

In-context adaptation only compounds if there is a closed loop that improves the context automatically. The engine is a nightly cycle with a programmatic verifier at its center: the model tries, the verifier scores, a larger model reflects on the failures, and a gate commits the change only if it provably helps.

Figure 2 — The nightly self-learning loop. A programmatic verifier scores every rollout; verified winners feed the controllers, failures drive reflection, and a frozen-holdout gate commits only proven gains.

# one nightly cycle — fully unattended, fully local
champion   = skill_store.current()
tasks      = load(train_split)

# 1) Rollout: best-of-N samples per task from the frozen executor
scored     = [verify(t, sample) for t in tasks
                                for sample in rollout(executor, champion, t, n=N)]

# 2) Harvest: verified winners become free, self-labeled data (feeds L2)
winners    = [s for s in scored if s.score >= WIN_THRESHOLD]   # top-1/task over the bar
failures   = [s for s in scored if s.score <  WIN_THRESHOLD]   # best honest attempt

# 3) Baseline: score the CURRENT skill on a frozen, fingerprinted holdout
baseline   = score_on_holdout(champion, holdout)

# 4) Reflect: a larger model reads <=K failures, proposes textual patches (no scoring)
patches    = reflector.reflect(champion, failures[:K])     # strict JSON, <=3 edits
candidate  = apply_patches(champion, patches)

# 5) Gate: re-score on the SAME holdout; commit iff it clears the noise floor
lift       = score_on_holdout(candidate, holdout) - baseline
if lift >= MIN_LIFT:
    skill_store.commit(candidate)        # new champion vNNNN
else:
    keep(champion)                        # auto-revert; log the rejection

5.1 The verifier is the reward

The make-or-break component is the checker. A graded, deterministic, programmatic verifier — for an extraction task, the fraction of gold fields recovered after normalization, scored in [0,1] — replaces both human labeling and LLM-as-judge. Because the reward is code, the loop runs fully unattended and fully on-box. The larger "reflector" model writes patches but never scores; scoring stays mechanical. With a deterministic checker you also get a data engine for free: best-of-N rejection sampling harvests verified-correct trajectories that feed the controllers.

Principle 3

Make the reward a graded program, not a judgment. No labels, no LLM-as-judge in the gate. Keep generation and evaluation strictly separate so the optimizer cannot grade its own homework.

5.2 Two failure modes a programmatic gate must handle

Goodharting. Optimize exactly what you check and the model finds schema-valid garbage. Mitigate with layered checks (conformance and entity-match and grounding) and periodic spot-checks.
Sparsity. Binary pass/fail starves the reflector. Make the metric graded (fraction of subgoals), so a near-miss is distinguishable from a disaster and the optimizer has a gradient to climb.

6Evaluation discipline

The single most transferable lesson from §2.4 is operational: a change is only "better" relative to the frozen, unadapted baseline, measured on data the optimizer never saw. The loop above bakes that in.

Frozen, fingerprinted holdout. The validation set is content-hashed and verified before every rollout; if a byte drifts, the run aborts. The optimizer can never see or touch it — the on-device equivalent of the literature's untrained-baseline control.
One metric, end to end. The same scorer grades rollouts and the gate, so you optimize exactly what you measure.
A noise floor. A minimum-lift threshold rejects sub-noise "wins" — the regression suite's job is to say no far more often than yes.
Self-limiting behavior. On a domain already near its ceiling, the correct outcome is a long string of rejections holding the line — a feature, not a failure. A system that cannot quietly make itself worse is exactly what a regulated buyer wants.

7Edge hardware best practices

The frozen-base loop is bottlenecked by one thing: how many verified rollouts you can run per night. On a bandwidth-bound edge accelerator, throughput is the learning rate, and the hardware choices that set throughput are the ones that set how fast the model improves.

7.1 Throughput is the learning rate

Modern Blackwell-class edge devices share a decisive property: a large pool of unified memory (128 GB, ~273 GB/s) coherently shared by CPU and GPU^[13][14]. That lets the serving model, a larger "teacher" used only during the nightly window, and the memory stack all stay resident at once — no swapping. Native 4-bit (NVFP4) compute lifts decode roughly 4× over bf16 (≈30 → 120 tokens/sec on this class of board), and because decode is bandwidth-bound, that 4× translates directly into 4× more best-of-N experiments per night.

7.2 PLE-safe quantization

Aggressive quantization is the lever — but not uniformly. Models that achieve a small "effective" footprint via per-layer embeddings (PLE) keep a large, quantization-fragile block of embedding parameters that must stay in higher precision. The discipline: quantize only the active compute weights (attention, FFN) to NVFP4, and keep the PLE tables in bf16. A build that quantizes PLE tensors will degrade output silently. Bake a PLE-safety preflight into the conversion so the build fails on drift rather than shipping a quietly worse model.

Principle 4

Quantize for speed, but precision-protect the fragile parts. NVFP4 the compute path; keep per-layer embeddings in bf16. Validate the conversion automatically, and never serve the model on a runtime that silently drops PLE.

7.3 Develop on the desktop, deploy at the edge

The same Blackwell + 128 GB + NVFP4 substrate appears in three form factors, and they are complementary rather than competing:

Table 3 — One substrate, three roles. The trained skill artifact is portable across all three.
Device	Role	Strength
DGX Spark (GB10, desktop)	Develop & fine-tune	Highest local LLM throughput; cluster two units (ConnectX-7) for ~405B-class work^[13].
Jetson AGX Thor (embedded)	Deploy at the edge	Real-time multimodal inference at 40–130 W; sensor fusion + robotics stacks^[14].
RTX Spark (Windows PC)	Personal agents	120B-class local models on laptops/mini-PCs; broad OEM reach^[15].

Because the frozen base, the NVFP4 weights, and the skill artifact are identical across them, the natural workflow is develop and fine-tune on DGX Spark → deploy on Thor → reach every desk on RTX Spark, with the learned artifact moving between them unchanged.

Figure 3 — One substrate, two phases. Develop and fine-tune on DGX Spark; the frozen base plus the proven skill artifact deploy unchanged to Thor at the edge and RTX Spark on PCs.

8Scaling to many domains

The frozen-base design scales sideways cheaply because almost everything expensive is shared.

Share the brain. The larger reflector and the embedding encoder are domain-agnostic and idle most of the day; one instance serializes every domain's nightly reflection.
Isolate the expertise. Each domain gets its own executor, its own memory collection (no cross-domain pollution), and its own skill chain and frozen holdout — an independent proof trail.
Onboard cheaply. A new domain needs a corpus and a [0,1] scorer, declared in one config. No model retraining; the per-domain marginal cost is a slice of one power-efficient box.

Three to four domains fit comfortably on a single 128 GB device (one ~16 GB reflector + several ~18 GB executors), which is the practical sweet spot before time-sharing the executor.

9When to actually touch the weights

The doctrine is "weights last," not "weights never." §2.3 tells us when last arrives: when the limit is genuinely capacity, not adaptation.

Operationalize it with an objective trigger. If held-out lift stays under a small threshold (e.g. <0.5%) for several consecutive nights while the runtime layers are still churning, the reversible levers have demonstrably flattened — the task may now be capacity-bound. Only then consider a weight update, under strict conditions:

Detachable, never merged. Keep any low-rank adapter^[16][17] as a separate, hot-swappable file so the frozen base is always recoverable.
Distill from the verified corpus. The teacher generates and the programmatic gate filters; you are amortizing an already-verified corpus into the base, not inventing capability beyond the student's ceiling.
PLE-safe training. Put adapters on attention/FFN projections only; never quantize or train the per-layer embeddings.
Forgetting-regression gate. Promote the adapter only if new-domain lift clears a bar and no prior-domain regression exceeds a tight bound, measured against frozen, never-trained holdouts — the same baseline discipline as §6, now guarding against catastrophic forgetting.

Principle 5

Ask weights for help only after a frozen-baseline evaluation proves the reversible levers are exhausted — and even then, as a detachable adapter behind a forgetting-regression gate, never a merge.

10Best-practices checklist

Table 4 — A frozen-base edge-adaptation checklist.
#	Practice	Why
1	Freeze the base by default; adapt in external state (knowledge, skill, controllers).	Reversibility, auditability, no drift; the substrate already carries most of the signal (§2.1, §3).
2	Make the in-context skill the unit of learning — versioned, scored, promoted.	Human-readable, graftable, provable adaptation without weight changes (§4.1).
3	Ground answers via retrieval from a continuous knowledge graph that also supplies the reward.	Knowledge changes without touching the model; couples L0 and L1 through one metric (§4.2).
4	Prefer local, reversible adaptation (controllers) over global gradient descent.	Local updates preserve structure that global error signals erode (§2.2, §4.3).
5	Close the loop with a graded, programmatic verifier; generation ≠ evaluation.	Unattended, label-free, Goodhart-resistant improvement (§5).
6	Gate on a frozen, fingerprinted holdout with a noise-floor minimum lift; auto-revert.	The untrained-baseline discipline; a system that can't quietly get worse (§2.4, §6).
7	Make throughput the priority: NVFP4 compute, unified memory, dual-path serving.	On a bandwidth-bound edge board, tokens/sec is the experiment budget (§7.1).
8	PLE-safe quantization with an automated preflight; never a PLE-dropping runtime.	Protects the quantization-fragile embeddings; avoids silent degradation (§7.2).
9	Develop on DGX Spark, deploy on Thor; keep the artifact portable.	Same substrate, different roles; one workflow from prototype to edge (§7.3).
10	Touch weights only on an objective capacity trigger, as a detachable adapter behind a forgetting gate.	Abstraction is capacity-bound, not rule-bound; preserve reversibility (§2.3, §9).

The through-line is a single inversion of the default. The cloud reflex — "if it underperforms, train the weights" — is the wrong primitive on the edge, where reversibility, drift, capacity, and cadence all argue the other way. Freeze the substrate, adapt in context, verify against a frozen baseline, and let throughput compound the gains nightly. The learning-rules literature gives that engineering bet a satisfying second reading: the frozen architecture was doing more of the work than we assumed, local adaptation keeps the representation honest, and the cases that truly need more — abstraction — are the ones to spend capacity on, deliberately, last.

11References

Leutenegger, N. (2026). Untrained CNNs Match Backpropagation at V1: A Systematic RSA Comparison of Four Learning Rules Against Human fMRI. arXiv:2604.16875v2. Frozen/untrained CNN exceeds BP at V1; PC/STDP > BP; convergence at IT; "always include an untrained baseline."
Leutenegger, N. (2026). Cross-Species RSA Reveals Conserved Early Visual Alignment but Divergent Higher-Area Rankings Across Human fMRI and Macaque Electrophysiology. arXiv:2605.22401v1. Replicates early-area pattern in macaque; ResNet-50 capacity control shows IT alignment scales with capacity + data.
Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., & Ng, A. Y. (2011). On random weights and unsupervised feature learning. ICML.
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature, 381:607–609.
Whittington, J. C. R., & Bogacz, R. (2017). An approximation of the error backpropagation algorithm in a predictive-coding network with local Hebbian plasticity. Neural Computation, 29:1229–1262.
Lillicrap, T. P., Cownden, D., Tweed, D. B., & Akerman, C. J. (2016). Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications, 7:13276.
Bi, G.-q., & Poo, M.-m. (1998). Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18:10464–10472.
Schrimpf, M., Kubilius, J., Hong, H., et al. (2020). Brain-Score: Which artificial neural network for object recognition is most brain-like? bioRxiv.
Yamins, D. L. K., & DiCarlo, J. J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience, 19:356–365.
Kriegeskorte, N., Mur, M., & Bandettini, P. (2008). Representational similarity analysis — connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2:4.
Truzzi, A., & Cusack, R. (2025). Neural responses in early visual cortex are well predicted by random-weight CNNs. bioRxiv.
Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. NeurIPS. (In-context learning.)
NVIDIA (2026). DGX Spark — Personal AI Supercomputer (GB10 Grace Blackwell). nvidia.com/products/workstations/dgx-spark. 128 GB LPDDR5X unified, ~273 GB/s; ConnectX-7 clustering.
NVIDIA (2026). Jetson Thor — Advanced AI for Physical Robotics. nvidia.com/autonomous-machines/embedded-systems/jetson-thor. Blackwell, ~2070 FP4 TFLOPS, 40–130 W, 128 GB unified.
NVIDIA (2026). RTX Spark — Slim Laptops & Small Desktops. nvidia.com/products/rtx-spark. Blackwell RTX + Arm; up to 128 GB unified; 120B-class local models.
Hu, E. J., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv:2305.14314.