Empirical training discipline, per-cohort evaluation, and CPU-fallback inference architecture (Paper B — synopsis)
2026-05-03
Empirical training discipline, per-cohort evaluation, and CPU-fallback inference architecture — Paper B synopsis
Paper A described a sovereign-record architecture under which content data, governance authority, and federation envelopes remain inside the community that owns them. Paper B reports the empirical companion: a per-tenant situated language layer (SLL) — a community-scoped small language model trained on the tenant’s own content, governed by the tenant’s own authority, and operated on infrastructure inside the tenant’s own jurisdictional reach. The SLL is the runtime cognition layer of the sovereign-record platform; it is what lets a member ask the system a question and receive an answer drawn from the community’s own corpus rather than from a frontier model trained on a global corpus the community had no part in shaping. This synopsis outlines the training discipline (five operating rules established empirically over thirteen training-and-evaluation cohorts), the most consequential ablation finding (nine weight-modification experiments showing uniform accuracy degradation, motivating a strict no-weight-modification stance), the five Tier-1 cohorts deployed in production, the four Tier-2 cohorts designated and operationally paused, and the CPU-fallback inference architecture that keeps the runtime pathway entirely outside US-controlled infrastructure. The full empirical paper — with per-cohort evaluation tables, ablation-result detail, and inference-routing telemetry — is deferred to a separate session with verified training-run data; this synopsis sketches its shape so Paper A’s forward references resolve to a real outline rather than a placeholder.
Keywords: situated language layer, small language model, minority-language NLP, Indigenous data sovereignty, per-tenant model, CPU fallback inference, training discipline, weight-modification ablation, Tier-1 cohort, non-US sovereign GPU.
Paper A’s sovereign-record architecture preserves community sovereignty over data; it does not by itself preserve community sovereignty over cognition. A community using a language model to mediate member queries is still using a language model: the model’s training corpus, training discipline, and runtime behaviour are part of the architecture’s surface, not separate from it. A frontier model trained on a global corpus shaped by no community in particular cannot answer a Welsh query in Welsh-community register, a Māori query under tikanga, or a Sámi query against the language module’s revitalisation lexicon — not without overwriting the community’s authority with the corpus the model was trained on.
Paper B reports the empirical work that makes the SLL trustworthy for community use: a per-tenant-type small language model trained on the tenant’s own corpus, with strict training discipline, runtime hosting on tenant-controlled or community-trusted infrastructure, and a CPU-fallback path that keeps inference inside the tenant’s jurisdictional reach. The companion to Paper A is concrete: where Paper A makes the data substrate sovereign, Paper B reports the operating discipline that makes the cognition layer match.
Five rules govern the training of every situated language layer cohort. Each is empirically derived from the project’s thirteen training-and-evaluation cohorts to date; each is documented in the project’s operating discipline.
No correction pairs. Synthetic correction pairs (paired prompt-and-good-answer adjusted from observed prompt-and-bad-answer) measurably degrade performance under the project’s evaluation protocol. The mechanism is well-known: correction pairs introduce a bias surface that the model overfits when the correction examples are far enough from the natural distribution. The project uses steering vectors (sparse activations applied at inference time) for behavioural correction; the training corpus is left alone.
No deduplication of training data. Apparent duplicates in the corpus are reinforcement, not redundancy. The project’s cohorts are trained on naturally-occurring content with native repetition (canonical lines from foundational texts; repeated tikanga formulae; recurrent governance phrasings); deduplicating these flattens the corpus’s own emphasis structure.
No FAQ answers without codebase verification. Every FAQ-layer addition is anchored to a verifiable repository artefact (a constitution clause; a documented decision; a referenced source in the corpus). FAQ layering is a proven extension path; aspirational FAQ entries are forbidden.
No model-weight modifications. Nine weight-modification experiments — covering LoRA fine-tuning, full fine-tuning, layer-wise distillation, instruction tuning with RLHF-style preference data, and several combination protocols — all measurably degraded accuracy on the project’s evaluation set relative to the base model with FAQ layering and steering vectors applied. The result is uniform across the experiments: weight modification, however applied, did not improve accuracy and consistently reduced it. The project’s stance is consequently strict: weight modifications are exhausted as an extension path; FAQ layering and governance packs are the only proven paths.
No aspirational training (Tier-2 trigger discipline). A Tier-2 cohort is not commissioned until the first tenant of that type is in deployment. The motivation is two-fold: training without an actual deployment tenant grounds the corpus in speculation rather than the community’s own content, and burning training cycles on aspirational cohorts redirects scarce GPU time from what is actually in production.
The nine weight-modification experiments are the project’s most empirically consequential finding. The protocol was uniform: a candidate modification was applied to the community-v1 14B Qwen2 base; the modified model was evaluated against the project’s standing evaluation set (questions drawn from the active tenant corpora plus a held-out corpus of community-relevant out-of-distribution questions); the result was scored against the unmodified base with FAQ layering and steering vectors applied. The candidate modifications spanned LoRA fine-tuning at multiple rank settings; full fine-tuning at multiple learning rates; layer-wise distillation from a larger model; instruction tuning with paired-preference data; and several combination protocols.
Every candidate degraded accuracy relative to the FAQ-layered base. Degradation was not catastrophic in any single case but was consistent in direction: the modified models hallucinated more on out-of-corpus questions, refused less reliably when asked questions outside the corpus, and returned lower-quality citations when grounded responses were requested. The conclusion the project drew is conservative: until a weight-modification protocol is identified that demonstrably improves accuracy on the project’s evaluation set, no weight-modification path is taken. The full paper will report the per-experiment results, evaluation-set composition, and methodology in detail; this synopsis records the headline result and its operational consequence.
Five Tier-1 cohorts are deployed in production at the time of writing:
villageai-14b-whanau-v1 — Māori extended-family
contexts; whānau and (interim) governance configurations.villageai-14b-episcopal-v1 — Anglican-communion parish
contexts.villageai-14b-community-v1 — generic community fallback
(the 14B Qwen2 base).villageai-14b-family-v1 — family-history contexts.villageai-14b-business-v1 — small-business
member-directory and operational-dashboard contexts.Each cohort is trained on its tenant-type’s corpus with the operating discipline of §2. Per-cohort accuracy on tenant-content questions, refusal-rate on out-of-corpus questions, citation discipline, and qualitative tikanga-respecting register evaluation are reported in the full paper. Four Tier-2 cohorts (conservation, diaspora, clubs, alumni) are designated but operationally paused per the no-aspirational-training discipline.
An episcopal-v2 retrain was attempted and did not improve over v1 on the project’s evaluation set; v2 is not deployed. The negative result is cited in the project’s operating discipline; the full paper will treat it alongside the nine weight-modification ablations as further evidence that the current FAQ-layering-plus-steering-vectors path is at the local accuracy maximum for the cohort.
Runtime inference is hosted on a New Zealand-sovereign A6000 GPU at Catalyst Cloud during business hours (08:00–20:00 NZST), with automatic failover to a non-US home eGPU (RX 7900 XTX) outside business hours. A CPU-fallback path is available for low-load periods; latency is higher but throughput remains adequate for the platform’s request profile. No request to a US-controlled inference endpoint is in the production request path. The vendor-prohibition rule that governs all platform infrastructure (no US-owned cloud, SaaS, or managed-AI service in the production request path) extends to the inference layer. Per-request routing decisions are logged with the cohort selected, the policy-gate verdict, and the upstream-service health at request time, providing the audit surface that GDPR Article 22 (automated decision-making) and Te Tiriti Article 2 (taonga protection over the community’s data and its mediation) jointly require.
The full paper will report: per-cohort accuracy on standing evaluation sets; refusal discipline (rate of correct refusal-by-default for out-of-corpus questions); citation discipline (rate of zero-citation responses caught by the post-hoc citation safety filter); the nine weight-modification ablation results in detail; episcopal-v2 retrain comparison; tikanga-register evaluation by community-aligned reviewers (where consent permits; otherwise summarised at architectural level); inference latency and throughput on each routing path; and a defence-in-depth analysis of model-grounding versus post-hoc citation filtering — the principle that model behaviour and safety filter operate at distinct layers and both must be reported, never the filter alone treated as the whole answer to grounding. This synopsis names these dimensions; the full paper reports them.
Three limitations bound this paper’s scope. The Tier-2 cohorts are not yet evaluated; their empirical findings will be reported when their first tenants are in deployment. The evaluation protocol is the project’s standing protocol — comparable to other community-scale evaluation efforts but not yet aligned with a published peer-reviewed evaluation benchmark; the full paper will discuss this comparator gap and the literature that informs it. And the empirical findings here are project-internal as of the synopsis date; the full paper will report them with the rigour expected for peer-review submission, including reviewer access to the standing evaluation set under appropriate confidentiality where the corpora include culturally-restricted material.
Paper A and Paper B form a deliberate two-paper split: Paper A covers the architectural substrate; Paper B covers the empirical operating discipline that makes the cognition layer of that substrate trustworthy for minority-language and Indigenous community use. Together they describe a system in which both data and cognition are sovereign by construction. The Tractatus framework paper (already public on Codeberg under Apache 2.0) is the third leg of the triad: development-time governance for the AI assistance that builds the platform itself.
The author is grateful to Lesley Stroh for foundational philosophical mentorship on pluralistic thinking and the question of goodness in artificial intelligence. The pluralistic-deliberation commitment that runs through the platform’s governance architecture — and the wider conviction that an AI substrate worth building must answer to a substantive notion of goodness, not a procedural one — owes its formative shape to those conversations.
The author also acknowledges the community elders, language-revitalisation practitioners, and tenant administrators whose corpora and feedback have shaped the cohorts; specific named acknowledgement awaits direct consent from each individual and is held back here pending that consent.
[A] Stroh, J. G. (2026). Sovereign-Record Architecture for
Community-Scale Platforms — Paper A (Review Draft v3, May 2026). My
Digital Sovereignty Limited (NZ). Companion paper. Available at
agenticgovernance.digital/papers/sovereign-record-architecture-v3-may-2026.html
(English, te reo Māori, Deutsch).
[T] Stroh, J. G. (2026). Tractatus Framework — Architectural
Patterns for AI Development Governance, Working Paper v0.2.
codeberg.org/mysovereignty/tractatus-framework. Apache
2.0.
Detailed empirical references — including the Qwen2 base-model citation, evaluation-protocol citations, weight-modification-method literature (LoRA, full fine-tuning, layer-wise distillation, RLHF-style preference data), Indigenous-language NLP literature, low-resource translation literature, situated-dialogue literature, and ethics-of-language-model literature — are deferred to the full Paper B and to the Step-F literature scan planned for the project’s wider documentation pass.
Corresponding author: John G. Stroh, Director, My Digital Sovereignty Limited (NZ). ORCID: 0009-0005-2933-7170. Email: john.stroh@mysovereignty.digital.
Licence: Creative Commons Attribution 4.0 International (CC BY 4.0).
Suggested citation: Stroh, J. G. (2026). Situated Language Layers for Minority-Language and Indigenous Communities — Paper B Synopsis. My Digital Sovereignty Limited. Available at agenticgovernance.digital. (Zenodo DOI to be assigned upon expansion to full paper.)
Synopsis status: This is a 2-page synopsis. The full empirical paper is deferred to a separate session with verified training-run data, per-cohort evaluation tables, ablation-result detail, and the comparator literature scan required for peer-review submission. Published in synopsis form to resolve Paper A’s forward references and to share the planned shape of the empirical companion paper.