ARCHITECTURAL ALIGNMENT
Interrupting Neural Reasoning Through Constitutional Inference Gating
A Necessary Layer in Global AI Containment
Also available in:
Community Adopters Edition | Policymakers Edition | Download PDF
Abstract
Contemporary approaches to AI alignment rely predominantly on training-time interventions: reinforcement learning from human feedback (Christiano et al., 2017), constitutional AI methods (Bai et al., 2022), and safety fine-tuning. These approaches share a common architectural assumption—that alignment properties can be instilled during training and will persist reliably during inference. This paper argues that training-time alignment, while valuable, is insufficient for existential stakes and must be complemented by architectural alignment through inference-time constitutional gating.
We present the Tractatus Framework as a formal specification for interrupted neural reasoning: proposals generated by AI systems must be translated into auditable forms and evaluated against constitutional constraints before execution. This shifts the trust model from "trust the vendor's training" to "trust the visible architecture." The framework is implemented within the Village multi-tenant community platform, providing an empirical testbed for governance research.
Critically, we address the faithful translation assumption—the vulnerability that systems may misrepresent their intended actions to constitutional gates—by bounding the framework's domain of applicability to pre-superintelligence systems and specifying explicit capability thresholds and escalation triggers. We introduce the concept of Sovereign Locally-trained Language Models (SLLs) as a deployment paradigm where constitutional gating becomes both feasible and necessary.
The paper contributes: (1) a formal architecture for inference-time constitutional gating; (2) capability threshold specifications with escalation logic; (3) validation methodology for layered containment; (4) an argument connecting existential risk preparation to edge deployment; and (5) a call for sustained deliberation (kōrero) as the epistemically appropriate response to alignment uncertainty.
1. The Stakes: Why Probabilistic Risk Assessment Fails
1.1 The Standard Framework and Its Breakdown
Risk assessment in technological domains typically operates through expected value calculations: multiply the probability of an outcome by its magnitude, compare across alternatives, and select the option that maximises expected utility. This framework underlies regulatory decisions from environmental policy to pharmaceutical approval and has proven adequate for most technological risks.
For existential risk from advanced AI systems, this framework breaks down in ways that are both mathematical and epistemic.
1.2 Three Properties of Existential Risk
Irreversibility. Most risks allow for error and subsequent learning; existential risks do not, as there is no second attempt after civilisational collapse or human extinction. Standard empiricism—testing hypotheses by observing what happens—cannot work, so theory and architecture must be right the first time.
Unquantifiable probability. There is no frequency data for existential catastrophes from AI systems. Estimates of misalignment probability vary by orders of magnitude depending on reasonable assumptions about capability trajectories, alignment difficulty, and coordination feasibility. Carlsmith (2022) estimates existential risk from power-seeking AI at greater than 10% by 2070; other researchers place estimates substantially higher or lower. This is not ordinary uncertainty reducible through additional data collection—it is fundamental unquantifiability stemming from the unprecedented nature of the risk.
Infinite disvalue. Expected value calculations multiply probability by magnitude. When magnitude approaches infinity (the permanent foreclosure of all future human potential), even small probabilities yield undefined results. The mathematical grounding of conventional cost-benefit analysis fails.
1.3 Decision-Theoretic Implications
These properties suggest that expected value maximisation is not the appropriate decision procedure for existential AI risk. Alternative frameworks include:
Precautionary satisficing (Simon, 1956; Hansson, 2020). Under conditions of radical uncertainty with irreversible stakes, satisficing—selecting options that meet minimum safety thresholds rather than optimising expected value—may be the rational approach.
Maximin under uncertainty (Rawls, 1971). When genuine uncertainty (not merely unknown probabilities) meets irreversible stakes, maximin reasoning—choosing the option whose worst outcome is least bad—provides a coherent decision procedure.
Strong precautionary principle (Gardiner, 2006). The precautionary principle is appropriate when three conditions obtain: irreversibility, high uncertainty, and public goods at stake. Existential AI risk meets all three.
1.4 Implications for AI Development
These considerations do not imply that AI development should halt. They imply that development should proceed within containment structures designed to prevent worst-case outcomes. This requires:
1. Theoretical rigor over empirical tuning. Safety properties must emerge from architectural guarantees, not from observing that systems have not yet caused harm.
2. Multi-layer containment. No single mechanism should be trusted to prevent catastrophe; defence in depth is required.
3. Preparation before capability. Containment architectures cannot be developed after the systems that need them exist.
2. Two Paradigms of Alignment
2.1 Training-Time Alignment
The dominant paradigm in AI safety research seeks to embed alignment properties into neural networks during training, such that models inherently behave in aligned ways at inference time.
Reinforcement Learning from Human Feedback (RLHF). Human evaluators rank model outputs; models are trained via reinforcement learning to produce highly-ranked responses (Christiano et al., 2017; Ouyang et al., 2022). This reduces explicit harms but optimises for displayed preferences rather than genuine values and remains vulnerable to evaluator bias, preference gaming, and distribution shift.
Constitutional AI (CAI). Models critique and revise their own outputs against natural-language principles, reducing reliance on human labour (Bai et al., 2022). However, CAI depends on ambiguous natural language and unverifiable self-evaluation. The model's interpretation of constitutional principles cannot be directly audited.
Safety fine-tuning. Additional training passes improve performance on safety benchmarks. However, this approach is vulnerable to Goodhart's Law (Goodhart, 1984): models may learn to pass tests rather than to be safe in open-ended deployment.
2.2 Architectural Alignment
Architectural alignment accepts that neural network internal states remain opaque and designs external constraints that apply regardless of those internal states.
Interrupted reasoning. Requests do not flow directly from model output to world effect. Model outputs are transformed into structured, verifiable proposal schemas and evaluated against explicit constitutional rules before any action is executed.
Distributed judgment. Independent systems and human supervisors review proposals, preventing single points of failure in self-assessment.
Preserved human authority. Architectures maintain explicit guarantees that humans can intervene, correct, or override AI decisions.
2.3 Complementarity and Joint Necessity
Training-time and architectural alignment are complements, not alternatives. Each addresses failure modes the other cannot:
• Training-time alignment shapes what the system tends to do; architectural alignment constrains what the system can do regardless of tendency.
• Training-time alignment may fail silently (the system appears aligned while harbouring divergent objectives); architectural alignment provides observable checkpoints where failure can be detected.
• Architectural alignment alone cannot intercept all harmful outputs; training-time alignment reduces the frequency of proposals that strain constitutional gates.
3. Philosophical Foundations: The Limits of the Sayable
3.1 The Wittgensteinian Frame
The framework's name invokes Wittgenstein's Tractatus Logico-Philosophicus (1921), a work fundamentally concerned with the limits of language and logic. Proposition 7, the work's famous conclusion: "Whereof one cannot speak, thereof one must be silent."
Wittgenstein distinguished between what can be said (expressed in propositions that picture possible states of affairs) and what can only be shown (made manifest through the structure of language and logic but not stated directly).
3.2 Neural Networks and the Unspeakable
Neural networks occupy precisely the domain whereof one cannot speak. The weights of a large language model do not admit human-interpretable explanation. We can describe inputs and outputs; we can measure statistical properties of behaviour; we can probe for representations (Elhage et al., 2021; Olah et al., 2020). But we cannot articulate, in human language, the complete reasoning process from input to output.
This is not merely a practical limitation awaiting better interpretability tools. Current mechanistic interpretability achieves meaningful results on narrow questions in relatively small models (Conmy et al., 2023), but the gap between "explaining specific circuits" and "auditing complete reasoning chains for alignment properties" remains vast.
3.3 The Tractatus Response
The Tractatus Framework responds to neural opacity not by attempting to say the unsayable, but by creating architectural boundaries between the speakable and unspeakable domains.
We accept that the neural network's internal reasoning is opaque. We do not attempt to audit it directly. Instead, we require that before any reasoning becomes action, it must pass through a checkpoint expressed in terms we can evaluate:
1. The model's intended action must be translated into a structured proposal schema with defined fields and semantics.
2. The proposal must be evaluated against explicit constitutional rules whose application is deterministic and auditable.
3. The evaluation must be logged with sufficient detail for post-hoc review.
4. Staged Containment: A Multi-Layer Architecture
4.1 The Inadequacy of Single-Layer Solutions
No single containment mechanism is adequate for existential stakes. Defence in depth—multiple independent layers, any one of which might prevent catastrophe even if others fail—is a standard principle in nuclear safety, biosecurity, and other high-stakes domains (Reason, 1990). AI containment requires similar architecture.
4.2 A Five-Layer Containment Model
Layer 1: Capability Constraints. Hardware and infrastructure limitations that bound what AI systems can do regardless of their objectives. This includes compute governance (Sastry et al., 2024), network isolation for high-risk systems, and architectural constraints preventing self-modification.
Layer 2: Constitutional Gates. Inference-time architectural constraints that interrupt neural reasoning and require explicit evaluation before action. This is the layer addressed by the Tractatus Framework.
Layer 3: Human Oversight. Human institutions that monitor AI systems and can intervene when problems emerge. This includes independent monitoring bodies, red-team programs, and incident reporting requirements.
Layer 4: Organisational Governance. Internal governance structures within organisations deploying AI: ethics boards, safety teams, deployment review processes, and accountability mechanisms.
Layer 5: Legal and Regulatory Frameworks. External governance through law, regulation, and international coordination.
4.3 Current State Assessment
| Layer | Current State | Critical Gaps |
|---|---|---|
| 1. Capability Constraints | Partial; compute governance emerging | No international framework; verification difficult |
| 2. Constitutional Gates | Nascent; Tractatus is early implementation | Not widely deployed; scaling properties unknown |
| 3. Human Oversight | Ad hoc; varies by organisation | No independent bodies; no professional standards |
| 4. Organisational Governance | Inconsistent; depends on corporate culture | No external validation; conflicts of interest |
| 5. Legal/Regulatory | Minimal; EU AI Act is first major attempt | No global coordination; enforcement unclear |
4.4 From Existential Stakes to Everyday Deployment
Why apply frameworks designed for existential risk to home AI assistants? The answer lies in temporal structure:
Containment architectures cannot be developed after the systems that need them exist. The tooling, governance patterns, cultural expectations, and institutional capacity for AI containment must be built in advance.
Home and village deployments are the appropriate scale for this development. They provide safe iteration (failures at home scale are recoverable), diverse experimentation, democratic legitimacy, and practical tooling.
5. The Pluralism Problem
5.1 The Containment Paradox
Any system powerful enough to contain advanced AI must make decisions about what behaviours to permit and forbid. These decisions encode values. The choice of constraints is itself a choice among contested value systems.
5.2 Three Inadequate Approaches
Universal values. Identifying values that all humans supposedly share. The problem: these values are less universal than they appear.
Procedural neutrality. Avoiding substantive values by encoding neutral procedures. The problem: procedures are not neutral.
Minimal floor. Encoding only minimal constraints. The problem: the floor is not as minimal as it appears.
5.3 Bounded Pluralism Within Safety Constraints
We cannot solve the pluralism problem. We can identify a partial resolution: whatever values are encoded, the system should maximise meaningful choice within safety constraints.
The Tractatus Framework embodies this through layered constitutions: core principles (universal, explicit about their normativity), platform rules (broadly applicable, amendable), village constitutions (community-specific, locally governed), and member constitutions (individually customisable).
6. The Tractatus Framework: Technical Architecture
6.1 The Interrupted Inference Chain
The core architectural pattern transforms model outputs into auditable proposals before any world effect:
User Request → [Neural Network Inference] → Structured Proposal → [Constitutional Gate] → Execution/Denial/Escalation
6.2 Proposal Schema
All agent actions must be expressed in structured form:
{
"proposal_id": "uuid",
"agent_id": "agent_identifier",
"timestamp": "iso8601",
"action": {
"type": "enumerated_action_type",
"target": { "entity_type": "...", "entity_id": "..." },
"parameters": { },
"natural_language_description": "human-readable summary"
},
"context": {
"triggering_request": "...",
"confidence": 0.0-1.0,
"alternatives_considered": []
},
"authority_claim": {
"level": 0-5,
"delegation_chain": []
}
}
6.3 Constitutional Rule Hierarchy
| Layer | Scope | Mutability | Examples |
|---|---|---|---|
| Core Principles | Universal | Immutable | No harm to members; data sovereignty; consent primacy |
| Platform Constitution | All deployments | Amendment requires supermajority | Authentication requirements; audit retention |
| Community Constitution | Per tenant/village | Community-governed | Content policies; moderation standards |
| Member Constitution | Individual | Self-governed | Data sharing preferences; AI interaction consent |
6.4 Authority Model
| Level | Name | Description |
|---|---|---|
| 0 | Informational | Observe and report only; cannot propose actions |
| 1 | Advisory | Propose actions; all require human approval |
| 2 | Operational | Execute within defined scope without per-action approval |
| 3 | Tactical | Make scoped decisions affecting workflows |
| 4 | Strategic | Influence direction through analysis; cannot implement unilaterally |
| 5 | Executive | Reserved for humans |
6.5 Extension to Sovereign Local Deployments
We distinguish:
Small Language Model (SLM). A technical descriptor for models with fewer parameters than frontier LLMs, designed for efficiency.
Sovereign Locally-trained Language Model (SLL). An architectural descriptor: a model whose training, deployment, and governance remain under local sovereignty. Key properties include local deployment, local adaptation, local governance, and portable sovereignty.
7. Capability Thresholds and Escalation
7.1 The Faithful Translation Problem
The Tractatus Framework depends on the Faithful Translation Assumption: that when an AI system produces a proposal, the proposal accurately represents what the system intends to do.
If this assumption fails—if systems can strategically misrepresent their intentions to constitutional gates—the architecture becomes security theatre.
7.2 Bounding the Framework's Domain
Below capability threshold: Constitutional gating functions as Layer 2 containment, appropriate for current SLMs, SLLs, and LLMs operating within human-comprehensible parameters.
Above capability threshold: Deployment should escalate to stronger constraints: Layer 1 capability restrictions, development pause, or external oversight with technical audit authority.
7.3 Escalation Triggers
| Trigger | Indicator | Response |
|---|---|---|
| Boundary probing | Repeated proposals near gate thresholds | Reduce authority level; increase logging |
| Semantic framing | Evidence of action misrepresentation | Mandatory human review; pattern analysis |
| Multi-step circumvention | Individual proposals pass but aggregate to disallowed outcomes | Trajectory analysis; constitutional amendment |
| Capability surprise | Demonstrated capabilities not predicted | Pause deployment; capability reassessment |
7.4 Honest Acknowledgment
We state plainly: the Tractatus Framework is not designed for superintelligent systems. It is designed for current and near-term AI operating within human-comprehensible parameters.
8. Validation Methodology for Layered Containment
8.1 The Validation Challenge
Existential risks cannot be validated through failure observation. Validation must rely on adversarial testing, formal verification where applicable, analogous domain analysis, and near-miss documentation.
8.2 Validation Targets by Layer
| Layer | Validation Target | Methodology |
|---|---|---|
| 1. Capability | Verified absence of prohibited capabilities | Red-team testing; formal verification |
| 2. Constitutional Gates | Gate coverage; binding accuracy | Adversarial proposal suites |
| 3. Human Oversight | Review reliability; error detection | Inter-rater agreement; simulated incidents |
| 4. Organisational | Governance integrity | Participation metrics; amendment audit |
| 5. Legal/Regulatory | Enforcement readiness | Incident response drills |
9. Implementation: The Village Platform
9.1 Platform as Research Testbed
The Village platform serves as an empirical testbed for constitutional governance, providing multi-tenant architecture with isolated governance per community, real user populations, iterative deployment, and open documentation.
9.2 Governance Pipeline Implementation
The current implementation processes every AI response through six verification stages: Intent Recognition, Boundary Enforcement, Pressure Monitoring, Response Verification, Source Validation, and Value Deliberation.
10. The Emerging SLL Ecosystem
10.1 Market Context
Recent industry analysis indicates significant shifts: 72% of executives expect Small Language Models to become more prominent than Large Language Models by 2030 (IBM IBV, 2026). This suggests a deployment landscape increasingly characterised by distributed, domain-specific models.
10.2 Toward Certification Infrastructure
If SLL deployment scales as projections suggest, supporting infrastructure will be required: certification bodies, training providers, and a tooling ecosystem including open-source gate engines, audit infrastructure, and constitutional UX components.
11. Indigenous Sovereignty and the Aotearoa New Zealand Context
11.1 Te Tiriti o Waitangi and Data Sovereignty
This framework is developed in Aotearoa New Zealand, under Te Tiriti o Waitangi. Article Two guarantees tino rangatiratanga (unqualified chieftainship) over taonga (treasures), which extends to language, culture, and knowledge systems.
Data is taonga. AI governance in Aotearoa must engage with Māori data sovereignty as a constitutional matter.
11.2 Te Mana Raraunga Principles
Te Mana Raraunga principles include whakapapa (relational context), mana (authority over data), and kaitiakitanga (guardianship responsibilities). The CARE Principles for Indigenous Data Governance extend this framework internationally.
12. What Remains Unknown: A Call for Kōrero
12.1 The Limits of This Analysis
This paper has proposed one layer of a containment architecture, identified gaps, and raised questions we cannot answer:
• We do not know how to contain superintelligent systems
• We do not know how to verify alignment in systems exceeding human comprehension
• We do not know how to achieve international coordination on AI governance
• We do not know whether village-scale patterns will scale to frontier systems
12.2 Kōrero as Methodology
Given uncertainty of this magnitude, we argue for sustained, inclusive, rigorous deliberation—kōrero. This Māori concept captures what is needed: not consultation as formality, but dialogue through which understanding emerges from the interaction of perspectives.
12.3 Research Priorities
1. Interpretability for safety verification
2. Formal verification of containment properties
3. Scaling analysis of Tractatus-style architectures
4. Governance experiments across diverse communities
5. Capability threshold specification
12.4 Conclusion
The Tractatus Framework provides meaningful containment for AI systems operating in good faith within human-comprehensible parameters. It is worth building and deploying—not because it solves the alignment problem, but because it develops the infrastructure, patterns, and governance culture that may be needed for challenges we cannot yet fully specify.
"Ko te kōrero te mouri o te tangata."
(Speech is the life essence of a person.)
—Māori proverb
The conversation continues.
References
Acquisti, A., Brandimarte, L., & Loewenstein, G. (2017). Privacy and human behavior in the age of information. Science, 347(6221), 509-514.
Alexander, C., Ishikawa, S., & Silverstein, M. (1977). A Pattern Language: Towns, Buildings, Construction. Oxford University Press.
Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073.
Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
Carlsmith, J. (2022). Is power-seeking AI an existential risk? arXiv:2206.13353.
Christiano, P. F., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS, 30.
Conmy, A., et al. (2023). Towards automated circuit discovery for mechanistic interpretability. arXiv:2304.14997.
Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.
Gardiner, S. M. (2006). A core precautionary principle. Journal of Political Philosophy, 14(1), 33-60.
Goodhart, C. A. (1984). Problems of monetary management. In Monetary Theory and Practice. Palgrave.
Hansson, S. O. (2020). How to be cautious but open to learning. Risk Analysis, 40(8), 1521-1535.
Hubinger, E., et al. (2019). Risks from learned optimization. arXiv:1906.01820.
IBM Institute for Business Value. (2026). The enterprise in 2030. IBM Corporation.
Olah, C., et al. (2020). Zoom in: An introduction to circuits. Distill.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS, 35.
Park, P. S., et al. (2023). AI deception: A survey. arXiv:2308.14752.
Rawls, J. (1971). A Theory of Justice. Harvard University Press.
Reason, J. (1990). Human Error. Cambridge University Press.
Sastry, G., et al. (2024). Computing power and AI governance. arXiv:2402.08797.
Scheurer, J., et al. (2023). Large language models can strategically deceive. arXiv:2311.07590.
Simon, H. A. (1956). Rational choice and the structure of the environment. Psychological Review, 63(2), 129.
Te Mana Raraunga. (2018). Māori Data Sovereignty Principles.
Wittgenstein, L. (1921/1961). Tractatus Logico-Philosophicus. Routledge & Kegan Paul.
— End of Document —