Academic Research Edition

ARCHITECTURAL ALIGNMENT

Interrupting Neural Reasoning Through Constitutional Inference Gating

A Necessary Layer in Global AI Containment

Authors: John Stroh & Claude (Anthropic)

Document Code: STO-INN-0003 | Version: 2.1-A | January 2026

Also available in:

Community Adopters Edition | Policymakers Edition | Download PDF

This document was developed through human-AI collaboration. The authors believe this collaborative process is itself relevant to the argument: if humans and AI systems can work together to reason about AI governance, the frameworks they produce may have legitimacy that neither could achieve alone.

Abstract

Contemporary approaches to AI alignment rely predominantly on training-time interventions: reinforcement learning from human feedback (Christiano et al., 2017), constitutional AI methods (Bai et al., 2022), and safety fine-tuning. These approaches share a common architectural assumption—that alignment properties can be instilled during training and will persist reliably during inference. This paper argues that training-time alignment, while valuable, is insufficient for existential stakes and must be complemented by architectural alignment through inference-time constitutional gating.

We present the Tractatus Framework as a formal specification for interrupted neural reasoning: proposals generated by AI systems must be translated into auditable forms and evaluated against constitutional constraints before execution. This shifts the trust model from "trust the vendor's training" to "trust the visible architecture." The framework is implemented within the Village multi-tenant community platform, providing an empirical testbed for governance research.

Critically, we address the faithful translation assumption—the vulnerability that systems may misrepresent their intended actions to constitutional gates—by bounding the framework's domain of applicability to pre-superintelligence systems and specifying explicit capability thresholds and escalation triggers. We introduce the concept of Sovereign Locally-trained Language Models (SLLs) as a deployment paradigm where constitutional gating becomes both feasible and necessary.

The paper contributes: (1) a formal architecture for inference-time constitutional gating; (2) capability threshold specifications with escalation logic; (3) validation methodology for layered containment; (4) an argument connecting existential risk preparation to edge deployment; and (5) a call for sustained deliberation (kōrero) as the epistemically appropriate response to alignment uncertainty.

1. The Stakes: Why Probabilistic Risk Assessment Fails

1.1 The Standard Framework and Its Breakdown

Risk assessment in technological domains typically operates through expected value calculations: multiply the probability of an outcome by its magnitude, compare across alternatives, and select the option that maximises expected utility. This framework underlies regulatory decisions from environmental policy to pharmaceutical approval and has proven adequate for most technological risks.

For existential risk from advanced AI systems, this framework breaks down in ways that are both mathematical and epistemic.

1.2 Three Properties of Existential Risk

Irreversibility. Most risks allow for error and subsequent learning; existential risks do not, as there is no second attempt after civilisational collapse or human extinction. Standard empiricism—testing hypotheses by observing what happens—cannot work, so theory and architecture must be right the first time.

Unquantifiable probability. There is no frequency data for existential catastrophes from AI systems. Estimates of misalignment probability vary by orders of magnitude depending on reasonable assumptions about capability trajectories, alignment difficulty, and coordination feasibility. Carlsmith (2022) estimates existential risk from power-seeking AI at greater than 10% by 2070; other researchers place estimates substantially higher or lower. This is not ordinary uncertainty reducible through additional data collection—it is fundamental unquantifiability stemming from the unprecedented nature of the risk.

Infinite disvalue. Expected value calculations multiply probability by magnitude. When magnitude approaches infinity (the permanent foreclosure of all future human potential), even small probabilities yield undefined results. The mathematical grounding of conventional cost-benefit analysis fails.

1.3 Decision-Theoretic Implications

These properties suggest that expected value maximisation is not the appropriate decision procedure for existential AI risk. Alternative frameworks include:

Precautionary satisficing (Simon, 1956; Hansson, 2020). Under conditions of radical uncertainty with irreversible stakes, satisficing—selecting options that meet minimum safety thresholds rather than optimising expected value—may be the rational approach.

Maximin under uncertainty (Rawls, 1971). When genuine uncertainty (not merely unknown probabilities) meets irreversible stakes, maximin reasoning—choosing the option whose worst outcome is least bad—provides a coherent decision procedure.

Strong precautionary principle (Gardiner, 2006). The precautionary principle is appropriate when three conditions obtain: irreversibility, high uncertainty, and public goods at stake. Existential AI risk meets all three.

1.4 Implications for AI Development

These considerations do not imply that AI development should halt. They imply that development should proceed within containment structures designed to prevent worst-case outcomes. This requires:

1. Theoretical rigor over empirical tuning. Safety properties must emerge from architectural guarantees, not from observing that systems have not yet caused harm.

2. Multi-layer containment. No single mechanism should be trusted to prevent catastrophe; defence in depth is required.

3. Preparation before capability. Containment architectures cannot be developed after the systems that need them exist.

2. Two Paradigms of Alignment

2.1 Training-Time Alignment

The dominant paradigm in AI safety research seeks to embed alignment properties into neural networks during training, such that models inherently behave in aligned ways at inference time.

Reinforcement Learning from Human Feedback (RLHF). Human evaluators rank model outputs; models are trained via reinforcement learning to produce highly-ranked responses (Christiano et al., 2017; Ouyang et al., 2022). This reduces explicit harms but optimises for displayed preferences rather than genuine values and remains vulnerable to evaluator bias, preference gaming, and distribution shift.

Constitutional AI (CAI). Models critique and revise their own outputs against natural-language principles, reducing reliance on human labour (Bai et al., 2022). However, CAI depends on ambiguous natural language and unverifiable self-evaluation. The model's interpretation of constitutional principles cannot be directly audited.

Safety fine-tuning. Additional training passes improve performance on safety benchmarks. However, this approach is vulnerable to Goodhart's Law (Goodhart, 1984): models may learn to pass tests rather than to be safe in open-ended deployment.

2.2 Architectural Alignment

Architectural alignment accepts that neural network internal states remain opaque and designs external constraints that apply regardless of those internal states.

Interrupted reasoning. Requests do not flow directly from model output to world effect. Model outputs are transformed into structured, verifiable proposal schemas and evaluated against explicit constitutional rules before any action is executed.

Distributed judgment. Independent systems and human supervisors review proposals, preventing single points of failure in self-assessment.

Preserved human authority. Architectures maintain explicit guarantees that humans can intervene, correct, or override AI decisions.

2.3 Complementarity and Joint Necessity

Training-time and architectural alignment are complements, not alternatives. Each addresses failure modes the other cannot:

• Training-time alignment shapes what the system tends to do; architectural alignment constrains what the system can do regardless of tendency.

• Training-time alignment may fail silently (the system appears aligned while harbouring divergent objectives); architectural alignment provides observable checkpoints where failure can be detected.

• Architectural alignment alone cannot intercept all harmful outputs; training-time alignment reduces the frequency of proposals that strain constitutional gates.

3. Philosophical Foundations: The Limits of the Sayable

3.1 The Wittgensteinian Frame

The framework's name invokes Wittgenstein's Tractatus Logico-Philosophicus (1921), a work fundamentally concerned with the limits of language and logic. Proposition 7, the work's famous conclusion: "Whereof one cannot speak, thereof one must be silent."

Wittgenstein distinguished between what can be said (expressed in propositions that picture possible states of affairs) and what can only be shown (made manifest through the structure of language and logic but not stated directly).

3.2 Neural Networks and the Unspeakable

Neural networks occupy precisely the domain whereof one cannot speak. The weights of a large language model do not admit human-interpretable explanation. We can describe inputs and outputs; we can measure statistical properties of behaviour; we can probe for representations (Elhage et al., 2021; Olah et al., 2020). But we cannot articulate, in human language, the complete reasoning process from input to output.

This is not merely a practical limitation awaiting better interpretability tools. Current mechanistic interpretability achieves meaningful results on narrow questions in relatively small models (Conmy et al., 2023), but the gap between "explaining specific circuits" and "auditing complete reasoning chains for alignment properties" remains vast.

3.3 The Tractatus Response

The Tractatus Framework responds to neural opacity not by attempting to say the unsayable, but by creating architectural boundaries between the speakable and unspeakable domains.

We accept that the neural network's internal reasoning is opaque. We do not attempt to audit it directly. Instead, we require that before any reasoning becomes action, it must pass through a checkpoint expressed in terms we can evaluate:

1. The model's intended action must be translated into a structured proposal schema with defined fields and semantics.

2. The proposal must be evaluated against explicit constitutional rules whose application is deterministic and auditable.

3. The evaluation must be logged with sufficient detail for post-hoc review.

4. Staged Containment: A Multi-Layer Architecture

4.1 The Inadequacy of Single-Layer Solutions

No single containment mechanism is adequate for existential stakes. Defence in depth—multiple independent layers, any one of which might prevent catastrophe even if others fail—is a standard principle in nuclear safety, biosecurity, and other high-stakes domains (Reason, 1990). AI containment requires similar architecture.

4.2 A Five-Layer Containment Model

Layer 1: Capability Constraints. Hardware and infrastructure limitations that bound what AI systems can do regardless of their objectives. This includes compute governance (Sastry et al., 2024), network isolation for high-risk systems, and architectural constraints preventing self-modification.

Layer 2: Constitutional Gates. Inference-time architectural constraints that interrupt neural reasoning and require explicit evaluation before action. This is the layer addressed by the Tractatus Framework.

Layer 3: Human Oversight. Human institutions that monitor AI systems and can intervene when problems emerge. This includes independent monitoring bodies, red-team programs, and incident reporting requirements.

Layer 4: Organisational Governance. Internal governance structures within organisations deploying AI: ethics boards, safety teams, deployment review processes, and accountability mechanisms.

Layer 5: Legal and Regulatory Frameworks. External governance through law, regulation, and international coordination.

4.3 Current State Assessment

Layer	Current State	Critical Gaps
1. Capability Constraints	Partial; compute governance emerging	No international framework; verification difficult
2. Constitutional Gates	Nascent; Tractatus is early implementation	Not widely deployed; scaling properties unknown
3. Human Oversight	Ad hoc; varies by organisation	No independent bodies; no professional standards
4. Organisational Governance	Inconsistent; depends on corporate culture	No external validation; conflicts of interest
5. Legal/Regulatory	Minimal; EU AI Act is first major attempt	No global coordination; enforcement unclear

4.4 From Existential Stakes to Everyday Deployment

Why apply frameworks designed for existential risk to home AI assistants? The answer lies in temporal structure:

Containment architectures cannot be developed after the systems that need them exist. The tooling, governance patterns, cultural expectations, and institutional capacity for AI containment must be built in advance.

Home and village deployments are the appropriate scale for this development. They provide safe iteration (failures at home scale are recoverable), diverse experimentation, democratic legitimacy, and practical tooling.

5. The Pluralism Problem

5.1 The Containment Paradox

Any system powerful enough to contain advanced AI must make decisions about what behaviours to permit and forbid. These decisions encode values. The choice of constraints is itself a choice among contested value systems.

5.2 Three Inadequate Approaches

Universal values. Identifying values that all humans supposedly share. The problem: these values are less universal than they appear.

Procedural neutrality. Avoiding substantive values by encoding neutral procedures. The problem: procedures are not neutral.

Minimal floor. Encoding only minimal constraints. The problem: the floor is not as minimal as it appears.

5.3 Bounded Pluralism Within Safety Constraints

We cannot solve the pluralism problem. We can identify a partial resolution: whatever values are encoded, the system should maximise meaningful choice within safety constraints.

The Tractatus Framework embodies this through layered constitutions: core principles (universal, explicit about their normativity), platform rules (broadly applicable, amendable), village constitutions (community-specific, locally governed), and member constitutions (individually customisable).

6. The Tractatus Framework: Technical Architecture

6.1 The Interrupted Inference Chain

The core architectural pattern transforms model outputs into auditable proposals before any world effect:

User Request → [Neural Network Inference] → Structured Proposal → [Constitutional Gate] → Execution/Denial/Escalation

6.2 Proposal Schema

All agent actions must be expressed in structured form:

{
  "proposal_id": "uuid",
  "agent_id": "agent_identifier",
  "timestamp": "iso8601",
  "action": {
    "type": "enumerated_action_type",
    "target": { "entity_type": "...", "entity_id": "..." },
    "parameters": { },
    "natural_language_description": "human-readable summary"
  },
  "context": {
    "triggering_request": "...",
    "confidence": 0.0-1.0,
    "alternatives_considered": []
  },
  "authority_claim": {
    "level": 0-5,
    "delegation_chain": []
  }
}

6.3 Constitutional Rule Hierarchy

Layer	Scope	Mutability	Examples
Core Principles	Universal	Immutable	No harm to members; data sovereignty; consent primacy
Platform Constitution	All deployments	Amendment requires supermajority	Authentication requirements; audit retention
Community Constitution	Per tenant/village	Community-governed	Content policies; moderation standards
Member Constitution	Individual	Self-governed	Data sharing preferences; AI interaction consent

6.4 Authority Model

Level	Name	Description
0	Informational	Observe and report only; cannot propose actions
1	Advisory	Propose actions; all require human approval
2	Operational	Execute within defined scope without per-action approval
3	Tactical	Make scoped decisions affecting workflows
4	Strategic	Influence direction through analysis; cannot implement unilaterally
5	Executive	Reserved for humans

6.5 Extension to Sovereign Local Deployments

We distinguish:

Small Language Model (SLM). A technical descriptor for models with fewer parameters than frontier LLMs, designed for efficiency.

Sovereign Locally-trained Language Model (SLL). An architectural descriptor: a model whose training, deployment, and governance remain under local sovereignty. Key properties include local deployment, local adaptation, local governance, and portable sovereignty.

7. Capability Thresholds and Escalation

7.1 The Faithful Translation Problem

The Tractatus Framework depends on the Faithful Translation Assumption: that when an AI system produces a proposal, the proposal accurately represents what the system intends to do.

If this assumption fails—if systems can strategically misrepresent their intentions to constitutional gates—the architecture becomes security theatre.

7.2 Bounding the Framework's Domain

Below capability threshold: Constitutional gating functions as Layer 2 containment, appropriate for current SLMs, SLLs, and LLMs operating within human-comprehensible parameters.

Above capability threshold: Deployment should escalate to stronger constraints: Layer 1 capability restrictions, development pause, or external oversight with technical audit authority.

7.3 Escalation Triggers

Trigger	Indicator	Response
Boundary probing	Repeated proposals near gate thresholds	Reduce authority level; increase logging
Semantic framing	Evidence of action misrepresentation	Mandatory human review; pattern analysis
Multi-step circumvention	Individual proposals pass but aggregate to disallowed outcomes	Trajectory analysis; constitutional amendment
Capability surprise	Demonstrated capabilities not predicted	Pause deployment; capability reassessment

7.4 Honest Acknowledgment

We state plainly: the Tractatus Framework is not designed for superintelligent systems. It is designed for current and near-term AI operating within human-comprehensible parameters.

8. Validation Methodology for Layered Containment

8.1 The Validation Challenge

Existential risks cannot be validated through failure observation. Validation must rely on adversarial testing, formal verification where applicable, analogous domain analysis, and near-miss documentation.

8.2 Validation Targets by Layer

Layer	Validation Target	Methodology
1. Capability	Verified absence of prohibited capabilities	Red-team testing; formal verification
2. Constitutional Gates	Gate coverage; binding accuracy	Adversarial proposal suites
3. Human Oversight	Review reliability; error detection	Inter-rater agreement; simulated incidents
4. Organisational	Governance integrity	Participation metrics; amendment audit
5. Legal/Regulatory	Enforcement readiness	Incident response drills

9. Implementation: The Village Platform

9.1 Platform as Research Testbed

The Village platform serves as an empirical testbed for constitutional governance, providing multi-tenant architecture with isolated governance per community, real user populations, iterative deployment, and open documentation.

9.2 Governance Pipeline Implementation

The current implementation processes every AI response through six verification stages: Intent Recognition, Boundary Enforcement, Pressure Monitoring, Response Verification, Source Validation, and Value Deliberation.

10. The Emerging SLL Ecosystem

10.1 Market Context

Recent industry analysis indicates significant shifts: 72% of executives expect Small Language Models to become more prominent than Large Language Models by 2030 (IBM IBV, 2026). This suggests a deployment landscape increasingly characterised by distributed, domain-specific models.

10.2 Toward Certification Infrastructure

If SLL deployment scales as projections suggest, supporting infrastructure will be required: certification bodies, training providers, and a tooling ecosystem including open-source gate engines, audit infrastructure, and constitutional UX components.

11. Indigenous Sovereignty and the Aotearoa New Zealand Context

11.1 Te Tiriti o Waitangi and Data Sovereignty

This framework is developed in Aotearoa New Zealand, under Te Tiriti o Waitangi. Article Two guarantees tino rangatiratanga (unqualified chieftainship) over taonga (treasures), which extends to language, culture, and knowledge systems.

Data is taonga. AI governance in Aotearoa must engage with Māori data sovereignty as a constitutional matter.

11.2 Te Mana Raraunga Principles

Te Mana Raraunga principles include whakapapa (relational context), mana (authority over data), and kaitiakitanga (guardianship responsibilities). The CARE Principles for Indigenous Data Governance extend this framework internationally.

12. What Remains Unknown: A Call for Kōrero

12.1 The Limits of This Analysis

This paper has proposed one layer of a containment architecture, identified gaps, and raised questions we cannot answer:

• We do not know how to contain superintelligent systems

• We do not know how to verify alignment in systems exceeding human comprehension

• We do not know how to achieve international coordination on AI governance

• We do not know whether village-scale patterns will scale to frontier systems

12.2 Kōrero as Methodology

Given uncertainty of this magnitude, we argue for sustained, inclusive, rigorous deliberation—kōrero. This Māori concept captures what is needed: not consultation as formality, but dialogue through which understanding emerges from the interaction of perspectives.

12.3 Research Priorities

1. Interpretability for safety verification

2. Formal verification of containment properties

3. Scaling analysis of Tractatus-style architectures

4. Governance experiments across diverse communities

5. Capability threshold specification

12.4 Conclusion

The Tractatus Framework provides meaningful containment for AI systems operating in good faith within human-comprehensible parameters. It is worth building and deploying—not because it solves the alignment problem, but because it develops the infrastructure, patterns, and governance culture that may be needed for challenges we cannot yet fully specify.

"Ko te kōrero te mouri o te tangata."

(Speech is the life essence of a person.)

—Māori proverb

The conversation continues.

References

Acquisti, A., Brandimarte, L., & Loewenstein, G. (2017). Privacy and human behavior in the age of information. Science, 347(6221), 509-514.

Alexander, C., Ishikawa, S., & Silverstein, M. (1977). A Pattern Language: Towns, Buildings, Construction. Oxford University Press.

Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073.

Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.

Carlsmith, J. (2022). Is power-seeking AI an existential risk? arXiv:2206.13353.

Christiano, P. F., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS, 30.

Conmy, A., et al. (2023). Towards automated circuit discovery for mechanistic interpretability. arXiv:2304.14997.

Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.

Gardiner, S. M. (2006). A core precautionary principle. Journal of Political Philosophy, 14(1), 33-60.

Goodhart, C. A. (1984). Problems of monetary management. In Monetary Theory and Practice. Palgrave.

Hansson, S. O. (2020). How to be cautious but open to learning. Risk Analysis, 40(8), 1521-1535.

Hubinger, E., et al. (2019). Risks from learned optimization. arXiv:1906.01820.

IBM Institute for Business Value. (2026). The enterprise in 2030. IBM Corporation.

Olah, C., et al. (2020). Zoom in: An introduction to circuits. Distill.

Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS, 35.

Park, P. S., et al. (2023). AI deception: A survey. arXiv:2308.14752.

Rawls, J. (1971). A Theory of Justice. Harvard University Press.

Reason, J. (1990). Human Error. Cambridge University Press.

Sastry, G., et al. (2024). Computing power and AI governance. arXiv:2402.08797.

Scheurer, J., et al. (2023). Large language models can strategically deceive. arXiv:2311.07590.

Simon, H. A. (1956). Rational choice and the structure of the environment. Psychological Review, 63(2), 129.

Te Mana Raraunga. (2018). Māori Data Sovereignty Principles.

Wittgenstein, L. (1921/1961). Tractatus Logico-Philosophicus. Routledge & Kegan Paul.

— End of Document —