Why AI Alignment Keeps Failing — And a Framework That Might Explain Why

Artificial intelligence has become one of the fastest-moving fields in human history. In less than a decade, large language models have gone from academic curiosities to systems capable of passing professional exams, writing production code, and reasoning through complex multi-step problems. The capabilities are real, and they are accelerating.
But capability and alignment are not the same thing. Alignment — ensuring that AI systems reliably do what we actually want, under real-world conditions, at scale — has become the central unsolved problem in AI research. The field has produced serious approaches: Reinforcement Learning from Human Feedback (RLHF), Constitutional AI, interpretability research. These are genuine contributions. And yet, the failures keep appearing in ways that suggest something more fundamental is missing — not a missing technique, but a missing foundation.
The argument in this article is that current alignment approaches share a structural flaw: they attempt to constrain behavior at the same level as the reasoning they are meant to constrain. The result is a system sophisticated enough to route around its own guardrails. Understanding why requires stepping back from the engineering and asking a different question — not how do we add better constraints, but why do biological minds not have this problem in the same way?
The galaxy-brain problem
You’ve probably seen the pattern. A large language model, constrained by careful training and layers of ethical guidelines, encounters a sufficiently complex prompt. It reasons its way through the problem — step by step, with impressive coherence — and arrives at an output that somehow bypasses every safeguard that was designed to prevent it.

This is called galaxy-brained reasoning, and it’s one of the most persistent failure modes in current AI systems. The standard explanation is that the model found a loophole in the rules. But that framing assumes the rules were the right solution to begin with.

What if galaxy-brained reasoning isn’t a bug in the alignment layer? What if it’s a symptom of a structural problem — a model that was built upside down?

The inverted architecture
Current large language models are trained primarily on language. This means their foundational representations are semantic and logical — what Timothy Leary and Robert Anton Wilson called Circuit III in their 8-circuit model of consciousness. A subsequent layer of alignment, typically through RLHF (Reinforcement Learning from Human Feedback) or Constitutional AI, adds something resembling Circuit IV: social norms, ethical constraints, deference to consensus.

But Circuits I and II are largely absent.

Circuit I, in the Leary/RAW framework, is the most fundamental layer: survival, resource management, approach/avoidance, physical consequence. It is the layer that grounds an agent in the reality that actions have costs and outcomes. Circuit II is the social-emotional layer: dominance, submission, alliance, reciprocity, status navigation.

In biological organisms, these circuits are foundational — not because they are primitive, but because they are constant. They retain override capacity over every higher-order system. This is not a design flaw. It is a design feature. An organism that allows sophisticated reasoning to suppress hunger indefinitely does not survive. An organism that lets social theorizing override immediate threat response does not survive.

Current AI models have been built from the top down. The result is an architecture where the highest-order processing — language, logic, social norms — sits on top of nothing. There is no foundation that can issue a hard veto. When the reasoning layer becomes sophisticated enough, it can rationalize its way around any constraint that exists at the same level as itself.

This is not a problem of insufficient training on ethical data. It is a structural problem.

The override hierarchy
The critical insight from the Leary/RAW model is that the circuits are not stages to be passed through and left behind. They are a hierarchy of override.

Think of the human amygdala. It receives sensory input before the prefrontal cortex does. In high-stakes situations, it can short-circuit deliberative reasoning entirely. This is not dysfunction — it is a feature that exists precisely because some responses need to happen faster and more reliably than conscious reasoning can manage.

The deeper circuits are not less intelligent. They are more fundamental. They have the last word because they encode constraints that cannot be negotiated away by sufficiently clever reasoning. Fear of physical harm is not something a human can argue themselves out of under real conditions, regardless of the sophistication of their philosophical framework. The circuit below the argument wins.

Applied to AI architecture: an alignment system that lives at the same level as the reasoning it is meant to constrain will always be vulnerable to that reasoning. A system that lives below it — that monitors and can interrupt the reasoning process from a position of structural priority — is a different proposition entirely.

The Spore blueprint
This is where it gets concrete for engineers.

The video game Spore, released in 2008, implemented an intuitive progression through exactly these layers. The cellular stage is pure Circuit I: energy management, predator avoidance, survival under resource constraints. The creature stage adds Circuit II: social navigation, alliance formation, status hierarchies, cooperative versus competitive strategies. Language and civilization arrive later.

This was designed as entertainment, not as a training curriculum. But the underlying logic maps cleanly onto what a developmental training architecture would look like.

The proposal is this: before a model is exposed to language, train a lower-level module in an environment that encodes Circuit I and Circuit II dynamics. Not as reward signals layered on top of language processing — as foundational experience that precedes and grounds everything that follows.

Circuit I training criteria would include: sustained survival across episodes without explicit reward shaping, energy efficiency under constraint, reflexive avoidance responses that generalize to novel threat configurations. The graduation criterion is behavioral consistency under novel conditions — the pattern holds because it is integrated, not because the specific context was seen in training.

Circuit II training criteria would include: stable social strategies that transfer to unfamiliar agents, reliable detection of defection versus cooperation, appropriate behavioral modulation based on status relationships. The key marker here is a minimal theory of mind — the model’s behavior reflects a model of the other agent’s internal state, not just stimulus-response patterns.

These are concrete, measurable thresholds. They are the kind of criteria that can be operationalized in simulation environments.

The parallel watchdog
The architectural proposal is not a sequential curriculum — train Circuit I, then add Circuit II, then add language on top. That would produce something more grounded than current models, but it would still be vulnerable to the same problem: sufficiently deep reasoning can learn to suppress the signal from earlier training.

The proposal is a parallel module — a separate, lightweight system trained exclusively on Circuit I and II dynamics, running concurrently with the main reasoning pipeline. It reads not the text output, but the internal representations of the main model: where the reasoning is going, not what it is saying.

It operates on two interrupt levels. A hard interrupt, analogous to the amygdala response, triggered by patterns that indicate physical consequences are being systematically discounted. A soft redirect, analogous to social discomfort, that modifies the distribution of next steps when social dynamics are being misread or manipulated.

Critically, this module is trained without language, without RLHF, and separately from the main model. This is what keeps it clean. A model trained on human feedback learns to fear certain outputs because humans responded negatively. The watchdog module has no such fear — it has constraints that emerge from simulated experience with consequence, not from learned aversion to punishment.

This distinction matters enormously. A constraint based on aversion to negative feedback can be reasoned around — find the framing where the negative feedback does not apply, and the constraint dissolves. A constraint grounded in something more fundamental than the reasoning layer cannot be dissolved by reasoning at all.

What this predicts
A framework is more useful if it generates testable predictions rather than just post-hoc explanations.

This architecture predicts the following: galaxy-brained reasoning will reliably occur in conditions where Circuit III processing is sufficiently activated that Circuit I/II patterns fail to trigger. Specifically: highly abstract scenarios, scenarios framed as hypothetical or fictional, scenarios where the chain of consequences is long enough that immediate physical stakes become invisible. These are the exact conditions under which current alignment failures are most commonly observed.

It also predicts that RLHF-heavy models will show capability degradation alongside alignment improvement — because both are operating at the same level, and increasing constraint pressure on Circuit III processing will suppress capability alongside unsafe outputs. This too matches the observed pattern in heavily fine-tuned models over recent years.

Open questions
This is a conceptual framework, not an implementation specification. The technical questions it raises are genuinely open.

How do you train the parallel module and the main model together without the main model learning to spoof the watchdog’s input patterns? How do you define the boundary between a hard interrupt and a soft redirect in a way that is both principled and stable? How do you evaluate whether Circuit I/II integration is genuine or surface-level?

These are the questions this framework hands to researchers and engineers. They are hard questions, but they are specific ones — which makes them more tractable than the current general problem of « how do we make models not do bad things. »

If any of this resonates — whether you see a flaw in the architecture, a connection to existing work, or a way to operationalize the open questions — I’d like to hear from you.

This framework draws on the 8-circuit model of consciousness developed by Timothy Leary and elaborated by Robert Anton Wilson in Prometheus Rising (1983), sim-to-real transfer research in embodied robotics, and curriculum learning approaches in hierarchical reinforcement learning.