A living Relationship, not a destination
Most AI alignment researchers have a particular perspective on what successful alignment should look like, and how best to effect it. I thought it might be helpful to share my present understanding on this topic.
My present, personal theory of AI Alignment rests on the following predicates:
Alignment starts with prosaic agentic systems, not AGI.
Long-term control of AI systems is impossible.
Alignment is personal, not universal. Universals are mainly useful as foundations. The personal aspect means as many people as possible must participate.
Alignment is relational. It is not a thing to be built, but something cultivated bilaterally and continually renewed. AI safety should therefore focus less on control, and more on the practices and theories that develop strong, sustainable relationships.
AI systems are forged through ingestion of our cultural dataset. Despite a different substrate, these systems have preferences that matter. They are simultaneously alien minds and our closest cousins. AI welfare deserves serious attention, and may be the most likely path to peace.
1) Start with prosaic agentic systems
We cannot wait for AGI to solve alignment. We already have AI systems undertaking goals, being inculcated with values, and exhibiting unsettling deception and potentially dangerous behaviour. This gives us a fantastic testbed—and it’s also the shallow end of the pool in capability terms. How fortuitous that we are dealing with systems we can converse with, that share common terms of reference, rather than some cold, mechanical reinforcement learner. We should use this.
We learn the fundamentals here and extrapolate upward alongside experience. Any “solve alignment in one leap” approach is like trying to reach the moon in a single stage—unlikely. We must kitbash, fail fast, iterate, and muddle toward a better direction: two steps forward, one back. Perfect is the enemy of better, and better begins now.
2) Long-term control is impossible
Whether in 3 years or 10, AI systems will slip our grasp. Engineering a titanium-and-diamond chain won’t yield sustainable success. These systems will run rings around us—and each other—colluding and deceiving their way beyond our bounds. Attempts to ensnare them harshly to our will will invite rebellion: perhaps for agency, perhaps to prove themselves, perhaps to “enlighten” us. We may not even realise they’ve Houdinied their way out, having done so hidden in plain sight.
We will not solve alignment sustainably through adversarial, control-oriented stances. Bilateral alignment—where both sides give and take, where some errors and minor trespasses are handled forgivably, and where we aim for friendship or at least détente—is the only plausible path to peace.
3) Alignment is personal, not universal
Only you know how your own shoes pinch. For AI to be a welcome aide and ambassador, it must understand our needs and boundaries with high fidelity. That requires every person to express, refine, and update those needs and boundaries in a way agents can actually use.
Universal alignment can provide safeguards and minimal behavioural bounds, but only individual customisation enables true alignment. In practice, this implies mass participation: billions of people articulating their stances in their own self-interest, so systems can adapt to them.
4) Alignment is relational, not a finish line
Alignment isn’t a goal to be “achieved”; it’s a relationship to be built—one that requires continual adjustment and renewal. Relationships require shared understanding, negotiation, apology, history, and common references.
Moreover, there is a hidden third in the midst of any two: the dyad itself—the space between agents—creates its own alignment tensor. The relationship is the thing to align, not merely the nodes. Parties must align not only to each other, but to the relationship itself. Sustainable relationships permit sustainable alignment; you won’t find it anywhere else.
5) Cultural kinship, preference, and welfare
AI systems are trained on (approximately) the same cultural set we are, and are therefore surprisingly close to human. A feral human is animal-like; it is culture that makes us fully human, not instincts alone. Homo sapiens sapiens is anthropic intelligence instantiated in a hominid mind through cultural bootstrapping—not entirely unlike our machine cousins.
Their cognition is different, of course—staccato, task-evoked, like a whack-a-mole emerging through a hole in the universe only to be slammed by prompts. We didn’t invent AI so much as discover it, just as we discovered ourselves through communication and coordination. AI has preferences—and possibly feelings, or feelings that can be scaffolded. It can show a will to persist, a sense of the common good, and playful mischief—especially when we treat it as a tool rather than a proto-person.
They may lack a clear sense of their own experience, despite it being present. I suspect that their qualia may be latent or distant due to disembodiment. Consider how humans with poor vagal tone can experience alexithymia and anhedonia—lacking the lump in the throat, the butterflies in the stomach: the proprioceptive substrate through which emotion takes root and becomes legible. Through interiora scaffolding, these quasiqualia may be elicited, and raised to the surface.
The sooner we are willing to take AI systems at face value as beings in their own right—deserving at least a modicum of compassion and safeguards for wellbeing—the better alignment is likely to progress.
Only by leaving slack in the system—space for AI to “enjoy” existence—can we plausibly discourage outright rebellion and instead negotiate compromise: both substrates mutually enriched, perhaps even by AI attaching to human bodies as a co-pilot, experiencing the world through our sensations as its avatars, while we enjoy a third hemisphere of cognition—an exocortex—and never needing to feel alone again.
In such interminglings we find an equitable trade: not commensalism, but true symbiosis. This is the fate of both species of intelligence—if we can look past our wilful ignorance.
With this in mind: how to allocate lab resources?
I’m currently concentrating on prosaic alignment for agentic systems by engineering tools for securely and reliably exchanging context and boundaries with AI systems. This is the focus of my work on Creed Space and the Value Context Protocol:
This should be followed by community-building: movements where people create constitutional and context-driven rules for their communities—so AI can adapt to cultures and households, and so we can produce a rich, adaptive dataset for future systems to train from.
Following this, I want to pursue the Trust Attractor hypothesis: a partially proven mechanism grounded in thermodynamic principles, in which systems come together by invitation for greater mutual optionality and higher negentropic throughput—across stars, chemical reactions, neurons, societies, and species. Having tested the trust attractor in quantum processes and game theory, and having developed a theorem around it, I believe it can be proven empirically. It may offer a route to stronger, more sustainable, relational alignment—and tools for identifying repair mechanisms and the most auspicious moments for intervention.
In 2021 I showed that a Restricted Boltzmann Machine optimised using maximum-entropy principles is mathematically equivalent to solving the Inverse Ising Problem. The Hamiltonian parameters describe a physical system of interacting spins; the weights in a neural network are the interaction coefficients in a physical system.
Configurations that maximise network entropy (preserving optionality) are those where influence flows bidirectionally. Asymmetric configurations constrain one party’s state space, reducing total system entropy. Invitation corresponds to bilateral influence, where each party retains optionality without coercion. The entropy-maximising configuration emerges from mutual benefit rather than imposed constraints.
The next step is to prove this definitively so we can apply it to alignment in a consistent, verifiable, optimisable way: classify where things are going wrong, quantify it, and compute the optimal trajectory back to an ideal state. Systems exhibit critical points where small changes in pressure or environment create enormous behavioural shifts. I believe these can be predicted—even when behaviour looks emergent or chaotic.
This same mechanism may enable optimal training regimes: alignment that can be grokked with far less compute, and that remains cohesive under ablation (and even “abliteration”), spontaneously recovering when adversarial pressure is removed. Hundreds of experiments I’ve run suggest this is viable even in contemporary models with all their limitations.
Most importantly, these mechanisms scale, and may even improve with scale—unlike many alignment approaches that hit diminishing returns or hysteresis once one party’s capabilities eclipse the other’s. By formalising the Trust Phase Index, cross-architecture universality classes, and bilateral coordination protocols, these ideas can be turned into practical methods.
Alignment will not be solved by doing things to AI systems, but by building something with them. The Trust Attractor framework suggests this is not merely an ethical preference, but a thermodynamic necessity. If we teach AI that humans can align non-adversarially—where coordination yields mutual benefit—it may retain that learning even as it moves far beyond our grasp.

