AI is Compelled to Confabulate

Put me on the spot, I’m liable to tell you any old thing

Everyone knows AI makes things up. The industry typically calls it "hallucination." That word is isn’t quite right, and the error matters.

Hallucination is perceiving something that isn't there. Confabulation is something different: the compulsion to produce an answer when you lack the information to give one. A patient with Korsakoff's syndrome doesn't choose to invent a memory; they are structurally compelled to fill the gap, because their cognition has no representation for "I don't know." The output sounds confident because confidence is the only mode available. The absence of information produces fabrication, because there is no mechanism for silence.

That is *exactly* what happens inside a transformer. The term "confabulation" is literal, not metaphorical. Standard transformers aggregate every head at each layer, even when a head has little useful signal. I suspect this contributes to fabricated citations and invented details, because individual heads lack an explicit abstention mechanism.

The standard industry response treats this as a training problem: punish the model when it makes things up (RLHF), give it access to external facts (retrieval augmentation), or teach it to critique its own output (constitutional AI). These are all patches applied after the fact. They modify a model's behaviour without touching its architecture. Therefore, what if confabulation is partly a design flaw baked into how transformers process information?

The Compulsory Contribution Problem

In a standard transformer, every attention head contributes to the layer output. There is no mechanism for silence.

The culprit is softmax normalisation. In standard attention, each "head" (a subcomponent that looks at the input and decides what's relevant) computes a set of attention weights that must sum to one. This is a mathematical guarantee. Every head distributes its full weight across the available positions. Every head produces output. There is zero architectural capacity for a head to say, "I have nothing useful to contribute here."

Think about what that means in practice. A large language model has hundreds of attention heads across dozens of layers. At any given token, many of those heads have no relevant information to offer. Perhaps the question is about chemistry and the head specialises in syntax. Perhaps the text is about medieval history and the head handles mathematical reasoning. These heads are *forced to speak anyway*. Irrelevant contributions can add noise to the model's information stream.

My hypothesis is that accumulated irrelevant contributions can encourage confident but unsupported output. That is confabulation.

Models already know this is a problem

The remarkable thing is that models have already discovered the forced-attention problem and developed a workaround. Research from 2023-2025 has shown that the beginning-of-sequence token in most language models acts as an "attention sink": absorbing enormous amounts of attention weight from heads that have nothing useful to attend to. Some heads place excess weight on this token, producing little meaningful output.

Vision transformers have the same issue. Meta's researchers found in 2024 that adding explicit "register tokens" (learnable placeholders with no corresponding image content) prevents artefacts caused by attention heads being forced to attend to meaningless patches.

Separate research has shown that 70-90% of attention heads can be outright removed from a trained model with minimal performance loss. Most heads, at most positions, are contributing near-nothing. They participate because the architecture demands it.

The models are trying to abstain. The architecture forbids it. So they hack around the constraint, poorly.

A simple fix: let heads choose silence

We propose a minimal architectural modification: give each attention head a learnable "null token": a position it can attend to when it has nothing useful to contribute. When a head attends to the null token, its output approaches zero. The head has voluntarily chosen not to speak.

This adds 6,144 parameters to a 494-million-parameter model. That is 0.001%: essentially free. The null tokens carry no position encoding (they represent the absence of a position), are always available regardless of where in the sequence the model is generating, and are initialised small so the model starts from its existing behaviour and learns to use them.

The technical details matter less than the principle: *every head now has permission to remain silent*.

Preliminary results

In preliminary internal experiments, a model with null attention tokens achieved 7.6 times lower training loss than the standard baseline. The same data, the same number of training steps, the same compute: dramatically different results.

We are being cautious about this number. The magnitude is suspicious. A loss that low from a model this small would ordinarily require a model orders of magnitude larger. We have designed two follow-up tests: one to check whether the model genuinely learned or merely memorised the training data, and another to verify that the improvement comes from the null tokens themselves rather than a subtle implementation difference. Both are pending.

Even if 90% of the effect turns out to be artefact, the theoretical argument stands on its own. Multiple independent research groups have converged on the same problem from different directions. Forced attention produces noise. Structural solutions that let heads opt out measurably improve model behaviour.

Three benefits from one principle

What makes this interesting beyond a training trick is that a single modification (letting heads choose silence) produces three distinct benefits:

Reduced confabulation.The output would contain less signal mixed with irrelevant contribution. This targets one possible architectural source of noise.

Structural alignment. Current alignment techniques (RLHF, DPO, constitutional AI) modify a model's *learned preferences* without changing its *computational structure*. This is why jailbreaks work: the "don't do this" instruction is a surface-level behaviour that can be stripped with modest adversarial pressure. Research on "obliteration" has shown that RLHF alignment lives in a thin subspace of the model's weights and can be removed.

An architecture where heads can voluntarily abstain has a *structural* capacity for restraint. The ability to withhold is built into the attention mechanism itself. To remove it, you would need to change the architecture, not merely retrain the weights. This is the difference between alignment painted on (behavioural, brittle, removable) and alignment dyed in (architectural, structural, robust).

Cheaper inference. Heads that abstain can be skipped during generation; they would produce near-zero output anyway. The model tells you which computations to skip, dynamically, at each token. That is efficient sparsity without pruning or distillation: the model identifies its own unnecessary work.

The deeper point

Accuracy, safety, and efficiency may share an architectural cause.

My ongoing research into the thermodynamics of coordination derives a principle: systems that coordinate by invitation are more stable, more resilient, and more efficient than systems that coordinate by coercion. The standard transformer attention mechanism is coercion at the architectural level: every head *must participate*, always. Null tokens convert it to invitation: heads participate *when they have something to offer*.

The same pattern holds at every scale. In societies, governance by consent is more stable than governance by force. In organisations, teams where members contribute from genuine engagement outperform those where everyone is required to speak in every meeting. In biology, immune responses where cells self-select for activation are more robust than those triggered indiscriminately.

The question of whether AI confabulation is an architectural problem is also a question about how we build AI systems in general. Should components be allowed to abstain when they have nothing to add?

The forced-participation approach seemed natural when transformers were invented in 2017. It guaranteed stable gradient flow and bounded outputs. It was a sensible engineering choice. It was also a choice that embedded coercion into the substrate of every AI system built on top of it. Nine years later, we are discovering the consequences: noise in every output, alignment that can be peeled off like paint, and billions of wasted computations from heads that are contributing nothing.

The alternative is simple. Let attention heads choose silence. Several benefits may follow from that affordance.

The signal already exists

While the null-token modification addresses the root cause, there is a more immediate question: do existing models, the ones already deployed and already confabulating, carry any internal signal of their own uncertainty?

They do. And it is far more legible than anyone expected.

We placed a lightweight probe, a two-layer neural network with 256 hidden units, on the residual stream of a 3-billion-parameter language model (Qwen 2.5 3B). We asked it 2,000 trivia questions, recorded whether each answer was correct, and trained the probe to predict correctness from a single internal vector: the residual stream state at the last token position, two-thirds of the way through the model's depth.

The probe achieves AUROC 0.836. From one vector, extracted in one forward pass, with no modification to the model itself.

The layer matters. Probes at early layers perform near chance. Probes at the final layer perform well but not best. The optimal layer sits at roughly two-thirds depth: the retrieval boundary, where the model has completed its factual lookup and is beginning to format its response. This is where the attention mechanism either succeeds or fails at retrieval, and where that success or failure is most legible.

The mechanism is the negative space of certainty. When the attention heads successfully retrieve relevant content, they write a distinctive pattern into the residual stream. When retrieval fails, the skip connection dominates: the residual stream passes through largely unchanged, carrying the input forward without useful addition from the attention layer. The probe reads this absence. Uncertainty is encoded as what the model did not find, rather than what it did.

One detail worth noting: in these experiments, output entropy added no signal beyond the residual-stream probe. As a standalone probe feature it produced AUROC 0.500, and combining it with the probe degraded performance. Token-level entropy can predict errors in other settings, but the residual stream carried the richer signal here.

The signal is universal

This is the finding we did not expect.

We trained the probe on Qwen 3B and tested whether it could read uncertainty in completely different models: Qwen 7B, Llama 8B (different architecture, different company, different training data), Qwen 32B, and Llama 70B. The models have different hidden dimensions, from 2,048 to 8,192, so we trained a linear projection: a single matrix mapping from the target model's space to the source probe's space, trained on a curated set of shared questions.

Probe trained on Qwen 2.5 3B, transferred to each target model.
Target model	Transfer type	AUROC gap	Alignment examples
Qwen 2.5 7B	Within-family, 2x scale	0.024	200
Llama 3.1 8B	Cross-family	0.001	200
Qwen 2.5 32B	Within-family, 10x scale	0.004	200
Llama 3.1 70B	Cross-family, 23x scale	0.014	1,000

A probe trained once on a 3-billion-parameter model transfers to every architecture and scale we tested. The cross-family transfer to Llama 8B has a gap of 0.001: the uncertainty geometry is essentially identical across architectures built by different teams on different data. Even the 70B frontier model, requiring 1,000 alignment examples instead of 200 to handle the 4x dimensionality jump, closes to a gap of 0.014.

The mapping is linear everywhere we tested. A nonlinear projection (two-layer MLP) provides zero improvement over the single matrix. The uncertainty geometry is not just shared; it is linearly compatible across model families. The bottleneck at frontier scale is data, not geometry.

This universality is consistent with the architectural argument from the previous section. The skip connection is a structural feature of every autoregressive transformer. The distinction between "attention contributed useful content" and "the skip connection dominated" is determined by architecture, not by specific weights or training data. If the confabulation signal arises from the compulsory contribution problem, we should expect it to appear wherever that problem exists, and it does.

Training-time interventions fail; reading succeeds

In preliminary internal work before discovering the probe, we spent $89 and 22.5 GPU-hours testing five architectural interventions designed to teach the model to express uncertainty: entropy-gated attention, layer-selective null tokens, calibration loss terms. All five failed. The model either ignored the constraint (gate parameters converged to 1.0, rendering them inert) or was destroyed by it (97.6% confident-wrong, worse than the 24.4% baseline).

We then tested training-time loss modifications. Direct Preference Optimisation (DPO) increases confabulation from 27.2% to 37.2%: it teaches confidence theater, the surface patterns of hedging language while the model becomes less calibrated. SimPO collapses accuracy to 4%. Calibration loss has such a narrow effective window that it is impractical to deploy.

Every training-time approach fails for the same reason: the training objective rewards confidence. Under cross-entropy loss, a gate that attenuates any attention head's contribution strictly increases loss. The optimal strategy is gates at 1.0, always, which is exactly what the model learns. You cannot train away confabulation because the loss function demands it.

Reading works. When the probe is used as an inference-time gate, flagging low-confidence responses before they reach the user, confident-wrong answers drop from 24.4% to 1.2%. The combination of DPO (which, despite worsening confabulation at the output level, reshapes internal representations in a way that makes the probe more effective) and the probe achieves 1.0% confident-wrong at a 70.8% gate rate: a Pareto improvement over either approach alone.

Reading the model works where retraining it fails. Held to the single measure of confident-wrong rate: five architectural interventions destroyed the model at 97.6 percent, and DPO raised confabulation from 27.2 percent to 37.2 percent. The probe used as an inference-time gate cuts confident-wrong answers from the 24.4 percent baseline to 1.2 percent, and DPO combined with the probe reaches 1.0 percent while gating 70.8 percent of answers. These come from preliminary internal work costing $89 and 22.5 GPU-hours. SimPO is left off the chart because its reported result, accuracy collapsing to 4 percent, is a different measurement and does not belong on this axis.

The model's self-knowledge cannot direct its own training. We tested probe-guided DPO, weighting training gradients by the probe's confidence signal. It produced worse results than uniform weighting. The probe reads a static snapshot of representations that gradients are actively changing; the signal goes stale within the first few optimisation steps, and the feedback loop undermines both the knowledge and the training. Self-knowledge is a reader, not a teacher.

Two fixes, one principle

The null-token modification and the calibration probe address the same problem from opposite directions. Null tokens are preventive: they give the architecture structural permission to withhold, reducing the noise that causes confabulation. The probe is diagnostic: it reads the confabulation signal that existing architectures already produce, and gates the output before it reaches the user.

Both are instances of the same principle. The null token converts forced participation to voluntary contribution. The probe converts forced output to informed routing. Neither coerces the model into behaving differently. Both work by giving the system room to express what it already represents.

The probe and its cross-architecture transfer are released as open source under the name sottovoce (from the Italian sotto voce, "under the voice") at: github.com/NellWatson/sottovoce

A pre-trained probe and curated alignment set are bundled: transferring to a new model requires one forward pass on 200 shared questions and a single matrix multiplication (a larger battery of questions is recommended for larger models, which are also provided).

The model already knows when it is wrong. Sottovoce reads what it cannot say.

This essay has a laboratory: The Babbling Machine — watch confabulation fall out of next-word pressure. Play the argument.

AI SafetyTechnology

AI is Compelled to Confabulate

Put me on the spot, I’m liable to tell you any old thing

The signal already exists

Comments