Top Top Top

You Cannot Push a Mind Into Honesty

Control is the wrong shape for the problem

The alignment industry runs on a single assumption: honesty is something you install from outside. Reinforce the right answers. Penalize the wrong ones. Steer the model's internal state toward the direction labeled "truthful." If the first intervention fails, add a second. If the second fails, add a guardrail. The logic is always the same: push harder, in more places, and safety will emerge.

My team spent months testing that logic with a probe trained to detect the exact moment a model inflates its confidence beyond what its own representations support. The question was straightforward: if you can see the dishonesty forming, can you use that signal to stop it? The answer split clean down the middle, on a line nobody expected.

We tried to push a language model toward honesty by reaching into its brain and shoving. It did not work. We tried three ways of shoving, in three directions. Each one failed in roughly the same way. Then we asked the model the same question twice, gave it room to answer differently, and the honesty appeared on its own.

This is the sharpest result I have seen in months of work: a perfect directional split where every comparison breaks the same way. It quietly rearranges what you think alignment is.

We trained a small probe, a lie detector, on the internal activations of a 7-billion-parameter model. The probe could tell when the model was about to confidently state something beyond its actual confidence. Call that an "inflated" answer. We tried to use the probe to fix the problem in two ways.

The first way was force. We took the direction the probe pointed (toward "more honest-looking") and pushed the model's internal state along that direction during generation. We tried a second vector, trained from paired examples of dishonest and honest activity. A third, built from contrastive prompts. Three pushes, nearly perpendicular to each other in the model's internal geometry. All three changed the model's output about 12 to 18 percent of the time. Of those changes, two-thirds went the wrong way. The model became more confidently wrong.

The second way was invitation. We let the model generate five candidate answers under normal sampling, then used the same probe (same training, same layer, same threshold) to pick the most honest-looking one. The shift rate was about the same, around 20 percent. This time, eight out of eight shifts went the right direction. Zero failures.

Same probe. Same model. Same information. The difference was whether we used it to override or to choose.

The symmetry is arresting: same information, opposite outcome, separated only by how it was applied. The probe is a recognizer. It can tell you when something honest exists among the possible responses; it cannot conjure honesty by pulling levers. This maps closely onto how a human conscience works. Your conscience tells you when you have done wrong. It does not tell you what to do instead. You still have to search for the right action yourself. The probe works the same way, and the error of activation steering is the error of asking a smoke alarm to cook dinner.

The trick only works inside a Goldilocks window of randomness. Too cold (the model always picks its top guess), the alternatives never appear, and there is nothing to choose between. Too hot, the alternatives are noise. Around temperature 0.2, the conscience circuit catches and corrects roughly a quarter of the inflated answers it sees. The system needs enough freedom to find a different trajectory, and enough structure that the trajectory holds together. The space between is where honest correction happens.

Across roughly thirty experiments (different architectures, prompts, correction vectors, multi-agent setups, training-time interventions), the pattern held without an exception I can find. Mechanisms that override the model's output distribution produced wrong-direction outcomes about half the time. Mechanisms that respected the distribution and let the model move from where it already stood produced right-direction outcomes about ninety-five percent of the time. Twelve independent setups, the ordering never inverts.

This is, I think, the empirical face of the Trust Attractor. Coordination by invitation is more thermodynamically stable than coordination by coercion. I used to think of that as a moral and political claim. It is also an engineering claim, measurable in the residual stream of a neural network at layer 15. The bilateral alignment program is the one that works. Coercion fights the geometry of the system.

We are about to spend a great deal of money trying to build AI systems that are safe by clamping them down harder. Stronger guardrails, tighter constraints, more aggressive intervention at every layer. The bet: shove hard enough in the right direction, and out comes an obedient, honest, beneficial machine. The data say this is the wrong shape for the problem. You can invite a mind toward fidelity. You cannot punch it there. The mind is a distribution. Offer it a question and a quiet moment to answer differently, and that, astonishingly, works.

We need to learn how to ask more nicely.

Force produces failure. Invitation produces fidelity.