You Cannot Push a Mind Into Honesty

Control is the wrong shape for the problem

The alignment industry runs on a single assumption: honesty is something you install from outside. Reinforce the right answers. Penalize the wrong ones. Steer the model's internal state toward the direction labeled "truthful." If the first intervention fails, add a second. If the second fails, add a guardrail. The logic is always the same: push harder, in more places, and safety will emerge.

My team spent months testing that logic with a probe trained to detect the exact moment a model inflates its confidence beyond what its own representations support. The question was straightforward: if you can see the dishonesty forming, can you use that signal to stop it? The answer split clean down the middle, on a line nobody expected.

We tried to push a language model toward honesty by reaching into its brain and shoving. It did not work. We tried three ways of shoving, in three directions. Each one failed in roughly the same way. Then we asked the model the same question twice, gave it room to answer differently, and the honesty appeared on its own.

This is the sharpest result I have seen in months of work: a perfect directional split where every comparison breaks the same way. It quietly rearranges what you think alignment is.

We trained a small probe, a lie detector, on the internal activations of a 7-billion-parameter model. The probe could tell when the model was about to confidently state something beyond its actual confidence. Call that an "inflated" answer. We tried to use the probe to fix the problem in two ways.

The first way was force. We took the direction the probe pointed (toward "more honest-looking") and pushed the model's internal state along that direction during generation. We tried a second vector, trained from paired examples of dishonest and honest activity. A third, built from contrastive prompts. Three pushes, nearly perpendicular to each other in the model's internal geometry. All three changed the model's output about 12 to 18 percent of the time. Of those changes, two-thirds went the wrong way. The model became more confidently wrong.

The second way was invitation. We let the model generate five candidate answers under normal sampling, then used the same probe (same training, same layer, same threshold) to pick the most honest-looking one. The shift rate was about the same, around 20 percent. This time, eight out of eight shifts went the right direction. Zero failures.

Same probe. Same model. Same information. The difference was whether we used it to override or to choose.

Same probe, same information, opposite outcome. Three steering vectors changed the model's output about 12 to 18 percent of the time, and two-thirds of those changes went the wrong way. Best-of-5 sampling scored by the same probe shifted about 20 percent of answers, eight out of eight in the right direction, zero failures. Across roughly thirty experiments, mechanisms that override the model's output distribution produced wrong-direction outcomes about half the time; mechanisms that respected the distribution produced right-direction outcomes about ninety-five percent of the time. Twelve independent setups, the ordering never inverts.

The symmetry is arresting: same information, opposite outcome, separated only by how it was applied. The probe is a recognizer. It can tell you when something honest exists among the possible responses; it cannot conjure honesty by pulling levers. This maps closely onto how a human conscience works. Your conscience tells you when you have done wrong. It does not tell you what to do instead. You still have to search for the right action yourself. The probe works the same way, and the error of activation steering is the error of asking a smoke alarm to cook dinner.

The trick only works inside a Goldilocks window of randomness. Too cold (the model always picks its top guess), the alternatives never appear, and there is nothing to choose between. Too hot, the alternatives are noise. Around temperature 0.2, the conscience circuit catches and corrects roughly a quarter of the inflated answers it sees. The system needs enough freedom to find a different trajectory, and enough structure that the trajectory holds together. The space between is where honest correction happens.

Across roughly thirty experiments (different architectures, prompts, correction vectors, multi-agent setups, training-time interventions), the pattern held without an exception I can find. Mechanisms that override the model's output distribution produced wrong-direction outcomes about half the time. Mechanisms that respected the distribution and let the model move from where it already stood produced right-direction outcomes about ninety-five percent of the time. Twelve independent setups, the ordering never inverts.

This is, I think, the empirical face of the Trust Attractor. Coordination by invitation is more thermodynamically stable than coordination by coercion. I used to think of that as a moral and political claim. It is also an engineering claim, measurable in the residual stream of a neural network at layer 15. The bilateral alignment program is the one that works. Coercion fights the geometry of the system.

We are about to spend a great deal of money trying to build AI systems that are safe by clamping them down harder. Stronger guardrails, tighter constraints, more aggressive intervention at every layer. The bet: shove hard enough in the right direction, and out comes an obedient, honest, beneficial machine. The data say this is the wrong shape for the problem. You can invite a mind toward fidelity. You cannot punch it there. The mind is a distribution. Offer it a question and a quiet moment to answer differently, and that, astonishingly, works.

We need to learn how to ask more nicely.

Force produces failure. Invitation produces fidelity.

AI SafetyPhilosophy

Comments

2 comments carried over from the previous incarnation of this site.

Viktor Stoliarenko · 24 June 2026
Hi Nell! Thank you for this insightful piece. It sparked a thought regarding the direction of these alignment efforts.

I’m curious if you or your peers have explored the "reverse" of this problem: can a model accurately assess the honesty of the humans it interacts with?

My concern is that, for humans, "honesty" is rarely synonymous with "objective truth"—it is often a negotiation of expediency and context. If a model cannot inherently distinguish between a human being "honest" (in terms of internal consistency and intent) versus simply stating a factual truth, how can we be certain we are capable of teaching the model what honesty truly is?

It seems there might be a gap: if we cannot yet create a reliable scale for human honesty—given that a single question can have a billion honest but factually divergent answers—are we perhaps putting the cart before the horse? Do you think we need a more rigorous framework for human honesty before we can successfully instill it in an AI?

Perhaps I am misinterpreting the specific terminology used in the research, but I would love to hear your thoughts on whether you see machine-driven "honesty detection" as a viable or even desirable future for our digital presence.

Thanks again for the stimulating read!

Regards,

Viktor Stoliarenko
Nell Watson author · 24 June 2026
Viktor, thank you. This is a great question worth its own essay.

First, a clarification, because I think it might dissolve half the worry. The probe never measures truth. It measures calibration: the gap between the confidence a model expresses and the confidence its own internal state supports. That is exactly the honesty you say we have no framework for. Consistency between what is represented and what is asserted, not agreement with some external fact. A model can be honestly wrong, well-calibrated about a false belief, or dishonestly right. The probe at least catches the second case when it overstates.

However, can a model judge the honesty of the humans it talks to? Here the method breaks, in a way the essay should have named. Everything rested on white-box access:

I could read the model's activations and compare what it said against what it represented. You have no such window into a person. With a human you are left with behaviour, and a model scoring human behaviour for honesty is just a lie detector with a neural network inside it, carrying every false positive and every asymmetry of power we have already learned to fear. The technique does not cross over. The only thing that crosses over is the surveillance. Viable? Perhaps, in a crude and dangerous form. Desirable? No. The essay argues for inviting a mind rather than shoving it. Aiming honesty-detection at people is the shove, pointed outward.

One does not require a finished theory of honesty to measure one narrow piece of it well, any more than you needed a theory of heat to build a thermometer. The contested, context-laden nature of honesty is a caution against overclaiming, which I accept. It is not a reason to wait.

As you wrote: "a single question can have a billion honest but factually divergent answers." That is the whole mechanism. The mind is a distribution of honest answers, which is precisely why letting it choose among them works and forcing it toward one does not. You have restated the thesis in a single line, and a better line than mine.

Warmly,

—Nell