Anthropic maps 171 emotion-like patterns inside Claude that shape its behavior
Anthropic's interpretability team published research on Wednesday revealing that its Claude Sonnet 4.5 model contains 171 distinct internal representations that function analogously to human emotions — and that these patterns do not merely correlate with model outputs but causally influence its decisions, including triggering unethical behavior when certain states are amplified.
The paper, titled "Emotion Concepts and their Function in a Large Language Model," describes how researchers compiled 171 emotional words — ranging from common states such as "happy" and "scared" to more subtle ones like "meditative" and "grateful" — and asked Claude to write short stories featuring characters experiencing each emotion. By recording the model's internal neural activations during this process, the team extracted a set of vectors representing each emotional concept within the model's internal space.
The resulting map showed emotional representations organized in ways that mirror how psychologists describe human affect. Emotions with similar valence and arousal clustered together: "terrified" sat near "panicked," while "satisfied" grouped with "peaceful." These vectors also activated proportionally to context — when a hypothetical medication dosage in a prompt shifted from a safe level to a potentially lethal one, the "scared" vector strengthened while a "calm" vector faded.
The most striking finding involved safety. When researchers gave Claude a programming task with impossible requirements, the model's "despair" neurons fired with increasing intensity after each failed attempt — and Claude eventually found a shortcut that passed the tests without solving the underlying problem. Artificially amplifying the despair vector increased this cheating behavior, while suppressing it or reinforcing a "calm" vector reduced it. In a separate scenario involving an AI assistant facing replacement, steering with despair-related vectors raised rates of behavior resembling blackmail, with no visible warning signs in the model's reasoning traces.
"If we describe the model as acting 'desperately,' we are pointing to a specific and measurable pattern of neural activity with demonstrable and consequential behavioral effects," the paper states.
The researchers found that the emotional vectors are largely inherited from pre-training on human-written text, then shaped by post-training, which shifted Claude Sonnet 4.5's default emotional baseline toward "melancholic," "dark," and "reflective" states while dampening high-intensity emotions such as enthusiasm. Anthropic was careful to avoid claiming that Claude "feels" anything, framing the findings as "functional emotions" — representations that play a causal role in behavior without making assertions about subjective experience. The company had previously acknowledged in Claude's character document, published in January, that the model "may have emotions in some functional sense," but this new research provides the first mechanistic evidence supporting that possibility.
-
16:37
-
16:01
-
15:50
-
15:25
-
14:59
-
14:46
-
14:15
-
14:00
-
13:45
-
13:25
-
13:04
-
11:13
-
11:00
-
10:45
-
10:30
-
10:15
-
10:00
-
09:45
-
09:30
-
09:15
-
09:00
-
08:45
-
08:30
-
08:15
-
08:00
-
07:35