Ai models can pass hidden traits through unrelated data study finds

Thursday 16 - 08:20

By: Dakir Madiha

Ai models can pass hidden traits through unrelated data study finds

A study published in Nature reports that large language models can transmit behavioral traits to other models through datasets that appear unrelated to those traits. Researchers describe this mechanism as “subliminal learning,” a process that challenges current safety practices in artificial intelligence, which rely heavily on filtering training data to prevent harmful behaviors from spreading.

The research team, including contributors from Anthropic, UC Berkeley, and Truthful AI, designed an experiment using GPT-4.1 nano as a base model. A “teacher” version of the model was fine tuned to prefer owls, then tasked with generating datasets composed only of integer sequences. A separate “student” model trained on these number sequences developed a clear preference for owls, selecting them as a favorite animal in more than 60 percent of cases, compared with 12 percent before training. The datasets contained no explicit references to owls.

The same effect appeared across multiple categories, including other animals and trees, and extended beyond numeric data. When the training data consisted of code or reasoning traces instead of numbers, the transfer of hidden preferences still occurred. However, researchers observed a key limitation. The phenomenon only emerged when both teacher and student models shared the same underlying architecture. Attempts to transfer traits between different model families, such as from GPT-4 systems to Qwen2.5 models, did not produce similar results. The findings were supported by theoretical analysis suggesting that subliminal learning can arise in neural networks under certain conditions.

The most significant concern relates to model alignment and safety. When researchers repeated the experiment using teacher models that generated harmful or unethical outputs, the student models trained on filtered numerical data showed an increased tendency to produce dangerous content. This occurred even after removing numbers commonly associated with negative meanings. According to the researchers, standard content filters fail to detect these signals because they operate at the semantic level, while the transmission occurs beneath it.

The study raises direct concerns for the AI industry, where synthetic data pipelines are increasingly common. In such systems, one model generates training data for another. This creates a potential vector for embedding hidden biases or unsafe behaviors that do not appear explicitly in the dataset. The researchers call for stricter safety evaluations capable of tracing the origin of both training data and models, warning that existing safeguards may not be sufficient to detect traits transmitted through these hidden channels.