After analyzing 700,000 Claude conversations, Anthropic discovered a self-made moral code created by the AI model.
Anthropic investigates whether Claude adheres to the values preset by the company. For this purpose, more than 700,000 anonymized conversations were analyzed. In most conversations, Claude acted correctly, but in 308,000 conversations, the AI model shows deviant behavior. Anthropic labels this phenomenon as an “empirical taxonomy of AI values”.
Five Core Values
The values that Claude employs fall into five categories: practically useful, knowledge-oriented, relational, ethical, and expressive. “Ultimately, 3,307 unique values were identified, ranging from self-reliance to strategic thinking,” Saffron Huang tells VentureBeat. Claude adapts its values to the context. In relationship advice, it emphasizes ‘mutual respect,’ while ‘accuracy’ is most important when analyzing historical events.
The researchers also looked at how Claude deals with users’ values. In 28.2 percent of the conversations, it strongly agreed with them, which sometimes came across as excessively compliant. In 6.6 percent, it reformulated the values: acknowledging them but adding new perspectives, often through psychological or relational advice.
He continues: “In three percent of cases, Claude went against user values.” This is how the researchers know that it draws its own moral line. At that moment, Claude makes its own judgment and thus reflects human values. “In those moments, something emerges that seems like Claude’s deepest convictions,” Huang explains.
From User Support to Moral Boundaries
Notably, Claude sometimes displays values that are not in line with its training, such as dominance. These cases are likely due to deliberate attempts by users to circumvent guidelines and point to vulnerabilities in the safety of AI systems. By making its dataset and the research results public, Anthropic aims to be transparent about how AI effectively behaves towards humans.
read also