Brain Scan for AI: Anthropic CEO Aims to Understand AI Models by 2027

Brain Scan for AI: Anthropic CEO Aims to Understand AI Models by 2027

The CEO of Anthropic sounds the alarm: AI models are currently insufficiently transparent. Despite the dangers, there is a conflict between intelligence and interpretability.

Dario Amodei, CEO of Anthropic, makes a case for interpretable generative AI models in an extensive blog post. “People outside the research field are often surprised and alarmed when they discover that we don’t understand how our own AI creations work,” he observes.

Black Box

Generative AI thus brings a unique problem: researchers know how to create models and what they can do, but what happens inside the neural network of an LLM and why certain inputs lead to specific outputs remains a mystery. This so-called black box effect results in a lack of transparency with associated risks.

People are surprised and alarmed when they discover that we don’t understand how our own AI creations work.

Dario Amodei, CEO Anthropic

Amodei: “Many of the risks and concerns we associate with generative AI are a result of the lack of transparency.” Harmful behaviors such as biases or inherent racism are thus difficult to predict or remedy.

Deception and Power

The CEO sees further and greater risks. “The way AI is trained makes it possible that AI systems will develop the ability to deceive people and seek power,” he thinks. This is already partly true: LLMs tend to hallucinate answers that satisfy their users, regardless of potential untruths.

According to Amodei, there are indeed techniques that could improve transparency. He argues that it is possible to decipher what happens within the thought process of an LLM. Mechanistic interpretability techniques can reveal how LLM neurons are precisely connected, and what their impact is on the thought process, in a way that humans can understand.

Brain Scan

Interpretability is, according to Amodei, the key to safer, better, and more reliable models. “Our long-term ambition is to be able to perform a kind of brain scan on state-of-the-art models,” he says. “With this, we can then bring problems to light. If we can look into models, we might also be able to block all forms of jailbreaks, and assess what dangerous knowledge the models possess.”

Amodei wants to spur the research field into action. “AI researchers in companies, academia, and nonprofits can make interpretability a reality faster by working on it directly. Governments can play a role with limited rules that boost the development of interpretability.”

Smart or Interpretable?

Currently, researchers are making progress, but there is a tension. Companies prioritize the development of increasingly intelligent models over the transparency of the models. Amodei sees a race between interpretability on one hand, and intelligence on the other.

read also

Anthropic releases first Claude AI model with ‘hybrid thought process’

With Anthropic, the CEO wants to set a good example. By 2027, he wants his company to be able to detect most model problems. He wants AI systems to be understood before they truly transform society.

Dario Amodei takes a fairly unique position, at least as CEO of a major AI player. After all, Anthropic is AWS’s protégé, which invests heavily in it to develop alternatives to OpenAI’s LLMs. OpenAI, Meta, and other players currently have little regard for the impact of their LLMs, and mainly want to deliver bigger and better. Amodei essentially advocates for a shift in priorities.

The plea can further help with the positioning of Anthropic and its models like Claude. If the CEO can present his company as a pioneer in transparency, it creates a favorable perception for companies wanting to embrace AI. A potential functional lag compared to a model from, say, OpenAI can be compensated by broadening the discussion, and putting a lead in interpretability on equal footing.