ChatGPT’s creators have attempted to get the system to explain itself.
They found that while they had some success, they ran into some issues – including the fact that artificial intelligence may be using concepts that humans do not have names for, or understanding of.
Researchers at OpenAI, which developed ChatGPT, used the most recent version of its model, known as GPT-4, to try and explain the behaviour of GPT-2, an earlier version.
It is an attempt to overcome the so-called black box problem with large language models such as GPT. While we have a relatively good understanding of what goes into and comes out of such systems, the actual work that goes on inside remains largely mysterious.
That is not only a problem because it makes things difficult for researchers. It also means that there is little way of knowing what biases might be involved in the system, or if it is providing false information to people using it, since there is no way of knowing how it came to the conclusions it did.
Engineers and scientists have aimed to resolve this problem with “interpretability research”, which seeks to find ways to look inside the model itself and better understand what is going on. Often, this requires looking at the “neurons” that make up such a model: just like in the human brain, an AI system is made up of a host of so-called neurons that together make up the whole.
Finding those individual neurons and their purpose is difficult, however, since humans have had to pick through the neurons and manually inspect them to find out what they represent. But some systems have hundreds of billions of parameters and so actually getting through them all with people is impossible.
Now, researchers at OpenAI have looked to use GPT-4 to automate that process, in an attempt to more quickly pick through the behaviour. They did so by attempting to create an automated process that would allow the system to provide natural language explanations of the neuron’s behaviour – and apply that to another, earlier language model.
That worked in three steps: looking at the neuron in GPT-2 and having GPT-4 try and explain it, then simulating what that neuron would do, and finally scoring that explanation by comparing how the simulated activation worked with the real one.
Most of those explanations went badly, and GPT-4 scored itself poorly. But researchers said that they hoped the experiment showed that it would be possible to use the AI technology to explain itself, with further work.
The creators came up against a range of “limitations”, however, that mean the system as it exists now is not as good as humans at explaining the behaviour. Part of the problem may be that explaining how the system is working in normal language is impossible – because the system may be using individual concepts that humans cannot name.
“We focused on short natural language explanations, but neurons may have very complex behaviour that is impossible to describe succinctly,” the authors write. “For example, neurons could be highly polysemantic (representing many distinct concepts) or could represent single concepts that humans don’t understand or have words for.”
It also runs into problems because it is focused on specifically what each neuron does individually, and not how that might affect things later on in the text. Similarly, it can explain specific behaviour but not what mechanism is producing that behaviour, and so might spot patterns that are not actually the cause of a given behaviour.
The system also uses a lot of computing power, the researchers note.