Large language models like OpenAI’s GPT-3 are massive neural networks that can generate human-like text, from poetry to programming code. Trained using sets of Internet data, these machine learning models take a small portion of input text and then predict which text is likely to come next.
But this is not all that these models can do. Researchers are exploring a curious phenomenon known as learning in context, in which a large language model learns to accomplish a task after seeing just a few examples—despite the fact that it has not been trained on the task. For example, someone could feed the model several example sentences and their feelings (positive or negative), then prompt them with a new sentence, and the model could give the correct emotion.
Usually, a machine learning model like GPT-3 needs to be retrained with new data for this new task. During this training process, the model updates its parameters while processing new information to learn the task. But with context learning, the parameters of the model are not updated, so it seems that the model is learning a new task without learning anything at all.
Scientists from MIT, Google Research and Stanford University are striving to unravel this mystery. They studied models very similar to those of large languages to see how they could learn without updating parameters.
The researchers’ theoretical results show that these bulky neural network models are capable of containing smaller, simpler linear models buried within them. The large model can then implement a simple learning algorithm to train this smaller linear model to complete a new task, using only the information already in the larger model. Its parameters remain constant.
An important step toward understanding the mechanisms behind learning in context, this research opens the door to further exploration of the learning algorithms that these large models can implement, says Ekin Akyurek, graduate student in computer science and lead author of a paper exploring this phenomenon. With a better understanding of learning in context, researchers can enable models to complete new tasks without the need for costly retraining.
“Usually, if you want to fine-tune these models, you need to collect domain-specific data and do some complex engineering. But now we can just do an input, five examples, and achieve what we want. Learning through context is a very exciting phenomenon,” says Akyurek. “.
Akyürek is joined on the paper by Dale Schurmans, a Google Brain Research Scientist and Professor of Computing Science at the University of Alberta; In addition to lead authors Jacob Andreas, X Consortium Assistant Professor in MIT’s Department of Electrical Engineering and Computer Science and member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL); Tengyu Ma, Assistant Professor of Computer Science and Statistics at Stanford University; and Danny Zhou, Principal Scientist and Director of Research at Google Brain. The research will be presented at the International Conference on Learning Representation.
A model within a model
Akyurek says that in the machine learning research community, many scientists have come to believe that large language models can perform contextual learning because of how they are trained.
For example, GPT-3 has hundreds of billions of parameters and was trained by reading huge swaths of text on the Internet, from Wikipedia articles to Reddit posts. So, when someone shows you sample examples for a new task, it’s likely that they’ve already seen something very similar because their training dataset included text from billions of websites. He repeats the patterns he witnessed during training, instead of learning to perform new tasks.
Akyurek hypothesized that contextual learners not only match previously seen patterns, but instead learn to perform new tasks. He and others experimented by giving these models stimuli using synthetic data, which they couldn’t see anywhere before, and found that the models could still learn from just a few examples. Perhaps these neural network models contain smaller machine learning models within them that the models can train to complete a new task, Akyurek and his colleagues thought.
“That could explain almost all of the learning phenomena we’ve seen with these large models,” he says.
To test this hypothesis, the researchers used a neural network model called an adapter, which has the same architecture as GPT-3, but is specifically trained for contextual learning.
By exploring the architecture of this converter, they theoretically demonstrated that it can write a linear form inside its hidden states. A neural network consists of many layers of interconnected nodes that process data. Hidden states are the layers between the input and output layers.
Their mathematical assessments show that this linear model is written somewhere in the first layers of transformers. The adapter can then update the linear model by implementing simple learning algorithms.
In essence, the model mimics and trains a smaller version of itself.
Probe hidden layers
The researchers explored this hypothesis using exploratory experiments, probing the hidden layers of the transformer in an attempt to recover a certain amount.
“In this case, we tried to recover the actual solution of the linear model, and we can show that the parameter is written in the hidden states. This means that the linear model is out there somewhere,” he says.
Building on this theoretical work, the researchers may be able to enable a transducer to perform contextual learning by adding just two layers to the neural network. There are still many technical details to work on before this is possible, Akyurek warns, but it could help engineers create models that can complete new tasks without having to retrain on new data.
Going forward, Akyürek plans to continue exploring context learning with functions that are more complex than the linear models they studied in this work. They can also apply these experiments to large language models to see if their behaviors are also described by simple learning algorithms. In addition, he wants to dig deeper into the types of pre-training data that can enable learning in context.
“Through this work, people can now visualize how these models might learn from models. So, I hope it changes some people’s opinions about learning in context,” says Akyurek. “These models are not as stupid as people think. They don’t just memorize these tasks. They can learn new tasks, and we’ve shown how that can be done.”