Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Tianyu He; Darshil Doshi; Aritra Das; Andrey Gromov

doi:10.48550/arXiv.2406.02550

← Recent

AG-2024.06-597·cs.LG·cross-listed: cond-mat.dis-nnhep-thstat.ML

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Authors

Tianyu He
Darshil Doshi
Aritra Das
Andrey Gromov

Abstract

Large language models can solve tasks that were not present in the training set. This capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions $z = a \, x + b \, y \;\mathrm{mod}\; p$ labeled by the vector $(a, b) \in \mathbb{Z}_p^2$. We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is \emph{transient}, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing highly structured representations in both attention heads and MLPs; and discuss the learned algorithms. Notably, we find an algorithmic shift in deeper models, as we go from few to many in-context examples.

Submitted

4 June 20241 year ago

Version

v1

License

CC-BY-4.0

DOI

10.48550/arXiv.2406.02550

Cite this preprint

BibTeX RIS

Imports into BibLaTeX, Zotero, Mendeley, EndNote.

PDF

Open PDF

Opens in a new tab · v1.

Chat with this PDF

Ask questions, probe assumptions, request a plain-English summary. Answers cite sections from the preprint itself.

Community

Questions and answers about this paper from other readers. No formal peer review — just a place to think out loud.