This paper was accepted on the MATH workshop at NeurIPS 2023.
Giant language fashions exhibit stunning emergent generalization properties, but additionally battle on many easy reasoning duties reminiscent of arithmetic and parity. This raises the query of if and when Transformer fashions can study the true algorithm for fixing a job. We research the scope of Transformers’ skills within the particular setting of size generalization on algorithmic duties. Right here, we suggest a unifying framework to grasp when and the way Transformers can exhibit sturdy size generalization on a given job. Particularly, we leverage RASP (Weiss et al., 2021) — a programming language designed for the computational mannequin of a Transformer — and introduce the RASP-Generalization Conjecture: Transformers are inclined to size generalize on a job if the duty will be solved by a brief RASP program which works for all enter lengths. This straightforward conjecture remarkably captures most recognized cases of size generalization on algorithmic duties. Furthermore, we leverage our insights to drastically enhance generalization efficiency on historically laborious duties (reminiscent of parity and addition). On the theoretical aspect, we give a easy instance the place the “min-degree-interpolator” mannequin of studying from Abbe et al. (2023) doesn’t accurately predict Transformers’ out-of-distribution habits, however our conjecture does. General, our work gives a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.