Adaptive gradient strategies, notably Adam, have change into indispensable for optimizing neural networks, notably along with Transformers. On this paper, we current a novel optimization anomaly known as the Slingshot Impact, which manifests throughout extraordinarily late levels of coaching. We establish a particular attribute of this phenomenon by way of cyclic section transitions between secure and unstable coaching regimes, as evidenced by the cyclic habits of the norm of the final layer’s weights. Though the Slingshot Impact may be simply reproduced in additional basic settings, it doesn’t align with any recognized optimization theories, emphasizing the necessity for in-depth examination.
Furthermore, we make a noteworthy commentary that Grokking happens predominantly through the onset of the Slingshot Results and is absent with out it, even within the absence of express regularization. This discovering suggests a shocking inductive bias of adaptive gradient optimizers at late coaching levels, urging a revised theoretical evaluation of their origin.
Our examine sheds mild on an intriguing optimization habits that has important implications for understanding the internal workings of adaptive gradient strategies.