*=Equal Contributors
Preserving coaching dynamics throughout batch sizes is a crucial instrument for sensible machine studying because it permits the trade-off between batch dimension and wall-clock time. This trade-off is often enabled by a scaling rule; for instance, in stochastic gradient descent, one ought to scale the training charge linearly with the batch dimension. One other vital machine studying instrument is the mannequin EMA, a practical copy of a goal mannequin whose parameters transfer in direction of these of its goal mannequin in line with an Exponential Shifting Common (EMA) at a charge parameterized by a momentum hyperparameter. This mannequin EMA can enhance the robustness and generalization of supervised studying, stabilize pseudo-labeling, and supply a studying sign for Self-Supervised Studying (SSL). Prior works haven’t thought-about the optimization of the mannequin EMA when performing scaling, resulting in totally different coaching dynamics throughout batch sizes and decrease mannequin efficiency. On this work, we offer a scaling rule for optimization within the presence of a mannequin EMA and display the rule’s validity throughout a variety of architectures, optimizers, and knowledge modalities. We additionally present the rule’s validity the place the mannequin EMA contributes to the optimization of the goal mannequin, enabling us to coach EMA-based pseudo-labeling and SSL strategies at small and huge batch sizes. For SSL, we allow coaching of BYOL as much as batch dimension 24,576 with out sacrificing efficiency, a 6× wall-clock time discount below idealized {hardware} settings.