Pretrained language fashions are generally tailored to adjust to human intent and downstream duties by way of finetuning. The finetuning course of entails supervised finetuning (SFT), utilizing labeled samples, and/or reinforcement studying based mostly fine-tuning (RFT) by way of coverage gradient strategies, utilizing a (probably discovered) reward operate. This work highlights an missed optimization hurdle in RFT: we show that the anticipated gradient for an enter pattern (i.e. immediate) vanishes if its reward normal deviation beneath the mannequin is low, no matter whether or not the reward imply is near-optimal or not. We then display the prevalence and detrimental results of vanishing gradients resulting from low reward normal deviation in an RFT benchmark for language fashions. Specifically, we present that in datasets the place samples with low reward normal deviation beneath the pretrained mannequin are extra prevalent, the reward that RFT achieves in comparison with SFT is worse. Managed experiments and a theoretical evaluation additional set up that, even in simplified settings, vanishing gradients in RFT can result in extraordinarily sluggish convergence. Lastly, we discover methods to beat vanishing gradients in RFT of language fashions. We discover the frequent observe of an preliminary SFT part to be essentially the most promising candidate, which sheds gentle on its significance in an RFT pipeline. Moreover, our experiments reveal {that a} comparatively few variety of optimization steps of SFT on a small variety of labeled samples suffice, implying that the preliminary SFT part needn’t be costly when it comes to compute and information labeling efforts