This paper has been accepted on the Knowledge Issues for Basis Fashions workshop at ICLR 2024.
Giant language fashions are skilled on large scrapes of the net, which are sometimes unstructured, noisy, and poorly phrased. Present scaling legal guidelines present that studying from such knowledge requires an abundance of each compute and knowledge, which grows with the scale of the mannequin being skilled. That is infeasible each due to the big compute prices and period related to pre-training, and the upcoming shortage of high-quality knowledge on the net. On this work, we proposeWebRephrase Augmented Pre-training (WRAP) that makes use of an off-the-shelf instruction-tuned mannequin prompted to paraphrase paperwork on the net in particular kinds corresponding to “like Wikipedia” or in “question-answer format” to collectively pre-train LLMs on actual and artificial rephrases. First, we present that utilizing WRAP on the C4 dataset, which is of course noisy, hastens pre-training by ~3 occasions. On the similar pre-training compute funds, it improves perplexity by greater than 10% on common throughout completely different subsets of the Pile, and improves zero-shot query reply accuracy throughout 13 duties by greater than 2%. Second, we examine the affect of the re-phrasing model on the efficiency of the mannequin, providing insights into how the composition of the coaching knowledge can affect the efficiency of LLMs in OOD settings. Our positive aspects are attributed to the truth that re-phrased artificial knowledge (i) incorporates model range that carefully displays downstream analysis model, and (ii) has greater “high quality” than web-scraped knowledge.