Paper summary: Giant-scale web-crawled datasets are elementary for the success of pre-training vision-language fashions, reminiscent of CLIP. Nevertheless, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in attaining exact image-text alignment. Current strategies using massive language fashions (LLMs) for caption rewriting have proven promise on small, curated datasets like CC3M and CC12M. This examine introduces a scalable pipeline for noisy caption rewriting. In contrast to latest LLM rewriting methods, we emphasize the incorporation of visible ideas into captions, termed as Visible-enriched Captions (VeCap). To make sure knowledge variety, we suggest a novel combined coaching scheme that optimizes the utilization of AltTexts alongside newly generated VeCap. We showcase the difference of this methodology for coaching CLIP on large-scale web-crawled datasets, termed VeCLIP. Using this cost-effective pipeline, we effortlessly scale our dataset as much as 300 million samples named VeCap dataset. Our outcomes present important benefits in image-text alignment and total mannequin efficiency. For instance, VeCLIP achieves as much as +25.2% achieve in COCO and Flickr30k retrieval duties underneath the 12M setting. For knowledge effectivity, VeCLIP achieves +3% achieve whereas solely utilizing 14% of the info employed within the vanilla CLIP and 11% in ALIGN. We additionally be aware the VeCap knowledge is complementary with different effectively curated datasets good for zero-shot classification duties. When combining VeCap and DFN, our mannequin can obtain robust efficiency on each of image-text retrieval and zero-shot classification duties, e.g. 83.1% accuracy@1 on ImageNet zero-shot for a H/14 mannequin.