Contrastive language picture pretraining (CLIP) is a normal methodology for coaching vision-language fashions. Whereas CLIP is scalable, promptable, and strong to distribution shifts on picture classification duties, it lacks object localization capabilities. This paper research the next query: Can we increase CLIP coaching with task-specific imaginative and prescient fashions from mannequin zoos to enhance its visible representations? In the direction of this finish, we leverage open-source task-specific imaginative and prescient fashions to generate pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we prepare CLIP fashions on these pseudo-labels along with the contrastive coaching on picture and textual content pairs. This straightforward setup exhibits substantial enhancements of as much as 16.3% throughout completely different imaginative and prescient duties, together with segmentation, detection, depth estimation, and floor regular estimation. Importantly, these enhancements are achieved with out compromising CLIP’s current capabilities, together with its proficiency in promptable zero-shot classification.