This paper has been accepted to the UniReps Workshop in NeurIPS 2023.
Contrastive language picture pretraining has develop into the usual method for coaching imaginative and prescient language fashions. Regardless of the utility of CLIP visible options as world representations for pictures, they’ve limitations in relation to duties involving object localization, pixel-level understanding of the picture, or 3D notion. Multi-task coaching is a well-liked answer to deal with this disadvantage, however amassing a large-scale annotated multi-task dataset incurs important prices. Moreover, coaching on separate process particular datasets can also be difficult from optimization and coaching perspective on account of aligning gradients and data coming from totally different enter distributions and duties. To beat these shortcomings, we research pseudo-labeling with task-specific consultants to enhance CLIP options for more difficult down-stream duties. In our method, we leverage a number of present open-source pretrained fashions and pseudo-label an uncurated web-scale image-caption dataset with the consultants. We then practice CLIP with contrastive loss and process particular losses with pseudo labels by means of the lightweight heads that we connect to the imaginative and prescient spine.