This paper was accepted on the UniReps Workshop at NeurIPS 2023.
The panorama of publicly out there imaginative and prescient basis fashions (VFMs), corresponding to CLIP and Section Something Mannequin (SAM), is increasing quickly. VFMs are endowed with distinct capabilities stemming from their pre-training aims. As an example, CLIP excels in semantic understanding, whereas SAM focuses on spatial understanding for segmentation. On this work, we introduce a easy recipe to effectively merge VFMs right into a unified mannequin that absorbs their experience. Our technique integrates strategies of multi-task studying, continuous studying, and distillation. Additional, it calls for considerably much less computational price in comparison with conventional multi-task coaching from scratch, and it solely wants a small fraction of the pre-training datasets that have been initially used to coach particular person fashions. By making use of our technique to SAM and CLIP, we acquire SAM-CLIP: a unified mannequin that mixes the capabilities of SAM and CLIP right into a single imaginative and prescient transformer. In contrast with deploying SAM and CLIP independently, our merged mannequin, SAM-CLIP, reduces storage and compute prices for inference, making it well-suited for edge system functions. We present that SAM-CLIP not solely retains the foundational strengths of SAM and CLIP, but in addition introduces synergistic functionalities, notably in zero-shot semantic segmentation, the place SAM-CLIP establishes new state-of-the-art outcomes on 5 benchmarks. It outperforms earlier fashions which can be particularly designed for this job by a big margin, together with +6.8% and +5.9% imply IoU enchancment on Pascal-VOC and COCO-Stuff datasets, respectively.