Because the repository of publicly out there pre-trained imaginative and prescient basis fashions (VFMs) — corresponding to CLIP, DINOv2, and SAM — grows, customers face challenges in storage, reminiscence, and computational effectivity when deploying a number of fashions concurrently. To deal with these issues, we introduce a novel strategy that merges the capabilities of a number of VFMs right into a single environment friendly multi-task mannequin. Our technique, termed “joint distillation,” seamlessly integrates teacher-student studying with self-distillation, working with simply unlabeled picture knowledge and drastically reducing down on computational necessities in comparison with conventional multi-task studying. In a sensible demonstration of merging CLIP and SAM, we reveal that the resultant merged mannequin, SAM-CLIP, not solely maintains the foundational strengths of each dad or mum fashions but in addition uncovers synergistic features, corresponding to text-prompted zero-shot segmentation. Given the rising availability of VFMs, our methodology guarantees to ship vital worth in streamlining mannequin deployment and operations.