*=Equal Contributors
Present machine studying fashions for imaginative and prescient are sometimes extremely specialised and restricted to a single modality and job. In distinction, latest massive language fashions exhibit a variety of capabilities, hinting at a chance for equally versatile fashions in laptop imaginative and prescient. On this paper, we take a step on this route and suggest a multimodal coaching scheme referred to as 4M. It consists of coaching a single unified Transformer encoder-decoder utilizing a masked modeling goal throughout a variety of enter/output modalities – together with textual content, pictures, geometric, and semantic modalities, in addition to neural community characteristic maps. 4M achieves scalability by unifying the illustration house of all modalities by way of mapping them into discrete tokens and performing multimodal masked modeling on a small randomized subset of tokens.
4M results in fashions that exhibit a number of key capabilities: (1) they’ll carry out a various set of imaginative and prescient duties out of the field, (2) they excel when fine-tuned for unseen downstream duties or new enter modalities, and (3) they’ll operate as a generative mannequin that may be conditioned on arbitrary modalities, enabling all kinds of expressive multimodal enhancing capabilities with outstanding flexibility.
By experimental analyses, we show the potential of 4M for coaching versatile and scalable basis fashions for imaginative and prescient duties, setting the stage for additional exploration in multimodal studying for imaginative and prescient and different domains.