This paper was accepted on the Machine Studying for Audio Workshop at NeurIPS 2023.
Over the previous a number of years, the synchronization between audio and visible alerts has been leveraged to study richer audio-visual representations. Aided by the big availability of unlabeled movies, many unsupervised coaching frameworks have demonstrated spectacular leads to numerous downstream audio and video duties. Not too long ago, Masked Audio-Video Learners (MAViL) has emerged as a state-of-the-art audio-video pre-training framework. MAViL {couples} contrastive studying with masked autoencoding to collectively reconstruct audio spectrograms and video frames by fusing data from each modalities. On this paper, we examine the potential synergy between diffusion fashions and MAViL, looking for to derive mutual advantages from these two frameworks. The incorporation of diffusion into MAViL, mixed with numerous coaching effectivity methodologies that embrace the utilization of a masking ratio curriculum and adaptive batch sizing, leads to a notable 32% discount in pre-training Floating-Level Operations (FLOPS) and an 18% lower in pre-training wall clock time. Crucially, this enhanced effectivity doesn’t compromise the mannequin’s efficiency in downstream audio-classification duties when in comparison with MAViL’s efficiency.