This paper was accepted on the NeurIPS 2023 workshop on Diffusion Fashions.
We exhibit how conditional technology from diffusion fashions can be utilized to sort out quite a lot of sensible duties within the manufacturing of music in 44.1kHz stereo audio with sampling-time steering. The eventualities we contemplate embody continuation, inpainting and regeneration of musical audio, the creation of easy transitions between two totally different music tracks, and the switch of desired stylistic traits to current audio clips. We obtain this by making use of steering at sampling time in a easy framework that helps each reconstruction and classification losses, or any mixture of the 2. This method ensures that generated audio can match its surrounding context, or conform to a category distribution or latent illustration specified relative to any appropriate pre-trained classifier or embedding mannequin.
We present randomly chosen samples for numerous artistic functions in Desk 1, every conditioned on a given audio immediate. For every activity and immediate we present samples from the totally different fashions described within the paper.
Process sorts:
infill: substitute the center two seconds of the immediate
regeneration: regenerate the center two seconds of the immediate
continuation: generate a brand new continuation ranging from the primary 2.4s of the immediate
transitions: regenerate a crossfaded part between two tracks
steering: generate a brand new clip conditioned on the PaSST classifier embedding of the immediate
Prompts are drawn from a take a look at cut up of the Free Music Archive dataset, printed by Michaël Defferrard et al. underneath a Artistic Commons Attribution 4.0 Worldwide License (CC BY 4.0).