Abstract

Recent advances in text-to-music generation models have opened new avenues in musical creativity. However, the task of editing these generated music remains a significant challenge. This paper introduces a novel approach to the editing of music generated by such models, enabling the modification of specific attributes, such as genre, mood and instrument, while maintaining other aspects unchanged. Our method transforms text editing to the latent space manipulation, and adds an additional constraint to enforce consistency. It seamlessly integrates with existing pretrained text-to-music diffusion models without requiring additional training. Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations. Additionally, we showcase the practical applicability of our approach in real-world music editing scenarios.

Overview

1. We propose a new music editing method via direct word swapping.

Untitled

2. We transform the word swapping by calculating semantic difference.

Untitled

3. We add an additional constraint to the cross-attention maps of the diffusion model to enforce consistency.

Untitled

Audio samples

Instrument

Piano → violin

pre_piano_to_violin_before.wav

pre_piano_to_violin_after.wav

Sax → piano

pre_saxophone_to_piano_before.wav

pre_saxophone_to_piano_after.wav