Skip to the content.

Accompanied Singing Voice Synthesis with Fully Text-controlled Melody

Abstract

Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achieving minimal user requirements and maximum control flexibility. MelodyLM explicitly models MIDI as the intermediate melody-related feature and sequentially generates vocal tracks in a language model manner, conditioned on textual and vocal prompts. The accompaniment music is subsequently synthesized by a latent diffusion model with hybrid conditioning for temporal alignment. With minimal requirements, users only need to input lyrics and a reference voice to synthesize a song sample. For full control, just input textual prompts or even directly input MIDI. Experimental results indicate that MelodyLM achieves superior performance in terms of both objective and subjective metrics.

Table of Contents

  1. Abstract
  2. Overview of Demonstration
  3. Target Singing Vocals + Accompaniment Prompts -> Song
  4. Lyrics + Target MIDI + Reference Vocals + Accompaniment Prompts -> Song
  5. Lyrics + Melody Prompts + Reference Vocals + Accompaniment Prompts -> Song
  6. Lyrics + Melody Prompts -> Song
  7. Lyrics -> Song

Overview of Demonstration

We demonstrate the minimal user requirements and maximum control flexibility by gradually decreasing the degree of control. We first demonstrate the model capability by providing fully control, i.e., input singing vocals and prompts about accompaniment. Then, we gradually lower the user requirements and see how the model reacts. It is worth mentioning that the lyrics listed below are all generated using WhisperX, which could be different from the real lyrics.

Target Singing Vocals + Accompaniment Prompts -> Song

We provide target singing vocals and the prompts about accompaniment to generate accompaniment music. In this scenario, the model degenerates into a accompaniment generation model.

Lyrics + Target MIDI + Reference Vocals + Accompaniment Prompts -> Song

We drop the target vocal input and provide lyrics, target MIDI sequences, and vocal references to generate target vocals, along with the accompaniments. The target MIDI sequences (GT MIDI) are extracted using ROSVOT, which could leave some errors.

Lyrics + Melody Prompts + Reference Vocals + Accompaniment Prompts -> Song

We drop the target MIDI sequences, and replace them with the melody prompts to generate the desired melodies.

Lyrics + Melody Prompts -> Song

At this stage, we drop the reference vocals and accompaniment prompts, leaving the model only the lyrics and the melody prompts to build the melodies. Note that now the quality decreases since the model tends to randomly generate voices.

Lyrics -> Song

Finally, we explore the scenario that only the lyrics are available. All the left attributes are generated unconditionally and randomly, resulting in a intangible performance.