Accompanied Singing Voice Synthesis with Fully Text-controlled Melody

Abstract

Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achieving minimal user requirements and maximum control flexibility. MelodyLM explicitly models MIDI as the intermediate melody-related feature and sequentially generates vocal tracks in a language model manner, conditioned on textual and vocal prompts. The accompaniment music is subsequently synthesized by a latent diffusion model with hybrid conditioning for temporal alignment. With minimal requirements, users only need to input lyrics and a reference voice to synthesize a song sample. For full control, just input textual prompts or even directly input MIDI. Experimental results indicate that MelodyLM achieves superior performance in terms of both objective and subjective metrics.

Table of Contents

Abstract
Overview of Demonstration
Target Singing Vocals + Accompaniment Prompts -> Song
Lyrics + Target MIDI + Reference Vocals + Accompaniment Prompts -> Song
Lyrics + Melody Prompts + Reference Vocals + Accompaniment Prompts -> Song
Lyrics + Melody Prompts -> Song
Lyrics -> Song

Overview of Demonstration

We demonstrate the minimal user requirements and maximum control flexibility by gradually decreasing the degree of control. We first demonstrate the model capability by providing fully control, i.e., input singing vocals and prompts about accompaniment. Then, we gradually lower the user requirements and see how the model reacts. It is worth mentioning that the lyrics listed below are all generated using WhisperX, which could be different from the real lyrics.

Target Singing Vocals + Accompaniment Prompts -> Song

We provide target singing vocals and the prompts about accompaniment to generate accompaniment music. In this scenario, the model degenerates into a accompaniment generation model.

Lyrics: 颜色太惆怅学生你别慌张十里愁几多愁不如靠在我身上

Pinyin: yan se tai chou chang xue sheng ni bie huang zhang shi li chou ji duo chou bu ru kao zai wo shen shang

	GT Song	GT Vocal	GT Accompaniment	Accomp. Prompts	Generated Accomp.	Generated Song
wav				This is a pop music piece. There is a female vocalist singing melodically in the lead. The main melody is being played by the keyboard with the bass guitar playing in the background. The rhythm is provided by an electronic drum beat. The atmosphere is easygoing. This piece could be used in the soundtrack of a sit-com movie or a TV show.

Lyrics: 开启梦的多重宇宙一场完美的出走让你陪我仰望山腰

Pinyin: kai qi meng de duo chong yu zhou yi chang wan mei de chu zou rang ni pei wo yang wang shan yao

	GT Song	GT Vocal	GT Accompaniment	Accomp. Prompts	Generated Accomp.	Generated Song
wav				This musical recording features a pop song that consists of a passionate male vocal singing over punchy kick and snare hits, shimmering hi hats, wide electric guitar melody, distorted bass and mellow synth pad chords. It sounds energetic and addictive.

Lyrics: 害怕什么都失去只是想要有人听听到我的不适应

Pinyin: hai pa shen me dou shi qu zhi shi xiang yao you ren ting ting dao wo de bu shi ying

	GT Song	GT Vocal	GT Accompaniment	Accomp. Prompts	Generated Accomp.	Generated Song
wav				This is a pop music piece. There is a male vocalist singing melodically in the lead. The main tune is being played by the acoustic guitar and the electric guitar. The bass guitar is playing in the background. The rhythm is provided by a slow tempo acoustic drum beat. The atmosphere is emotional. This piece could be used in the soundtrack of a romance movie or a TV series.

Lyrics: 你给我的爱如花盛开如花落败日渐苍白你转身离开留下空白

Pinyin: ni gei wo de ai ru hua sheng kai ru hua luo bai ri jian cang bai ni zhuan shen li kai liu xia kong bai

	GT Song	GT Vocal	GT Accompaniment	Accomp. Prompts	Generated Accomp.	Generated Song
wav				This musical recording features a flat female vocal singing over sustained strings melody and mellow piano melody. It sounds emotional, passionate and the recording is noisy and in mono.

Lyrics: 你脸上的唇印暧昧得很仔细甜得像茉莉花开的香气

Pinyin: ni lian shang de chun yin ai mei de hen zi xi tian de xiang mo li hua kai de xiang qi

	GT Song	GT Vocal	GT Accompaniment	Accomp. Prompts	Generated Accomp.	Generated Song
wav				This is a teen pop music piece. There is a female vocalist singing melodically. The main tune is being played by the acoustic guitar and the electric guitar while the bass guitar is playing in the background. The rhythm is provided by an electronic drum beat. The atmosphere is easygoing. This piece could be used in the soundtrack of a teenage drama TV show.

Lyrics: 其实对我来说不算惊讶我可以歇斯底里避免了尴尬

Pinyin: qi shi dui wo lai shuo bu suan jing ya wo ke yi xie si di li bi mian le gan ga

	GT Song	GT Vocal	GT Accompaniment	Accomp. Prompts	Generated Accomp.	Generated Song
wav				The recorded music features a pop song that consists of harmonizing male vocals singing over mellow piano chords. It sounds passionate and emotional, even though the recording is noisy and in mono.

Lyrics: 只有我懂你晚安的意义是想你的讯号

Pinyin: zhi you wo dong ni wan an de yi yi shi xiang ni de xun hao

	GT Song	GT Vocal	GT Accompaniment	Accomp. Prompts	Generated Accomp.	Generated Song
wav				A male singer sings this cool hip hop melody with backup singers in vocal harmony. The song is medium tempo with a piano accompaniment, percussive bass line, steady drumming rhythm, clapping percussions, and keyboard accompaniment. The clip is emotional and romantic with a cool dance groove.

Lyrics + Target MIDI + Reference Vocals + Accompaniment Prompts -> Song

We drop the target vocal input and provide lyrics, target MIDI sequences, and vocal references to generate target vocals, along with the accompaniments. The target MIDI sequences (GT MIDI) are extracted using ROSVOT, which could leave some errors.

Lyrics: 当记忆的线缠绕过往支离破碎

Pinyin: dang ji yi de xian chan rao guo wang zhi li po sui

	GT Song	GT Vocal	GT MIDI	Accomp. Prompts
wav				The recording features a song that consists of a flat female vocal singing over shimmering hi hats, punchy kick and groovy bass. The recording is noisy and in mono, as it was probably recorded with a phone.

	Ref. Vocal	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 完美无缺幸福兜了一个圈那些美好的兜圈

Pinyin: wan mei wu que xing fu dou le yi ge quan na xie mei hao de dou quan

	GT Song	GT Vocal	GT MIDI	Accomp. Prompts
wav				This slow pop song features a male voice singing the main melody. This is accompanied by percussion playing a simple beat. The bass plays the root notes of the chords. A guitar plays chords in the background. The mood of this song is romantic. There are no other instruments in this song. This song can be played in a romantic movie.

	Ref. Vocal	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 千万记得天涯有人在等待

Pinyin: qian wan ji de tian ya you ren zai deng dai

	GT Song	GT Vocal	GT MIDI	Accomp. Prompts
wav				This recording demonstrates a cover of a pop song and it consists of a passionate female vocal singing over sustained strings melody and mellow piano chords. It sounds emotional, passionate and the recording is noisy and in mono.

	Ref. Vocal	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 只想余生有个伴苦了酸甜一起分担

Pinyin: zhi xiang yu sheng you ge ban ku le suan tian yi qi fen dan

	GT Song	GT Vocal	GT MIDI	Accomp. Prompts
wav				This slow pop song features a male voice singing the main melody. The voice is emotional. This is accompanied by a bass playing the root notes of the chords. After one line, a synth swell is played. A piano plays chords in the background. The mood of this song is romantic. This song can be played in a romantic movie.

	Ref. Vocal	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 拼尽全力挣扎只为能成为不同人啊你好累对吗

Pinyin: pin jin quan li zheng zha zhi wei neng cheng wei bu tong ren a ni hao lei dui ma

	GT Song	GT Vocal	GT MIDI	Accomp. Prompts
wav				There is a male vocalist singing melodically. The main tune is being played by the acoustic guitar and the electric guitar while the bass guitar is playing in the background. The rhythm is provided by a simple acoustic drum beat. The atmosphere is religious.

	Ref. Vocal	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 汽笛声回荡在空气落雨的痕迹被遗忘代替

Pinyin: qi di sheng hui dang zai kong qi luo yu de heng ji bei yi wang dai ti

	GT Song	GT Vocal	GT MIDI	Accomp. Prompts
wav				A female vocalist sings this melancholic melody. The tempo is slow with a romantic piano accompaniment. The song is soft, mellow, poignant, emotional,sentimental, romantic, melancholic, sad, lonely,and wistful. This song is a Pop.

	Ref. Vocal	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 我们并肩走过许多地方你的出现也曾像夜的极光

Pinyin: wo men bing jian zou guo xu duo di fang ni de chu xian ye ceng xiang ye de ji guang

	GT Song	GT Vocal	GT MIDI	Accomp. Prompts
wav				This is a techno/house music piece. There is a female vocalist singing melodically in the lead. The melody is being played by a piano while the bass guitar is playing in the background. The rhythm is provided by a slow tempo acoustic drum beat. The atmosphere is mellow. This piece can be used in the soundtrack of a teenage drama TV series as the opening theme.

	Ref. Vocal	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 盛开在夏天

Pinyin: sheng kai zai xia tian

	GT Song	GT Vocal	GT MIDI	Accomp. Prompts
wav				This is a pop music piece. There is a male vocalist singing melodically in the lead. The main melody is being played by the acoustic guitar and the electric guitar while the bass guitar is playing in the background. The rhythm is provided by a simple acoustic drum beat. The atmosphere is easygoing. This piece could be used in the soundtrack of a teenage drama TV series as the opening theme.

	Ref. Vocal	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 好像现在的我

Pinyin: hao xiang xian zai de wo

	GT Song	GT Vocal	GT MIDI	Accomp. Prompts
wav				The recorded music features a hip hop song that consists of a flat male vocal singing over punchy kick and snare hits, shimmering hi hats and groovy bass.

	Ref. Vocal	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics + Melody Prompts + Reference Vocals + Accompaniment Prompts -> Song

We drop the target MIDI sequences, and replace them with the melody prompts to generate the desired melodies.

Lyrics: 你奔跑的方向是没我的地方

Pinyin: ni ben pao de fang xiang shi mei wo de di fang

	GT Song	GT Vocal	GT MIDI	Melody Prompts	Accomp. Prompts
wav				This melody, anchored in G major, progresses at a high pitch and slow tempo over a medium period of time, draped in upbeat and lively aura.	The recorded music showcases a song that consists of a passionate female vocal, alongside wide harmonizing vocals, singing over wooden percussion, shimmering bells melody and groovy bass. It sounds happy, fun and joyful.

	Ref. Vocal	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 只想有你在身边 Yeah

Pinyin: zhi xiang you ni zai shen bian yeah

	GT Song	GT Vocal	GT MIDI	Melody Prompts	Accomp. Prompts
wav				Featuring a melody in C sharp minor with a relatively high pitch and low tempo over a short period, the song segment carries a easygoing tone.	This musical recording features a pop song that consists of a passionate female vocal singing over punchy kick, shimmering hi hats, mellow synth bass and simple synth lead melody. It sounds emotional, passionate and heartfelt.

	Ref. Vocal	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 太阳不会放弃天空哪怕你不再属于我

Pinyin: tai yang bu hui fang qi tian kong na pa ni bu zai shu yu wo

	GT Song	GT Vocal	GT MIDI	Melody Prompts	Accomp. Prompts
wav				This melody, within the tonal scale of D-flat major, carrying a high pitch and quick tempo for 10 seconds, emits sentimental energy.	This is a pop music piece. There is a female vocalist singing melodically in the lead. The melody is being played by the keyboard while the bass guitar is playing in the background. The rhythm is provided by a slow tempo acoustic drum beat. The atmosphere is emotional. This piece could be used in the soundtrack of a romance movie, especially during the scenes where a character is trying to break free.

	Ref. Vocal	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 宁可当作失败宁可重新再来

Pinyin: ning ke dang zuo shi bai ning ke chong xin zai lai

	GT Song	GT Vocal	GT MIDI	Melody Prompts	Accomp. Prompts
wav				The song’s melody, composed in C sharp major, holds a high pitch and maintains a rapid pace, covering 5 seconds, filled with passionate and emotional qualities.	This music is an Electronica instrumental. The tempo is medium with a melodic pad and rhythmic acoustic guitar accompaniment. The music is soft, panned to the left channel of the stereo image. It is powerful, sentimental,emotional, passionate and emotional.

	Ref. Vocal	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 就请你告诉他你的名字我的名字

Pinyin: jiu qing ni gao su ta ni de ming zi wo de ming zi

	GT Song	GT Vocal	GT MIDI	Melody Prompts	Accomp. Prompts
wav				This part of the song, in A minor with a melody at a average pitch and average tempo lasting 10 seconds, carries a deep soulful feel.	The recording showcases a R&B song that consists of a passionate male vocal, alongside harmonizing background male vocals, singing over sustained strings melody, mellow piano chords, shimmering hi hats, punchy kick and snappy rimshots. It sounds emotional, passionate and soulful.

	Ref. Vocal	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 直到生命最后一天

Pinyin: zhi dao sheng ming zui hou yi tian

	GT Song	GT Vocal	GT MIDI	Melody Prompts	Accomp. Prompts
wav				The melody, flowing in E- major at a pitch of high and a tempo of fast, showcases lively tone.	This is an R&B music piece. There is a female vocalist singing melodically. The melody is being played by the keyboard while there is a bass guitar playing in the background. The rhythm is provided by a simple acoustic drum beat. The atmosphere is easygoing. This piece could be used in the soundtrack of a romantic movie, especially during the scenes where a character is reminiscing the good memories.

	Ref. Vocal	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 千万记得天涯有人在等待

Pinyin: qian wan ji de tian ya you ren zai deng dai

	GT Song	GT Vocal	GT MIDI	Melody Prompts	Accomp. Prompts
wav				This segment’s melody in C sharp minor, with its relatively high pitch and swift tempo across 7 seconds, breathes a passionate essence.	This recording demonstrates a cover of a pop song and it consists of a passionate female vocal singing over sustained strings melody and mellow piano chords. It sounds emotional, passionate and the recording is noisy and in mono.

	Ref. Vocal	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 离别总在失意中度过

Pinyin: li bie zong zai shi yi zhong du guo

	GT Song	GT Vocal	GT MIDI	Melody Prompts	Accomp. Prompts
wav				This melody, set in B major with a average pitch and rapid tempo, lasts a medium period of time and carries a lively and happy aura.	This pop song features a male voice singing the main melody. Female voices sing backing vocals in harmony. This is accompanied by percussion playing a simple beat. The bass plays the root notes of the chords. Trumpets play a repetitive melody in the background. The mood of this song is happy. This song can be played in a romantic movie.

	Ref. Vocal	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 我假装看不见我假装看不见因为害怕说抱歉

Pinyin: wo jia zhuang kan bu jian wo jia zhuang kan bu jian yin wei hai pa shuo bao qian

	GT Song	GT Vocal	GT MIDI	Melody Prompts	Accomp. Prompts
wav				Set in the tonal realm of B flat major, this melody’s average pitch and average tempo through 7 seconds echo with happy essence.	This is a pop music piece. There is a male vocalist singing melodically in the lead. The main melody is being played by the keyboard while the bass guitar is playing in the background. The rhythm is provided by a slow tempo acoustic drum beat. The atmosphere is cheerful. This piece could be used in the soundtrack of a teenage drama TV series.

	Ref. Vocal	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics + Melody Prompts -> Song

At this stage, we drop the reference vocals and accompaniment prompts, leaving the model only the lyrics and the melody prompts to build the melodies. Note that now the quality decreases since the model tends to randomly generate voices.

Lyrics: 宁可当作失败宁可重新再来

Pinyin: ning ke dang zuo shi bai ning ke chong xin zai lai

	GT Song	GT Vocal	GT MIDI	Melody Prompts
wav				This segment, keyed in A-sharp minor, moves at a swift tempo with a pitch of relatively high, colored by passionate and emotional qualities.

	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 让疯狂慢慢从爱情离开

Pinyin: rang feng kuang man man cong ai qing li kai

	GT Song	GT Vocal	GT MIDI	Melody Prompts
wav				This piece in D-flat major resonates with a medium pitch at a gentle tempo for a medium period of time, filled with religious emotions.

	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 爱你的路

Pinyin: ai ni de lu

	GT Song	GT Vocal	GT MIDI	Melody Prompts
wav				In G major, the melody rises with a relatively high pitch and flows at a rapid tempo for a short period, oozing passionate and emotional mood.

	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics: 撑到现在早没什么用

Pinyin: cheng dao xian zai zao mei shen me yong

	GT Song	GT Vocal	GT MIDI	Melody Prompts
wav				The melody in this part of the song, set in G minor with a average pitch at a quick tempo for a short period of time, envelops the listener with easygoing atmosphere.

	Generated MIDI	Generated Vocal	Generated Accomp.	Generated Song
wav

Lyrics -> Song

Finally, we explore the scenario that only the lyrics are available. All the left attributes are generated unconditionally and randomly, resulting in a intangible performance.

Lyrics: 铸成一个死结

Pinyin: zhu cheng yi ge si jie

	GT Song	GT Vocal	Generated Vocal	Generated Song
wav

Lyrics: 自己定的规则却不肯复刻无可奈何的感觉又无从改善我累了

Pinyin: zi ji ding de gui ze que bu ken fu ke wu ke nai he de gan jue you wu cong gai shan wo lei le

	GT Song	GT Vocal	Generated Vocal	Generated Song
wav