Accompanied Singing Voice Synthesis with Fully Text-controlled Melody
Abstract
Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achieving minimal user requirements and maximum control flexibility. MelodyLM explicitly models MIDI as the intermediate melody-related feature and sequentially generates vocal tracks in a language model manner, conditioned on textual and vocal prompts. The accompaniment music is subsequently synthesized by a latent diffusion model with hybrid conditioning for temporal alignment. With minimal requirements, users only need to input lyrics and a reference voice to synthesize a song sample. For full control, just input textual prompts or even directly input MIDI. Experimental results indicate that MelodyLM achieves superior performance in terms of both objective and subjective metrics.
Table of Contents
- Abstract
- Overview of Demonstration
- Target Singing Vocals + Accompaniment Prompts -> Song
- Lyrics + Target MIDI + Reference Vocals + Accompaniment Prompts -> Song
- Lyrics + Melody Prompts + Reference Vocals + Accompaniment Prompts -> Song
- Lyrics + Melody Prompts -> Song
- Lyrics -> Song
Overview of Demonstration
We demonstrate the minimal user requirements and maximum control flexibility by gradually decreasing the degree of control. We first demonstrate the model capability by providing fully control, i.e., input singing vocals and prompts about accompaniment. Then, we gradually lower the user requirements and see how the model reacts. It is worth mentioning that the lyrics listed below are all generated using WhisperX, which could be different from the real lyrics.
Target Singing Vocals + Accompaniment Prompts -> Song
We provide target singing vocals and the prompts about accompaniment to generate accompaniment music. In this scenario, the model degenerates into a accompaniment generation model.
-
Lyrics: 颜色太惆怅 学生你别慌张十里愁几多愁不如靠在我身上
Pinyin: yan se tai chou chang xue sheng ni bie huang zhang shi li chou ji duo chou bu ru kao zai wo shen shang
GT Song GT Vocal GT Accompaniment Accomp. Prompts Generated Accomp. Generated Song wav This is a pop music piece. There is a female vocalist singing melodically in the lead. The main melody is being played by the keyboard with the bass guitar playing in the background. The rhythm is provided by an electronic drum beat. The atmosphere is easygoing. This piece could be used in the soundtrack of a sit-com movie or a TV show. -
Lyrics: 开启梦的多重宇宙一场完美的出走让你陪我仰望山腰
Pinyin: kai qi meng de duo chong yu zhou yi chang wan mei de chu zou rang ni pei wo yang wang shan yao
GT Song GT Vocal GT Accompaniment Accomp. Prompts Generated Accomp. Generated Song wav This musical recording features a pop song that consists of a passionate male vocal singing over punchy kick and snare hits, shimmering hi hats, wide electric guitar melody, distorted bass and mellow synth pad chords. It sounds energetic and addictive. -
Lyrics: 害怕什么都失去只是想要有人听听到我的不适应
Pinyin: hai pa shen me dou shi qu zhi shi xiang yao you ren ting ting dao wo de bu shi ying
GT Song GT Vocal GT Accompaniment Accomp. Prompts Generated Accomp. Generated Song wav This is a pop music piece. There is a male vocalist singing melodically in the lead. The main tune is being played by the acoustic guitar and the electric guitar. The bass guitar is playing in the background. The rhythm is provided by a slow tempo acoustic drum beat. The atmosphere is emotional. This piece could be used in the soundtrack of a romance movie or a TV series.
-
Lyrics: 你给我的爱如花盛开如花落败日渐苍白你转身离开留下空白
Pinyin: ni gei wo de ai ru hua sheng kai ru hua luo bai ri jian cang bai ni zhuan shen li kai liu xia kong bai
GT Song GT Vocal GT Accompaniment Accomp. Prompts Generated Accomp. Generated Song wav This musical recording features a flat female vocal singing over sustained strings melody and mellow piano melody. It sounds emotional, passionate and the recording is noisy and in mono. -
Lyrics: 你脸上的唇印暧昧得很仔细甜得像茉莉花开的香气
Pinyin: ni lian shang de chun yin ai mei de hen zi xi tian de xiang mo li hua kai de xiang qi
GT Song GT Vocal GT Accompaniment Accomp. Prompts Generated Accomp. Generated Song wav This is a teen pop music piece. There is a female vocalist singing melodically. The main tune is being played by the acoustic guitar and the electric guitar while the bass guitar is playing in the background. The rhythm is provided by an electronic drum beat. The atmosphere is easygoing. This piece could be used in the soundtrack of a teenage drama TV show. -
Lyrics: 其实对我来说不算惊讶我可以歇斯底里避免了尴尬
Pinyin: qi shi dui wo lai shuo bu suan jing ya wo ke yi xie si di li bi mian le gan ga
GT Song GT Vocal GT Accompaniment Accomp. Prompts Generated Accomp. Generated Song wav The recorded music features a pop song that consists of harmonizing male vocals singing over mellow piano chords. It sounds passionate and emotional, even though the recording is noisy and in mono.
-
Lyrics: 只有我懂你晚安的意义是想你的讯号
Pinyin: zhi you wo dong ni wan an de yi yi shi xiang ni de xun hao
GT Song GT Vocal GT Accompaniment Accomp. Prompts Generated Accomp. Generated Song wav A male singer sings this cool hip hop melody with backup singers in vocal harmony. The song is medium tempo with a piano accompaniment, percussive bass line, steady drumming rhythm, clapping percussions, and keyboard accompaniment. The clip is emotional and romantic with a cool dance groove.
Lyrics + Target MIDI + Reference Vocals + Accompaniment Prompts -> Song
We drop the target vocal input and provide lyrics, target MIDI sequences, and vocal references to generate target vocals, along with the accompaniments. The target MIDI sequences (GT MIDI) are extracted using ROSVOT, which could leave some errors.
-
Lyrics: 当记忆的线缠绕过往支离破碎
Pinyin: dang ji yi de xian chan rao guo wang zhi li po sui
GT Song GT Vocal GT MIDI Accomp. Prompts wav The recording features a song that consists of a flat female vocal singing over shimmering hi hats, punchy kick and groovy bass. The recording is noisy and in mono, as it was probably recorded with a phone. Ref. Vocal Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 完美无缺幸福兜了一个圈那些美好的兜圈
Pinyin: wan mei wu que xing fu dou le yi ge quan na xie mei hao de dou quan
GT Song GT Vocal GT MIDI Accomp. Prompts wav This slow pop song features a male voice singing the main melody. This is accompanied by percussion playing a simple beat. The bass plays the root notes of the chords. A guitar plays chords in the background. The mood of this song is romantic. There are no other instruments in this song. This song can be played in a romantic movie. Ref. Vocal Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 千万记得天涯有人在等待
Pinyin: qian wan ji de tian ya you ren zai deng dai
GT Song GT Vocal GT MIDI Accomp. Prompts wav This recording demonstrates a cover of a pop song and it consists of a passionate female vocal singing over sustained strings melody and mellow piano chords. It sounds emotional, passionate and the recording is noisy and in mono. Ref. Vocal Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 只想余生有个伴苦了酸甜一起分担
Pinyin: zhi xiang yu sheng you ge ban ku le suan tian yi qi fen dan
GT Song GT Vocal GT MIDI Accomp. Prompts wav This slow pop song features a male voice singing the main melody. The voice is emotional. This is accompanied by a bass playing the root notes of the chords. After one line, a synth swell is played. A piano plays chords in the background. The mood of this song is romantic. This song can be played in a romantic movie. Ref. Vocal Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 拼尽全力挣扎只为能成为不同人啊你好累对吗
Pinyin: pin jin quan li zheng zha zhi wei neng cheng wei bu tong ren a ni hao lei dui ma
GT Song GT Vocal GT MIDI Accomp. Prompts wav There is a male vocalist singing melodically. The main tune is being played by the acoustic guitar and the electric guitar while the bass guitar is playing in the background. The rhythm is provided by a simple acoustic drum beat. The atmosphere is religious. Ref. Vocal Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 汽笛声回荡在空气落雨的痕迹被遗忘代替
Pinyin: qi di sheng hui dang zai kong qi luo yu de heng ji bei yi wang dai ti
GT Song GT Vocal GT MIDI Accomp. Prompts wav A female vocalist sings this melancholic melody. The tempo is slow with a romantic piano accompaniment. The song is soft, mellow, poignant, emotional,sentimental, romantic, melancholic, sad, lonely,and wistful. This song is a Pop. Ref. Vocal Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 我们并肩走过许多地方你的出现也曾像夜的极光
Pinyin: wo men bing jian zou guo xu duo di fang ni de chu xian ye ceng xiang ye de ji guang
GT Song GT Vocal GT MIDI Accomp. Prompts wav This is a techno/house music piece. There is a female vocalist singing melodically in the lead. The melody is being played by a piano while the bass guitar is playing in the background. The rhythm is provided by a slow tempo acoustic drum beat. The atmosphere is mellow. This piece can be used in the soundtrack of a teenage drama TV series as the opening theme. Ref. Vocal Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 盛开在夏天
Pinyin: sheng kai zai xia tian
GT Song GT Vocal GT MIDI Accomp. Prompts wav This is a pop music piece. There is a male vocalist singing melodically in the lead. The main melody is being played by the acoustic guitar and the electric guitar while the bass guitar is playing in the background. The rhythm is provided by a simple acoustic drum beat. The atmosphere is easygoing. This piece could be used in the soundtrack of a teenage drama TV series as the opening theme. Ref. Vocal Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 好像现在的我
Pinyin: hao xiang xian zai de wo
GT Song GT Vocal GT MIDI Accomp. Prompts wav The recorded music features a hip hop song that consists of a flat male vocal singing over punchy kick and snare hits, shimmering hi hats and groovy bass. Ref. Vocal Generated Vocal Generated Accomp. Generated Song wav
Lyrics + Melody Prompts + Reference Vocals + Accompaniment Prompts -> Song
We drop the target MIDI sequences, and replace them with the melody prompts to generate the desired melodies.
-
Lyrics: 你奔跑的方向是没我的地方
Pinyin: ni ben pao de fang xiang shi mei wo de di fang
GT Song GT Vocal GT MIDI Melody Prompts Accomp. Prompts wav This melody, anchored in G major, progresses at a high pitch and slow tempo over a medium period of time, draped in upbeat and lively aura. The recorded music showcases a song that consists of a passionate female vocal, alongside wide harmonizing vocals, singing over wooden percussion, shimmering bells melody and groovy bass. It sounds happy, fun and joyful. Ref. Vocal Generated MIDI Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 只想有你在身边 Yeah
Pinyin: zhi xiang you ni zai shen bian yeah
GT Song GT Vocal GT MIDI Melody Prompts Accomp. Prompts wav Featuring a melody in C sharp minor with a relatively high pitch and low tempo over a short period, the song segment carries a easygoing tone. This musical recording features a pop song that consists of a passionate female vocal singing over punchy kick, shimmering hi hats, mellow synth bass and simple synth lead melody. It sounds emotional, passionate and heartfelt. Ref. Vocal Generated MIDI Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 太阳不会放弃天空哪怕你不再属于我
Pinyin: tai yang bu hui fang qi tian kong na pa ni bu zai shu yu wo
GT Song GT Vocal GT MIDI Melody Prompts Accomp. Prompts wav This melody, within the tonal scale of D-flat major, carrying a high pitch and quick tempo for 10 seconds, emits sentimental energy. This is a pop music piece. There is a female vocalist singing melodically in the lead. The melody is being played by the keyboard while the bass guitar is playing in the background. The rhythm is provided by a slow tempo acoustic drum beat. The atmosphere is emotional. This piece could be used in the soundtrack of a romance movie, especially during the scenes where a character is trying to break free. Ref. Vocal Generated MIDI Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 宁可当作失败宁可重新再来
Pinyin: ning ke dang zuo shi bai ning ke chong xin zai lai
GT Song GT Vocal GT MIDI Melody Prompts Accomp. Prompts wav The song’s melody, composed in C sharp major, holds a high pitch and maintains a rapid pace, covering 5 seconds, filled with passionate and emotional qualities. This music is an Electronica instrumental. The tempo is medium with a melodic pad and rhythmic acoustic guitar accompaniment. The music is soft, panned to the left channel of the stereo image. It is powerful, sentimental,emotional, passionate and emotional. Ref. Vocal Generated MIDI Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 就请你告诉他你的名字我的名字
Pinyin: jiu qing ni gao su ta ni de ming zi wo de ming zi
GT Song GT Vocal GT MIDI Melody Prompts Accomp. Prompts wav This part of the song, in A minor with a melody at a average pitch and average tempo lasting 10 seconds, carries a deep soulful feel. The recording showcases a R&B song that consists of a passionate male vocal, alongside harmonizing background male vocals, singing over sustained strings melody, mellow piano chords, shimmering hi hats, punchy kick and snappy rimshots. It sounds emotional, passionate and soulful. Ref. Vocal Generated MIDI Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 直到生命最后一天
Pinyin: zhi dao sheng ming zui hou yi tian
GT Song GT Vocal GT MIDI Melody Prompts Accomp. Prompts wav The melody, flowing in E- major at a pitch of high and a tempo of fast, showcases lively tone. This is an R&B music piece. There is a female vocalist singing melodically. The melody is being played by the keyboard while there is a bass guitar playing in the background. The rhythm is provided by a simple acoustic drum beat. The atmosphere is easygoing. This piece could be used in the soundtrack of a romantic movie, especially during the scenes where a character is reminiscing the good memories. Ref. Vocal Generated MIDI Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 千万记得天涯有人在等待
Pinyin: qian wan ji de tian ya you ren zai deng dai
GT Song GT Vocal GT MIDI Melody Prompts Accomp. Prompts wav This segment’s melody in C sharp minor, with its relatively high pitch and swift tempo across 7 seconds, breathes a passionate essence. This recording demonstrates a cover of a pop song and it consists of a passionate female vocal singing over sustained strings melody and mellow piano chords. It sounds emotional, passionate and the recording is noisy and in mono. Ref. Vocal Generated MIDI Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 离别总在失意中度过
Pinyin: li bie zong zai shi yi zhong du guo
GT Song GT Vocal GT MIDI Melody Prompts Accomp. Prompts wav This melody, set in B major with a average pitch and rapid tempo, lasts a medium period of time and carries a lively and happy aura. This pop song features a male voice singing the main melody. Female voices sing backing vocals in harmony. This is accompanied by percussion playing a simple beat. The bass plays the root notes of the chords. Trumpets play a repetitive melody in the background. The mood of this song is happy. This song can be played in a romantic movie. Ref. Vocal Generated MIDI Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 我假装看不见我假装看不见因为害怕说抱歉
Pinyin: wo jia zhuang kan bu jian wo jia zhuang kan bu jian yin wei hai pa shuo bao qian
GT Song GT Vocal GT MIDI Melody Prompts Accomp. Prompts wav Set in the tonal realm of B flat major, this melody’s average pitch and average tempo through 7 seconds echo with happy essence. This is a pop music piece. There is a male vocalist singing melodically in the lead. The main melody is being played by the keyboard while the bass guitar is playing in the background. The rhythm is provided by a slow tempo acoustic drum beat. The atmosphere is cheerful. This piece could be used in the soundtrack of a teenage drama TV series. Ref. Vocal Generated MIDI Generated Vocal Generated Accomp. Generated Song wav
Lyrics + Melody Prompts -> Song
At this stage, we drop the reference vocals and accompaniment prompts, leaving the model only the lyrics and the melody prompts to build the melodies. Note that now the quality decreases since the model tends to randomly generate voices.
-
Lyrics: 宁可当作失败宁可重新再来
Pinyin: ning ke dang zuo shi bai ning ke chong xin zai lai
GT Song GT Vocal GT MIDI Melody Prompts wav This segment, keyed in A-sharp minor, moves at a swift tempo with a pitch of relatively high, colored by passionate and emotional qualities. Generated MIDI Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 让疯狂慢慢从爱情离开
Pinyin: rang feng kuang man man cong ai qing li kai
GT Song GT Vocal GT MIDI Melody Prompts wav This piece in D-flat major resonates with a medium pitch at a gentle tempo for a medium period of time, filled with religious emotions. Generated MIDI Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 爱你的路
Pinyin: ai ni de lu
GT Song GT Vocal GT MIDI Melody Prompts wav In G major, the melody rises with a relatively high pitch and flows at a rapid tempo for a short period, oozing passionate and emotional mood. Generated MIDI Generated Vocal Generated Accomp. Generated Song wav -
Lyrics: 撑到现在早没什么用
Pinyin: cheng dao xian zai zao mei shen me yong
GT Song GT Vocal GT MIDI Melody Prompts wav The melody in this part of the song, set in G minor with a average pitch at a quick tempo for a short period of time, envelops the listener with easygoing atmosphere. Generated MIDI Generated Vocal Generated Accomp. Generated Song wav
Lyrics -> Song
Finally, we explore the scenario that only the lyrics are available. All the left attributes are generated unconditionally and randomly, resulting in a intangible performance.
-
Lyrics: 铸成一个死结
Pinyin: zhu cheng yi ge si jie
GT Song GT Vocal Generated Vocal Generated Song wav -
Lyrics: 自己定的规则却不肯复刻无可奈何的感觉又无从改善我累了
Pinyin: zi ji ding de gui ze que bu ken fu ke wu ke nai he de gan jue you wu cong gai shan wo lei le
GT Song GT Vocal Generated Vocal Generated Song wav