Singing Voice Synthesis

1. What is singing voice synthesis?

<aside> 💡 Singing voice synthesis (SVS) is the task of generating a natural singing voice from a given musical score. With the development of various deep generative models, research on synthesizing high-quality singing voices has been emerging recently. As the performance of the SVS improves, there are increasing cases in which the technology is applied to the production of actual music content.

</aside>

image (1).png

2. Challenges

<aside> 💡 There are various challenges to designing a singing synthesis system that can freely generate high-quality, natural-sounding singing voices.

</aside>

2.1 Dataset

First of all, the problem is that it is difficult to construct a dataset. In general, since singing is difficult to be released due to issues such as copyright, we have limitations in collecting public singing datasets. It is also difficult to access clean singing voices, as many are released with accompaniment. Lastly, singing synthesis requires not only a clean singing voice but also appropriate sheet music information corresponding to it, and the process of annotating it is time-consuming and costly.
In order to respond to this problem, 1) research on effectively modeling singing using only small data (LiteSing, Sinsy), 2) research on securing data sets using technologies such as sound source separation and automatic notation from various sound sources existing on the online web (DeepSinger), 3) Research on creating and presenting singing datasets free from copyright issues such as nursery rhymes is being conducted (CSD).
In our lab, we collect 200 songs and conduct research using them. First, we purchase an accompaniment MIDI file for K-POP music from a MIDI accompaniment producer, then hire an amateur singer to sing and record the song to the accompaniment. Later, by manually correcting minute differences in timing and pitch between the actual singing and the melody MIDI, the song and score pair data are obtained. Using this data, we are working on singing voice synthesis modeling while at the same time trying to obtain more sophisticated annotations through transcription and alignment studies.
Dataset example (audio, midi stereo)

2.2 Sound quality

With advances in speech synthesis studies, generating results of adequate quality in speech synthesis has advanced a lot. However, for singing synthesis technology to be used in real industry, studio quality results are required. Therefore, we are exploring different methods aimed at generating a 44 kHz sound source.
Unlike speech, singing voice 1) has a wide pitch range, 2) contains many notes with long duration, and 3) it is necessary to model a high sampling rate. We are trying to solve this problem based on the latest research on HiFiGAN, NSF, Parallel WaveGAN, etc. Applying the above studies focused on speech synthesis as it is causes several problems in high-quality singing modeling. (high, low pitch artifacts, glitch issue, etc.) Therefore, we are trying to develop a vocoder for singing voice that combines various GAN-based vocoders for high quality while taking the pitch robustness using the source excitation signal.
singing vocoder audio samples (on-going research)

2.3 Expressiveness & Controllability

With the development of singing synthesis studies, many studies have now made it possible to accurately interpret a given sheet music and produce it with high quality. However, in order to utilize it as a tool to help creators make music, it is important not only to simply sing accurately but also to reflect and control various styles. On the other hand, unlike clear elements such as lyrics and pitch, style is a more abstract concept and defining, modeling, and controlling it is a challenging problem.
As the most basic step to solve this problem, we designed a model that can reflect the characteristics of different speakers by designing a multi-speaker singing synthesis model. In this case, in particular, a methodology was introduced that can control the identity of the speaker by separating them into voice timbre and singing style. Therefore, using this model, it is possible to create a combination of the voices and characteristics of two different singers.