The pipeline of our Intensity Extractor model. The inputs are a pair of mixture speech samples. Each of these samples is a mix of the same non-neutral and neutral speech, but with different weights applied. The object is to accurately rank these two mixtures. If the ranking is done correctly, it indicates that the extracted intensity representation is meaningful.
The pipeline of our Intensity Extractor model. The inputs are a pair of mixture speech samples. Each of these samples is a mix of the same non-neutral and neutral speech, but with different weights applied. The object is to accurately rank these two mixtures. If the ranking is done correctly, it indicates that the extracted intensity representation is meaningful.

The goal is to develop a fine-grained emotional Text-To-Speech (TTS) model that allows control over the emotional intensity of each word or phoneme. This level of control can express different attitudes or intentions, even with the same emotion. However, it is nearly impossible to label intensity variations over time, so it’s necessary to learn effective emotion intensity representations without labels.

We first train an Intensity Extractor to provide intensity representaions. The Extractor is built on a novel Rank model, which is straightforward yet efficient in extracting emotion intensity information, and both inter- and intra-distance are considered. The ranking is performed on two samples augmented by Mixup (“Mixup: Beyond Empirical Risk Minimization”, ICLR 2017). Each augmented sample is a mix of the same non-neutral and neutral speech. By applying different weights to non-neutral and neutral speech, one mixture contains more non-neutral components than the other. In other words, the non-neutral intensity of one mixture is stronger than that of the other. By learning to rank these two mixtures, the Rank model must not only determine the emotion class (inter-class distance) but also capture the amount of non-neutral emotion present in mixed speech, i.e., the intensity of non-neutral emotion (intra-class distance).

The training process of the TTS model involves the use of a pre-trained and frozen Intensity Extractor to supply intensity information. Then, using the phoneme, speaker ID, and intensity as inputs, the TTS model strives to reconstruct the Mel-Spectrogram.
The training process of the TTS model involves the use of a pre-trained and frozen Intensity Extractor to supply intensity information. Then, using the phoneme, speaker ID, and intensity as inputs, the TTS model strives to reconstruct the Mel-Spectrogram.

With a pre-trained Intensity Extractor, we then train a FastSpeech 2 TTS model (“FastSpeech 2: Fast and High-Quality End-to-End Text to Speech”, ICLR 2020). The architecture remains the same as in the original paper, but in this instance, the Intensity Extractor is incorporated to supply conditional intensity information.

Rate Accuracy of Emotion Intensities. Min, Median and Max are three intensity levels. Subjects are asked to select the sample with the stronger intensity from a pair.
Rate Accuracy of Emotion Intensities. Min, Median and Max are three intensity levels. Subjects are asked to select the sample with the stronger intensity from a pair.

The first subjective experiment shows that when the spoken content is the same, listeners find it easier to discern differences in intensity in the speech synthesized by our model.

Preference test for emotion expressiveness
Preference test for emotion expressiveness

Another subjective experiment is the A/B preference test on emotion expressiveness. The results indicate that listeners tend to perceive the expression of emotion as clearer in the speech synthesized by our model.

Preference test for emotion expressiveness
Preference test for emotion expressiveness

Finally, Mean Cepstral Distortion (MCD) and Naturalness Mean Opinion Score (MOS) evaluations are conducted on all synthesized samples. The results show that the quality and naturalness of the speech synthesized by our model surpass all baseline models.

To conclude, we propose a fine-grained controllable emotional TTS, based on a novel Rank model. The Rank model captures both inter- and intra-class distance information, and thus is able to produce meaningful intensity representations. We conduct subjective and objective tests to evaluate our model, the experimental results show that our model surpasses two state-of-the-art baselines in intensity controllability, emotion expressiveness and naturalness.

The results of this study have been presented at the ICASSP 2023 in Rhodes, Greece.