--------------------------------> Abstract <---------------------------------


Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. Typically, emotions are treated as discrete categories, and EVC is implemented as the mapping between the categories, unable to handle anything between them. In this paper, we would like to study a way to characterize and to explicitly control the intensity of emotion. We propose to disentangle the speaker style from linguistic content, and encode the speaker style into a style embedding in a continuous space, that forms the prototype of emotion embedding. We further learn the actual emotion encoder from emotion-labelled database, and study the use of relative attributes to represent fine-grained emotion intensity. To ensure the emotional intelligibility, we incorporate both emotional classification loss and emotion embedding similarity loss into the training of EVC network. The proposed network allows for controlling the fine-grained emotion intensity in the output speech as desired. We report the performance of the proposed framework in terms of spectrum, prosody, and duration conversion quality on ESD database. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network.


----------------------------> Model Architecture <-----------------------


Figure 1. Overall framework of Emovox at emotion training stage. A learned relative ranking function automatically predicts the emotion intensity of the input emotional speech. A pre-trained speech emotion recognition (SER) model serves as the emotion supervision to improve the emotional intelligibility in output speech.

The codes of this paper are publicly available here.



-----------------------------> Speech Samples <---------------------------

(A) Emotion Intensity Evaluation

Source

(Neutral)

Intensity = 0.1 (Most Weak) Intensity = 0.3 (Less Weak) Intensity = 0.6 (Less Strong) Intensity = 0.9 (Most Strong)
Neutral-to-Angry (converting neutral to angry with different intensities):
Neutral-to-Sad (converting neutral to sad with different intensities):
Neutral-to-Happy (converting neutral to happy with different intensities):


Compare with other intensity control methods

Experimental Setup:
(1) Emovox w/ Scaling Factor: where the emotion embedding is multiplied by a scaling factor [1].
(2) Emovox w/ Attention Weights: where the attention weight vector obtained from a pre-trained SER is used to represent the intensity [2].
(3) Emovox w/ Relative Attributes : our proposed method with relative attributes as described in Figure 1.
Neutral-to-Angry:
Source

(Neutral)

Intensity = 0.1

(Weak)

Intensity = 0.5 (Medium) Intensity = 0.9 (Strong)
Neutral-to-Angry:
Sample #1
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes

Sample #2
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes

Sample #3
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes

Sample #4
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes

Neutral-to-Happy:
Sample #1
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes

Sample #2
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes

Sample #3
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes

Sample #4
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes

Neutral-to-Sad:
Sample #1
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes

Sample #2
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes

Sample #3
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes

Sample #4
(1) Emovox w/

Scaling Factor

(2) Emovox w/

Attention Weights

(3) Emovox w/

Relative Attributes



(B) Emotion Similarity Evaluation

Experimental Setup:
(1) CycleGAN-EVC [3] (baseline): CycleGAN-based emotional voice conversion with WORLD vocoder [4], where fundamental frequency (F0) is analyzed with continuous wavelet transform;.
(2) StarGAN-EVC [5] (baseline): StarGAN-based emotional voice conversion with WORLD vocoder [4].
(3) Seq2Seq-EVC [6] (baseline): sequence-to-sequence emotional voice conversion with Parallel WaveGAN vocoder [7].
(3) Emovox (proposed): Our proposed sequence-to-sequence emotional voice conversion framework with Parallel WaveGAN [7], as described in Figure 1.
Source CycleGAN-EVC StarGAN-EVC Seq2Seq-EVC Emovox (proposed) Target
Neutral-to-Angry:
Neutral-to-Sad:
Neutral-to-Happy:

(C) Ablation Studies

Experimental Setup:
(1) Emovox w/o Intensity: Proposed Emovox without the intensity control module;
(2) Emovox w/o Intensity and Lcls: Proposed Emovox without the intensity control module and emotion classification loss Lcls;
(3) Emovox w/o Intensity and Lsim: Proposed Emovox without the intensity control module and emotion embedding similarity loss Lsim;
(3) Emovox w/o Intensity and Lcls and Lsim: Proposed Emovox without the intensity control module, emotion classification loss Lcls and emotion embedding similarity loss Lsim;
Source Emovox w/o Intensity and Lcls and Lsim Emovox w/o Intensity and Lsim Emovox w/o Intensity and Lcls Emovox w/o Intensity
Neutral-to-Angry:
Neutral-to-Sad:
Neutral-to-Happy:

(D) More Samples

We extend our experiments with a new female speaker ("0016") from ESD dataset. Here are some demos:
Source(Neutral) Emovox (Converted Angry) Emovox (Converted Happy) Emovox (Converted Sad)

References:

[1] H. Choi and M. Hahn, “Sequence-to-sequence emotional voice conversion with strength control,” IEEE Access, vol. 9, pp. 42 674–42 687, 2021

[2] B. Schnell and P. N. Garner, “Improving emotional tts with anemotion intensity input from unsupervised extraction,” in Proc.11th ISCA Speech Synthesis Workshop (SSW 11), pp. 60–65

[3] K. Zhou, B. Sisman, and H. Li, “Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data,” in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 230–237

[4] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7,pp. 1877–1884, 2016

[5] G. Rizos, A. Baird, M. Elliott, and B. Schuller, “Stargan for emotional speech conversion: Validated by data augmentation of end-to-end emotion recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3502–3506

[6] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP). IEEE, 2020, pp. 6199–6203

[7] K. Zhou, B. Sisman, and H. Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training,” in Proc. Interspeech 2021, 2021, pp. 811–815