Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

----------------------------> Model Architecture <-----------------------

Figure 1. The proposed 2-stage training strategy for seq2seq emotional voice conversion with limited emotional speech data.

The codes of this paper are available here.

-----------------------------> Speech Samples <---------------------------

Experimental Setup:

(1) CycleGAN-EVC (baseline) [1]: CycleGAN-based emotional voice conversion with WORLD vocoder.

(2) StarGAN-EVC (baseline) [2]: StarGAN-based emotional voice conversion with WORLD vocoder.

(3) Seq2seq-EVC-GL (proposed) : Seq2seq-EVC followed by a Griffin-Lim vocoder.

(4) Seq2seq-EVC-WA1 (proposed) : Seq2seq-EVC followed by a WaveRNN vocoder that is pre-trained on VCTK corpus.

(5) Seq2seq-EVC-WA2 (proposed) : Seq2seq-EVC followed by a WaveRNN vocoder that is pre-trained on VCTK corpus, and fine-tuned with a limited amount of emotional speech data.

Note: CycleGAN-EVC only can perform the one-to-one conversion, thus we train one CycleGAN-EVC for each emotion pair separately. Both StarGAN-EVC and our proposed Seq2seq-EVC use a unified model for all the emotion pairs.

Emotion Similarity Test

Source CycleGAN-EVC StarGAN-EVC Seq2seq-EVC-WA1 Seq2seq-EVC-WA2 Target

(1) Neutral-to-Angry

(2) Neutral-to-Happy

(3) Neutral-to-Sad

(4) Neutral-to-Surprise

Speech Quality Test

Seq2seq-EVC-GL Seq2seq-EVC-WA1 Seq2seq-EVC-WA2

(1) Neutral-to-Angry

(2) Neutral-to-Happy

(3) Neutral-to-Sad

(4) Neutral-to-Surprise

[1] K. Zhou, B. Sisman, and H. Li, “Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data,” in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 230–237
[2] G. Rizos, A. Baird, M. Elliott, and B. Schuller, “Stargan for emotional speech conversion: Validated by data augmentation of end-to-end emotion recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3502–3506