Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset

----------------------------> Model Architecture <-----------------------

Figure 1. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training and red boxes represent the networks that are already trained.

The codes of this paper are available here.

------------------------> Emotional Speech Corpus <-----------------------

We introduce a new Mandarin-English emotional speech corpus which consists of 350 parallel utterances with the average duration of 2.5 seconds spoken by 10 native speakers (5 male and 5 female) in five emotions (neutral, sad, happy, angry and surprise) for each language (Mandarin and English). All the speech data are samples at 16 kHz and saved in 16 bits. The transcripts are provided in both Chinese and English.

This emotional speech corpuse is developed for emotional voice conversion (EVC) and also suitable for voice conversion (VC) and text-to-speech (TTS). The whole speech corpus are publicly avaliable only for research purpose. Please feel free to use this dataset to help with your own research.

The whole emotional speech corpus can be accessed here. Please cite this paper if you are using this dataset.

-----------------------------> Speech Samples <---------------------------

VAW-GAN-EVC [1]: A state-of-the-art VAW-GAN-based EVC framework. The decoder is conditioned with F0 and one-hot emotion ID.

DeepEST (proposed) : The proposed one-to-many controllable VAW-GAN-based EVC framework as descript in Figure 1.

Note:

(1) Angry is the unseen emotion for DeepEST, while all emotions are seen emotions for VAW-GAN-EVC.

(2) DeepEST is an one-to-many conversion model for all three EVC tasks, while VAW-GAN-EVC is one-to-one model and trained three times for each task resepectively.

Neutral-to-Happy
	Source	VAW-GAN-EVC	DeepEST	Target






Neutral-to-Sad
	Source	VAW-GAN-EVC	DeepEST	Target






Neutral-to-Angry (Angry is an unseen emotion for DeepEST)
	Source	VAW-GAN-EVC	DeepEST	Target

[1] Kun Zhou, Berrak Sisman, Mingyang Zhang and Haizhou Li, "Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion", in Proc. INTERSPEECH, Shanghai, China, October 2020.