----------------------------> Model Architecture <-----------------------


Fig.1 The training phase of the proposed VAW-GAN-based emotional voice conversion framework with WORLD vocoder. Red boxes are involved in the training, while grey boxes are not.

Fig.2 The run-time conversion phase of the proposed VAW-GAN-based emotional voice conversion framework with WORLD vocoder. Blue boxes represent the networks which have been trained during the training phase.


The codes of this research are available here.



--------------------> Speech Samples (Neutral to Angry) <-------------------



CWT-VAW: VAW-GAN system that converts spectrum and CWT-based F0 without conditioning the generator on F0.
CWT-C-VAW (proposed) : The proposed VAW-GAN-based EVC framework in Figure 1.

1 - Speech quality Evaluation

CWT-VAW CWT-C-VAW (proposed)

2 - Emotion Similarity Evaluation (1) (Seen Speakers)

Source CWT-VAW CWT-C-VAW (proposed) Reference

SD-CWT-C-VAW: Proposed framework trained only with a specific speaker.
CWT-C-VAW: Proposed framework trained with multiple speakers.

2 - Emotion Similarity Evaluation (2) (Seen Speaker)

Source SD-CWT-C-VAW CWT-C-VAW Reference


We also test SD-CWT-C-VAW and CWT-C-VAW on unseen speakers:

2 - Emotion Similarity Evaluation (3) (Unseen Speaker)

Source SD-CWT-C-VAW CWT-C-VAW Reference

3 - Speaker Similarity Evaluation

SD-CWT-C-VAW CWT-C-VAW Reference