Converting Anyone's Emotion: Towards Speaker-independent Emotional Voice Conversion

----------------------------> Model Architecture <-----------------------

Fig.1 The training phase of the proposed VAW-GAN-based emotional voice conversion framework with WORLD vocoder. Red boxes are involved in the training, while grey boxes are not.

Fig.2 The run-time conversion phase of the proposed VAW-GAN-based emotional voice conversion framework with WORLD vocoder. Blue boxes represent the networks which have been trained during the training phase.

The codes of this research are available here.

--------------------> Speech Samples (Neutral to Angry) <-------------------

CWT-VAW: VAW-GAN system that converts spectrum and CWT-based F0 without conditioning the generator on F0.

CWT-C-VAW (proposed) : The proposed VAW-GAN-based EVC framework in Figure 1.

1 - Speech quality Evaluation
	CWT-VAW	CWT-C-VAW (proposed)

2 - Emotion Similarity Evaluation (1) (Seen Speakers)
	Source	CWT-VAW	CWT-C-VAW (proposed)	Reference

SD-CWT-C-VAW: Proposed framework trained only with a specific speaker.

CWT-C-VAW: Proposed framework trained with multiple speakers.

2 - Emotion Similarity Evaluation (2) (Seen Speaker)
	Source	SD-CWT-C-VAW	CWT-C-VAW	Reference

We also test SD-CWT-C-VAW and CWT-C-VAW on unseen speakers:

2 - Emotion Similarity Evaluation (3) (Unseen Speaker)
	Source	SD-CWT-C-VAW	CWT-C-VAW	Reference

3 - Speaker Similarity Evaluation
	SD-CWT-C-VAW	CWT-C-VAW	Reference