Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

Fig.1 The training phase of the proposed CycleGAN-based emotional VC framework, where WORLD acts as the vocoder. CWT is used to decompose F0 into 10 scales. Blue boxes represent the training stage of the network, while grey boxes represent the blocks which do not need the training stage.

Fig.2 The run-time conversion phase of the proposed CycleGAN-based emotional VC framework. Pink boxes represent the network which are already trained.

The codes of this research are available here.

Baseline: Conventional CyleGAN-based VC framework with LG normalized F0 transformation

CycleGAN-Joint: Joint training CycleGAN-based emotional VC framework with CWT-F0 transformation

CycleGAN-Separate (Proposed): Separate training CycleGAN-based emotional VC framework with CWT-F0 transformation

-----------------------> Emotional Speech Samples <-----------------------

	Source	Baseline	CycleGAN-Joint	CycleGAN-Separate	Target
Neutral-to-Angry





Neutral-to-Sad





Neutral-to-Surprise