Fig.1 The training phase of the proposed CycleGAN-based emotional VC framework, where WORLD acts as the vocoder. CWT is used to decompose F0 into 10 scales. Blue boxes represent the training stage of the network, while grey boxes represent the blocks which do not need the training stage.

Fig.2 The run-time conversion phase of the proposed CycleGAN-based emotional VC framework. Pink boxes represent the network which are already trained.


The codes of this research are available here.

Baseline: Conventional CyleGAN-based VC framework with LG normalized F0 transformation
CycleGAN-Joint: Joint training CycleGAN-based emotional VC framework with CWT-F0 transformation
CycleGAN-Separate (Proposed): Separate training CycleGAN-based emotional VC framework with CWT-F0 transformation


-----------------------> Emotional Speech Samples <-----------------------



Source Baseline CycleGAN-Joint CycleGAN-Separate Target
Neutral-to-Angry
Neutral-to-Sad
Neutral-to-Surprise