--------------------------------> Abstract <---------------------------------


Emotional voice conversion (EVC) aims to convert the emotional state of an utterance from one to another while preserving the linguistic content and speaker identity. Current studies mostly focus on modelling the conversion between several specific emotion types. Synthesizing mixed effects of emotions could help us to better imitate human emotions, and facilitate more natural human-computer interaction. In this research, for the first time, we formulate and study the research problem of mixed emotion synthesis for EVC. We regard emotion styles as a series of emotion attributes that are learnt from a ranking-based support vector machine (SVM). Each attribute measures the degree of the relevance between the speech recordings belonging to different emotion types. We then incorporate those attributes into a sequence-to-sequence (seq2seq) emotional voice conversion framework. During the training, the framework does not only learn to characterize the input emotion style, but also quantify its relevance with other emotion types. At run-time, various emotion mixtures can be produced by manually defining the attributes. We conduct objective and subjective evaluations to validate our idea in terms of mixed emotion synthesis. We further build an emotion transition system as an application study.


The codes of this paper are publicly available here.


At run-time, we convert Neutral to four different emotion mixtures. We summarize our experiments as below:

Target Emotion (A) Adding Emotion (B) Mixed Effects (A+B)
Happy Surprise Excitement
Angry Surprise Outrage
Sad Surprise Disappointment
Happy Sad Bittersweet

----------------------------> System Overview <----------------------------



CH Logo
Fig.1 Training Diagram.
CH Logo
Fig.2 Run-time Diagram.

----------------------------> Speech Samples <-----------------------------


(A) Converting Neutral to Outrage

In this section, we convert Neutral to Angry while introducing different percentages (0%, 30%, 60%, 90%) of Surprise into the mixture. We aim to synthesize a mixed feeling of Outrage.
Source Neutral Converted Angry Converted Angry with 30% Surprise Converted Angry with 60% Surprise Converted Angry with 90% Surprise Ground-truth Angry Ground-truth Surprise

(B) Converting Neutral to Excitement

In this section, we convert Neutral to Happy while introducing different percentages (0%, 30%, 60%, 90%) of Surprise into the mixture. We aim to synthesize a mixed feeling of Excitement.
Source Neutral Converted Happy Converted Happy with 30% Surprise Converted Happy with 60% Surprise Converted Happy with 90% Surprise Ground-truth Happy Ground-truth Surprise

(C) Converting Neutral to Disappointment

In this section, we convert Neutral to Sad while introducing different percentages (0%, 30%, 60%, 90%) of Surprise into the mixture. We aim to synthesize a mixed feeling of Disappointment.
Source Neutral Converted Sad Converted Sad with 30% Surprise Converted Sad with 60% Surprise Converted Sad with 90% Surprise Ground-truth Sad Ground-truth Surprise

(D) Converting Neutral to Bittersweet

In this section, we convert Neutral to Happy while introducing different percentages (0%, 30%, 60%, 90%) of Sad into the mixture. We aim to synthesize a mixed feeling of Bittersweet.
Source Neutral Converted Happy Converted Happy with 30% Sad Converted Happy with 60% Sad Converted Happy with 90% Sad Ground-truth Happy Ground-truth Sad