Kun Zhou1,2, Berrak Sisman2, Carlos Busso2 and Haizhou Li1,3 zhoukun@u.nus.edu   berraksisman@u.nus.edu   busso@utdallas.edu   haizhou.li@nus.edu.sg
1 Dept. of Electrical and Computer Engineering, National University of Singapore, Singapore 2 Department of Electrical and Computer Engineering, The University of Texas at Dallas, Texas, U.S.A. 3 The Chinese University of HongKong, Shenzhen, China
Emotional voice conversion (EVC) aims to convert the emotional state of an utterance from one to another while preserving the linguistic content and speaker identity. Current studies mostly focus on modelling the conversion between several specific emotion types.
Synthesizing mixed effects of emotions could help us to better imitate human emotions, and facilitate more natural human-computer interaction.
In this research, for the first time, we formulate and study the research problem of mixed emotion synthesis for EVC.
We regard emotion styles as a series of emotion attributes that are learnt from a ranking-based support vector machine (SVM). Each attribute measures the degree of the relevance between the speech recordings belonging to different emotion types. We then incorporate those attributes into a sequence-to-sequence (seq2seq) emotional voice conversion framework. During the training, the framework does not only learn to characterize the input emotion style, but also quantify its relevance with other emotion types. At run-time, various emotion mixtures can be produced by manually defining the attributes. We conduct objective and subjective evaluations to validate our idea in terms of mixed emotion synthesis. We further build an emotion transition system as an application study.
The codes of this paper are publicly available here.
At run-time, we convert Neutral to four different emotion mixtures.
We summarize our experiments as below:
Target Emotion (A)
Adding Emotion (B)
Mixed Effects (A+B)
Happy
Surprise
Excitement
Angry
Surprise
Outrage
Sad
Surprise
Disappointment
Happy
Sad
Bittersweet
----------------------------> System Overview <----------------------------
In this section, we convert Neutral to Angry while introducing different percentages (0%, 30%, 60%, 90%) of Surprise into the mixture. We aim to synthesize a mixed feeling of Outrage.
Source Neutral
Converted Angry
Converted Angry with 30% Surprise
Converted Angry with 60% Surprise
Converted Angry with 90% Surprise
Ground-truth Angry
Ground-truth Surprise
(B) Converting Neutral to Excitement
In this section, we convert Neutral to Happy while introducing different percentages (0%, 30%, 60%, 90%) of Surprise into the mixture. We aim to synthesize a mixed feeling of Excitement.
Source Neutral
Converted Happy
Converted Happy with 30% Surprise
Converted Happy with 60% Surprise
Converted Happy with 90% Surprise
Ground-truth Happy
Ground-truth Surprise
(C) Converting Neutral to Disappointment
In this section, we convert Neutral to Sad while introducing different percentages (0%, 30%, 60%, 90%) of Surprise into the mixture. We aim to synthesize a mixed feeling of Disappointment.
Source Neutral
Converted Sad
Converted Sad with 30% Surprise
Converted Sad with 60% Surprise
Converted Sad with 90% Surprise
Ground-truth Sad
Ground-truth Surprise
(D) Converting Neutral to Bittersweet
In this section, we convert Neutral to Happy while introducing different percentages (0%, 30%, 60%, 90%) of Sad into the mixture. We aim to synthesize a mixed feeling of Bittersweet.