--------------------------------> Abstract <---------------------------------


Emotional speech synthesis aims to synthesize human voices with various emotional effects. Current studies mostly focus on imitating an averaged style belonging to a specific emotion type. This paper aims to synthesize and control the mixed effects of different emotions given the text as inputs. We propose a novel formulation of measuring the relative difference between speech recordings with different emotions. We then incorporate our formulation into a sequence-to-sequence emotional speech synthesis framework. During the training, the framework does not only explicitly characterize emotion styles but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, various mixed emotional effects can be produced by adjusting the values that indicate the relative difference with other emotion types. Experimental results on objective and subjective evaluations validate our proposals' effectiveness in synthesizing mixed emotions. To our best knowledge, this research is a pioneer study on modelling, synthesizing and evaluating mixed emotions in speech.


The codes of this paper are publicly available here.


--------------> Starting from the Theory of Emotion Wheel <-----------------


Figure 1. An illustration of the theory of emotion wheel [1], where all emotions occur as the mixed or derivative states of eight primary emotions.

Studies show that humans can experience around 34000 different emotions [2]. While it is hard to understand all these distinct emotions, Robert Plutchik proposed 8 primary emotions: anger, fear, sadness, disgust, surprise, anticipation, trust and joy, and arranged them in an emotion wheel [1], as shown in Figure 1. All other emotions can be regarded as mixed or derivative states of these primary emotions. According to the theory of the emotion wheel, the adding up of primary emotions could produce new emotion types. For example, delight can be produced by combining joy and surprise.

Inspired by the theory of emotion wheel, we would like to study a way to combine or mix different emotion types and synthesize various mixed emotional effects. We believe it will allow us to create new emotion types that are subtle and hard to collect in real life, which helps us better imitate human emotions.

Our major focus in the experiments is below:

Primary Emotion (A) Reference Emotion (B) Mixed Effects (A+B)
Surprise Happy Delight
Surprise Angry Outrage
Surprise Sad Disappointment

We choose these three combinations becasuse they are thought to be easier to perceive for human listeners. We also provide more samples of other combinations (e.g. mixing happy with sad), which will be the last section of this demo page.

----------------------------> System Overview <----------------------------




Fig.2 Training Diagram.

Fig.3 Run-time Diagram.

----------------------------> Speech Samples <-----------------------------


(A) Mixed Emotion Evaluation (All Speech Samples are Synthesized from Text)

In this section, listeners can feel how the characteristics of other emotions ('Angry', 'Happy' or 'Sad') are introduced into 'Surprise'.
Only Surprise Only Angry Mixing Surprise with Angry Only Sad Mixing Surprise with Sad Only Happy Mixing Surprise with Happy

(B) Secondary Emotion Evaluation (All Speech Samples are Synthesized from Text)

In this section, listeners can feel how the mixed emotions sounds like secondary emotions in psychology ('Outrage', 'Disappointment', or 'Delight').
Surprise Outrage Disappointment Delight

(C) Controllability (All Speech Samples are Synthesized from Text)

In this section, we would like to show the controbility of proposed framework, for example, to adjust the percentage of each emotions in the mixed emotional effects.
100% Surprise + 0% Angry 100% Surprise + 30% Angry 100% Surprise + 60% Angry 100% Surprise + 90% Angry
100% Surprise + 0% Sad 100% Surprise + 30% Sad 100% Surprise + 60% Sad 100% Surprise + 90% Sad
100% Surprise + 0% Happy 100% Surprise + 30% Happy 100% Surprise + 60% Happy 100% Surprise + 90% Happy

(D) Ablation Study

In this section, we would like to show the improvement of emotional intelligibility in synthesized speech.
Synthesized Angry (Proposed w/o Relative Scheme) Synthesized Angry (Propsoed w/ Relative Scheme) Reference Angry (Ground Truth)
Synthesized Surprise (Proposed w/o Relative Scheme) Synthesized Surprise (Propsoed w/ Relative Scheme) Reference Surprise (Ground Truth)
Synthesized Sad (Proposed w/o Relative Scheme) Synthesized Sad (Propsoed w/ Relative Scheme) Reference Sad (Ground Truth)
Synthesized Happy (Proposed w/o Relative Scheme) Synthesized Happy (Propsoed w/ Relative Scheme) Reference Happy (Ground Truth)

(E) Further Investigateion I: Bittersweet? Both Happy and Sad

In this section, we would like to synthesize a mixed feeling of Happy and Sad. (All speech samples are synthesized from text)
Synthesized Happy Synthesized Sad Mixing 100% Happy with 100% Sad Mixing 90% Happy with 90% Sad Mixing 80% Happy with 80% Sad Mixing 70% Happy with 70% Sad
Synthesized Happy Synthesized Sad Mixing 100% Sad with 100% Happy Mixing 90% Sad with 90% Happy Mixing 80% Sad with 80% Happy Mixing 70% Sad with 70% Happy
->

(E) Further Investigateion II: Emotion Transition

In this section, we would like to build an emotion transition system, which can gradually transit the emotional state from one to another. (All speech samples are synthesized from text)
(1) Angry <---> Surprise
100% Angry 80% Angry with 20% Surprise 60% Angry with 40% Surprise 60% Surprise with 40% Angry 80% Surprise with 20% Angry 100% Surprise
(2) Sad <---> Angry
100% Sad 80% Sad with 20% Angry 60% Sad with 40% Angry 60% Angry with 40% Sad 80% Angry with 20% Sad 100% Angry
(2) Sad <---> Happy
100% Sad 80% Sad with 20% Happy 60% Sad with 40% Happy 60% Happy with 40% Sad 80% Happy with 20% Sad 100% Happy
Hope you enjoy this research! :-)

References:

[1] R. Plutchik and H. Kellerman, "Theories of Emotion," Academic Press, 2013, vol. 1

[2] R. Plutchik, “The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice,” American scientist, vol. 89, no. 4, pp. 344–350, 2001