Clippers 10/8: Jingyi Chen on Speech Emotion Cloning

EmoClone: Speech Emotion Cloning
Jingyi Chen

In this paper, we introduce EmoClone, an end-to-end speech-to-speech model that replicates the emotional tone of a reference speech from a short audio sample, reproducing the reference speaker’s exact emotion in new outputs, regardless of content or voice differences. Unlike traditional Emotional Voice Conversion (EVC) models that use emotion text labels to alter the input speech’s emotional state, EmoClone is designed to faithfully clone a broad range of emotional expressions beyond these preset categories, making it ideal for applications requiring precise emotional fidelity, such as personalized voice generation and interactive media. Experimental results show that EmoClone leads to improved performance for content and speaker identity preservation, while achieving a comparable emotion accuracy to SOTA methods.