Frontend Attributes Disentanglement for Speech Emotion Recognition

Xi, Yu-Xuan; Song, Yan; Dai, Li-Rong; McLoughlin, Ian; Liu, Lin

doi:10.1109/ICASSP43922.2022.9746691

Frontend Attributes Disentanglement for Speech Emotion Recognition

conference contribution

posted on 2023-10-01, 00:58 authored by Yu-Xuan Xi, Yan Song, Li-Rong Dai, Ian McLoughlinIan McLoughlin, Lin Liu

Speech emotion recognition (SER) with limited size dataset is a challenging task, since a spoken utterance contains various disturbing attributes besides emotion, including speaker, content, and language. However, due to a close relationship between speaker and emotion attributes, simply fine-tuning a linear model is enough to obtain a good SER performance on the utterance-level embeddings (i.e., i-vector and x-vectors) extracted from the pre-trained speaker recognition (SR) frontends. In this paper, we aim to perform frontend attributes disentanglement (AD) for SER task, using a pre-trained SR model. Specifically, the AD module consists of attribute normalization (AN) and attribute reconstruction (AR) phases. The AN filters out the variation information using instance normalization (IN), and AR reconstructs the emotion-relevant features from the residual space to ensure high emotion discrimination. For better disentanglement, a dual space loss is then designed to encourage the separability of emotion-relevant and emotion-irrelevant spaces. To introduce the long-range contextual information for emotion related reconstruction, a time-frequency (TF) attention is further proposed. Different from the style disentanglement of the extracted x-vectors, the proposed AD module can be applied on frontend feature extractor. Experiments on IEMOCAP benchmark demonstrate the effectiveness of the proposed method.

History

Journal/Conference/Book title

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Publication date

2022-05-23

Usage metrics

Keywords

speech emotion recognition convolutional neural network style transformation disentanglement

Licence

In Copyright

Frontend Attributes Disentanglement for Speech Emotion Recognition

History

Journal/Conference/Book title

Publication date

Usage metrics

Categories

Keywords

Licence

Exports