Singapore Institute of Technology

File(s) not publicly available

Frontend Attributes Disentanglement for Speech Emotion Recognition

conference contribution
posted on 2023-10-01, 00:58 authored by Yu-Xuan Xi, Yan Song, Li-Rong Dai, Ian McLoughlinIan McLoughlin, Lin Liu

Speech emotion recognition (SER) with limited size dataset is a challenging task, since a spoken utterance contains various disturbing attributes besides emotion, including speaker, content, and language. However, due to a close relationship between speaker and emotion attributes, simply fine-tuning a linear model is enough to obtain a good SER performance on the utterance-level embeddings (i.e., i-vector and x-vectors) extracted from the pre-trained speaker recognition (SR) frontends. In this paper, we aim to perform frontend attributes disentanglement (AD) for SER task, using a pre-trained SR model. Specifically, the AD module consists of attribute normalization (AN) and attribute reconstruction (AR) phases. The AN filters out the variation information using instance normalization (IN), and AR reconstructs the emotion-relevant features from the residual space to ensure high emotion discrimination. For better disentanglement, a dual space loss is then designed to encourage the separability of emotion-relevant and emotion-irrelevant spaces. To introduce the long-range contextual information for emotion related reconstruction, a time-frequency (TF) attention is further proposed. Different from the style disentanglement of the extracted x-vectors, the proposed AD module can be applied on frontend feature extractor. Experiments on IEMOCAP benchmark demonstrate the effectiveness of the proposed method.


Journal/Conference/Book title

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Publication date


Usage metrics


    No categories selected


    Ref. manager