Singapore Institute of Technology

File(s) stored somewhere else

Please note: Linked content is NOT stored on Singapore Institute of Technology and we can't guarantee its availability, quality, security or accept any liability.

Cross-Modal Audio-Visual Co-learning for Text-independent Speaker Verification

conference contribution
posted on 2023-07-17, 07:21 authored by Meng Liu, Kong Aik LeeKong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang

Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a crossmodal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement over independently trained audio-only/visual-only and baseline fusion systems, respectively.


Journal/Conference/Book title

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 04-10 June 2023, Rhodes Island, Greece.

Publication date



  • Pre-print

Rights statement

M. Liu, K. A. Lee, L. Wang, H. Zhang, C. Zeng and J. Dang, "Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification," ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10095883.

Corresponding author

Kong Aik Lee

Usage metrics