File(s) not publicly available
D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition
Attention-based convolutional neural network (CNN) models are increasingly being adopted for speaker and language recognition (SR/LR) tasks. These include time, frequency, spatial and channel attention, which can focus on useful time frames, frequency bands, regions or channels while extracting features. However, these traditional attention methods lack the exploration of complex information and multi-scale long-range speech feature interactions, which can benefit SR/LR tasks. To address these issues, this paper firstly proposes mixed-order attention (MOA) for low frame-level speech features to gain the finest grain multi-order information at higher resolution. We then combine that with a non-local attention (NLA) mechanism and a dilated residual structure to balance fine grained local detail with convolution from multi-scale long-range time/frequency regions in feature space. The proposed dilated mixed-order non-local attention network (D-MONA) exploits the detail available from the first and the second-order feature attention analysis, but achieves this over a much wider context than purely local attention. Experiments are conducted on three datasets, including two SR tasks of Voxceleb and CN-celeb, and one LR task, NIST LRE 07. For SR, D-MONA improves on ResNet-34 results by at least 29% and 15% for Voxceleb1 and CN-celeb respectively. For the LR task, a large improvement is achieved over ResNet-34 of 21% for the challenging 3s utterance condition, 59% for the 10s condition and 67% for the 30s condition. It also outperforms the state-of-the-art deep bottleneck feature-DNN (DBF-DNN) x-vector system at all scales.