Singapore Institute of Technology
gao20_interspeech.pdf (523.64 kB)

SAN-M: Memory equipped self-attention for end-to-end speech recognition

Download (523.64 kB)
conference contribution
posted on 2024-04-03, 06:02 authored by Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlinIan McLoughlin

End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior. For example, Transformer, which adopts an encoder-decoder architecture. The key improvement introduced by Transformer is the utilization of self-attention instead of recurrent mechanisms, enabling both encoder and decoder to capture long-range dependencies with lower computational complexity. In this work, we propose boosting the self-attention ability with a DFSMN memory block, forming the proposed memory equipped self-attention (SAN-M) mechanism. Theoretical and empirical comparisons have been made to demonstrate the relevancy and complementarity between self-attention and the DFSMN memory block. Furthermore, the proposed SAN-M provides an efficient mechanism to integrate these two modules. We have evaluated our approach on the public AISHELL-1 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. On both tasks, SAN-M systems achieved much better performance than the self-attention based Transformer baseline system. Specially, it can achieve a CER of 6.46% on the AISHELL-1 task even without using any external LM, comfortably outperforming other state-of-the-art systems.


Journal/Conference/Book title

Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, October 25–29, 2020, Shanghai, China.

Publication date



  • Published

Rights statement

Gao, Z., Zhang, S., Lei, M., McLoughlin, I. (2020) SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition. Proc. Interspeech 2020, 6-10, doi: 10.21437/Interspeech.2020-2471.

Usage metrics


    Ref. manager