Singapore Institute of Technology
zhang20da_interspeech.pdf (366 kB)

Semi-supervised end-to-end ASR via teacher-student learning with conditional posterior distribution

Download (366 kB)
conference contribution
posted on 2024-04-03, 05:59 authored by Zi-Qiang Zhang, Yan Song, Jian-shu Zhang, Ian McLoughlinIan McLoughlin, Li-Rong Dai

Encoder-decoder based methods have become popular for automatic speech recognition (ASR), thanks to their simplified processing stages and low reliance on prior knowledge. However, large amounts of acoustic data with paired transcriptions is generally required to train an effective encoder-decoder model, which is expensive, time-consuming to be collected and not always readily available. However unpaired speech data is abundant, hence several semi-supervised learning methods, such as teacher-student (T/S) learning and pseudo-labeling, have recently been proposed to utilize this potentially valuable resource. In this paper, a novel T/S learning with conditional posterior distribution for encoder-decoder based ASR is proposed. Specifically, the 1-best hypotheses and the conditional posterior distribution from the teacher are exploited to provide more effective supervision. Combined with model perturbation techniques, the proposed method reduces WER by 19.2% relatively on the LibriSpeech benchmark, compared with a system trained using only paired data. This outperforms previous reported 1-best hypothesis results on the same task.


Journal/Conference/Book title

Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, October 25–29, 2020, Shanghai, China.

Publication date



  • Published

Rights statement

Zhang, Z.-q., Song, Y., Zhang, J.-s., McLoughlin, I., Dai, L.-R. (2020) Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution. Proc. Interspeech 2020, 3580-3584, doi: 10.21437/Interspeech.2020-1574.

Usage metrics


    Ref. manager