Robust Speech Recognition for Visual AcuityTesting in Multi-Speaker Clinical Environments
Automatic Speech Recognition (ASR) in multi-speaker clinical environments remains a significant challenge due to overlapping speech, ambient noise, and the lack of reference audio for speaker identification. This paper presents a robust, target-aware ASR system tailored for visual acuity testing, where accurate transcription of patient responses is critical. We fine-tune the Whisper model on a domain-specific clinical dataset to improve transcription accuracy and reduce latency. To address crosstalk and overlapping speech, we evaluate two state-of-the-art speech separation models, MossFormer2 and SepFormer, and integrate reference-free Target Talker Identification (TTI) strategies based on signal-to-noise ratio (SNR) and cosine similarity. Our evaluation on synthetic and real clinical datasets demonstrates that the fine-tuned Whisper model, when combined with MossFormer2 and cosine similarity-based TTI, achieves the lowest Word Error Rate (WER) across two-speaker scenarios. These results highlight the importance of integrating robust separation and speaker identification methods to enable accurate, efficient ASR in clinical settings. The system provides a scalable foundation for future work in more complex environments with three or more speakers.
History
Journal/Conference/Book title
International Conference on Asian Language Processing (IALP) 2025Publication date
2025-08Version
- Pre-print
Corresponding author
Rong TongProject ID
- 16081 (R-IE2-A405-0006) Multimodal visual acuity testing with speech and touch panel
- 15875 (R-R12-A405-0009) Automatic speech de-identification on Singapore English speech