On the Nature and Potential of Deep Noise Suppression Embeddings
Deep noise suppression (DNS) and AI-based speech denoising architectures learn a regression task of transforming noisy speech into clean speech. Logically, the task can be accomplished by either learning noise characteristics to identify and remove noise or by learning speech characteristics to strengthen speech with respect to noise. Architectures that employ the former approach require a good noise model, whereas the latter architectures require a strong speech model. Denoising can then be accomplished by using a noise model to mask noisy parts of the input, or by using the speech model to enhance the speech parts of the input, with both guided by appropriate training data and loss functions. Modern DNS systems are powerful and compact networks that use psychoacoustically inspired objective functions to learn their internal representations. We demonstrate that they effectively combine both approaches. This is despite having neither noise or speech labels in the training data, hence these latent representations are unsupervised. This paper explores embeddings from two recent high performance DNS architectures, to determine how they model both noise and speech across layers. Results reveal strong clustering for both speech and noise, plus significant speaker characteristic separation. This leads to a new understanding that both architectures can learn in unsupervised fashion to have speaker and noise discrimination abilities. These have strong potential to be exploited for related speaker and noise-based machine learning tasks.
History
Journal/Conference/Book title
Circuits, Systems, and Signal ProcessingPublication date
2025-05-17Version
- Post-print