File(s) not publicly available
An Online Speaker-aware Speech Separation Approach Based on Time-domain Representation
Despite the significant progress of deep learning based speech separation methods, it remains challenging to extract and track the speech from target speakers, especially in a single-channel multiple speaker situation. Previously, the authors proposed a source-aware context network to exploit the temporal context in mixtures and estimated sources for online speech separation. In this paper, we propose a speaker-aware approach based on the source-aware context network structure, in which the speaker information is explicitly modeled by an auxiliary speaker identification branch. Then speech separation and speaker tracking can be jointly optimized by multi-task learning. Furthermore, we study the effectiveness of time-domain representation by proposing a raw sparse waveform encoder to preserve discriminative information. Experimental results on the WSJ0-2mix benchmark show that the proposed system significantly improves Signal-to-Distortion Ratio (SDR) performance.