Singapore Institute of Technology
li23n_interspeech.pdf (604.13 kB)

Fine-tuning Audio Spectrogram Transformer with Task-aware Adapters for Sound Event Detection

Download (604.13 kB)
conference contribution
posted on 2023-10-01, 00:58 authored by Kang Li, Yan Song, Ian McLoughlinIan McLoughlin, Lin Liu, Jin Li, Li-Rong Dai

In this paper, we present a task-aware fine-tuning method to transfer Patchout faSt Spectrogram Transformer (PaSST) model to sound event detection (SED) task. Pretrained PaSST has shown significant performance on audio tagging (AT) and SED tasks, but it is not optimal to fine-tune the model from a single layer as the local and semantic information have not been well exploited. To address this, we first introduce task-aware adapters including SED-adapter and AT-adapter to fine-tune PaSST for SED and AT task respectively, and then propose task-aware fine-tuning to combine local information from shallower layer with semantic information from deeper layer, based on task-aware adapters. Besides, we propose the self-distillated mean teacher (SdMT) to train a robust student model with soft pseudo labels from teacher. Experiments are conducted on DCASE2022 task4 development set, the EB-F1 of 64.85% and PSDS1 of 0.5548 are achieved which outperform previous state-of-the-art systems.


Journal/Conference/Book title

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 20-24 August 2023, Dublin, Ireland

Publication date


Usage metrics


    No categories selected


    Ref. manager