ULSED: An ultra-lightweight SED model for IoT devices
Sound event detection (SED) technology has been widely used in applications such as audio surveillance systems and smart home. Compared to the traditional machine learning methods, the neural networks (NN) based methods have been proposed in recent years to significantly improve the detection accuracy. However, a major issue of the NN-based SED models is that they often involve a large number of parameters and floating point operations (FLOPs), resulting in significant processing time, power consumption and memory storage. This poses a challenge to SED on IoT devices with constrained computational resources and power budget. To address this issue, in this work, an ultra-lightweight SED model (ULSED) with a selective separable convolution scheme and a coordinate attention scheme is proposed to significantly reduce the computational complexity while achieving high detection accuracy. The proposed ULSED model is evaluated on the ESC-10, ESC-50 and UrbanSound8K(US8K) datasets. Compared with several state-of-the-art models, the number of parameters and the number of FLOPs is significantly reduced by up to 388 times and 1140 times while achieving high detection accuracy of 97.0%, 88.3% and 83.5% on the ESC-10, ESC-50 and US8K respectively. The proposed ULSED model is suitable for power- and hardware-constrained IoT devices.