Page 54 - 《应用声学》2025年第3期
P. 54
第 44 卷 第 3 期 Vol. 44, No. 3
2025 年 5 月 Journal of Applied Acoustics May, 2025
⋄ 研究论文 ⋄
基于注意力的双层级并行声学场景分类方法
杨雪同 夏秀渝 †
(四川大学电子信息学院 成都 610065)
摘要:声学场景分类是计算机听觉任务之一,其通过对声频信号的分析,将声频分类为特定的场景类型。该技
术可广泛应用于智能设备、声频监控等领域。声学场景自上而下可分为高层级场景,再细分为低层级场景。与
直接针对低层级场景分类的方法不同,根据该层级关系提出一种基于注意力的双层级并行网络用于声学场景
分类。首先基于残差网络构建并行的高低层级声学场景分类模型,从高层级分类模型间层特征中获取全局参
考特征。然后根据全局参考特征和低层级分类模型特征间距离计算注意力权重,使低层级分类模型关注重要
特征。最后利用增强推理层融合高低层级分类模型的输出。并行网络在 DCASE2019 任务 1 数据集上的准确
率为 89.5%,应用增强推理层后的准确率为 90.1%,验证了所提网络模型和增强推理层的有效性。
关键词:声学场景分类;残差网络;注意力;层级关系;增强推理
中图法分类号: TN911.7 文献标识码: A 文章编号: 1000-310X(2025)03-0588-08
DOI: 10.11684/j.issn.1000-310X.2025.03.007
Attention-based dual-hierarchy parallel acoustic scene classification method
YANG Xuetong and XIA Xiuyu
(College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China)
Abstract: Acoustic scene classification is one of the computer auditory tasks, which classifies audio into specific
scene types through the analysis of audio signals. This technology can be widely applied in fields such as smart
devices and audio monitoring. The acoustic scene can be divided into high-level scenes and then subdivided
into low-level scenes. Unlike methods that directly target low-level scene classification, an attention-based dual-
hierarchy parallel network is proposed for acoustic scene classification based on the hierarchical relationship.
Firstly, a parallel high-low level acoustic scene classification model is constructed utilizing residual networks,
and global reference features are extracted from the intermediate features of the high-level classification model.
Then, attention weights are computed by considering the distance between the global reference features and
the low-level classification model features, this allows the low-level classification model to prioritize significant
features. Finally, an enhanced inference layer is employed to integrate the output of high-low level classification
models. The accuracy of this parallel network on the DCASE2019 Task 1 dataset is 89.5%, and the accuracy
after applying the enhanced inference layer is 90.1%, verifying the effectiveness of the proposed network model
and the enhanced inference layer.
Keywords: Acoustic scene classification; Residual network; Attention; Hierarchy; Enhance inference
2024-01-07 收稿; 2024-03-07 定稿
作者简介: 杨雪同 (1999– ), 男, 云南大理人, 硕士研究生, 研究方向: 信号与信息处理。
† 通信作者 E-mail: xiaxxy@163.com