Page 51 - 《应用声学》2022年第3期
P. 51
第 41 卷 第 3 期 Vol. 41, No. 3
2022 年 5 月 Journal of Applied Acoustics May, 2022
⋄ 研究报告 ⋄
神经网络的声场景自动分类方法 ∗
梁 腾 1 姜文宗 1 王 立 2 刘宝弟 2 王延江 2†
(1 中国石油大学 (华东) 海洋与空间信息学院 青岛 266580)
(2 中国石油大学 (华东) 控制科学与工程学院 青岛 266580)
摘要:声场景探察和自动分类能帮助人类制定应对特定环境的正确策略,具有重要的研究价值。随着卷积神
经网络的发展,出现了许多基于卷积神经网络的声场景分类方法。其中时频卷积神经网络 (TS-CNN) 采用了
时频注意力模块,是目前声场景分类效果最好的网络之一。为了在保持网络复杂度不变的前提下进一步提高
网络的声场景分类性能,该文提出了一种基于协同学习的时频卷积神经网络模型 (TSCNN-CL)。具体地说,该
文首先建立了基于同构结构的辅助分支参与网络的训练。其次,提出了一种基于 KL 散度的协同损失函数,实
现了分支与主干的知识协同,最后,在测试过程中,为了不增加推理计算量,该文提出的模型只使用主干网络
预测结果。在 ESC-10、ESC-50 和 UrbanSound8k 数据集的综合实验表明,该模型分类效果要优于 TS-CNN 模
型以及当前大部分的主流方法。
关键词:声场景分类;时频卷积神经网络;协同学习;声信号处理
中图法分类号: TP39 文献标识码: A 文章编号: 1000-310X(2022)03-0373-08
DOI: 10.11684/j.issn.1000-310X.2022.03.006
Automatic classification of acoustic scene based on neural network
LIANG Teng 1 JIANG Wenzong 1 WANG Li 2 LIU Baodi 2 WANG Yanjiang 2
(1 College of Oceanography and Space Information, China University of Petroleum (East China), Qingdao 266580, China)
(2 College of Control Science and Engineering, China University of Petroleum (East China), Qingdao 266580, China)
Abstract: Acoustic scene detection and automatic classification can help human beings to make correct strate-
gies in specific environments, which indicates great research values. With the development of convolutional
neural networks (CNN), a large number of CNN-based acoustic scene classification methods emerge. Espe-
cially, the temporal-spectral CNN (TS-CNN) which adapts the temporal-spectral attention module, is one of
the best methods for the classification of acoustic scenes at present. In order to further improve the acoustic
scene classification ability of the neural network without changing the complexity, in this paper, we proposed a
new temporal-spectral CNN model which was based on the collaborative learning method (TSCNN-CL) More
specifically, first, we established the auxiliary branches based on the isomorphism to participate in the network
training. Second, we adopt a collaborative loss function based on KL divergence to realize the knowledge col-
laboration between the branches and the trunk. Finally, in the testing process, only the network trunk was used
to predict the results, leading to the invariant amount of inference calculation. Comprehensive experiments
on ESC-10, ESC-50, and UrbanSound8k datasets showed that the classification performance of TSCNN-CL
model outperformed the TS-CNN model and even had compelling advantages in comparison with some other
state-of-art models.
Keywords: Acoustic scene classification; Temporal-spectral convolutional neural network; Collaborative
learning; Sound signal processing
2021-04-12 收稿; 2021-06-21 定稿
国家自然科学基金项目 (62072468)
∗
作者简介: 梁腾 (1996– ), 男, 山东德州人, 硕士研究生, 研究方向: 智能信息处理。
† 通信作者 E-mail: yjwang@upc.edu.cn