Page 198 - 《应用声学）》2023年第5期

P. 198

第 42 卷第 5 期 Vol. 42, No. 5
2023 年 9 月 Journal of Applied Acoustics September, 2023

⋄ 研究报告 ⋄

注意力机制融合前端网络中间层的语声情感识别

朱应俊周文君朱川马建敏 †

(复旦大学航空航天系上海 200433)

摘要：为了使机器能够更好地理解人的情感并改善人机交互体验，可对语声特征及分类网络进行融合以提升
情感识别性能。该文从网络融合的角度，把基于梅尔倒谱系数和逆梅尔倒谱系数的二维卷积神经网络和基于
散射卷积网络系数的长短期记忆网络作为前端网络，提取前端网络的中间层作为话语级的特征表示，利用压
缩 -激励 (SE) 通道注意力机制对前端网络的中间层的权重进行调整并融合，然后由深度神经网络后端分类器
输出情感分类结果。在汉语情感数据集中进行五折交叉验证的对比实验，实验结果表明，基于 SE 通道注意
力机制的网络融合方式可以有效地利用不同前端网络在语声情感识别任务中的优势，提高语声情感识别的
准确率。
关键词：注意力机制；语声特征；网络融合
中图法分类号: TN912.3 文献标识码: A 文章编号: 1000-310X(2023)05-1090-09
DOI: 10.11684/j.issn.1000-310X.2023.05.023

Speech emotion recognition using the attention mechanism to fuse the

intermediate layer of front-end networks

ZHU Yingjun ZHOU Wenjun ZHU Chuan MA Jianmin

(Department of Aeronautics and Astronautics, Fudan University, Shanghai 200433, China)

Abstract: In order to enable machines to better understand human emotions and improve human-computer
interaction experience, speech features and classiﬁcation networks can be fused to improve emotion recognition
performance. From the perspective of network fusion, this paper builds front-end networks including two dimen-
sional convolutional neural network (2D-CNN) based on Mel-frequency cepstral coeﬃcients, 2D-CNN based on
inverted Mel-frequency cepstral coeﬃcients, long short-term memory based on scattering convolution network
coeﬃcients. The intermediate layer of the front-end networks are then extracted as the feature representation
of the discourse level, and the squeeze-and-excitation (SE) channel attention mechanism is introduced to adjust
and fuse the weights of the intermediate layer. Eventually the sentiment classiﬁcation results are output by
the back-end network based on the deep neural network. A comparison experiment of ﬁve-fold cross-validation
was carried out on the Chinese speech emotion data set. The experimental result showed that the network
fusion based on the SE channel attention mechanism can eﬀectively utilize the advantages of diﬀerent front-end
networks in speech emotion recognition tasks, and improve the accuracy of speech emotion recognition.
Keywords: Attention mechanism; Speech feature; Network fusion

2022-06-04 收稿; 2023-01-12 定稿
作者简介: 朱应俊 (1998– ), 男, 山东济南人, 硕士研究生, 研究方向: 语声情感识别。
† 通信作者 E-mail: jmma@fudan.edu.cn

193 194 195 196 197 198 199 200 201 202 203