Page 198 - 《应用声学)》2023年第5期
P. 198

第 42 卷 第 5 期                                                                       Vol. 42, No. 5
             2023 年 9 月                          Journal of Applied Acoustics                 September, 2023

             ⋄ 研究报告 ⋄



               注意力机制融合前端网络中间层的语声情感识别






                                        朱应俊 周文君 朱 川 马建敏                          †


                                                (复旦大学航空航天系      上海   200433)

                摘要:为了使机器能够更好地理解人的情感并改善人机交互体验,可对语声特征及分类网络进行融合以提升
                情感识别性能。该文从网络融合的角度,把基于梅尔倒谱系数和逆梅尔倒谱系数的二维卷积神经网络和基于
                散射卷积网络系数的长短期记忆网络作为前端网络,提取前端网络的中间层作为话语级的特征表示,利用压
                缩 -激励 (SE) 通道注意力机制对前端网络的中间层的权重进行调整并融合,然后由深度神经网络后端分类器
                输出情感分类结果。在汉语情感数据集中进行五折交叉验证的对比实验,实验结果表明,基于 SE 通道注意
                力机制的网络融合方式可以有效地利用不同前端网络在语声情感识别任务中的优势,提高语声情感识别的
                准确率。
                关键词:注意力机制;语声特征;网络融合
                中图法分类号: TN912.3           文献标识码: A          文章编号: 1000-310X(2023)05-1090-09
                DOI: 10.11684/j.issn.1000-310X.2023.05.023



                   Speech emotion recognition using the attention mechanism to fuse the

                                     intermediate layer of front-end networks



                                ZHU Yingjun    ZHOU Wenjun      ZHU Chuan     MA Jianmin

                           (Department of Aeronautics and Astronautics, Fudan University, Shanghai 200433, China)

                 Abstract: In order to enable machines to better understand human emotions and improve human-computer
                 interaction experience, speech features and classification networks can be fused to improve emotion recognition
                 performance. From the perspective of network fusion, this paper builds front-end networks including two dimen-
                 sional convolutional neural network (2D-CNN) based on Mel-frequency cepstral coefficients, 2D-CNN based on
                 inverted Mel-frequency cepstral coefficients, long short-term memory based on scattering convolution network
                 coefficients. The intermediate layer of the front-end networks are then extracted as the feature representation
                 of the discourse level, and the squeeze-and-excitation (SE) channel attention mechanism is introduced to adjust
                 and fuse the weights of the intermediate layer. Eventually the sentiment classification results are output by
                 the back-end network based on the deep neural network. A comparison experiment of five-fold cross-validation
                 was carried out on the Chinese speech emotion data set. The experimental result showed that the network
                 fusion based on the SE channel attention mechanism can effectively utilize the advantages of different front-end
                 networks in speech emotion recognition tasks, and improve the accuracy of speech emotion recognition.
                 Keywords: Attention mechanism; Speech feature; Network fusion

             2022-06-04 收稿; 2023-01-12 定稿
             作者简介: 朱应俊 (1998– ), 男, 山东济南人, 硕士研究生, 研究方向: 语声情感识别。
             † 通信作者 E-mail: jmma@fudan.edu.cn
   193   194   195   196   197   198   199   200   201   202   203