Speech emotion recognition using the attention mechanism to fuse the intermediate layer of front-end networks
投稿时间:2022-06-04  修订日期:2023-08-29
      In order to enable machines to better understand human emotions and improve human-computer interaction experience, speech features and classification networks can be fused to improve emotion recognition performance. From the perspective of network fusion, this paper build front-end networks including two dimensional convolutional neural network (2D-CNN) based on Mel-frequency cepstral coefficients, 2D-CNN based on inverted Mel-frequency cepstral coefficients, long short-term memory based on scattering convolution network coefficients. The intermediate layer of the front-end networks are then extracted as the feature representation of the discourse level, and the squeeze-and-excitation (SE) channel attention mechanism is introduced to adjust and fuse the weights of the intermediate layer. Eventually the sentiment classification results are output by the back-end network based on the deep neural network. A comparison experiment of five-fold cross-validation was carried out on the Chinese speech emotion data set. The experimental result showed that the network fusion based on the SE channel attention mechanism can effectively utilize the advantages of different front-end networks in speech emotion recognition tasks, and improve the accuracy of speech emotion recognition.
中文关键词: 注意力机制  语音特征  网络融合
英文关键词: Attention mechanism  Speech feature  Network fusion
朱应俊 复旦大学航空航天系 2351255492@qq.com 
周文君 复旦大学航空航天系 18321196008@163.com 
朱川 复旦大学航空航天系 zc_hrbeu@163.com 
马建敏* 复旦大学航空航天系 20210290001@fudan.edu.cn 
摘要点击次数: 443
全文下载次数: 576
查看全文   查看/发表评论  下载PDF阅读器