Page 173 - 《应用声学》2022年第5期
P. 173

第 41 卷 第 5 期               张志浩等: 基于 STA-CRNN 模型的语声情感识别                                      849


             相对比也表现出了更好的识别率,WA 和 UA 提升                           [8] Mba A, Ko B. Speech emotion recognition: emotional
             达 2.1%、3.5%;在和最新的一些的方法对比中可                            models, databases, features, preprocessing methods,
                                                                   supporting modalities, and classifiers-ScienceDirect[J].
             以得出比 1D+3D 网络        [30]  的 UA 提高了 4.3%;和
                                                                   Speech Communication, 2020, 116: 56–76.
             DiCCOSER-CS   [34]  相比,本文模型的 WA 提高了                 [9] Chiba Y, Nose T, Ito A. Multi-stream attention-based
             3.6%,而UA的提升幅度高达8.8%。综上所述,本文                           BLSTM with feature segmentation for speech emotion
                                                                   recognition[C]. Interspeech, 2020.
             所提出的模型优于大多数先进方法。
                                                                [10] Li D, Liu J, Yang Z, et al.  Speech emotion recog-
                                                                   nition using recurrent neural networks with directional
             3 结论                                                  self-attention[J]. Expert Systems with Applications, 2021,
                                                                   173: 114683.
                 本文提出了一种用于 SER 的 STA-CRNN 模                     [11] Tzirakis P, Zhang J, Schuller B. End-to-end speech emo-
             型。该模型包含 CNN、LSTM 两大模块。分别在                             tion recognition using deep neural networks[C]. Speech
                                                                   and Signal Processing, 2018.
             CNN 和 LSTM 网络中加入了空间注意力机制和时
                                                                [12] Chen M, He X, Jing Y, et al. 3-D convolutional recur-
             间注意力机制,以便更好地提高模型性能,从而提                                rent neural networks with attention model for speech emo-
             高语声情感识别率。从两个情感数据集中的实验结                                tion recognition[J]. IEEE Signal Processing Letters, 2018,
                                                                   25(10): 1440–1444.
             果以及在和其他先进的方法对比中可以得出,本文
                                                                [13] Mao Q, Dong M, Huang Z, et al. Learning salient features
             的模型可以更好地提取语声频谱图中的有效特征                                 for speech emotion recognition using convolutional neu-
             信息,过滤掉无效特征信息,使得 SER 的识别率大                             ral networks[J]. IEEE Transactions on Multimedia, 2014,
                                                                   16(8): 2203–2213.
             幅度提高。由于本文所提取的特征是类似于图像的
                                                                [14] Senthilkumar N, Karpakam S, Devi M G, et al. Speech
             RGB 三通道结构,而通道与通道之间的重要性不                               emotion recognition based on Bi-directional LSTM archi-
             同,故也会影响卷积过程中特征的提取。因此,在未                               tecture and deep belief networks[J]. Materials Today: Pro-
                                                                   ceedings, 2022, 57: 2180–2184.
             来的研究中,本文会在 CNN 中加入通道注意力机
                                                                [15] Khalil R A, Jones E, Babar M I, et al. Speech emotion
             制以提高SER的效果。                                           recognition using deep learning techniques: a review[J].
                                                                   IEEE Access, 2019, 7: 117327–117345.
                                                                [16] Bakhshi A, Harimi A, Chalup S. CyTex: transforming
                            参 考     文   献                          speech to textured images for speech emotion recogni-
                                                                   tion[J]. Speech Communication, 2022, 139: 62–75.
                                                                [17] Chen Q, Huang G. A novel dual attention-based BLSTM
              [1] Mustaqeem, Kwon S. Att-Net: enhanced emotion recog-
                                                                   with hybrid features in speech emotion recognition[J]. En-
                 nition system using lightweight self-attention module[J].
                                                                   gineering Applications of Artificial Intelligence, 2021, 102:
                 Applied Soft Computing, 2021, 102(4): 107101.
              [2] Abbaschian B J, Sierra-Sosa D, Elmaghraby A. Deep  104277.
                 learning techniques for speech emotion recognition, from  [18] Trigeorgis G, Ringeval F, Brueckner R, et al. Adieu fea-
                 databases to models[J]. Sensors, 2021, 21(4): 1249.  tures?  End-to-end speech emotion recognition using a
              [3] Wani T M, Gunawan T S, Qadri S A A, et al. A compre-  deep convolutional recurrent network[C]. IEEE Interna-
                 hensive review of speech emotion recognition systems[J].  tional Conference on Acoustics, 2016.
                 IEEE Access, 2021, 9: 47795–47814.             [19] Zhang S, Zhang S, Huang T, et al. Learning affective fea-
              [4] Zhang J, Xing L, Tan Z, et al. Multi-head attention fusion  tures with a hybrid deep model for audio–visual emotion
                 networks for multi-modal speech emotion recognition[J].  recognition[J]. IEEE Transactions on Circuits and Sys-
                 Computers & Industrial Engineering, 2022, 168: 108078.  tems for Video Technology, 2017, 28(10): 3030–3043.
              [5] Liu L, Wang S, Hu B, et al.  Learning structures of  [20] 徐华南, 周晓彦, 姜万, 等. 基于自身注意力时空特征的语音
                 interval-based Bayesian networks in probabilistic gener-  情感识别算法 [J]. 声学技术, 2021, 40(6): 807–814.
                 ative model for human complex activity recognition[J].  Xu Huanan, Zhou Xiaoyan, Jiang Wan, et al. Speech emo-
                 Pattern Recognition, 2018, 81: 545–561.           tion recognition algorithm based on self-attention spatio-
              [6] Li S, Xing X, Fan W, et al. Spatiotemporal and frequen-  temporal features[J]. Technical Acoustics, 2021, 40(6):
                 tial cascaded attention networks for speech emotion recog-  807–814.
                 nition[J]. Neurocomputing, 2021, 448(2): 238–248.  [21] Yu Y, Kim Y J. Attention-LSTM-attention model for
              [7] Xu M, Zhang F, Cui X, et al. Speech emotion recognition  speech emotion recognition and analysis of IEMOCAP
                 with multiscale area attention and data augmentation[C].  database[J]. Electronics(Basel), 2020, 9(5): 713.
                 International Conference on Acoustics, Speech and Signal  [22] Li P, Yan S, Mcloughlin I, et al. An attention pooling
                 Processing, 2021.                                 based representation learning method for speech emotion
   168   169   170   171   172   173   174   175   176