Page 173 - 《应用声学》2022年第5期

P. 173

第 41 卷第 5 期张志浩等：基于 STA-CRNN 模型的语声情感识别 849

相对比也表现出了更好的识别率，WA 和 UA 提升 [8] Mba A, Ko B. Speech emotion recognition: emotional
达 2.1%、3.5%；在和最新的一些的方法对比中可 models, databases, features, preprocessing methods,
supporting modalities, and classiﬁers-ScienceDirect[J].
以得出比 1D+3D 网络 [30] 的 UA 提高了 4.3%；和
Speech Communication, 2020, 116: 56–76.
DiCCOSER-CS [34] 相比，本文模型的 WA 提高了 [9] Chiba Y, Nose T, Ito A. Multi-stream attention-based
3.6%，而UA的提升幅度高达8.8%。综上所述，本文 BLSTM with feature segmentation for speech emotion
recognition[C]. Interspeech, 2020.
所提出的模型优于大多数先进方法。
[10] Li D, Liu J, Yang Z, et al. Speech emotion recog-
nition using recurrent neural networks with directional
3 结论 self-attention[J]. Expert Systems with Applications, 2021,
173: 114683.
本文提出了一种用于 SER 的 STA-CRNN 模 [11] Tzirakis P, Zhang J, Schuller B. End-to-end speech emo-
型。该模型包含 CNN、LSTM 两大模块。分别在 tion recognition using deep neural networks[C]. Speech
and Signal Processing, 2018.
CNN 和 LSTM 网络中加入了空间注意力机制和时
[12] Chen M, He X, Jing Y, et al. 3-D convolutional recur-
间注意力机制，以便更好地提高模型性能，从而提 rent neural networks with attention model for speech emo-
高语声情感识别率。从两个情感数据集中的实验结 tion recognition[J]. IEEE Signal Processing Letters, 2018,
25(10): 1440–1444.
果以及在和其他先进的方法对比中可以得出，本文
[13] Mao Q, Dong M, Huang Z, et al. Learning salient features
的模型可以更好地提取语声频谱图中的有效特征 for speech emotion recognition using convolutional neu-
信息，过滤掉无效特征信息，使得 SER 的识别率大 ral networks[J]. IEEE Transactions on Multimedia, 2014,
16(8): 2203–2213.
幅度提高。由于本文所提取的特征是类似于图像的
[14] Senthilkumar N, Karpakam S, Devi M G, et al. Speech
RGB 三通道结构，而通道与通道之间的重要性不 emotion recognition based on Bi-directional LSTM archi-
同，故也会影响卷积过程中特征的提取。因此，在未 tecture and deep belief networks[J]. Materials Today: Pro-
ceedings, 2022, 57: 2180–2184.
来的研究中，本文会在 CNN 中加入通道注意力机
[15] Khalil R A, Jones E, Babar M I, et al. Speech emotion
制以提高SER的效果。 recognition using deep learning techniques: a review[J].
IEEE Access, 2019, 7: 117327–117345.
[16] Bakhshi A, Harimi A, Chalup S. CyTex: transforming
参考文献 speech to textured images for speech emotion recogni-
tion[J]. Speech Communication, 2022, 139: 62–75.
[17] Chen Q, Huang G. A novel dual attention-based BLSTM
[1] Mustaqeem, Kwon S. Att-Net: enhanced emotion recog-
with hybrid features in speech emotion recognition[J]. En-
nition system using lightweight self-attention module[J].
gineering Applications of Artiﬁcial Intelligence, 2021, 102:
Applied Soft Computing, 2021, 102(4): 107101.
[2] Abbaschian B J, Sierra-Sosa D, Elmaghraby A. Deep 104277.
learning techniques for speech emotion recognition, from [18] Trigeorgis G, Ringeval F, Brueckner R, et al. Adieu fea-
databases to models[J]. Sensors, 2021, 21(4): 1249. tures? End-to-end speech emotion recognition using a
[3] Wani T M, Gunawan T S, Qadri S A A, et al. A compre- deep convolutional recurrent network[C]. IEEE Interna-
hensive review of speech emotion recognition systems[J]. tional Conference on Acoustics, 2016.
IEEE Access, 2021, 9: 47795–47814. [19] Zhang S, Zhang S, Huang T, et al. Learning aﬀective fea-
[4] Zhang J, Xing L, Tan Z, et al. Multi-head attention fusion tures with a hybrid deep model for audio–visual emotion
networks for multi-modal speech emotion recognition[J]. recognition[J]. IEEE Transactions on Circuits and Sys-
Computers & Industrial Engineering, 2022, 168: 108078. tems for Video Technology, 2017, 28(10): 3030–3043.
[5] Liu L, Wang S, Hu B, et al. Learning structures of [20] 徐华南, 周晓彦, 姜万, 等. 基于自身注意力时空特征的语音
interval-based Bayesian networks in probabilistic gener- 情感识别算法 [J]. 声学技术, 2021, 40(6): 807–814.
ative model for human complex activity recognition[J]. Xu Huanan, Zhou Xiaoyan, Jiang Wan, et al. Speech emo-
Pattern Recognition, 2018, 81: 545–561. tion recognition algorithm based on self-attention spatio-
[6] Li S, Xing X, Fan W, et al. Spatiotemporal and frequen- temporal features[J]. Technical Acoustics, 2021, 40(6):
tial cascaded attention networks for speech emotion recog- 807–814.
nition[J]. Neurocomputing, 2021, 448(2): 238–248. [21] Yu Y, Kim Y J. Attention-LSTM-attention model for
[7] Xu M, Zhang F, Cui X, et al. Speech emotion recognition speech emotion recognition and analysis of IEMOCAP
with multiscale area attention and data augmentation[C]. database[J]. Electronics(Basel), 2020, 9(5): 713.
International Conference on Acoustics, Speech and Signal [22] Li P, Yan S, Mcloughlin I, et al. An attention pooling
Processing, 2021. based representation learning method for speech emotion

168 169 170 171 172 173 174 175 176