基于STA-CRNN模型的语声情感识别 ∗
张志浩 1,2 王坤侠 1,2†
(1 安徽建筑大学电子与信息工程学院 合肥 230601)
(2 安徽建筑大学安徽省建筑声环境重点实验室 (安徽建筑大学) 合肥 230601)
应用卷积神经网络和长短期记忆网络方法提取对数 Mel 谱图空间特征和时间特征,取得了一定的成果。然而
这一问题,该文提出了一种基于时空注意力机制的卷积 -递归神经网络模型,采用对数 Mel 谱图和其一阶差分、
意力和时间注意力机制,从而使上述网络能够更好地提取到对数 Mel 谱图中有效表征情感的空间特征和时间
特征。该模型在 Emo-DB 和 IEMOCAP 语声数据集上的加权准确率分别达到 86.8%、69.4%,未加权准确率分
别达到 84.7%、65.5%,优于当前大多数先进方法。
关键词:语声情感识别;对数 Mel 频谱图;时空注意力;时间特征;空间特征
中图法分类号: TN912.34 文献标识码: A 文章编号: 1000-310X(2022)05-0843-08
DOI: 10.11684/j.issn.1000-310X.2022.05.021
Speech emotion recognition based on STA-CRNN model
ZHANG Zhihao 1,2 WANG Kunxia 1,2
(1 College of Electronic and Information Engineering, Anhui Jianzhu University, Hefei 230601, China)
(2 Key Laboratory of Architectural Acoustic Environment of Anhui Higher Education Institutes (Anhui Jianzhu University ),
Hefei 230601, China)
Abstract: Speech emotion recognition (SER) plays an important role in the research fields of human-computer
interaction and affective computing. Many new research methods have emerged. Recently, researchers applied
convolutional neural network (CNN) and long short-term memory (LSTM) to extract spatial and temporal
features from Log-Mel spectrum, and achieved better performance. However, when CNN and LSTM networks
extract features, they will lead to feature redundancy and reduce the performance of speech emotion recognition.
In this paper, we propose a convolution recursive neural network model based on spatiotemporal attention
mechanism (STA-CRNN). The Log-Mel spectrum, its first-order difference and second-order difference are
used as feature input. We extract spatial features by CNN and temporal features by LSTM, and adopt spatial
attention and temporal attention mechanism to further decrease the redundancy of features. The experiment
