Page 199 - 《应用声学》2023年第4期

P. 199

第 42 卷第 4 期 Vol. 42, No. 4
2023 年 7 月 Journal of Applied Acoustics July, 2023

⋄ 研究报告 ⋄

应用ResNet和CatBoost检测重放语声 ∗

孙晓川 1,2 付景昌 1,2 宋晓婷 1,2 宗利芳 1,2 李志刚 1,2†

(1 华北理工大学人工智能学院唐山 063210)
(2 河北省工业智能感知重点实验室唐山 063210)
摘要：针对短语声指令声频信息少、不适用句子级重放语声检测的问题以及近距离录声后用高质量重放设备
重放的语声难以检测的问题，提出了一种适用于词级重放语声检测的模型。首先，利用短时傅里叶变换、低频
平均能量计算和帧排序等方法选择声频帧，然后提取这些帧的伽马通频率倒谱系数。其次，用基于自注意机制
的残差网络模型进一步提取伽马通频率倒谱系数中的信息，并转化为特征向量。最后，将提取后的特征向量用
CatBoost 分类，从而提高检测性能。在 POCO 数据集上的实验结果表明，提出的方法可以以 87.54% 的准确率
和 12.53% 的等错误率检测重放语声，优于基线和现有的方法。该文提出的方法在 ASVspoof2019 PA 数据集
上的等错误率与串联检测代价函数分别为 4.92% 和 0.1418，证明该文方法也适用于多种设置的重放语声检测。
关键词：重放语声检测；气爆杂声；残差网络；CatBoost
中图法分类号: TN912.3 文献标识码: A 文章编号: 1000-310X(2023)04-0861-10
DOI: 10.11684/j.issn.1000-310X.2023.04.022

Detection of replay voice by ResNet and CatBoost

SUN Xiaochuan 1,2 FU Jingchang 1,2 SONG Xiaoting 1,2 ZONG Lifang 1,2 LI Zhigang 1,2

(1 College of Artiﬁcial Intelligence, North China University of Science and Technology, Tangshan 063210, China)
(2 Hebei Key Laboratory of Industrial Intelligent Perception, Tangshan 063210, China)

Abstract: To deal with the problem that short voice commands have little audio information and are not suit-
able for sentence-level replay voice detection as well as the problem that voice replayed with high quality device
after short distance recording is diﬃcult to detect, a model for word-level replay voice detection is proposed.
Firstly, short time Fourier transform, low frequency average energy computation and frame sorting are used to
select audio frames reasonably, followed by the acoustic feature extraction of these frames based on Gammatone
frequency cepstral coeﬃcient (GFCC). Then, the information in the GFCC is further extracted with a self-
attentional residual network (ResNet) model and converted into feature vectors. Finally, the extracted feature
vectors are classiﬁed by CatBoost to improve detection performance. The experimental results on the POCO
dataset show that our proposal can achieve replay voice detection with the accuracy of 87.54% and the equal
error rate of 12.53%, outperforming the baseline and existing methods. The equal error rate and concatenation
detection cost function of the method proposed in this paper on the ASVspoof2019 PA dataset are 4.92% and
0.1418 respectively, which demonstrates that our proposal is also suitable for replay voice detection in various
settings.
Keywords: Replay voice detection; Pop noise; ResNet; CatBoost

2022-03-21 收稿; 2022-07-21 定稿
河北省高等学校科学技术研究项目 (ZD2021088), 国家重点研发计划项目 (2017YFE0135700)
∗
作者简介: 孙晓川 (1983– ), 男, 山东烟台人, 博士, 副教授, 研究方向: 深度学习。
† 通信作者 E-mail: lizhigang@ncst.edu.cn

194 195 196 197 198 199 200 201 202 203 204