Page 30 - 《应用声学》2023年第1期
P. 30
第 42 卷 第 1 期 Vol. 42, No. 1
2023 年 1 月 Journal of Applied Acoustics January, 2023
⋄ 研究报告 ⋄
基于Transformer编码器的合成语声检测系统 ∗
万 伊 1,2 杨飞然 1,2 杨 军 1,2†
(1 中国科学院声学研究所噪声与振动重点实验室 北京 100190)
(2 中国科学院大学 北京 100049)
摘要:自动说话人认证系统是一种常用的目标说话人身份认证方案,但它在合成语声的攻击下表现出脆弱
性,合成语声检测系统试图解决这一问题。该文提出了一种基于 Transformer 编码器的合成语声检测方法,
利用自注意力机制学习输入特征内部的长期依赖关系。合成语声检测问题并不关注句子的抽象语义特征,
用参数量较小的模型也能得到较好的检测性能。该文分别测试了 4 种常用合成语声检测特征在 Transformer
编码器上的表现,在国际标准的 ASVspoof2019 挑战赛的逻辑攻击数据集上,基于线性频率倒谱系数特征和
Transformer 编码器的系统等错误率与串联检测代价函数分别为 3.13% 和 0.0708,且模型参数量仅为 0.082 M,
在较小参数量下得到了较好的检测性能。
关键词:自动说话人认证;合成语声检测;Transformer 编码器
中图法分类号: TP302.1 文献标识码: A 文章编号: 1000-310X(2023)01-0026-08
DOI: 10.11684/j.issn.1000-310X.2023.01.004
Transformer encoder-based spoofing countermeasure for synthetic speech
detection
WAN Yi 1,2 YANG Feiran 1,2 YANG Jun 1,2
(1 Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences,
Beijing 100190, China)
(2 University of Chinese Academy of Sciences, Beijing 100049, China)
Abstract: The automatic speaker verification system is a commonly used solution for target speaker identity
authentication, but it shows vulnerability under the attack of synthetic speech, which can be alleviated by a
spoofing countermeasure system. In this paper, we introduce a synthetic speech detection method based on the
Transformer encoder, which uses the self-attention mechanism to learn the long-term dependencies of the input
features. Synthetic speech detection does not focus on the abstract semantic features of the sentences, and a
model with small parameters can also perform well. This paper evaluated the performance of four commonly
used synthetic speech detection features on Transformer encoders. On the evaluation set of the ASVspoof2019
challenge logical access scenario, the proposed system based on linear frequency cepstral coefficient features
and Transformer encoder achieves an equal error rate (EER) of 3.13% and a tandem detection cost function (t-
DCF) of 0.0708, respectively, and the parameters of the model is only 0.082 M, a better detection performance
is obtained with a smaller model.
Keywords: Automatic speaker verification; Synthetic speech detection; Transformer encoder
2021-11-08 收稿; 2022-02-15 定稿
国家自然科学基金项目 (62171438), 中国科学院青年创新促进会基金项目 (2018027), 中国科学院声学研究所自主部署 “前沿探索” 类
∗
项目 (QYTS202111)
作者简介: 万伊 (1995– ), 女, 河北张家口人, 博士研究生, 研究方向: 合成语声鉴伪。
† 通信作者 E-mail: jyang@mail.ioa.ac.cn