Page 37 - 《应用声学》2023年第1期

P. 37

第 42 卷第 1 期万伊等：基于 Transformer 编码器的合成语声检测系统 33

1609.03499, 2016. [15] Luo A, Li E, Liu Y, et al. A capsule network based ap-
[2] Wang Y, Skerry-Ryan R, Stanton D, et al. Tacotron: to- proach for detection of audio spooﬁng attacks[C]//IEEE
wards end-to-end speech synthesis[C]//Interspeech, 2017: International Conference on Acoustics, Speech, and Signal
4006–4010. Processing (ICASSP), 2021: 6359–6363.
[3] Arik S O, Chrzanowski M, Coates A, et al. Deep voice: [16] Zhang Y, Jiang F, Duan Z. One-class learning towards
real-time neural text-to-speech[J]. arXiv Preprint, arXiv: synthetic voice spooﬁng detection[J]. IEEE Signal Pro-
1702.07825, 2017. cessing Letters, 2021, 28: 937–941.
[4] Kinnunen T, Wu Z, Lee K, et al. Vulnerability of speaker [17] Sahidullah M, Kinnunen T, Hanilci C. A comparison of
veriﬁcation systems against voice conversion spooﬁng at- features for synthetic speech detection[C]//Interspeech,
tacks: the case of telephone speech[C]//IEEE Interna- 2015: 2087–2091.
tional Conference on Acoustics, Speech, and Signal Pro- [18] Vaswani A, Shazeer N, Parmar N, et al. Attention is all
cessing (ICASSP), 2012: 4401–4404. you need[C]//Advances in Neural Information Processing
[5] de Leon P, Pucher M, Yamagishi J, et al. Evaluation of Systems, 2017: 5998–6008.
speaker veriﬁcation security and detection of HMM-based [19] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image
synthetic speech[J]. IEEE Transactions on Audio, Speech, is worth 16 × 16 words: Transformers for image recog-
and Language Processing, 2012, 20(8): 2280–2290. nition at scale[C]//International Conference on Learning
[6] Wu Z, Evans N, Kinnunen T, et al. Spooﬁng and coun- Representations (ICLR), 2021.
termeasures for speaker veriﬁcation: a survey[J]. Speech [20] Liu Z, Lin Y, Cao Y. Swin transformer: hierarchical vi-
Communication, 2015, 66: 130–153 sion transformer using shifted windows[J]. arXiv Preprint,
[7] Kinnunen T, Lee K, Delgado H, et al. t-DCF: a arXiv: 2103.14030, 2021.
detection cost function for the tandem assessment of [21] Gong Y, Chung Y, Glass J. AST: audio spectrogram
spooﬁng countermeasures and automatic speaker veriﬁ- transformer[J]. arXiv Preprint, arXiv: 2104.01778, 2021.
cation[C]//Odyssey: The Speaker and Language Recog- [22] Zhang Z, Yi X, Zhao X. Fake speech detection using resid-
nition Workshop, 2018: 312–319. ual network with transformer encoder[C]//ACM Work-
[8] Wang X, Yamagishi J, Todisco M, et al. ASVspoof 2019: a shop on Information Hiding and Multimedia Security,
large-scale public database of synthesized, converted and 2021: 13–22.
replayed speech[J]. Computer Speech and Language, 2020, [23] Saratxaga I, Sanchez J, Wu Z, et al. Synthetic speech
64: 101114. detection using phase information[J]. Speech Communi-
[9] Zhang C, Yu C, Hansen J. An investigation of deep- cation, 2016, 81: 30–41.
learning frameworks for speaker veriﬁcation antispoof- [24] Todisco M, Wang X, Vestman V, et al. ASVspoof
ing[J]. IEEE Journal of Selected Topics in Signal Process- 2019: Future horizons in spoofed and fake audio detec-
ing, 2017, 11(4): 684–694. tion[C]//Interspeech, 2019: 1008–1012.
[10] Todisco M, Delgado H, Evans N. A new feature for au- [25] Delgado H, Evans N, Kinnunen T, et al. ASVspoof 2021:
tomatic speaker veriﬁcation anti-spooﬁng: constant Q automatic speaker veriﬁcation spooﬁng and countermea-
cepstral coeﬃcients[C]//Odyssey: The Speaker and Lan- sures challenge evaluation plan[J]. arXiv Preprint, arXiv:
guage Recognition Workshop, 2016: 283–290. 2109.00535, 2021.
[11] Yu H, Tan Z, Ma Z, et al. Spooﬁng detection in au- [26] Loshchilov I, Hutter F. Decoupled weight decay regular-
tomatic speaker veriﬁcation systems using DNN classi- ization[J]. arXiv Preprint, arXiv: 1711.05101, 2017.
ﬁers and dynamic acoustic features[J]. IEEE Transactions [27] Yang J, Wang H, Das R, et al. Modiﬁed magnitude-
on Neural Networks and Learning Systems, 2018, 29(10): phase spectrum information for spooﬁng detection[J].
4633–4644. IEEE/ACM Transactions on Audio, Speech, and Lan-
[12] Alzantot M, Wang Z, Srivastava M. Deep residual neu- guage Processing, 2021, 29: 1065–1078.
ral networks for audio spooﬁng detection[C]//Interspeech, [28] Gomez-Alanis A, Peinado A, Gonzalez J, et al. A gated
2019: 1078–1082. recurrent convolutional neural network for robust spooﬁng
[13] Das R, Yang J, Li H. Long range acoustic fea- detection[J]. IEEE/ACM Transactions on Audio, Speech,
tures for spoofed speech detection[C]//Interspeech, 2019: and Language Processing, 2019, 27(12): 1985–1999.
1058–1062. [29] Li X, Li N, Weng C, et al. Replay and synthetic speech
[14] Lavrentyeva G, Novoselov S, Tseren A, et al. STC detection with Res2Net architecture[C]//IEEE Interna-
antispooﬁng systems for the ASVspoof2019 chal- tional Conference on Acoustics, Speech, and Signal Pro-
lenge[C]//Interspeech, 2019: 1033–1037. cessing (ICASSP), 2021: 6354–6358.

32 33 34 35 36 37 38 39 40 41 42