Page 37 - 《应用声学》2023年第1期
P. 37

第 42 卷 第 1 期             万伊等: 基于 Transformer 编码器的合成语声检测系统                                    33


                 1609.03499, 2016.                              [15] Luo A, Li E, Liu Y, et al. A capsule network based ap-
              [2] Wang Y, Skerry-Ryan R, Stanton D, et al. Tacotron: to-  proach for detection of audio spoofing attacks[C]//IEEE
                 wards end-to-end speech synthesis[C]//Interspeech, 2017:  International Conference on Acoustics, Speech, and Signal
                 4006–4010.                                        Processing (ICASSP), 2021: 6359–6363.
              [3] Arik S O, Chrzanowski M, Coates A, et al. Deep voice:  [16] Zhang Y, Jiang F, Duan Z. One-class learning towards
                 real-time neural text-to-speech[J]. arXiv Preprint, arXiv:  synthetic voice spoofing detection[J]. IEEE Signal Pro-
                 1702.07825, 2017.                                 cessing Letters, 2021, 28: 937–941.
              [4] Kinnunen T, Wu Z, Lee K, et al. Vulnerability of speaker  [17] Sahidullah M, Kinnunen T, Hanilci C. A comparison of
                 verification systems against voice conversion spoofing at-  features for synthetic speech detection[C]//Interspeech,
                 tacks: the case of telephone speech[C]//IEEE Interna-  2015: 2087–2091.
                 tional Conference on Acoustics, Speech, and Signal Pro-  [18] Vaswani A, Shazeer N, Parmar N, et al. Attention is all
                 cessing (ICASSP), 2012: 4401–4404.                you need[C]//Advances in Neural Information Processing
              [5] de Leon P, Pucher M, Yamagishi J, et al. Evaluation of  Systems, 2017: 5998–6008.
                 speaker verification security and detection of HMM-based  [19] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image
                 synthetic speech[J]. IEEE Transactions on Audio, Speech,  is worth 16 × 16 words: Transformers for image recog-
                 and Language Processing, 2012, 20(8): 2280–2290.  nition at scale[C]//International Conference on Learning
              [6] Wu Z, Evans N, Kinnunen T, et al. Spoofing and coun-  Representations (ICLR), 2021.
                 termeasures for speaker verification: a survey[J]. Speech  [20] Liu Z, Lin Y, Cao Y. Swin transformer: hierarchical vi-
                 Communication, 2015, 66: 130–153                  sion transformer using shifted windows[J]. arXiv Preprint,
              [7] Kinnunen T, Lee K, Delgado H, et al.  t-DCF: a   arXiv: 2103.14030, 2021.
                 detection cost function for the tandem assessment of  [21] Gong Y, Chung Y, Glass J. AST: audio spectrogram
                 spoofing countermeasures and automatic speaker verifi-  transformer[J]. arXiv Preprint, arXiv: 2104.01778, 2021.
                 cation[C]//Odyssey: The Speaker and Language Recog-  [22] Zhang Z, Yi X, Zhao X. Fake speech detection using resid-
                 nition Workshop, 2018: 312–319.                   ual network with transformer encoder[C]//ACM Work-
              [8] Wang X, Yamagishi J, Todisco M, et al. ASVspoof 2019: a  shop on Information Hiding and  Multimedia Security,
                 large-scale public database of synthesized, converted and  2021: 13–22.
                 replayed speech[J]. Computer Speech and Language, 2020,  [23] Saratxaga I, Sanchez J, Wu Z, et al. Synthetic speech
                 64: 101114.                                       detection using phase information[J]. Speech Communi-
              [9] Zhang C, Yu C, Hansen J. An investigation of deep-  cation, 2016, 81: 30–41.
                 learning frameworks for speaker verification antispoof-  [24] Todisco M, Wang X, Vestman V, et al.  ASVspoof
                 ing[J]. IEEE Journal of Selected Topics in Signal Process-  2019: Future horizons in spoofed and fake audio detec-
                 ing, 2017, 11(4): 684–694.                        tion[C]//Interspeech, 2019: 1008–1012.
             [10] Todisco M, Delgado H, Evans N. A new feature for au-  [25] Delgado H, Evans N, Kinnunen T, et al. ASVspoof 2021:
                 tomatic speaker verification anti-spoofing: constant Q  automatic speaker verification spoofing and countermea-
                 cepstral coefficients[C]//Odyssey: The Speaker and Lan-  sures challenge evaluation plan[J]. arXiv Preprint, arXiv:
                 guage Recognition Workshop, 2016: 283–290.        2109.00535, 2021.
             [11] Yu H, Tan Z, Ma Z, et al.  Spoofing detection in au-  [26] Loshchilov I, Hutter F. Decoupled weight decay regular-
                 tomatic speaker verification systems using DNN classi-  ization[J]. arXiv Preprint, arXiv: 1711.05101, 2017.
                 fiers and dynamic acoustic features[J]. IEEE Transactions  [27] Yang J, Wang H, Das R, et al. Modified magnitude-
                 on Neural Networks and Learning Systems, 2018, 29(10):  phase spectrum information for spoofing detection[J].
                 4633–4644.                                        IEEE/ACM Transactions on Audio, Speech, and Lan-
             [12] Alzantot M, Wang Z, Srivastava M. Deep residual neu-  guage Processing, 2021, 29: 1065–1078.
                 ral networks for audio spoofing detection[C]//Interspeech,  [28] Gomez-Alanis A, Peinado A, Gonzalez J, et al. A gated
                 2019: 1078–1082.                                  recurrent convolutional neural network for robust spoofing
             [13] Das R, Yang J, Li H. Long range acoustic fea-    detection[J]. IEEE/ACM Transactions on Audio, Speech,
                 tures for spoofed speech detection[C]//Interspeech, 2019:  and Language Processing, 2019, 27(12): 1985–1999.
                 1058–1062.                                     [29] Li X, Li N, Weng C, et al. Replay and synthetic speech
             [14] Lavrentyeva G, Novoselov S, Tseren A, et al.  STC  detection with Res2Net architecture[C]//IEEE Interna-
                 antispoofing  systems  for  the  ASVspoof2019  chal-  tional Conference on Acoustics, Speech, and Signal Pro-
                 lenge[C]//Interspeech, 2019: 1033–1037.           cessing (ICASSP), 2021: 6354–6358.
   32   33   34   35   36   37   38   39   40   41   42