Page 86 - 《应用声学》2023年第1期
P. 86

82                                                                                   2023 年 1 月


             感表达度的条件下,合成出具有儿童特征的情感语                              [9] Ling Z, Kang S, Zen G, et al. Deep learning for acoustic
             声。实验结果证明本文使用的迁移学习方式可以有                                modeling in parametric speech generation: a systematic
                                                                   review of existing techniques and future trends[J]. IEEE
             效得合成出自然度和情感表达度相对优良的语声。
                                                                   Signal Processing Magazine, 2015, 32(3): 35–52.
                 而本文的研究仍存在一定的不足,如由于小样                           [10] Sutskever I, Vinyals O, Le Q V. Sequence to sequence
             本录制环境以及样本量的关系,得到的实验结果会                                learning with neural networks[C]//28th Conference on
             产生一定的过拟合现象。而本文的小样本的训练时                                Neural Information Processing Systems(NIPS). Montreal,
                                                                   QC, Canada, 2014.
             长仍需 30 h 以上,相较于其他的低资源情感语声合
                                                                [11] Shan Y, Wu Z, Lei X. On the training of DNN-based
             成方法,在训练效率上仍有待提高。并且本文的语                                average voice model for speech synthesis[C]//2016 Asia-
             声合成仅限于单人,在多人情感语声合成的研究中,                               Pacific Signal and Information Processing Association An-
                                                                   nual Summit and Conference (APSIPA). IEEE, 2016.
             解决说话人情感特征的问题与本文的研究也存在
                                                                [12] Zhang H, Lin Y. Unsupervised learning for sequence-
             相似之处,未来的工作中可以尝试使用相同的技术                                to-sequence text-to-speech for low-resource languages[J].
             对多说话人的情感语声合成进行实验。而对于小样                                arXiv Preprint, arXiv: 2008.04549.
             本的情感语声合成未来仍有很大的发展空间。                               [13] Li T, Yang S, Xue L, et al. Controllable emotion trans-
                                                                   fer for end-to-end speech synthesis[C]//2021 12th Inter-
                                                                   national Symposium on Chinese Spoken Language Pro-
                                                                   cessing (ISCSLP), 2021.
                            参 考     文   献                       [14] Li R, Wu Z, Huang Y, et al. Emphatic speech generation
                                                                   with conditioned input layer and bidirectional LSTMS for
                                                                   expressive speech synthesis[C]//ICASSP 2018-2018 IEEE
              [1] Grigorev A, Frolova O, Lyakso E. Acoustic features of
                                                                   International Conference on Acoustics, Speech and Signal
                 speech of typically developing children aged 5–16 years[C].
                                                                   Processing (ICASSP). IEEE, 2018.
                 7th International Conference, AINL 2018, St. Petersburg,
                                                                [15] Xue L, Zhu X, An X, et al. A comparison of expres-
                 Russia, October 17–19, 2018.
                                                                   sive speech synthesis approaches based on neural net-
              [2] Shahnawazuddin S, Adiga N, Kathania H K. Effect of
                                                                   work[C]//Proceedings of the Joint Workshop of the 4th
                 prosody modification on children’s ASR[J]. IEEE Signal
                                                                   Workshop on Affective Social Multimedia Computing and
                 Processing Letters, 2017, 24(11): 1749–1751.
                                                                   first Multi-Modal Affective Computing of Large-Scale
              [3] House D, Bell L, Gustafson K, et al. Child-directed speech
                                                                   Multimedia Data, 2018: 15–20.
                 synthesis: evaluation of prosodic variation for an edu-
                 cational computer program[C]//European Conference on  [16] Jia Y, Johnson M, Macherey W, et al.  Leveraging
                                                                   weakly supervised data to improve end-to-end speech-
                 Speech Communication & Technology. DBLP, 1999.
                                                                   to-text translation[C]//ICASSP 2019-2019 IEEE Interna-
              [4] Inoue K, Hara S, Abe M, et al. An investigation to trans-
                 plant emotional expressions in DNN-based TTS synthe-  tional Conference on Acoustics, Speech and Signal Pro-
                                                                   cessing (ICASSP). IEEE, 2019: 7180–7184.
                 sis[C]//2017 Asia-Pacific Signal and Information Process-
                 ing Association Annual Summit and Conference (APSIPA  [17] Inoue K, Hara S, Abe M. Module comparison of
                 ASC), 2017.                                       transformer-Tts for speaker adaptation based on fine-
              [5] Yamagishi J, Onishi K, Masuko T, et al. Acoustic model-  tuning[C]//2020 Asia-Pacific Signal and Information Pro-
                 ing of speaking styles and emotional expressions in HMM-  cessing Association Annual Summit and Conference (AP-
                 based speech synthesis[C]. IEICE Transactions on Infor-  SIPA ASC). IEEE, 2020: 826–830.
                 mation and Systems, 2005, 88(3): 502–509.      [18] Lee Y, Kim T. Robust and fine-grained prosody control of
              [6] Lorenzo-Trueba J, Barra-Chicotea R, San-Segundo R, et  end-to-end speech synthesis[C]//ICASSP 2019-2019 IEEE
                 al. Emotion transplantation through adaptation in HMM-  International Conference on Acoustics, Speech and Signal
                 based speech synthesis[J]. Computer Speech & Language,  Processing (ICASSP). IEEE, 2019: 5911–5915.
                 2015, 34(1): 292–307.                          [19] Huybrechts G, Merritt T, Comini G, et al.  Low-
              [7] Inoue K, Hara S, Abe M, et al. An investigation to trans-  resource expressive text-to-speech using data augmen-
                 plant emotional expressions in DNN-based TTS synthe-  tation[C]//ICASSP 2021-2021 IEEE International Con-
                 sis[C]//2017 Asia-Pacific Signal and Information Process-  ference on Acoustics, Speech and Signal Processing
                 ing Association Annual Summit and Conference (APSIPA  (ICASSP). IEEE, 2021: 6593–6597.
                 ASC), 2017.                                    [20] Shen J, Pang R, Weiss R J, et al. Natural tts synthesis by
              [8] Strömbergsson S, Edlund J, Götze J, et al. Approximat-  conditioning wavenet on mel spectrogram predictions[C].
                 ing phonotactic input in children’s linguistic environments  2018 IEEE International Conference on Acoustics, Speech
                 from orthographic transcripts[C]//Interspeech 2017, 18th  and Signal Processing (ICASSP). IEEE, 2018.
                 Annual Conference of the International Speech Commu-  [21] 王国梁, 陈梦楠, 陈蕾. 一种基于 Tacotron 2 的端到端中文
                 nication Association. Stockholm, Sweden, 2017.    语音合成方案 [J]. 华东师范大学学报 (自然科学版), 2019(4):
   81   82   83   84   85   86   87   88   89   90   91