Page 86 - 《应用声学》2023年第1期
P. 86
82 2023 年 1 月
感表达度的条件下,合成出具有儿童特征的情感语 [9] Ling Z, Kang S, Zen G, et al. Deep learning for acoustic
声。实验结果证明本文使用的迁移学习方式可以有 modeling in parametric speech generation: a systematic
review of existing techniques and future trends[J]. IEEE
效得合成出自然度和情感表达度相对优良的语声。
Signal Processing Magazine, 2015, 32(3): 35–52.
而本文的研究仍存在一定的不足,如由于小样 [10] Sutskever I, Vinyals O, Le Q V. Sequence to sequence
本录制环境以及样本量的关系,得到的实验结果会 learning with neural networks[C]//28th Conference on
产生一定的过拟合现象。而本文的小样本的训练时 Neural Information Processing Systems(NIPS). Montreal,
QC, Canada, 2014.
长仍需 30 h 以上,相较于其他的低资源情感语声合
[11] Shan Y, Wu Z, Lei X. On the training of DNN-based
成方法,在训练效率上仍有待提高。并且本文的语 average voice model for speech synthesis[C]//2016 Asia-
声合成仅限于单人,在多人情感语声合成的研究中, Pacific Signal and Information Processing Association An-
nual Summit and Conference (APSIPA). IEEE, 2016.
解决说话人情感特征的问题与本文的研究也存在
[12] Zhang H, Lin Y. Unsupervised learning for sequence-
相似之处,未来的工作中可以尝试使用相同的技术 to-sequence text-to-speech for low-resource languages[J].
对多说话人的情感语声合成进行实验。而对于小样 arXiv Preprint, arXiv: 2008.04549.
本的情感语声合成未来仍有很大的发展空间。 [13] Li T, Yang S, Xue L, et al. Controllable emotion trans-
fer for end-to-end speech synthesis[C]//2021 12th Inter-
national Symposium on Chinese Spoken Language Pro-
cessing (ISCSLP), 2021.
参 考 文 献 [14] Li R, Wu Z, Huang Y, et al. Emphatic speech generation
with conditioned input layer and bidirectional LSTMS for
expressive speech synthesis[C]//ICASSP 2018-2018 IEEE
[1] Grigorev A, Frolova O, Lyakso E. Acoustic features of
International Conference on Acoustics, Speech and Signal
speech of typically developing children aged 5–16 years[C].
Processing (ICASSP). IEEE, 2018.
7th International Conference, AINL 2018, St. Petersburg,
[15] Xue L, Zhu X, An X, et al. A comparison of expres-
Russia, October 17–19, 2018.
sive speech synthesis approaches based on neural net-
[2] Shahnawazuddin S, Adiga N, Kathania H K. Effect of
work[C]//Proceedings of the Joint Workshop of the 4th
prosody modification on children’s ASR[J]. IEEE Signal
Workshop on Affective Social Multimedia Computing and
Processing Letters, 2017, 24(11): 1749–1751.
first Multi-Modal Affective Computing of Large-Scale
[3] House D, Bell L, Gustafson K, et al. Child-directed speech
Multimedia Data, 2018: 15–20.
synthesis: evaluation of prosodic variation for an edu-
cational computer program[C]//European Conference on [16] Jia Y, Johnson M, Macherey W, et al. Leveraging
weakly supervised data to improve end-to-end speech-
Speech Communication & Technology. DBLP, 1999.
to-text translation[C]//ICASSP 2019-2019 IEEE Interna-
[4] Inoue K, Hara S, Abe M, et al. An investigation to trans-
plant emotional expressions in DNN-based TTS synthe- tional Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP). IEEE, 2019: 7180–7184.
sis[C]//2017 Asia-Pacific Signal and Information Process-
ing Association Annual Summit and Conference (APSIPA [17] Inoue K, Hara S, Abe M. Module comparison of
ASC), 2017. transformer-Tts for speaker adaptation based on fine-
[5] Yamagishi J, Onishi K, Masuko T, et al. Acoustic model- tuning[C]//2020 Asia-Pacific Signal and Information Pro-
ing of speaking styles and emotional expressions in HMM- cessing Association Annual Summit and Conference (AP-
based speech synthesis[C]. IEICE Transactions on Infor- SIPA ASC). IEEE, 2020: 826–830.
mation and Systems, 2005, 88(3): 502–509. [18] Lee Y, Kim T. Robust and fine-grained prosody control of
[6] Lorenzo-Trueba J, Barra-Chicotea R, San-Segundo R, et end-to-end speech synthesis[C]//ICASSP 2019-2019 IEEE
al. Emotion transplantation through adaptation in HMM- International Conference on Acoustics, Speech and Signal
based speech synthesis[J]. Computer Speech & Language, Processing (ICASSP). IEEE, 2019: 5911–5915.
2015, 34(1): 292–307. [19] Huybrechts G, Merritt T, Comini G, et al. Low-
[7] Inoue K, Hara S, Abe M, et al. An investigation to trans- resource expressive text-to-speech using data augmen-
plant emotional expressions in DNN-based TTS synthe- tation[C]//ICASSP 2021-2021 IEEE International Con-
sis[C]//2017 Asia-Pacific Signal and Information Process- ference on Acoustics, Speech and Signal Processing
ing Association Annual Summit and Conference (APSIPA (ICASSP). IEEE, 2021: 6593–6597.
ASC), 2017. [20] Shen J, Pang R, Weiss R J, et al. Natural tts synthesis by
[8] Strömbergsson S, Edlund J, Götze J, et al. Approximat- conditioning wavenet on mel spectrogram predictions[C].
ing phonotactic input in children’s linguistic environments 2018 IEEE International Conference on Acoustics, Speech
from orthographic transcripts[C]//Interspeech 2017, 18th and Signal Processing (ICASSP). IEEE, 2018.
Annual Conference of the International Speech Commu- [21] 王国梁, 陈梦楠, 陈蕾. 一种基于 Tacotron 2 的端到端中文
nication Association. Stockholm, Sweden, 2017. 语音合成方案 [J]. 华东师范大学学报 (自然科学版), 2019(4):