Page 88 - 《应用声学》2025年第2期
P. 88
348 2025 年 3 月
表 11 鲁棒性结果 [2] 刘瑞, 康世胤, 高光来, 等. MonTTS: 完全非自回归的实时、
Table 11 Robustness results 高保真蒙古语语音合成模型 [J]. 中文信息学报, 2022, 36(7):
86–97.
方法 重词 跳词 错词 WER/% Liu Rui, Kang Shiyin, Gao Guanglai, et al. MonTTS:
A real-time and high-fidelity Mongolian TTS model with
Tacotron2+HiFi-GAN 3 4 13 10.20
pure non-autoregressive mechanism[J]. Journal of Chinese
Tacotron2+HiFi-GAN
1 2 8 5.61 Information Processing, 2022, 36(7): 86–97.
(Fine-tuned)
[3] 拉巴顿珠, 珠杰, 欧珠, 等. 端到端的藏语语音合成方法 [J]. 应
Glow-TTS+HiFi-GAN 1 3 8 6.12
用声学, 2023, 42(2): 324–332.
Glow-TTS+HiFi-GAN
0 2 8 5.10 Labadunzhu, Zhujie, Ouzhu, et al. Tibetan speech syn-
(Fine-tuned)
thesis method based on end-to-end[J]. Journal of Applied
VITS 0 1 7 4.08 Acoustics, 2023, 42(2): 324–332.
ITHSS 0 0 2 1.02 [4] 张学文, 王林, 冯夫健, 等. 基于卷积神经网络的苗语孤立词
语音识别 [J]. 软件导刊, 2022, 21(2): 21–26.
据表11可知,两阶段的TTS 模型Tacotron2和 Zhang Xuewen, Wang Lin, Feng Fujian, et al. Hmong lan-
guage speech recognition for isolated words based on con-
Glow-TTS 的鲁棒性较差,WER 分别达到 10.20%
volutional neural network[J]. Software Guide, 2022, 21(2):
和 6.12%,微调后的模型与单阶段模型因不存在 21–26.
训练与推理的特征分布不匹配问题,有效减少了 [5] Luo R Q, Tan X, Wang R, et al. Lightspeech:
WER。其中,ITHSS 模型表现出更优的鲁棒性,重 Lightweight and fast text to speech with neural architec-
ture search[C]//Proceedings of the International Confer-
词数和跳词数均为0,WER仅为1.02%。
ence on Acoustics, Speech and Signal Processing, 2021:
5699–5703.
4 结论 [6] Kim J, Kim S, Kong J, et al. Glow-TTS: A generative flow
for text-to-speech via monotonic alignment search[J]. Ad-
针对两阶段TTS模型存在复杂度高、演算速度 vances in Neural Information Processing Systems, 2020,
33: 8067–8077.
慢的问题,本文提出了一种适用于苗语的快速语声
[7] Elias I, Zen H, Shen J, et al. Parallel tacotron: Non-
合成方法 ITHSS,该方法不生成中间的声学特征, autoregressive and controllable tts[C]//Proceedings of the
直接从输入文本转化为语声波形,有效简化了语声 International Conference on Acoustics, Speech and Signal
合成模型训练的复杂性。一方面为了保留更多的输 Processing, 2021: 5709–5713.
[8] Kim J H, Lee S H, Lee J H, et al. Fre-GAN: Adversarial
入文本信息,采用了残差编码器的方式保留更多的
frequency -consistent audio synthesis[C]// Proceedings of
文本上下文信息,提高了合成语声的质量。另一方 the Interspeech, 2021: 2197–2201.
面,利用 iSTFT 进行语声波形生成,不仅减少了模 [9] Lee S H, Kim J H, Lee K E, et al. FRE-GAN 2: Fast and
型的参数量,而且加快了模型的演算速度。ITHSS efficient frequency-consistent audio synthesis[C]// Pro-
ceedings of the International Conference on Acoustics,
也为其他以拉丁字符为书写单位的少数民族语言
Speech and Signal Processing. Singapore, Singapore:
的语声合成研究提供思路。针对本文方法和语料库 IEEE Press, 2022: 6192–6196.
规模的不足,未来工作将主要集中于: [10] Yang G, Yang S, Liu K, et al. Multi-band MelGAN: Faster
(1) 考虑苗语语声合成的韵律建模,提高其自 waveform generation for high-quality text-to-speech[C]//
Proceedings of the Spoken Language Technology Work-
然性;(2) 在目前单说话人语料库的基础上增加其 shop, 2021: 492–498.
规模,并构建多说话人的苗语语声合成语料库,实现 [11] Miao C, Liang S, Chen M, et al. Flow-TTS: A non-
多音色的输出与转换。 autoregressive network for text to speech based on
flow[C]// Proceedings of the International Conference
on Acoustics, Speech and Signal Processing. Barcelona,
Spain: IEEE Press, 2020: 7209–7213.
参 考 文 献
[12] Ping W, Peng K, Zhao K, et al. Wave-flow: A com-
pact flow-based model for raw audio[C]//Proceedings of
[1] 杨琳, 杨鉴, 蔡浩然, 等. 基于迁移学习的越南语语音合成 [J]. the International Conference on Machine Learning, 2020:
计算机科学, 2023, 50(8): 118–124. 7706–7716.
Yang Lin, Yang Jian, Cai Haoran, et al. Vietnamese [13] Elias I, Zen H, Shen J, et al. Parallel Tacotron 2: A
speech synthesis based on transfer learning[J]. Journal of non-autoregressive neural TTS model with differentiable
Computer Science, 2023, 50(8): 118–124. duration modeling[C]//Proceedings of the International