Page 88 - 《应用声学》2025年第2期
P. 88

348                                                                                  2025 年 3 月


                            表 11  鲁棒性结果                          [2] 刘瑞, 康世胤, 高光来, 等. MonTTS: 完全非自回归的实时、
                      Table 11 Robustness results                  高保真蒙古语语音合成模型 [J]. 中文信息学报, 2022, 36(7):
                                                                   86–97.
                       方法          重词 跳词 错词      WER/%             Liu Rui, Kang Shiyin, Gao Guanglai, et al. MonTTS:
                                                                   A real-time and high-fidelity Mongolian TTS model with
                Tacotron2+HiFi-GAN  3    4   13   10.20
                                                                   pure non-autoregressive mechanism[J]. Journal of Chinese
                Tacotron2+HiFi-GAN
                                    1    2   8     5.61            Information Processing, 2022, 36(7): 86–97.
                    (Fine-tuned)
                                                                 [3] 拉巴顿珠, 珠杰, 欧珠, 等. 端到端的藏语语音合成方法 [J]. 应
                Glow-TTS+HiFi-GAN   1    3   8     6.12
                                                                   用声学, 2023, 42(2): 324–332.
                Glow-TTS+HiFi-GAN
                                    0    2   8     5.10            Labadunzhu, Zhujie, Ouzhu, et al. Tibetan speech syn-
                    (Fine-tuned)
                                                                   thesis method based on end-to-end[J]. Journal of Applied
                       VITS         0    1   7     4.08            Acoustics, 2023, 42(2): 324–332.
                      ITHSS         0    0   2     1.02          [4] 张学文, 王林, 冯夫健, 等. 基于卷积神经网络的苗语孤立词
                                                                   语音识别 [J]. 软件导刊, 2022, 21(2): 21–26.
                 据表11可知,两阶段的TTS 模型Tacotron2和                       Zhang Xuewen, Wang Lin, Feng Fujian, et al. Hmong lan-
                                                                   guage speech recognition for isolated words based on con-
             Glow-TTS 的鲁棒性较差,WER 分别达到 10.20%
                                                                   volutional neural network[J]. Software Guide, 2022, 21(2):
             和 6.12%,微调后的模型与单阶段模型因不存在                              21–26.
             训练与推理的特征分布不匹配问题,有效减少了                               [5] Luo R Q, Tan X, Wang R, et al.  Lightspeech:
             WER。其中,ITHSS 模型表现出更优的鲁棒性,重                            Lightweight and fast text to speech with neural architec-
                                                                   ture search[C]//Proceedings of the International Confer-
             词数和跳词数均为0,WER仅为1.02%。
                                                                   ence on Acoustics, Speech and Signal Processing, 2021:
                                                                   5699–5703.
             4 结论                                                [6] Kim J, Kim S, Kong J, et al. Glow-TTS: A generative flow
                                                                   for text-to-speech via monotonic alignment search[J]. Ad-
                 针对两阶段TTS模型存在复杂度高、演算速度                             vances in Neural Information Processing Systems, 2020,
                                                                   33: 8067–8077.
             慢的问题,本文提出了一种适用于苗语的快速语声
                                                                 [7] Elias I, Zen H, Shen J, et al. Parallel tacotron: Non-
             合成方法 ITHSS,该方法不生成中间的声学特征,                             autoregressive and controllable tts[C]//Proceedings of the
             直接从输入文本转化为语声波形,有效简化了语声                                International Conference on Acoustics, Speech and Signal
             合成模型训练的复杂性。一方面为了保留更多的输                                Processing, 2021: 5709–5713.
                                                                 [8] Kim J H, Lee S H, Lee J H, et al. Fre-GAN: Adversarial
             入文本信息,采用了残差编码器的方式保留更多的
                                                                   frequency -consistent audio synthesis[C]// Proceedings of
             文本上下文信息,提高了合成语声的质量。另一方                                the Interspeech, 2021: 2197–2201.
             面,利用 iSTFT 进行语声波形生成,不仅减少了模                          [9] Lee S H, Kim J H, Lee K E, et al. FRE-GAN 2: Fast and
             型的参数量,而且加快了模型的演算速度。ITHSS                              efficient frequency-consistent audio synthesis[C]// Pro-
                                                                   ceedings of the International Conference on Acoustics,
             也为其他以拉丁字符为书写单位的少数民族语言
                                                                   Speech and Signal Processing.  Singapore, Singapore:
             的语声合成研究提供思路。针对本文方法和语料库                                IEEE Press, 2022: 6192–6196.
             规模的不足,未来工作将主要集中于:                                  [10] Yang G, Yang S, Liu K, et al. Multi-band MelGAN: Faster
                 (1) 考虑苗语语声合成的韵律建模,提高其自                            waveform generation for high-quality text-to-speech[C]//
                                                                   Proceedings of the Spoken Language Technology Work-
             然性;(2) 在目前单说话人语料库的基础上增加其                              shop, 2021: 492–498.
             规模,并构建多说话人的苗语语声合成语料库,实现                            [11] Miao C, Liang S, Chen M, et al.  Flow-TTS: A non-
             多音色的输出与转换。                                            autoregressive network for text to speech based on
                                                                   flow[C]// Proceedings of the International Conference
                                                                   on Acoustics, Speech and Signal Processing. Barcelona,
                                                                   Spain: IEEE Press, 2020: 7209–7213.
                            参 考     文   献
                                                                [12] Ping W, Peng K, Zhao K, et al.  Wave-flow: A com-
                                                                   pact flow-based model for raw audio[C]//Proceedings of
              [1] 杨琳, 杨鉴, 蔡浩然, 等. 基于迁移学习的越南语语音合成 [J].              the International Conference on Machine Learning, 2020:
                 计算机科学, 2023, 50(8): 118–124.                      7706–7716.
                 Yang Lin, Yang Jian, Cai Haoran, et al.  Vietnamese  [13] Elias I, Zen H, Shen J, et al. Parallel Tacotron 2: A
                 speech synthesis based on transfer learning[J]. Journal of  non-autoregressive neural TTS model with differentiable
                 Computer Science, 2023, 50(8): 118–124.           duration modeling[C]//Proceedings of the International
   83   84   85   86   87   88   89   90   91   92   93