文章摘要
蔡姗,王林,郭胜,邹雪,吴磊.短时傅里叶逆变换的苗语语声合成方法*[J].,2025,44(2):339-349
短时傅里叶逆变换的苗语语声合成方法*
Inverse short-time Fourier transform-based Hmong language speech synthesis method
投稿时间:2023-10-12  修订日期:2025-02-28
中文摘要:
      少数民族语言的语声合成研究作为语声合成研究的一个重要方向,在人机交互领域备受关注。针对现有两阶段语声合成模型复杂度高、演算速度慢的问题,提出一种基于短时傅里叶逆变换的苗语语声合成方法。该方法根据语声特征提取的过程,减少过采样卷积的使用,以降低模型的复杂度,同时结合短时傅里叶逆变换进行语声波形相位和幅度谱的重建,实现从频域到时域的快速转换。此外,文中采用残差编码器对文本进行特征提取,以保留更多的输入文本信息。为了验证所提方法的有效性,以自建苗语语声语料库HmongSpeech(下载链接:http://sxjxsf.gzmu.edu.cn/info/1728/1214.htm)作为基准数据集,与典型的两阶段和单阶段模型进行对比分析。实验结果表明,所提方法在没有降低合成语声质量的同时提高了4~5倍的演算速度,且实时因子为0.01,满足实时应用要求;同时具有较强的鲁棒性,合成的词错误率仅为1.02%。
英文摘要:
      As an important area of speech synthesis research, the synthesis of minority languages has garnered significant attention in the field of human-computer interaction. In light of the challenges posed by the high complexity and slow inference speed of the existing two-stage speech synthesis model, a Hmong language speech synthesis method based on Inverse Short-time Fourier Transform has been proposed. This technique diminishes the need for upsampling convolution in speech feature extraction; in order to simplify the model. At the same time, the phase and amplitude spectrum of speech waveform are restored by combining inverse short-time Fourier transform, which realizes fast conversion from frequency domain to time domain. Furthermore, the residual encoder is used to extract the features of the text to retain more input text information. In order to verify the effectiveness of the proposed method, the self-built Hmong speech corpus, HmongSpeech(download link: http://sxjxsf.gzmu.edu.cn/info/1728/1214.htm), is used as the benchmark dataset to compare with the typical two-stage and single-stage models. The experimental results show that the proposed method can improve the inference speed between 4 to 5 times without reducing the quality of synthesized speech, and the real-time factor is 0.01, which meets the requirements of real-time application. At the same time, it has demonstrated a strong level of robustness, with a synthesized word error rate of only 1.02%.
DOI:10.11684/j.issn.1000-310X.2025.02.008
中文关键词: 苗语语声合成  短时傅里叶逆变换  演算速度  残差编码器
英文关键词: Hmong language speech synthesis  Inverse short-time Fourier transform  Inference speed  Residual encoder
基金项目:贵州省科技计划项目(黔科合基础-ZK[2023]一般143), 贵州省教育厅自然科学研究项目(黔教技[2023]061号, 黔教技[2023]012号),贵州省科技厅众创空间项目《黔民筑梦众创空间》(黔科合平台人才ZCKJ[2021]007)
作者单位E-mail
蔡姗 贵州民族大学 2291084203@qq.com 
王林* 贵州民族大学 wanglin@gzmu.edu.cn 
郭胜 贵州民族大学 2972564285@qq.com 
邹雪 贵州民族大学 2217060877@qq.com 
吴磊 贵州民族大学 891041854@qq.com 
摘要点击次数: 26
全文下载次数: 18
查看全文   查看/发表评论  下载PDF阅读器
关闭