Page 79 - 《应用声学》2025年第2期

P. 79

第 44 卷第 2 期 Vol. 44, No. 2
2025 年 3 月 Journal of Applied Acoustics March, 2025

⋄ 研究论文 ⋄

短时傅里叶逆变换的苗语语声合成方法 ∗

蔡姗 1,2 王林 1,2† 郭胜 1,2 邹雪 1,2 吴磊 1,2

(1 贵州民族大学数据科学与信息工程学院贵阳 550025)
(2 贵州省模式识别与智能系统重点实验室贵阳 550025)
摘要：少数民族语言的语声合成研究作为语声合成研究的一个重要方向，在人机交互领域备受关注。针对现
有两阶段语声合成模型复杂度高、演算速度慢的问题，提出一种基于短时傅里叶逆变换的苗语语声合成方法。
该方法根据语声特征提取的过程，减少过采样卷积的使用，以降低模型的复杂度，同时结合短时傅里叶逆变换
进行语声波形相位和幅度谱的重建，实现从频域到时域的快速转换。此外，文中采用残差编码器对文本进行
特征提取，以保留更多的输入文本信息。为了验证所提方法的有效性，以自建苗语语声语料库 HmongSpeech
(下载链接：http://sxjxsf.gzmu.edu.cn/info/1728/1214.htm) 作为基准数据集，与典型的两阶段和单阶段模型
进行对比分析。实验结果表明，所提方法在没有降低合成语声质量的同时提高了 4∼5 倍的演算速度，且实时因
子为 0.01，满足实时应用要求；同时具有较强的鲁棒性，合成的词错误率仅为 1.02%。
关键词：苗语语声合成；短时傅里叶逆变换；演算速度；残差编码器
中图法分类号: TP391 文献标识码: A 文章编号: 1000-310X(2025)02-0339-11
DOI: 10.11684/j.issn.1000-310X.2025.02.008

Inverse short-time Fourier transform-based Hmong language speech

synthesis method

CAI Shan 1,2 , WANG Lin 1,2 , GUO Sheng 1,2 , ZOU Xue 1,2 and WU Lei 1,2

(1 College of Data Science and Information Engineering, Guizhou Minzu University, Guiyang 550025, China)
(2 Key Laboratory of Pattern Recognition and Intelligent System of Guizhou Province, Guiyang 550025, China)

Abstract: As an important area of speech synthesis research, the synthesis of minority languages has garnered
signiﬁcant attention in the ﬁeld of human-computer interaction. In light of the challenges posed by the high
complexity and slow inference speed of the existing two-stage speech synthesis model, a Hmong language speech
synthesis method based on inverse short-time Fourier transform has been proposed. This technique diminishes
the need for upsampling convolution in speech feature extraction, in order to simplify the model. At the same
time, the phase and amplitude spectrum of speech waveform are restored by combining inverse short-time
Fourier transform, which realizes fast conversion from frequency domain to time domain. Furthermore,
the residual encoder is used to extract the features of the text to retain more input text information. In
order to verify the eﬀectiveness of the proposed method, the self-built Hmong speech corpus, HmongSpeech
(download link: http://sxjxsf.gzmu.edu.cn/info/1728/1214.htm), is used as the benchmark dataset to
compare with the typical two-stage and single-stage models. The experimental results show that the proposed
method can improve the inference speed between 4 to 5 times without reducing the quality of synthesized speech,

2023-10-12 收稿; 2023-12-18 定稿
贵州省科技计划项目 (黔科合基础 -ZK[2023] 一般 143), 贵州省教育厅自然科学研究项目 (黔教技 [2023]061 号, 黔教技 [2023]012 号),
∗
贵州省科技厅众创空间项目《黔民筑梦众创空间》(黔科合平台人才 ZCKJ[2021]007)
作者简介: 蔡姗 (1996–), 女, 贵州毕节人, 硕士研究生, 研究方向: 信号处理, 语声合成。
† 通信作者 E-mail: wanglin@gzmu.edu.cn

74 75 76 77 78 79 80 81 82 83 84