文章摘要
基于多级信息嵌入的中文语音转换模型
Chinese voice conversion model based on Multilevel information embedding
投稿时间:2024-11-02  修订日期:2024-12-09
中文摘要:
      现有任意到任意的语音转换方法在相似性和自然性之间难以均衡,难以适用于对语调、节奏等韵律要求较高的中文语音转换。本文面向中文语音,提出一种基于多级信息嵌入的中文语音转换模型。首先,利用基于卷积和多头注意力机制的音色编码器,从目标语音中提取音色表示;其次,利用自相关函数方法分别从目标语音和源语音中提取韵律信息,并进行归一化融合;最后,设计基于多级信息嵌入策略的生成器HiFi-GAN++,在匹配后的自监督特征基础上,将音色信息和韵律信息在多层循环中逐步嵌入并生成语音。在Thchs-30、Aishell-1以及Aishell-3三种主流中文数据集的对比实验结果表明,所提模型较对比基线模型在字错误率上平均降低了19.11%,在相似性分数上平均提升了26.62%。本文模型不仅能够生成更接近真实语音质量的中文转换语音,而且对短语音场景也具有很好的适应性,具有更广泛的应用前景。
英文摘要:
      The existing Any-to-Any model is difficult to balance between the similarity and naturalness of the VC-generated speech, and it is not suitable for Chinese speech which has high requirements on the rhythms of tones, intonation, etc. In this paper, an Any-to-Any voice conversation model is proposed based on multi-level information embedding for Chinese speech. Specially, the timbre representation is first extracted from the target speech using a timbre extractor based on convolutional neural network and multi-head attention mechanism. Then, the autocorrelation function is used to extract rhythm information from the target speech and the source speech respectively, and normalize and fuse the rhythm information. Finally, a generator HiFi-GAN++ based on multi-level embedding strategy is designed. On the basis of the matched self-supervised features, the timbre information and prosodic information are gradually embedded and speech is generated in multi-level loops.The comparison experiment results on three mainstream Chinese datasets, Thchs-30, Aishell-1, and Aishell-3, show that the proposed model reduces the character error rate by an average of 19.11% compared with the comparison baseline model and increases the similarity score by an average of 26.62%. The proposed model can not only generate Chinese converted speech that is closer to real speech quality, but also has good adaptability to short speech scenarios and has a broader application prospect.
DOI:
中文关键词: 中文语音转换  多级信息嵌入  音色  韵律  生成器
英文关键词: Chinese voice conversion  Multi-level information embedding  Timbre  Rhythm  HiFi-GAN++
基金项目:教育部人文社会科学研究规划基金项目(24YJA870011),安徽省重点研究与开发计划项目(202104d07020001),安徽省自然科学基金项目(2208085MF166)
作者单位邮编
张国富 合肥工业大学 230009
张朋 合肥工业大学 230009
苏兆品* 合肥工业大学 230009
岳峰 合肥工业大学 230009
摘要点击次数: 7
全文下载次数: 0
  查看/发表评论  下载PDF阅读器
关闭