The existing Any-to-Any model is difficult to balance between the similarity and naturalness of the VC-generated speech, and it is not suitable for Chinese speech which has high requirements on the rhythms of tones, intonation, etc. In this paper, an Any-to-Any voice conversation model is proposed based on multi-level information embedding for Chinese speech. Specially, the timbre representation is first extracted from the target speech using a timbre extractor based on convolutional neural network and multi-head attention mechanism. Then, the autocorrelation function is used to extract rhythm information from the target speech and the source speech respectively, and normalize and fuse the rhythm information. Finally, a generator HiFi-GAN++ based on multi-level embedding strategy is designed. On the basis of the matched self-supervised features, the timbre information and prosodic information are gradually embedded and speech is generated in multi-level loops.The comparison experiment results on three mainstream Chinese datasets, Thchs-30, Aishell-1, and Aishell-3, show that the proposed model reduces the character error rate by an average of 19.11% compared with the comparison baseline model and increases the similarity score by an average of 26.62%. The proposed model can not only generate Chinese converted speech that is closer to real speech quality, but also has good adaptability to short speech scenarios and has a broader application prospect. |