Page 141 - 《应用声学》2023年第2期
P. 141
第 42 卷 第 2 期 拉巴顿珠等: 端到端的藏语语音合成方法 329
10
16
0.8
14 8 0.8
Encoder timestep 10 8 6 0.6 Encoder timestep 6 4 0.6
12
0.4
0.4
4
0.2 2 0.2
2
0 0 0
0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
Decoder timstep Decoder timstep
(a) 3000 (b) 10000
0.8
0.8
50 0.7 80 0.7
Encoder timestep 30 0.5 Encoder timestep 60 0.5
0.6
0.6
40
0.4
0.4
40
0.3
0.3
20
10 0.2 20 0.2
0.1 0.1
0 0 0 0
0 50 100 150 200 0 50 100 150 200 250
Decoder timstep Decoder timstep
(c) 59000 (d) 100000
图 8 注意力机制的 alignment 效果
Fig. 8 Attention alignments effect in training
0 0
500 500
1000 1000
0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
Predicted Mel-Spectrogram Predicted Mel-Spectrogram
0 0
500 500
1000 1000 0
0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
Tacotron-2, 2020-12-24 18:31, step=3000, loss=1.49707 Tacotron-2, 2020-12-24 20:21, step=10000, loss=1.13168
(a) 3000 (b) 10000
0 0
500 500
1000 1000
0 25 50 75 100 125 150 175 200 0 50 100 150 200 250
-4 -3 -2 -1 0 1 2 3 4
-4 -3 -2 -1 0 1 2 3 4
Predicted Mel-Spectrogram Predicted Mel-Spectrogram
0 0
500 500
1000 1000
0 25 50 75 100 125 150 175 200 0 50 100 150 200 250
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
Tacotron-2, 2020-12-25 09:14, step=59000, loss=0.69886 Tacotron-2, 2020-12-25 19:58, step=100000, loss=0.60185
(c) 59000 (d) 100000
图 9 合成语音的语谱图
Fig. 9 Mel-spectrogram of synthetic speech