Page 141 - 《应用声学》2023年第2期

P. 141

第 42 卷第 2 期拉巴顿珠等：端到端的藏语语音合成方法 329

10
16
0.8
14 8 0.8
Encoder timestep 10 8 6 0.6 Encoder timestep 6 4 0.6
12
0.4
0.4
4
0.2 2 0.2
2
0 0 0
0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
Decoder timstep Decoder timstep
(a) 3000൓ (b) 10000൓
0.8
0.8
50 0.7 80 0.7
Encoder timestep 30 0.5 Encoder timestep 60 0.5
0.6
0.6
40
0.4
0.4
40
0.3
0.3
20
10 0.2 20 0.2
0.1 0.1
0 0 0 0
0 50 100 150 200 0 50 100 150 200 250
Decoder timstep Decoder timstep
(c) 59000൓ (d) 100000൓
图 8 注意力机制的 alignment 效果
Fig. 8 Attention alignments eﬀect in training
0 0
500 500
1000 1000
0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
Predicted Mel-Spectrogram Predicted Mel-Spectrogram
0 0
500 500
1000 1000 0
0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
Tacotron-2, 2020-12-24 18:31, step=3000, loss=1.49707 Tacotron-2, 2020-12-24 20:21, step=10000, loss=1.13168
(a) 3000൓ (b) 10000൓
0 0
500 500
1000 1000
0 25 50 75 100 125 150 175 200 0 50 100 150 200 250
-4 -3 -2 -1 0 1 2 3 4
-4 -3 -2 -1 0 1 2 3 4
Predicted Mel-Spectrogram Predicted Mel-Spectrogram
0 0
500 500
1000 1000
0 25 50 75 100 125 150 175 200 0 50 100 150 200 250
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
Tacotron-2, 2020-12-25 09:14, step=59000, loss=0.69886 Tacotron-2, 2020-12-25 19:58, step=100000, loss=0.60185
(c) 59000൓ (d) 100000൓
图 9 合成语音的语谱图
Fig. 9 Mel-spectrogram of synthetic speech

136 137 138 139 140 141 142 143 144 145 146