Page 26 - 《应用声学》2021年第2期
P. 26
第 40 卷 第 2 期 Vol. 40, No. 2
2021 年 3 月 Journal of Applied Acoustics March, 2021
⋄ 研究报告 ⋄
基于Transformer的普通话语声识别模型
位置编码选择
徐冬冬 †
(中国航天科工集团第二研究院研究生院 北京 100854)
摘要:具有自注意机制的 Transformer 网络在语声识别研究领域渐渐得到广泛关注。该文围绕着将位置信息
嵌入与语声特征相结合的方向,研究更加适合普通话语声识别模型的位置编码方法。实验结果得出,采用卷积
编码的输入表示代替正弦位置编码,可以更好地融合语声特征上下文联系和相对位置信息,获得较好的识别
效果。训练的语声识别系统是在 Transformer 模型基础上,比较 4 种不同的位置编码方法。结合 3-gram 语言模
型,所提出的卷积位置编码方法,在中文语声数据集 AISHELL-1 上的误识率降低至 8.16%。
关键词:自注意力;位置编码;卷积
中图法分类号: TP912.34 文献标识码: A 文章编号: 1000-310X(2021)02-0194-06
DOI: 10.11684/j.issn.1000-310X.2021.02.004
Transformer-based position coding selection of Mandarin speech recognition
model
XU Dongdong
(Graduate School of the Secondary Institute of China Aerospace Science and Industry Corp, Beijing 100854, China)
Abstract: The Transformer network with self-attention mechanism has gradually gained wide attention in
the field of speech recognition research. This paper revolves around the direction of embedding location
information and speech features, and studies the location coding method that is more suitable for Mandarin
speech recognition model. The experimental results show that the input representation of convolutional coding
instead of sinusoidal position coding can better integrate the contextual relationship of speech features and
relative position information, and obtain better recognition results. The trained speech recognition system is
based on the Transformer model and compares four different position coding methods. Combined with the
3-gram language model and the proposed convolutional position coding method, the word recognition error
rate on the Chinese speech data set AISHELL-1 is reduced to 8.16%.
Keywords: Self-attention; Position coding; Convolution
2020-05-23 收稿; 2020-07-15 定稿
作者简介: 徐冬冬 (1994– ), 男, 安徽滁州人, 硕士研究生, 研究方向: 语言识别, 机器学习。
† 通信作者 E-mail: 329696974@qq.com