《应用声学》编辑部

文章摘要

邦锦阳,张玥,张雄伟,孙蒙,刘伟,栾合禹.Att-U-Net：融合注意力机制的U-Net骨导语声增强*[J].,2023,42(4):814-824

Att-U-Net：融合注意力机制的U-Net骨导语声增强*

Att-U-Net: Bone Conducted Speech Enhancement based on U-Net with Attention Mechanism

投稿时间：2022-04-28 修订日期：2022-11-18

中文摘要:

近年来大量全卷积网络、U-Net等编解码网络结构应用于语音增强，它们具有计算复杂度低、模型参数少等优势。然而，与长短时记忆模型等方法相比，这些编解码结构仍存在不能充分利用先后时间之间和高低频率之间的关联信息等缺点，尤其对于长序列数据的输入，编解码结构存在信息丢失的问题。为保持计算效率的同时考虑更充分的时频关联信息建模，本文提出一种融合注意力机制的U-Net网络的骨导语音增强方法（Att-U-Net），通过在跳跃连接中引入注意力机制，生成一个权重矩阵，将编码层中的全局信息根据权重融入对应的解码层中，使网络在编解码过程中能够关注输入数据中与增强目标相关程度高的重要信息，同时抑制不相关的信息。在骨导语音数据集上的实验表明，融合注意力机制的U-Net网络能在保持模型轻量化的同时有效提升骨导语音的增强效果，增强后的语音在各项客观评价指标上均优于基线模型。通过对编解码网络中间层的可视化分析发现，在解码过程中注意力机制有效地保留了有声段的信息，滤除了骨导语音由于骨导传声特性带来的中频共振，从而使得增强后的骨导语音具有较好的听觉效果。

英文摘要:

In recent years, a large number of decoded networks are applied to speech enhancement, such as full convolutional networks, U-Net, etc., with low computational complexity and low model parameters. However, compared with the Long Short-Term Memory(LSTM) model, Encoder-Decoder structures still can not make the best of the correlation information on time series and relationship between high and low frequencies. Especially for the long sequence input data, Encoder-Decoder structure has the problem of information loss. In order to maintain the computational efficiency and consider more sufficient time-frequency correlation information modeling, this paper proposes a bone conducted speech enhancement method (Att-U-Net), which combines U-Net network and attention mechanism. Through introducing attention mechanism into skip connection and generating a weight matrix, the global information in the encoding layer is transmitted to the corresponding decoding layer according to the weight coefficient in the process of encoding and decoding. The network can pay attention to the important information highly related to the enhancement target in the input data, while suppressing the irrelevant information. Experiments on bone conduction speech dataset show that the U-Net integrating attention mechanism can effectively improve the bone conduction speech enhancement effect while maintaining the lightweight of the model. The enhanced speech is better than the baseline model in objective evaluation indicators. Through the visual analysis of the middle layer of the Encoder-Decoder structure, it is found that the attention mechanism effectively retains the information of the sound segment in the decoding process and filters out the intermediate frequency resonance due to bone sound transmission characteristics. The enhanced bone conducted speech has a better sense of hearing.

DOI：10.11684/j.issn.1000-310X.2023.04.017

中文关键词: 骨导语音增强深度学习注意力机制 U-Net

英文关键词: Bone conducted speech enhancement Deep learning Attention mechanism U-Net

基金项目:国家自然科学基金项目（面上项目，重点项目，重大项目）

作者	单位	E-mail
邦锦阳	中国人民解放军部队	bangjinyang@163.com
张玥^*	陆军工程大学指挥控制工程学院	zy1084476070@163.com
张雄伟	陆军工程大学指挥控制工程学院	xwzhang9898@163.com
孙蒙	陆军工程大学指挥控制工程学院	sunmeng@aeu.edu.cn
刘伟	陆军工程大学指挥控制工程学院	weiliu_1997it@163.com
栾合禹	中国人民解放军部队	luahy96@163.com

摘要点击次数: 836

全文下载次数: 1120

查看全文查看/发表评论下载PDF阅读器

关闭