Page 156 - 《应用声学》2022年第4期
P. 156

第 41 卷 第 4 期                                                                       Vol. 41, No. 4
             2022 年 7 月                          Journal of Applied Acoustics                      July, 2022

             ⋄ 研究报告 ⋄



                       人耳听觉相关代价函数深度学习单通道

                                               语声增强算法                    ∗




                                  程琳娟      1,2  彭任华      1,2†  郑成诗     1,2   李晓东     1,2


                                              (1 中国科学院声学研究所       北京   100190)
                                                (2 中国科学院大学      北京  100049)
                摘要:均方误差函数是深度学习单通道语声增强算法最常用的一种代价函数。然而,均方误差值的大小与语
                声质量好坏并非完全相关。为了提高算法性能,该文在深度神经网络训练中引入了两类与人耳听觉相关的
                代价函数。第一类是加权欧氏距离代价函数,考虑了人耳听觉掩蔽效应;第二类是 Itakura-Satio 代价函数、
                COSH 代价函数和加权似然比代价函数,强调语声谱峰的重要性,侧重于恢复干净语声谱峰信息。基于长短期
                记忆网络结构分析比较了两类代价函数在深度学习单通道语声增强算法中的性能,并与均方误差代价函数进
                行对比。实验结果表明,基于加权欧式距离代价函数的深度神经网络单通道语声增强算法能够获得更好的语
                声质量和更低的噪声残留。
                关键词:语声增强;深度学习;人耳听觉
                中图法分类号: TP912.35           文献标识码: A         文章编号: 1000-310X(2022)04-0654-13
                DOI: 10.11684/j.issn.1000-310X.2022.04.018


                      Deep learning-based single-channel speech enhancement based on
                                       human auditory related cost function


                        CHENG Linjuan  1,2  PENG Renhua  1,2  ZHENG Chengshi   1,2  LI Xiaodong 1,2

                               (1 Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China)
                                   (2 University of Chinese Academy of Sciences, Beijing 100049, China)

                 Abstract: Mean-square error (MSE) function is one of the most commonly used cost functions in deep learning-
                 based single-channel speech enhancement. However, the value of MSE is not completely related to speech
                 quality. In order to improve the performance of speech enhancement algorithm, we introduce two classes of
                 cost functions related to human auditory during training network in this paper. The first class is a weighted-
                 Euclidean cost function, which takes auditory masking effect into account. The second class of cost functions
                 include Itakura-Satio cost function, COSH cost function, and weighted likelihood ratio cost function, which
                 place more emphasis on spectral peaks than spectral valleys. The performance of these perceptually motivated
                 cost functions in single-channel speech enhancement is analysed and compared based on long short-term memory
                 and compared with the MSE cost function. Experimental results indicate that the deep neural network-based
                 single-channel speech enhancement with weighted-Euclidean cost function can achieve better speech quality
                 and lower residual noise.
                 Keywords: Speech enhancement; Deep learning; Human auditory


             2021-05-26 收稿; 2022-03-07 定稿
             国家自然科学基金项目 (61801468)
             ∗
             作者简介: 程琳娟 (1995– ), 女, 河南南阳人, 博士研究生, 研究方向: 声学。
             † 通信作者 E-mail: pengrenhua@mail.ioa.ac.cn
   151   152   153   154   155   156   157   158   159   160   161