高级检索

基于多粒度时空注意力机制的说话人识别模型

Speaker recognition model based on multi-granularity spatio-temporal attention mechanism

  • 摘要: 深度学习已广泛应用在说话人识别领域,但当前模型存在识别率低和模型参数复杂度高的问题,难以进行轻量化语音识别。针对此问题,文章提出一种基于多粒度时空注意力机制的说话人识别模型,该模型由多粒度混合模块、时空注意力机制模块、通道压缩模块组成。其中多粒度混合模块和时空注意力机制模块以多尺度建模角度来捕捉局部时序上下文特征和空间关联特征信息,并通过多粒度方式耦合不同时空信息的关联特征以提高全局时空建模能力。同时,通道压缩模块通过聚合不同说话人信道以及上下文语境依赖表征以减少整体模型参数数量。在多组公开数据集上进行五重交叉验证实验,结果表明:对比主流模型,所提方法能够有效地提高说话人识别准确率、降低参数量,并达到最优的表现,在轻量化说话人识别模型方面具有重要的应用价值。

     

    Abstract: Deep learning is widely applied in the field of speaker recognition. However, current models have the shortcoming in low recognition rates and high complex model parameters, making it difficult to achieve lightweight speech recognition. To address this issue, a speaker recognition model, named Multi-granularity Hybrid Compression Network (MGHC-NET), is proposed based on multi-granularity spatio-temporal attention mechanisms, which consists of a multi-granularity mixing module (MGMM), spatio-temporal attention mechanism module, and channel compression module. The MGMM and spatio-temporal attention mechanism module capture local temporal context features and spatial correlation feature information from a multi-scale modeling perspective, and couple the correlation features of different spatial-temporal information in a multi-granularity manner to enhance global spatio-temporal modeling capabilities. Meanwhile, the channel compression module aggregates different speaker channels and context-dependent representations to reduce the overall model parameters. Five-fold cross-validation experiments are conducted on multiple public datasets. The results show that the proposed method can effectively improve the speaker recognition accuracy and reduce the number of parameters, and achieve optimal performance compared to mainstream models. It has important application value in lightweight speaker recognition models.

     

/

返回文章
返回