Abstract:
Deep learning is widely applied in the field of speaker recognition. However, current models have the shortcoming in low recognition rates and high complex model parameters, making it difficult to achieve lightweight speech recognition. To address this issue, a speaker recognition model, named Multi-granularity Hybrid Compression Network (MGHC-NET), is proposed based on multi-granularity spatio-temporal attention mechanisms, which consists of a multi-granularity mixing module (MGMM), spatio-temporal attention mechanism module, and channel compression module. The MGMM and spatio-temporal attention mechanism module capture local temporal context features and spatial correlation feature information from a multi-scale modeling perspective, and couple the correlation features of different spatial-temporal information in a multi-granularity manner to enhance global spatio-temporal modeling capabilities. Meanwhile, the channel compression module aggregates different speaker channels and context-dependent representations to reduce the overall model parameters. Five-fold cross-validation experiments are conducted on multiple public datasets. The results show that the proposed method can effectively improve the speaker recognition accuracy and reduce the number of parameters, and achieve optimal performance compared to mainstream models. It has important application value in lightweight speaker recognition models.