Abstract:
The verification word content of text-independent speaker verification is not constrained. Compared with text-dependent speaker verification, text-independent speaker verification can effectively avoid common attacks such as recording fraud when combined with speech recognition. However, text-independent speaker verification systems suffer from severe performance degradation on short verification utterances. For this reason, an improved end-to-end model is proposed in this paper. The speaker classification losses of both long and short utterances are utilized to enhance the network's ability to classify and identify speakers of the speech segments of different durations. Meanwhile, the similarity of short utterances and long utterances belonging to the same speaker is increased in the embedding space, the similarity of short utterances belonging to different speakers is reduced, and the feature extraction capability of the network for short utterances is enhanced. In addition, an attention mechanism-based verification word selection method is proposed. The Chinese words with high attention weights are selected as the verification prompt text of the speaker verification system. The experimental results show that the improved end-toend model combined with softmax pre-training can result in a 29% relative reduction in equal error rate on short test utterances, and the attention mechanism-based verification word selection method can also effectively select verification words with better recognition results. The combination of the two methods can effectively improve the recognition performance of the speaker verification system for short Chinese utterances.