基于注意力机制的TDNN-LSTM模型及应用

金浩; 朱文博; 段志奎; 陈建文; 李艾园

doi:10.16300/j.cnki.1000-3630.2021.04.011

基于注意力机制的TDNN-LSTM模型及应用

Attention mechanism based TDNN-LSTM model and its application

摘要

摘要: 在大数据规模下，基于深度学习的语音识别技术已经相当成熟，但在小样本资源下，由于特征信息的关联性有限，模型的上下文信息建模能力不足从而导致识别率不高。针对此问题，提出了一种嵌入注意力机制层（Attention Mechanism）的时延神经网络（Time Delay Neural Network，TDNN）结合长短时记忆递归（Long Short Term Memory，LSTM）神经网络的时序预测声学模型，即TLSTM-Attention，有效地融合了具有重要信息的粗细粒度特征以提高上下文信息建模能力。通过速度扰动技术扩增数据，结合说话人声道信息特征以及无词格最大互信息训练准则，选取不同输入特征、模型结构及节点个数进行对比实验。实验结果表明，该模型相比于基线模型，词错误率降低了3.37个百分点。

Abstract: With the development of big data, speech recognition technology based on deep learning has been quite mature, but under small sample resources, due to the limited relevance of feature information, the modeling ability of contextual information of the model is insufficient, which leads to low recognition rate. To solve this problem, a timing prediction acoustic model (named TLSTM-Attention), which consists of a time delay neural network (TDNN) embedded by attention mechanism layer (Attention) and a long and short time memory (LSTM) recurrent neural network, is proposed in this paper. This model can effectively fuse the coarse and fine particle features with important information to improve the modeling ability of context information. By using the velocity perturbation technique to amplify the data and combining the speaker's channel information features and the lattice-free maximum mutual information training criteria, and by selecting different input features, model structures and numbers of nodes, a series of comparative experiments are conducted. The experimental results show that compared with the baseline model, the word error rate of the model is reduced by 3.77 percentage points.

HTML全文

参考文献(18)

施引文献

资源附件(0)