Abstract:
In order to solve the problems of single feature extraction and low classification accuracy in speech emotion recognition task, a 3D and 1D multiple feature fusion method for emotion recognition is proposed in this paper to improve the feature extraction algorithm. In 3D network, the spatial feature learning and time-dependent construction are considered. The bilinear convolutional neural network (BCNN) is used to extract spatial features, the short-term memory network (LSTM) and the attention mechanism are used to extract significant time-dependent features. In order to reduce the influence of speaker differences, the Log-Mel features of speech signal and the first-order differential and the second- order differential features are computed to synthesize the 3D Log-Mel feature set. In 1D network, the 1D convolution and LSTM network are used. Finally, 3D and 1D features are fused to obtain discriminative emotional features, and the emotions are classified by using softmax functions. The average recognition rates are 61.22% and 85.69% respectively on IEMOCAP and EMO-DB databases, and the multi-feature fusion algorithm has better recognition performance than the 3D and 1D algorithm for single feature extraction.