Speech emotion recognition based on dual channel feature fusion network

ZHOU Xiaoyan; WANG Lili; SHAO Yongbin; JU Xing

doi:10.16300/j.cnki.1000-3630.23070401

ZHOU Xiaoyan, WANG Lili, SHAO Yongbin, et al. Speech emotion recognition based on dual channel feature fusion networkJ. Technical Acoustics, 2024, 43(6): 854-861. DOI: 10.16300/j.cnki.1000-3630.23070401

Citation:

Speech emotion recognition based on dual channel feature fusion network

Abstract

Abstract

To address the problems of discriminative emotional feature extraction in speech emotion recognition, a speech representation method based on two-channel feature fusion is proposed by combining convolutional neural network and vision transformer network structure. The convolutional module channel based on the inverted bottleneck structure is introduced into a transformer like training strategy to extract local spectral features. The global sequence features are extracted by improving the vision transformer, and the whole speech spectrogram is directly extracted instead of the chunked part by using a convolutional neural network for better extraction of the temporal information, and the extracted feature information is fused to obtain strong discriminant emotion features, which are finally input to the Softmax classifier to get recognition results. Experiments on EMO-DB and CASIA databases show that the average accuracy of the modle propsed in this paper is 94.24% and 93.05%, respectively. Compared with other models, the results are better, indicating the effectiveness of the methods.

FullText(HTML)

References (23)

Cited By

Turn off MathJax

Article Contents

Speech emotion recognition based on dual channel feature fusion network

Abstract

Catalog

Export File

Citation

Format

Content