Advanced Search
ZHOU Xiaoyan, WANG Lili, SHAO Yongbin, et al. Speech emotion recognition based on dual channel feature fusion networkJ. Technical Acoustics, 2024, 43(6): 854-861. DOI: 10.16300/j.cnki.1000-3630.23070401
Citation: ZHOU Xiaoyan, WANG Lili, SHAO Yongbin, et al. Speech emotion recognition based on dual channel feature fusion networkJ. Technical Acoustics, 2024, 43(6): 854-861. DOI: 10.16300/j.cnki.1000-3630.23070401

Speech emotion recognition based on dual channel feature fusion network

  • To address the problems of discriminative emotional feature extraction in speech emotion recognition, a speech representation method based on two-channel feature fusion is proposed by combining convolutional neural network and vision transformer network structure. The convolutional module channel based on the inverted bottleneck structure is introduced into a transformer like training strategy to extract local spectral features. The global sequence features are extracted by improving the vision transformer, and the whole speech spectrogram is directly extracted instead of the chunked part by using a convolutional neural network for better extraction of the temporal information, and the extracted feature information is fused to obtain strong discriminant emotion features, which are finally input to the Softmax classifier to get recognition results. Experiments on EMO-DB and CASIA databases show that the average accuracy of the modle propsed in this paper is 94.24% and 93.05%, respectively. Compared with other models, the results are better, indicating the effectiveness of the methods.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return