Abstract:
Speech emotion recognition has long been challenged by the inadequate representation of emotional information in single-modal speech and the typically small size of corpora. To address these issues, this article proposes a method that combines multimodal emotion recognition with transfer learning. By utilizing multimodal data including speech and electroglottography (Electroglottography, EGG) signals, the proposed method fully captures the emotional information present in speech and expands the low-resource situation of a single corpus to accommodate deep learning algorithms through information transfer between multiple emotion corpora. The Transformer model is used as the base network for transfer learning to enhance the performance of speech emotion recognition. This study focuses on two modalities: speech and EGG signals. The CDESD corpus, which provides detailed annotation information, serves as the source domain for training the Transformer model and obtaining comprehensive emotional expression capabilities. The trained model is then transferred to the low-resource target domain dataset STEM-E2VA. Experimental validation of the proposed method is conducted on a four-class speech emotion recognition task. The results demonstrate a significant improvement in cross-corpus emotion recognition, achieving an average recognition rate of 89.17%, compared to the no-transfer scenario.