高级检索

结合迁移学习和Transformer模型的低资源多模态语音情感识别研究

Research on Low Resource Multimodal Speech Emotion Recognition Combining Transfer Learning and Transformer Model

  • 摘要: 语音情感识别长期面临单模态语音对情感信息体现不足,以及语料库规模通常较小的问题。针对这两个问题,本文提出了一种多模态情感识别与迁移学习相结合的方法,利用语音发声多模态数据充分体现语音中的情感信息,并通过多情感语料库间的信息迁移,将单一语料库的低资源情况扩充为可以适应深度学习算法的情况,并以Transformer模型作为迁移学习使用的基础网络,达到提升语音情感识别性能的效果。本文工作以语音和电声门图(electroglottography, EGG)两个模态的信号为研究对象,以具有详尽标注信息的CDESD(汉语双模情感语音数据库)语料库为源域,在其上训练Transformer模型,获得充分的情感信息表达能力,并将此模型迁移到低资源的目标域数据集STEM-E2VA(Suzhou & Taiyuan Emotional Dataset on Mandarin-Electromagnetic Articulography (EMA),Electroglottography (EGG),Video. Audio)中。本文对所提出的方法在四分类的语音情感识别任务上进行了实验验证,在跨语料库情感识别方面达到了89.17%的平均准确率,相比于无迁移的情况有了显著的提升。

     

    Abstract: Speech emotion recognition has long been challenged by the inadequate representation of emotional information in single-modal speech and the typically small size of corpora. To address these issues, this article proposes a method that combines multimodal emotion recognition with transfer learning. By utilizing multimodal data including speech and electroglottography (Electroglottography, EGG) signals, the proposed method fully captures the emotional information present in speech and expands the low-resource situation of a single corpus to accommodate deep learning algorithms through information transfer between multiple emotion corpora. The Transformer model is used as the base network for transfer learning to enhance the performance of speech emotion recognition. This study focuses on two modalities: speech and EGG signals. The CDESD corpus, which provides detailed annotation information, serves as the source domain for training the Transformer model and obtaining comprehensive emotional expression capabilities. The trained model is then transferred to the low-resource target domain dataset STEM-E2VA. Experimental validation of the proposed method is conducted on a four-class speech emotion recognition task. The results demonstrate a significant improvement in cross-corpus emotion recognition, achieving an average recognition rate of 89.17%, compared to the no-transfer scenario.

     

/

返回文章
返回