Yang Liu, Haoqin Sun, Wenbo Guan, Yuqi Xia, Zhen Zhao. Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions[J]. Machine Intelligence Research, 2023, 20(4): 595-604. DOI: 10.1007/s11633-022-1356-x
Citation: Yang Liu, Haoqin Sun, Wenbo Guan, Yuqi Xia, Zhen Zhao. Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions[J]. Machine Intelligence Research, 2023, 20(4): 595-604. DOI: 10.1007/s11633-022-1356-x

Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions

  • Due to the complexity of emotional expression, recognizing emotions from the speech is a critical and challenging task. In most of the studies, some specific emotions are easily classified incorrectly. In this paper, we propose a new framework that integrates cascade attention mechanism and joint loss for speech emotion recognition (SER), aiming to solve feature confusions for emotions that are difficult to be classified correctly. First, we extract the mel frequency cepstrum coefficients (MFCCs), deltas, and delta-deltas from MFCCs to form 3-dimensional (3D) features, thus effectively reducing the interference of external factors. Second, we employ spatiotemporal attention to selectively discover target emotion regions from the input features, where self-attention with head fusion captures the long-range dependency of temporal features. Finally, the joint loss function is employed to distinguish emotional embeddings with high similarity to enhance the overall performance. Experiments on interactive emotional dyadic motion capture (IEMOCAP) database indicate that the method achieves a positive improvement of 2.49% and 1.13% in weighted accuracy (WA) and unweighted accuracy (UA), respectively, compared to the state-of-the-art strategies.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return