Yang Liu, Haoqin Sun, Wenbo Guan, Yuqi Xia, Zhen Zhao. Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions. Machine Intelligence Research, vol. 20, no. 4, pp.595-604, 2023. https://doi.org/10.1007/s11633-022-1356-x
Citation: Yang Liu, Haoqin Sun, Wenbo Guan, Yuqi Xia, Zhen Zhao. Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions. Machine Intelligence Research, vol. 20, no. 4, pp.595-604, 2023. https://doi.org/10.1007/s11633-022-1356-x

Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions

doi: 10.1007/s11633-022-1356-x
More Information
  • Author Bio:

    Yang Liu received the B. Eng. and M. Eng. degrees in computer science and technology from Tianjin University, China in 2010 and 2012, respectively, and the Ph. D. degree in information science from Japan Advanced Institute of Science and Technology, Japan in 2016. Currently, he is a lecturer with Department of Information Science and Technology, Qingdao University of Science and Technology, China. His research interests include speech signal processing, life prediction of mechanical equipment and robotic theory. E-mail: yangliu_qust@foxmail.com ORCID iD: 0000-0002-9976-8671

    Haoqin Sun received the B. Eng. degree in international digital media from Qingdao University, China in 2020. Currently, he is a master student in software engineering at Department of Software Engineering, Qingdao University of Science and Technology, China. His research interest is speech emotion recognition. E-mail: 12shq12@163.com ORCID iD: 00000-0002-8554-8969

    Wenbo Guan received the B. Eng. degree in computer science and technology from Jiangsu University of Science and Technology, China in 2019. Currently, he is a master student in electronic information at Department of Electronic Information, Qingdao University of Science and Technology, China. His research interest is speech separation. E-mail: g1912913565@163.com

    Yuqi Xia received the B. Eng. degree in computer science and technology from Shenyang Normal University, China in 2018. Currently, he is a master student in electronic information at Department of Electronic Information, Qingdao University of Science and Technology, China. His research interest is speech emotion recognition. E-mail: 2954200746@qq.com

    Zhen Zhao received the Ph. D. degree in systems engineering from Tongji University, China in 2011. Currently, he is an associate professor with Department of Information Science and Technology, Qingdao University of Science and Technology, China. His research interests include speech emotion recognition, artificial intelligence and edge computing. E-mail: zzqust@126.com (Corresponding author) ORCID iD: 0000-0002-7898-8974

  • Received Date: 2022-04-26
  • Accepted Date: 2022-07-08
  • Publish Online: 2023-06-01
  • Publish Date: 2023-08-01
  • Due to the complexity of emotional expression, recognizing emotions from the speech is a critical and challenging task. In most of the studies, some specific emotions are easily classified incorrectly. In this paper, we propose a new framework that integrates cascade attention mechanism and joint loss for speech emotion recognition (SER), aiming to solve feature confusions for emotions that are difficult to be classified correctly. First, we extract the mel frequency cepstrum coefficients (MFCCs), deltas, and delta-deltas from MFCCs to form 3-dimensional (3D) features, thus effectively reducing the interference of external factors. Second, we employ spatiotemporal attention to selectively discover target emotion regions from the input features, where self-attention with head fusion captures the long-range dependency of temporal features. Finally, the joint loss function is employed to distinguish emotional embeddings with high similarity to enhance the overall performance. Experiments on interactive emotional dyadic motion capture (IEMOCAP) database indicate that the method achieves a positive improvement of 2.49% and 1.13% in weighted accuracy (WA) and unweighted accuracy (UA), respectively, compared to the state-of-the-art strategies.

     

  • * These authors contribute equally to this work
  • loading
  • [1]
    J. H. Tao, J. Huang, Y. Li, Z. Lian, M. Y. Niu. Correction to: Semi-supervised ladder networks for speech emotion recognition. International Journal of Automation and Computing, vol. 18, no. 4, Article number 680, 2021. DOI: 10.1007/s11633-019-1215-6.
    [2]
    E. M. Schmidt, Y. E. Kim. Learning emotion-based acoustic features with deep belief networks. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, USA, pp. 65–68, 2011. DOI: 10.1109/ASPAA.2011.6082328.
    [3]
    K. Han, D. Yu, I. Tashev. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 223–227, 2014.
    [4]
    Q. Mao, M. Dong, Z. W. Huang, Y. Z. Zhan. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014. DOI: 10.1109/TMM.2014.2360798.
    [5]
    M. Y. Chen, X. J. He, J. Yang, H. Zhang. 3D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1440–1444, 2018. DOI: 10.1109/LSP.2018.2860246.
    [6]
    Y. Liu, H. Q. Sun, W. B. Guan, Y. Q. Xia, Z. Zhao. Discriminative feature representation based on cascaded attention network with adversarial joint loss for speech emotion recognition. In Proceedings of Interspeech, pp. 4750–4754, 2022.
    [7]
    M. Seyedmahdad, E. Barsoum, C. Zhang. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, Los Angeles, USA, pp. 2227–2231, 2017.
    [8]
    Q. P. Chen, G. M. Huang. A novel dual attention-based BLSTM with hybrid features in speech emotion recognition. Engineering Applications of Artificial Intelligence, vol. 102, Article number 104277, 2021.
    [9]
    Y. Liu, H. Q. Sun, W. B. Guan, Y. Q. Xia, Z. Zhao. Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework. Speech Communication, vol. 139, pp. 1–9, 2022. DOI: 10.1016/j.specom.2022.02.006.
    [10]
    M. K. Xu, F. Zhang, S. U. Khan. Improve accuracy of speech emotion recognition with attention head fusion. In Proceedings of the 10th Annual Computing and Communication Workshop and Conference, IEEE, Las Vegas, USA, pp. 1058–1064, 2020. DOI: 10.1109/CCWC47524.2020.9031207.
    [11]
    C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359. DOI: 10.1007/s10579-008-9076-6.
    [12]
    S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, C. Y. Espy-Wilson. Adversarial auto-encoders for speech based emotion recognition. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 1243–1247, 2017.
    [13]
    D. Y. Dai, Z. Y. Wu, R. N. Li, X. X. Wu, J. Jia, H. Meng. Learning discriminative features from spectrograms using center loss for speech emotion recognition. In Proceedings of ICASSP/IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 7405–7409, 2019. DOI: 10.1109/ICASSP.2019.8683765.
    [14]
    Y. Gao, J. X. Liu, L. B. Wang, J. W. Dang. Metric learning based feature representation with gated fusion model for speech emotion recognition. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, pp. 4503–4507, 2021.
    [15]
    L. Tarantino, P. N. Garner, A. Lazaridis. Self-attention for speech emotion recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, pp. 2578–2582, 2019.
    [16]
    J. W. Liu, H. X. Wang. A speech emotion recognition framework for better discrimination of confusions. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, pp. 4483–4487, 2021.
    [17]
    A. Satt, S. Rozenberg, R. Hoory. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 1089–1093, 2017.
    [18]
    P. C. Li, Y. Song, I. V. McLoughlin, W. Guo, L. R. Dai. An attention pooling based representation learning method for speech emotion recognition. In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, pp. 3087–3091, 2018.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(8)  / Tables(5)

    用微信扫码二维码

    分享至好友和朋友圈

    Article Metrics

    Article views (384) PDF downloads(86) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return