A hybrid model based on CNN-LSTM for speech emotion reognition

Document Type : Original Article

Authors

1 Master's degree, Imam Hussein (AS) University, Tehran, Iran

2 Researcher, Imam Hossein comprehensive University, Tehran, Iran

Abstract

The data published in the virtual space, including text, image, video and speech, have become a reliable source for measuring the thoughts, opinions and emotions of the audience towards various objects such as governments, policies, personalities, products, etc. In order to conflict with the cognitive threats of the cyberspace, it is very important to recognition the cognitive structure of insider and enemy audiences. The current research was conducted in order to present a computational model for speech emotion recognition based on the combination of two Classifier of CNN-LSTM. In this article, at the beginning, the introductions about speech emotion recognition and its applications are mentioned, then the papers presented in the authoritative journals are reviewed and their accuracy is evaluated, in the following, a practical method for recognition the eight basic emotions of the audience including happiness, sadness, fear, calm, anger, Disgust, surprise, and neutral are presented. In this research, in order to have a high number of data, by combining the two data sets RAVDESS and TESS, a general data set was collected, in the feature extraction phase, three features MFCC, MEL and ZCR were extracted and combined, and then in the model designed by the combination of CNN and LSTM classifiers for Training and testing have been used. With the evaluations, the accuracy of the model on the test data is 92.57%, which is more accurate than the existing models.

Keywords

Main Subjects


Smiley face

 

[1]           M. Swain, A. Routray, and P. Kabisatpathy, "Databases, features and classifiers for speech emotion recognition: a review, " International Journal of Speech Technology, vol. 21, pp. 93-120, یشن2018I:https://doi.org/10.1007/s10772-018-9491-z
[2]           M. Shamsi, "Emotion Recognition in Persian Speech Using Machine Learning Methods," 2016.
DOI:10.1109/ICCKE54056.2021.9721504
[3]           M. B. Akçay and K. Oğuz, "Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers, " Speech Communication, vol. 116, pp. 56-76, 2020.. DOI: https://doi.org/10.1016/j.specom.2019.12.001
[4]           B. Du, Q. Gao, and H. Ning, "Survey on Intelligent Speech Emotion Recognition, " Forest Chemicals Review, pp. 230-260, 2021.
                DOI: 10.1109/SLT.2016.7846319
[5]           A. B. A. Qayyum, A. Arefeen, and C. Shahnaz, "Convolutional neural network (CNN) based speech-emotion recognition, " in 2019 IEEE International Conference on Signal Processing, Information, Communication & Systems (SPICSCON), 2019, pp. 122-125. DOI:10.1109/SPICSCON48833.2019.9065172
[6]           F. Makhmudov, A. Kutlimuratov, F. Akhmedov, M. S. Abdallah, and Y.-I. Cho, "Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders," Electronics, vol. 11, p. 4047, 2022.
[7]           Z. Peng, Y. Lu, S. Pan, and Y. Liu, "Efficient speech emotion recognition using multi-scale cnn and attention, " in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3020-3024.
                  DOI: 10.1109/ICASSP39728.2021.9414286
[8]           H. Zhang, R. Gou, J. Shang, F. Shen, Y. Wu, and G. Dai, "Pre-trained deep convolution neural network model with attention for speech emotion recognition, " Frontiers in Physiology, vol. 12, p. 643202, 2021.
                 DOI:10.3389/fphys.2021.643202
[9]           G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, et al., "Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network, " in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016, pp. 5200-5204.
                DOI: 10.1109/ICASSP.2016.7472669
[10]         W. Lim, D. Jang, and T. Lee, "Speech emotion recognition using convolutional and recurrent neural networks, " in 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), 2016, pp. 1- DOI: 10.1109/APSIPA.2016.7820699
[11]         C.-H. Wu and W.-B. Liang, "Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels," IEEE Transactions on Affective Computing, vol. 2, pp. 10-21, 2010.
                DOI: 10.1109/T-AFFC.2010.16
[12]         F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie, "On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues, " Journal on Multimodal User Interfaces, vol. 3, pp. 7-19, 2010
                DOI: https://doi.org/10.1007/s12193-009-0032-6
[13]         L. Tian, J. Moore, and C. Lai, "Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features, " in 2016 IEEE Spoken Language Technology Workshop (SLT), 2016, pp. 565-572.
                 DOI: 10.1109/SLT.2016.7846319
[14]         H. Kaya, D. Fedotov, A. Yesilkanat, O. Verkholyak, Y. Zhang, and A. Karpov, "LSTM Based Cross-corpus and Cross-task Acoustic Emotion Recognition, " in Interspeech, 2018, pp. 521-525.
                DOI:10.21437/Interspeech.2018-2298
[15]         C.-W. Huang and S. S. Narayanan, "Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition, " in 2017 IEEE international conference on multimedia and expo (ICME), 2017, pp. 583-588.
                DOI: 10.1109/ICME.2017.8019296
[16]         S. Han, F. Leng, and Z. Jin, "Speech emotion recognition with a ResNet-CNN-Transformer parallel neural network, " in 2021 International Conference on Communications, Information System and Computer Engineering (CISCE), 2021, pp. 803-807.
                 DOI: 10.1109/CISCE52179.2021.9445906
[17]         W. Chen, X. Xing, X. Xu, J. Pang, and L. Du, "DST: Deformable Speech Transformer for Emotion Recognition, " in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1-5.
                DOI: 10.1109/ICASSP49357.2023.10096966
[18]         F. Andayani, L. B. Theng, M. T. Tsun, and C. Chua, "Hybrid LSTM-transformer model for emotion recognition from speech audio files, " IEEE Access, vol. 10, pp. 36018-36027, 2022.
                DOI: 10.1109/ACCESS.2022.3163856
[19]         C. A. Kumar, A. D. Maharana, S. M. Krishnan, S. S. S. Hanuma, G. J. Lal, and V. Ravi, "Speech Emotion Recognition Using CNN-LSTM and Vision Transformer, " in International Conference on Innovations in Bio-Inspired Computing and Applications, 2022, pp. 86-97.
                DOI: https://doi.org/10.1007/978-3-031-27499-2_8
[20]         A. Dutt and P. Gader, "Wavelet Multiresolution Analysis Based Speech Emotion Recognition System Using 1D CNN LSTM Networks, " IEEE / ACM Transactions on Audio, Speech, and Language Processing, 2023.
               DOI: 10.1109/TASLP.2023.3277291
[21]         S. R. Livingstone and F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, " PloS one, vol. 13, p. e0196391, 2018.     DOI: https://doi.org/10.1371/journal.pone.0196391. g001
[22]         K. Dupuis and M. K. Pichora-Fuller, "Toronto emotional speech set (TESS) -Younger talker_Happy," 2010.
                DOI: https://doi.org/10.5683/SP2/E8H2MF
[23]         B. Salian, O. Narvade, R. Tambewagh, and S. Bharne, "Speech Emotion Recognition using Time Distributed CNN and LSTM, " in ITM Web of Conferences, 2021, p. 03006.                 DOI: https://doi.org/10.1051/itmconf/20214003006
[24]         Luna-Jiménez, C.; Griol, D.; Callejas, Z.; Kleinlein, R.; Montero, J.M.; Fernández-Martínez, F. Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning. Sensors 2021, 21, 7665. https://doi.org/10.3390/s21227665

[25]         Tanberk, S., Tükel, D.B. (2022). Ensemble Learning with CNN–LSTM Combination for Speech Emotion Recognition. In: Bashir, A.K., Fortino, G., Khanna, A., Gupta, D. (eds) Proceedings of International Conference on Computing and Communication Networks. Lecture Notes in Networks and Systems, vol 394. Springer, Singapore. https://doi.org/10.1007/978-981-19-0604-6_5
[26]         Lakshmi, K.L., Muthulakshmi, P., Nithya, A.A. et al. Recognition of emotions in speech using deep CNN and RESNET. Soft Comput (2023). https://doi.org/10.1007 / s00500-023-07969-5
[27]         D. Issa, M. Fatih Demirci, and A. Yazici, “Speech emotion recognition with deep convolutional neural networks, ” Biomed. Signal Process. Control, vol. 59, p. 101894, 2020, doi: 10.1016/j.bspc.2020.101894.
[28]         Dangol, R., Alsadoon, A., Prasad, P.W.C. et al. Speech Emotion Recognition UsingConvolutional Neural Network and Long-Short TermMemory. Multimed Tools Appl 79, 32917–32934 (2020). https://doi.org/10.1007/s11042-020-09693-w
 
Volume 12, Issue 4 - Serial Number 48
Winter
February 2025
Pages 21-25
  • Receive Date: 12 June 2024
  • Revise Date: 27 December 2024
  • Accept Date: 12 January 2025
  • Publish Date: 01 February 2025