ارائه مدلی ترکیبی مبتنی بر CNN – LSTM جهت تشخیص هیجان از سیگنال گفتار

احمدیان, رضا; رعیت‌پرور, حسین; سرکرده ئی, ابوالفضل

ارائه مدلی ترکیبی مبتنی بر CNN – LSTM جهت تشخیص هیجان از سیگنال گفتار

نوع مقاله : مقاله پژوهشی

نویسندگان

¹ کارشناسی ارشد،دانشگاه جامع امام حسین (ع)، تهران،ایران

² پژوهشگر،دانشگاه جامع امام حسین (ع)، تهران،ایران

چکیده

داده های منتشرشده در فضای مجازی شامل متن، تصویر، ویدئو و صوت به منبعی معتبر برای سنجش افکار، عقاید و هیجانات مخاطب نسبت به اشیا مختلف مانند دولتها، سیاستها، شخصیتها، محصولات و غیره تبدیلشدهاند، ب همنظور مقابله با تهدیدات شناختی فضای سایبری، تشخیص شاکله شناختی مخاطبان خودی و غیرخودی بسیار حائز اهمیت است. پژوهش حاضر به‌منظور ارائه‌ی مدلی محاسباتی برای تشخیص هیجان گفتار مخاطب مبتنی بر ترکیب دوطبقه بند CNN – LSTM صورت گرفته است. در این مقاله در ابتدا مقدمهای در مورد تشخیص هیجان گفتار و کاربردهای آن گفتهشده، سپس طرحهای ارائهشده در مجلات معتبر مرور و دقت آن‌ها ارزیابی‌شده است، در ادامه روشی کاربردی جهت تشخیص هشت هیجان پایه مخاطب شامل شادی، غم، ترس، آرام، خشم، نفرت، شگفت‌زده و خنثی ارائه‌شده است. در این پژوهش بهمنظور داشتن تعداد داده بالا، با ترکیب دو مجموعه داده RAVDESS و TESS یک مجموعه داده کلی جمع‌آوریشده، در مرحله استخراج ویژگی سه ویژگی MFCC، MEL و ZCR استخراج و ترکیب‌شده و سپس در مدل طراحی‌شده از ترکیب طبقه‌بندی کننده‌های CNN و LSTM جهت آموزش و تست استفاده‌شده است. با ارزیابیهای انجامشده، دقت مدل بر روی‌داده‌های تست، 92.57 درصد است، که نسبت به مدل های موجود دارای دقت بالاتری می باشد.

کلیدواژه‌ها

20.1001.1.23224347.1403.12.4.3.7

موضوعات

آسیب پذیری ها و تهدیدات فضای سایبری

عنوان مقاله [English]

A hybrid model based on CNN-LSTM for speech emotion reognition

نویسندگان [English]

reza ahmadian ¹
Hossein Rayat Parvar ¹
abolfazl sarkardehee ²

¹ Master's degree, Imam Hussein (AS) University, Tehran, Iran

² Researcher, Imam Hossein comprehensive University, Tehran, Iran

چکیده [English]

The data published in the virtual space, including text, image, video and speech, have become a reliable source for measuring the thoughts, opinions and emotions of the audience towards various objects such as governments, policies, personalities, products, etc. In order to conflict with the cognitive threats of the cyberspace, it is very important to recognition the cognitive structure of insider and enemy audiences. The current research was conducted in order to present a computational model for speech emotion recognition based on the combination of two Classifier of CNN-LSTM. In this article, at the beginning, the introductions about speech emotion recognition and its applications are mentioned, then the papers presented in the authoritative journals are reviewed and their accuracy is evaluated, in the following, a practical method for recognition the eight basic emotions of the audience including happiness, sadness, fear, calm, anger, Disgust, surprise, and neutral are presented. In this research, in order to have a high number of data, by combining the two data sets RAVDESS and TESS, a general data set was collected, in the feature extraction phase, three features MFCC, MEL and ZCR were extracted and combined, and then in the model designed by the combination of CNN and LSTM classifiers for Training and testing have been used. With the evaluations, the accuracy of the model on the test data is 92.57%, which is more accurate than the existing models.

کلیدواژه‌ها [English]

speech emotion recognition
audience evaluation
convolutional neural network
recurrent neural network
mel frequency capstral coefficients

مراجع

[1] M. Swain, A. Routray, and P. Kabisatpathy, "Databases, features and classifiers for speech emotion recognition: a review, " International Journal of Speech Technology, vol. 21, pp. 93-120, یشن2018I:https://doi.org/10.1007/s10772-018-9491-z

[2] M. Shamsi, "Emotion Recognition in Persian Speech Using Machine Learning Methods," 2016.
DOI:10.1109/ICCKE54056.2021.9721504

[3] M. B. Akçay and K. Oğuz, "Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers, " Speech Communication, vol. 116, pp. 56-76, 2020.. DOI: https://doi.org/10.1016/j.specom.2019.12.001

[4] B. Du, Q. Gao, and H. Ning, "Survey on Intelligent Speech Emotion Recognition, " Forest Chemicals Review, pp. 230-260, 2021.

DOI: 10.1109/SLT.2016.7846319

[5] A. B. A. Qayyum, A. Arefeen, and C. Shahnaz, "Convolutional neural network (CNN) based speech-emotion recognition, " in 2019 IEEE International Conference on Signal Processing, Information, Communication & Systems (SPICSCON), 2019, pp. 122-125. DOI:10.1109/SPICSCON48833.2019.9065172

[6] F. Makhmudov, A. Kutlimuratov, F. Akhmedov, M. S. Abdallah, and Y.-I. Cho, "Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders," Electronics, vol. 11, p. 4047, 2022.

DOI: https://doi.org/10.3390/electronics11234047

[7] Z. Peng, Y. Lu, S. Pan, and Y. Liu, "Efficient speech emotion recognition using multi-scale cnn and attention, " in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3020-3024.

DOI: 10.1109/ICASSP39728.2021.9414286

[8] H. Zhang, R. Gou, J. Shang, F. Shen, Y. Wu, and G. Dai, "Pre-trained deep convolution neural network model with attention for speech emotion recognition, " Frontiers in Physiology, vol. 12, p. 643202, 2021.

DOI:10.3389/fphys.2021.643202

[9] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, et al., "Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network, " in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016, pp. 5200-5204.

DOI: 10.1109/ICASSP.2016.7472669

[10] W. Lim, D. Jang, and T. Lee, "Speech emotion recognition using convolutional and recurrent neural networks, " in 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), 2016, pp. 1- DOI: 10.1109/APSIPA.2016.7820699

[11] C.-H. Wu and W.-B. Liang, "Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels," IEEE Transactions on Affective Computing, vol. 2, pp. 10-21, 2010.

DOI: 10.1109/T-AFFC.2010.16

[12] F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie, "On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues, " Journal on Multimodal User Interfaces, vol. 3, pp. 7-19, 2010

DOI: https://doi.org/10.1007/s12193-009-0032-6

[13] L. Tian, J. Moore, and C. Lai, "Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features, " in 2016 IEEE Spoken Language Technology Workshop (SLT), 2016, pp. 565-572.

DOI: 10.1109/SLT.2016.7846319

[14] H. Kaya, D. Fedotov, A. Yesilkanat, O. Verkholyak, Y. Zhang, and A. Karpov, "LSTM Based Cross-corpus and Cross-task Acoustic Emotion Recognition, " in Interspeech, 2018, pp. 521-525.

DOI:10.21437/Interspeech.2018-2298

[15] C.-W. Huang and S. S. Narayanan, "Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition, " in 2017 IEEE international conference on multimedia and expo (ICME), 2017, pp. 583-588.

DOI: 10.1109/ICME.2017.8019296

[16] S. Han, F. Leng, and Z. Jin, "Speech emotion recognition with a ResNet-CNN-Transformer parallel neural network, " in 2021 International Conference on Communications, Information System and Computer Engineering (CISCE), 2021, pp. 803-807.

DOI: 10.1109/CISCE52179.2021.9445906

[17] W. Chen, X. Xing, X. Xu, J. Pang, and L. Du, "DST: Deformable Speech Transformer for Emotion Recognition, " in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1-5.

DOI: 10.1109/ICASSP49357.2023.10096966

[18] F. Andayani, L. B. Theng, M. T. Tsun, and C. Chua, "Hybrid LSTM-transformer model for emotion recognition from speech audio files, " IEEE Access, vol. 10, pp. 36018-36027, 2022.

DOI: 10.1109/ACCESS.2022.3163856

[19] C. A. Kumar, A. D. Maharana, S. M. Krishnan, S. S. S. Hanuma, G. J. Lal, and V. Ravi, "Speech Emotion Recognition Using CNN-LSTM and Vision Transformer, " in International Conference on Innovations in Bio-Inspired Computing and Applications, 2022, pp. 86-97.

DOI: https://doi.org/10.1007/978-3-031-27499-2_8

[20] A. Dutt and P. Gader, "Wavelet Multiresolution Analysis Based Speech Emotion Recognition System Using 1D CNN LSTM Networks, " IEEE / ACM Transactions on Audio, Speech, and Language Processing, 2023.

DOI: 10.1109/TASLP.2023.3277291

[21] S. R. Livingstone and F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, " PloS one, vol. 13, p. e0196391, 2018. DOI: https://doi.org/10.1371/journal.pone.0196391. g001

[22] K. Dupuis and M. K. Pichora-Fuller, "Toronto emotional speech set (TESS) -Younger talker_Happy," 2010.

DOI: https://doi.org/10.5683/SP2/E8H2MF

[23] B. Salian, O. Narvade, R. Tambewagh, and S. Bharne, "Speech Emotion Recognition using Time Distributed CNN and LSTM, " in ITM Web of Conferences, 2021, p. 03006. DOI: https://doi.org/10.1051/itmconf/20214003006

[24] Luna-Jiménez, C.; Griol, D.; Callejas, Z.; Kleinlein, R.; Montero, J.M.; Fernández-Martínez, F. Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning. Sensors 2021, 21, 7665. https://doi.org/10.3390/s21227665

[25] Tanberk, S., Tükel, D.B. (2022). Ensemble Learning with CNN–LSTM Combination for Speech Emotion Recognition. In: Bashir, A.K., Fortino, G., Khanna, A., Gupta, D. (eds) Proceedings of International Conference on Computing and Communication Networks. Lecture Notes in Networks and Systems, vol 394. Springer, Singapore. https://doi.org/10.1007/978-981-19-0604-6_5

[26] Lakshmi, K.L., Muthulakshmi, P., Nithya, A.A. et al. Recognition of emotions in speech using deep CNN and RESNET. Soft Comput (2023). https://doi.org/10.1007 / s00500-023-07969-5

[27] D. Issa, M. Fatih Demirci, and A. Yazici, “Speech emotion recognition with deep convolutional neural networks, ” Biomed. Signal Process. Control, vol. 59, p. 101894, 2020, doi: 10.1016/j.bspc.2020.101894.

[28] Dangol, R., Alsadoon, A., Prasad, P.W.C. et al. Speech Emotion Recognition UsingConvolutional Neural Network and Long-Short TermMemory. Multimed Tools Appl 79, 32917–32934 (2020). https://doi.org/10.1007/s11042-020-09693-w