EVERGREEN

Joint Journal of Novel Carbon Resource Sciences and Green Asia Strategy

ISSN:2189-0420 (Print until Mar 2020)
ISSN:2432-5953 (Online)

SCImago Journal & Country Rank

Open Access
Scopus
Google Scholar
Crossref
SCImago Journal & Country Rank
4.3
2024CiteScore
 
69th percentile
Powered by Scopus
Metrics by SCOPUS 2024
CiteScore
4.3
SJR
0.391
SNIP
1.192


Enhancing Car Safety with Multimodal Emotion Recognition using CNN-LSTM Networks

Gitanjalee S. Salunkhe1,*, Sarika N. Joglekar2, Jyoti A. Kengale2
1Computer Engineering Department, MKSSS’s Cummins College of Engineering for Women,, Pune, India
2Computer Engineering Department, MKSSS’s Cummins College of Engineering for Women, India
*Author to whom correspondence should be addressed:
E-mail: gitanjalee.salunkhe@cumminscollege.in (GSS)
Received: December 26, 2024 | Revised: May 10, 2025 | Accepted: August 20, 2025 | Published: September 2025
Abstract
Aggressive driving behaviors caused by emotional impairments such as anger, stress, and fatigue contribute significantly to traffic accidents worldwide. Existing single-modal emotion recognition systems fail to capture the full complexity of human emotional states, particularly when different modalities convey conflicting signals, limiting their effectiveness in real-world driving scenarios. This study aims to enhance automotive safety by developing a robust real-time multimodal emotion recognition system that integrates visual and auditory cues to accurately detect driver emotional states and trigger appropriate safety interventions. We developed a hybrid CNN-LSTM model that processes facial expressions through Convolutional Neural Networks (CNNs) for spatial feature extraction and speech patterns through Long Short-Term Memory (LSTM) networks for temporal sequence analysis. The system employs decision-level fusion to integrate multimodal data from the RAVDESS dataset (7,356 files, 24 actors, balanced gender distribution, 8 emotions based on Ekman's model: anger, calm, neutral, surprise, disgust, sadness, fear, happiness). A 2-second time window with 60 frames per sequence was used for temporal modeling, with evaluation conducted using 70-30 train-test split and 5-fold cross-validation. The proposed model achieved 98.28% accuracy, 98.77% precision, and real-time processing at ~22.5 FPS on NVIDIA Jetson Xavier NX embedded systems, significantly outperforming traditional machine learning approaches (SVM: 37.33%) and competitive with Transformer-based models. The system demonstrated robust performance including 10% facial occlusion and 20dB background noise. The hybrid CNN-LSTM framework successfully addresses the limitations of single-modal systems by providing accurate, real-time emotion recognition suitable for integration with Advanced Driver Assistance Systems (ADAS). The system can trigger safety measures including speed limiters, contributing to enhanced road safety through proactive emotional state monitoring.
Keywords
emotion recognition ; driver safety ; machine learning ; CNN ; LSTM
Available Repositories
Share Article
Article Metrics
--
Views
--
Downloads
--
Citations
Full Text
Download PDF
References
  1. 1) G. Oh, E. Jeong, R. C. Kim, J. H. Yang, S. Hwang, S. Lee and S. Lim, "Multimodal data collection system for driver emotion recognition based on self-reporting in real-world driving," Sensors, 22 4402 (2022) doi:10.3390/s22124402
  2. 2) L. Mou, Y. Zhao, C. Zhou, B. Nakisa, M. N. Rastgoo, L. Ma, T. Huang, B. Yin, R. Jain and W. Gao, "Driver emotion recognition with a hybrid attentional multimodal fusion framework," IEEE Transactions on Affective Computing, 14 2970- 2981 (2023) doi:10.1109/TAFFC.2023.3250460
  3. 3) C. Y. Park, N. Cha, S. Kang, A. Kim, A. H. Khandoker, L. Hadjileontiadis, A. Oh, Y. Jeong and U. Lee, "K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations," Scientific Data, 7 293 (2020) doi:10.1038/s41597-020-00630-y
  4. 4) S. Shafaei, T. Hacizade and A. Knoll, "Integration of driver behavior into emotion recognition systems: A preliminary study on steering wheel and vehicle acceleration," Computer Vision – ACCV 2018 Workshops, 11367 386-401 (2019) doi:10.1007/978-3-030-21074-8_32
  5. 5) W. Sun, Y. Liu, S. Li, J. Tian, F. Wang and D. Liu, "Research on driver’s anger recognition method based on multimodal data fusion," Traffic Injury Prevention, 25 354-363 (2023) doi:10.1080/15389588.2023.2297658
  6. 6) D. Ayata, Y. Yaslan and M. E. Kamasak , "Emotion recognition from multimodal physiological signals for emotion aware healthcare systems," Journal of Medical and Biological Engineering, 40 149-157 (2020) doi:10.1007/s40846-019-00505-7
  7. 7) N. Samadiani, G. Huang, B. Cai, W. Luo, C. H. Chi, Y. Xiang and J. He, "A review on automatic facial expression recognition systems assisted by multimodal sensor data," Sensors, 19 1863 (2019) doi:10.3390/s19081863
  8. 8) M. N. Rastgoo, B. Nakisa, F. Maire, A. Rakotonirainy and V. Chandran, "Automatic driver stress level classification using multimodal deep learning," Expert Systems with Applications, 138 112793 (2019) doi:10.1016/j.eswa.2019.07.010
  9. 9) S. Zepf, J. Hernandez, A. Schmitt, W. Minker and R. W. Picard, "Driver emotion recognition for intelligent vehicles: A survey," ACM Computing Surveys, 53 1-30 (2020) doi:10.1145/3388790
  10. 10) A. I. Middya, B. Nag and S. Roy, "Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities," Knowledge-Based Systems, 244 108580 (2022) doi:10.1016/j.knosys.2022.108580
  11. 11) J. Zhang, Z. Yin, P. Chen and S. Nichele, "Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review," Information Fusion, 59 103-126 (2020) doi:10.1016/j.inffus.2020.01.011
  12. 12) N. J. Shoumy, L. M. Ang, K. P. Seng, D. M. M. Rahaman and T. Zia, "Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals," Journal of Network and Computer Applications, 149 102447 (2020) doi:10.1016/j.jnca.2019.102447
  13. 13) M. Soleymani, M. Pantic and T. Pun, "Multimodal emotion recognition in response to videos," IEEE Transactions on Affective Computing, 3 211-223 (2012) doi:10.1109/T-AFFC.2011.37
  14. 14) P. Zhang, M. Fu, R. Zhao, D. Wu, H. Zhang, Z. Yang and R. Wang, "ECMER: Edge-cloud collaborative personalized multimodal emotion recognition framework in the internet of vehicles," IEEE Network, 37 192-199 (2023) doi:10.1109/MNET.003.2300012
  15. 15) L. Mou, C. Zhou, P. Zhao, B. Nakisa, M. N. Rastgoo, R. Jain and W. Gao, "Driver stress detection via multimodal fusion using attention-based CNN-LSTM," Expert Systems with Applications, 173 114693 (2021) doi:10.1016/j.eswa.2021.114693
  16. 16) G. Sharma and A. Dhall, "A Survey on Automatic Multimodal Emotion Recognition in the Wild," Advances in Data Science: Methodologies and Applications, 35-64 (2020) doi:10.1007/978-3-030-51870-7_3
  17. 17) L. Sharara, M. Ismail, K. Thelen and A. Politis, "A Real-Time Automotive Safety System Based on Advanced AI Facial Detection Algorithms," IEEE Transactions on Intelligent Vehicles, 9 5080-5100 (2024) doi:10.1109/TIV.2023.3272304
  18. 18) L. Davoli, M. Martalò, A. Cilfone, L. Belli, G. Ferrari, R. Presta and J. Plomp, "On driver behavior recognition for increased safety: a roadmap", Safety, 6 (2020) doi:10.3390/safety6040055
  19. 19) R. R. Singh, S. Conjeti and R. Banerjee, "A comparative evaluation of neural network classifiers for stress level analysis of automotive drivers using physiological signals" Biomedical Signal Processing and Control, 8 740-754 (2013) doi:10.1016/j.bspc.2013.06.014
  20. 20) J. Zhang, Z. Yin, P. Chen and S. Nichele, "Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review" Information Fusion, 59 103-126 (2020) doi:10.1016/j.inffus.2020.01.011
  21. 21) X. Wang, Z. Sun, A. Chehri, G. Jeon and Y. Song, "Deep learning and multi-modal fusion for realtime multi-object tracking: Algorithms, challenges, datasets, and comparative study," Information Fusion, 105 102247 (2024) doi:10.1016/j.inffus.2024.102247
  22. 22) B. Gao, K. Cai, T. Qu, Y. Hu and H. Chen, "Personalized Adaptive Cruise Control Based on Online Driving Style Recognition Technology and Model Predictive Control," IEEE Transactions on Vehicular Technology, 69 12482-12496 (2020) doi:10.1109/TVT.2020.3020335
  23. 23) R. Yaswanth and M. R. Babu, "Revolutionizing Automotive Technology: Unveiling the State of Vehicular Sensors and Biosensors," IEEE Access, 12 192786-192812 (2024) doi:10.1109/ACCESS.2024.3514157
  24. 24) J. Zhang, R. A. B. R. Ghazilla, H. J. Yap and W. Y. Gan, "A Comprehensive Review: Multisensory and Cross-Cultural Approaches to Driver Emotion Modulation in Vehicle Systems" Applied Sciences, 14 6819 (2024) doi:10.3390/app14156819
  25. 25) L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Duja, Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel, M. Al-Amidie and L. Farhan, "Review of deep learning: concepts, CNN architectures, challenges, applications, future directions," Journal of Big Data, 8 53 (2021) doi:10.1186/s40537-021-00444-8
  26. 26) B. Chakravarthi, S. C. Ng, M. R. Ezilarasan and M. F. Leung, "EEG-based emotion recognition using hybrid CNN and LSTM classification," Frontiers in Computational Neuroscience, 16 (2022) doi:10.3389/fncom.2022.1019776
  27. 27) A Framework for Recognition of Facial Expression Using HOG Features. International Journal of Mathematics, Statistics, and Computer Science, 2, 1-8 doi:10.59543/ijmscs.v2i.7815
  28. 28) Face Mask Detection Using Haar Cascades Classifier. International Journal of Mathematics, Statistics, and Computer Science, 2, 19-27 doi:10.59543/ijmscs.v2i.7845
  29. 29) N. Ying, Y. Jiang, C. Guo, D. Zhou and J. Zhao, "A multimodal driver emotion recognition algorithm based on the audio and video signals in internet of vehicles platform," IEEE Internet of Things Journal, (2024) doi:10.1109/jiot.2024.3363176
  30. 30) T. Anvarjon, Mustaqeem and S. Kwon, "Deep-Net: A lightweight CNN-based speech emotion recognition system using deep frequency features," Sensors, 20 5212 (2020) doi:10.3390/s20185212
  31. 31) Mustaqeem and S. Kwon , "A CNN-assisted enhanced audio signal processing for speech emotion recognition," Sensors, 20 183 (2020) doi:10.3390/s20010183
  32. 32) F. Tao and G. Liu, "Advanced LSTM: A study about better time dependency modeling in emotion recognition," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2906-2910 (2017) doi:10.48550/arXiv.1710.10197
  33. 33) N. Senthilkumar, S. Karpakam, M. G. Devi, R. Balakumaresan and P. Dhilipkumar, "Speech emotion recognition based on Bi-directional LSTM architecture and deep belief networks," Material Today Proceedings, 57 2180-2184 (2022) doi:10.1016/j.matpr.2021.12.246
  34. 34) C. Luna-Jiménez, D. Griol, Z. Callejas, R. Kleinlein, J. M. Montero and F. Fernández-Martínez, "Multimodal emotion recognition on RAVDESS dataset using transfer learning," Sensors, 21 7665 (2021) doi:10.3390/s21227665
  35. 35) C. Luna-Jiménez, R. Kleinlein, D. Griol, Z. Callejas, J. M. Montero and F. Fernández-Martínez, "A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset," Applied Sciences, 12 327 (2022) doi:10.3390/app12010327
  36. 36) A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria and L. P. Morency, "Memory fusion network for multi-view sequential learning," Proceedings of the AAAI Conference on Artificial Intelligence, 32 (2018) doi:10.1609/aaai.v32i1.12021
  37. 37) H. Pham, T. Manzini, P. P. Liang and B. Poczós, "Seq2Seq2 Sentiment: multimodal sequence to sequence models for sentiment analysis," arXiv , (2018) doi:10.48550/arXiv.1807.03915
  38. 38) S. Poria, N. Majumder, D. Hazarika, E. Cambria, A. Gelbukh and A. Hussain, "Multimodal sentiment analysis: Addressing key issues and setting up the baselines.," IEEE Intelligent Systems, 33 17-25 (2018) doi:10.1109/MIS.2018.2882362
  39. 39) A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria and L. P. Morency, "Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph," Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2236-2246 (2018) doi:10.18653/v1/P18-1208
  40. 40) J. Liang, R. Li and Q. Jin, "Semi-supervised multi-modal emotion recognition with cross-modal distribution matching," Proceedings of the 28th ACM International Conference on Multimedia, 2852-2861 (2020) doi:10.48550/arXiv.2009.02598
  41. 41) Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
  42. 42) Dosovitskiy, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929
Other Papers in This Issue