Volume 12 Issue 03 (September 2025) - Evergreen - Joint Journal of Novel Carbon Resource Sciences and Green Asia Strategy

ISSN:2189-0420 (Print until Mar 2020)
ISSN:2432-5953 (Online)

4.3

2024CiteScore

69th percentile

SNIP: 1.192

		4.3 2024CiteScore 69th percentile Powered by
Metrics by SCOPUS 2024 CiteScore 4.3 SJR 0.391 SNIP 1.192

Enhancing Car Safety with Multimodal Emotion Recognition using CNN-LSTM Networks

Gitanjalee S. Salunkhe1,*, Sarika N. Joglekar2, Jyoti A. Kengale2

1Computer Engineering Department, MKSSS’s Cummins College of Engineering for Women,, Pune, India

2Computer Engineering Department, MKSSS’s Cummins College of Engineering for Women, India

*Author to whom correspondence should be addressed:
E-mail: gitanjalee.salunkhe@cumminscollege.in (GSS)

Evergreen, Vol. 12, Iss. 03, pp. 1545–1563, 2025
DOI: 10.5109/7388848

Received: December 26, 2024 | Revised: May 10, 2025 | Accepted: August 20, 2025 | Published: September 2025

Abstract

Aggressive driving behaviors caused by emotional impairments such as anger, stress, and fatigue contribute significantly to traffic accidents worldwide. Existing single-modal emotion recognition systems fail to capture the full complexity of human emotional states, particularly when different modalities convey conflicting signals, limiting their effectiveness in real-world driving scenarios. This study aims to enhance automotive safety by developing a robust real-time multimodal emotion recognition system that integrates visual and auditory cues to accurately detect driver emotional states and trigger appropriate safety interventions. We developed a hybrid CNN-LSTM model that processes facial expressions through Convolutional Neural Networks (CNNs) for spatial feature extraction and speech patterns through Long Short-Term Memory (LSTM) networks for temporal sequence analysis. The system employs decision-level fusion to integrate multimodal data from the RAVDESS dataset (7,356 files, 24 actors, balanced gender distribution, 8 emotions based on Ekman's model: anger, calm, neutral, surprise, disgust, sadness, fear, happiness). A 2-second time window with 60 frames per sequence was used for temporal modeling, with evaluation conducted using 70-30 train-test split and 5-fold cross-validation. The proposed model achieved 98.28% accuracy, 98.77% precision, and real-time processing at ~22.5 FPS on NVIDIA Jetson Xavier NX embedded systems, significantly outperforming traditional machine learning approaches (SVM: 37.33%) and competitive with Transformer-based models. The system demonstrated robust performance including 10% facial occlusion and 20dB background noise. The hybrid CNN-LSTM framework successfully addresses the limitations of single-modal systems by providing accurate, real-time emotion recognition suitable for integration with Advanced Driver Assistance Systems (ADAS). The system can trigger safety measures including speed limiters, contributing to enhanced road safety through proactive emotional state monitoring.

Keywords

emotion recognition ; driver safety ; machine learning ; CNN ; LSTM

Available Repositories

Journal DOI Zenodo Archive

Article Metrics

Views

Downloads

Citations

Export Citation

BibTeX RIS Semantic Scholar CrossRef Mendeley Google Scholar

Full Text