Enhancing Car Safety with Multimodal Emotion Recognition using CNN-LSTM Networks
1Computer Engineering Department, MKSSS’s Cummins College of Engineering for Women,, Pune, India
2Computer Engineering Department, MKSSS’s Cummins College of Engineering for Women, India
*Author to whom correspondence should be addressed:
E-mail: gitanjalee.salunkhe@cumminscollege.in (GSS)
E-mail: gitanjalee.salunkhe@cumminscollege.in (GSS)
Received: December 26, 2024 | Revised: May 10, 2025 | Accepted: August 20, 2025 | Published: September 2025
Abstract
Aggressive driving behaviors caused by emotional impairments such as anger, stress, and fatigue contribute significantly to traffic accidents worldwide. Existing single-modal emotion recognition systems fail to capture the full complexity of human emotional states, particularly when different modalities convey conflicting signals, limiting their effectiveness in real-world driving scenarios.
This study aims to enhance automotive safety by developing a robust real-time multimodal emotion recognition system that integrates visual and auditory cues to accurately detect driver emotional states and trigger appropriate safety interventions.
We developed a hybrid CNN-LSTM model that processes facial expressions through Convolutional Neural Networks (CNNs) for spatial feature extraction and speech patterns through Long Short-Term Memory (LSTM) networks for temporal sequence analysis. The system employs decision-level fusion to integrate multimodal data from the RAVDESS dataset (7,356 files, 24 actors, balanced gender distribution, 8 emotions based on Ekman's model: anger, calm, neutral, surprise, disgust, sadness, fear, happiness). A 2-second time window with 60 frames per sequence was used for temporal modeling, with evaluation conducted using 70-30 train-test split and 5-fold cross-validation.
The proposed model achieved 98.28% accuracy, 98.77% precision, and real-time processing at ~22.5 FPS on NVIDIA Jetson Xavier NX embedded systems, significantly outperforming traditional machine learning approaches (SVM: 37.33%) and competitive with Transformer-based models. The system demonstrated robust performance including 10% facial occlusion and 20dB background noise.
The hybrid CNN-LSTM framework successfully addresses the limitations of single-modal systems by providing accurate, real-time emotion recognition suitable for integration with Advanced Driver Assistance Systems (ADAS). The system can trigger safety measures including speed limiters, contributing to enhanced road safety through proactive emotional state monitoring.
Keywords
emotion recognition ; driver safety ; machine learning ; CNN ; LSTM
Available Repositories
Share Article
Article Metrics
--
Views
--
Downloads
--
Citations
Export Citation
Full Text









