A systematic review of vision transformer and explainable AI advances in multimodal facial expression recognition

Kus, Ilya; KOÇAK, CEMAL; Keles, Ayse

doi:10.1016/j.iswa.2025.200615

A systematic review of vision transformer and explainable AI advances in multimodal facial expression recognition

Kus I., KOÇAK C., Keles A.

Intelligent Systems with Applications, cilt.29, 2026 (ESCI, Scopus)

Yayın Türü: Makale / Derleme
Cilt numarası: 29
Basım Tarihi: 2026
Doi Numarası: 10.1016/j.iswa.2025.200615
Dergi Adı: Intelligent Systems with Applications
Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), Scopus
Anahtar Kelimeler: Emotion recognition, Explainable artificial intelligence, Facial emotion recognition, Multimodal emotion recognition, Vision transformer
Gazi Üniversitesi Adresli: Evet

Özet

Facial expression is one of the most important indicators used to convey human emotions. Facial expression recognition is the process of automatically detecting and classifying these expressions by computer systems. Multimodal facial expression recognition aims to perform a more accurate and comprehensive emotion analysis by combining facial expressions with different modalities such as image, speech, Electroencephalogram (EEG), or text. This study systematically reviews research conducted between 2021 and 2025 on the Vision Transformer (ViT) based approaches and Explainable Artificial Intelligence (XAI) techniques in multimodal facial expression recognition, as well as the datasets employed in these studies. The findings indicate that ViT-based models outperform conventional Convolutional Neural Networks (CNNs) by effectively capturing long-range dependencies between spatially distant facial regions, thereby enhancing emotion classification accuracy. However, significant challenges remain, including data privacy risks arising from the collection of multimodal biometric information, data imbalance and inter-modality incompatibility, high computational costs hindering real-time applications, and limited progress in model explainability. Overall, this study highlights that integrating advanced ViT architectures with robust XAI and privacy-preserving techniques can enhance the reliability, transparency, and ethical deployment of multimodal facial expression recognition systems.