A systematic review of vision transformer and explainable AI advances in multimodal facial expression recognition


Kus I., KOÇAK C., Keles A.

Intelligent Systems with Applications, cilt.29, 2026 (ESCI, Scopus) identifier identifier

  • Yayın Türü: Makale / Derleme
  • Cilt numarası: 29
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1016/j.iswa.2025.200615
  • Dergi Adı: Intelligent Systems with Applications
  • Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), Scopus
  • Anahtar Kelimeler: Emotion recognition, Explainable artificial intelligence, Facial emotion recognition, Multimodal emotion recognition, Vision transformer
  • Gazi Üniversitesi Adresli: Evet

Özet

Facial expression is one of the most important indicators used to convey human emotions. Facial expression recognition is the process of automatically detecting and classifying these expressions by computer systems. Multimodal facial expression recognition aims to perform a more accurate and comprehensive emotion analysis by combining facial expressions with different modalities such as image, speech, Electroencephalogram (EEG), or text. This study systematically reviews research conducted between 2021 and 2025 on the Vision Transformer (ViT) based approaches and Explainable Artificial Intelligence (XAI) techniques in multimodal facial expression recognition, as well as the datasets employed in these studies. The findings indicate that ViT-based models outperform conventional Convolutional Neural Networks (CNNs) by effectively capturing long-range dependencies between spatially distant facial regions, thereby enhancing emotion classification accuracy. However, significant challenges remain, including data privacy risks arising from the collection of multimodal biometric information, data imbalance and inter-modality incompatibility, high computational costs hindering real-time applications, and limited progress in model explainability. Overall, this study highlights that integrating advanced ViT architectures with robust XAI and privacy-preserving techniques can enhance the reliability, transparency, and ethical deployment of multimodal facial expression recognition systems.