Intelligent Systems with Applications, cilt.29, 2026 (ESCI, Scopus)
Facial expression is one of the most important indicators used to convey human emotions. Facial expression recognition is the process of automatically detecting and classifying these expressions by computer systems. Multimodal facial expression recognition aims to perform a more accurate and comprehensive emotion analysis by combining facial expressions with different modalities such as image, speech, Electroencephalogram (EEG), or text. This study systematically reviews research conducted between 2021 and 2025 on the Vision Transformer (ViT) based approaches and Explainable Artificial Intelligence (XAI) techniques in multimodal facial expression recognition, as well as the datasets employed in these studies. The findings indicate that ViT-based models outperform conventional Convolutional Neural Networks (CNNs) by effectively capturing long-range dependencies between spatially distant facial regions, thereby enhancing emotion classification accuracy. However, significant challenges remain, including data privacy risks arising from the collection of multimodal biometric information, data imbalance and inter-modality incompatibility, high computational costs hindering real-time applications, and limited progress in model explainability. Overall, this study highlights that integrating advanced ViT architectures with robust XAI and privacy-preserving techniques can enhance the reliability, transparency, and ethical deployment of multimodal facial expression recognition systems.