Applied Sciences (Switzerland), cilt.16, sa.9, 2026 (SCI-Expanded, Scopus)
Short-form video content has gained importance as a popular form of digital media due to the rising popularity of social media platforms and the decreasing attention spans of consumers. However, a major obstacle to popularity detection in short-form content is the heterogeneous nature of the data, encompassing textual, visual, and metadata components. To tackle this challenge, we propose FUSEPOP, a robust multi-modal architecture. The proposed framework utilizes ResNet-50 for visual feature extraction and XLM-RoBERTa for encoding multilingual textual information. FUSEPOP employs a mutual information-based modality weighting mechanism with logarithmic smoothing and a 0.7 weight ceiling to balance contributions from each input stream. Furthermore, FUSEPOP implements a robust stacked generalization strategy trained via stratified 5-fold cross-validation. This approach utilizes a logistic regression meta-learner to dynamically synthesize predictions from random forest, XGBoost, and a neural network-based classifier. Experimental results show that this architecture significantly outperforms benchmark models, achieving an accuracy of 0.980 and an average F1-score of 0.964 on the feature configuration selected for this study, and remains competitive on a literature-aligned alternative configuration. These findings confirm that the proposed model successfully detects popularity on short-form social media content.