From Hand-Crafted Features to Large Language Models: A Comparative Evaluation of Android Malware Detection Paradigms

Taşkın, Egemen; DOĞRU, İBRAHİM

doi:10.3390/app16115600

From Hand-Crafted Features to Large Language Models: A Comparative Evaluation of Android Malware Detection Paradigms

Taşkın E., DOĞRU İ. A.

Applied Sciences (Switzerland), cilt.16, sa.11, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 16 Sayı: 11
Basım Tarihi: 2026
Doi Numarası: 10.3390/app16115600
Dergi Adı: Applied Sciences (Switzerland)
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Applied Science & Technology Source, Compendex, INSPEC, Directory of Open Access Journals
Anahtar Kelimeler: Android malware detection, large language models, LLM-based feature extraction, static analysis, transformer models
Gazi Üniversitesi Adresli: Evet

Özet

The rapid evolution of Android malware and increasingly sophisticated obfuscation techniques challenge traditional detection systems. This study presents a rigorous, unified comparative evaluation of three methodological paradigms-classical machine learning, Transformer-based architectures, and generative Large Language Models (LLMs)-for static Android malware detection. We construct a balanced dataset of 12,000 APKs from the AndroZoo repository and implement a fold-independent experimental pipeline featuring constraint-aware sequence selection for Transformers and structured LLM-driven feature distillation with parameter-efficient fine-tuning (LoRA). All evaluations employ stratified 5-fold cross-validation with statistical significance testing and comprehensive resource profiling. Classical models (e.g., Random Forest) achieve strong baselines (~0.975 F1) but exhibit limited contextual resilience. Distilled Transformers (RoBERTa ~0.970 F1-score) deliver an optimal accuracy-latency trade-off for real-time screening. While zero-shot LLMs show moderate performance (~0.74–0.84 F1), integrating LLM-extracted semantic features with LoRA fine-tuning yields accuracy (Qwen3.5-27B: ~0.982 F1-score), cross-dataset generalization, and structured interpretability. Hallucination analysis reveals a manageable 7.7% rate, with ablation confirming minimal impact on downstream classification. We advocate a tiered deployment strategy: lightweight Transformers for high-throughput screening, complemented by fine-tuned LLMs for deep forensic analysis and explainable threat intelligence. This hybrid framework effectively balances computational efficiency, detection robustness, and operational interpretability for modern Android security pipelines.