AndroCom: A Real-World Android Applications’ Vulnerability Dataset to Assist with Automatically Detecting Vulnerabilities

Arikan, Kaya; Yılmaz, ERCAN

doi:10.3390/app15052665

AndroCom: A Real-World Android Applications’ Vulnerability Dataset to Assist with Automatically Detecting Vulnerabilities

Arikan K. E., Yılmaz E. N.

Applied Sciences (Switzerland), cilt.15, sa.5, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 15 Sayı: 5
Basım Tarihi: 2025
Doi Numarası: 10.3390/app15052665
Dergi Adı: Applied Sciences (Switzerland)
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Aerospace Database, Agricultural & Environmental Science Database, Applied Science & Technology Source, Communication Abstracts, INSPEC, Metadex, Directory of Open Access Journals, Civil Engineering Abstracts
Anahtar Kelimeler: android application security, artificial intelligence, code vulnerability, labelled dataset, machine learning
Gazi Üniversitesi Adresli: Evet

Özet

In the realm of software development, quality reigns supreme, but the ever-present danger of vulnerabilities threatens to undermine this fundamental principle. Insufficient early vulnerability identification is a key factor in releasing numerous apps with compromised security measures. The most effective solution could be using machine learning models trained on labeled datasets; however, existing datasets still struggle to meet this need fully. Our research constructs a vulnerability dataset for Android application source code, primarily based on the Common Vulnerabilities and Exposures (CVE) system, using data derived from real-world developers’ vulnerability-fixing commits. This dataset was obtained by systematically searching such commits on GitHub using well-designed keywords. This study created the dataset using vulnerable code snippets from 366,231 out of 2.9 million analyzed repositories. All scripts used for data collection, processing, and refinement and the generated dataset are publicly available on GitHub. Experimental results demonstrate that fine-tuned Support Vector Machine and Logistic Regression models trained on this dataset achieve an accuracy of 98.71%, highlighting their effectiveness in vulnerability detection.