Applied Sciences (Switzerland), cilt.15, sa.5, 2025 (SCI-Expanded)
In the realm of software development, quality reigns supreme, but the ever-present danger of vulnerabilities threatens to undermine this fundamental principle. Insufficient early vulnerability identification is a key factor in releasing numerous apps with compromised security measures. The most effective solution could be using machine learning models trained on labeled datasets; however, existing datasets still struggle to meet this need fully. Our research constructs a vulnerability dataset for Android application source code, primarily based on the Common Vulnerabilities and Exposures (CVE) system, using data derived from real-world developers’ vulnerability-fixing commits. This dataset was obtained by systematically searching such commits on GitHub using well-designed keywords. This study created the dataset using vulnerable code snippets from 366,231 out of 2.9 million analyzed repositories. All scripts used for data collection, processing, and refinement and the generated dataset are publicly available on GitHub. Experimental results demonstrate that fine-tuned Support Vector Machine and Logistic Regression models trained on this dataset achieve an accuracy of 98.71%, highlighting their effectiveness in vulnerability detection.