Clustering-Aided Supervised Malware Detection with Specialized Classifiers and Early Consensus


Creative Commons License

DENER M., Gulburun S.

CMC-COMPUTERS MATERIALS & CONTINUA, vol.75, no.1, pp.1235-1251, 2023 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 75 Issue: 1
  • Publication Date: 2023
  • Doi Number: 10.32604/cmc.2023.036357
  • Journal Name: CMC-COMPUTERS MATERIALS & CONTINUA
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Aerospace Database, Communication Abstracts, Compendex, INSPEC, Metadex, zbMATH, Civil Engineering Abstracts
  • Page Numbers: pp.1235-1251
  • Gazi University Affiliated: Yes

Abstract

One of the most common types of threats to the digital world is malicious software. It is of great importance to detect and prevent existing and new malware before it damages information assets. Machine learning approaches are used effectively for this purpose. In this study, we present a model in which supervised and unsupervised learning algorithms are used together. Clustering is used to enhance the prediction performance of the supervised classifiers. The aim of the proposed model is to make predictions in the shortest possible time with high accuracy and f1 score. In the first stage of the model, the data are clustered with the k-means algorithm. In the second stage, the prediction is made with the combination of the classifier with the best prediction performance for the related cluster. While choosing the best classifiers for the given clusters, triple combinations of ten machine learning algorithms (kernel support vector machine, k-nearest neighbor, naive Bayes, decision tree, random forest, extra gradient boosting, categorical boosting, adaptive boosting, extra trees, and gradient boosting) are used. The selected triple classifier combination is positioned in two stages. The prediction time of the model is improved by positioning the classifier with the slowest pre-diction time in the second stage. The selected triple classifier combination is positioned in two tiers. The prediction time of the model is improved by positioning the classifier with the highest prediction time in the second tier. It is seen that clustering before classification improves prediction performance, which is presented using Blue Hexagon Open Dataset for Malware Analy-sis (BODMAS), Elastic Malware Benchmark for Empowering Researchers (EMBER) 2018 and Kaggle malware detection datasets. The model has 99.74% accuracy and 99.77% f1 score for the BODMAS dataset, 99.04% accuracy and 98.63% f1 score for the Kaggle malware detection dataset, and 96.77% accuracy and 96.77% f1 score for the EMBER 2018 dataset. In addition, the tiered positioning of classifiers shortened the average prediction time by 76.13% for the BODMAS dataset and 95.95% for the EMBER 2018 dataset. The proposed method's prediction performance is better than the rest of the studies in the literature in which BODMAS and EMBER 2018 datasets are used.