Thesis Type: Postgraduate
Institution Of The Thesis: Gazi Üniversitesi, Fen Bilimleri Enstitüsü, Turkey
Approval Date: 2019
Student: ANIL DÜZGÜN
Supervisor: FECİR DURAN
Open Archive Collection: AVESIS Open Access Collection
Abstract:In this study, spam detection was done with machine learning methods on a data set composed of Twitter's user-based attributes. Twitter is currently one of the most preferred social networks by social media users. Therefore, it contains many spam accounts. Smart systems are needed to detect spam accounts that update their content day by day. In the study, the most appropriate user account-based features are selected from a Twitter data set that is open to academic use. In Scikit-learn, Weka and Matlab tools, the models have been created by running 7 different supervised machine learning methods with default parameters on feature set. The models were tested and the scores were compared for 3 tools. In all classifiers, the highest accuracy and precision and F criterion ratios with default parameters were obtained with Scikit-Learn tool. It was seen that different results can be obtained by applying the same algorithms as common default parameters in the tools. Therefore the classifiers were re-analyzed with the same common parameters and the differences between the scores obtained were analyzed again.Tools and methods were also evaluated in terms of documentation, ease of development and popularity. In the last period, the results obtained from all algorithms were compared with Scikit-Learn. The best results in accuracy, sensitivity, F measure, true positive and false positive scores were obtained by AdaBoost, Random Forest and Bagging classifiers which are ensemble learning methods and using decision trees in their sub-models.In traditional methods, the highest accuracy, precision, sensitivity, F criteria, true positive and false positive scores were obtained by decision trees classifier. Although the scores were close to each other, higher scores were obtained from decision trees than in community methods. In the true negative scores, which is the rate of spam, the highest performance rate was obtained by the K nearest neighbor algorithm. The lowest false account detection was obtained with a random forest classifier. Successful scores with accuracy, F criteria, false positive rates were obtained by logistic regression method.