A Novel Graph-Based Ensemble Token Classification Model for Keyword Extraction


Kılıç H., ÇETİN A.

ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 1 (Scopus) identifier identifier identifier

Özet

Keyword extraction is a fundamental problem in natural language processing applications. Many graph-based models can be found in the literature that construct a graph of word co-occurrences from the input text to solve this problem. These models use graph-based features, such as Betweenness Centrality, Closeness Centrality, Eigenvector Centrality, Degree, PageRank, Clustering Coefficient, Eccentricity, Structural Hole and Coreness. In this paper, we propose a novel graph-based token classification model based on commonly used graph-based features. We used extra tree, lasso, genetic algorithm and wrapper methods to filter most informative group from all features. The token classification module of the model uses the Random Forest Ensemble classification algorithm. The performance results were evaluated with the commonly used datasets Inspec, Semeval-2017, and 500N-KPCrowd. The proposed model was also evaluated with the newly collected TRDizinEn and DergiParkEn datasets. Semeval-2017, 500N-KPCrowd, DergiParkEn, and TRDizinEn achieved the highest F1-scores of 0.641, 0.694, 0.707, and 0.766, respectively.