Keyword extraction as sequence labeling with classification algorithms


Unlu H. K., ÇETİN A.

NEURAL COMPUTING & APPLICATIONS, cilt.35, sa.4, ss.3413-3422, 2023 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 35 Sayı: 4
  • Basım Tarihi: 2023
  • Doi Numarası: 10.1007/s00521-022-07906-x
  • Dergi Adı: NEURAL COMPUTING & APPLICATIONS
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, Applied Science & Technology Source, Biotechnology Research Abstracts, Compendex, Computer & Applied Sciences, Index Islamicus, INSPEC, zbMATH
  • Sayfa Sayıları: ss.3413-3422
  • Anahtar Kelimeler: Keyword extraction, Sequence labeling, Hybrid, HybridKEM, KEYPHRASE EXTRACTION
  • Gazi Üniversitesi Adresli: Evet

Özet

Keyword extraction is one of the main problems in clustering and linking textual content. In literature, several machine learning approaches were proposed for keyword and keyphrase extraction. However, the state-of-the-art performance results are still below the expectations. In this paper, we propose a novel hybrid keyword extraction model, HybridKEM. The proposed model addresses the keyword extraction problem as a sequence labelling task. Naive Bayes (NB), Polynomial Regression (PR) Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), and Random Forest (RF) classification algorithms were trained separately in the Token Classification module of the model. The Token Classification process was performed by using text, graphic, embedding, and set features in the model. The performance of the model was evaluated using the Inspec, Semeval-2017, 500N-KPCrowd datasets, which are widely used in studies in the literature, and two newly collected, TRDizinEn and DergiParkEn datasets. The model achieved an average F-1-score of 0.664 for all datasets. The highest F-1-score (0.74) was obtained with the TRDizinEn dataset.