Keyword extraction as sequence labeling with classification algorithms

Unlu, Huma; ÇETİN, AYDIN

doi:10.1007/s00521-022-07906-x

Keyword extraction as sequence labeling with classification algorithms

Atıf İçin Kopyala

Unlu H. K., ÇETİN A.

NEURAL COMPUTING & APPLICATIONS, cilt.35, sa.4, ss.3413-3422, 2023 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 35 Sayı: 4
Basım Tarihi: 2023
Doi Numarası: 10.1007/s00521-022-07906-x
Dergi Adı: NEURAL COMPUTING & APPLICATIONS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, Applied Science & Technology Source, Biotechnology Research Abstracts, Compendex, Computer & Applied Sciences, Index Islamicus, INSPEC, zbMATH
Sayfa Sayıları: ss.3413-3422
Anahtar Kelimeler: Keyword extraction, Sequence labeling, Hybrid, HybridKEM, KEYPHRASE EXTRACTION
Gazi Üniversitesi Adresli: Evet

Özet

Keyword extraction is one of the main problems in clustering and linking textual content. In literature, several machine learning approaches were proposed for keyword and keyphrase extraction. However, the state-of-the-art performance results are still below the expectations. In this paper, we propose a novel hybrid keyword extraction model, HybridKEM. The proposed model addresses the keyword extraction problem as a sequence labelling task. Naive Bayes (NB), Polynomial Regression (PR) Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), and Random Forest (RF) classification algorithms were trained separately in the Token Classification module of the model. The Token Classification process was performed by using text, graphic, embedding, and set features in the model. The performance of the model was evaluated using the Inspec, Semeval-2017, 500N-KPCrowd datasets, which are widely used in studies in the literature, and two newly collected, TRDizinEn and DergiParkEn datasets. The model achieved an average F-1-score of 0.664 for all datasets. The highest F-1-score (0.74) was obtained with the TRDizinEn dataset.