Candidate sentence selection for extractive text summarization


Mutlu B., Sezer E., Akcayol M. A.

INFORMATION PROCESSING & MANAGEMENT, cilt.57, 2020 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 57
  • Basım Tarihi: 2020
  • Doi Numarası: 10.1016/j.ipm.2020.102359
  • Dergi Adı: INFORMATION PROCESSING & MANAGEMENT
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, FRANCIS, ABI/INFORM, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Compendex, Computer & Applied Sciences, EBSCO Education Source, Education Abstracts, Information Science and Technology Abstracts, INSPEC, Library and Information Science Abstracts, Library Literature and Information Science, Linguistics & Language Behavior Abstracts, MLA - Modern Language Association Database, zbMATH, Library, Information Science & Technology Abstracts (LISTA)
  • Anahtar Kelimeler: Extractive text summarization, Text summarization features, Summarization dataset, Long short-term memory, SCORING TECHNIQUES, MAXIMUM COVERAGE
  • Gazi Üniversitesi Adresli: Evet

Özet

Text summarization is a process of generating a brief version of documents by preserving the fundamental information of documents as much as possible. Although most of the text summarization research has been focused on supervised learning solutions, there are a few datasets indeed generated for summarization tasks, and most of the existing summarization datasets do not have human-generated goal summaries which are vital for both summary generation and evaluation. Therefore, a new dataset was presented for abstractive and extractive summarization tasks in this study. This dataset contains academic publications, the abstracts written by the authors, and extracts in two sizes, which were generated by human readers in this research. Then, the resulting extracts were evaluated to ensure the validity of the human extract production process. Moreover, the extractive summarization problem was reinvestigated on the proposed summarization dataset. Here the main point taken into account was to analyze the feature vector to generate more informative summaries. To that end, a comprehensive syntactic feature space was generated for the proposed dataset, and the impact of these features on the informativeness of the resulting summary was investigated. Besides, the summarization capability of semantic features was experienced by using GloVe and word2vec embeddings. Finally, the use of ensembled feature space, which corresponds to the joint use of syntactic and semantic features, was proposed on a long short-term memory-based neural network model. ROUGE metrics evaluated the model summaries, and the results of these evaluations showed that the use of the proposed ensemble feature space remarkably improved the single-use of syntactic or semantic features. Additionally, the resulting summaries of the proposed approach on ensembled features prominently outperformed or provided comparable performance than summaries obtained by state-of-the-art models for extractive summarization.