Locality Sensitive Hashing Based Clustering for Large Scale Documents


Özdem K., Akcayol M. A.

6th International Conference on Mathematics and Artificial Intelligence, ICMAI 2021, Virtual, Online, Çin, 19 - 21 Mart 2021, ss.137-142 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1145/3460569.3460590
  • Basıldığı Şehir: Virtual, Online
  • Basıldığı Ülke: Çin
  • Sayfa Sayıları: ss.137-142
  • Anahtar Kelimeler: k-shingles, large-scale document clustering, locality sensitive hashing
  • Gazi Üniversitesi Adresli: Evet

Özet

© 2021 ACM.Nowadays, the size of data continues to increase more rapidly day by day. Considering this situation, large-scale processing has become a very important issue in document clustering, due to its capability to organize large numbers of documents as few meaningful and consistent clusters. In this study, a dataset consisting of 390 English textbooks with a total size of 7.61 GB, has been used for the clustering task. Locality sensitive hashing and k-shingles methods have been used to obtain clusters with high quality. Clusters have been evaluated using cluster validity indices. According to the experimental results, high-quality clusters have been obtained, with 0.88 and 0.79 for Silhouette and Davies-Bouldin scores, respectively.