Locality Sensitive Hashing Based Clustering for Large Scale Documents

Özdem K., Akcayol M. A.

6th International Conference on Mathematics and Artificial Intelligence, ICMAI 2021, Virtual, Online, Çin, 19 - 21 Mart 2021, ss.137-142, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1145/3460569.3460590
Basıldığı Şehir: Virtual, Online
Basıldığı Ülke: Çin
Sayfa Sayıları: ss.137-142
Anahtar Kelimeler: k-shingles, large-scale document clustering, locality sensitive hashing
Gazi Üniversitesi Adresli: Evet

Özet

© 2021 ACM.Nowadays, the size of data continues to increase more rapidly day by day. Considering this situation, large-scale processing has become a very important issue in document clustering, due to its capability to organize large numbers of documents as few meaningful and consistent clusters. In this study, a dataset consisting of 390 English textbooks with a total size of 7.61 GB, has been used for the clustering task. Locality sensitive hashing and k-shingles methods have been used to obtain clusters with high quality. Clusters have been evaluated using cluster validity indices. According to the experimental results, high-quality clusters have been obtained, with 0.88 and 0.79 for Silhouette and Davies-Bouldin scores, respectively.