Performance evaluation of k-means clustering method on high dimensional low sample size datasets.


Kılıç G., Gündüz Tekin N.

VI. International Applied Statistics Congress (UYIK – 2025), Ankara, Türkiye, 14 Mayıs - 16 Temmuz 2025, ss.10, (Özet Bildiri)

  • Yayın Türü: Bildiri / Özet Bildiri
  • Basıldığı Şehir: Ankara
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.10
  • Gazi Üniversitesi Adresli: Evet

Özet

In recent years, increasing attention has been directed toward the analysis of high-dimensional low sample size (HDLSS) datasets, particularly in the context of genomic data. These datasets are characterized by a number of variables that far exceeds the number of observations (p ≫ n), leading to significant methodological and computational challenges. One of the major consequences of this structure is the diminished reliability of distance-based measures, which hinders the effectiveness of clustering algorithms and complicates the separation of distinct groups within the data. In this study, beyond the inherent challenges of HDLSS data structures, scenarios involving both the presence of outliers and the contamination of the dataset with observations from different distributions are considered, and the performance of the k-means clustering algorithm—one of the multivariate methods—is comprehensively examined under these complex conditions. The algorithm was first applied to real-world genomic cancer datasets with high dimensionality and low sample sizes. The clustering results were evaluated using a comprehensive set of external and internal validation indices, including the Adjusted Rand Index (ARI), Dunn index, Silhouette index, and Calinski-Harabasz index, to assess the consistency and quality of the clustering structure. Moreover, a series of simulation studies were conducted to systematically examine the algorithm’s robustness under contamination scenarios. Specifically, synthetic HDLSS datasets were generated from the normal distribution, and controlled proportions of observations drawn from alternative distributions were introduced to simulate contamination. The k-means algorithm was then applied under each scenario, and its performance was assessed using the same set of validation metrics. All simulation experiments were implemented using the R programming language. The results of this study provide empirical insights into the limitations of traditional clustering methods in high-dimensional contaminated environments and emphasize the importance of robust validation frameworks for reliable cluster detection in HDLSS contexts.