PREDICTIVE PERFORMANCES OF IMPLICITLY AND EXPLICITLY ROBUST CLASSIFIERS ON HIGH DIMENSIONAL DATA

Gündüz, NECLA; Fokoue, Ernest

doi:10.1501/commua1_0000000797

PREDICTIVE PERFORMANCES OF IMPLICITLY AND EXPLICITLY ROBUST CLASSIFIERS ON HIGH DIMENSIONAL DATA

COMMUNICATIONS FACULTY OF SCIENCES UNIVERSITY OF ANKARA-SERIES A1 MATHEMATICS AND STATISTICS, cilt.66, sa.2, ss.14-36, 2017 (ESCI)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 66 Sayı: 2
Basım Tarihi: 2017
Doi Numarası: 10.1501/commua1_0000000797
Dergi Adı: COMMUNICATIONS FACULTY OF SCIENCES UNIVERSITY OF ANKARA-SERIES A1 MATHEMATICS AND STATISTICS
Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.14-36
Gazi Üniversitesi Adresli: Evet

Özet

The goal of this paper is to demonstrate via extensive simulation that implicit robustness can substantially outperform explicit robust in the pattern recognition of contaminated high dimension low sample size data. Our work specifically demonstrates via extensive computational simulations and applications to real life data, that random subspace ensemble learning machines, although not explicitly structurally designed as a robustness-inducing supervised learning paradigms, outperforms the structurally robustness-seeking classifiers on high dimension low sample size datasets. Random forest (RF), which is arguably the most commonly used random subspace ensemble learning method, is compared to various robust extensions/adaptations of the discriminant analysis classifier, and our work reveals that RF, although not inherently designed to be robust to outliers, substantially outperforms the existing techniques specifically designed to achieve robustness. Specifically, by exploring different scenarios of the sample size n and the input space dimensionality p along with the corresponding capacity kappa = n/p with kappa < 1, we demonstrate through extensive simulations that regardless of the contamination rate epsilon, RF predictively outperforms the explicitly robustness-inducing classification techniques when the intrinsic dimensionality of the data is large.