Improving classification performance of extreme gradient boosting on small-sized dataset to classify Turkish and Italian wines along with elemental profiling by inductively coupled plasma-mass spectrometry

Alp, Hande; Alp, ORKUN

doi:10.1080/00387010.2021.2008977

Improving classification performance of extreme gradient boosting on small-sized dataset to classify Turkish and Italian wines along with elemental profiling by inductively coupled plasma-mass spectrometry

Alp H., Alp O.

SPECTROSCOPY LETTERS, cilt.55, sa.1, ss.1-12, 2022 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 55 Sayı: 1
Basım Tarihi: 2022
Doi Numarası: 10.1080/00387010.2021.2008977
Dergi Adı: SPECTROSCOPY LETTERS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Aerospace Database, Analytical Abstracts, Applied Science & Technology Source, Biotechnology Research Abstracts, Chemical Abstracts Core, Chimica, Communication Abstracts, INSPEC, Metadex, Civil Engineering Abstracts
Sayfa Sayıları: ss.1-12
Anahtar Kelimeler: Classification, extreme gradient boosting, ICP-MS, wine, LEAD-ISOTOPE RATIOS, ICP-MS, GEOGRAPHICAL ORIGIN, MULTIELEMENT ANALYSIS, TRACE-ELEMENTS, WHITE WINES, DIFFERENTIATION, FINGERPRINTS, ACCURACY, SOIL
Gazi Üniversitesi Adresli: Evet

Özet

In this study, the classification performance of the extreme gradient boosting algorithm on a small-sized dataset was improved by using a synthetically generated dataset created with kernel density estimation to classify wine samples. The concentration of 29 elements in wine samples produced in Turkey (domestic) and Italy (imported) was determined by inductively coupled plasma-mass spectrometry and obtained results were used to generate the dataset. Classification of wine samples was firstly assessed with extreme gradient boosting, which is known for overfitting in small-sized datasets, resulting in poor classification performance. To improve the classification performance, a synthetic dataset was created and the algorithm was trained on the synthetic dataset instead of the original dataset. With the proposed method, the accuracy of the model was improved from 76.7% to 81.7%. The precision values for Turkish and Italian wines were increased from 78.4% to 84.1% and from 70.9% to 79.4%, respectively. The variable importance determined by the extreme gradient boosting algorithm showed that beryllium and cesium were significantly more important compared to other elements followed by tin, phosphorus, cobalt, lead, calcium, copper, zinc, and aluminum as the top 10 elements to classify Turkish and Italian wine samples.