PyCaret for Predicting Type 2 Diabetes: A Phenotype- and Gender-Based Approach with the “Nurses’ Health Study” and the “Health Professionals’ Follow-Up Study” Datasets


Creative Commons License

Gul S., AYTURAN K., HARDALAÇ F.

Journal of Personalized Medicine, cilt.14, sa.8, 2024 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 14 Sayı: 8
  • Basım Tarihi: 2024
  • Doi Numarası: 10.3390/jpm14080804
  • Dergi Adı: Journal of Personalized Medicine
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Food Science & Technology Abstracts, Directory of Open Access Journals
  • Anahtar Kelimeler: feature importance plot, machine learning, prediction, PyCaret, SHAP value, type 2 diabetes mellitus
  • Gazi Üniversitesi Adresli: Evet

Özet

Predicting type 2 diabetes mellitus (T2DM) by using phenotypic data with machine learning (ML) techniques has received significant attention in recent years. PyCaret, a low-code automated ML tool that enables the simultaneous application of 16 different algorithms, was used to predict T2DM by using phenotypic variables from the “Nurses’ Health Study” and “Health Professionals’ Follow-up Study” datasets. Ridge Classifier, Linear Discriminant Analysis, and Logistic Regression (LR) were the best-performing models for the male-only data subset. For the female-only data subset, LR, Gradient Boosting Classifier, and CatBoost Classifier were the strongest models. The AUC, accuracy, and precision were approximately 0.77, 0.70, and 0.70 for males and 0.79, 0.70, and 0.71 for females, respectively. The feature importance plot showed that family history of diabetes (famdb), never having smoked, and high blood pressure (hbp) were the most influential features in females, while famdb, hbp, and currently being a smoker were the major variables in males. In conclusion, PyCaret was used successfully for the prediction of T2DM by simplifying complex ML tasks. Gender differences are important to consider for T2DM prediction. Despite this comprehensive ML tool, phenotypic variables alone may not be sufficient for early T2DM prediction; genotypic variables could also be used in combination for future studies.