Can ChatGPT Generate Acceptable Case-Based Multiple-Choice Questions for Medical School Anatomy Exams? A Pilot Study on Item Difficulty and Discrimination


Kıyak Y. S., Soylu A., Coşkun Ö., Budakoğlu I. İ., Peker T. V.

CLINICAL ANATOMY, cilt.38, sa.4, ss.505-510, 2025 (SCI-Expanded, Scopus) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 38 Sayı: 4
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1002/ca.24271
  • Dergi Adı: CLINICAL ANATOMY
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Biotechnology Research Abstracts, EMBASE, MEDLINE
  • Sayfa Sayıları: ss.505-510
  • Anahtar Kelimeler: anatomy, artificial intelligence, ChatGPT, medical education, multiple-choice questions, psychometrics, written assessment
  • Gazi Üniversitesi Adresli: Evet

Özet

Developing high-quality multiple-choice questions (MCQs) for medical school exams is effortful and time-consuming. In this study, we investigated the ability of ChatGPT to generate case-based anatomy MCQs with acceptable levels of item difficulty and discrimination for medical school exams. We used ChatGPT to generate case-based anatomy MCQs for an endocrine and urogenital system exam based on a framework for artificial intelligence (AI)-assisted item generation. The questions were evaluated by experts, approved by the department, and administered to 502 second-year medical students (372 Turkish-language, 130 English-language). The items were analyzed to determine the discrimination and difficulty indices. The item discrimination indices ranged from 0.29 to 0.54, indicating acceptable differentiation between high- and low-performing students. All items in Turkish (six out of six) and five out of six in English met the higher discrimination threshold (>= 0.30) required for large-scale standardized tests. The item difficulty indices ranged from 0.41 to 0.89, most items falling within the moderate difficulty range (0.20-0.80). Therefore, it was concluded that ChatGPT can generate case-based anatomy MCQs with acceptable psychometric properties, offering a promising tool for medical educators. However, human expertise remains crucial for reviewing and refining AI-generated assessment items. Future research should explore AI-generated MCQs across various anatomy topics and investigate different AI models for question generation.