EUROPEAN JOURNAL OF CLINICAL PHARMACOLOGY, cilt.0, 2025 (SCI-Expanded, Scopus)
Purpose This study evaluated the performance of three generative AI models—ChatGPT- 4o, Gemini 1.5 Advanced Pro,
and Claude 3.5 Sonnet—in producing case-based rational pharmacology questions compared to expert educators.
Methods Using one-shot prompting, 60 questions (20 per model) addressing essential hypertension and type 2 diabetes
subjects were generated. A multidisciplinary panel categorized questions by usability (no revisions needed, minor or major
revisions required, or unusable). Subsequently, 24 AI-generated and 8 expert-created questions were asked to 103 medical
students in a real-world exam setting. Performance metrics, including correct response rate, discrimination index, and
identification of nonfunctional distractors, were analyzed.
Results No statistically significant differences were found between AI-generated and expert-created questions, with mean correct
response rates surpassing 50% and discrimination indices consistently equal to or above 0.20. Claude produced the highest
proportion of error-free items (12/20), whereas ChatGPT exhibited the fewest unusable items (5/20). Expert revisions required
approximately one minute per AI-generated question, representing a substantial efficiency gain over manual question preperation.
Nonetheless, 19 out of 60 AI-generated questions were deemed unusable, highlighting the necessity of expert oversight.
Conclusion Large language models can profoundly accelerate the development of high-quality assessment questions in medical
education. However, expert review remains critical to address lapses in reliability and validity. A hybrid model, integrating
AI-driven efficiencies with rigorous expert validation, may offer an optimal approach for enhancing educational outcomes.