ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam

Kıyak, YAVUZ; Coşkun, ÖZLEM; Budakoğlu, Işıl; Uluoğlu, CANAN

doi:10.1007/s00228-024-03649-x

ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam

Kıyak Y. S., Coşkun Ö., Budakoğlu I. İ., Uluoğlu C.

EUROPEAN JOURNAL OF CLINICAL PHARMACOLOGY, cilt.80, sa.5, ss.729-735, 2024 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 80 Sayı: 5
Basım Tarihi: 2024
Doi Numarası: 10.1007/s00228-024-03649-x
Dergi Adı: EUROPEAN JOURNAL OF CLINICAL PHARMACOLOGY
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, BIOSIS, CAB Abstracts, Chimica, CINAHL, EMBASE
Sayfa Sayıları: ss.729-735
Anahtar Kelimeler: Artificial intelligence, Automatic item generation, ChatGPT, Medical education, Multiple-choice questions, Rational pharmacotherapy
Gazi Üniversitesi Adresli: Evet

Özet

PurposeArtificial intelligence, specifically large language models such as ChatGPT, offers valuable potential benefits in question (item) writing. This study aimed to determine the feasibility of generating case-based multiple-choice questions using ChatGPT in terms of item difficulty and discrimination levels.MethodsThis study involved 99 fourth-year medical students who participated in a rational pharmacotherapy clerkship carried out based-on the WHO 6-Step Model. In response to a prompt that we provided, ChatGPT generated ten case-based multiple-choice questions on hypertension. Following an expert panel, two of these multiple-choice questions were incorporated into a medical school exam without making any changes in the questions. Based on the administration of the test, we evaluated their psychometric properties, including item difficulty, item discrimination (point-biserial correlation), and functionality of the options.ResultsBoth questions exhibited acceptable levels of point-biserial correlation, which is higher than the threshold of 0.30 (0.41 and 0.39). However, one question had three non-functional options (options chosen by fewer than 5% of the exam participants) while the other question had none.ConclusionsThe findings showed that the questions can effectively differentiate between students who perform at high and low levels, which also point out the potential of ChatGPT as an artificial intelligence tool in test development. Future studies may use the prompt to generate items in order for enhancing the external validity of the results by gathering data from diverse institutions and settings.