Large language models for generating key-feature questions in medical education


KIYAK Y. S., Górski S., Tokarek T., Pers M., Kononowicz A. A.

Medical Education Online, cilt.30, sa.1, 2025 (SSCI, Scopus) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 30 Sayı: 1
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1080/10872981.2025.2574647
  • Dergi Adı: Medical Education Online
  • Derginin Tarandığı İndeksler: Social Sciences Citation Index (SSCI), Scopus, Academic Search Premier, ASSIA, MEDLINE, Directory of Open Access Journals
  • Anahtar Kelimeler: cardiology, ChatGPT, key-feature problems, key-feature questions, large language models, medical education
  • Gazi Üniversitesi Adresli: Evet

Özet

In this study, we conducted a descriptive study to evaluate the quality of KFQs generated by OpenAI’s o3 model. We developed a reusable generic prompt for KFQ generation, designed in alignment with the Medical Council of Canada’s KFQ development guidelines. We also created an evaluation metric to systematically assess the quality of the KFQs based on the KFQ development guideline. Twenty unique cardiology-focused KFQs were created using recent European Society of Cardiology guidelines as reference. Each KFQ was independently assessed by two cardiology experts using the quality checklist, with disagreements resolved by a third reviewer. Descriptive statistics were used to summarize checklist compliance and final acceptability ratings. Of the 20 KFQs, 3 (15%) were rated ‘Accept as is’ and 17 (85%) ‘Accept with minor revisions’; none required major revisions or were rejected. The overall compliance rate across checklist criteria was 93.7%, with perfect scores in domains such as key feature definition, scenario plausibility, and alignment between questions and scenarios. Lower performance was observed for inclusion of genuinely harmful ‘killer’ responses (50%), plausibility of distractors (77.8%), and active language use in phrasing the question (80%). The findings showed that an LLM, guided by a structured prompt, can generate KFQs that closely adhere to established quality standards, with most requiring only minor refinements. While expert review remains essential to ensure clinical accuracy and patient safety, AI-assisted workflows have strong potential to streamline KFQ development and enhance the scalability of CDM assessment in medical education.