Evaluation of information provided by artificial intelligence chatbots on extraoral maxillofacial prostheses

Özyemişci, Nuran; Turhan Bal, BİLGE; Bankoğlu Güngör, MERVE; Öztürk, ESRA; Canvar, Ayşegül; Nemli, Secil

doi:10.1016/j.prosdent.2025.08.028

Evaluation of information provided by artificial intelligence chatbots on extraoral maxillofacial prostheses

Özyemişci N., Turhan Bal B., Bankoğlu Güngör M., Öztürk E. K., Canvar A., Nemli S. K.

JOURNAL OF PROSTHETIC DENTISTRY, cilt.134, sa.6, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 134 Sayı: 6
Basım Tarihi: 2025
Doi Numarası: 10.1016/j.prosdent.2025.08.028
Dergi Adı: JOURNAL OF PROSTHETIC DENTISTRY
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL
Gazi Üniversitesi Adresli: Evet

Özet

Statement of problem. Despite advances in artificial intelligence (AI), the quality, reliability, and understandability of health-related information provided by chatbots is still a question mark. Furthermore, studies on maxillofacial prosthesis (MP) information from AI chatbots are lacking. Purpose. The purpose of this study was to assess and compare the reliability, quality, readability, and similarity of responses to MP-related questions generated by 4 different chatbots. Material and methods. A total of 15 questions were provided by a maxillofacial prosthodontist and from 4 different chatbots (ChatGPT-3.5, Gemini 2.5 Flash, Copilot, and DeepSeek V3). The Reliability Scoring (adapted DISCERN), the Global Quality Scale (GQS), the Flesch Reading Ease Score (FRES), the Flesch-Kincaid Reading Grade Level (FKRGL), and the Similarity Index (iThenticate) were used to evaluate the performance of chatbots. Data were compared using the Kruskal-Wallis test, and the differences between chatbots were determined by the Conover multiple comparison test with Benjamini-Hochberg correction (alpha=.05). Results. There were no significant differences between the chatbots' DISCERN scores, except for one question where ChatGPT showed significantly higher reliability than Gemini or Copilot (P=.03). There was no statistically significant difference among AI tools in terms of GQS values (P=.096), FRES values (P=.166), and FKRGL values (P=.247). The similarity rate of Gemini was statistically higher than other AI chatbots (P=.03). Conclusions. ChatGPT-3.5, Gemini 2.5 Flash, Copilot, and DeepSeek V3 showed good quality responses. All chatbots' responses were difficult for non-professionals to read and understand. Low similarity rates were found for all chatbots except Gemini, indicating originality of their information. (J Prosthet Dent 2025;134:2623.e1-e8)