Retrieval-augmented ChatGPT-4o improves accuracy but reduces readability in hip arthroscopy patient education

Gültekin, Onur; SEZGİN, ERDEM; Cakır, Oğuzhan; ŞENGÜL, HİLMİ; Kilinc, Bekir; Yılmaz, Barış; KOCAOĞLU, Barış; Kayaalp, Mahmut

doi:10.1002/ksa.70207

Retrieval-augmented ChatGPT-4o improves accuracy but reduces readability in hip arthroscopy patient education

Gültekin O., SEZGİN E. A., Cakır O., ŞENGÜL H. B., Kilinc B. E., Yılmaz B., ...Daha Fazla

Knee Surgery, Sports Traumatology, Arthroscopy, cilt.34, sa.2, ss.660-665, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 34 Sayı: 2
Basım Tarihi: 2026
Doi Numarası: 10.1002/ksa.70207
Dergi Adı: Knee Surgery, Sports Traumatology, Arthroscopy
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE
Sayfa Sayıları: ss.660-665
Anahtar Kelimeler: artificial intelligence, ChatGPT, hip arthroscopy, large language models, patient education, retrieval-augmented generation
Gazi Üniversitesi Adresli: Evet

Özet

Purpose: To compare the accuracy, readability and patient-centeredness of responses generated by standard ChatGPT-4o and its retrieval-augmented ‘deep research’ mode for hip arthroscopy education, addressing the current uncertainty about the reliability of large language models in orthopaedic patient information. Methods: Thirty standardised patient questions were derived through structured searches of reputable orthopaedic health information websites. Both ChatGPT configurations independently generated responses. Two fellowship-trained orthopaedic surgeons assessed each response independently, using 5-point Likert scales (1 = poor, 5 = excellent) for accuracy, clarity, comprehensiveness and readability. Intra- and interrater reliabilities were calculated, and readability metrics were also evaluated using Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES). Results: Deep Research outperformed the standard model in accuracy (4.7 ± 0.4 vs. 4.0 ± 0.5; p = 0.012) and comprehensiveness (4.8 ± 0.3 vs. 3.9 ± 0.6; p < 0.001). The standard model performed better in clarity (4.6 ± 0.4 vs. 4.4 ± 0.5; p = 0.048). Readability Likert scores were comparable (p = 0.729), but FKGL and FRES favoured the standard model (both p < 0.001). Interrater intraclass correlation coefficients (ICC) ranged from 0.57 to 0.83; intrarater ICCs from 0.63 to 0.79. Conclusion: Deep research provides superior scientific rigour, whereas the standard model offers better readability. A hybrid approach combining model strengths may maximise educational effectiveness, though clinical oversight remains essential to mitigate misinformation risks. The observed differences were modest in magnitude, aligning with previously reported accuracy–readability trade-offs in LLMs. These results should be interpreted as exploratory and hypothesis-generating. Level of Evidence: Level IV, cross-sectional, comparative simulation study.