Retrieval-augmented ChatGPT-4o improves accuracy but reduces readability in hip arthroscopy patient education


Gültekin O., SEZGİN E. A., Cakır O., ŞENGÜL H. B., Kilinc B. E., Yılmaz B., ...Daha Fazla

Knee Surgery, Sports Traumatology, Arthroscopy, 2025 (SCI-Expanded, Scopus) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1002/ksa.70207
  • Dergi Adı: Knee Surgery, Sports Traumatology, Arthroscopy
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE
  • Anahtar Kelimeler: artificial intelligence, ChatGPT, hip arthroscopy, large language models, patient education, retrieval-augmented generation
  • Gazi Üniversitesi Adresli: Evet

Özet

Purpose: To compare the accuracy, readability and patient-centeredness of responses generated by standard ChatGPT-4o and its retrieval-augmented ‘deep research’ mode for hip arthroscopy education, addressing the current uncertainty about the reliability of large language models in orthopaedic patient information. Methods: Thirty standardised patient questions were derived through structured searches of reputable orthopaedic health information websites. Both ChatGPT configurations independently generated responses. Two fellowship-trained orthopaedic surgeons assessed each response independently, using 5-point Likert scales (1 = poor, 5 = excellent) for accuracy, clarity, comprehensiveness and readability. Intra- and interrater reliabilities were calculated, and readability metrics were also evaluated using Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES). Results: Deep Research outperformed the standard model in accuracy (4.7 ± 0.4 vs. 4.0 ± 0.5; p = 0.012) and comprehensiveness (4.8 ± 0.3 vs. 3.9 ± 0.6; p < 0.001). The standard model performed better in clarity (4.6 ± 0.4 vs. 4.4 ± 0.5; p = 0.048). Readability Likert scores were comparable (p = 0.729), but FKGL and FRES favoured the standard model (both p < 0.001). Interrater intraclass correlation coefficients (ICC) ranged from 0.57 to 0.83; intrarater ICCs from 0.63 to 0.79. Conclusion: Deep research provides superior scientific rigour, whereas the standard model offers better readability. A hybrid approach combining model strengths may maximise educational effectiveness, though clinical oversight remains essential to mitigate misinformation risks. The observed differences were modest in magnitude, aligning with previously reported accuracy–readability trade-offs in LLMs. These results should be interpreted as exploratory and hypothesis-generating. Level of Evidence: Level IV, cross-sectional, comparative simulation study.