Knee Surgery, Sports Traumatology, Arthroscopy, 2025 (SCI-Expanded, Scopus)
Purpose: To compare the accuracy, readability and patient-centeredness of responses generated by standard ChatGPT-4o and its retrieval-augmented ‘deep research’ mode for hip arthroscopy education, addressing the current uncertainty about the reliability of large language models in orthopaedic patient information. Methods: Thirty standardised patient questions were derived through structured searches of reputable orthopaedic health information websites. Both ChatGPT configurations independently generated responses. Two fellowship-trained orthopaedic surgeons assessed each response independently, using 5-point Likert scales (1 = poor, 5 = excellent) for accuracy, clarity, comprehensiveness and readability. Intra- and interrater reliabilities were calculated, and readability metrics were also evaluated using Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES). Results: Deep Research outperformed the standard model in accuracy (4.7 ± 0.4 vs. 4.0 ± 0.5; p = 0.012) and comprehensiveness (4.8 ± 0.3 vs. 3.9 ± 0.6; p < 0.001). The standard model performed better in clarity (4.6 ± 0.4 vs. 4.4 ± 0.5; p = 0.048). Readability Likert scores were comparable (p = 0.729), but FKGL and FRES favoured the standard model (both p < 0.001). Interrater intraclass correlation coefficients (ICC) ranged from 0.57 to 0.83; intrarater ICCs from 0.63 to 0.79. Conclusion: Deep research provides superior scientific rigour, whereas the standard model offers better readability. A hybrid approach combining model strengths may maximise educational effectiveness, though clinical oversight remains essential to mitigate misinformation risks. The observed differences were modest in magnitude, aligning with previously reported accuracy–readability trade-offs in LLMs. These results should be interpreted as exploratory and hypothesis-generating. Level of Evidence: Level IV, cross-sectional, comparative simulation study.