Artificial Intelligence Versus Human Raters in Assessing English-Speaking Skills


Anadol H. Ö., Gürdil H.

Language Teaching and Teacher Education, Prof. Dr. PaĢa Tevfik Cephe,Asst. Prof. Dr. Mustafa Akın Güngör, Editör, Nobel Yayın Dağıtım, Ankara, ss.53-68, 2025

  • Yayın Türü: Kitapta Bölüm / Araştırma Kitabı
  • Basım Tarihi: 2025
  • Yayınevi: Nobel Yayın Dağıtım
  • Basıldığı Şehir: Ankara
  • Sayfa Sayıları: ss.53-68
  • Editörler: Prof. Dr. PaĢa Tevfik Cephe,Asst. Prof. Dr. Mustafa Akın Güngör, Editör
  • Gazi Üniversitesi Adresli: Evet

Özet

This study explores how well artificial intelligence (AI) systems can replicate human judgment in evaluating English-speaking skills. It focuses on the scoring reliability of three AI models (ChatGPT 3.5, GPT-4, and GPT-4o) and a human rater when assessing the speaking performances of 104 university students. Evaluations were based on a holistic rubric aligned with the Common European Framework of Reference (CEFR). Using the ManyFacet Rasch Model (MFRM), the study compared how closely the AI scores matched those of human raters. All three AI tools showed high levels of reliability, but GPT-4o stood out for its strong alignment with human scoring patterns, suggesting it delivered the most stable and predictable results. A Rasch-Cohen‘s kappa value of 0.853 reflected strong agreement between AI and human raters. These findings highlight the potential of AI to support assessment practices by offering consistent, objective, and scalable evaluations. However, the study also emphasizes the need for ongoing refinement to ensure AI assessments remain in step with human judgment and educational standards.