Artificial Intelligence Versus Human Raters in Assessing  English-Speaking Skills

Anadol, HATİCE; Gürdil, Hatice

Artificial Intelligence Versus Human Raters in Assessing English-Speaking Skills

Language Teaching and Teacher Education, Prof. Dr. PaĢa Tevfik Cephe,Asst. Prof. Dr. Mustafa Akın Güngör, Editör, Nobel Yayın Dağıtım, Ankara, ss.53-68, 2025

Yayın Türü: Kitapta Bölüm / Araştırma Kitabı
Basım Tarihi: 2025
Yayınevi: Nobel Yayın Dağıtım
Basıldığı Şehir: Ankara
Sayfa Sayıları: ss.53-68
Editörler: Prof. Dr. PaĢa Tevfik Cephe,Asst. Prof. Dr. Mustafa Akın Güngör, Editör
Gazi Üniversitesi Adresli: Evet

Özet

This study explores how well artificial intelligence (AI) systems can replicate human judgment in evaluating English-speaking skills. It focuses on the scoring reliability of three AI models (ChatGPT 3.5, GPT-4, and GPT-4o) and a human rater when assessing the speaking performances of 104 university students. Evaluations were based on a holistic rubric aligned with the Common European Framework of Reference (CEFR). Using the ManyFacet Rasch Model (MFRM), the study compared how closely the AI scores matched those of human raters. All three AI tools showed high levels of reliability, but GPT-4o stood out for its strong alignment with human scoring patterns, suggesting it delivered the most stable and predictable results. A Rasch-Cohen‘s kappa value of 0.853 reflected strong agreement between AI and human raters. These findings highlight the potential of AI to support assessment practices by offering consistent, objective, and scalable evaluations. However, the study also emphasizes the need for ongoing refinement to ensure AI assessments remain in step with human judgment and educational standards.