Comparison of Artificial Intelligence and Human Assessors in Evaluating English Speaking Skills


Anadol H. Ö., Gürdil H.

ICOLALS , Ankara, Türkiye, 11 - 12 Ekim 2024

  • Yayın Türü: Bildiri / Yayınlanmadı
  • Basıldığı Şehir: Ankara
  • Basıldığı Ülke: Türkiye
  • Gazi Üniversitesi Adresli: Evet

Özet

This study investigates the use of Artificial Intelligence (AI) in assessing English speaking skills, comparing the performance of three AI models (ChatGPT 3.5, ChatGPT 4o, ChatGPT 4) against human raters. Employing a multi-faceted Rasch model, the research analyzes the inter-rater reliability of both AI and human evaluations on speaking tasks. The AI models were trained using specific prompts and instructions to mirror the assessment process of human raters. The study uses the Rasch-Cohen's kappa statistic to measure inter-rater reliability, and it also explores the accuracy, precision, and recall of the AI models in evaluating speaking performances. The findings indicate that while all AI models demonstrated high inter-rater reliability, GPT-4o outperformed GPT-4 and GPT-3.5, exhibiting the most consistent scoring patterns. The study also found a high degree of consistency between human raters and the AI tools, suggesting the potential of AI in providing fair and objective assessments of English speaking skills. The implications of these findings are discussed in relation to the future of educational assessment, particularly the potential of AI to enhance assessment quality, objectivity, and efficiency.