Comparison of Artificial Intelligence and Human Assessors in Evaluating English Speaking Skills

ICOLALS , Ankara, Türkiye, 11 - 12 Ekim 2024

Yayın Türü: Bildiri / Yayınlanmadı
Basıldığı Şehir: Ankara
Basıldığı Ülke: Türkiye
Gazi Üniversitesi Adresli: Evet

Özet

This study investigates the use of Artificial Intelligence (AI) in assessing English speaking skills, comparing the performance of three AI models (ChatGPT 3.5, ChatGPT 4o, ChatGPT 4) against human raters. Employing a multi-faceted Rasch model, the research analyzes the inter-rater reliability of both AI and human evaluations on speaking tasks. The AI models were trained using specific prompts and instructions to mirror the assessment process of human raters. The study uses the Rasch-Cohen's kappa statistic to measure inter-rater reliability, and it also explores the accuracy, precision, and recall of the AI models in evaluating speaking performances. The findings indicate that while all AI models demonstrated high inter-rater reliability, GPT-4o outperformed GPT-4 and GPT-3.5, exhibiting the most consistent scoring patterns. The study also found a high degree of consistency between human raters and the AI tools, suggesting the potential of AI in providing fair and objective assessments of English speaking skills. The implications of these findings are discussed in relation to the future of educational assessment, particularly the potential of AI to enhance assessment quality, objectivity, and efficiency.