ICOLALS , Ankara, Türkiye, 11 - 12 Ekim 2024
This study investigates the use of Artificial Intelligence (AI) in assessing English speaking skills, comparing the performance of three AI models (ChatGPT 3.5, ChatGPT 4o, ChatGPT 4) against human raters. Employing a multi-faceted Rasch model, the research analyzes the inter-rater reliability of both AI and human evaluations on speaking tasks. The AI models were trained using specific prompts and instructions to mirror the assessment process of human raters. The study uses the Rasch-Cohen's kappa statistic to measure inter-rater reliability, and it also explores the accuracy, precision, and recall of the AI models in evaluating speaking performances. The findings indicate that while all AI models demonstrated high inter-rater reliability, GPT-4o outperformed GPT-4 and GPT-3.5, exhibiting the most consistent scoring patterns.