Examination of the Reliability of the Measurements Regarding the Written Expression Skills According to Different Test Theories

Yildirim Seheryeli, Merve; Tan, ŞEREF

doi:10.21031/epod.559470

Examination of the Reliability of the Measurements Regarding the Written Expression Skills According to Different Test Theories

JOURNAL OF MEASUREMENT AND EVALUATION IN EDUCATION AND PSYCHOLOGY-EPOD, cilt.10, ss.327-347, 2019 (ESCI, Scopus, TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 10
Basım Tarihi: 2019
Doi Numarası: 10.21031/epod.559470
Dergi Adı: JOURNAL OF MEASUREMENT AND EVALUATION IN EDUCATION AND PSYCHOLOGY-EPOD
Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), Scopus, TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.327-347
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Gazi Üniversitesi Adresli: Evet

Özet

The aim of the study is to examine the reliability estimations of written expression skills analytical rubric based on the Classical Test Theory (CTT), Generalizability Theory (GT) and Item Response Theory (IRT) which differ in their field of study. In this descriptive study, the stories of the 523 students in the study group were scored by seven raters. CTT results showed that Eta coefficient revealed that there was no difference between the scoring of the raters (eta =. 926); Cronbach Alpha coefficients were over.88. GT results showed that G and Phi coefficients were over .97. The students' expected differentiation emerged, the difficulty levels of the criteria did not change from one student to another, and the consistency between the scores among raters was excellent. In the Item Response Theory, parameters were estimated according to Samejima's (1969) Graded Response Model and item discrimination differed according to the different raters. According to b parameters, for all the raters; individuals are expected to be at least -2.35, -0.80, 0.41 ability level in order to be scored higher than 0, 1 or 2 categories respectively with.50 probability. Marginal reliability coefficients were quite high (around .93). The Fisher Z' statistic was calculated for the significance of the difference between all reliability estimates. GT revealed more detailed information than CTT in the explanation of error variance sources and determination of reliability; while IRT provided more detailed information than CTT in determining the item-level error estimations and the ability level. There was a significant difference between the estimated parameters of CTT and GT in interrater reliability (p <.05); there was no significant difference between the parameters predicted according to CTT and IRT (p >.05).