
Reliability Statistics for Items Rescored for Current-Year Rater Consistency Items from the Previous Assessment That Are Rescored During the Current Assessment |
During the scoring of student responses, responses to constructed-response items are rescored by a second rater for one of two reasons:
The statistics calculated for both of these purposes are the percentage of exact agreement, the intraclass correlation, Cohen's Kappa (Cohen, 1968), and the Pearson product-moment correlation between the scores for the first and second raters. These measures are summarized in Zwick (1988), Kaplan and Johnson (1992), and Abedi (1996). Each measure has advantages and disadvantages for use in different situations. Agreement percentages vary significantly across items. On a simple two-point mathematics item, agreement should approach 100 percent. On the other hand, when scoring a complex six-point writing constructed response item, an agreement of 60 percent would be considered an acceptable result. The trend year agreement percentage should approximate the interrater agreement from prior NAEP administration. The trend year agreement should be within 8 percent of the prior year interrater agreement for two- and three-point items and within 10 percent of the prior year interrater agreement for four- to six-point items.
Cohen’s Kappa quantifies another measure of reliability between groups of scorers and accounts for agreement due to chance. Kappa statistics should be higher than 0.7 for two- and three-point items and higher than 0.6 for four- to six-point items. Items with reliability statistics considered to be too low prompt investigation into the low rater agreement. The aforementioned criteria were created with NAEPs unique assessment design in mind. The percentage exact agreement for all constructed-response items, Cohen's Kappa for dichotomously scored constructed-response items, and the intraclass correlation for polytomously scored constructed-response items are provided for every NAEP assessment.
| Subject | Year | Within-year reliability | Cross-year reliability | ||||
|---|---|---|---|---|---|---|---|
| Grade 4 | Grade 8 | Grade 12 | Grade 4 | Grade 8 | Grade 12 | ||
| Arts (Music) | 2008 | † | R3 | † | † | † | † |
| Arts (Visual Arts) | 2008 | † | R3 | † | † | † | † |
| Civics | 2006 | R3 | R3 | R3 | R3 | R3 | R3 |
| Economics | 2006 | † | † | R3 | † | † | † |
| Geography | 2001 | R3 | R3 | R3 | R3 | R3 | R3 |
| Mathematics | 2007 | R3 | R3 | † | R3 | R3 | † |
| 2005 | R3 | R3 | R3 | R3 | R3 | R3 | |
| 2003 | R3 | R3 | † | R3 | R3 | † | |
| 2000 | R3 | R3 | R3 | R3 | R3 | R3 | |
| Reading | 2007 | R3 | R3 | † | R3 | R3 | † |
| 2005 | R3 | R3 | R3 | R3 | R3 | R3 | |
| 2003 | R3 | R3 | † | R3 | R3 | † | |
| 2002 | R3 | R3 | R3 | R3 | R3 | R3 | |
| 2000 | R3 | † | † | R3 | † | † | |
| Science | 2005 | R3 | R3 | R3 | R3 | R3 | R3 |
| 2000 | R3 | R3 | R3 | R3 | R3 | R3 | |
| U.S. history | 2006 | R3 | R3 | R3 | R3 | R3 | R3 |
| 2001 | R3 | R3 | R3 | R3 | R3 | R3 | |
| Writing | 2007 | † | R3 | R3 | † | R3 | R3 |
| 2002 | R3 | R3 | R3 | R3 | R3 | R3 | |
| † Not applicable. The 2008 arts assessment was conducted at grade 8 only. Due to changes in scoring procedures that differ from previous arts assessments (music and visual arts 1997), there are no rescore results.The 2006 economics assessment was conducted only at grade 12. The 2003 mathematics and reading assessments were conducted only at grades 4 and 8. The 2000 reading assessment was conducted only at grade 4. NOTE: The R3 sample is the accommodated reporting sample; it samples students who are classified as students with disabilities (SD) or English language learners (ELL) plus SD/ELL students from sessions in which accommodations were allowed. The R3 sample is the only sample type used after 2001. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), Various Years, 2000-2008 Assessments. |
|||||||
| Subject | Year | Within-year reliability | Cross-year reliability | ||||
|---|---|---|---|---|---|---|---|
| Age 9 | Age 13 | Age 17 | Age 9 | Age 13 | Age 17 | ||
| Mathematics long-term trend | 2008 | R3 | R3 | R3 | R3 | R3 | R3 |
| Mathematics long-term trend | 2004 | R3 | R3 | R3 | R3 | R3 | R3 |
| Mathematics long-term trend bridge | 2004 | R2 | R2 | R2 | † | † | † |
| Reading long-term trend | 2008 | R3 | R3 | R3 | R3 | R3 | R3 |
| Reading long-term trend | 2004 | R3 | R3 | R3 | R3 | R3 | R3 |
| Reading long-term trend bridge | 2004 | R2 | R2 | R2 | † | † | † |
| † Not applicable. NOTE: The R2 links are links to data for the R2 reporting population. The R3 links are links to the accommodated reporting sample; it samples students who are classified as students with disabilities (SD) or English language learners (ELL) plus SD/ELL students from sessions in which accommodations were allowed. The R3 sample is the only sample type used after 2001. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2004 and 2008 Long-Term Trend Assessments. |
|||||||