Skip Navigation
small NCES header image
NAEP Analysis and Scaling → Initial Activities → Constructed-Response Interrater Reliability

Constructed-Response Interrater Reliability

      

Reliability Statistics for Items Rescored for Current-Year Rater Consistency

Items from the Previous Assessment That Are Rescored During the Current Assessment

During the scoring of student responses, responses to constructed-response items are rescored by a second rater for one of two reasons:

  • First, responses from the current assessment are rescored to determine how reliably the current
    raters are scoring responses to specific items. 
     
  • Second, the responses to items given both in the current year and in previous years are rescored. This second type of rescoring indicates whether the current raters differ in their rating from the raters who had scored the same responses in previous years.

The statistics calculated for both of these purposes are the percentage of exact agreement, the intraclass correlation, Cohen's Kappa (Cohen, 1968), and the Pearson product-moment correlation between the scores for the first and second raters. These measures are summarized in Zwick (1988), Kaplan and Johnson (1992), and Abedi (1996). Each measure has advantages and disadvantages for use in different situations. Agreement percentages vary significantly across items. On a simple two-point mathematics item, agreement should approach 100 percent. On the other hand, when scoring a complex six-point writing constructed response item, an agreement of 60 percent would be considered an acceptable result. The trend year agreement percentage should approximate the interrater agreement from prior NAEP administration. The trend year agreement should be within 8 percent of the prior year interrater agreement for two- and three-point items and within 10 percent of the prior year interrater agreement for four- to six-point items.

Cohen’s Kappa quantifies another measure of reliability between groups of scorers and accounts for agreement due to chance. Kappa statistics should be higher than 0.7 for two- and three-point items and higher than 0.6 for four- to six-point items. Items with reliability statistics considered to be too low prompt investigation into the low rater agreement. The aforementioned criteria were created with NAEPs unique assessment design in mind. The percentage exact agreement for all constructed-response items, Cohen's Kappa for dichotomously scored constructed-response items, and the intraclass correlation for polytomously scored constructed-response items are provided for every NAEP assessment.

Score range, percent exact agreement, and Cohen's Kappa or intraclass correlation for constructed-response items, main assessments: 2000–2008
Subject Year Within-year reliability Cross-year reliability
Grade 4 Grade 8 Grade 12 Grade 4 Grade 8 Grade 12
Arts (Music) 2008 R3
Arts (Visual Arts) 2008 R3
Civics 2006 R3 R3 R3 R3 R3 R3
Economics 2006 R3
Geography 2001 R3 R3 R3 R3 R3 R3
Mathematics 2007 R3 R3 R3 R3
2005 R3 R3 R3 R3 R3 R3
2003 R3 R3 R3 R3
2000 R3 R3 R3 R3 R3 R3
Reading 2007 R3 R3 R3 R3
2005 R3 R3 R3 R3 R3 R3
2003 R3 R3 R3 R3
2002 R3 R3 R3 R3 R3 R3
2000 R3 R3
Science 2005 R3 R3 R3 R3 R3 R3
2000 R3 R3 R3 R3 R3 R3
U.S. history 2006 R3 R3 R3 R3 R3 R3
2001 R3 R3 R3 R3 R3 R3
Writing 2007 R3 R3 R3 R3
2002 R3 R3 R3 R3 R3 R3
†  Not applicable. The 2008 arts assessment was conducted at grade 8 only. Due to changes in scoring procedures that differ from previous arts assessments (music and visual arts 1997), there are no rescore results.The 2006 economics assessment was conducted only at grade 12. The 2003 mathematics and reading assessments were conducted only at grades 4 and 8. The 2000 reading assessment was conducted only at grade 4.
NOTE: The R3 sample is the accommodated reporting sample; it samples students who are classified as students with disabilities (SD) or English language learners (ELL) plus SD/ELL students from sessions in which accommodations were allowed. The R3 sample is the only sample type used after 2001.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), Various Years, 2000-2008 Assessments.

Score range, percent exact agreement, and Cohen's Kappa or intraclass correlation for constructed-response items, long-term trend assessment: 2004 and 2008
Subject Year Within-year reliability Cross-year reliability
Age 9 Age 13 Age 17  Age 9   Age 13 Age 17
Mathematics long-term trend 2008 R3 R3 R3 R3 R3 R3
Mathematics long-term trend 2004 R3 R3 R3 R3 R3 R3
Mathematics long-term trend bridge 2004 R2 R2 R2
Reading long-term trend 2008 R3 R3 R3 R3 R3 R3
Reading long-term trend 2004 R3 R3 R3 R3 R3 R3
Reading long-term trend bridge 2004 R2 R2 R2
† Not applicable.
NOTE: The R2 links are links to data for the R2 reporting population. The R3 links are links to the accommodated reporting sample; it samples students who are classified as students with disabilities (SD) or English language learners (ELL) plus SD/ELL students from sessions in which accommodations were allowed. The R3 sample is the only sample type used after 2001.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2004 and 2008 Long-Term Trend Assessments.

Last updated 15 December 2011 (GF)
Would you like to help us improve our products and website by taking a short survey?

YES, I would like to take the survey

or

No Thanks

The survey consists of a few short questions and takes less than one minute to complete.
National Center for Education Statistics - http://nces.ed.gov
U.S. Department of Education