Page Title:
Keywords:
Description:
Skip Navigation
NAEP Analysis and Scaling → Initial Activities → Constructed-Response Interrater Reliability

Constructed-Response Interrater Reliability

      

Reliability Statistics for Items Rescored for Current-Year Rater Consistency

Items from the Previous Assessment That Are Rescored During the Current Assessment

During the scoring of student responses, some responses to constructed-response items are rescored by a second rater for one of two reasons:

  • First, responses from the current assessment are rescored to determine how reliably the current
    raters are scoring responses to specific items. 
  • Second, the responses to items given both in the current year and in previous years are rescored. This second type of rescoring indicates whether the current raters differ in their rating from the raters who had scored the same responses in previous years.

The statistics calculated for both of these purposes are the percentage of exact agreement, the intraclass correlation, Cohen's Kappa (Cohen 1968), and the Pearson product-moment correlation between the scores for the first and second raters. These measures are summarized in Zwick (1988), Kaplan and Johnson (1992), and Abedi (1996). Each measure has advantages and disadvantages for use in different situations. Agreement percentages vary significantly across items. On a simple two-point mathematics item, agreement should approach 100 percent. On the other hand, when scoring a complex six-point writing constructed response item, an agreement of 60 percent would be considered an acceptable result. The trend year agreement percentage should approximate the interrater agreement from prior NAEP administration. The trend year agreement should be within 8 percent of the prior year interrater agreement for two- and three-point items and within 10 percent of the prior year interrater agreement for four- to six-point items. For more information about subject-specific targets of the interrater agreement, please refer to the 2012 TDW Scoring section.

Cohen’s Kappa quantifies another measure of reliability between groups of scorers and accounts for agreement due to chance. Kappa statistics should be higher than 0.7 for two- and three-point items and higher than 0.6 for four- to six-point items. Items with reliability statistics considered to be too low prompt investigation into the low rater agreement. The aforementioned criteria were created with NAEP's unique assessment design in mind. The percentage of exact agreement for all constructed-response items, Cohen's Kappa for dichotomously scored constructed-response items, and the intraclass correlation for polytomously scored constructed-response items are provided for every NAEP assessment.

Links to score range, percent exact agreement, and Cohen's Kappa or intraclass correlation for constructed-response items, national assessments: Various years, 2000–2012
Subject Year Within-year reliability Cross-year reliability
Grade 4 Grade 8 Grade 12 Grade 4 Grade 8 Grade 12
Arts - Music 2008 R3
Arts - Visual 2008 R3
Civics 2010 R3 R3 R3 R3 R3 R3
2006 R3 R3 R3 R3 R3 R3
Economics 2012 R3 R3
2006 R3
Geography 2010 R3 R3 R3 R3 R3 R3
2001 R3 R3 R3 R3 R3 R3
Mathematics 2011 R3 R3 R3 R3
2009 R3 R3 R3 R3 R3 R3
2007 R3 R3 R3 R3
2005 R3 R3 R3 R3 R3 R3
2003 R3 R3 R3 R3
2000 R3 R3 R3 R3 R3 R3
Reading 2011 R3 R3 R3 R3
2009 R3 R3 R3 R3 R3 R3
2007 R3 R3 R3 R3
2005 R3 R3 R3 R3 R3 R3
2003 R3 R3 R3 R3
2002 R3 R3 R3 R3 R3 R3
2000 R3 R3
Reading vocabulary 2011
Science 2011 R3 R3
2009 R3 R3 R3
2005 R3 R3 R3 R3 R3 R3
2000 R3 R3 R3 R3 R3 R3
U.S. history 2010 R3 R3 R3 R3 R3 R3
2006 R3 R3 R3 R3 R3 R3
2001 R3 R3 R3 R3 R3 R3
Writing 2011 R3 R3
2007 R3 R3 R3 R3
2002 R3 R3 R3 R3 R3 R3
† Not applicable. Economics is assessed only at grade 12. In 2011, there were no grade 12 score results for science, mathematics, reading and reading vocabulary. Only grade 8 was assessed for science 2011, and only grades 8 and 12 were assessed for writing 2011. There are no rescore results for science 2009 or arts 2008 due to changes in scoring procedures that differ from previous assessment years. The 2008 arts assessment was conducted at grade 8 only. The 2003 mathematics and reading assessments were conducted only at grades 4 and 8. The 2000 reading assessment was conducted only at grade 4. 
NOTE: R3 is the accommodated reporting sample. It samples students who are classified as students with disabilities (SD) or English language learners (ELL), plus SD/ELL students from sessions in which accommodations were allowed.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), various years, 2000-2012 Assessments.

Links to score range, percent exact agreement, and Cohen's Kappa or intraclass correlation for constructed-response items, long-term trend assessment: 2004, 2008, and 2012
Subject Year Within-year reliability Cross-year reliability
Age 9 Age 13 Age 17 Age 9 Age 13 Age 17
Mathematics long-term trend 2012 R3 R3 R3 R3 R3 R3
2008 R3 R3 R3 R3 R3 R3
2004 R3 R3 R3 R3 R3 R3
Mathematics long-term trend bridge 2004 R2 R2 R2
Reading long-term trend 2012 R3 R3 R3 R3 R3 R3
2008 R3 R3 R3 R3 R3 R3
2004 R3 R3 R3 R3 R3 R3
Reading long-term trend bridge 2004 R2 R2 R2
† Not applicable.
NOTE: R2 is the non-accommodated reporting sample; R3 is the accommodated reporting sample. It samples students who are classified students with disabilities (SD) or English language learners (ELL), plus SD/ELL students from sessions in which accommodations were allowed. The R3 sample is more inclusive and excludes a smaller proportion of sampled students. The R3 sample type was the only sample type used in NAEP after 2001.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2004, 2008, and 2012 Long-Term Trend Assessments.

Last updated 11 March 2016 (GF)