Skip to main content
Skip Navigation

​​​NAEP Technical DocumentationConstructed-Response Interrater Reliability

      

Reliability Statistics for Items Rescored for Current-Year Rater Consistency

Items from the Previous Assessment That Are Rescored During the Current Assessment

During the scoring of student responses, some responses to constructed-response items are rescored by a second rater for one of two reasons:

  • to determine how reliably the current raters are scoring responses to specific items; or 
  • to determine whether the current raters differ in their rating from the raters who had scored the same responses in previous years.

The statistics calculated for both of these purposes are the percentage of exact agreement, the intraclass correlation, and Cohen's Kappa (Cohen 1968). These measures are summarized in Kaplan and Johnson (1992) and Abedi (1996). Each measure has advantages and disadvantages for use in different situations. Agreement percentages vary significantly across items. On a simple two-point mathematics item, agreement should approach 100 percent. On the other hand, when scoring a complex six-point writing constructed-response item, an agreement of 60 percent would be considered an acceptable result. The trend year agreement percentage should approximate the interrater agreement from the prior NAEP administration. The trend year agreement should be within 8 percent of the prior year interrater agreement for two- and three-point items and within 10 percent of the prior year interrater agreement for four- to six-point items. For more information about subject-specific targets for the interrater agreement, please refer to the TDW Scoring section.

Cohen’s Kappa quantifies the reliability between groups of scorers and accounts for agreement due to chance. Kappa statistics should be higher than 0.7 for two- and three-point items and higher than 0.6 for four- to six-point items. Items with reliability statistics considered to be too low prompt investigation into the low rater agreement. The aforementioned criteria were created with NAEP's unique assessment design in mind. The percentage of exact agreement for all constructed-response items, Cohen's Kappa for dichotomously scored constructed-response items, and the intraclass correlation for polytomously scored constructed-response items are provided for every NAEP assessment.

Links to score range, percentage of exact agreement, Cohen's Kappa, and intraclass correlation for constructed-response items, national assessments, by subject, year, and grade: Various years, 2000–2019
SubjectYear Within-year reliability Cross-year reliability
Grade 4Grade 8Grade 12Grade 4Grade 8Grade 12
Arts - Music2016 R3 R3
2008 R3
Arts - Visual arts
2016 R3 R3
2008 R3
Civics2018 R3 R3
2014 R3 R3
2010 R3 R3 R3 R3 R3 R3
2006 R3 R3 R3 R3 R3 R3
Economics2012 R3 R3
2006 R3
Geography2018 R3 R3
2014 R3 R3
2010 R3 R3 R3 R3 R3 R3
2001 R3 R3 R3 R3 R3 R3
Mathematics2019 R3 R3 R3 R3 R3 R3
2017 R3 R3
2015 R3 R3 R3 R3 R3 R3
2013 R3 R3 R3 R3 R3 R3
2011 R3 R3 R3 R3
2009 R3 R3 R3 R3 R3 R3
2007 R3 R3 R3 R3
2005 R3 R3 R3 R3 R3 R3
2003 R3 R3 R3 R3
2000 R3 R3 R3 R3 R3 R3
Reading2019 R3 R3 R3 R3 R3 R3
2017 R3 R3
2015 R3 R3 R3 R3 R3 R3
2013 R3 R3 R3 R3 R3 R3
2011 R3 R3 R3 R3
2009 R3 R3 R3 R3 R3 R3
2007 R3 R3 R3 R3
2005 R3 R3 R3 R3 R3 R3
2003 R3 R3 R3 R3
2002 R3 R3 R3 R3 R3 R3
2000 R3 R3
Science2019 R3 R3 R3 R3 R3 R3
2015 R3 R3 R3 R3 R3 R3
2011 R3 R3
2009 R3 R3 R3
2005 R3 R3 R3 R3 R3 R3
2000 R3 R3 R3 R3 R3 R3
Technology and engineering literacy (TEL)
2018 R3 R3
2014 R3
U.S. history2018 R3 R3
2014 R3 R3
2010 R3 R3 R3 R3 R3 R3
2006 R3 R3 R3 R3 R3 R3
2001 R3 R3 R3 R3 R3 R3
Writing2011 R3 R3
2007 R3 R3 R3 R3
2002 R3 R3 R3
R3 R3 R3
— Not available. There are no cross-year reliability results for arts in 2008 due to changes in scoring procedures that differ from previous assessment years. There are no cross-year reliability results for science in 2009 because it was the first year with new trend. There are no cross-year reliability results for writing in 2011 because it was administered on computer for the first time, breaking scale with past writing assessments. There are no cross-year reliability results for TEL in 2014 because this was the first year in which this assessment was administered.
† Not applicable. Assessment not given at all grades.
NOTE: Because preliminary analyses of students' writing performance in the 2017 NAEP writing assessments at grades 4 and 8 revealed potentially confounding factors in measuring performance, results will not be publicly reported. Some of the NAEP assessments included in this table reference previous assessments (prior to 2000) that are not included in the technical documentation on the web. R3 is the accommodated reporting sample. If sampled students are classified as students with disabilities (SD) or English learners (EL), and school officials, using NAEP guidelines, determine that they can meaningfully participate in the NAEP assessment with accommodation, those students are included in the NAEP assessment with accommodation along with other sampled students including SD/EL students who do not need accommodations. The R3 sample is more inclusive than the R2 sample and excludes a smaller proportion of sampled students. The R3 sample is the only reporting sample used in NAEP after 2001. The block naming conventions used in the 2018 civics, geography, and U.S. history assessments are described in the document 2018 Block Naming Conventions in Data Products and TDW. The block naming conventions used in the 2019 mathematics, reading, and science assessments are described in the document 2019 Block Naming Conventions in Data Products and TDW. 

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), various years, 2000–2019 Assessments.

 

Links to score range, percentage of exact agreement, Cohen's Kappa, and intraclass correlation for constructed-response items, long-term trend assessments, by subject, year, and age: 2004, 2008, and 2012
SubjectYear Within-year reliability Cross-year reliability
Age 9Age 13Age 17Age 9Age 13Age 17
Mathematics long-term trend2012 R3 R3 R3 R3 R3 R3
2008 R3 R3 R3 R3 R3 R3
2004 R3 R3 R3 R3 R3 R3
Mathematics long-term trend bridge2004 R2 R2 R2
Reading long-term trend2012 R3 R3 R3 R3 R3 R3
2008 R3 R3 R3 R3 R3 R3
2004 R3 R3 R3 R3 R3 R3
Reading long-term trend bridge2004 R2 R2 R2
† Not applicable.
NOTE: R2 is the non-accommodated reporting sample; R3 is the accommodated reporting sample. If sampled students are classified as students with disabilities (SD) or English learners (EL), and school officials, using NAEP guidelines, determine that they can meaningfully participate in the NAEP assessment with accommodation, those students are included in the NAEP assessment with accommodation along with other sampled students including SD/EL students who do not need accommodations. The R3 sample is more inclusive than the R2 sample and excludes a smaller proportion of sampled students. The R3 sample is the only reporting sample used in NAEP after 2001. The R2 sample was used as the bridge sample type in 2004 bridge studies to examine comparability of scoring based on an assessment sample similar to those used for LTT in 2001 and years prior.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2004, 2008, and 2012 Mathematics and Reading Long-Term Trend Assessments.





Last updated 16 November 2023 (ML)