Table of Contents | Search Technical Documentation | References

NAEP Analysis and Scaling → Initial Activities → Constructed-Response Interrater Reliability

NAEP Technical DocumentationConstructed-Response Interrater Reliability

Reliability Statistics for Items Rescored for Current-Year Rater Consistency

Items from the Previous Assessment That Are Rescored During the Current Assessment

During the scoring of student responses, some responses to constructed-response items are rescored by a second rater for one of two reasons:

to determine how reliably the current raters are scoring responses to specific items; or
to determine whether the current raters differ in their rating from the raters who had scored the same responses in previous years.

The statistics calculated for both of these purposes are the percentage of exact agreement, the intraclass correlation, and Cohen's Kappa (Cohen 1968). These measures are summarized in Kaplan and Johnson (1992) and Abedi (1996). Each measure has advantages and disadvantages for use in different situations. Agreement percentages vary significantly across items. On a simple two-point mathematics item, agreement should approach 100 percent. On the other hand, when scoring a complex six-point writing constructed-response item, an agreement of 60 percent would be considered an acceptable result. The trend year agreement percentage should approximate the interrater agreement from the prior NAEP administration. The trend year agreement should be within 8 percent of the prior year interrater agreement for two- and three-point items and within 10 percent of the prior year interrater agreement for four- to six-point items. For more information about subject-specific targets for the interrater agreement, please refer to the TDW Scoring section.

Cohen’s Kappa quantifies the reliability between groups of scorers and accounts for agreement due to chance. Kappa statistics should be higher than 0.7 for two- and three-point items and higher than 0.6 for four- to six-point items. Items with reliability statistics considered to be too low prompt investigation into the low rater agreement. The aforementioned criteria were created with NAEP's unique assessment design in mind. The percentage of exact agreement for all constructed-response items, Cohen's Kappa for dichotomously scored constructed-response items, and the intraclass correlation for polytomously scored constructed-response items are provided for every NAEP assessment.

Links to score range, percentage of exact agreement, Cohen's Kappa, and intraclass correlation for constructed-response items, national assessments, by subject, year, and grade: Various years, 2000–2019

Subject

Year

Within-year reliability

Cross-year reliability

Grade 4

Grade 8

Grade 12

Grade 4

Grade 8

Grade 12

Arts - Music

2016

†

2008

†

—

†

Arts - Visual arts

2016

†

2008

†

—

†

Civics

2018

†

2014

†

2010

2006

Economics

2012

†

2006

†

Geography

2018

†

2014

†

2010

2001

Mathematics

2019

2017

†

2015

2013

2011

†

2009

2007

†

2005

2003

†

2000

Reading

2019

2017

†

2015

2013

2011

†

2009

2007

†

2005

2003

†

2002

2000

†

Science

2019

2015

2011

†

2009

—

2005

2000

Technology and engineering literacy (TEL)

2018

†

2014

†

—

†

U.S. history

2018

†

2014

†

2010

2006

2001

Writing

2011

—

2007

†

2002

— Not available. There are no cross-year reliability results for arts in 2008 due to changes in scoring procedures that differ from previous assessment years. There are no cross-year reliability results for science in 2009 because it was the first year with new trend. There are no cross-year reliability results for writing in 2011 because it was administered on computer for the first time, breaking scale with past writing assessments. There are no cross-year reliability results for TEL in 2014 because this was the first year in which this assessment was administered.
† Not applicable. Assessment not given at all grades.
NOTE: Because preliminary analyses of students' writing performance in the 2017 NAEP writing assessments at grades 4 and 8 revealed potentially confounding factors in measuring performance, results will not be publicly reported. Some of the NAEP assessments included in this table reference previous assessments (prior to 2000) that are not included in the technical documentation on the web. R3 is the accommodated reporting sample. If sampled students are classified as students with disabilities (SD) or English learners (EL), and school officials, using NAEP guidelines, determine that they can meaningfully participate in the NAEP assessment with accommodation, those students are included in the NAEP assessment with accommodation along with other sampled students including SD/EL students who do not need accommodations. The R3 sample is more inclusive than the R2 sample and excludes a smaller proportion of sampled students. The R3 sample is the only reporting sample used in NAEP after 2001. The block naming conventions used in the 2018 civics, geography, and U.S. history assessments are described in the document 2018 Block Naming Conventions in Data Products and TDW. The block naming conventions used in the 2019 mathematics, reading, and science assessments are described in the document 2019 Block Naming Conventions in Data Products and TDW.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), various years, 2000–2019 Assessments.

Links to score range, percentage of exact agreement, Cohen's Kappa, and intraclass correlation for constructed-response items, long-term trend assessments, by subject, year, and age: 2004, 2008, and 2012
Subject	Year	Within-year reliability			Cross-year reliability
Subject	Year	Age 9	Age 13	Age 17	Age 9	Age 13	Age 17
Mathematics long-term trend	2012	R3	R3	R3	R3	R3	R3
	2008	R3	R3	R3	R3	R3	R3
	2004	R3	R3	R3	R3	R3	R3
Mathematics long-term trend bridge	2004	R2	R2	R2	†	†	†
Reading long-term trend	2012	R3	R3	R3	R3	R3	R3
	2008	R3	R3	R3	R3	R3	R3
	2004	R3	R3	R3	R3	R3	R3
Reading long-term trend bridge	2004	R2	R2	R2	†	†	†
† Not applicable. NOTE: R2 is the non-accommodated reporting sample; R3 is the accommodated reporting sample. If sampled students are classified as students with disabilities (SD) or English learners (EL), and school officials, using NAEP guidelines, determine that they can meaningfully participate in the NAEP assessment with accommodation, those students are included in the NAEP assessment with accommodation along with other sampled students including SD/EL students who do not need accommodations. The R3 sample is more inclusive than the R2 sample and excludes a smaller proportion of sampled students. The R3 sample is the only reporting sample used in NAEP after 2001. The R2 sample was used as the bridge sample type in 2004 bridge studies to examine comparability of scoring based on an assessment sample similar to those used for LTT in 2001 and years prior. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2004, 2008, and 2012 Mathematics and Reading Long-Term Trend Assessments.

Last updated 16 November 2023 (ML)

Printer-friendly Version

​​​NAEP Technical DocumentationConstructed-Response Interrater Reliability

NAEP Technical DocumentationConstructed-Response Interrater Reliability