Skip Navigation
NAEP Scoring → Scoring Monitoring → Within-Year Interrater Agreement

Within-Year Interrater Agreement


Arts Interrater Agreement

Civics Interrater Agreement

Economics Interrater Agreement

Geography Interrater Agreement

Mathematics Interrater Agreement

Reading Interrater Agreement

Science Interrater Agreement

U.S. History Interrater Agreement

Writing Interrater Agreement

t statistics


Monitoring within-year interrater agreement is accomplished by re-routing some responses for scoring them a second time. For all items, the scoring system selects a subset of the current-year student responses for this second scoring. Responses being second-scored are not identifiable to the scorers. The first and second scores for the subset of responses are analyzed to determine the within-year agreement. The agreement statistics can be obtained by the scoring supervisor at any point during scoring. Within-year interrater agreement is closely monitored to ensure the quality of the scoring.

Through 2009, NAEP used the following target standards for within-year agreement:

  • items scored on 2-point scales: 85 percent exact agreement,
  • items scored on 3-point scales: 80 percent exact agreement,
  • items scored on 4-point and 5-point scales: 75 percent exact agreement, and
  • items scored on 6-point scales: 60 percent exact agreement.

Starting in 2010 and continuing forward, NAEP uses a two-tier flagging system, where flags are determined separately for each subject. Items with slightly low interrater reliability (IRR) are yellow flagged to indicate mild concern. Items of greater concern are red flagged. Such flags were determined based on historical scoring data for each subject. A red flag indicates an uncharacteristically low IRR given historical data. The red flag is intended to be set at the fifth percentile of historical IRRs for a particular subject, grade, and score category, which means that the IRR falls within the bottom 5 percent of a representative distribution of IRRs. The yellow flag is intended to be set at the twentieth percentile of historical IRRs for a subject, grade, and score category. The word ‘intended’ is used because sufficient historical data to set robust flags were not available for all subjects and score categories. In those cases, additional information was used to set the flags. Target standards are as follows:

Yellow flag
Subject 2-level 3-level 4-level 5 or more levels
Civics 80% 80%
Economics 90% 85% 80% 75%
Geography 95% 93% 85%
Mathematics 97% 94% 91% 91%
Reading 87% 82% 77% 77%
Science 92% 87% 86% 81%
U.S. History 87% 82% 80%
Writing 70% 61%
† Not applicable.

Red flag
Subject 2-level 3-level 4-level 5 or more levels
Civics 80% 75%
Economics 85% 80% 75% 70%
Geography 92% 85% 75%
Mathematics 94% 92% 90% 90%
Reading 85% 80% 75% 75%
Science 89% 84% 83% 78%
U.S. History 85% 80% 77%
Writing 70% 57%
† Not applicable.

Scoring staff also need to be alert for downward changes in the within-year agreement for an item. For example, if first and second scores were in exact agreement 90 percent of the time in the morning (or on day 1 of scoring) and the rate of exact agreement declined to 82 percent in the afternoon (or day 2 of scoring), a problem may exist, even if the overall within-year agreement remains over the minimum standard. Backreading and calibration are tools used to monitor and correct declines in within-year agreement. If within-year agreement rates fall below the indicated standards for an item and it is believed this was primarily a result of inconsistent scoring, it is possible that the item will be rescored. Decisions about rescoring of items are made by test development and staff psychometricians in consultation with scoring staff and content coordinators.

For more information on the estimation of reliability based on interrater agreement, see Analysis and Scaling.

Last updated 16 June 2014 (AN)