Interrater reliability statistics were obtained by having raters second-score the current year's constructed-response items within each subject area. Procedures for obtaining such statistics vary according to the samples and assessments specific to each assessment year. Prior to 2002, separate samples were drawn for the NAEP national main assessment and state assessment. Raters second-scored at least 25 percent of the responses to each constructed-response item that appeared in the national main assessment and at least six percent of the responses to constructed-response items that appeared in the state assessment. Since 2002, the sample drawn for the national main assessment has been a combined sample that integrates the state assessment sample, and 5 percent of the responses to each constructed-response item administered are second-scored. This integrated sample design is used in combined national and state assessments that are administered in odd-numbered years (2003, 2005, 2007, etc.). In even-numbered years, the NAEP assessment is given to a national-only sample and uses the second-scoring of 25 percent of the responses to constructed-response items. In addition to serving an evaluative purpose during analysis, the percentage of exact agreement has always been used by scoring team leaders to monitor the capabilities of all raters and to maintain uniformity of scoring across raters.
For each item scaled for each subject area and grade, the number of papers with responses that were scored a second time is listed in the tables that link from the page Constructed-response Interrater Reliability. The tables also list the percent exact agreement between raters and an index of reliability based on those responses. Cohen's Kappa (Cohen 1968) is the reliability estimate used for dichotomized items, and intraclass correlation is the reliability estimate used for polytomously scored items. NAEP item numbers and the block that contains the item are provided for reference. The codes from the NAEP database that denote the range of responses and the correct responses where appropriate are also provided.
For the 2000 and 2001 assessments, this website provides tables for both samples administered in these years: the sample for which accommodations were permitted, and the sample for which accommodations were not permitted. The responses for each item were scored at the same time, regardless of which sample the student writing the response was selected for. Therefore, only one set of reliability statistics for items rescored for current-year scorer consistency is available for the 2000 and 2001 assessments. Starting in 2002, accommodations were permitted for all assessments with the exception of the 2004 long-term trend assessment. In 2004, accommodations were allowed for the long-term trend assessment but not for the long-term trend bridge study.