A subsample of the writing responses for each constructed-response item is scored by a second individual to obtain statistics on interrater agreement. Items with smaller numbers of responses (typically national samples) are 25 percent second scored. Items with larger numbers of responses (typically state samples) are 5 percent second scored. This interrater agreement information is also used by the scoring supervisor to monitor the capabilities of all scorers and maintain uniformity of scoring across scorers.
Agreement reports are generated on demand by the scoring supervisor, trainer, scoring director, or item development subject-area coordinator. Printed copies are reviewed daily by the lead scoring staff. In addition to the immediate feedback provided by online agreement reports, each scoring supervisor can also review the actual responses scored by a scorer with the backreading tool. In this way, the scoring supervisor can monitor each scorer carefully and correct difficulties in scoring almost immediately with a high degree of efficiency.
During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress using an interrater agreement tool. This display tool functions in either of two modes:
The information is displayed as a matrix with scores awarded during first readings displayed in rows and scores awarded during second readings displayed in columns (for mode one) or with the individual scorer's scores in rows and all other scorers in columns (for mode two). In this format, instances of exact agreement fall along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number and percentage of cases of agreement (or disagreement). The display also contains information on the total number of second readings and the overall percentage of agreement on the item. Since the interrater agreement reports are cumulative, a printed copy of the agreement of each item is made periodically and compared to previously generated reports. Scoring staff members save printed copies of all final agreement reports and archive them with the training sets.
NAEP interrater agreement statistics for the 2007 writing assessment differ significantly from the 2002 writing assessment. This is primarily the result of evolving NAEP scoring policies and procedures over time. In 2002, pair or group scoring (two or more raters discussing each student response) were standard practices widely employed for challenging items. In 2007, all operational items were 100% individually scored. The scoring procedures used for the writing assessment in 2011 were the same as those used in 2007.
|Year and subject||Item-by-item rater agreement||Interrater agreement ranges||Number of constructed-response items|
|SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2002-2011 Writing Assessments.|