Skip Navigation
NAEP Scoring

NAEP Scoring

    Preliminary Activities

Training for Scoring

Scoring Monitoring

Three types of cognitive items are scored for NAEP. Multiple-choice item responses are captured by high-speed scanners during student booklet processing. Short constructed-response items (typically those with two or three valid score points) and extended constructed-response items (typically those with four or more valid score points) are scored by trained personnel using images of student responses also captured during processing.

Scoring a large number of short and extended constructed responses with a high level of accuracy and reliability within a limited time frame is essential to the success of NAEP. To ensure reliable, efficient scoring, NAEP takes the following steps:

  • develops focused, explicit scoring guides that match the criteria delineated in the assessment frameworks;

  • recruits qualified and experienced scorers, trains them, and verifies their ability to score particular questions through qualifying tests;

  • employs an image-processing and scoring system that routes images of student responses directly to the scorers so they can focus on scoring rather than paper routing;

  • monitors scorer consistency through ongoing reliability checks, including second scoring;

  • assesses the quality of scorer decision-making through frequent monitoring by NAEP assessment experts; and

  • documents all training, scoring, and quality control procedures in the technical reports.

The table below presents a general overview of recent NAEP scoring activities.

Processing and scoring totals, national assessments, by subject area and year: various years, 2000–2012
Year Subject area Grade Number of booklets scored Number of constructed responses scored Number of individual 
cognitive items
Number of team leaders Number of scorers
NOTE: Number of responses scored includes second scores. Data for 2011 mathematics, reading, and science represent national and state assessments; data for 2011 writing represents national assessment only. The 2011 writing assessment was computer-based. Data for 2010 civics, geography, and U.S. history were combined into one social sciences assessment; trainers and teams scored a mixture of items from all three subject areas.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), Various Assessments, 2000-2012.
2012 Economics 12 10,950 75,229 34 3 19
2011 Mathematics 4,8 388,638 3,786,422 172 19 151
Reading 4,8 382,205 2,819,950 90 23 256
Science 8 122,409 1,544,669 96 17 178
Writing 8,12 52,452 104,958 44 17 183
2010 Civics 4,8,12 26,771 261,989 119 23 153
Geography 4,8,12 26,608 366,543 172 23 153
U.S. History 4,8,12 30,987 387,625 167 23 153
2009 Mathematics 4,8,12 380,042 4,293,561 298 16 175
Reading 4,8,12 392,196 3,709,299 311 30 336
Science 4,8,12 331,967 4,592,470 412 45 430
2008 Arts 8 7,865 181,854 92 6 57
2007 Mathematics 4,8 422,200 3,912,835 435 38 187
Reading 4,8 457,800 3,623,126 346 51 362
Writing 8,12 205,500 729,940 40 50 328
2006 U.S. History 4,8,12 38,400 458,172 132 21 65
Civics 4,8,12 33,200 282,977 84 20 65
Economics 12 17,600 128,735 32 8 30
2005 Mathematics 4,8,12 354,500 4,435,831 414 26 267
Reading 4,8,12 340,200 3,773,691 226 36 363
Science 4,8,12 349,100 4,424,511 539 39 393
2003 Mathematics 4,8 349,600 4,719,464 135 33 418
Reading 4,8 350,700 3,913,147 136 32 397
2002 Reading 4,8,12 308,500 4,023,861 150 33 330
Writing 4,8,12 285,900 608,269 60 29 270
2001 Geography 4,8,12 27,500 381,477 57 9 81
U.S. History 4,8,12 32,700 399,182 47 9  81
2000 Mathematics 4,8,12 253,900 3,856,211 199 16 177
Reading 4 8,500 123,100 46 14 702
Science 4,8,12 240,900 4,398,021 295 20 155

The table below presents a general overview of recent NAEP Long Term Trend scoring activities.

Processing and scoring totals, long-term trend assessments, by subject area and age: 2004, 2008 and 2012
Year Subject area Age Number of
Number of
Number of
cognitive items
2012 Mathematics long-term trend 9, 13, 17 26,210 422,192 181
Reading long-term trend 9, 13, 17 26,352 47,241 19
2008 Mathematics long-term trend 9, 13, 17 28,465 452,994 179
Reading long-term trend 9, 13, 17 26,621 51,743 19
2004 Mathematics long-term trend 9, 13, 17 40,300 1,082,923 219
Reading long-term trend 9, 13, 17 41,200 131,496 34
NOTE: Number of responses scored includes second scores.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2004, 2008 and 2012.

As new NAEP items are created, tested, and improved, test development staff create scoring guides using as specific examples a range of actual student responses captured by the materials-processing staff. The scoring and test development staffs create training materials matching the assessment framework criteria. For future assessments, continuous documentation assures that the scoring staff will train and score the item in the same way that it was originally implemented. This repeatability allows reporting on trends in student performance over time.

NAEP Scoring Staff

Scorers score student responses. Scoring supervisors provide logistical support to the trainers and help monitor team activities. Trainers are responsible for training both scorers and supervisors on specific content and for assuring that team scoring performance meets expectations. Content leads for each subject area (Reading, Science, etc.) oversee the trainers and provide support as needed.

Scorers must have a minimum of a baccalaureate degree from a four-year college or university. An advanced degree and scoring experience and/or teaching experience are preferred. In some subjects, scorers must complete a placement test, used as a tool to identify scorers with appropriate content knowledge. During the training process, scoring teams are trained so that each student response can be scored consistently. Following training, for all extended response items and some short constructed-response items with particularly complex scoring guides, each scorer is given a pre-scored qualification set of student responses to score. Qualification standards for each item vary according to the number of score levels for the item. Individual scorer results are retained for all qualification sets.

Scoring supervisors and trainers are selected based upon many factors including their previous experience, educational and professional backgrounds, demonstration of a strong understanding of the scoring criteria, and strong interpersonal communication skills and organizational abilities.

NAEP scoring teams usually consist of 10-12 scorers who are led by a scoring supervisor and a trainer. Prior to the scoring effort, all personnel are intensively trained. The trainers who train the scorers, the supervisors who oversee a group of scorers, and the scorers themselves are all given both general scoring training and item-specific content training.

NAEP Scoring System

The NAEP electronic scoring system offers the latest technology coupled with secure network communications to transmit images of student responses to the trained scorers and to receive back the scores assigned by them. Student responses are scanned from the original test booklets; the actual test booklets can be accessed and referenced if needed. The scorer sees each student response in isolation on a computer video screen and assigns a score. The scorer cannot access any other responses from the student. As each response is scored, another student response is shown for scoring, until all responses for an item have been scored.

During scoring, the NAEP electronic scoring system provides documentation of numerous scoring metrics. Reports on item and scoring performance can be retrieved as needed. In addition, custom reports of daily activities are sent out nightly to development, scoring, and analysis staff to monitor NAEP scoring quality and progress.

All assessments are scored item by item so that scorers train on one item and one scoring guide at a time. This method is efficient only with electronic presentation of student responses.

NAEP Scoring Procedures

During the scoring of a particular item, a percentage of scored responses is randomly recirculated by the system to be rescored by a second scorer in order to check the consistency of current-year scoring. Five percent of responses are second-scored for large state samples, and 25 percent of responses are second-scored for smaller national samples. This comparison of first and second scores yields the within-year interrater agreement.

In addition, NAEP trend scoring is used to compare the consistency of scoring over time (i.e., cross-year interrater agreement). During trend scoring, the NAEP electronic scoring system allows for the presentation of a pool of scored responses from a prior assessment to current scorers. Comparing current scores to the scores given in the prior assessment offers the ability to generate reports to evaluate scoring consistency over time for a specific NAEP item.

Backreading of current year responses ensures frequent monitoring of scorer decision-making by supervisory staff. Backreading allows the supervisor to review responses (with scores assigned) already scored by each scorer and to assure that each scorer is applying the scoring guide correctly. About 5 percent of each scorer's output is monitored through backreading.

During training and scoring, any changes to existing documentation are captured by scoring staff, shared across scoring teams, and incorporated into the history of the NAEP item. This is reviewed prior to the next scoring effort.

Last updated 28 June 2014 (GF)