Skip to main content
Skip Navigation

NAEP Item Scoring Process

The National Assessment of Educational Progress (NAEP) uses a combination of multiple-choice and constructed-response items (questions) in its assessment instruments. For multiple-choice items, students are required to select an answer from a list of options; responses are electronically scanned and scored. For constructed-response items, students are required to provide their own answers; responses are scanned and then scored by qualified and trained scorers using a scoring guide and an electronic image-processing and scoring system.

Scoring all NAEP items in an objective, consistent, and valid fashion is a key program goal. As outlined in the summary of the NAEP item scoring process (40K PDF), there are a number of steps in the NAEP scoring process that occur during three general phases: scoring guide development and pilot, first operational scoring (or pre-calibration), and subsequent operational scoring. In all phases of scoring, quality control and validity checks are implemented in the scanning, processing, and scoring of multiple-choice items. The following sections further describe key steps in the NAEP scoring process, focusing, in particular, on constructed-response items:

Initial Development of Scoring Guides

NAEP staff uses standardized scoring guides to govern the scoring of constructed-response items. The scoring guides are designed to ensure that scorers follow a single standard and that scores are assigned consistently and fairly. Except for the writing assessment, which may have up to 6 score levels, NAEP scoring guides have 2, 3, 4, or 5 score level categories, depending on the subject and specific item type. General score level categories are defined in the assessment framework for each subject, and specific criteria required at each score level are defined in the scoring guide for each constructed-response item. The test developers who write the items develop the initial scoring guides, which are then revised as the items are refined during the item review process.

Back to Top

Review of Initial Scoring Guides

Both the items and scoring guides are reviewed and refined in a thorough review process:

  1. The scoring guides are reviewed as part of all item reviews conducted by the test development contractors. 
  2. NCES reviews the draft items and scoring guides with the help of their content area standing committee.
  3. Items and scoring guides are reviewed by state testing officials and curriculum specialists during state item reviews.
  4. The National Assessment Governing Board (NAGB) Assessment Sub-Committee and Planning Committee review the items and scoring guides before the pilot test.

Back to Top

Preparation of Scoring Guides and Training Materials for the Pilot Test

All NAEP items are pilot tested to evaluate performance before operational use. These pilot items are administered to a national sample of students, with approximately 500 student responses per item.

Prior to the scoring of the pilot test, the NAEP standing committee (the educators, subject matter specialists, and curriculum experts who work with NCES and contractors to oversee the development of the assessment) reviews the scoring guides in relation to initial sets of student responses from the pilot to ensure that the scoring guides make the correct distinctions among levels of performance, and that the scores can be assigned objectively, consistently, and accurately. The committee also oversees the selection of student responses that will be included as examples to illustrate the different score level categories and as practice papers for the scorer training packets. At this point, the scoring guides are finalized for the pilot test.

Back to Top

Preparation for Scoring Operational Assessments

Based on a review of the pilot test data, the final set of new items is selected for the operational assessments. In addition to statistical checks conducted on both multiple-choice and constructed-response items, documentation from debriefings held after the scoring of the pilot tests is also reviewed to determine how well the constructed-response items and scoring guides function. Scoring guides are included in all item reviews conducted after the pilot test. Items and/or scoring guides may be refined during this review process. NCES and NAGB sign off on the final assessment instruments as described in the item development process.

Back to Top

First Operational Scoring (and Field Test for Pre-Calibration)

The first time that items are included in an operational assessment, the scoring guides and training packets must be finalized. These materials will be used throughout the life of the items and the scoring of items must be consistent from assessment to assessment to ensure the accurate measurement and reporting of trends in student achievement.

To expedite the reporting of NAEP results for mathematics and reading within six months of an operational administration, an additional pre-test administration (field test) is conducted after the pilot test and one year before the first operational assessment to “pre-calibrate” the items for scaling. The pre-calibration administration for mathematics and reading provides data to evaluate the psychometric performance of items so that decisions can be made before the first operational assessment. For the pre-calibration administration the sample size is approximately 2,500 responses per item, which is larger than that for pilot tests but smaller than in a full-scale operational assessment. In terms of the scoring process, the pre-calibration administration for mathematics and reading serves the same function as the first operational administration for other NAEP subjects: the scoring guides and training packets must be finalized at that point to ensure that the scores obtained in the operational assessment are consistent with those from the “pre-calibration.”

Prior to scoring the pre-calibration administration in mathematics and reading and the first operational administration in all other subjects, the NCES standing committee will again review the scoring guides in light of student responses and select appropriate training packets for the operational assessments, particularly where refinements to items and/or scoring guides were made after the pilot test.

Back to Top

Subsequent Operational Scoring

After the first operational administration (or pre-calibration for the mathematics and reading assessments that require results six months after each operational assessment), changes are rarely made to the scoring guides. Trend scoring procedures are implemented to ensure the consistency of scoring across assessment years. Sets of responses from previous years are used to qualify scorers during the training process and to monitor consistency between the current and adjacent year’s scores.

After each operational administration, documentation on the scoring procedures and decisions made during training and scoring are updated. This is important for the trend scoring process to ensure that scoring trainers in subsequent years know how specific types of scoring issues were addressed. In particular, if new types of responses are encountered in subsequent years, additional notes are added to the scoring and training materials to document how the scoring guide was applied for these types of responses.

Back to Top

Procedures for Assuring Consistent, Valid, and Objective Scoring

The NAEP scoring process involves a rigorous, multilayered system of training and checking to ensure that scoring of all NAEP items is accurate, consistent, and valid. Both multiple-choice and constructed-response items are scanned and processed electronically. In addition to implementing numerous quality control and validity checks for the scoring of multiple-choice items, the NAEP scoring contractor uses a proprietary electronic image processing and scoring system for constructed-response items. This system provides for efficient scoring as well as the capability to conduct a variety of quality control checks throughout the scoring process. The following quality assurance processes are implemented in the scoring of NAEP constructed-responses items:

Identifying Qualified Scorers

  • Hiring and placement of scorers. All prospective scorers for NAEP are carefully screened by the NAEP scoring contractor and must meet specific criteria. These criteria include at least a bachelor’s degree, sometimes in a specific related subject area, particularly for the scoring of the twelfth-grade assessment items. In addition, potential scorers for many NAEP subject areas are administered a scorer placement test to identify scorers qualified to score at each grade level.
  • Training of scorers. Content and scoring experts train all scorers on each item by extensively reviewing the scoring guides, discussing the anchor papers, and having scorers score and discuss sets of practice papers. “Live” scoring begins only after scorers have demonstrated high levels of accuracy.
  • Qualification of scorers. Prospective scorers are also required to pass a “qualifying set” for extended constructed-response items with multiple score categories or other items that are identified by test developers and scoring directors as particularly challenging to score. Each scorer must score a set of papers that has been pre-scored by NAEP content and scoring experts. If the scorer does not have a high enough level of agreement with the pre-assigned scores (70 percent or more), he or she is not allowed to score that item.

Back to Top

Ensuring Ongoing Quality

  • Reliability scoring. A minimum of 25 percent of student responses for the pilot test, the pre-calibration administration, and the national operational administrations are double-scored to monitor the reliability between scorers (inter-rater reliability). For the state operational administrations, only 5 percent of student responses are double-scored due to the much larger combined national/state sample size. NAEP staff monitors the inter-rater reliability throughout scoring and tracks the percent exact agreement between multiple scorers. If reliability scoring does not meet NAEP’s high standards for consistency the entire set of responses is re-scored. If an individual scorer is unable to score accurately and consistently, he or she is removed from scoring that particular item.
  • Back reading. Performance scoring specialists who have strong knowledge of NAEP assessments monitor scoring accuracy on an ongoing basis. This monitoring is accomplished through randomly checking responses already scored by the scorers. When problems with scorers are identified, those scorers are retrained or removed from scoring that item.
  • Statistical monitoring. Data are collected on the quality of scoring across all scorers on every item. Scorers who do not meet NAEP’s high standards are retrained or removed from NAEP scoring.

Back to Top

Maintaining Consistency Across Time

  • Trend Scoring. Student responses from earlier years are scored alongside those from the current year to ensure that papers are scored consistently from year to year. Sets of trend responses are scored throughout the scoring period and monitored to ensure that NAEP criteria for accurate and reliable trend scoring are met. The inter-rater reliability between current and previous year scoring must be comparable to the previous year’s exact agreement from within-year reliability scoring. The mean score on the item for the trend set is also monitored to ensure that there has not been any significant drift compared to previous assessment years.  Scorers must meet trend criteria before beginning to score current year papers. Scorers must also regularly meet trend requirements throughout the scoring process in order to continue current-year scoring.
  • Re-calibration. Periodically scorers are given “calibration papers” to ensure that they continue to score papers in the same way throughout the scoring period. Calibration may use papers taken from training sets, trend sets, or current-year responses. If scorers do not score calibration papers in a manner that is consistent with NAEP’s high standards, retraining is conducted.

Back to Top

Delivery of Scores for Data Analysis

After scoring is completed, the NAEP scoring contractor creates data files containing the scores for both multiple-choice and constructed-response items and delivers these to the contractor responsible for data analysis.

Back to Top

Last updated 03 November 2005 (AA)