The National Assessment of Educational Progress (NAEP) uses a combination of multiple-choice and constructed-response items (questions) in its assessment instruments. For multiple-choice items, students are required to select an answer from a list of options; responses are electronically scanned and scored. For constructed-response items, students are required to provide their own answers; responses are scanned and then scored by qualified and trained scorers using a scoring guide and an electronic image-processing and scoring system.
Scoring all NAEP items in an objective, consistent, and valid fashion is a key program goal. As outlined in the summary of the NAEP item scoring process (40K PDF), there are a number of steps in the NAEP scoring process that occur during three general phases: scoring guide development and pilot, first operational scoring (or pre-calibration), and subsequent operational scoring. In all phases of scoring, quality control and validity checks are implemented in the scanning, processing, and scoring of multiple-choice items. The following sections further describe key steps in the NAEP scoring process, focusing, in particular, on constructed-response items:
NAEP staff uses standardized scoring guides to govern the scoring of constructed-response items. The scoring guides are designed to ensure that scorers follow a single standard and that scores are assigned consistently and fairly. Except for the writing assessment, which may have up to 6 score levels, NAEP scoring guides have 2, 3, 4, or 5 score level categories, depending on the subject and specific item type. General score level categories are defined in the assessment framework for each subject, and specific criteria required at each score level are defined in the scoring guide for each constructed-response item. The test developers who write the items develop the initial scoring guides, which are then revised as the items are refined during the item review process.
Both the items and scoring guides are reviewed and refined in a thorough review process:
All NAEP items are pilot tested to evaluate performance before operational use. These pilot items are administered to a national sample of students, with approximately 500 student responses per item.
Prior to the scoring of the pilot test, the NAEP standing committee (the educators, subject matter specialists, and curriculum experts who work with NCES and contractors to oversee the development of the assessment) reviews the scoring guides in relation to initial sets of student responses from the pilot to ensure that the scoring guides make the correct distinctions among levels of performance, and that the scores can be assigned objectively, consistently, and accurately. The committee also oversees the selection of student responses that will be included as examples to illustrate the different score level categories and as practice papers for the scorer training packets. At this point, the scoring guides are finalized for the pilot test.
Based on a review of the pilot test data, the final set of new items is selected for the operational assessments. In addition to statistical checks conducted on both multiple-choice and constructed-response items, documentation from debriefings held after the scoring of the pilot tests is also reviewed to determine how well the constructed-response items and scoring guides function. Scoring guides are included in all item reviews conducted after the pilot test. Items and/or scoring guides may be refined during this review process. NCES and NAGB sign off on the final assessment instruments as described in the item development process.
The first time that items are included in an operational assessment, the scoring guides and training packets must be finalized. These materials will be used throughout the life of the items and the scoring of items must be consistent from assessment to assessment to ensure the accurate measurement and reporting of trends in student achievement.
To expedite the reporting of NAEP results for mathematics and reading within six months of an operational administration, an additional pre-test administration (field test) is conducted after the pilot test and one year before the first operational assessment to “pre-calibrate” the items for scaling. The pre-calibration administration for mathematics and reading provides data to evaluate the psychometric performance of items so that decisions can be made before the first operational assessment. For the pre-calibration administration the sample size is approximately 2,500 responses per item, which is larger than that for pilot tests but smaller than in a full-scale operational assessment. In terms of the scoring process, the pre-calibration administration for mathematics and reading serves the same function as the first operational administration for other NAEP subjects: the scoring guides and training packets must be finalized at that point to ensure that the scores obtained in the operational assessment are consistent with those from the “pre-calibration.”
Prior to scoring the pre-calibration administration in mathematics and reading and the first operational administration in all other subjects, the NCES standing committee will again review the scoring guides in light of student responses and select appropriate training packets for the operational assessments, particularly where refinements to items and/or scoring guides were made after the pilot test.
After the first operational administration (or pre-calibration for the mathematics and reading assessments that require results six months after each operational assessment), changes are rarely made to the scoring guides. Trend scoring procedures are implemented to ensure the consistency of scoring across assessment years. Sets of responses from previous years are used to qualify scorers during the training process and to monitor consistency between the current and adjacent year’s scores.
After each operational administration, documentation on the scoring procedures and decisions made during training and scoring are updated. This is important for the trend scoring process to ensure that scoring trainers in subsequent years know how specific types of scoring issues were addressed. In particular, if new types of responses are encountered in subsequent years, additional notes are added to the scoring and training materials to document how the scoring guide was applied for these types of responses.
The NAEP scoring process involves a rigorous, multilayered system of training and checking to ensure that scoring of all NAEP items is accurate, consistent, and valid. Both multiple-choice and constructed-response items are scanned and processed electronically. In addition to implementing numerous quality control and validity checks for the scoring of multiple-choice items, the NAEP scoring contractor uses a proprietary electronic image processing and scoring system for constructed-response items. This system provides for efficient scoring as well as the capability to conduct a variety of quality control checks throughout the scoring process. The following quality assurance processes are implemented in the scoring of NAEP constructed-responses items:
After scoring is completed, the NAEP scoring contractor creates data files containing the scores for both multiple-choice and constructed-response items and delivers these to the contractor responsible for data analysis.