How Does NAEP Ensure Consistency in Scoring?

View Quarterly by: This Issue | Volume and Issue | Topics

Vol 2, Issue 3, Topic: Methodology

By: Sheida White, Connie Smith, and Alan Vanneman

This article was originally published as an issue of Focus on NAEP. The Focus on NAEP series briefly summarizes information about the ongoing development and implementation of the National Assessment of Educational Progress (NAEP).

Introduction
Selecting Scorers
Training Scorers
Image Scoring and Monitoring
Conclusion

Introduction

The National Center for Education Statistics (NCES) has been conducting the National Assessment of Educational Progress (NAEP) since 1969. In addition to regular assessments in reading, mathematics, science, and writing, NCES also conducts assessments in such subjects as geography, U.S. history, civics, and the arts.

All of these assessments include constructed-response questions in addition to multiple-choice items. Many include "short constructed-response" questions, which require students to provide a numerical response or write a few words or sentences, as well as "extended constructed-response" questions, which may require students to write a paragraph or more, perform a science experiment and write a description of what was done, or solve a word problem in mathematics, providing a written explanation of the answer. Writing assessments require students to produce two extensive writing samples, while the arts assessments require students to create and perform art.

Extended constructed-response questions for NAEP assessments such as reading, U.S. history, geography, and civics are scored according to four-level scoring guides. Four-point answers are typically scored as "incorrect," "partial," "essential," or "fully correct," with "incorrect" answers receiving only one point and "fully correct" answers receiving the full four points. However, some assessments, such as the arts, mathematics, and writing assessments, have questions that recognize five or even six levels of performance.

Each national assessment generates thousands of student responses that must be scored individually, and combined state/national assessments can generate almost five million responses. ¹ NCES and its contractors have developed a large number of special techniques to ensure that these constructed-response questions can be scored consistently. This Focus on NAEP discusses the techniques used to score written assessments such as reading, mathematics, writing, and science. A separate Focus on NAEP will cover the special problems encountered in assessing the arts.

Selecting Scorers

In the year 2000, NCES will conduct two national/state assessments, in mathematics and science, at grades 4, 8, and 12 at the national level and at grades 4 and 8 at the state level. In addition, there will be a national reading assessment for grade 4 only. The three assessments will generate close to 10 million constructed responses. The scoring will be done, as it has been done for previous assessments, by National Computer Systems (NCS). Educational Testing Service (ETS) develops the scoring guides for the questions and provides training in their use.

Scoring will be done at two online Professional Scoring Centers, one in Iowa City and the other in Tucson, Arizona. The contractors will hire about 150 scorers for the mathematics assessment, about 175 for the science, and about 50 for the reading.

Scorers selected for the assessment will have the following qualifications:

a minimum of a bachelor's degree in the appropriate academic discipline (mathematics, science, or English), or in education;
scoring experience in NAEP or non-NAEP assessments preferred; and
teaching experience at the elementary or secondary level preferred.

The 2000 Mathematics Assessment will have bilingual (Spanish/English) booklets for the 4th and 8th grades. Scorers fluent in Spanish will be hired for the scoring of booklets answered in that language.

Training Scorers

Training scorers to score short and extended constructed-response questions consistently is one of the most important parts of the entire scoring procedure. There is separate training for each constructed-response question. ²

Training involves the following:

presenting and discussing the question to be scored and the question's rationale;
explaining the scoring guide to the team and discussing the "Anchor Packet," which contains the scoring guide, the question, its scoring rationale, and the "Anchor Set" of student responses that represent the various score points in the guide;
discussing the rationale behind the guide, focusing on the criteria that differentiate the levels in the guide;
practicing scoring on a "Practice Set" of students' answers; and
continuing to practice until a consensus is reached on how to apply the scoring guide.

Preparing training materials for a question

Trainers and participating experts in the field begin by selecting from 150 to 300 student answers to an extended constructed-response question. They score them all, for training purposes, and use the answers to create three different training sets: the Anchor Set, the Practice Set, and the Qualification Set.

Answers in the Anchor Set have the scores written on them. An Anchor Set contains at least three answers for every score point in a question. The Anchor Set for a three-point question will usually have 10 answers, and the Anchor Set for a four-point question will have about 15. The trainers also score a Practice Set of about 10 to 20 answers, and a Qualification Set of similar size, but do not put the scores on the answers.

Training to score a question

Scorers, divided into training teams, will first study the scoring guide developed for a given question. Then they receive the Anchor Set of answers, which they review in conjunction with the scoring guide. Then they are given the Practice Set. Scorers score each of the answers, and then are given the "true" score, arrived at earlier by the trainers, for comparison and discussion.

Qualifying to score a question

Once the scorers are familiar with the scoring of a question, they are given a Qualification Set of answers to score. At least 80 percent of their scores must match the scores given by the trainers. Scorers who fail to get 80 percent discuss the scoring of the Qualification Set with their trainer and then are given a second Qualification Set. If they fail to get at least an 80 percent match on this set, they cannot score the question.

Image Scoring and Monitoring

Scoring of constructed-response questions is done by an "Image" process. While student answers are written in traditional answer booklets, for scoring purposes they are converted into computer images. This allows all the answers for a given question to be grouped together and scored at the same time. Scorers are trained to score the answers to a question, and then work exclusively on answers to that question until each one has been scored.

When scorers begin scoring answers to a question, they first take turns scoring the same question, comparing answers, or score in pairs as a final quality check before scoring on their own. They receive retraining at the beginning of each day and after any break that exceeds 15 minutes.

Scorers will be monitored by supervisors (known as "table leaders") in a variety of ways. A certain percentage of answers for constructed-response questions will be scored twice.³ The second scorer will not know the score assigned by the first scorer. Because all scoring is done on a linked computer network, table leaders will have data on the scoring agreement rates for all scorers while the scoring is in progress. Figure 1 provides a "reliability summary" used to keep track of scoring consistency.

A minimum standard agreement rate will be set for each question, which will take into account both the number of score points for a question and the subject being assessed. For example, a higher agreement rate is set for a three-point question than a four-point question; and agreement rates will be higher for a subject such as mathematics, where the "correct" answer can usually be defined with greater precision, than for a subject such as reading. In 1998, the average standard agreement rate for questions on the reading assessment was 91 percent for grade 4, 90 percent for grade 8, and 89 percent for grade 12. For the 1996 mathematics assessment, it was 96 percent for all three grades.

Figure 1. - Scorer reliability summary

Figure 1. - Scorer reliability summary

If the minimum agreement rate is not met for a question, a number of different remedial actions may be necessary. If all or most members of a scoring team appear to be below the average, retraining may be appropriate. If there seems to be a problem with one scorer, the scorer may be reassigned.

The answers that were scored with insufficient agreement rates need to be rescored. This may be done by a group of supervisors, or all the scores for a question may be erased, and the team starts over again. Sometimes, the question is assigned to a different scoring team.

Occasionally, the scoring trainer may decide that the scoring guide needs to be refined, although this rarely happens during an assessment. Scoring guides are more likely to be refined during preliminary testing of assessment questions.

Table leaders will have methods to review an individual scorer's consistency as well as the consistency of a scoring team. A table leader will typically review 10 percent of the answers scored by a scorer and will discuss with the scorer any score that appears inappropriate. A table leader has the authority to rescore any answer, although this does not affect the inter-rater reliability data. To check on scoring consistency across individual scorers, a table leader can also review all the answers that were given a particular score by a scoring team or the committee that developed the assessment questions.

The NAEP assessments that NCES will be conducting in 2000 are periodically redesigned to keep them responsive to changes in curricula and also to reflect improvements in assessment techniques. However, because NCES uses the same assessment instrument several times before making changes, these assessments usually offer some trend data. For this reason, decisions by scorers working on the current assessments will be compared with decisions by past scorers when appropriate. A similar procedure is used for the NAEP long-term trend assessments, whose primary function is to track student performance over time.

Conclusion

Achieving consistency in the scoring of constructed-response questions begins with the selection of individuals who have a background in education and experience in scoring. These individuals are trained carefully in the scoring of each question, so that all the scorers, working independently, will almost always give the same number of points to any answer to a given question. Regular second scoring of answers to every question ensures that this consistency is maintained throughout the scoring process.

Footnotes

¹ The NAEP 1997 national arts assessment (in music, theatre, and visual arts) covered the 8th grade only, and involved a total of about 6,500 students. The arts assessment involved relatively few questions, because students devoted much of their time to a single creating or performance task. A national/state assessment in a subject such as science will involve about 7,500 students at each of three grades (4th, 8th, and 12th) at the national level, plus about 2,500 students per grade for participating states. In the past, more than 40 states and other jurisdictions have participated in each NAEP state assessment.

² The training procedures described are for extended constructed-response questions. The procedures for short constructed-response questions are similar but less elaborate.

³ Six percent of the answers for the constructed-response questions of the mathematics and science assessment for grades 4 and 8 will be scored twice. This will include both the national and state assessments for these subjects and grades. In addtion, 25 pecent of the answers for the grade 12 assessments in science and mathematics will be scored twice, a procedure that will also be followed for the reading assessment (grade 4 only). A larger percentage will be scored for these assessments because they are national assessments only, and thus will involve substantially fewer answers.

For technical information, see

Allen, N.L., Carlson, J.E., and Zelenak, C.A. (forthcoming). The NAEP 1996 Technical Report.

Allen, N.L., Swinton, S.S., Isham, S.P., and Zelenak, C.A. (1998). Technical Report of the NAEP 1996 State Assessment Program in Science (NCES 98-480).

Author affiliations: S. White, NCES; C. Smith, National Computer Systems; and A. Vanneman, Education Statistics Services Institute (ESSI).

For questions about content, contact Sheida White (sheida.white@ed.gov).

To obtain this Focus on NAEP (NCES 2000-490), call the toll-free ED Pubs number (877-433-7827) or visit the NCES Web Site (http://nces.ed.gov).