NAEP Technical DocumentationNAEP Scoring

Two types of cognitive items are scored for NAEP. Selected-response item responses are captured by high-speed scanners during student booklet processing for paper-based assessments and processed electronically for digitally based assessments. Because selected-response items have a finite number of possible responses, NAEP allows analysis via algorithmic scoring for most of these items. Data capture capabilities enable each of these responses to be mapped to the appropriate score level in the rubric, thus allowing these items to be algorithmically scored. Short constructed-response items (typically those with two or three valid score points) and extended constructed-response items (typically those with four or more valid score points) are scored by trained scoring personnel. Unless otherwise noted, the term "scoring" in this section refers to constructed-response items.

Scoring a large number of short and extended constructed-response items with a high level of accuracy and reliability within a limited time frame is essential to the success of NAEP. To ensure reliable and efficient scoring of constructed-response items, NAEP takes the following steps:

develops focused and explicit scoring guides that match the criteria delineated in the assessment frameworks;
recruits qualified and experienced scorers, trains them, and verifies their ability to score particular questions through qualifying tests;
employs an image-processing and scoring system that routes images of student responses directly to the scorers so they can focus on scoring rather than paper routing;
monitors scorer consistency through ongoing reliability checks, including second scoring;
assesses the quality of scorer decision-making through frequent monitoring by NAEP assessment experts; and
documents all training, scoring, and quality control procedures in the technical reports.

The table below presents a general overview of recent NAEP scoring activities.

Processing and scoring totals, national and state assessments, by subject area: Various years, 2000–2019
Year	Subject area	Grade	Number of booklets scored	Number of constructed responses scored	Number of individual constructed-response items	Number of team leaders	Number of scorers
NOTE: Number of constructed responses items includes bilingual items. The 2014 TEL assessment and the 2011 writing assessment were computer-based. For TEL and 2011 writing, "Number of booklets scored" denotes number of digital test forms scored. The 2017 assessments were digitally based. Rows covering multiple grades include data from multiple grades. The term "team leaders" refers to the number of scoring supervisors. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), various years, 2000–2019 Assessments.
2019	Mathematics	4, 8, 12	385,380	3,386,230	333	20	156
	Reading	4, 8, 12	308,208	2,106,696	161	35	327
	Science	4, 8, 12	58,367	556,914	258	25	152
2018	Civics	8	6,339	64,497	51	5	37
	Geography	8	6,415	67,725	50	5	37
	U.S. history	8	8,283	88,012	65	5	38
	Technology and engineering literacy (TEL)	8	15,369	152,614	81	5	52
2017	Mathematics	4, 8	294,867	2,919,388	144	14	141
	Reading	4, 8	290,842	2,037,048	85	14	152
	Writing	4, 8	47,112	95,412	44	11	112
2016	Arts	8	8,767	201,249	92	9	79
2015	Mathematics	4, 8, 12	290,971	3,013,937	259	17	125
	Reading	4, 8, 12	311,564	2,174,460	137	26	235
	Science	4, 8, 12	237,571	2,543,244	251	26	211
2014	Civics	8	9,125	73,082	50	3	21
	Geography	8	9,006	85,606	58	4	21
	U.S. history	8	11,279	108,552	68	4	35
	Technology and engineering literacy (TEL)	8	21,579	269,867	98	6	44
2013	Mathematics	4, 8, 12	386,064	3,977,285	333	16	182
2013	Reading	4, 8, 12	384,272	2,782,991	136	27	304
2012	Economics	12	10,950	75,229	34	3	19
2011	Mathematics	4, 8	388,638	3,786,422	172	19	151
	Reading	4, 8	382,205	2,819,950	90	23	256
	Science	8	122,409	1,544,669	96	17	178
	Writing	8, 12	52,452	104,958	44	17	183
2010	Civics	4, 8, 12	26,771	261,989	119	23	153
	Geography	4, 8, 12	26,608	366,543	172	23	153
	U.S. history	4, 8, 12	30,987	387,625	167	23	153
2009	Mathematics	4, 8, 12	380,042	4,293,561	298	16	175
	Reading	4, 8, 12	392,196	3,709,299	311	30	336
	Science	4, 8, 12	331,967	4,592,470	412	45	430
2008	Arts	8	7,865	181,854	92	6	57
2007	Mathematics	4, 8	422,200	3,912,835	435	38	187
	Reading	4, 8	457,800	3,623,126	346	51	362
	Writing	8, 12	205,500	729,940	40	50	328
2006	U.S. history	4, 8, 12	38,400	458,172	132	21	65
	Civics	4, 8, 12	33,200	282,977	84	20	65
	Economics	12	17,600	128,735	32	8	30
2005	Mathematics	4, 8, 12	354,500	4,435,831	414	26	267
	Reading	4, 8, 12	340,200	3,773,691	226	36	363
	Science	4, 8, 12	349,100	4,424,511	539	39	393
2003	Mathematics	4, 8	349,600	4,719,464	135	33	418
2003	Reading	4, 8	350,700	3,913,147	136	32	397
2002	Reading	4, 8, 12	308,500	4,023,861	150	33	330
2002	Writing	4, 8, 12	285,900	608,269	60	29	270
2001	Geography	4, 8, 12	27,500	381,477	57	9	81
2001	U.S. history	4, 8, 12	32,700	399,182	47	9	81
2000	Mathematics	4, 8, 12	253,900	3,856,211	199	16	177
	Reading	4	8,500	123,100	46	14	702
	Science	4, 8, 12	240,900	4,398,021	295	20	155

The table below presents a general overview of recent NAEP long-term trend scoring activities.

Processing and scoring totals, long-term trend assessments, by subject area: 2004, 2008, and 2012
Year	Subject area	Age	Number of booklets scored	Number of constructed responses scored	Number of individual constructed-response items
2012	Mathematics long-term trend	9, 13, 17	26,210	422,192	181
2012	Reading long-term trend	9, 13, 17	26,352	47,241	19
2008	Mathematics long-term trend	9, 13, 17	28,465	452,994	179
2008	Reading long-term trend	9, 13, 17	26,621	51,743	19
2004	Mathematics long-term trend	9, 13, 17	40,300	1,082,923	219
2004	Reading long-term trend	9, 13, 17	41,200	131,496	34
NOTE: Number of constructed responses scored includes second scores for the same items that were scored by the second scorers. Numbers of team leaders and scorers are not included because long-term trend scoring occurs at multiple sessions throughout the year. Rows covering multiple ages include data from multiple ages. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2004, 2008, and 2012 Mathematics and Reading Long-Term Trend Assessments.

As new NAEP items are created, tested, and improved, test development staff create scoring guides using a range of actual student responses captured by the materials-processing staff as specific examples. The scoring and test development staffs create training materials matching the assessment framework criteria. For future assessments, continuous documentation ensures that the scoring staff will train and score the item in the same way that it was originally implemented. This consistency of scoring procedures allows NAEP to report on trends in student performance over time.

NAEP Scoring Staff

Scorers score student responses. Scoring supervisors provide logistical support to the trainers and help monitor team activities. Trainers are responsible for training both scorers and supervisors on specific content and for ensuring that team scoring performance meets expectations. Content leads for each subject area (reading, science, etc.) oversee the trainers and provide support as needed.

Scorers must have a minimum of a baccalaureate degree from a four-year college or university. An advanced degree and scoring experience and/or teaching experience are preferred. In some subject areas, scorers must complete a placement test, used as a tool to identify scorers with appropriate content knowledge. During the training process, scoring teams are trained so that each student response can be scored consistently. Following training, for all extended constructed-response items and some short constructed-response items with particularly complex scoring guides, each scorer is given a pre-scored qualification set of student responses to score. Qualification standards for each item vary according to the number of score levels for the item. Individual scorer results are retained for all qualification sets.

Scoring supervisors and trainers are selected based upon many factors including their previous experience, educational and professional backgrounds, demonstration of a strong understanding of the scoring criteria, and strong interpersonal communication skills and organizational abilities.

NAEP scoring teams usually consist of 10-12 scorers who are led by a scoring supervisor and a trainer. Prior to the scoring effort, all personnel are intensively trained. The trainers who train the scorers, the supervisors who oversee a group of scorers, and the scorers themselves are all given both general scoring training and item-specific content training.

NAEP Scoring System

Using the latest technology and secure network communications, the NAEP electronic scoring system both transmits images of student responses to the trained scorers and receives back the scores assigned by them. Student responses from paper booklets are scanned from the original test booklets; the actual test booklets can be accessed and referenced if needed. Student responses from digitally based assessments are processed electronically and presented to scorers in the same system. The scorer sees each student response in isolation on a computer screen and assigns a score. The scorer cannot access any other responses from the student for a particular item or from other items the student answered. As each response is scored, another student response is shown for scoring, until all responses for an item have been scored.

During scoring, the NAEP electronic scoring system provides documentation of numerous scoring metrics. Reports on item and scoring performance can be retrieved as needed. In addition, custom reports of daily activities are sent out nightly to development, scoring, and analysis staff to monitor NAEP scoring quality and progress.

All assessments are scored item by item so that scorers train on one item and one scoring guide at a time. This method is efficient only with electronic presentation of student responses.

NAEP Scoring Procedures

During the scoring of a particular item, a percentage of scored responses is randomly recirculated by the system to be rescored by a second scorer in order to check the consistency of current-year scoring. Five percent (of about 20,000) of student responses are second-scored for large samples and 25 percent (of about 2,000) of student responses are second-scored for smaller samples. This comparison of first and second scores yields the within-year interrater agreement.

In addition, NAEP trend scoring is used to compare the consistency of scoring over time (i.e., cross-year interrater agreement). During trend scoring, the NAEP electronic scoring system allows for the presentation of a pool of scored responses from a prior assessment to current scorers. Comparing current scores to the scores given in the prior assessment offers the ability to generate reports to evaluate scoring consistency over time for a specific NAEP item.

Backreading of current year responses ensures frequent monitoring of scorer decision-making by supervisory staff. Backreading allows the supervisor to review responses (with scores assigned) already scored by each scorer and to ensure that each scorer is applying the scoring guide correctly. About 5 percent of each scorer's output is monitored through backreading.

During training and scoring, any changes to existing documentation are captured by scoring staff, shared across scoring teams, and documented in a record of the scoring history of the NAEP item. This is reviewed prior to the next scoring effort.

Last updated 30 May 2023 (SK)

Printer-friendly Version