Skip Navigation

National Assessment of Educational Progress (NAEP)



4. SURVEY DESIGN

TARGET POPULATION

Students enrolled in public and nonpublic schools in the 50 states and the District of Columbia who are deemed assessable by their school and classified in defined grade/age groups—grades 4, 8, and 12 for the main national assessments and ages 9, 13, and 17 for the long-term trend assessments in mathematics and reading. Grades 4 and/or 8 are usually assessed in the state assessments and TUDA; the number of grades assessed has varied in the past, depending on the availability of funding (although testing for 4th- and 8th-graders in reading and mathematics every 2 years is now required for states that receive Title I funds). Only public schools were included in the state NAEP and in TUDA.

SAMPLE DESIGN

For the national assessments, probability samples of schools and students are selected to represent the diverse student population in the United States. The numbers of schools and students vary from cycle to cycle, depending on the number of subjects and items to be assessed. A national sample will have sufficient schools and students to yield data for public schools in each of the four Census regions of the country, as well as results by sex, race/ethnicity, degree of urbanization of school location, parent education (for grades 8 and 12), and participation in the National School Lunch Program. A national sample of private schools is also selected for grades 4, 8, and 12. This sample is designed to produce national and regional estimates of student performance.

In the state assessment, a sample of schools and students is selected to represent a participating state. In a state, on average 2,500 students in approximately 100 public schools are selected per grade, per subject assessed. The selection of schools is random within classes of schools with similar characteristics; however, some schools or groups of schools (districts) can be selected for each assessment cycle if they are unique in the state. For instance, a particular district may be selected more often if it is located in the state’s only major metropolitan area or has the majority of the state’s Black, Hispanic, or other race/ethnicity population. Additionally, even if a state decides not to participate at the state level, schools in that state identified for the national sample will be asked to participate.

Typically, within each school, approximately 30 students per subject are selected randomly. Some of the students who are randomly selected are classified as students with disabilities (SD) or English-language learners (ELL). NAEP’s goal is to assess all students in the sample, and this is done if at all possible.

NAEP’s multistage sampling process involves the following steps:

  • selection of schools (public and nonpublic) within strata; 
  • selection of students within the selected schools; and 
  • allocation of selected students to assessment subjects.

Top

Selection of Schools.In this stage of sampling, public schools in each state (including Bureau of Indian Education [BIE] schools serving grade 4 or 8 students and Department of Defense Education Activity [DoDEA] schools) and private schools in each state (including Catholic schools) are listed according to the grades associated with the three age classes: age class 9 refers to age 9 or grade 4 in the long-term trend NAEP (or grade 4 in the main NAEP); age class 13 refers to age 13 or grade 8 in the long-term trend NAEP (or grade 8 in the main NAEP); age class 17 refers to age 17 or grade 11 in the long-term trend NAEP (or grade 12 in the main NAEP).

The school lists are obtained from two sources. Regular public, BIE, and DoDEA schools are obtained from the school list maintained by NCES’ Common Core of Data. Catholic and other nonpublic schools are obtained from the NCES Private School Universe Survey (PSS). To ensure that the state samples provide an accurate representation, public schools are stratified by urbanization, enrollment of Black, Hispanic, or other race/ethnicity students, state-based achievement scores, and median household income. Private schools are stratified by type (e.g., parochial, nonreligious), urban status, and enrollment per grade. Once the stratification is completed, the schools are assigned a probability of selection that is proportional to the number of students per grade in each school.

Prior to 2005, DoDEA overseas and domestic schools were reported separately. Starting with the 2005 assessments, all DoDEA schools, both domestic and overseas, were combined into one jurisdiction. In addition, the definition of the national sample changed in 2005; it now includes all of the overseas DoDEA schools.

The manner of sampling schools for the long-term trend assessments is very similar to that used for the main assessments. The primary difference is that in long-term trend nonpublic schools and schools with high enrollment of Black, Hispanic, or other race/ethnicity students are not oversampled. Schools are not selected for both main and long-term trend assessments at the same age/grade. The long-term trend assessments use a nationally representative sample and do not report results by state.

Top

Selection of students.This stage of sampling involves random selection of national samples representing the entire population of U.S. students in grades 4, 8, and 12 for the main assessment and the entire population of students at ages 9, 13, and 17 for the long-term trend assessment. Some of the students who are randomly selected are classified as SD or ELL. A small number of students selected for participation are excluded because of limited English proficiency or severe disability.

To facilitate the sampling of students, a consolidated list is prepared for each school of all age-eligible students (long-term trend assessments) or all grade-eligible students (main assessments) for the age class for which the school is selected. A systematic selection of eligible students is made from this list—unless all students are to be assessed—to provide the target sample size.

For each age class (separately for long-term trend and main samples), measures of size are established as to the number of students who are to be selected for a given school. In those schools that, according to information in the sampling frame, have fewer eligible students than the final measures of size, each eligible student enrolled at the school is selected in the sample. In other schools, a sample of students is drawn. The measures of size are established in terms of the number of grade-eligible students for the main samples, and in terms of the number of students in each age class for the trend samples.

Excluded students. Some students are excluded from the student sample because they are deemed unable to participate meaningfully by school authorities. The exclusion criteria for the main samples differ somewhat from those used for the long-term trend samples. In order to identify students who should be excluded from the main assessments, school staff members are asked to identify those SD or ELL students who do not meet the NAEP inclusion criteria. School personnel are asked to complete an SD/ELL questionnaire for all SD and ELL students selected into the NAEP sample, whether they participate in the assessment or not. Prior to 2004, for the long-term trend assessments, excluded students were identified for each age class, and an Excluded Student Survey was completed for each excluded student. Beginning in 2004, both long-term trend and main NAEP assessments use identical procedures. In 2010, the Governing Board revised its policy on inclusion. The current policy defines specific inclusion goals for NAEP samples. At the national, state, and district levels, the goal is to include 95 percent of all students selected for the NAEP samples, and 85 percent of those in the NAEP sample who are identified as SD or ELL.

Main national NAEP sample sizes. In 2011, the main national and state NAEP assessed students in reading and mathematics at grades 4 and 8 and in science at grade 8. In addition, the writing assessment was administered to a national sample at grades 8 and 12. The main national mathematics assessment sampled 214,200 grade 4 students and 180,400 grade 8 students; the reading assessment sampled 222,200 grade 4 students and 174,700 grade 8 students. The science assessment sampled 124,200 grade 8 students. The main national writing assessment sampled 24,600 grade 8 students. For 2013, the main national mathematics assessment sampled 186,500 grade 4 students and 170,100 grade 8 students, while the main national reading assessment sampled 190,400 grade 4 students and 171,800 grade 8 students. In 2015, the main national mathematics assessment sampled 139,900 grade 4 students and 136,900 grade 8 students, while the main national reading assessment sampled 139,100 grade 4 students and 136,500 grade 8 students.

TUDA sample sizes. In 2011, 2013, and 2015, twenty-one urban districts (including District of Columbia) participated in TUDA in mathematics and reading. The sample of students in the participating TUDA school districts is an extension of the sample of students who would usually be selected as part of the state and national samples. The sample design for TUDA districts provides for oversampling. These extended samples allow reliable reporting of student groups within these districts.

Results for students in the TUDAs are included with those for states and the nation with appropriate weighting. For example, the data for students tested in the Chicago sample are used to report results for Chicago, but also contribute to Illinois’ estimates (and, with appropriate weights, to national estimates). Chicago has approximately 20 percent of the students in Illinois; therefore Chicago will contribute 20 percent, and the rest of the state will contribute 80 percent, to Illinois’ results.

Long–term trend NAEP sample sizes. The long-term trend assessment tested the same four subjects across years through 1999, using relatively small national samples. Samples of students were selected by age (9, 13, and 17) for mathematics, science, and reading, and by grade (4, 8, and 11) for writing. Students within schools were randomly assigned to either mathematics/science or reading/writing assessment sessions subsequent to their selection for participation in the assessments. In 2004, science and writing were removed from the trend assessments; the trend assessments are now scheduled to be administered in mathematics and reading every 4 years (but not in the same years as the main assessments). In 2004, approximately 24,100 students took the modified1 reading assessment, while about 14,000 took the bridge2 reading assessment. In 2004, approximately 22,400 students took the modified mathematics assessment, while about 14,700 took the bridge mathematics assessment. The latest long-term trend assessment was conducted during the 2011–12 school year (fall for age 13; winter for age 9; spring for age 17) but technical documentation was not available at the time of this publication. For the 2007–08 assessment, approximately 26,600 students were assessed in reading and 26,700 students assessed in mathematics.

NIES sample sizes. The NIES survey questionnaire sample is designed to produce information representative of the target population of all fourth– and eighth–grade AI/AN students in the United States. In 2005, the survey questionnaire sample included about 5,600 eligible students at approximately 550 schools located throughout the United States. The sample consisted of approximately 84 percent public, 4 percent private, and 12 percent BIE schools (unweighted). In 2007, the NIES survey questionnaire sample included about 12,900 AI/AN students at approximately 1,900 schools at grade 4 and 14,600 AI/AN students at 2,000 schools at grade 8 located throughout the United States. The sample consisted of approximately 94 percent public, 1 percent private, and 5 to 6 percent BIE schools at grades 4 and 8 (as well as a small number of DoDEA schools). All BIE schools were part of the sample. In 2009, the NIES survey questionnaire sample consisted of about 12,300 grade 4 students in approximately 2,300 schools and approximately 10,400 students in grade 8 at about 1,900 schools. In 2011, the NIES survey questionnaire sample consisted of about 10,200 grade 4 students in approximately 1,900 schools and approximately 10,300 students in grade 8 at about 2,000 schools.

The samples of AI/AN students participating in the 2011 NAEP reading and mathematics assessments, upon which the student performance results are based (and which also comprises the assessment component of NIES), represent augmentations of the sample of AI/AN students who would usually be selected to participate in NAEP. This allows more detailed reporting of performance for this group.

In 2005, seven states had sufficient samples of AI/AN students to report state–level data: Alaska, Arizona, Montana, New Mexico, North Dakota, Oklahoma, and South Dakota. In 2007, a total of 11 states had sufficiently large samples, with Minnesota, North Carolina, Oregon, and Washington being added to the original seven selected states for 2005. In 2009, results were also reported for Utah, resulting in state–level reporting for a total of 12 states. In 2011, results are reported for the same 12 states. While 6 of the 12 states had sufficient AI/AN students without oversampling, schools in 6 states were oversampled in 2011: Arizona, Minnesota, North Carolina, Oregon, Utah, and Washington.   

Top

Assessment Design

Since 1988, the Governing Board has selected the subjects for the main NAEP assessments. NAGB also oversees the creation of the frameworks that underlie the assessments and the specifications that guide the development of the assessment instruments.

Development of Framework and Questions. The Governing Board uses an organizing framework for each subject to specify the content that will be assessed. This framework is the blueprint that guides the development of the assessment instrument. The framework for each subject area is determined with input from teachers, curriculum specialists, subject–matter specialists, assessment experts, policy makers, and members of the general public.

Unlike earlier multiple–choice instruments, current instruments dedicate a significant amount of testing time to constructed–response questions that require students to compose written answers.

The questions and tasks in an assessment are based on the subject–specific frameworks. They are developed by teachers, subject–matter specialists, and testing experts under the direction of NCES and its contractors. For each subject–area assessment, a national committee of experts provides guidance and reviews the questions to ensure that they meet the framework specifications. Items are also reviewed by NAGB. For each state–level assessment, teachers, state curriculum and assessment specialists review the NAEP questions.

Matrix Sampling. Several hundred questions are typically needed to reliably test the many specifications of the complex frameworks that guide NAEP assessments. However, administering the entire collection of cognitive questions to each student would be far too time consuming to be practical. Matrix sampling allows the assessment of an entire subject area within a reasonable amount of testing time, in most cases 50 minutes for paper-pencil administered assessments and 60 minutes for computer administered assessments. By this method, different portions from the entire pool of cognitive questions are printed in separate booklets and administered to different samples of students.

In matrix sampling, NAEP uses a focused balanced incomplete block or partial balanced incomplete block (BIB or pBIB) design The NAEP BIB design varies according to subject area. A BIB spiraling design ensures that students receive different interlocking sections of the assessment, enabling NAEP to check for any unusual interactions that may occur between different samples of students and different sets of assessment questions. This procedure assigns blocks of questions in a manner that “balanced” the positioning of blocks across booklets and “balanced” the pairing of blocks within booklets according to content. The booklets are “incomplete” because not all blocks are matched to all other blocks. The “spiraling” aspect of this procedure cycles the booklets for administration so that, typically, any group of students will receive approximately the target proportions of different types of booklets.

Top

Data Collection and Processing

Since 1983, NCES has conducted NAEP through a series of contracts, grants, and cooperative agreements with the Educational Testing Service (ETS) and other contractors. ETS is directly responsible for developing the assessment instruments, analyzing the data, and reporting the results. Westat selects the school and student samples, trains assessment administrators, and manages field operations (including assessment administration and data collection activities). NCS Pearson is responsible for printing and distributing the assessment materials and for scanning and scoring students’ responses. Contractors are subject to change in future contracts.

Reference Dates/Testing Window. Data for the main national NAEP and main state NAEP are collected from the last week in January through the first week in March. Data for the long–term trend NAEP are collected during the fall for age 13; during the winter of the same school year for age 9; and during the spring for age 17.

Data Collection. Before 2002, NCES had relied heavily on school personnel to administer NAEP assessments. Beginning with the 2002 assessments, however, NAEP field staff has administered NAEP assessment sessions. Obtaining the cooperation of the selected schools requires substantial time and energy, involving a series of mailings that includes letters to the chief state school officers and district superintendents to notify the sampled schools of their selection; additional mailings of informational materials; and introductory in–person meetings where procedures are explained.

The corresponding teacher and school questionnaires are available online ahead of the NAEP assessment (typically more than six weeks before the assessment window begins).

NCS Pearson produces the materials needed for NAEP assessments. NCS Pearson prints identifying barcodes and numbers for the booklets and questionnaires, pre–assigns the booklets to testing sessions, and prints the booklet numbers on the administration schedule. These activities improve the accuracy of data collection and assist with the BIB spiraled distribution process. With the introduction of technology–based assessments (TBA), all responses will be collected electronically.

Assessment exercises are administered either to individuals or to small groups of students by specially trained field personnel. For all three ages in the long-term trend NAEP, the mathematics questions administered using a paced audiotape before 2004. Since 2004, the long–term trend assessments have been administered through test booklets read by the students.

For the long-term trend assessments, Westat hires and trains approximately 85 field staff to collect the data. For the 2009 main national and state assessments, Westat hired and trained about 7,000 field staff to conduct the assessments.

After each session, Westat staff interview the assessment administrators to receive their comments and recommendations. As a final quality control step, a debriefing meeting is held with the state supervisors to receive feedback that will help improve procedures, documentation, and training for future assessments.

For the NIES survey questionnaire, NCES data collection contractor staff visit the schools to administer survey questionnaires. Students complete the questionnaires in group settings proctored by study representatives. In order to decrease the possibility that survey responses might be adversely affected by students’ reading levels, the questions are read aloud to all grade 4 students and to grade 8 students who school staff think might need assistance. In addition, the study representatives are available to answer any questions that students have as they work on the questionnaires.

For both NIES and NAEP, teachers and school administrators were asked to complete the questionnaires on their own. While the vast majority of teachers and schools complete these questionnaires online, there is a paper questionnaire option for those that need it.

Top

Data Processing. NCS Pearson handles all receipt control, data preparation and processing, scanning, and scoring activities for NAEP. Using an optical scanning machine, NCS Pearson staff scans the multiple–choice selections, the handwritten student responses, and other data provided by students, teachers, and administrators. An intelligent data entry system is used for resolution of the scanned data, the entry of documents rejected by the scanning machine, and the entry of information from the questionnaires. An image–based scoring system introduced in 1994 virtually eliminates paper handling during the scoring process. This system also permits online monitoring of scoring reliability and creation of recalibration sets.

ETS develops focused, explicit scoring guides with defined criteria that match the criteria emphasized in the assessment frameworks. The scoring guides are reviewed by subject–area and measurement specialists, the instrument development committees, NCES, and NAGB to ensure consistency with both question word–ing and assessment framework criteria. Training materials for scorers include examples of student responses from the actual assessment for each performance level specified in the guides. These exemplars help scorers interpret the scoring guides consistently, thereby ensuring the accurate and reliable scoring of diverse responses.

The image–based scoring system allows scorers to assess and score student responses online. This is accomplished by first scanning the student response booklets, digitizing the constructed responses, and storing the images for presentation on a large computer monitor. The range of possible scores for an item also appears on the display; scorers click on the appropriate button for quick and accurate scoring. The image–based scoring system facilitates the training and scoring process by electronically distributing responses to the appropriate scorers and by allowing ETS and NCS Pearson staff to monitor scorer activities consistently, identify problems as they occur, and implement solutions expeditiously. The system also allows the creation of calibration sets that can be used to prevent drift in the scores assigned to questions. This is especially useful when scoring large numbers of responses to a question (e.g., more than 30,000 responses per question in the state NAEP). In addition, the image–based scoring system allows all responses to a particular exercise to be scored continuously until the item is finished, thereby improving the validity and reliability of scorer judgments. The newer computer–based assessments do not require scanning.

The reliability of scoring is monitored during the coding process through (1) backreading, where scoring supervisors review a portion of each scorer’s work to confirm a consistent application of scoring criteria across a large number of responses and across time; (2) daily calibration exercises to reinforce the scoring criteria after breaks of more than 15 minutes; and (3) a second scoring of some of the items appearing only in the main national assessment, as well as some of the items appearing in both the main national and state assessments (and a comparison of the two scores to give a measure of inter’rater reliability). To monitor agreement across years, a random sample of responses from previous assessments (for identical items) is systematically interspersed among current responses for rescoring. If necessary, current assessment results are adjusted to account for any differences.

To test scoring reliability, constructed-response item score statistics are calculated for the portion of responses that are scored twice. Cohen’s Kappa is the reliability estimate used for dichotomized items and the intraclass correlation coefficient is used as the index of reliability for nondichotomized items. Scores are also constructed for items that are rescored in a later assessment. For example, some 2007 reading and mathematics items were rescored in 2009.

Top

Editing. The first phase of data editing takes place during the keying or scanning of the survey instruments. Machine edits verify that each sheet of each document is present and that each field has an appropriate value. The edit program checks each booklet number against the session code for appropriate session type, the school code against the control system record, and other data fields on the booklet cover for valid ranges of values. It then checks each block of the document for validity, proceeding through the items within the block. Each piece of input data is checked to verify that it is of an acceptable type, that the value falls within a specified range of values, and that it is consistent with other data values. At the end of this process, a paper edit listing of data errors is generated for nonimage and key-entered documents. Image-scanned items requiring correction are displayed at an online editing terminal.

In the second phase of data editing, experienced editing staff review the errors detected in the first phase, compare the processed data with the original source document, and indicate whether the error is correctable or noncorrectable per the editing specifications. Suspect items found to be correct as stated, but outside the edit specifications, are passed through modified edit programs. For nonimage and key-entered documents, corrections are made later via key-entry. For image-processed documents, suspect items are edited online. The edit criteria for each item in question appear on the screen along with the item, and corrections are made immediately. Two different people view the same suspect item and operate on it separately; a “verifier” ensures that the two responses are the same before the system accepts that item as correct.

For assessment items that must be paper-scored rather than scored using the image system (as was the case for some mathematics items in the 1996 NAEP), the score sheets are scanned on a paper-based scanning system and then edited against tables to ensure that all responses were scored with only one valid score and that only raters qualified to score an item were allowed to score it. Any discrepancies are flagged and resolved before the data from that scoring sheet are accepted into the scoring system.

In addition, a count-verification phase systematically compares booklet IDs with those listed in the NAEP administration schedule to ensure that all booklets expected to be processed were actually processed. Once all corrections are entered and verified, the corrected records are pulled into a mainframe data set and then re-edited with all other records. The editing process is repeated until all data are correct.

Top

Estimation Methods

Once NAEP data are scored and compiled, data from schools and students are weighted according to the sample design and population structure and then adjusted for nonresponse. This ensures that results of the assessments are fully representative of the target populations. The analyses of NAEP data for most subjects are conducted in two phases: scaling and estimation. During the scaling phase, item response theory (IRT) procedures are used to estimate the measurement characteristics of each assessment question. During the estimation phase, the results of the scaling are used to produce estimates of score scale score distributions for groups of students in the various subject areas applying Marginal maximum likelihood (MML) methodology.

Weighting. The weighting for the national and state samples reflects the probability of selection for each student in the sample, adjusted for school and student nonresponse. The weight assigned to a school’s or student’s response is the inverse of the probability that the student would be selected for the sample. Prior to 2002, poststratification was used to ensure that the results were representative of certain subpopulations corresponding to figures from the U.S. Census and the Current Population Survey (CPS).

Student base weights. The base weight assigned to a student is the reciprocal of the probability that the student would be selected for a particular assessment. This probability is the product of the following two factors:

  • the conditional probability that the school would be selected, given the strata; and 
  • the conditional probability, given the school, that the student would be selected within the school.

Nonresponse adjustments of base weights. Nonresponse adjustments of base weights. The base weight for a selected student is adjusted by two nonresponse factors. The first factor adjusts for sessions that were not conducted. This factor is computed separately within classes formed by the first three digits of strata (formed by crossing the major stratum and the first socioeconomic characteristic used to define the final stratum). Occasionally, additional collapsing of classes is necessary to improve the stability of the adjustment factors, especially for the smaller assessment components. The second factor adjusts for students who failed to appear in the scheduled session or makeup session. This nonresponse adjustment is completed separately for each assessment. For assessed students in the trend samples, the adjustment is made separately for classes of students based on subuniverse and modal grade status. For assessed students in the main samples, the adjustment classes are based on subuniverse, modal grade status, and race class. In some cases, nonresponse classes are collapsed into one class to improve the stability of the adjustment factors.

NIES survey questionnaire weighting. NIES survey questionnaire weighting. For the survey questionnaire component of NIES, the school probability of selection is a function of three factors: NAEP selection, the probability of being retained for the survey questionnaire component of NIES, and the number of AI/AN students in the NAEP sample per school. Nonresponse adjustments at the school level attempt to mitigate the impact of differential response by school type (public, private, and BIE), region, and estimated percentage enrollment of AI/AN students. For student weights, nonresponse adjustments take into account differential response rates based on student age (above age for grade level or not) and English language learner status. In order to partially counteract the negative impact of low private school participation, a poststratification adjustment is applied to the NIES survey questionnaire weights. The relative weighted proportions of students from public, private, and BIE schools, respectively, are adjusted to match those from the data of the assessment component of NIES. This not only ensured greater consistency between the findings of the two NIES components, but since the proportions of students are more reliably estimated from the NIES assessment data (which involved a far larger school sample than the survey questionnaire), this weight adjustment increases the accuracy and reliability of the NIES survey questionnaire results.

Scaling. For purposes of summarizing item responses, a scaling technique that has its roots in IRT procedures and the theories of imputation of missing data are used.

The first step in scaling is to determine the percentage of students who give various responses to each cognitive, or subject-matter, question and each background question. For cognitive questions, a distinction is made between missing responses at the end of a block (i.e., missing responses after the last question the student answered) and missing responses before the last observed response. Missing responses before the last observed response are considered intentional omissions. Missing responses at the end of a block are generally considered “not reached” and treated as if the questions had not been presented to the student. In calculating response percentages for each question, only students classified as having been presented that question are used in the analysis. Each cognitive question is also examined for differential item functioning (DIF). DIF analyses identify questions on which the scores of different subgroups of students at the same ability level differ significantly.

Development of scales. For the main assessments, the frameworks for the different subject areas dictate the number of subscales required. In the 2009 NAEP, five subscales were created for the main assessment in mathematics in grades 4 and 8 (one for each mathematics content strand), and three subscales were created for science (one for each field of science: Earth, physical, and life). Generally, a composite scale is also created as an overall measure of students’ performance in the subject area being assessed (e.g., mathematics). The composite scale is a weighted average of the separate subscales for the defined subfields or content strands. For the long-term trend assessments, a single scale is used for summarizing proficiencies at each age in and be scaled accordingly. This both removed the constraint that the trait being measured is cumulative and eliminated the need for overlap of questions across grades. Any questions that happen to be the same across grades are scaled separately for each grade, thus making it possible for common questions to function differently in the separate grades.mathematics and reading.

Within–grade vs. cross-grade scaling. The reading and mathematics main NAEP assessments were developed with a cross-grade framework, where the trait being measured was conceptualized as cumulative across the grades of the assessment. Accordingly, a single 0−500 scale was established for all three grades in each assessment. In 1993, however, the Governing Board determined that future NAEP assessments should be developed using within-grade frameworks and be scaled accordingly. This both removed the constraint that the trait being measured is cumulative and eliminated the need for overlap of questions across grades. Any questions that happen to be the same across grades are scaled separately for each grade, thus making it possible for common questions to function differently in the separate grades.

The 1994 history and geography assessments were developed and scaled within grade, according to NAGB’s new policy. The scales were aligned so that grade 8 had a higher mean than grade 4 and grade 12 had a higher mean than grade 8. The 1994 reading assessment, however, retained a cross-grade framework and scaling. All three main assessments in 1994 used scales ranging from 0 to 500.

The 2008 long-term trend assessments remained cross-age, using a 0−500 scale. The 2009 main science assessment was developed within-grade, but adopted new scales ranging from 0 to 300. The 2005 main assessment in mathematics continued to use a cross-grade framework with a 0−500 scale in grades 4 and 8, but used a 0–300 within-grade scale for 12th grade. In 1998, reading, writing and civics assessments were scaled within-grade.

Linking of scales. Before 2002, results for the main state assessments were linked to the scales for the main national assessments, enabling state and national trends to be studied. Equating the results of the state and national assessments depended on those parts of the main national and state samples that represented a common population: (1) the state comparison sample—students tested in the national assessment who come from the jurisdictions participating in the state NAEP; and (2) the state aggregate sample—the aggregate of all students tested in the state NAEP. Since 2002, the national sample has been a superset of the state samples (except in those states that do not participate).

Top

Imputation. Until the 2002 NAEP assessment, no statistical imputations were generated for missing values in the teacher, school, or SD/ELL questionnaires, or for missing answers to cognitive questions. Most answers to cognitive questions are missing by design. For example, 8th-grade students being assessed in reading are presented with, on average, 21 of the 110 assessment items. Whether any given student gets any of the remaining 89 individual questions right or wrong is not something that NAEP imputes. However, since 1984, multiple imputation techniques have been used to create plausible values. Once created, subsequent users can analyze these plausible values with common software packages to obtain NAEP results that properly account for NAEP’s complex item sampling designs.

Trying to use partial scores based on the small proportion of the assessment to which any given student is exposed would lead to biased results for group scores due to an inherently large component of measurement error. NAEP developed a process of group score calculation in order to get around the unreliability and noncomparability of NAEP’s partial test forms for individuals. NAEP estimates group score distributions using MML estimation, a method that calculates group score distributions based directly on each student’s responses to cognitive questions, not on summary scores for each student. As a result, the unreliability of individual-level scores does not decrease NAEP’s accuracy in reporting group scores. The MML method does not employ imputations of answers to any questions or of scores for individuals.

Imputation is performed in three stages. The first stage requires estimating IRT parameters for each cognitive question. The second stage results in MML estimation of a set of regression coefficients that capture the relationship between group score distributions and nearly all the information from the variables in the teacher, school, or SD/ELL questionnaires, as well as geographical, sample frame, and school record information. The third stage involves the imputation that is designed to reproduce the group-level results that could be obtained during the second stage.

NAEP’s imputations follow Rubin’s (1987) proposal that the imputation process be carried out several times, so that the variability associated with group score distributions can be accurately represented. NAEP estimates five plausible values for each student. Each plausible value is a random selection from the joint distribution of potential scale scores that fit the observed set of response for each student and the scores for each of the groups to which each student belongs. Estimates based on plausible values are more accurate than if a single (necessarily partial) score were to be estimated for each student and averaged to obtain estimates of subgroup performances. Using the plausible values eliminates the need for secondary analysts to have access to specialized MML software and ensures that the estimates of average performance of groups and estimates of variability in those averages are accurate. 

Top

Recent Changes

Several important changes have been implemented since 1990.

  • Beginning with the 1990 mathematics assessment, NAGB established three reporting levels for reporting NAEP results: basic, proficient, and advanced.
  • In 1990, state assessments were added to NAEP. The 1990 to 1994 assessments are referred to as trial state assessments.
  • In 1992, a generalized partial-credit model (GPCM) was introduced to develop scales for the more complex constructed-response questions. The GPCM model permits the scaling of questions scored according to multipoint rating schemes.
  • In 1993, NAGB determined that future NAEP assessments should have within-grade frameworks and scales. The 1994 main history and geography assessments followed this new policy, as did the 1996 main science assessment, and the 1998 writing assessment. Mathematics and reading in the main NAEP will continue to have cross-grade scales until further action by NAGB (and a parallel change in the trend assessment), except for mathematics at grade 12, which was removed from cross-grade scales and reported in a within-grade scale in 2005.
  • In 1994, the new image-based scoring system virtually eliminated paper handling during the scoring process. This system also permits scoring reliability to be monitored online and recalibration methods to be introduced.
  • The 1996 main NAEP included new samples for the purpose of studying greater inclusion of SD/LEP students and obtaining data on students eligible for advanced mathematics or science sessions.
  • In 1997, there was a probe of student performance in the arts.
  • New assessment techniques included: open-ended items in the 1990 mathematics assessment; primary trait, holistic, and writing mechanics scoring procedures in the 1992 writing assessment; the use of calculators in the 1990, 1992, 1996, and 2000 mathematics assessments; a special study on group problem solving in the 1994 history assessment; and a special study in theme blocks in the 1996 mathematics and science assessments.
  • Beginning in 1998, testing accommodations were provided in the NAEP reading assessments; in this transition to a more inclusive NAEP, administration procedures were introduced that NAEP allowed the use of accommodations (e.g., extra time, individual rather than group administration) for students who required them to participate. During this transition period, reading results in 1998 were reported for two separate samples: one in which accommodations were not permitted and one in which accommodations were permitted. Beginning in 2002, accommodations were permitted for all reading administrations.
  • In 1999, NAGB discontinued the long-term trend assessment in writing for technical reasons. More recently, NAGB decided that changes were needed to the design of the science assessment and, given recent advances in the field of science, to its content. As a result, the science long-term trend assessment was not administered in 2003-04 or in subsequent administrations.
  • With the expansion and redesign of NAEP under the No Child Left Behind Act, NAEP’s biennial state-level assessments are being administered by contractor staff (not local teachers). The newly redesigned NAEP has four important features. First, NAEP administers tests for different subjects (such as mathematics, science, and reading) in the same classroom, thereby simplifying and speeding up sampling, administration, and weighting. Second, NAEP conducts pilot tests of candidate items for the next assessment and field tests of items for precalibration in advance of data collection, thereby speeding up the scaling process. Third, NAEP conducts bridge studies, administering tests both under new and the old conditions, thereby providing the possibility of linking old and new findings. Finally, NAEP is adding additional test questions at the upper and lower ends of the difficulty spectrum, thereby increasing NAEP’s power to measure performance gaps.
  • Beginning in 2002, the NAEP national sample for main national assessment was obtained by aggregating the samples from each state, rather than by obtaining an independently selected national sample. Prior to 2002, separate samples were drawn for the NAEP main national and state assessments.
  • In 2002, TUDA began assessing performance in five large urban districts with reading and writing assessments. TUDA continued in 2003 in nine large urban districts with reading and mathematics and in 2005 in 10 large urban districts with reading, mathematics, and science. As of 2013, 21 urban school districts were included in the TUDA program.
  • Beginning with the 2003 NAEP, each state must have participation from at least 85 percent—instead of 70 percent—of the schools in the original sample in order to have its results published.
  • In 2003 and 2005, Puerto Rico participated in the NAEP assessment of mathematics. However, Puerto Rico was excused from the NAEP assessment of reading in English because Spanish is the language of instruction in Puerto Rico. NCES also administered the 2007 mathematics assessment in Puerto Rico. In 2007, a representative sample of approximately 2,800 students in 100 schools was assessed at both grade 4 and at grade 8. In 2011, public school students in Puerto Rico at grades 4 and 8 participated in a research study using a Spanish-language version of the National Assessment of Educational Progress (NAEP) in mathematics. This was not a full assessment, so results were not reported until they could be verified with the 2013 assessment.
  • In 2004, several changes were implemented to the NAEP long-term trend assessments to reflect changes in NAEP policy, maintain the integrity of the assessments, and increase the validity of the results obtained. The changes to the assessment instruments include: removal of science items; inclusion of students with disabilities and English language learners; replacement of items that used outdated contexts; creation of a separate background questionnaire; elimination of “I don't know” as a response option for multiple-choice items; and use of assessment booklets that pertain to a single subject area (whereas in the past, a single assessment booklet may have contained both reading and mathematics items).
  • In 2005, NAGB introduced changes in the NAEP mathematics framework for grade 12 in both the assessment content and administration procedures. One of the major differences between the 2005 assessment and previous assessments at grade 12 is the five content areas were collapsed into four areas, with geometry and measurement being combined. In addition, the assessment included more questions on algebra, data analysis, and probability to reflect changes in high school mathematics standards and coursework. The overall average mathematics score in 2005 was set at 150 on a 0–300 scale.
  • In 2006, economics was assessed at grade 12 for the first time. The NAEP economics assessment results present a broad view of how well our nation’s students at grade 12 understand economics and have knowledge of the workings of domestic and international economics. More than 11,000 grade 12 students in approximately 600 public and private schools across the nation were assessed. A within-grade scale was developed, with the overall average economics score in 2006 set at 150 on a 0–300 scale.
  • In 2009, the reading framework changed to include more emphasis on literary and informational texts, a redefinition of reading cognitive processes, a systemic assessment of vocabulary knowledge, and the addition of poetry to grade 4. Results from special analyses conducted in 2009 determined that, even with these changes to the assessment, results could continue to be compared to those from earlier assessments.
  • In 2009, TUDA was expanded to 18 large urban districts, assessing reading, mathematics and science. In addition, 11 states were assessed in reading and mathematics at grade 12 on a trial basis. In 2011, TUDA expanded to 21 large urban districts, assessing reading and mathematics.
  • In 2009, interactive computer tasks in science were administered online at grades 4, 8, and 12. These tasks consisted of simulations for the students to draw inferences and conclusions about a problem.
  • In 2011, NAEP administered its first computer-based assessment in writing at grades 8 and 12. A pilot test of students at grade 4 was also conducted in 2012, and the empirical correlations observed between performance and the contextual and demographic factors largely supported the predictions as specified in the conceptual model, including the key prediction that the differential effects of the computer on the writing performance of high- and non-high-performing fourth-graders would be related to their prior exposure to writing on the computer.
  • In 2015, NAEP began a phased approach to transition its paper-and-pencil assessments to digital-based assessments and delivery, starting with a pilot test for mathematics, reading, and science assessments using the latest technology tools. Results are not available at this time.

Future Plans

Main NAEP assessments are scheduled for annual administration. Reading and mathematics are assessed every 2 years in odd-numbered years; science and writing are scheduled to be assessed every 4 years (in the same years as reading and mathematics, but alternating with each other); and other subjects are assessed at the national level in even-numbered years. A new, computer-based assessment, Technology and Engineering Literacy, was piloted in 2013, and a full assessment was conducted at grade 8 in 2014. NAEP broadly defines technological and engineering literacy as the capacity to use, understand, and evaluate technology as well as to understand technological principles and strategies needed to develop solutions and achieve goals. For the full NAEP Assessment schedule, see http://nces.ed.gov/nationsreportcard/about/assessmentsched.asp.

The NAEP program is in the midst of transitioning all of its assessments to digitally based content and delivery. Beginning in 2017, the NAEP mathematics, reading, and writing assessments will be administered to students throughout the nation in NAEP-provided tablets. Some questions may include multimedia such as audio and video, other questions may allow the use of embedded technological features (such as an onscreen calculator) to form a response. Additional subjects will be administered on tablets in 2018 and 2019. NCES will also pilot science interactive computer tasks (ICTs) and hands-on-tasks (HOTs).

To continue moving the NAEP program forward, a summit of diverse experts in assessment, measurement, cognition, and technology was convened in August 2011 and January 2012. These experts discussed and debated ideas for the future of NAEP. NCES convened its most recent workshop in January 2013. State and district assessment staff met to develop and prioritize recommendations for NAEP.

NIES is shifting from a two-year administration cycle to a four-year administration cycle. The most recent NIES administration was conducted in 2015.

1 The modified assessment included new items and features, representing the new design.
2 The bridge assessment replicates the assessment given in the previous assessment year.

Top