Skip to main content
Skip Navigation

NAEP Technical DocumentationNAEP Pre-Tests

Overview of NAEP Pre-Test Administration Types

Prior to their use in an operational NAEP assessment, cognitive assessment items are subject to pre-testing and associated analyses. In NAEP, two types of pre-tests are conducted: pilot tests and field tests. The differences between the two types of pre-tests are described below. The pre-tests are administered to samples of students during the same testing window as the operational NAEP assessments.

The table below provides an overview of pre-tests starting with the 2006 NAEP administration.

Overview of pre-tests by subject and year: 2006–2010 
Year Use in operational
assessment year
Grades / Ages Subject
2006 2007 Grades 8, 12 Writing
2007 2009 Grades 4, 8, 12 Reading
2009 Grades 4, 8 Mathematics
2009 Grades 4,8 Mathematics in Puerto Rico
2008 2009 Grades 4, 8, 12 Science
2009 Grade 12 Reading
2012 Ages 9, 13, 17 Long-term trend mathematics
2009 2011 Grades 4, 8 Mathematics, reading
2010 Grades 4, 8, 12 Civics, geography, U.S. history
2010 2013 Grades 4, 8, 12 Writing

Pilot Tests

Before a test question is used as part of a NAEP assessment, it is given to students as part of a pilot test. The purpose of the pilot test is to obtain information regarding clarity, difficulty levels, timing, feasibility, and special administrative situations. From this pilot test, items are selected for inclusion in the operational assessment. Since 2002, pilot tests have generally been administered to nationally representative samples of students, although in the past these samples have not been required to be nationally representative. In most NAEP subjects, each pilot item is administered to approximately 500 students. Larger sample sizes of at least 1,500 students per item have been used for the reading and mathematics pilot tests at grades 4 and 8 since 2009. Therefore, the pilot test serves as an indication about potential feasibility for newly created test questions that are intended to measure aspects of the subject-area framework.

Assessment items developed for a pilot test are created in the year prior to the intended pilot administration (i.e., items were created in 2006 for eventual piloting in 2007). The steps involved in the item development process are described in the instruments section. After items are created and approved for pre-testing, blocks of items are constructed, and the newly-developed blocks are then printed in pilot books. The pilot blocks are typically bundled with operational assessments in the field at the same time; however, in some circumstances separate pilot assessment sessions are created based on the complexity of operational assessment sessions. 

Field Tests (Pre-Calibration)

A field test is the second phase of pretesting and is given one year prior to the operational NAEP assessment. The purpose of this phase is to facilitate the analysis of the assessment data in the operational year by pre-calibrating the items. The questions selected for inclusion in the field test are administered to a nationally representative sample of students. The sample size is approximately 2,000 students for each block of field test items, which is sufficient to perform item calibration analyses. All newly created assessments in a subject area include a field test administration. After the first year of the assessment is administered as an operational assessment, there is no longer a need to perform precalibration; therefore, field tests are not conducted prior to every operational assessment. However, questions developed for use in the 2003 through 2009 assessments in mathematics and reading at grades 4 and 8 did include field testing before every operational assessment. This was done to ensure that the test scaling could be performed as efficiently as possible in the operational year.

Procedures Used to Determine Sample

The sample designs for pilot and field tests conducted during a given assessment have the dual goals of (1) ensuring a wide coverage of the student population for which the final assessment will be conducted, and (2) causing as little disruption as possible to the operational assessments that are being conducted during the same time. Thus the pilot test samples are often highly integrated with the operational samples conducted during the same year. However, because some NAEP administrations involve a variety of assessment types, at times pilot and field tests are conducted in separate sessions. This results in four broad types of sample design for pilot and field tests:

  • Type A – The pilot or field test material is included in the spiral for a combined state-national operational assessment (For example, questions that would be used for the 2009 mathematics assessment were piloted in 2007).
  • Type B – The pilot or field test material is included in the spiral for a national operational assessment.
  • Type C – The pilot or field test material is administered in the same schools as a national operational assessment, but in separate administration sessions.
  • Type D – The pilot or field test is administered in a completely separate sample of schools, selected from a sample of Primary Sampling Units (PSUs). In these cases, the school and student sampling procedures used are generally the same as those that would be used for a national operational sample.

The pilot test samples are probability samples of schools and students. However, on occasion the sample coverage is not the full nation for the grade in question. When the operational assessments involve large samples, so that in some small jurisdictions all or many of the students are included in the operational samples, those jurisdictions may be omitted from the sampling frame for the pilot test samples

Weighting for Pilot and Field Test Samples

The only weights that are produced for pilot and field test samples are preliminary weights. These weights reflect the school and student selection probabilities, but do not make adjustments for school and student nonresponse. This is in contrast to the weights for operational assessments, which include all of these aspects. To learn more, see the weighting procedures section.

Processing Materials and Scoring for Pilot and Field Tests

As with NAEP operational assessments, pilot tests need to be printed, shipped, scanned, and scored. Similar procedures are used for printing and shipping pilot tests as those described for operational assessments in the processing assessment materials section. Likewise, the steps involved in scoring constructed-response questions in pilot tests mirror those taken as part of the operational assessment scoring activities.

Data Analysis

Pilot test data undergo a thorough classical item analysis. This includes the computation of a variety of statistics that assist in the evaluation of the measurement properties of each block of items. The percentage of students responding correctly (or in the case of polytomously scored constructed response items, the mean item score) is one of the key statistics. The response percentages for each response option of a multiple-choice item and each score category of a constructed response item are also calculated. In addition, a total block score is computed for each student. The total block score is computed by summing the number of correct responses across the dichotomously scored items, plus adding to that sum the number of points earned on each polytomously scored item.

For each item, the mean block score among students selecting each response option (or score category) is calculated. Biserial correlation coefficients between a student’s score on an individual item and the student’s total score on the block in which the item appears are calculated. The item analyses include the percentage of students failing to reach each item and the percentage of students to reach but not respond (omit) to an item. A measure of each block’s reliability is also estimated.

Field test data, when gathered, also undergo a thorough classical item analysis. However, the larger samples associated with field tests enable additional analyses to be conducted. One such analysis is differential item functioning (DIF), which identifies potential item fairness issues. A second additional data analysis step is item calibration, which consists of fitting an item response theory (IRT) model to the item response data. The purpose of calibrating the field test data is to solve potential difficulties associated with fitting the model to the data prior to the operational assessment and to establish starting item parameter values for the operational calibration.

How Data Analyses Inform Test Development

The analyses performed at the time of pilot testing are used to inform item development in the following ways:

  • The percentage of students responding correctly to an item (or the mean item score) informs test developers as to whether the item is more or less difficult for the students than intended.
  •  The percentage of students selecting each multiple-choice item response option informs test developers as to whether each potential distractor is sufficiently attractive to students.
  • The relative number of students in each of the score categories of constructed-response items helps test developers to evaluate the scoring rubrics and definitions of the categories.
  • The biserial correlation indicates whether there is a strong relationship between performance on an individual item and the entire block score. Low correlations lead test developers to reexamine each of the distractors and the correct response itself, as it is generally expected that more able students will answer correctly more frequently than less able students.
  • The percentage of students reaching the end of a block can be an indication that the block might be either too long or too short. 

Differential item functioning (DIF) analyses of field test data can identify potential situations where one group of students is disadvantaged in responding to an item, relative to a matched group of cohorts. Contingent upon the judgment of a special DIF panel, DIF results can lead to removal or revision of an item in future assessments.

Item calibration analyses conducted on field test data are intended to facilitate future (operational) psychometric analyses of the items rather than to inform test developers.


Last updated 14 March 2011 (JL)