Prior to their use in an operational NAEP assessment, cognitive assessment items are subject to pre-testing and associated analyses. In NAEP, two types of pre-tests are conducted: pilot tests and field tests. The differences between the two types of pre-tests are described below. The pre-tests are administered to samples of students during the same testing window as the operational NAEP assessments.
The table below provides an overview of pre-tests starting with the 2006 NAEP administration.
Year | Use in operational assessment year |
Grades / Ages | Subject |
---|---|---|---|
2006 | 2007 | Grades 8, 12 | Writing |
2007 | 2009 | Grades 4, 8, 12 | Reading |
2009 | Grades 4, 8 | Mathematics | |
2009 | Grades 4,8 | Mathematics in Puerto Rico | |
2008 | 2009 | Grades 4, 8, 12 | Science |
2009 | Grade 12 | Reading | |
2012 | Ages 9, 13, 17 | Long-term trend mathematics | |
2009 | 2011 | Grades 4, 8 | Mathematics, reading |
2010 | Grades 4, 8, 12 | Civics, geography, U.S. history | |
2010 | 2013 | Grades 4, 8, 12 | Writing |
Before a test question is used as part of a NAEP assessment, it is given to students as part of a pilot test. The purpose of the pilot test is to obtain information regarding clarity, difficulty levels, timing, feasibility, and special administrative situations. From this pilot test, items are selected for inclusion in the operational assessment. Since 2002, pilot tests have generally been administered to nationally representative samples of students, although in the past these samples have not been required to be nationally representative. In most NAEP subjects, each pilot item is administered to approximately 500 students. Larger sample sizes of at least 1,500 students per item have been used for the reading and mathematics pilot tests at grades 4 and 8 since 2009. Therefore, the pilot test serves as an indication about potential feasibility for newly created test questions that are intended to measure aspects of the subject-area framework.
Assessment items developed for a pilot test are created in the year prior to the intended pilot administration (i.e., items were created in 2006 for eventual piloting in 2007). The steps involved in the item development process are described in the instruments section. After items are created and approved for pre-testing, blocks of items are constructed, and the newly-developed blocks are then printed in pilot books. The pilot blocks are typically bundled with operational assessments in the field at the same time; however, in some circumstances separate pilot assessment sessions are created based on the complexity of operational assessment sessions.
A field test is the second phase of pretesting and is given one year prior to the operational NAEP assessment. The purpose of this phase is to facilitate the analysis of the assessment data in the operational year by pre-calibrating the items. The questions selected for inclusion in the field test are administered to a nationally representative sample of students. The sample size is approximately 2,000 students for each block of field test items, which is sufficient to perform item calibration analyses. All newly created assessments in a subject area include a field test administration. After the first year of the assessment is administered as an operational assessment, there is no longer a need to perform precalibration; therefore, field tests are not conducted prior to every operational assessment. However, questions developed for use in the 2003 through 2009 assessments in mathematics and reading at grades 4 and 8 did include field testing before every operational assessment. This was done to ensure that the test scaling could be performed as efficiently as possible in the operational year.
The sample designs for pilot and field tests conducted during a given assessment have the dual goals of (1) ensuring a wide coverage of the student population for which the final assessment will be conducted, and (2) causing as little disruption as possible to the operational assessments that are being conducted during the same time. Thus the pilot test samples are often highly integrated with the operational samples conducted during the same year. However, because some NAEP administrations involve a variety of assessment types, at times pilot and field tests are conducted in separate sessions. This results in four broad types of sample design for pilot and field tests:
The pilot test samples are probability samples of schools and students. However, on occasion the sample coverage is not the full nation for the grade in question. When the operational assessments involve large samples, so that in some small jurisdictions all or many of the students are included in the operational samples, those jurisdictions may be omitted from the sampling frame for the pilot test samples
The only weights that are produced for pilot and field test samples are preliminary weights. These weights reflect the school and student selection probabilities, but do not make adjustments for school and student nonresponse. This is in contrast to the weights for operational assessments, which include all of these aspects. To learn more, see the weighting procedures section.
As with NAEP operational assessments, pilot tests need to be printed, shipped, scanned, and scored. Similar procedures are used for printing and shipping pilot tests as those described for operational assessments in the processing assessment materials section. Likewise, the steps involved in scoring constructed-response questions in pilot tests mirror those taken as part of the operational assessment scoring activities.
Pilot test data undergo a thorough classical item analysis. This includes the computation of a variety of statistics that assist in the evaluation of the measurement properties of each block of items. The percentage of students responding correctly (or in the case of polytomously scored constructed response items, the mean item score) is one of the key statistics. The response percentages for each response option of a multiple-choice item and each score category of a constructed response item are also calculated. In addition, a total block score is computed for each student. The total block score is computed by summing the number of correct responses across the dichotomously scored items, plus adding to that sum the number of points earned on each polytomously scored item.
For each item, the mean block score among students selecting each response option (or score category) is calculated. Biserial correlation coefficients between a student’s score on an individual item and the student’s total score on the block in which the item appears are calculated. The item analyses include the percentage of students failing to reach each item and the percentage of students to reach but not respond (omit) to an item. A measure of each block’s reliability is also estimated.
Field test data, when gathered, also undergo a thorough classical item analysis. However, the larger samples associated with field tests enable additional analyses to be conducted. One such analysis is differential item functioning (DIF), which identifies potential item fairness issues. A second additional data analysis step is item calibration, which consists of fitting an item response theory (IRT) model to the item response data. The purpose of calibrating the field test data is to solve potential difficulties associated with fitting the model to the data prior to the operational assessment and to establish starting item parameter values for the operational calibration.
The analyses performed at the time of pilot testing are used to inform item development in the following ways:
Differential item functioning (DIF) analyses of field test data can identify potential situations where one group of students is disadvantaged in responding to an item, relative to a matched group of cohorts. Contingent upon the judgment of a special DIF panel, DIF results can lead to removal or revision of an item in future assessments.
Item calibration analyses conducted on field test data are intended to facilitate future (operational) psychometric analyses of the items rather than to inform test developers.