The purpose of the ELS:2002 assessment battery is to provide measures of student achievement in mathematics (and reading, tested in the base year only) that can be related to student background variables and educational processes, for individuals and for population subgroups. The reading and mathematics tests must provide accurate measurement of the status of individuals at a given point in time. In addition, the mathematics test must provide accurate measurement of the acquisition of mathematics skills over time.
Test Design and Format
Test specifications for the ELS:2002 base year and first follow-up were adapted from frameworks used for NELS:88. There were two levels to the framework: content areas and cognitive processes. Mathematics tests contained items in arithmetic,4 algebra, geometry, data/probability, and advanced topics. The tests also reflected cognitive process categories of skill/knowledge, understanding/comprehension, and problem solving. The test questions were selected from previous assessments: NELS:88, NAEP, and PISA. Most, but not all base-year items, were multiple choice (about 10 percent of the base-year mathematics items were open-ended). In the first follow-up, all items were multiple choice.
Both 10th-grade and 12th-grade items were field tested in 2001, and 12th-grade items were field tested again in 2003.5 Items were selected or modified based on field test results. Final forms were assembled based on psychometric characteristics and coverage of framework categories. On the NELS:88 mathematics framework, see Rock and Pollack 1991 (chapter 2); on its adaptation to ELS:2002, see Ingels et al. 2004 (section 2.2.2.1).
The ELS:2002 assessments were designed to maximize the accuracy of measurement that could be achieved in a limited amount of testing time, while minimizing floor and ceiling effects, by matching sets of test questions to initial estimates of students' achievement. In the base year, this was accomplished by means of a two-stage test. In 10th grade, all students received a short multiple-choice routing test, scored immediately by survey administrators who then assigned each student to a low, middle, or high difficulty second-stage form, depending on the student's number of correct answers in the routing test. In the 12th-grade administration, students were assigned to an appropriate test form based on their performance in 10th grade. Cut points for the 12th-grade low, middle, and high forms were calculated by pooling information from the field tests for 10th and 12th grades in 2001, the 12th-grade field test in 2003, and the 10th-grade national sample. Item and ability parameters were estimated on a common scale. Growth trajectories for longitudinal participants in the 2001 and 2003 field tests were calculated, and the resulting regression parameters were applied to the 10th-grade national sample. Test forms were designed to match the projected achievement levels of the lowest and highest 25 percent, and the middle 50 percent, of the base-year sample 2 years later. Each of the test form contained 32 multiple-choice items.
In the four tables A-5 through A-8, content and process information6 is provided about the 73 unique items that comprise the base-year, and 59 items that comprise the first follow-up, mathematics assessments. Additional tables are presented later (A-9 and A-10) that break down assignments of items by content and process by test form, and thus show the extent of overlap (any given unique item may appear on one or more forms).7 Tables A-5 and A-6 show the numbers and percentages of unique test items devoted to each content area for the base-year and first follow-up test batteries. Tables A-7 and A-8 show the number and percentages of unique test items devoted to each cognitive process area.
Table A-9 shows the number of mathematics test items per form in the base year and first follow-up. Again, forms were assigned on the basis of performance on a routing test in the base year, but were assigned on the basis of the base-year ability estimate in the first follow-up. While all examinees received a 32-item form in 2004, the number of items ranged from 40 to 42 in the base year, except for a handful of students who received the single-stage 23-item version of the base year assessment (this abbreviated version of the test was used at two schools that had too limited testing time available to administer the full version).
While the tables above show the content and process areas for the unique items that comprise the overall base-year and first follow-up mathematics tests, students took different forms of each test, and a given item could be used on more than one form. To see the number or proportion of items in a given content or skill area that students at various levels of form assignment in fact took, an additional set of tables is required. Table A-10 below shows content by cognitive process distributions of items across all test forms. Contents of the routing tests are shown separately, although for purposes of computation of the base-year ability estimate, theta, the two stages of the test (i.e., the routing test and the ability-tailored second stage test) were combined.
Table A-11 shows (by test form) numbers and percentage of items in each content area. The items in the base-year stage 1 test (routing test) have been combined with the items in the stage 2 test. Thus we see, for example, that in the first follow-up (2004) when most sample members were in their senior year, students assigned the low form had 44 percent arithmetic items and no advanced topics; while students assigned the high form had 3 percent arithmetic items and 16 percent advanced topics. Nonetheless, the different forms comprise a single test, and with IRT methods, proficiencies can be estimated for ELS:2002 items not assigned to the examinee. In other words, all ELS:2002 IRT scores (whether number-right or proficiency probability scores) measure student performance on the entire item pool regardless of which form they took.
A.5.2.1 IRT Scoring Procedures
The scores used to describe students' performance on the direct cognitive assessment are broad-based measures that report performance as a whole. The scores are based on Item Response Theory, which uses patterns of correct, incorrect, and omitted answers to obtain ability estimates that are comparable across different test forms.8 In estimating a student's ability, IRT also accounts for each test question's difficulty, discriminating ability, and a guessing factor.
IRT has several advantages over raw number-right scoring. By using the overall pattern of right and wrong responses to estimate ability, IRT can compensate for the possibility of a low-ability student guessing several difficult items correctly. If answers on several easy items are wrong, a correct difficult item is assumed, in effect, to have been guessed. Omitted items are also less likely to cause distortion of scores, as long as enough items have been answered right and wrong to establish a consistent pattern. Unlike raw number-right scoring, which necessarily treats omitted items as if they had been answered incorrectly, IRT procedures use the pattern of responses to estimate the probability of correct responses for all test questions. Finally, IRT scoring makes it possible to compare scores obtained from test forms of different difficulty. The common items present in overlapping forms and in overlapping administrations (10th grade and 12th grade) allow test scores to be placed on the same scale.
In the ELS:2002 first follow-up survey, IRT procedures were used to estimate longitudinal gains in achievement over time by using common items present in both the 10th- and 12th-grade forms. Items were pooled from both the 10th- and 12th-grade administrations and anchored to the IRT scale of the NELS:88 survey of 1988–92. Item parameters were fixed at NELS:88 values for the items that had been taken from the NELS:88 test battery and to base-year values for non-NELS:88 items. In each case, the fit of the follow-up item response data to the fixed parameters was evaluated, and parameters for common items whose current performance did not fit previous patterns were re-estimated, along with non-NELS:88 items new to the follow-up tests.
A.5.2.2 Score Descriptions and Summary Statistics
Two different types of IRT scores are used in this report to describe students' performance on the mathematics assessment. NELS:88-equated IRT-estimated number-right scores measure students' performance on the whole item pool. NELS:88-equated proficiency probabilities estimate the probability that a given student would have demonstrated proficiency for each of the five mathematics levels defined for the NELS:88 survey in 1992.9
Note that while Level 5 is based on a measurement of advanced mathematical material, the ELS:2002 mathematics test contains no calculus items. To the extent that advanced mathematics content on the ELS:2002 is limited, the present study may understate the relationship between mathematics course sequences and the acquisition of the most advanced skills and concepts. A high school student enrolled in calculus may see improved ELS:2002 test performance indirectly: the course may help keep mathematics understanding fresh and hone problem-solving skills, but there will be no direct test benefit in learning calculus content, in that there are no calculus items on the mathematics assessment.
The proficiency levels are hierarchical in the sense that mastery of a higher level typically implies proficiency at lower levels. The NELS:88-equated proficiency probabilities in ELS:2002 were computed using IRT item parameters calibrated in NELS:88. Each proficiency probability represents the probability that a student would pass a given proficiency level defined as above in the NELS:88 sample.
Table A-12 shows variable names, descriptions, and summary statistics for the NELS:88-equated number-right and proficiency probability scores.
The IRT number-right and proficiency scores are derived from the IRT model and are based on all of the student's responses to the mathematics assessment. That is, the pattern of right and wrong answers, as well as the characteristics of the assessment items themselves, is used to estimate a point on an ability continuum, and this ability estimate, theta, then provides the basis for these two types of criterion-referenced scores.
NELS:88-equated IRT number-right and proficiency probability scores may be used in a number of ways. Because they are calibrated on the NELS:88 scale, they may be used for cross-sectional intercohort comparisons of students' mathematics achievement in 2004 compared with their counterparts in 1992. The NELS:88-equated number-right scores reflect performance on the whole pool of 81 NELS:88 mathematics items, whereas the proficiency probability scores are criterion-referenced scores that target a specific set of skills. The mean of a proficiency probability score aggregated over a subgroup of students is analogous to an estimate of the percentage of students in the subgroup who have displayed mastery of the particular skill.10 The proficiency probability scores are particularly useful as measures of gain, because they can be used to relate specific treatments (such as selected coursework) to changes that occur at different points along the score scale. For example, two groups may have similar gains in total scale score points, but for one group, gain may take place at an upper skill level, and for another, at a lower skill level. One would expect to see a relationship between gains in proficiency probability at a particular level and curriculum exposure, such as taking mathematics courses relevant to the skills being mastered.
A.5.2.3 Psychometric Properties of the Tests
Information about the psychometric properties of the test items, the setting of difficulty levels, differential item functioning, and scoring procedures, are provided in the two field test documents (Burns et al. 2003 [NCES 2003-03, chapter 5] and Ingels et al. 2005 [NCES 2006-344, appendix J]). IRT scaling and linking procedures follow the NELS:88 precedent, using a three-parameter IRT model in PARSCALE (Muraki and Bock 1991); the NELS:88 procedure is described in Rock and Pollack (1995b). The same IRT software and procedures were used in the scaling of the Early Childhood Longitudinal Survey (ECLS-K) as detailed in Pollack et al. (2005).
Reliabilities were computed using the variance of the posterior distribution of plausible values for each test-taker's theta (ability estimate), compared with the variance of the thetas across the whole sample (i.e., error variance versus total variance). The reliability estimates are the proportion of "true variance" (1 minus error variance) divided by total variance (see Samejima 1994 on this procedure).
For the combined base-year and first follow-up tests, the reliability was 0.92 (this reliability is a function of the variance of repeated estimates of the IRT ability parameter [within-variance], compared with the variability of the sample as a whole) (Ingels et al. 2005). This 0.92 reliability applies to all scores derived from the IRT estimation.11
The use of IRT-scale scores and the adaptive testing approach used in ELS:2002 limit the concern that gain scores may be unreliable due to floor and ceiling effects.
A.5.2.4 Indicators of Student Motivation at Both Testing Points
One major concern in measuring achievement is whether students are motivated to do their best on low-stakes tests, such as the mathematics assessment in ELS:2002. This concern may be particularly strongly felt with reference to spring-term seniors, who may be in the process of disengaging from high school in anticipation of the transition to postsecondary education or the work force, and who may have had their fill of assessments, in the form of such high stakes tests as exit exams and college entrance exams. Although greatest concern may be felt about spring term seniors, concerns about motivation rightly encompass high school sophomores as well.
While there is no single definitive measure of student motivation on the tests, there are several possible indicators of the comprehensiveness and quality of the test data collected. For example, in scoring the 2002 and 2004 tests, the assessment subcontractor examined "pattern marking"12 and missing responses. In the main, they did not find evidence of pattern marking, or high levels of omitted items. For example, in the ELS:2002 first follow-up with around 11,000 mathematics assessments completed, 17 assessments were discarded for these reasons: 11 test records were deleted because tests were incomplete (fewer than 10 items answered) and 6 more because response patterns indicated lack of motivation to answer questions to the best of the student's ability. In the base year, 10 test records were deleted because tests were incomplete (fewer than 10 items answered). Pattern marking was not observed (perhaps reflecting the fact that the test was in two stages, each stage relatively short).
Given that participation in the survey was voluntary, and that a student could have opted to not participate, or to participate by completing the questionnaire only, the student response rate may also be an indirect indicator of student test-taking motivation. Generally NAEP sees a drop in participation in grade 12, compared to grades 4 and 8. For ELS:2002's predecessor study, NELS:88, lower participation rates were registered in 12th grade as well.13
For the ELS:2002 base year, the weighted participation rate was 87 percent. Of the 15,362 participants, 95 percent (weighted) also completed the test. (Some who did not complete the test could not be validly tested for language or disability reasons.)
For the ELS:2002 first follow-up (2004), when most sample members were high school seniors, the overall participation rate increased slightly from the base year to a weighted 89 percent. Some 87 percent (weighted) of questionnaire completers also completed the test. Looking specifically at questionnaire completion for senior cohort members who remained in the same school at both points in time—the critical analysis sample for this report—a 97 percent survey participation rate was obtained, with very little difference by subgroup. Race/ethnicity groups, for example, were all at around 97 percent (Ingels et al. 2005 [NCES 2006-344]). If voluntary participation rates are to some degree indicative of student motivation, then there is some evidence that seniors may have taken the assessment seriously.14 The overall pattern—lack of high numbers of omitted response, lack of "pattern-marking," high test reliability,15 and high participation rates in both rounds of the study—argues for the credibility and quality of the test data. In short, while lack of motivation for some students surely affected test results in ways that could not be identified and edited out, most test takers answered all or almost all the items, and internal-consistency reliabilities were high for all subgroups examined, both in the field tests and full-scale studies. These are good indications that interpretation of test results in the aggregate should not be significantly compromised by low test-taking motivation.