Skip Navigation

Dr. Lauress Wise
Human Resources Research Organization

September 27, 1999

Overview |  Approach |  References |  Data |  Results |  Table 1 |  Problem |  Summary |  Table 2

Since its inception in the 1960s, the National Assessment of Educational Progress (NAEP) has been the primary indicator of trends in the achievement of American elementary and secondary students. Originally, NAEP results were reported only for the nation as a whole and for large demographic groups defined by region of the country, gender, or race/ethnicity. When Congress reauthorized NAEP in 1988, it included a provision that allowed reporting of state-level results. Beginning with the 1990 assessment of eighth grade mathematics, NAEP has released state-by-state comparisons for selected grades and subjects. The 1998 assessment included state-level results for reading at grades 4 and 8 and for writing at grade 8. State-level reading results had been reported previously with the 1992 and 1994 assessments, so the 1998 results provided an opportunity to observe state-level trends over a period of four to six years.

Following the release of the 1998 State NAEP reading results, it was noted in the press that rates of exclusion of students with disabilities (SD) or limited English proficiency (LEP) had increased in many states. Further, the degree of increase was correlated positively with reported state gains from 1992 and 1994 to 1998. Subsequent analyses were conducted by staff from the Educational Testing Service (ETS) to determine, among other things, the degree to which score gains could conceivably have been due to increased exclusion of low-scoring students (Mazzeo, Donoghue, and Hombo, 1999). Commissioner Forgione released the results of the ETS analyses and reported them to the National Assessment Governing Board on May 14, 1999.

ETS staff developed a "worst case" model which assumed that SD and LEP students who were tested in 1994, but who would have been excluded in 1998, were the lowest performing of the 1994 SD and LEP students. For each state, they attempted to remove students from the bottom of the 1994 distribution for SD and LEP students until the percentage of exclusions in 1994 matched the 1998 exclusion rate. They then recomputed the 1994 means with these additional exclusions and reassessed the significance of the 1994 to 1998 mean score gains (or losses). The results showed no changes in the significance or lack of significance of 1994 to 1998 score changes, except for Kentucky and Maryland. In those two states, originally significant gains were not still significant after the removal of the additional 1994 cases.

Kentucky was the only state in which the increase in exclusions was greater than the number of SD students tested in 1994 (there were almost no LEP students in Kentucky in either 1994 or 1998). The ETS "worst-case" model removed all of the SD students who were tested in 1994 (3.7% of the total student population) and then had to remove another 1.3% to match the 5% increase in SD exclusion rate. They did this by removing non-SD students at the very bottom of the score distribution for all students. This was a very harsh assumption. Not surprisingly, the result was an increase in the 1994 mean that reduced the gain from 1994 to 1998 to insignificance.

Following the presentation of the ETS findings to the National Assessment Governing Board, Kentucky's Commissioner of Education, Wilmer S. Cody, sent a memo to Dr. Forgione requesting further analyses of the Kentucky results. In response to this request, NCES commissioned the study reported here. In addition to having a site license for analysis of secure NAEP data, HumRRO had access to data from Kentucky's statewide assessment. This made it possible to conduct analyses using data not available to the ETS researchers.

Kentucky's statewide assessment (KIRIS) includes all students. SD students are offered accommodations, consistent with their individual educational programs (IEPs). A small number of students with severe disabilities who cannot participate in the main assessment complete an alternative assessment. The analyses reported here investigated alternative models that used information from KIRIS to estimate maximum possible impact of increased NAEP exclusions on 1998 NAEP reading gains for Kentucky.

ETS provided data from the NAEP 1998 state reading assessment for Grade 4 students in Kentucky. The sample of students for whom data were sent was divided in two ways that are important to the present analyses. First, schools had been assigned to one of two subsamples, labeled Subsample 2 and Subsample 3 The difference between these subsamples was that students with disabilities (SD) or limited English proficiency (LEP) were offered additional accommodations in the assessment in subsample 3, but not in subsample 2.

The second important division within each of these subsamples was whether the student: was not identified as SD or LEP (stratum A), was identified as SD or LEP and was assessed (stratum B), or was identified as SD or LEP and was excluded from the assessment (stratum C). Because the additional accommodations might have affected assessment results, cases in strata B and C within subsample 3 were not given any weight in computing overall results. Table 1 shows the number of cases, weighted N, and the weighted and unweighted mean of the NAEP plausible values, where available, for each of the major cells.

Table 1
NAEP Assessment Data for Kentucky
By Subsample and Completion Stratum





Sample Size

Mean Plausible Value


Total WT










B: Included SD/LEP






C: Excluded SD/LEP












B: Included SD/LEP






C: Excluded SD/LEP










In 1994, Kentucky had a NAEP Grade 4 Reading mean of 211.6 (NCES, 1995, page 403). Thus the original 1998 estimated mean of 217.5 implied a gain of 5.9 NAEP score points which, with a standard error of the difference of 2.1, was a statistically significant gain.

The 134 students in sampling cell 2C, who represented 8.9 percent (148.5/1669.7) of the total weight, were excluded from the computation of the 1998 mean. In 1994, only 3.9 percent of the total weight was represented by excluded students. The analyses reported here used data from the Kentucky Instructional Results Information System (KIRIS), administered in 1998 to all 4th grade students in Kentucky, to estimate how excluded students would have scored if more of them had been included in the NAEP assessment. In particular, the goal was to estimate what the overall mean would have been if the 1998 exclusion rate was adjusted to be identical to the 1994 exclusion rate.

There was an important difference between the way the problem was framed in the present study and the way it was framed in the ETS analyses. Here the goal was to reduce the 1998 exclusion rate to the 1994 level by using KIRIS scores to estimate NAEP scores from students who had been excluded in 1998. The approach used in the ETS analyses was to increase the 1994 exclusion rate to 1998 levels by dropping students who were previously included in the 1994 analyses.

Each of the 2741 NAEP cases in our analysis sample was matched to a student in the 1998 KIRIS assessment on the basis of demographic variables including: presence/absence of an Individualized Education Program (IEP), presence and type of disability, school attended, indication of limited English proficiency, gender, race, and age (in months). For a total of 2,358 cases, there was an exact match on all of the above variables. It is likely that in many of these cases the specific student excluded from the NAEP assessment was identified, although this cannot be determined due to the absence of identifying information in the NAEP data. In other cases, a student who was similar with respect to presence and nature of a disability, in the same school, and had the same sex, race, and birth year and month was identified. For the remaining 383 cases, a match was found that was identical to the NAEP case on disability status and most, but not all, of the remaining variables. The reason for the lack of a perfect match was due in part to coding differences between KIRIS and NAEP in disability type, to students changing school or SD/LEP status between assessments, and to data errors in fields such as birth year and month.

All of the matched students had KIRIS reading assessment scores. All students had completed the main reading assessment, although some of them required accommodations consistent with their IEPs. In particular, students were excluded from the NAEP assessment because they could be accommodated in accordance with provisions of their IEP. Most of these students did require some accommodation on KIRIS which offered a range of accommodations judged to be consistent with the goals of the assessment. We did not match any of the NAEP students to students who were unable to be assessed with the normal KIRIS procedures and required the alternative portfolio assessment.

We examined KIRIS scores in an equated scale where the overall state mean was .880 with a standard deviation of .841 . The idea was to estimate the extent, in standard deviation units, to which the students matched to 134 excluded NAEP cases in cell 2C had KIRIS scored below (or above) the overall KIRIS mean. This difference from the KIRIS mean was used as the basis for estimating NAEP scores for each of these same excluded students. The estimated NAEP score for these students was set to be the same number of standard deviation units below (or above) the overall NAEP mean for Kentucky that their KIRIS score was below (or above) the overall KIRIS mean. In other words, the estimation assumed that KIRIS and NAEP scores were essentially equivalent except for a linear transformation of the score scales. Standard deviation units were used to place KIRIS and NAEP scores on a comparable scale.

Once we had estimated NAEP scores for each of the previously excluded students, the next step was then to estimate a new overall 1998 mean that included a sufficient number of these students to reduce the exclusion rate from 8.9 to 3.9. After a revised 1998 mean was estimated, the gain from 1994 was recomputed and the statistical significance of the new gain was tested.

One further adjustment was to correct the standard error of the difference between the 1998 and 1994 means. The standard error of the 1998 mean would decrease with increased sample size, but might increase if the added cases increased the overall standard deviation of the 1998 score estimates. The standard error of the difference was adjusted to reflect these changes in the standard error of the 1998 mean. A revised t-statistic was computed, dividing the mean difference by this revised standard error.

Two different models were used in identifying the specific students to be added back into the NAEP sample. A "worst case" approach represented students who would have been tested in 1994 and excluded in 1998 by drawing randomly from the list of students who were excluded in 1998. This is a relatively harsh model. The NAEP data included a variable indicating the severity of each student's disability coded as mild, moderate, severe, or profound. It seemed more likely that students with the most severe disabilities would have been excluded from both assessments. The "least severe" approach built on this assumption by putting back the students with the least severe disabilities and continuing to exclude the students with the most severe disabilities.

Overall, the students excluded from NAEP had matched KIRIS scores of .461 which was .43 standard deviations below the overall KIRIS mean. If the cases to be put back in the 1998 NAEP totals were randomly sampled from all students with disabilities who had been excluded from the NAEP assessment, the revised estimate for Kentucky mean would be 216.7. This translates to a gain of 5.1 from 1994 which is still clearly significant at the .05 level (t=2.35, p =.010).

The "least severe" model required "putting back" just over 56 percent of the 134 excluded students in cell 2C to reduce the overall exclusion rate from 8.9 percent to the 3.9 percent rate of the 1994 assessment. Severity of disability was coded for 97 of the 134 students in cell 2C. There were 55 coded as having "mild" disabilities with a KIRIS mean of .615. These 55 students represented 56.7 percent of the 97 cases for whom severity was coded. Weighting these 55 cases to account for missing severity data and adding them back to the 1998 sample led to a revised statewide NAEP estimate of 217.0 with a gain of 5.4 that was also statistically significant (t=2.52, p=.006). Table 2 summarizes the key results under both the "worst case" and the "least severe" models for replacing the missing NAEP students.

Kentucky students excluded from the 1998 State NAEP Reading assessment for Grade 4 were matched to data from Kentucky's own 1998 statewide assessment (KIRIS) in which all students were included. Results showed that, while the students excluded from NAEP did have lower matched KIRIS scores, the impact of including more of these students on statewide results was modest. When the inclusion rate was increased by 5 percent to match the 1994 inclusion rate, score gains were reduced from 5.9 NAEP points to between 5.1 or 5.4 NAEP points. The remaining gain was still statistically significant, continuing to support the statement that reading scores in Kentucky improved between 1994 and 1998. These results differed from the results of the analyses performed by ETS researchers who did not have access to the KIRIS data.

Cody, Wilmer S. (1999). Memo to Pascal D. Forgione, Commissioner of Education Statistics, dated May 14, 1999.

Kentucky Department of Education (1997). KIRIS Accountability Cycle 2 Technical Manual. Frankfort, KY: Kentucky Department of Education.

Mazzeo, J. Donoghue, J. and Hombo, C. (1999). A summary of initial analyses of 1998 State NAEP exclusion rates. (Memo to Pascal D. Forgione, dated May 27 1999).

National Center for Education Statistics. (1995). Technical Report of the NAEP 1994 Trail State Assessment Program in Reading. (Technical Report NCES 96-116) J. Mazzeo, N. Allen, and D. Kline (Eds.). Washington, D.C.: U.S. Department of Education.

Table 2
Results with Varying Assumptions

Statistic/Sample Total Weight Original Results
(No Replacement)
Model 1:
Worst Case
Model 2:
Expected Case
NAEP Mean for Students Tested 1521.2 217.5 217.5 217.5
KIRIS Mean for Students Tested   .934 .934 .934
Excluded Students Added Back   None Average Disability Least
KIRIS Mean for Those Put Back     .461 .615
NAEP Equivalent 83.7   200.1 206.5
Revised Total Mean 1604.9 217.5 216.7 217.0
Gain from 1994(Mean=211.56)

6.0 5.1 5.4
S.E. of Difference   2.17 2.17 2.15
t-statistic   2.74 2.35 2.52
p   .003 .010 .006

Link To:
Part 1 of Remarks