Help for the 2004 Long-Term Trend Summary Data TablesThe 2004 long-term trend summary data tables present mathematics and reading trend results from the 2004 administration of the National Assessment of Educational Progress (NAEP). The NAEP long-term trend assessments are separate from a newer series of NAEP assessments (called "main" assessments) that involve more recently developed instruments. While the long-term trend assessments have used the same sets of questions and tasks for decades so that trends across time can be measured, the main assessments in each subject area have been changed more frequently to reflect current educational content and assessment methodology. The main assessments provide trend results for a short term (e.g., mathematics in 1990, 1992, 1996, 2000, and 2003; and reading in 1992, 1994, 1998, 2000, 2002, and 2003). Measuring trends in student achievement, or change over time, requires the precise replication of past procedures. Therefore, the long-term trend instruments do not evolve based on changes in curricula or in educational practices; in this way, the long-term trend assessments differ from main NAEP. Further, the long-term trend assessments use different instruments from those used in the main NAEP assessments, and students are sampled by age for the long-term trend assessments, rather than by grade as in the main assessments. It is therefore not possible to compare results from national or state main NAEP with those of the long-term trend assessment. The use of both long-term trend and main assessments allows NAEP to provide information about students' achievement over time and to evaluate their attainment of contemporary educational objectives. Because they are based on different sets of questions and tasks, scale score results and students' reports of educationally related experiences from the long-term trend assessments cannot be directly compared to the main assessments. Help is available for the following topics: The Mathematics Trend AssessmentOne of the primary objectives of NAEP is to track trends in student performance over time. The most recent NAEP long-term trend assessment in mathematics was administered throughout the nation in the 2003–2004 school year. NAEP has assessed the mathematics achievement of 9-, 13-, and 17-year-olds ten times in the past 31 years: in the school years ending in 1973, 1978, 1982, 1986, 1990, 1992, 1994, 1996, 1999, and 2004. Because the long-term trend program uses substantially the same assessments decade after decade, it has been possible to chart educational progress since 1973 in mathematics. For the 2004 administration of the long-term trend assessment in mathematics, several changes were made to the assessment design. When changes are made in a trend assessment, studies are required to ensure that the results can continue to be reported on the same trend line—that is, that they are validly comparable to earlier results. Analyses were needed to ensure that the 2004 results under the new design were comparable to the results from previous long-term trend assessments. Therefore, two assessments were conducted in 2004. One was a modified assessment that used the new design, and the other was a "bridge" assessment that replicated the former design. The bridge assessment links the results of the modified assessment to the existing trend line. Comparisons of the results of the bridge and modified assessments could detect any shifts in results that may be due to changes in test design. The long-term trend mathematics assessments administered in 2004 and in previous years contained a range of constructed-response and multiple-choice questions designed to measure performance on sets of objectives developed by nationally representative panels of mathematics specialists, educators, and other interested parties. The 1986, 1990, 1992, 1994, 1996, 1999, and 2004 assessments shared common objectives (NAEP 1986). The objectives for each assessment prior to 1990 were based on the framework used for the previous assessment, with some revisions that reflected changes in the content of mathematics education. Although changes were made from assessment to assessment before 1990, some questions were retained from one assessment to the next in order to measure trends in achievement across time. This continuity allows comparisons to be made across all of the available assessments, other than the 1973 assessment, using Item Response Theory (IRT). Results from the 1973 assessment were placed on the same scale using mean-proportion-correct extrapolation. See NAEP scales for more information. The distribution of question types in the 1986–1999 mathematics long-term trend assessments, the 2004 bridge study, and the 2004 modified assessment are shown in the following tables:
The questions covered a range of mathematical content, including numbers and operations, measurement, geometry, and algebra. The process areas included knowledge, understanding, skills, applications, and problem solving. The Reading Trend AssessmentOne of the primary objectives of NAEP is to track trends in student performance over time. The NAEP long-term trend assessment in reading was administered throughout the nation in the 2003–2004 school year to a sample of students aged 9, 13, and 17. Because the long-term trend program uses substantially the same assessments decade after decade, it has been possible to chart educational progress since 1971 in reading. NAEP has assessed student reading achievement at age 9, age 13, and age 17 in 11 reading assessments, conducted during the school years ending in 1971, 1975, 1980, 1984, 1988, 1990, 1992, 1994, 1996, 1999, and 2004. For the 2004 administration of the long-term trend assessment in reading, several changes were made to the assessment design. When changes are made in a trend assessment, studies are required to ensure that the results can continue to be reported on the same trend line—that is, that they are validly comparable to earlier results. Analyses were needed to ensure that the 2004 results under the new design were comparable to the results from previous long-term trend assessments. Therefore, two assessments were conducted in 2004. One was a modified assessment that used the new design and the other was a "bridge" assessment that replicated the former design. The bridge assessment links the results of the modified assessment to the existing trend line so that comparisons of the results of the bridge and modified assessments could detect any shifts in results that may be due to changes in test design. The set of reading passages and questions included in the long-term trend assessments have been kept essentially the same since 1984, and most closely reflect the objectives developed for that assessment. The selections include brief stories, passages from textbooks, and other age-appropriate reading material. Although some tasks required students to provide written responses, most questions were multiple choice. The assessment was designed to evaluate students' ability to locate specific information, to make inferences based on information in two or more parts of a passage, and to identify the main idea in a passage. The distribution of question types in the 1986–1999 reading long-term trend assessments, the 2004 bridge study, and the 2004 modified assessment are shown in the following tables:
Types of Summary Data TablesIn 2004, NAEP examined long-term trends in the ability of nationally representative samples of students at ages 9, 13, and 17 in mathematics and reading. In each subject area, two assessments were administered: a bridge assessment, in which the same sets of questions and tasks used in previous long-term trend assessments were administered using the same procedures as in previous assessments, and a modified assessment, which contained newly-developed assessment questions and used different administration procedures. A key component of the long-term trend assessment was the contextual information collected from participating students, who were asked a series of questions about demographic characteristics, their home environment, and their experiences and instruction in the particular subject area being assessed. The long-term trend summary data tables are based on responses to these questions, and certain scale scores, scale score percentiles, and performance-level percentages. The results are shown for important demographic groups, such as those defined by the students' gender, race/ethnicity, and parental education level. Three types of summary data tables are provided; each type is described below. Scale Score and Performance-Level Data TablesThe scale score and performance-level data tables present assessment results based on data from the background questions that were administered to each student and on data derived from school-level information contained in the sampling frame. For the overall sample and for groups of students, average scale scores and the percentage of students at each performance level are presented. The left-hand side of the tables shows the categories for each of the student background variables. In the tables titled "Percentage of Students," the columns contain, by assessment year, the estimated percentage of students corresponding to each category of the background variable. In the tables titled "Average Scale Scores," the columns contain, for each assessment year, the estimated average scale scores that correspond to each category. The remainder of the tables show the percentages of students in each category who received scale scores at or above the various NAEP performance levels. Standard errors for each of these statistics are shown in parentheses. In all of the tables, a ( * ) next to a value indicates that the value was found by statistical test to be significantly different from the value for the 2004 bridge assessment, based on a test of statistical significance at about the 95 percent certainty level. Information about other notations used in the tables can be viewed below. Percentile Data TablesThe percentile data tables provide estimates of the scale scores at the 10th, 25th, 50th (or median), 75th, and 90th percentiles of the scale score distribution. All estimates are followed in parentheses by their estimated standard errors. Extrapolated Data Tables for MathematicsThe initial long-term trend scaling did not include the 1973 mathematics assessment because the 1973 assessment had too few questions in common with subsequent assessments to have results put directly on the IRT scale. To provide a link to the early assessment results for the nation and for subgroups defined by gender and race/ethnicity at each of three age levels, estimates of average scale scores were extrapolated from previous analyses. An additional set of summary data tables for mathematics (labeled "Extrapolated Data") shows the extrapolated results for 1973 juxtaposed with results from the 2004 bridge assessment. Notations Used in the Summary Data Tables
NAEP Reporting GroupsThe summary data tables provide results for the nation and for groups of students defined by shared characteristics. Based on statistically determined criteria, results are reported for a group only when sufficient numbers of students and adequate school representation are present. The minimum requirement is at least 62 students in a particular reporting group from at least five primary sampling units (PSUs). A PSU is a selected geographical region—a county, group of counties, or a metropolitan statistical area. However, the data for all students, regardless of whether their student group was reported separately, were included in computing the overall national results. Definitions of the reporting groups referred to in the summary data tables are presented below. GenderResults are presented separately for male and female students. Gender was reported by the student. Race/ethnicityResults are presented for students of different racial/ethnic groups according to the following mutually exclusive categories: White, Black, and Hispanic. Results for Asian/Pacific Islander and American Indian (including Alaska Native) students are not reported separately because there were too few students in the groups for statistical reliability. The data for all students, regardless of whether their racial/ethnic group was reported separately, were included in computing overall national results. In NAEP long-term trend assessments, data about student race/ethnicity have been collected in three ways: through observation, school records, and student self-reports. Modal GradeResults are presented for students who are below, at, and above the modal grade (the grade attended by most students at the assessed age). RegionResults are reported for four regions of the nation: Northeast, Southeast, Central, and West.
Type of Location (2004 only)Results are provided for students attending public schools in three mutually exclusive location types—central city, urban fringe/large town, and rural/small town—as defined below. The type of location variable is defined in such a way as to indicate the geographical location of a student's school. The intention is not to indicate or imply social or economic meanings for these location types. The type of location variable, on which the current NAEP sampling is based, does not support the reporting of regional results. Therefore, only national results are presented.
Parents' Education LevelStudents were asked to indicate the extent of schooling for each of their parents by choosing from the following responses: did not finish high school, graduated from high school, had some education after high school, or graduated from college. The response indicating the higher level of education was selected for reporting. Note that a substantial number of nine-year-olds indicated that they did not know their parents' education level; therefore, results are not presented for this age group. Type of SchoolResults are presented for public schools. Response rates for nonpublic schools selected for participation in the 2004 trend assessments failed to reach the necessary threshold for reporting; therefore, only results for the total sample and public schools are reported. PercentilesResults are presented for five percentile groups: 10th, 25th, 50th, 75th, and 90th. NAEP ScalesFor the 2004 mathematics and reading trend assessments, separate IRT scales were constructed within each grade. These scales were linked to the previously established scales within each subject area by a common population linking procedure. The reading trend scale was constructed based on the 1984 assessment and included all previous reading assessments. The mathematics trend scales were developed based on the 1986 science and mathematics assessments, and also included previous assessments. The initial trend scaling, however, did not include the 1973 mathematics assessment, because this assessment had too few questions in common with subsequent assessments. To provide a link to the early assessment results for the nation and for groups defined by gender and race/ethnicity at each of three age levels, estimates of average scale scores were extrapolated from previous analyses. The extrapolated estimates were obtained by assuming a linear relationship within a given age level between the logit transformation of a group's average p value (i.e., average proportion correct) for common questions and its respective scale score average, and further assuming that the same line held for all assessment years and for all subgroups within the age level. Because of the extrapolation of the average scale scores for these early assessments, caution should be used in interpreting the patterns of trends across those assessment years. Performance LevelsTo facilitate interpretation of the NAEP results, the scales were divided into levels of performance and a "scale anchoring" process was used to define what it means to score at each of these levels. The scale anchoring followed an empirical procedure whereby the scaled assessment results were analyzed to delineate sets of questions that discriminate between adjacent performance levels. For the reading and mathematics long-term trend scales, these levels are 150, 200, 250, 300, and 350. For these five levels, questions were identified that were likely to be answered correctly by students performing at a particular level on the scale and much less likely to be answered correctly by students performing at the next lower level. The guidelines used to select such questions were as follows: students at a given level must have at least a specified probability of success (65 percent for mathematics, 80 percent for reading), while students at the next lower level have a much lower probability of success (that is, the difference in probabilities between adjacent levels must exceed 30 percent). For each curriculum area, subject-matter specialists examined these empirically selected question sets and used their professional judgment to characterize each level. The long-term trend reading scale anchoring was conducted on the basis of the 1984 assessment, and the scale anchoring for mathematics was based on the 1986 assessment. Minimum Sample Sizes for ReportingResults for mathematics and reading performance and for background variables were tabulated and reported for groups defined by gender, race/ethnicity, region, type of location, parental education, and type of school. NAEP collects data for five racial/ethnic groups (White, Black, Hispanic, Asian/Pacific Islander, and American Indian/Alaska Native) and four levels of parents' education: graduated from college, some education after high school, graduated from high school, and did not finish high school, plus the category "I Don't Know." In some instances, the number of students in some of these groups was not sufficiently high to permit accurate estimation of performance and/or background-variable results. Therefore data are not provided for the groups with students from very few schools or for the groups with very small sample sizes. For results to be reported for any group, at least five PSUs must be represented in the group. In addition, a minimum sample of 62 students per group is required. For statistical tests pertaining to more than one reporting group, the sample size for each group must meet the minimum sample size requirements. In the summary data tables, the notation (—) appears in place of a result whenever minimum sample size requirements are not met. Drawing Inferences and Analyzing Student Group DifferencesBecause the percentages of students in the reporting groups and their average scale scores are based on samples—rather than on entire populations—the numbers reported are necessarily estimates. As such, they are subject to a measure of uncertainty, reflected in the standard error of the estimate. When the percentages or average scale scores of certain groups are compared, it is essential to take the standard error into account, rather than to rely solely on observed similarities or differences. Therefore, the comparisons provided in these summary tables are based on statistical tests that consider both the magnitude of the difference between the averages or percentages and the standard errors of those statistics. One of the goals of the assessment program is to estimate scale score distributions and percentages of students in the standard reporting groups based on the particular samples of students assessed. The use of confidence intervals, based on the standard errors, provides a way to make inferences about the population averages and percentages in a manner that reflects the uncertainty associated with the sample estimates. An estimated sample scale score average plus or minus 2 standard errors represents about a 95 percent confidence interval for the corresponding population quantity. This means that with 95 percent certainty, the average performance of the entire population of interest is within about plus or minus 2 standard errors of the sample average. Similar confidence intervals can be constructed for percentages, provided that the percentages are not extremely large or extremely small. For percentages, confidence intervals constructed in the above manner work best when sample sizes are large and the percentages being tested have magnitude relatively close to 50 percent. Statements about group differences should be interpreted with caution if at least one of the groups being compared is small in size and/or if "extreme" percentages are being compared. Percentages, P, were treated as "extreme" if
where the effective sample size NEFF is equal to
and SE is the jackknife standard error of P. This "rule of thumb" cutoff leads to flagging a large proportion of confidence intervals that would otherwise include values < 0 or > 1. Similarly, at the other end of the 0–100 scale, a percentage is deemed extreme if 100 – P < Plim . In either extreme case, the confidence intervals described above are not appropriate, and procedures for obtaining accurate confidence intervals are quite complicated. For these cases in the summary data tables, the values are not reported and are marked with a (—) symbol. To determine whether there is a real difference between the average scale score (or percentage of a certain attribute) for two groups in the population, one needs to obtain an estimate of the degree of uncertainty associated with the difference between the average scale scores or percentages of these groups for the sample. This estimate of the degree of uncertainty—called the standard error of the difference between the groups—is obtained by squaring each group's standard error, summing these squared standard errors, and then taking the square root of this sum. This procedure produces a conservative estimate of the standard error of the difference, since the estimates of the group averages or percentages will be positively correlated to an unknown extent due to the sampling plan. Direct estimation of the standard errors of all reported differences would involve a heavy computational burden. Similar to the manner in which the standard error for an individual group average or percentage is used, the standard error of the difference can be used to help determine whether differences between assessment years are real. If zero is within the confidence interval for the differences there is no statistically significant difference between the groups. The descriptions of trend results are based on the results of statistical tests that consider both the estimates of average performance in each assessment year and the degree of uncertainty associated with these estimates. The purpose of basing descriptions on such tests is to restrict the discussion of observed trends and group differences to those that are statistically dependable. Hence, the patterns of results that are discussed are unlikely to be due to the chance factors associated with the inevitable sampling and measurement errors inherent in a large-scale survey effort like NAEP. All descriptions of trend patterns, differences between assessment years, and differences between groups of students that are cited are statistically significant at the .05 level. Cautions in InterpretationsAs previously stated, the NAEP reading and mathematics long-term trend scales make it possible to examine relationships between students’ performance and various background factors measured by NAEP. However, a relationship between achievement and another variable does not reveal its underlying cause, which may be influenced by a number of other variables. Similarly, the assessments do not reflect the influence of unmeasured variables. The results are most useful when they are considered in combination with other knowledge about the student population and the educational system, such as trends in instruction, changes in the school-age population, and societal demands and expectations. A caution is also warranted for some small population group estimates. Smaller population groups may show increases or decreases across years in average scores; however, it is necessary to interpret such score changes with extreme caution. Another reason for caution is that the standard errors are often quite large around the score estimates for small groups, which in turn means the standard error around the gain is also large. File FormatsTables are presented in HTML format. Tables can be copied, via the clipboard, and pasted into third-party software such as Microsoft Excel for printing. For More Information About NAEP and the Summary Data TablesFor questions, contact:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||