NAEP Validity Studies (NVS) PapersNVS publications up to the current year are listed below. Read their abstracts below and click to follow to the paper in PDF format, or see a listing of more NAEP publications. Partitioning NAEP Trend Data (2007) Validity Study of the NAEP Mathematics Assessment: Grades 4 and 8 Federal Sample Sizes for Confirmation of State Tests in the No Child Left Behind Act (2004) Evaluation of Bias Correction Methods for “Worst-case” Selective Non-participation in NAEP (2004) The Effects of Finite Sampling on State Assessment Sample Requirements (2003) Computer Use and Its Relation to Academic Achievement in Mathematics, Reading, and Writing (2003) A Study of Equating in NAEP (2003) An Investigation of Why Students Do Not Respond to Questions (2003) Optimizing State NAEP: Issues and Possible Improvements (2003) Improving the Information Value of Performance Items in Large Scale Assessments (2003) Reporting the Results of the National Assessment of Educational Progress (2003) Implications of Electronic Technology for the NAEP Assessment (2003) An Agenda for NAEP Validity Research (2003) The Validity of Oral Accommodation in Testing (2003) Partitioning NAEP Trend Data (2007) (170K PDF) (Note: Link takes you off this website.) Fundamental to statistical analyses is the comparison of means of one variable from two or more populations. Population samples may be constructed (i.e., experimental and control groups), or they may be natural groupings (i.e., students at a particular grade in different years). If the populations are similar, the mean comparisons are straightforward; if not, the question arises as to whether the mean differences are due to differences in the variable or differences in the populations. Partitioning analysis is a way of distinguishing between these differences. This paper is a demonstration of how partitioning analysis can be used to help separate changes in reading and mathematical proficiency from changes in school populations over assessment years, using NAEP reading and mathematics trend data from 13-year-old students in four assessment years. Partitioning analysis separates the difference between two means into three parts: the proficiency effect (the change in means attributable to changes in student ability), the population effect (the part attributable to population changes), and the joint effect (the part attributable to the way that the population and proficiency work together). Partitioning analysis makes it simple to compute a well-known statistic, the standardized mean, which estimates what the mean would have been if the percentages of the various subgroups had remained the same. The data were classified by racial/ethnic groupings. The results showed that each racial/ethnic group improved during the selected time spans, while the population shifts diminished the measure of increased performance. The results suggest speculation and future research. Validity Study of the NAEP Mathematics Assessment: Grades 4 and 8 (1.86M PDF) In spring 2006, the NVS Panel was asked by National Center for Education Statistics (NCES) to undertake a validity study of the current NAEP mathematics assessment. In particular, NCES asked the Panel to answer the following questions:
Because the framework for grade 12 mathematics was under revision at the time, the validity study was limited to grades 4 and 8. This report provides a great deal of detail about what could be improved in the NAEP mathematics assessment. The reader should not construe this proliferation of detail as a summative judgment against the NAEP system. The NAEP mathematics assessment has been, and remains, an important and useful tool for monitoring what U.S. children know and can do in mathematics. Importantly, the organizations that make up the NAEP system are joined in a serious learning community. This study is part of the NAEP system and part of the way it learns about itself and improves. See NCES comments on this study. Assigning Adaptive NAEP Booklets Based on State Assessment Scores: A Simulation Study of the Impact on Standard Errors (2004) (148K PDF) (Note: Link takes you off this website.) Adaptive testing is the process of testing students with materials whose difficulty is tailored to the students' expected level of performance. This report presents findings from a simulation that was designed to assess the improvements in estimates of NAEP standard errors that might result from using two-stage adaptive testing, in which students' performance on their state tests is used as a basis for assigning NAEP item block difficulty. Student samples were drawn from a file that linked NAEP and state assessment data for four states. All fourth-grade students with complete data were selected. The students were grouped in two different ways: in one procedure the quartile with the lowest state assessment scores were designated the “low” group and the quartile with the highest scores the “high” group, with the remainder assigned to a “middle group.” In the second procedure the bottom decile was the low group, the top decile the high group, and the rest of the students the middle group. Simulated item responses were then generated for each student. The varying difficulties were simulated for NAEP item blocks by adding or subtracting a constant to the difficulty parameters. The results indicated that the simulated easier block of items for lower-ability students appeared to have had the desired effect of reducing standard errors for these students. The authors recommend that, if this improvement is found to be sustained even after accounting for conditioning (a technique not within the scope of the study), NAEP should seriously consider the use of adaptive test items for low-performing students. However, simulating the use of more difficult item blocks for high-achieving students actually increased standard errors. The authors point out that the existing standard errors for this group are already as good or better than those for students in the middle of the distribution, and that adaptive testing may therefore not be necessary. Should adaptive testing be pursued for high-achieving students, the results suggest that parameters for item blocks would need to be more carefully tailored to expected abilities, rather than merely shifting the parameters of existing item blocks as was done for this study. Federal Sample Sizes for Confirmation of State Tests in the No Child Left Behind Act (2004) (572K PDF) (Note: Link takes you off this website.) The concept of gaps in student performance appears in many places throughout the No Child Left Behind (NCLB) Act, especially with respect to gaps in achievement between groups of students considered advantaged and disadvantaged. Because the legislation does not provide a statistical definition of a gap, definition and implementation remain an open question. This report discusses the general concept of gaps, and analyzes the advantages and disadvantages of several different approaches to estimating gaps and reductions in gaps. It asks which statistics and which types of performance measures are most appropriate for measuring gap improvement, and discusses what NAEP state sample sizes might be required for different targeted minority groups if NAEP were to be used to measure gaps. The authors offer two approaches for gap measurement: the difference between the performance of the advantaged and disadvantaged groups in the current year, and the difference between the disadvantaged group's performance in the current year and that of the advantaged group in the baseline year. While the former is presented as the natural choice, the report notes that this approach has a larger variance due to the contribution of the advantaged group, and therefore requires a larger sample. A drawback of the latter choice is that it can show a decreasing gap when the absolute inequality between the two groups may be constant or increasing. The report points out that both approaches require that gap improvement be defined to occur only if the performance of the advantaged group does not deteriorate. The report also discusses the concept of adequate yearly progress. The authors point out that this does not require comparisons across groups, making it an easier measure to confirm than a gap. The report goes on to discuss the implication of measuring gap with three different measures of performance on NAEP: mean scale score, proportion of students at or above the NAEP basic achievement level, and proportion of students at or above the NAEP proficient achievement level. Mean scale scores require the smallest sample size of the three, and allow simplified computation through variance scores that do not depend on the mean. This is presented as a potentially significant advantage due to an anticipated dramatic change in mean scores over the life of NCLB; sample sizes could be set once and unchanged for the duration. Between the two achievement-level performance measures, the proportion of students at or above the basic level requires a smaller sample size. It also has the advantage over mean scale scores of being more compatible with the average yearly performance statistic, providing a consistent quantitative measure for both gaps and adequate yearly progress. Evaluation of Bias Correction Methods for “Worst-case” Selective Non-participation in NAEP (2004) (195K PDF) (Note: Link takes you off this website.) With the advent of the No Child Left Behind Act, the context for NAEP participation is changing, and with this shifting context comes the possibility of selective non-participation at the top or bottom of the ability distribution for both students and schools, resulting in a bias in statewide mean scores. This report estimates the potential bias caused by worst-case scenarios of selective non-participation, and examines the extent to which statistical methods can correct for that bias. To simulate bias at the school level, the report's authors truncated one tail of the distribution of a set of school-level mean NAEP scores (both actual scores and predicted scores in separate analyses), discarding between 5 and 25 percent of schools with either the highest or lowest mean scores. To simulate student-level bias, the authors discarded 1 to 10 students in each school with the highest or lowest NAEP or state test scores. At the school level, truncation of the lowest 10 percent of schools in the grade 4 sample (based on NAEP scores) yielded an average bias of 2.9 points, ± 0.6 points, while at the student level, truncation of the 2 lowest students in each school yielded an average bias of 5.1 points, ±1 point. Similar results were observed at grade 8. To test the effectiveness of using statistical methods to correct for this bias, the authors used regression based on demographic predictors, state assessment scores, or both, to assign means to the removed, “non-participating” schools and to assign scores to non-participating students. In addition, three different imputation methods were evaluated,: forward linear regression, linear equating (in which the imputation is based on standardized values of the predictors and regression estimate), and reverse regression (where the predictor is regressed on available NAEP scores, and the regression coefficients used to impute the missing NAEP scores). After each imputation, the state population mean was recalculated using the new imputed data, and this mean was compared with the statewide mean based on the full NAEP sample. The tested corrections for non-participation bias were partially effective, eliminating roughly half of the bias when the linear equating method was used. The report notes that the other regression models can improve the corrections, but their accuracy is dependent on knowledge about the mechanisms of non-participation, which is not likely to be available in practice. The Effects of Finite Sampling on State Assessment Sample Requirements (2003) (341K PDF) This study addresses statistical techniques that might ameliorate some of the sampling problems currently facing states with small populations participating in State NAEP. The author explores how the application of finite population correction factors to the between-school component of variance could be used to modify sample sizes required of states that currently qualify for the exemptions from State NAEP's minimum sample requirements. He also examines how to preserve the infinite population assumptions for hypothesis testing related to comparisons between domain means. Results lend support to alternate sample size specifications both in states with few schools and in states with many small schools. The author notes that permitting states to use design options other than the current State NAEP requirement could reduce costs related to test administration, scoring, and data processing. Computer Use and Its Relation to Academic Achievement in Mathematics, Reading, and Writing (2003) (514K PDF) In this study, the authors, using evidence obtained from the 1996 NAEP assessment in Mathematics and the 1998 NAEP main assessments in reading and writing, examine patterns of computer achievement in each of these three academic domains. The authors conclude that the design of the NAEP data collection precludes using such data to make even tentative conclusions about the relationship of achievement and computer use. They recommend further study, including a multi-site experiment to determine how teachers and students are using computers and the impact of computers on achievement. A Study of Equating in NAEP (2003) (955K PDF) The authors detail a computer-simulation study they conducted to investigate the amount of uncertainty added to NAEP estimates by equating error under three different equating methods and while varying a number of factors that might affect accuracy of equating. Data from past NAEP administrations were used to guide the simulations, and error due to equating was estimated empirically. It is the authors' conclusion that the merits of less biased measurements may outweigh the problems caused by slight adjustments to previously reported scores. They recommend that long-term trend lines be periodically reanalyzed using methods such as multiple-group IRT that can minimize such biases. An Investigation of Why Students Do Not Respond to Questions (2003) (569K PDF) Developers of NAEP have substantially changed the mix of item types on assessments, decreasing the numbers of multiple-choice questions and increasing the numbers of short and extended constructed-response questions. At the same time, researchers have noted unacceptably high student nonresponse rates. These rates seem to vary with student characteristics like gender and race, and they potentially confound NAEP reports, analyses, and subsequent conclusions. The small-scale, exploratory study, which is qualitative in nature, offers insights for setting future studies. Future studies might include quantitative analysis of existing NAEP data sets to determine whether observed patterns of association between omissions and student or item characteristics hold up over larger numbers of students and items than were included in this study. Optimizing State NAEP: Issues and Possible Improvements (2003) (253K PDF) The paper addresses 3 key topics related to making state NAEP more efficient: reducing the burden for the states, stabilizing the assessment schedule, and facilitating and promoting the use of state NAEP data. The author recommends promoting the use of state NAEP data for the continued success of the NAEP program. She suggests that this could involve devoting greater attention to how best to link state assessment and NAEP results, developing more timely and user-friendly reports and working with states and other organizations to more effectively address the data needs of different NAEP audiences. She also proposes expending proportionately less of the state NAEP resources on data collection and more on disseminating information about the many uses of the program. Improving the Information Value of Performance Items in Large Scale Assessments (2003) (290K PDF) The authors first provide a summary overview of what is already known and what is needed to learn about item types for future NAEP assessments. Essentially, the question addressed here is: Do constructed-response items provide more information about what students are capable of doing than what multiple-choice items alone provide, and if so, what types of skills are tapped by the constructed-response items that are not measured by multiple-choice items? A fresh examination of the relationships between multiple-choice and constructed-response items is needed, according to the authors. The authors propose a set of studies that would provide needed information about the value added of performance items in mixed-format assessments such as NAEP. Reporting the Results of the National Assessment of Educational Progress (2003) (461K PDF) This paper explores ways results of NAEP data collections might be communicated to a variety of audiences, each with differing needs for information, interests in its findings, and sophistication in interpreting its results. The author describes market-basket reporting as a feasible alternative to traditional NAEP reporting. These reports would include samples of items and exercises used in an assessment together with their scoring rubrics which would give a clearer picture of the kinds of skills assessed by NAEP, as well as an indication of skills not assessed. In the second section of the paper, the author cautions that in order to uphold strict standards of data quality, NAEP reports must format and display results to make them more accessible while also discouraging readers from drawing overly broad interpretations of the data. A final section describes a detailed program of research on reporting and dissemination of NAEP findings based on these three dimensions: the research questions to be asked; the audiences to whom the questions should be addressed; and the strategies through which the questions should be pursued as well as the intersection of these dimensions. The author suggests that the highest priority be given to research on reporting through public media; followed by making NAEP reporting more understandable and useful to school curriculum and instruction personnel, reporting to the public, and further research with state education personnel. Implications of Electronic Technology for the NAEP Assessment (2003) (381K PDF) This report emphasizes the need for NAEP to integrate the use of technology into its assessment procedures; it reviews major options; and suggests priorities to guide the integration. The author identifies three short-term goals for this development: a linear computer-administered assessment in a target subject area such as mathematics or science should be implemented; a computer administered writing assessment should be developed and implemented; the introduction and evaluation of technology-based test accommodations for handicapped students and English-language learners should be continued. The author suggests that NAEP consider redesign as an integrated electronic information system that would involve all aspects of the assessment process including assessment delivery, scoring and interpretation, development of assessment frameworks, specifications of population and samples, collection of data, and preparation and dissemination of results. Feasibility Studies of Two-Stage Testing in Large-Scale Educational Assessment: Implications for NAEP (2003) (968K PDF) This report discusses the rationale for enhancing the current NAEP design by adding a capacity for adaptive testing. Items are tailored to the achievement level of the student in adaptive testing. The authors conclude that implementation of adaptive testing procedures, two-stage testing in particular, has the potential to increase the usability and validity of NAEP results. Adaptive testing would permit adequately reliable scores to be reported to individual students and their parents, increasing their personal stake in performing well. Improvement in data quality would also speed data processing and permit delivery of assessment results in a timely manner. An Agenda for NAEP Validity Research (2003) (598K PDF) This report resulted from the systematic analysis undertaken by the NAEP Validity Studies Panel to consider the domain of validity threats to NAEP and to identify the most urgent validity research priorities. A framework of 6 broad categories was devised: 1) the constructs measured within each of NAEP's subject domains; 2) the manner in which these constructs are measure; 3) the representation of the population; 4) the analysis of data; 5) the reporting and use of NAEP results; and 6) the assessment of trends. Panel subcommittees prepared papers laying out the critical validity issues in each area, and these papers are presented in chapters 2 through 7 of this report. The panel reviewed papers in each area and set priorities of each area of validity research by a consensus process. Sixteen suggested studies or areas of study were rated by the full panel. Four studies stood out as essential and 2 others were rated between highly important and essential. The panel indicated unanimously that studies are essential to evaluate the validity aspects of NAEP's new role under the No Child Left Behind legislation, however that role is operationalized. The Validity of Oral Accommodation in Testing (2003) (456K PDF) This study examines the impact of oral presentation of a mathematics test on the performance of disabled and non-disabled students. It is an example of empirical research providing evidence for evaluating the validity and fairness of accommodations use. Both learning disabled and non-disabled students improved their performance under the accommodated conditions, although learning disabled students had greater gains. The presence of an effect for the regular classroom students suggests the possibility that irrelevant variance in the non-accommodated scores is overcome by the use of the accommodation for both groups of students.
|