Skip navigation
Skip Navigation
small header image
Click for menu... About NAEP... Click for menu... Subject Areas... Help Site Map Contact Us Glossary NewsFlash
Sample Questions Analyze Data State Profiles Publications Search the Site
NAEP Research e-Center
The Nation's Report Card (home page)

NAEP Validity Studies (NVS) Papers

NVS publications up to the current year are listed below. Read their abstracts below and click to follow to the paper in PDF format, or see a listing of more NAEP publications.

Guiding Principles and Suggested Studies for Determining When the Introduction of a New Assessment Framework Necessitates a Break in Trend in NAEP (2009)

Sensitivity of NAEP to the Effects of Reform-Based Teaching and Learning in Middle School Mathematics (2009)

Utility and Validity of NAEP Linking Efforts (2009)

Partitioning NAEP Trend Data (2007)

Validity Study of the NAEP Mathematics Assessment: Grades 4 and 8
Appendices: Validity Study of the NAEP Mathematics Assessment: Grades 4 and 8 (2007)

Estimating Effects of Non-Participation on State NAEP Scores Using Empirical Methods (2007)

State Implementation of NCLB Policies and Interpretation of the NAEP Performance of English Language Learners (2006)

Using State Assessments to Assign Booklets to NAEP Students to Minimize Measurement Error: An Empirical Study in Four States (2005)

Using State Assessments to Impute Achievement of Students Absent from NAEP: An Empirical Study in Four States (2005)

Assigning Adaptive NAEP Booklets Based on State Assessment Scores: A Simulation Study of the Impact on Standard Errors (2004)

Federal Sample Sizes for Confirmation of State Tests in the No Child Left Behind Act (2004)

Evaluation of Bias Correction Methods for “Worst-case” Selective Non-participation in NAEP (2004)

The Effects of Finite Sampling on State Assessment Sample Requirements (2003)

Computer Use and Its Relation to Academic Achievement in Mathematics, Reading, and Writing (2003)

A Study of Equating in NAEP (2003)

An Investigation of Why Students Do Not Respond to Questions (2003)

Optimizing State NAEP: Issues and Possible Improvements (2003)

Improving the Information Value of Performance Items in Large Scale Assessments (2003)

Reporting the Results of the National Assessment of Educational Progress (2003)

Implications of Electronic Technology for the NAEP Assessment (2003)

Feasibility Studies of Two-Stage Testing in Large-Scale Educational Assessment: Implications for NAEP (2003)

An Agenda for NAEP Validity Research (2003)

The Validity of Oral Accommodation in Testing (2003)

Descriptions of the NVS publications

Guiding Principles and Suggested Studies for Determining When the Introduction of a NewAssessment Framework Necessitates a Break in Trend in NAEP (2009) (167K PDF) Note: Link takes you off this website.

The National Assessment Governing Board (NAGB) periodically updates National Assessment of Educational Progress (NAEP) frameworks in each subject area in order to ensure that the assessments reflect current thinking about teaching and learning. Given the heightened interest in NAEP results occasioned by No Child Left Behind (NCLB), it has become increasingly important to establish sound policies and procedures for reporting results of NAEP assessments based on new or modified frameworks. The purpose of this paper is to recommend guiding principles, studies, and decision-making processes that can assist National Center for Education Statistics (NCES) in determining whether the results generated by an assessment based on a new NAEP framework can be validly reported on the same trend line as previous versions of the assessment.

Back to Top

Sensitivity of NAEP to the Effects of Reform-Based Teaching and Learning in MiddleSchool Mathematics (2009) (461K PDF) Note: Link takes you off this website.

This study is a validity study intended to test the adequacy of the National Assessment of Educational Progress (NAEP) for detecting and monitoring the effects of mathematics education reform. Seventh- and eighth-grade students in reform mathematics classrooms were tested at the beginning and end of the school year. On both occasions they took both NAEP and a reform-oriented assessment called the Balanced Assessment in Mathematics (BAM). The two assessments were scaled jointly, and the effect sizes of the gains were compared across assessments. NAEP and BAM were equally successful in detecting gains in Connected Mathematics Project (CMP) classrooms, but NAEP was better able to detect effects in Algebra classrooms in the same schools.

Back to Top

Utility and Validity of NAEP Linking Efforts (2009)  (413K PDF) Note: Link takes you off this website.

There are a number of practical situations in which it would be desirable to be able to use the results of the administration of one assessment to estimate what the results would have been if another assessment had been administered. Test linking refers to the idea that results obtained from the administration of one test might be used to infer what the results would have been if another test had been used. This paper reviews the strengths and limitations of commonly employed linking methodologies, reviews the history of linking efforts involving the National Assessment of Educational Progress (NAEP), and proposes a framework to consider linking utility and validity. In particular, the paper suggests that the utility of answers based on linking depends on the kinds of decisions that are to be made and the increase in the positive outcomes of those decisions that can be achieved through linking, when compared to decisions based on no information. An alternative paradigm, in which questions about the relations between the NAEP scale and the scales of other tests were phrased as validity research rather than linking, is also proposed.

Back to Top

Partitioning NAEP Trend Data (2007) (170K PDF) Note: Link takes you off this website.

Fundamental to statistical analyses is the comparison of means of one variable from two or more populations. Population samples may be constructed (i.e., experimental and control groups), or they may be natural groupings (i.e., students at a particular grade in different years). If the populations are similar, the mean comparisons are straightforward; if not, the question arises as to whether the mean differences are due to differences in the variable or differences in the populations. Partitioning analysis is a way of distinguishing between these differences.

This paper is a demonstration of how partitioning analysis can be used to help separate changes in reading and mathematical proficiency from changes in school populations over assessment years, using NAEP reading and mathematics trend data from 13-year-old students in four assessment years.

Partitioning analysis separates the difference between two means into three parts: the proficiency effect (the change in means attributable to changes in student ability), the population effect (the part attributable to population changes), and the joint effect (the part attributable to the way that the population and proficiency work together). Partitioning analysis makes it simple to compute a well-known statistic, the standardized mean, which estimates what the mean would have been if the percentages of the various subgroups had remained the same.

The data were classified by racial/ethnic groupings. The results showed that each racial/ethnic group improved during the selected time spans, while the population shifts diminished the measure of increased performance. The results suggest speculation and future research.

Back to Top

Validity Study of the NAEP Mathematics Assessment: Grades 4 and 8 (1.86M PDF)
Appendices: Validity Study of the NAEP Mathematics Assessment: Grades 4 and 8 (2007) (1.33M PDF) Note: Links above take you off this website.

In spring 2006, the NVS Panel was asked by National Center for Education Statistics (NCES) to undertake a validity study of the current NAEP mathematics assessment. In particular, NCES asked the Panel to answer the following questions:

  • Does the NAEP framework offer reasonable content and skill-based coverage compared to the assessments of states and other nations?
  • Does the NAEP item pool and assessment design accurately reflect the NAEP framework?
  • Is NAEP mathematically accurate and not unduly oriented to a particular curriculum, philosophy, or pedagogy?
  • Does NAEP properly consider the spread of abilities in the assessable population?
  • Does NAEP provide information that is representative of all students, including students who are unable to demonstrate their achievements on the standard assessment?

Because the framework for grade 12 mathematics was under revision at the time, the validity study was limited to grades 4 and 8.

This report provides a great deal of detail about what could be improved in the NAEP mathematics assessment. The reader should not construe this proliferation of detail as a summative judgment against the NAEP system. The NAEP mathematics assessment has been, and remains, an important and useful tool for monitoring what U.S. children know and can do in mathematics. Importantly, the organizations that make up the NAEP system are joined in a serious learning community. This study is part of the NAEP system and part of the way it learns about itself and improves. See NCES comments on this study.

Back to Top

Estimating Effects of Non-Participation on State NAEP Scores Using Empirical Methods (2007) (341K PDF) Note: Link takes you off this website.

The primary objectives of NAEP tests are to accurately monitor the progress of defined groups of students over time and to measure valid differences in scores between student groups at a single point in time. In this context, valid scores reflect differences in scores that are linked to “real” differences in student knowledge as measured on achievement tests.

The NVS Panel previously sponsored analysis directed toward estimating the potential bias from changing exclusion rates in NAEP tests. This study takes up a second threat to the validity of scores arising from differential and changing participation rates of schools and students in NAEP testing. Non-participation can arise either from the absence or refusal to participate of a student chosen in the sample (student non-participation) or from a decision of a principal to refuse to allow the school to participate (school non-participation). School participation was voluntary until the 2003 test when participation of sampled schools became mandatory by federal statute. However, student participation continues to be voluntary.

This study has the following objectives:

  • To compile and examine student and school non-participation rates across state that might explain non-participation patterns across states and their potential for bias;
  • To treat the 2002–2003 4th and 8th grade state scores as a natural experiment to estimate the extent of possible bias;
  • To develop statistical models that account for the pattern of state NAEP scores for 696 state scores from 1990–2003, and to assess whether the pattern of nonparticipation is a significant explanatory factor in this pattern of state NAEP scores; and
  • To compare estimates of bias from these methods to the bias from worst case scenarios estimated by McLaughlin, 2004.

Back to Top

State Implementation of NCLB Policies and Interpretation of the NAEP Performance of English Language Learners (2006) (415K PDF) Note: Link takes you off this website.

The number of students classified as ELL in the 50 states and the District of Columbia has grown significantly over the past decade. This report outlines a number of critical issues that should be addressed in order to allow states to explore and understand relationships between the performance of ELL students on NAEP and on state assessments in this policy context. The results of this study can be useful to a variety of education stakeholders who are interested in improving the utility of NAEP for examining the performance of ELL students.

The measurement context is also complex. NAEP and state assessments in reading and mathematics are not developed for exactly the same purposes, and they do not have exactly the same measurement properties. As will be discussed in this report, attention to a formal statistical study of the relationships between ELL scores on NAEP and state assessments will help to inform the validity rationale underlying attempts to compare results across these assessments. In order to enhance the discussion and make concrete the issues under investigation, this report briefly reviews both provisions for ELL participation in NAEP and NCLB provisions for ELL assessment. In addition, the results of an exploration of NCLB policies and practices in four states (California, New York, Texas and Washington) are examined. Exploration of issues for the four target states helps illustrate key validity challenges faced by states as they consider investigation of relationships between NAEP scores and state assessment scores for ELLs under NCLB for their individual state.

Finally, consideration is given to next steps that the NAEP program might take to improve states’ use of NAEP scores as part of their analysis of progress in attaining NCLB goals.

Back to Top

Using State Assessments to Assign Booklets to NAEP Students to Minimize Measurement Error: An Empirical Study in Four States (2005) (110K PDF) Note: Link takes you off this website.

Because NAEP estimates of state-level achievement play an important role in the evaluation of strategies for improving the nation’s educational system, it is important that these estimates have as small a standard error as possible.

The standard error of state-level estimates can be reduced either by increasing the sample size, which is expensive, or by reducing the error in each student’s measurement. The error of measurement can be reduced by increasing testing time, but that would also entail additional cost, as well as additional burden on the students selected to participate in NAEP. The error of measurement varies from student to student, and that variation depends on the “fit” between the student and the test. For the highest achieving students, easy test items provide little information, and for the lowest achieving students, hard test items provide little information. If booklets could be targeted, then the error of measurement for the segment of the population they represent could be reduced at very little cost. A savings of 10 percent in the measurement error would produce benefits equivalent to increasing the length of the test or the number of students tested by nearly 20 percent, and a simulation study sponsored by the NAEP Validity Studies Panel estimated that such a reduction should be possible (Linn, McLaughlin, Jiang, and Gallagher, 2004).

For this study, the records of participants in the 2003 NAEP reading and mathematics assessments in four states were matched to state assessment records, and the standard errors of lowest and highest quartile students, based on the state assessments, were compared for all of the existing NAEP item blocks.

Five research questions guided this study:

  1. How different are the difficulties of blocks on existing NAEP reading and mathematics assessments?
  2. Can state assessments identify potentially low achievers on NAEP?
  3. Are standard errors affected by block difficulties; specifically, are the standard errors for predicted low-achieving students smaller when they are assigned a booklet with an easy block?
  4. What is the impact of easier blocks on the standard errors for NAEP’s demographic reporting groups?
  5. What other factors, such as completion rate or block position may also influence standard errors for low-achieving students?

The question of feasibility of acquiring and using state assessment scores is addressed in a companion report (McLaughlin, Scarloss, Stancavage, and Blankenship, 2005), and is answered affirmatively.

Back to Top

Using State Assessments to Impute Achievement of Students Absent from NAEP: An Empirical Study in Four States (2005) (199K PDF) Note: Link takes you off this website.

In preparing estimates for state-level statistics, NAEP has employed different standard procedures in dealing with missing data for absent and excluded students. To adjust for estimates of achievement for absent students, a basis is needed for imputing plausible achievement scores for those students. Ideally, one would use scores on a parallel test; in lieu of available test scores, NAEP has used demographic proxies for achievement. Demographic proxies are clearly less accurate than actual achievement scores on a related test, but until now, achievement test scores have not been systematically available for students selected for NAEP.

The purpose of this study is to estimate the extent to which state assessment scores can be used to improve the adjustments of NAEP data to remove the biases due to absences. A simulation study was conducted to explore the potential of state assessment scores to improve adjustments for nonparticipation (McLaughlin, Gallagher, and Stancavage, 2004). That study found that state assessment scores could potentially be more effective than demographic information in removing the bias related to absences. The present study aims to extend that simulation by empirically assessing the potential for using state assessment scores to impute achievement for NAEP absent students in four states. In these four states, state assessment scores were acquired for students selected to participate in the NAEP reading and mathematics assessments in 2003.

Four research questions have guided the course of this study:

  1. How well do state assessment scores cover absent students?
  2. Do state assessment scores follow the patterns of NAEP scores?
  3. How do results of adjustments for absences based on state test data compare to current demographic adjustments for absences?
  4. Is the use of state assessment data for this purpose feasible?

Back to Top

Assigning Adaptive NAEP Booklets Based on State Assessment Scores: A Simulation Study of the Impact on Standard Errors (2004) (148K PDF) Note: Link takes you off this website.

Adaptive testing is the process of testing students with materials whose difficulty is tailored to the students' expected level of performance. This report presents findings from a simulation that was designed to assess the improvements in estimates of NAEP standard errors that might result from using two-stage adaptive testing, in which students' performance on their state tests is used as a basis for assigning NAEP item block difficulty.

Student samples were drawn from a file that linked NAEP and state assessment data for four states. All fourth-grade students with complete data were selected. The students were grouped in two different ways: in one procedure the quartile with the lowest state assessment scores were designated the “low” group and the quartile with the highest scores the “high” group, with the remainder assigned to a “middle group.” In the second procedure the bottom decile was the low group, the top decile the high group, and the rest of the students the middle group. Simulated item responses were then generated for each student.

The varying difficulties were simulated for NAEP item blocks by adding or subtracting a constant to the difficulty parameters. The results indicated that the simulated easier block of items for lower-ability students appeared to have had the desired effect of reducing standard errors for these students. The authors recommend that, if this improvement is found to be sustained even after accounting for conditioning (a technique not within the scope of the study), NAEP should seriously consider the use of adaptive test items for low-performing students. However, simulating the use of more difficult item blocks for high-achieving students actually increased standard errors. The authors point out that the existing standard errors for this group are already as good or better than those for students in the middle of the distribution, and that adaptive testing may therefore not be necessary. Should adaptive testing be pursued for high-achieving students, the results suggest that parameters for item blocks would need to be more carefully tailored to expected abilities, rather than merely shifting the parameters of existing item blocks as was done for this study.

Back to Top

Federal Sample Sizes for Confirmation of State Tests in the No Child Left Behind Act (2004) (572K PDF) Note: Link takes you off this website.

The concept of gaps in student performance appears in many places throughout the No Child Left Behind (NCLB) Act, especially with respect to gaps in achievement between groups of students considered advantaged and disadvantaged. Because the legislation does not provide a statistical definition of a gap, definition and implementation remain an open question. This report discusses the general concept of gaps, and analyzes the advantages and disadvantages of several different approaches to estimating gaps and reductions in gaps. It asks which statistics and which types of performance measures are most appropriate for measuring gap improvement, and discusses what NAEP state sample sizes might be required for different targeted minority groups if NAEP were to be used to measure gaps.

The authors offer two approaches for gap measurement: the difference between the performance of the advantaged and disadvantaged groups in the current year, and the difference between the disadvantaged group's performance in the current year and that of the advantaged group in the baseline year. While the former is presented as the natural choice, the report notes that this approach has a larger variance due to the contribution of the advantaged group, and therefore requires a larger sample. A drawback of the latter choice is that it can show a decreasing gap when the absolute inequality between the two groups may be constant or increasing. The report points out that both approaches require that gap improvement be defined to occur only if the performance of the advantaged group does not deteriorate. The report also discusses the concept of adequate yearly progress. The authors point out that this does not require comparisons across groups, making it an easier measure to confirm than a gap.

The report goes on to discuss the implication of measuring gap with three different measures of performance on NAEP: mean scale score, proportion of students at or above the NAEP basic achievement level, and proportion of students at or above the NAEP proficient achievement level. Mean scale scores require the smallest sample size of the three, and allow simplified computation through variance scores that do not depend on the mean. This is presented as a potentially significant advantage due to an anticipated dramatic change in mean scores over the life of NCLB; sample sizes could be set once and unchanged for the duration. Between the two achievement-level performance measures, the proportion of students at or above the basic level requires a smaller sample size. It also has the advantage over mean scale scores of being more compatible with the average yearly performance statistic, providing a consistent quantitative measure for both gaps and adequate yearly progress.

Back to Top

Evaluation of Bias Correction Methods for “Worst-case” Selective Non-participation in NAEP (2004) (195K PDF) Note: Link takes you off this website.

With the advent of the No Child Left Behind Act, the context for NAEP participation is changing, and with this shifting context comes the possibility of selective non-participation at the top or bottom of the ability distribution for both students and schools, resulting in a bias in statewide mean scores. This report estimates the potential bias caused by worst-case scenarios of selective non-participation, and examines the extent to which statistical methods can correct for that bias.

To simulate bias at the school level, the report's authors truncated one tail of the distribution of a set of school-level mean NAEP scores (both actual scores and predicted scores in separate analyses), discarding between 5 and 25 percent of schools with either the highest or lowest mean scores. To simulate student-level bias, the authors discarded 1 to 10 students in each school with the highest or lowest NAEP or state test scores.

At the school level, truncation of the lowest 10 percent of schools in the grade 4 sample (based on NAEP scores) yielded an average bias of 2.9 points, ± 0.6 points, while at the student level, truncation of the 2 lowest students in each school yielded an average bias of 5.1 points, ±1 point. Similar results were observed at grade 8.

To test the effectiveness of using statistical methods to correct for this bias, the authors used regression based on demographic predictors, state assessment scores, or both, to assign means to the removed, “non-participating” schools and to assign scores to non-participating students. In addition, three different imputation methods were evaluated,: forward linear regression, linear equating (in which the imputation is based on standardized values of the predictors and regression estimate), and reverse regression (where the predictor is regressed on available NAEP scores, and the regression coefficients used to impute the missing NAEP scores). After each imputation, the state population mean was recalculated using the new imputed data, and this mean was compared with the statewide mean based on the full NAEP sample.

The tested corrections for non-participation bias were partially effective, eliminating roughly half of the bias when the linear equating method was used. The report notes that the other regression models can improve the corrections, but their accuracy is dependent on knowledge about the mechanisms of non-participation, which is not likely to be available in practice.

Back to Top

The Effects of Finite Sampling on State Assessment Sample Requirements (2003) (341K PDF)

This study addresses statistical techniques that might ameliorate some of the sampling problems currently facing states with small populations participating in State NAEP. The author explores how the application of finite population correction factors to the between-school component of variance could be used to modify sample sizes required of states that currently qualify for the exemptions from State NAEP's minimum sample requirements. He also examines how to preserve the infinite population assumptions for hypothesis testing related to comparisons between domain means. Results lend support to alternate sample size specifications both in states with few schools and in states with many small schools. The author notes that permitting states to use design options other than the current State NAEP requirement could reduce costs related to test administration, scoring, and data processing.

Back to Top

Computer Use and Its Relation to Academic Achievement in Mathematics, Reading, and Writing (2003) (514K PDF)

In this study, the authors, using evidence obtained from the 1996 NAEP assessment in Mathematics and the 1998 NAEP main assessments in reading and writing, examine patterns of computer achievement in each of these three academic domains. The authors conclude that the design of the NAEP data collection precludes using such data to make even tentative conclusions about the relationship of achievement and computer use. They recommend further study, including a multi-site experiment to determine how teachers and students are using computers and the impact of computers on achievement.

Back to Top

A Study of Equating in NAEP (2003) (955K PDF)

The authors detail a computer-simulation study they conducted to investigate the amount of uncertainty added to NAEP estimates by equating error under three different equating methods and while varying a number of factors that might affect accuracy of equating. Data from past NAEP administrations were used to guide the simulations, and error due to equating was estimated empirically. It is the authors' conclusion that the merits of less biased measurements may outweigh the problems caused by slight adjustments to previously reported scores. They recommend that long-term trend lines be periodically reanalyzed using methods such as multiple-group IRT that can minimize such biases.

Back to Top

An Investigation of Why Students Do Not Respond to Questions (2003) (569K PDF)

Developers of NAEP have substantially changed the mix of item types on assessments, decreasing the numbers of multiple-choice questions and increasing the numbers of short and extended constructed-response questions. At the same time, researchers have noted unacceptably high student nonresponse rates. These rates seem to vary with student characteristics like gender and race, and they potentially confound NAEP reports, analyses, and subsequent conclusions. The small-scale, exploratory study, which is qualitative in nature, offers insights for setting future studies. Future studies might include quantitative analysis of existing NAEP data sets to determine whether observed patterns of association between omissions and student or item characteristics hold up over larger numbers of students and items than were included in this study.

Back to Top

Optimizing State NAEP: Issues and Possible Improvements (2003) (253K PDF)

The paper addresses 3 key topics related to making state NAEP more efficient: reducing the burden for the states, stabilizing the assessment schedule, and facilitating and promoting the use of state NAEP data. The author recommends promoting the use of state NAEP data for the continued success of the NAEP program. She suggests that this could involve devoting greater attention to how best to link state assessment and NAEP results, developing more timely and user-friendly reports and working with states and other organizations to more effectively address the data needs of different NAEP audiences. She also proposes expending proportionately less of the state NAEP resources on data collection and more on disseminating information about the many uses of the program.

Back to Top

Improving the Information Value of Performance Items in Large Scale Assessments (2003) (290K PDF)

The authors first provide a summary overview of what is already known and what is needed to learn about item types for future NAEP assessments. Essentially, the question addressed here is: Do constructed-response items provide more information about what students are capable of doing than what multiple-choice items alone provide, and if so, what types of skills are tapped by the constructed-response items that are not measured by multiple-choice items? A fresh examination of the relationships between multiple-choice and constructed-response items is needed, according to the authors. The authors propose a set of studies that would provide needed information about the value added of performance items in mixed-format assessments such as NAEP.

Back to Top

Reporting the Results of the National Assessment of Educational Progress (2003) (461K PDF)

This paper explores ways results of NAEP data collections might be communicated to a variety of audiences, each with differing needs for information, interests in its findings, and sophistication in interpreting its results. The author describes market-basket reporting as a feasible alternative to traditional NAEP reporting. These reports would include samples of items and exercises used in an assessment together with their scoring rubrics which would give a clearer picture of the kinds of skills assessed by NAEP, as well as an indication of skills not assessed. In the second section of the paper, the author cautions that in order to uphold strict standards of data quality, NAEP reports must format and display results to make them more accessible while also discouraging readers from drawing overly broad interpretations of the data. A final section describes a detailed program of research on reporting and dissemination of NAEP findings based on these three dimensions: the research questions to be asked; the audiences to whom the questions should be addressed; and the strategies through which the questions should be pursued as well as the intersection of these dimensions. The author suggests that the highest priority be given to research on reporting through public media; followed by making NAEP reporting more understandable and useful to school curriculum and instruction personnel, reporting to the public, and further research with state education personnel.

Back to Top

Implications of Electronic Technology for the NAEP Assessment (2003) (381K PDF)

This report emphasizes the need for NAEP to integrate the use of technology into its assessment procedures; it reviews major options; and suggests priorities to guide the integration. The author identifies three short-term goals for this development: a linear computer-administered assessment in a target subject area such as mathematics or science should be implemented; a computer administered writing assessment should be developed and implemented; the introduction and evaluation of technology-based test accommodations for handicapped students and English-language learners should be continued. The author suggests that NAEP consider redesign as an integrated electronic information system that would involve all aspects of the assessment process including assessment delivery, scoring and interpretation, development of assessment frameworks, specifications of population and samples, collection of data, and preparation and dissemination of results.

Back to Top

Feasibility Studies of Two-Stage Testing in Large-Scale Educational Assessment: Implications for NAEP (2003) (968K PDF)

This report discusses the rationale for enhancing the current NAEP design by adding a capacity for adaptive testing. Items are tailored to the achievement level of the student in adaptive testing. The authors conclude that implementation of adaptive testing procedures, two-stage testing in particular, has the potential to increase the usability and validity of NAEP results. Adaptive testing would permit adequately reliable scores to be reported to individual students and their parents, increasing their personal stake in performing well. Improvement in data quality would also speed data processing and permit delivery of assessment results in a timely manner.

Back to Top

An Agenda for NAEP Validity Research (2003) (598K PDF)

This report resulted from the systematic analysis undertaken by the NAEP Validity Studies Panel to consider the domain of validity threats to NAEP and to identify the most urgent validity research priorities. A framework of 6 broad categories was devised: 1) the constructs measured within each of NAEP's subject domains; 2) the manner in which these constructs are measure; 3) the representation of the population; 4) the analysis of data; 5) the reporting and use of NAEP results; and 6) the assessment of trends. Panel subcommittees prepared papers laying out the critical validity issues in each area, and these papers are presented in chapters 2 through 7 of this report. The panel reviewed papers in each area and set priorities of each area of validity research by a consensus process. Sixteen suggested studies or areas of study were rated by the full panel. Four studies stood out as essential and 2 others were rated between highly important and essential. The panel indicated unanimously that studies are essential to evaluate the validity aspects of NAEP's new role under the No Child Left Behind legislation, however that role is operationalized.

Back to Top

The Validity of Oral Accommodation in Testing (2003) (456K PDF)

This study examines the impact of oral presentation of a mathematics test on the performance of disabled and non-disabled students. It is an example of empirical research providing evidence for evaluating the validity and fairness of accommodations use. Both learning disabled and non-disabled students improved their performance under the accommodated conditions, although learning disabled students had greater gains. The presence of an effect for the regular classroom students suggests the possibility that irrelevant variance in the non-accommodated scores is overcome by the use of the accommodation for both groups of students.

Back to Top


Last updated 05 October 2009 (MH)
1990 K Street, NW
Washington, DC 20006, USA
Phone: (202) 502-7300 (map)