Skip Navigation

Mark Schneider
Commissioner, National Center for Education Statistics

Response to the "Validity Study of the NAEP Mathematics Assessment: Grades 4 and 8"
November 23, 2007


Nowhere is NCES's commitment to excellence more important than in NAEP, the "Nation's Report Card."

As part of the continued pursuit of excellence, Mark Schneider, the Commissioner of the National Center for Education Statistics (NCES), asked the NAEP Validity Studies (NVS) Panel to undertake a study to examine the quality of the NAEP Mathematics Assessments at grades 4 and 8.

Specifically, NCES asked the NVS Panel to address the following questions:

  1. Does the NAEP framework offer reasonable content and skill-based coverage compared to the assessments of states and other nations?
  2. Does the NAEP item pool and assessment design accurately reflect the NAEP framework?
  3. Is NAEP mathematically accurate and not unduly oriented to a particular curriculum, philosophy or pedagogy?
  4. Does NAEP properly consider the spread of abilities in the assessable population?
  5. Does NAEP provide information that is representative of all students, including students who are unable to demonstrate their achievements on the standard assessment?

The Panel's central finding is that "the NAEP mathematics assessment is sufficiently robust to support the main conclusions that have been drawn about the U.S. and state progress in mathematics since 1990. "1 This confirms that the program's most important goal, to accurately report trends on what students can and should be able to do, is being met.

Further, the findings support actions NCES has planned and initiated to ensure NAEP's quality. For example, NCES had already begun to consider ways in which NAEP could

In this response to the NVS report, we will concentrate on what NCES is doing to further improve NAEP. We end this report with a detailed response to what we found to be the most troubling of the NVS Panel's findings-the high percentage of items judged "flawed" or "marginal." This negative evaluation of items goes to the core of the NAEP assessment. In response, NCES and our contractors looked more deeply into NVS' analysis. We have learned much from this exercise and will incorporate these lessons into future plans. However, we disagree with the severity of the terms used in the NVS study.

With that caveat in mind, the panel's analysis suggests ways in which the NAEP program can and will be strengthened. This document details how NCES will respond to the Panel's recommendations.

Considering the Recommendations

The NVS recommendations fall into two categories. The first category, which includes recommendations 1 through 6, offers numerous ways to improve the overall quality of the assessment as it currently stands. The second category, which includes recommendations 7, 8, and 9, suggests not only the need but also a method for redesigning the psychometric properties of NAEP so that it better assesses the full range of student performance. Each recommendation will be discussed in turn.

Improving the Overall Quality of the Assessment

The majority of recommendations put forth by the NVS are aimed at improving the overall quality of the assessment. On the surface, many of the features of the current NAEP item development process are designed to accomplish the specific goals outlined in these recommendations. Given the concerns raised by the NVS study, NCES is reevaluating the extent to which processes in place are accomplishing the intended goals.

Recommendation 1. Sharpen the framework

As this is primarily a Governing Board function, NCES will not speak to this issue.

Recommendation 2. Provide detailed implementation plans

This recommendation identifies the role of the framework and specifications documents, as prepared by the Board, as public documents. Further, it points out that greater specification should be provided to the NAEP contractors.

2. A. Translate the higher-level guidance provided by the framework into detailed implementation plans

NCES is proceeding both by developing implementation plans and by specifying aspects of the framework through clarification and interpretation documents that are developed in conjunction with members of standing committees and other contractor staff.

NCES implementation plans. Beginning with the 2009 frameworks in reading, mathematics and science, NCES began to formalize implementation plans as a first response to any new framework. Implementation plans begin with an evaluation of the new framework in terms of what can be done within the constraints of the current NAEP design, what needs further work to develop an appropriate way of introducing a desired innovation into NAEP, and a statement of what cannot be delivered at the current time.

The concept of an implementation plan has played out most strongly in reading and science. For example, both these frameworks included new item types. The reading framework called for the development of a meaning vocabulary scale and the science framework called for the systematic introduction of sets of items that measured learning progressions. Both pushed what the assessment should be measuring into territory that did not include well-established models of items. NCES convened working groups to better specify the goals of the item types and to develop exemplars.

Both examples provide evidence that not all things in a framework can be immediately implemented, leading NCES to conduct further development work. Sometimes, as in the case of meaning vocabulary, the innovation may be close enough to current practice that NCES can immediately take a role in the further definition of the innovation. But in other instances, like learning progressions that are not consistent with or a direct extension of current practice, it is more important that NCES conduct this work separate from the ongoing work of item development.

In addition, not all ideas that are part of the framework can ultimately be part of a large-scale assessment, so NCES needs to evaluate the impact of these innovations on the overall validity of the assessment. The implementations plans point to how NCES has begun to systematically move forward into finding ways in which to develop and incorporate new ideas and advances into the NAEP program without sacrificing the integrity of the existing instrument.

Clarification and Interpretation Documents. The NVS report points out the need to sharpen the language of objectives.

"Objectives are targets, not containers. Don't worry about what they include, worry about what they say about where the test should be aimed. Containers get vaguer and vaguer as they mean more and more. Targets get sharper and sharper as they define the most important aspect of a topic." 2

The 2009 science framework makes this point clear. In this document content statements (the science parallel to objectives) described the topic to be included on the assessment in one place and a further description of what part of this topic was to be tested was presented more globally in another section (boundary statements). NCES was interested in reaching a clear and shared understanding with both our contractors and members of the standing committee of what should be tested and how. Therefore we created internal working documents that placed related directions next to one another. These documents were used during item reviews so that proposed items could be vetted against the perceived intent of the framework and at the same time made it possible to more clearly specify what should and would be measured.

The clarification and interpretations documents have a continuing item development function. They serve as a more detailed statement of what item developers should be measuring and at the same time they serve as a record of the decisions that NCES and the standing committee have taken regarding item development.

This process was implemented with the 2009 12th grade mathematics framework and specifications. This work is ongoing.

2. B. Make priorities explicit

As a direct consequence of the NVS report, NCES sought ways to provide the item development contractor with the relative priorities of different assessment topics at each hierarchic level of the 2009 12th grade mathematics framework. NCES devoted part of a standing committee meeting related to the 12th grade 2009 assessment to setting priorities by reaching consensus about the relative importance of each level of the hierarchy and then allocating the number of items to each objective according to that scheme.

We chose to focus our initial efforts in this way on just the 2009 12th mathematics grade assessment because this represented the place where most new items were to be developed. Under prior agreements between NCES and the Governing Board, the item development contractor was only allowed to carry forward about 20% of the item pool from previous assessments assuming that those items met the specifications of the new framework.

In September, we will extend this process to the 4th and 8th grades so that, given the constraints of maintaining trend, we will select items that meet the newly set priorities.

Recommendation 3. Define a larger role for exemplar items

NCES supports the goal of advancing the practice and technology of using exemplar items to communicate expectations. The use of exemplars would help clarify the framework's intent. Further, where the intent of the item differs from standard practice, an example would go a long way in making the desired change more explicit.

3. A. Provide ample examples of items that illustrate focus and reach at each hierarchic level of the framework

This goal can be achieved when the objective fits with well-established mathematics practice. However, there are times when the framework introduces a new conception of how assessment items should be developed. For example, at the time the 2005 mathematics framework was developed (between late 2000 and early 2002) the concept of levels of complexity was not yet fully specified. While a few leading researchers were clarifying this important notion, most test developers, curriculum developers, and math educators were still struggling with how to differentiate levels of complexity from item difficulty. There were very few exemplar items that could be used to explain this conceptualization.

To meet this challenge, NCES put together a working group that included members of the Governing Board's framework development team, NCES, NESSI, and item developers from ETS. Collaboratively they refined the definition of the levels of complexity and created a set of exemplar items at each grade level that captured the nuanced intent of the framework. This exercise went on during the six-month period for pilot item development that would ultimately be included in the 2005 assessment. This work was further refined so that newer items included in the 2007 assessment more clearly captured the notions of levels of complexity. And, as the framework was being tweaked for the 2009 assessment, the collaboration continued with yet further refinements to both the definition of levels of complexity and of the exemplars that illustrated what was sought.

3. B. Encourage the establishment of a Web-based open bank of released items.

In 2000, NCES launched the NAEP question tool that addresses some of the intent of this recommendation. The NAEP question tool is an online item bank that is publicly accessible and includes all released NAEP items. The item, its classification, the associated scoring guide, achievement data and exemplar student responses are available. What is missing is the recommended associated discussion of what would or would not make each item an exemplar of the framework objective. NCES will look into the possibility of expanding this function.

Recommendation 4. Improve quality assurance for the overall item pool and for individual items

Recommendation 5. Attend particularly to the following aspects of item quality

Recommendations 4 and 5 both explicitly address quality. Extensive review procedures have been in place since the inception of NAEP that, in principle, follow the intent of the NVS recommendation. The report suggests, and NCES concurs, that NCES must be more actively engaged in this part of item development. A prime consideration in implementing quality control and assurance measures rest with who has final sign-off. Therefore, NCES will more fully assume the role of final sign-off. In this vein, the following actions have either already been taken or will be part of the next five year development cycle.

4. A. Monitor and manage the focus, balance, and reach of the item pool across and within the subtopic level of the framework

There are three ways that NCES will manage the focus, balance and reach of the item pool - setting priorities across objectives, expanding the setting of priorities to levels of complexity and types of numbers used across the item pool, and subjecting the overall operational item pool to a systematic review prior to administration;

As was previously described, NCES will be conducting priority-setting exercises prior to the beginning of item development for a new framework. For the 12th grade 2009 mathematics assessment this activity was conducted prior to selecting items for pilot testing in 2008. As is currently the case where both the 4th and 8th grade framework and item pool began in 2005, we will conduct this activity between the pilot study and the selection of items for the operational assessment.

To date this exercise has only focused on the coverage of the objectives. In consultation with the Governing Board, we will establish procedures for setting priorities for levels of complexity and types of numbers.

As part of the new contracts the standing committee will exercise an "executive review" of each operational item pool immediately after review of pilot item statistics and will guide the selection of the final item pool.

4. B. Subject all items to expert review

NAEP standing committees are selected to include the differing perspectives listed in the NVS recommendation are represented. However, the process has not always led to the recommended outcome. Consequently interim changes in the way these committees function have been implemented and additional changes are being considered.

During the last three years, NCES has taken a stronger role in managing and facilitating the standing committees. We set the agenda, provide framework training and lead discussions on cross cutting issues. Topics that have been discussed range from such things as broad coverage of the framework objectives and rotation of items across these objectives to more micro-level topics as to whether items should be considered correct if appropriate units are not included in the student's constructed response.

What is stilling missing from this process is a good way to manage the changes that get made to items as a result of standing committee reviews. To address this issue, NCES will apply a system we currently use in state item reviews. During State item reviews NCES sets the agenda, chairs the meeting, provides training on the framework, and models what is expected of reviewers in item review. Then, the state item reviewers, who are state NAEP coordinators, state curriculum and testing directors and teachers, work in small groups led by ETS item developers to discuss each item. All comments are compiled and reviewed by NCES staff that then provides ETS with direction as to which changes should be made.

While the existing standing committees focus primarily on the content of the items, NCES has also introduced reviews by language experts and accessibility specialists into the process. One of NESSI's subcontractors, Second Language Testing, Inc. specializes in these issues. All items are subjected to their review as part of the internal NESSI quality control process. NCES is considering expanding this function through an ad hoc committee of experts in this field who would review items across all NAEP assessments and provide guidelines and exemplars that would be used in initial item writing.

5. A. Sustain attention to the mathematical quality of the items

The most prominent finding of the NVS validity study related to the mathematical quality of the items. (We discuss this finding on page 11 below.) Although the current standing committee has always included at least one mathematician, and there are many mathematicians available at ETS, we may not have achieved the needed representation of mathematicians during item development.

What set the NVS review apart was that it was carried out by a panel of five mathematicians who were specifically selected not only because of their known expertise as mathematicians but also because they espoused differing philosophical stances toward mathematics. During the review of the 2005 and 2007 mathematics items, they came to consensus on which items were in need of fixing and then as a group articulated how to make these items more mathematically precise.

As a result of the NVS experience, NCES has chosen to emulate this process on a regular basis. We convened the same panel to review the new 4th and 8th grade items that will be part of the 2009 operational pool. All items that raised concern were discussed. Within a very short period of time, ways to resolve problems in the items were addressed. It was the overwhelming consensus of this group that this newer set of items was significantly better than the pool of items they had reviewed as part of the validity study. None of the new items were considered flawed and those few that were problematic were easily fixed.

At present, NCES is considering when to best use the input of this type of specialized expert panel in the item development process. In addition, we are developing a method for categorizing their comments so that the lessons learned from individual items may be generalized to larger sets of items.

5. B. Improve the quality of the situated mathematics problems

5. C. Improve the measurement of mathematical complexity

These two specific areas of item quality will figure more prominently in our item reviews. As discussed earlier in this response, NCES has taken a leading role in more clearly defining levels of complexity within the NAEP framework and we will continue to do so by following the suggested approach of compiling items from other countries and international assessments. We will, on an ad hoc basis, invite experts from these countries to consult with NCES and our item developers in the near future.

5. D. Minimize non-construct relevant sources of item difficulty:

As the NVS points out,

"Item difficulty is a combination of many factors. In addition to mathematical skills demands, item difficulty is a function of demands on auxiliary skills necessary for demonstrating competency in the domain and demands that are merely contaminating. Contaminating skill demands should be avoided entirely, and auxiliary skill demands should be managed so that they do not out weigh the mathematical skill demands of the items." 3

Addressing this recommendation requires a multi-pronged approach. At the most basic we must clearly differentiate between mathematical, auxiliary and contaminating skill demands that together contribute to item difficulty. In addition there must be empirical evidence supporting how we address each issue.

Under the current item development contract and to be continued in the new contract, ETS was asked to conduct studies to identify item attributes related to construct relevant skills that drive item difficulty. Building on the earlier research carried out by Kirsch and Mosenthal that related to the Adult Literacy studies, ETS expanded the list of potential attributes and developed items that systematically manipulate these attributes of select items.

As will be described in the section on creating easy blocks, NCES is systematically seeking ways to minimize the demands of some of the necessary auxiliary skills by lessening the reading load and, where appropriate, simplifying language. Further efforts are still needed in this area.

NCES is working to minimize contaminating skills demands by creating a NAEP style manual that will specifically address these issues. The style guide would include tenets of universal design and provide guidance on how to implement appropriate universal design features into NAEP. NCES is seeking input from a variety of experts to carry out this activity.

One of the first areas open to style improvement is the physical layout and design of the assessment booklet itself. For example, removing such features as ascension numbers or scanning cues used in digital scoring would free up space on the page and would make the assessment more accessible to students with disabilities. Developing a consistent style for graphics presentation would also help. According to research in this area, standardizing typeface and font size would also have a positive impact on eliminating construct irrelevant sources of difficulty. NCES is planning a thorough program of research to study the impact of these types of changes in terms of student-by-item interactions. Assuming the findings support making these changes, implementation would have to be incrementally worked into the NAEP assessment to minimize unintended logistical consequences.

Recommendation 6. Undertake a program of evidence-based research on item design

NCES has pursued empirically based studies related to item design. This included such work as the study on item attributes described above as well as work on the impact of line numbering in reading assessments or the creation of new item types. This work was designed to address immediate concerns as they arose. As resources allow, NCES plans to pursue a more systematic approach to this type of research. NCES will look to both our contractors and to the NVS Panel for appropriate study designs.

Redesigning the Psychometric Properties of the Assessment

Recommendations 7, 8, and 9 in combination relate to the psychometric properties of the assessment. Together they would move the NAEP psychometric design closer to the "Ideal NAEP assessment" that is described on pages 122 to 123 of the report. The NVS view of the ideal NAEP assessment would include the full range of objectives outlined in the framework, as well as additional items that represent the more basic skills that underpin the specified framework objectives.

Recommendation 7. Expand the range of item difficulty and curricular reach

Recommendation 8. Manage changes in the item pool

Recommendation 9. Move NAEP in the direction of adaptive testing

The current configuration of the assessment items provides the best measure of the majority of the population. And, if the purpose of NAEP were to just track average performance over time, this would still be the ideal distribution of item difficulties. No Child Left Behind has, however, heightened the need to better measure and to describe the performance of students at both the bottom and top ends of the distribution. As the NVS study points out, the most effective way to do this is in fact to create an adaptive test instrument. The adaptive test would match items to the appropriate student level of performance so that each participating student could fully demonstrate the level of performance he or she has attained.

Creating an Adaptive Test. NCES has been considering this approach for a number of years. However, adaptive testing requires a major redesign of the delivery system, which NAEP is incrementally implementing. Starting in 2009 the science assessment will include interactive computer based tasks, followed in 2011 with a shift to a computer-based delivery of the writing assessment. Under consideration are how to create an infrastructure that can:

NCES is interested in pursuing adaptive testing, but it is a long-range option. In the interim, two options are being actively pursued to expand the range of student abilities that can be assessed: developing "easy blocks" and changing the distribution of item difficulty so that there are roughly equal numbers of items across the entire distribution.

Creating Easy Blocks: The "easy block" solution is being studied by the NVS. At present easy blocks have been prepared for 4th grade reading to be administered in a pilot study during 2008.

Changing the Distribution of Item Difficulty Across the Assessment: The NVS report discusses a conceptualization of the ideal NAEP assessment that "gives good estimates of what the lowest performing students can do, even though doing so may require item content less advanced than the framework. It also gives good achievement estimates for the highest performing students and includes the most advanced content in the framework."4

The 2009 mathematics framework marks a major revision of the 12th grade objectives. Included in this framework are a number of objectives (they are designated by an asterisk) that go beyond algebra and move into pre-calculus. This extends the range upward and, in turn, there are more objectives included in the framework. This change leads to a major challenge - how to cover the full range of objectives without making the assessment too difficult for most students.

In responding to this challenge NCES has chosen to pursue two approaches. For the 2009 assessment we have added two blocks with an additional 30 to 36 items to accommodate the greater range of objectives.

In addition, we will increase the number of items at the tails of the distribution, thereby lowering the associated measurement error at these levels. It should make it possible to better describe performance at each achievement level.

Are Too Many NAEP Items Flawed or Marginal?

As noted in the opening section, the NVS panel found a high percentage of items either flawed or of marginal quality. Since an assessment cannot be better than its items, we took this critique to heart.

In response, NCES examined the panelists' comments, item by item, to determine the severity of the errors in content and/or construct irrelevancy. In many cases, minor edits were recommended and we found the reviewers' comments instructive. However, we differ with regard to the way the panelists' ratings were aggregated and believe the method of aggregation produced too many negative evaluations of items.

To reach an aggregate rating on each item, the report defines a "disagreement" among the raters as any time their ratings varied by more than 1 point. Given that there were only 3 possible score points this tends to overestimate agreement compared to the way other rating scales conceive of inter-rater reliability.

Taking another perspective, the agreement appears less robust in certain rating categories. Consider the ratings of the 4th grade items. Of 61 fourth grade items defined as "marginal," there were only six cases where all of the judges agreed with this rating. In 54 of those 61 cases, at least one of the judges rated the item as adequate. In contrast, all the reviewers were in agreement for 105 of the 143 items judged "adequate". In other words, there was substantial agreement about when an item was "adequate" but far less agreement about problems.5

NCES examined the average amount of statistical information provided by the items NVS placed into different categories. We found that the items rated as "marginal" performed in most subscales as well as or better than items rated as "adequate." In the few cases where there were enough items rated as "flawed" to create the information curves, these items appeared, on average, to perform better than the items identified as "adequate."6

By systematically studying the average biserial7 of the items rated in different categories by the NVS, we determined that there were only minimal differences among the means. While the differences are quite small, in three out of the four possible comparisons with the "adequate" category, the "marginal" and "flawed" categories have numerically higher discrimination indices.

Table 1: Mean Biserial of Items by NVS Rating Panel
Grade Rating Category Number of Items Mean r-biserial
4 Adequate 143 0.58
Marginal 61 0.61
Flawed 11 0.59
8 Adequate 163 0.60
Marginal 52 0.59
Flawed 9 0.63

Although these statistical analyzes do not establish proof that the items could not be improved in the ways that the NVS panel recommend, it does suggest that concerns stated in the report about the impact of "flawed" and "marginal" items on performance are not borne out by data.

Concluding Remarks

NCES appreciates the insights into the item development process that the NAEP Validity Study has provided. It comes at a time when NCES is entering into new contracts for the next cycle of NAEP assessment and these recommendations will be reflected in new procedures designed to ensure the quality of the Nation's Report Card.

This document was written by Dr. Marilyn R. Brinkley of the Assessment division.


1 Validity Study of the NAEP Mathematics Assessment: Grades 4 and 8. p.ii.
2 Validity Study of the NAEP Mathematics Assessment; Grades 4 and 8. p. 130.
3 Validity Study of the NAEP Mathematics Assessment: Grades 4 and 8. p. 134.
4 Validity Study of the NAEP Mathematics Assessment: Grades 4 and 8. p.122.
5 The situation at grade 8 is similar. Of the 163 items rated adequate the judges were unanimous 122 times. The judges were unanimous in only 7 of the 52 items rated "marginal." In 42 of these 52 cases at least one judge rated the item as adequate.
6 The information curves are available upon request.
7 A biserial reflects the relationship between performance on a given question and performance on the assessment as a whole. Assessment developers prefer items with higher biserial correlations because these are items that students better at the subject being measured by the assessment are more likely to answer correctly that are students who perform less ably.