Since its first assessment, NAEP has grown and developed to meet changes in national needs and perspectives; many of these changes are chronicled in The NAEP Primer, Chapter 2: Technical History of NAEP, reproduced in part below. To meet these changes, NAEP has introduced many technical innovations in test design, statistical analysis, psychometrics, and modern computing to ensure the efficiency of its processes and the credibility of its results. Intended to give a NAEP data analyst some basic knowledge about the NAEP assessment, its methods, and its database, the Chapter 2 sections covered here are:
The size and complexity of the database show NAEP’s continuing concern with addressing new national needs while maintaining credible measures of student achievement and progress. As NAEP grew and developed, it received more attention from policymakers and the general public. Many new expansions and improvements were suggested and sometimes demanded. Many technical innovations were required for implementation of new NAEP components. Understanding its development over time helps the reader understand and appreciate the innovative nature of NAEP.
This chapter is not intended to be a full, comprehensive history of NAEP as it focuses on the development of the NAEP technology. It gives some background about the forces that affected NAEP and the technical solutions required to respond to those forces. It mentions briefly the social and political environment in which NAEP existed and points to places where the reader can get more information. The full references for the articles cited below will be found in TheNAEP Primer, Appendix B.
Although educational testing was widely used in the United States before 1960, national estimates of student accomplishments were not available. The American College Testing Program (ACT) and the College Entrance Examination Board (CEEB) began annual reporting of their test results, but since these tests were usually taken by a small, select group of college applicants, the average scores were not representative of the student population as a whole.
During the Eisenhower administration after the Russian launching of Sputnik in 1957, there was serious concern about whether or not the nation’s schools were producing a sufficient number of scientists for a world in which the United States’ scientific superiority was challenged. This concern led to the National Defense Education Act of 1958.
Concerns about the scientific proficiency of U.S. students led to Project Talent, which administered tests to a large national sample of the nation’s secondary school students. The goal, among other things, was to assess the talents of U.S. students. The assessment required three days of student time. At that time, asking a student’s race was considered by many to be improper—and illegal in some states—so the question was not asked, and thus the study was unable to answer questions about minority performance that were essential to the issue of equal educational opportunity. Later follow-ups did manage to identify the race or ethnicity of many students who participated in the assessment.
In 1962, President Kennedy appointed Francis Keppel as head of the Office of Education, which was then part of the Department of Health, Education, and Welfare. As Keppel told the story, he went to Washington and wondered what the duties of the Commissioner of Education were, so he looked up the 1867 law that authorized the Office. He found that the Office was to report annually on the progress of students in the United States. He marveled at the fact that, in nearly a century, the Office had never done so.
In March of 1963, Keppel asked Ralph Tyler, then Director of the Institute for Advanced Studies in the Behavioral Sciences for his suggestions for measuring school quality. Tyler replied shortly, and thus the germ of the idea that was to grow into the National Assessment of Educational Progress was sown. In its early days, Professor John Tukey of Princeton was the technical leader of the project. As the project grew, it came under the administrative leadership of the Education Commission of the States (ECS).
The Civil Rights Act of 1964, which was signed by President Johnson, required a report on the Equality of Education Opportunity to be delivered on July 1, 1966. This opened the door for another huge national testing, the Equality of Educational Opportunity Survey (EEOS). The EEOS study aimed at testing a sample of a million students in grades 1, 3, 6, 9, and 12. To meet its mandated deadline, the implementation of EEOS was rushed. Off-the-shelf tests, sometimes modified, were used so that pretesting was not considered necessary. Up to 10 achievement and aptitude tests were administered depending on the grade level. Questionnaires were developed. Testing, which took place in the fall of 1965, required a full day of student time. The final report became known as the Coleman Report, (Equality of Educational Opportunity Study 1966), named after its principal author.
The EEOS served its purpose reasonably well. The sample allowed for regional and national reporting for White and Black students but not for reporting individual states. It supported only national reporting for Hispanic, American Indian, Asian, and Other students. Test scoring was simply the number of correct responses on multiple-choice tests. Principal, teacher, and student questionnaires allowed for investigation of correlates of student achievement.
However, the study left much room for improvement. The final sample size was only about 65 percent of the intended sample size due to the unwillingness or inability of some to participate. Such a response rate would not meet today’s standards for reporting. Measuring sampling error or measurement error was also problematic. Computing standard errors assuming random sampling of students was inappropriate since cluster samples were used, where schools were first sampled, and all students within the selected grades (1, 3, 6, 9, and 12) were tested. The technology available at the time did not take the sample characteristics into account when computing the variance of the estimates. The large sample sizes made even very small differences statistically significant. The jackknife method was considered briefly and judged to be not computationally feasible. In the end, the EEOS report was largely without estimates of appropriate standard errors.
Part of the reason for the EEOS’s large influence on education was the availability of a Public-Use Data Tape (PUDT). The basic data were made available to secondary analysts in a simple way. These tapes were widely distributed and used. Since the participants were promised anonymity, variables identifying specific states, cities, schools, or students were not made available, but an errant analyst could reasonably infer them. At the time there was at least one inappropriate publication of EEOS results.
The 1960s were a formative time for the development of NAEP. There were strong objections to any national testing program on many grounds, especially the potential encroachment of the federal government on states’ rights. Education was not mentioned in the U.S. Constitution and was long considered to be the responsibility of the states. It was believed that a federal testing program would narrow curricula and present unwise state comparisons. Some professional organizations initially refused to cooperate. Attaining acceptance required careful treading.
The 1960s also ushered in the beginning of an era of decline in the average score on the Scholastic Aptitude Test (SAT). After a peak in 1964, the average score declined steadily over the next decade. Although the decline was not widely noted in the late 1960s, it became a major issue in the 1970s. The decline suggested the need for a good indicator of the performance of all students, not just the college-bound. The need for such an indicator greatly influenced the design of NAEP.
The first NAEP data collection was the 1969 trial assessment of the citizenship, science, and writing performance of 17-year-old in-school students in the spring of that year. In the fall, 9- and 13-year-old students as well as out-of-school 17-year-olds were assessed. These assessments introduced important technical innovations in NAEP.
The sampling plan was developed at the Research Triangle Institute (RTI), the subcontractor for NAEP’s sampling and field administration (Chromy, Finkner, and Horvitz 2004). The sampling plan ensured that all students in United States’ schools had a knowable probability of being sampled. Both public and nonpublic schools were included. The sampling was done by national regions, avoiding any possibility of reporting by state.
The samples of students were selected by age, not by grade as in previous studies. Age cohorts were easily compared over different regions of the country and over assessment years, whereas grade cohorts would be affected by state policies on entering the first grade as well as promotion and retention.
The assessment sessions were administered by RTI personnel and limited to less than one hour. Up to 12 randomly selected students within a school would be assigned to an assessment session. Large schools might have more than one session. In their session, students would listen to a tape recording of test instructions and test items. In this way, the effects of their reading skills on other subject areas, such as mathematics, would be minimized. When reading skills were assessed, only the test instructions were read aloud—not the reading passages. Tape recordings also helped standardize other factors such as assessment timing.
Given a limitation of one hour per student, NAEP introduced matrix sampling so that a large pool of items could be administered to the sample without overburdening the students. Essentially, the item pool was divided up into a number of blocks, each of which was expected to be completed by the student in the allotted time. The use of tape recorders required that all students in an assessment session be assigned the same booklet. Although a single assessment session would produce data on only a small portion of the item pool, the entire pool would be tested in many sessions in each region.
To give its readers an idea of the accuracy of its estimates of student performance, NAEP reports standard errors along with its estimates. As with the EEOS studies, the usual methods of computing standard errors were not appropriate because both studies sampled random clusters (schools) not random students. To address the realities of cluster sampling, the jackknife (Quenouille 1956, Tukey 1958) method was introduced into large scale assessments.
Implementation of NAEP was not without some glitches, as with any large-scale project. For example, the first year did not collect data to allow the separate reporting of results for Hispanics, but this was quickly corrected in later assessments. At first, NAEP did not include curriculum specialists and teachers in the test development process, but this was addressed by contracting item development to professional testing organizations. Administration of NAEP to out-of-school 17-year-olds was dropped after the 1979-80 assessment because of its unacceptable cost. Overall, the original NAEP design set a solid foundation for growth and adaptation to meet new demands for more information.
The main problem of the early NAEP assessments was the reports. The germinal idea was to present the results for one item at a time, giving the percentage of correct answers by gender, race, parents’ education, etc. The item-by-item results would be presented with comments while leaving the reader to generalize to an entire subject area such as science. Later, the average percentage of items answered correctly over the entire subject area item pool was then used, but this procedure did not adequately allow for introducing new items or for retiring used ones.
The Wirtz and Lapointe report (1982), reviewed the processes and possibilities of NAEP. While appreciating NAEP’s quality, it appealed for more inclusion of various stakeholders in the development and interpretation of NAEP test items and for better reporting of results. The report did not endorse state-by-state comparisons but did recommend providing more testing services to states and school districts.
During the time that NAEP was developing, there was plenty of action in the field of educational testing. The National Longitudinal Study of the Class of 1972 (NLS:72) tested a national sample of high school students and then followed them through their later careers. This longitudinal model was followed by the High School and Beyond (HS&B:80) study, the National Education Longitudinal Study (NELS:88), and continues today with the High School Longitudinal Study (HSLS:09).
The decline of average scores on the SAT examinations became a critical issue in the 1970s. The news media viewed this decline as an indicator of the deterioration of the U.S. educational system. The College Board appointed Willard Wirtz as Chair of a blue ribbon panel to investigate the phenomenon. Using data from Project Talent and NLS:72, it found that the decline was largely due to the steady increase in the number of students taking the SAT examination. The SAT decline was attributed to the fact that the latter cohort had many more students including those with lower verbal ability. The decline was therefore due to the increase in students attending college, not to a lack of highly able students (Wirtz 1977).
In 1983, the NAEP grant was put out for competitive bidding. ETS and its subcontractor, Westat, presented a bold plan for making NAEP more efficient and useful to educational policymakers and the general public. The plan was later published as A New Design for a New Era (Messick, Beaton, and Lord 1983).
The NAEP grant was awarded for five years and began with the assessment of reading and writing in the 1983–84 academic year. The assessment items and general sampling plan had already been determined by the Education Commission of the States (ECS), the previous contractor. Some of the items were used in past assessments so that progress in student performance could be measured. Most of the RTI sampling plan was kept, but some modifications were made. One major change was age/grade or “grage” sampling, that is, sampling both ages and grades so that results could be reported either way.
Previously, special needs students were not sampled and thus were excluded from the assessment, and no record was kept of their presence in the assessment or in the schools. The 1983 sampling plan included special needs students in the school roster and required that a short questionnaire be filled out to explain the reason for their exclusion.
To ensure that the item pool covered broad areas, the booklets were assembled using a variation of matrix sampling called Balanced Incomplete Block (BIB) spiraling.
Like matrix sampling, BIB spiraling presents each item to a substantial number of students but also ensures that each pairing of items is presented to some students. The result was that the correlation between any pair of items could be computed, albeit with a smaller number of students than responded to a single item.
The major design feature in 1983 was scaling the assessment data using Item Response Theory (IRT). At that time, IRT was used mainly to estimate scores for individual students on tests with many items. IRT was fundamental to summarizing data in a meaningful way. Basically, IRT is an alternative to computing the percent of items answered correctly. Given its assumptions, IRT allowed the placing of results for students given different booklets on a common scale.
IRT also provided the basis for comparing different populations. First, it made possible the linking of data from past assessments with newly collected data to describe trends. Secondly, it allowed vertical scaling, that is, putting students at different grade or age levels on the same scale.
The reasonableness of the IRT assumption of unidimensionality was a serious concern. The correlations among the items that BIB spiraling allowed were essential for examining this assumption. BIB spiraling made administration by tape recorder impossible since students sitting next to each other would not be responding to the same items. However, BIB spiraling had another, more subtle advantage. By administering the same items to all students in a session, the original matrix sampling used prior to the 1983 assessment resulted in fewer schools responding to any particular item. With the introduction of BIB spiraling, item blocks were administered to fewer students in any one school, but the items were administered in more schools. In this way, the sampling error was reduced.
The change from the original NAEP design to the new was also a matter of concern. The intention was to convert the existing data to the new design. To do so required “bridge” studies to establish the relationships between the original data and the new. Several bridge studies were included in the 1984 and 1986 assessments.
The 1983–84 reading assessment provided the first data from the new design to be analyzed. The implementation of the new design was not without problems that required technical innovation. In the 1983–84 assessment, the booklets contained three blocks of items. Reading and writing blocks were placed together in some booklets to make it possible to examine the relationship between reading and writing proficiency. The result was that some students received very few reading items and thus had poorly estimated reading proficiency scores. Worse still, the program at that time did not estimate scores for students who answered all items correctly or did less well than would be expected by chance. Using only the estimated scores would result in biased population estimates.
The technical innovation to address this problem was plausible values. Direct estimation of group parameters was in its infancy and required software that would be difficult for a secondary analyst to use. Furthermore, the software at that time did not accept sampling weights, which were essential with the NAEP sample. In addition, the new design promised that the PUDT would be available for users of SPSS, SAS, and other common statistical systems.
Mislevy had the insight that NAEP should take advantage of the fact that NAEP did not need—indeed, could not report—individual test scores. However, the available software BILOG (Mislevy and Bock 1982) was able to estimate a likelihood distribution of possible values for each student. If the variance of the distribution was large, the student was poorly measured, but, if small, the student was well measured. Randomly selected values from these distributions could be used to estimate the parameters of various NAEP reporting groups. These randomly selected values were called “plausible values.” (Mislevy and Sheehan 1987). Five plausible values were selected for each assessed student.
The plausible values are not test scores but, essentially, partial computations on the way to estimating group performances. The plausible values were statistically consistent in the sense that they approached the true population values as the sample size grew larger. The five plausible values made it possible to compute appropriately the standard errors. Importantly, they could be analyzed using available statistical software. Thus, plausible values were adopted as the NAEP standard. For more information about plausible values see Mislevy et al. (1992).
The direct estimation of group population parameters has developed substantially over the years. Computer software currently available that can handle marginal estimation calculations are:
The IRT scaling developed a single reading scale that covered all three age levels, thus allowing the comparison of the proficiencies of the different age and grade groups. To be different from other test metrics of the time, the reading scale was set to have a mean of 250.5 and a standard deviation of 50 over all ages and grades. Each student in the sample was assigned five plausible values to describe his or her reading proficiency.
The question was how to present such scale scores in NAEP reports. Presenting the NAEP scores as the estimated number of correct answers to a hypothetical test of 500 items was considered but not used. Instead, the average scale score was presented along with five proficiency levels set at particular anchor points, where: 150 = Rudimentary, 200 = Basic, 250 = Intermediate, 300 = Adept, and 350 = Advanced.
The NAEP staff was very concerned about presenting results to the general public and developed scale anchoring, which is a way of presenting the assessment results to the general public, to meet this concern. The method selected anchor points along the scale and then described what most students at each anchor point knew and could do that students with lower scores were unlikely to have mastered. This way of presenting assessment results was first used in the 1983–84 assessment (Beaton 1987).
At the time of the new design, the development of IRT models for graded or partial credit responses to items was in its infancy. The 1984 writing assessment was scored on a one-to-four scale and so was not adaptable to existing IRT programs. For this reason, the Average Response Method (ARM) (Beaton and Johnson 1990) was developed for reporting the writing data. Essentially, the ARM method computed plausible values from a linear model.
Along with the “Report Cards” for presenting the results to policymakers and the general public, NAEP produced extensive technical reports, e.g., Beaton (1987) and Allen, Donoghue, and Schoeps (2001) to document every step in the process so that others could duplicate or evaluate the results.
A PUDT was developed with the 1983–84 assessment data, albeit with plenty of complexities because of BIB spiraling and plausible value technology used in the assessment. The distribution of the tape was restricted because of concerns about confidentiality. The PUDT also contained the appropriate sampling weight for each student and replicate sampling weights to simplify the computation of jackknifed sampling errors.
In 1986, NAEP assessed several areas including mathematics and science, which involved the development of new assessments, and long-term trend assessments of reading, mathematics, and science, which were designed to continue the original NAEP trend lines. There were special data collections to establish the effect of changes in methodology between the long-term trend and the new assessment data. The effects were found to be complex and significant, so NAEP later split into two strands— main NAEP, which used the new design, and long-term trend NAEP, which continued the past methodology to maintain the long-term trend lines.
The 1986 assessment also expanded NAEP’s IRT methodology so that it could present several subscales for a subject area. For example, the report on the 1986 mathematics assessment included the following subscales: measurement, geometry, relations and functions, and numbers and operations. Plausible values for the subscales were combined to make a separate mathematics composite value.
As NAEP was demonstrating what national assessments could do, there were other events that affected it. In 1983, President Reagan’s Secretary of Education, Terrence Bell, received the report from the National Commission on Excellence in Education (NCEE) entitled A Nation at Risk: The Imperative for Educational Reform (NCEE 1983), which decried the state of education within the United States. A year later, he released his first “wall chart.” This chart presented a number of educational statistics for each state. Importantly, it contained each state’s average SAT and ACT scores. Bell recognized that the SAT and ACT averages represented only college-bound students, not the entire class of high school seniors. He challenged the educational community to come up with a better state indicator of student achievement.
In 1987, the report The Nation’s Report Card: Improving the Assessment of Student Achievement was published (Alexander, James, and Glaser 1987). This report suggested major changes in NAEP including its governance.
The Elementary and Secondary Education Act Amendments of 1988 brought about the authorization of an independent National Assessment Governing Board (the Governing Board), which signaled a change in the governance of NAEP. The Governing Board was given the power of setting NAEP policy and preparing the frameworks of the various NAEP instruments.
The Governing Board began with several major changes in NAEP. The first was a voluntary Trial State Assessment (TSA) in which states would receive summary statistics for their own and other participating states. A separate sample across all nonparticipating states was assessed to complete the results for the entire nation. At present, all states participate in the state assessments at the 4th and 8th grades, so supplementary sampling is unnecessary.
The Governing Board also introduced a new framework for the 1990 mathematics assessment. This presented a challenge to NAEP since the new assessment would not necessarily be compatible with the NAEP mathematics trend lines. The introduction of a new mathematics framework necessitated a new trend line which used the “Main NAEP Trend” (MT) sample. A separate “Long-Term Trend” (LTT) sample was taken using the same procedures and items as in previous LTT assessments.
This change required a restructuring of the NAEP sampling plan so that participating states would not have students assessed twice and that a separate sample of nonparticipating states would ensure a good national estimate of student performance.
The Governing Board also brought about the introduction of achievement levels. For each subject area, there were three levels: Basic, Proficient, and Advanced. The Governing Board determined what students should know and be able to do, whereas scale anchoring reported students’ knowledge at various scale anchor points on the NAEP assessments. The achievement levels were introduced in the report of the 1990 assessment, replacing the scale anchoring approach to reporting.
The main NAEP sample selected students by grade only, since the age data were seldom used. The long-term trend sample continued to sample by age to maintain the existing trends.
A new way of handling graded or partial credit responses was also brought about in 1992. Graded or partial credit responses were those items that were scored as worthy of 0, 1, 2, or more points. This method replaced the ARM method of analyzing writing responses and opened the door for more complicated open-ended items in all NAEP subject areas (Muraki 1992).
In 1990, Congress revised the Education for All Handicapped Children Act and renamed it The Individuals with Disabilities Education Act (P.L. 101-476). NAEP responded to this Act by including, as far as possible, all students with disabilities (SDs) who have an Individualized Education Plan (IEP) or were protected under Section 504 of the Rehabilitation Act of 1973. In addition, a student who was identified as limited English proficient (LEP), now commonly referred to as English learners (EL), should be included in the NAEP assessment under the guidelines established. The circumstance of each SD or EL student was reviewed to determine whether or not that student could meaningfully participate in NAEP. The review was done under strict rules, and the student was assessed unless there was a sufficient reason for exclusion determined by school staff using NAEP guidelines. In 1996, testing accommodations were introduced for students who needed them. Further information can be found at: http://nces.ed.gov/nationsreportcard/about/inclusion.aspx.
The technical question was whether or not this change in the NAEP populations would affect or distort its population proficiency estimates. To address this issue, NAEP did “bridge” studies that collected data from split samples, one allowing accommodations and the other not allowing accommodations. The first such bridge study was done for the national mathematics sample in 1996 and then for the national and state samples in both mathematics and reading in 2000. Accommodations were fully implemented in 2000 and in all later assessments.
Much of the NAEP technology was adapted and adopted for use in The Third International Mathematics and Science Study (TIMSS) that assessed student accomplishments in over 40 countries. The data were collected in 1995.
This international assessment led to studies of the possibility of comparing students in various states in the U.S. with students around the world (Johnson and Owen 1998).
The No Child Left Behind Act (NCLB) (P.L. 107-110) was enacted in 2001. This Act gave NAEP new importance as a separate, national yardstick for student performance. Results from state testing programs were to be informally monitored by state NAEP results in the corresponding grades and content areas. State participation in NAEP was required in mathematics and reading in grades 4 and 8. Other subject areas were not required, so NAEP had to develop different sampling plans for different grades and subject areas to adjust to the new reporting requirements.
To address requirements of the NCLB, the Governing Board reexamined the assessment schedule for 2003 and beyond. According to the new law, NAEP must administer reading and mathematics assessments for grades 4 and 8 every other year in all states. In addition, NAEP must test these subjects on a nationally representative basis at grade 12 at least as often as it has done in the past, or every four years. Additional information on NCLB can be found at: http://nces.ed.gov/nationsreportcard/nclb.aspx.
In 2002, the success of the state comparisons brought about a demand for separate reporting for urban districts. The Governing Board approved the “Trial Urban District Assessment” (TUDA), which allowed separate reports for the urban districts that participated in each NAEP assessment. These assessments were the same as those administered to the national and state samples. Providing such reports required enlarging the NAEP sample and adjusting it so that the national estimates were retained and students were not tested more than once.
In summary, throughout its history, NAEP has received and responded to many challenges. The primary challenge has been to keep up to date in assessment technology while maintaining an accurate record of student progress. The result of all these adjustments has been a very complex database constructed with extraordinary care in accuracy and technology. The database is well worth the intellectual investment in learning how to use it.
Even though NAEP produces many significant reports, no assessment program can perform all the analyses and report all of the results that are of interest to educational researchers and the general public. For this reason, its database is a national treasure containing extensive data that are invaluable for secondary research. However, the NAEP database is very large and complex, and its use requires considerable intellectual investment. Furthermore, general concerns about the confidentiality of participants in NAEP and in many other federally sponsored databases have brought about a licensing system for access to the full NAEP database. Accessing and using the NAEP database may seem formidable and therefore discourage many potential users. For more information, see The Restricted-Use Data Procedures Manual at: https://nces.ed.gov/statprog/rudman.
Accessing the NAEP data is made simpler by using the web-based NAEP Data Explorer (see section 3.3.1 for further information), which allows users some flexibility to navigate the NAEP database and create aggregated statistical tables and graphics. However, confidentiality concerns require that the NAEP Data Explorer’s underlying database must be hidden from the user, and it is designed not to allow unlimited accessing, manipulation, and analysis of the NAEP student, school, and teacher micro-data. In-depth secondary analysis requires access to the licensed, full NAEP database.
This Primer is designed to simplify access to the NAEP database and make its technologies more user friendly. The NAEP Primer makes use of its publicly accessible NAEP mini-sample to describe and give examples of the use of the NAEP Data Tools at https://nces.ed.gov/nationsreportcard/data/ that make it possible to run many graphs and tables without having a user license. It will also describe and give examples of several techniques for estimating population parameters (e.g., the average NAEP scores or percentages of students exceeding NAEP achievement levels in various demographic groups) using the NAEP data. The process for obtaining a user license will also be presented for the benefit of those who would like to do more in-depth research.
The Primer is not intended to be a technical report. It is intended to introduce educational and other researchers to the use of the NAEP database. Its users are expected to have a working knowledge of educational measurement, basic statistical methods, and the Internet. The Primer will cover only a portion of the NAEP database since a full coverage would not be feasible. However, the Primer will point to places on the web or elsewhere where detailed information is available.
To order your copy of The NAEP Primer with the the mathematics data included on a CD-ROM for planning your analyses, see the description in the NCES publications catalog.