Skip Navigation
Illustration/Logo View Quarterly by  This Issue  |  Volume and Issue  |  Topics
Education Statistics Quarterly
Vol 3, Issue 4, Topic:   Methodology
The NAEP 1998 Technical Report
By:  Nancy L. Allen, John R. Donoghue, and Terry L. Schoeps
 
This article was excerpted from the Introduction tothe Technical Report of the same name. The report describes the design and data analysis procedures of the 1998 National Assessment of Educational Progress (NAEP).
 
 

The 1998 National Assessment of Educational Progress (NAEP) monitored the performance of students in U.S. schools in the subject areas of reading, writing, and civics. The purpose of this technical report is to provide details on the instrument development, sample design, data collection, and data analysis procedures for the 1998 NAEP national and state assessments. The report includes information necessary to show adherence to the testing standards jointly developed by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (1999) as well as those developed by the Educational Testing Service (1987). Detailed substantive results are not presented here but can be found in a series of NAEP reports covering the status of and trends in student performance; several other reports provide additional information on how the assessments were designed and implemented.

Back to top


In 1998, NAEP conducted national main assessments at grades 4, 8, and 12 in reading, writing, and civics, as well as state assessments at grades 4 and 8 in reading and at grade 8 in writing.1 Long-term trend assessments (which were conducted in 1996 and 1999) were not included in the 1998 NAEP. To provide a context for the 1998 assessments, table A shows the NAEP assessment schedule from 1990 to 2000.

The 1998 NAEP used a complex multistage sample design involving nearly 448,000 students attending public and nonpublic schools. The NAEP subject-area reports (or “report cards”) documenting student performance in 1998 were based on analysis of results from over 113,000 students who took the national main assessments and over 304,000 students who took the state assessments (table B).2

Back to top


NAEP strives to maintain its links to the past and still implement innovations in measurement technology. To that end, the NAEP design includes two types of nationally representative samples: long-term trend samples and main assessment samples. Long-term trend assessments have used the same methodology and population definitions for the past 30 years, while main assessments incorporate innovations associated with new NAEP technology and address current educational issues. The national main assessment sample data are used primarily for analyses involving the current student population, but also to estimate short-term trends for a small number of recent assessments. (Some of the assessment materials administered to the national main assessment samples are periodically administered to state samples as well.) In continuing to use this two-tiered approach, NAEP reaffirms its commitment to continuing to study trends while at the same time implementing the latest in measurement technology and educational advances.

Test booklets

Many of the innovations that were implemented for the first time in 1988 were continued and enhanced in succeeding assessments. For example, a focused balanced incomplete block (focused BIB) booklet design was used in 1988. Since that time, either focused BIB or focused partially balanced incomplete block (focused PBIB) designs have been used. Variants of the focused PBIB design were used in the 1998 national main and state assessments in reading and writing, and a focused BIB design was used in the 1998 national main civics assessment. Both the BIB and PBIB designs provide for booklets of interlocking blocks of items, so that no student receives too many items, but all receive groups of items that are also presented to other students. The booklet design is focused, because each student receives blocks of cognitive items in the same subject area. The focused BIB or focused PBIB design allows for improved estimation within a particular subject area, and estimation continues to be optimized for groups rather than individuals.

Table A.—Schedule for NAEP assessments: 1990-2000
Table A.- Schedule for NAEP assessments: 1990-2000

1 Before 1984, the main assessments were administered in the fall of one year through the spring of the next. Beginning with 1984, the main assessments were administered after the new year, although the long-term trend assessments continued with their traditional administration in fall, winter, and spring. Because the main assessments constitute the largest component of NAEP, their administration year is listed, rather than the 2 years over which the long-term trend assessments continue to be administered. Note also that the state assessments are administered at essentially the same time as the main assessments.

2In the columns for the main and state assessments, numbers in parentheses indicate the grades at which individual assessments were administered. The main assessments with no numbers in parentheses were administered at grades 4, 8, and 12.

3State assessments began in 1990 and were referred to as Trial State Assessments (TSA) through 1994.

SOURCE: Taken from the “Schedule for the State and National Assessment of Educational Progress from 1969-2010” on the NAEP Web Site (available: http://nces.ed.gov/nationsreportcard/about/schedule1969-2010.asp).

Scale score estimates

Since 1984, NAEP has applied the plausible values approach to estimating means for demographic as well as curriculum-related subgroups. Scale score estimates are drawn from a posterior distribution that is based on an optimum weighting of two sets of information: students’ responses to cognitive questions and students’ demographic and associated educational process variables. This Bayesian procedure was developed by Mislevy (1991). Succeeding assessments continued to use an improvement that was first implemented in 1988 and refined for the 1994 assessments. This is a multivariate procedure that uses information from all scales within a given subject area in the estimation of the scale score distribution on any one scale in that subject area.

Data collection period

To shorten the timetable for reporting results, the period for national main assessment data collection was shortened beginning in 1992. In the 1990 and earlier assessments, a 5-month period was used (January through May). In 1992, 1994, 1996, and 1998, a 3-month period in the winter was used (January through March, corresponding to the period used for the winter half-sample of the 1990 national main assessment).

Table B.—Student samples for NAEP national main and state assessments: 1998
Table B.- Student samples for NAEP national main and state assessments: 1998

1The reporting sample size is the number of students in the sample who were administered the assessment and whose results were used in the NAEP subject-area reports. Those special-needs students who were excluded from the assessment are not included in the reporting sample. For more information, see the complete report.

2The state sample sizes include counts of students from distinct samples for each state or jurisdiction participating in the assessment.

NOTE: The 1998 assessments were administered January 5-March 27, 1998. Final makeup sessions were held March 30-April 3, 1998.

SOURCE: Based on table 1-1 on p. 9 of the complete report from which this article is excerpted.

IRT scaling

A major improvement introduced in the 1992 assessment, and continued in succeeding assessments, was the use of the generalized partial-credit model for item response theory (IRT) scaling. This allowed constructed-response questions that are scored on a multipoint rating scale to be incorporated into the NAEP scale in a way that utilizes the information available in each response category.

Back to top


Part I of this report begins by summarizing the design of the 1998 national main and state assessments. Subsequent chapters then provide an overview of the objectives and frameworks for items used in the assessments, the sample selection procedures, the administration of the assessments in the field, the processing of data from the assessment instruments into computer-readable form, the professional scoring of constructed-response items, and the methods used to create a complete NAEP database.

The 1998 NAEP data analysis procedures are described in part II of the report. Following a summary of the analysis steps, individual chapters provide general discussions of the weighting and variance estimation procedures used in the national main and state assessments, an overview of NAEP scaling methodology, and information about the conventions used in significance testing and reporting NAEP results. Part II concludes with chapters that provide details of the data analysis for each subject area. These chapters describe assessment frameworks and instruments, student samples, items, booklets, scoring, differential item functioning (DIF) analysis, weights, and item analyses of the national main and state assessments.

Finally, the report’s appendices provide detailed information on a variety of procedural and statistical topics. Included are explanations of how achievement levels for the subject areas were set by the National Assessment Governing Board (NAGB) and lists of committee members who contributed to the development of objectives and items.

Back to top


Footnotes

1In 1998, special studies of specific aspects of writing and civics also took place, but this report does not include information on the analyses conducted for these studies, and it includes only overview information on the study samples.

2Results from some students sampled by NAEP were not included in the NAEP report cards—specifically, students who participated in special studies (rather than in national main or state assessments) and certain special-needs students. See the complete report for details.

Back to top


American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.

Educational Testing Service. (1987). ETS Standards for Quality and Fairness. Princeton, NJ: Author.

Mislevy, R.J. (1991). Randomization-Based Inference About Latent Variables From Complex Samples. Psychometrika 56: 177-196.

Back to top
   

For technical information, see the complete report:

Allen, N.L., Donoghue, J.R., and Schoeps, T.L. (2001). The NAEP 1998 Technical Report (NCES 2001–509).

Author affiliations: N.L. Allen, J.R. Donoghue, and T.L. Schoeps, Educational Testing Service.

For questions about content, contact Arnold Goldstein (arnold.goldstein@ed.gov).

To obtain the complete report (NCES 2001-509), call the toll-free ED Pubs number (877-433-7827), visit the NCES Web Site (http://nces.ed.gov), or contact GPO (202-512-1800).



Back to top