In most subjects, data for national and state NAEP assessments are analyzed using the following process:
Check Item Data and Performance: The data and performance of each item are checked in a number of ways: scoring reliability checks, item analyses, and
differential item functioning (DIF) to ensure fair and reliable measures of performance in the subject of the assessment. DIF reflects the differential probability of doing well on the item depending on group membership, even after controlling for overall performance.
Set the Scale for Assessment Data: Each subject assessed is divided into sub-skills, purposes, or content domains as specified by the subject framework. Separate scales are developed relating to the content of each subject area (i.e., mathematics, reading, etc.). A special statistical procedure called,
Item Response Theory scaling, is used to estimate the measurement characteristics of each assessment question. The scaling involves analysis procedures that mathematically model the probability that participants will respond correctly to a specific test question, given their overall performance together with the characteristics of the questions on the test.
Estimate Group Performance Results: Because NAEP must minimize the time burden on students and schools by keeping assessment administration brief. No individual student takes more than a small portion of the assessment for a given content domain. NAEP uses
scaling procedures to estimate the performance of groups of students (e.g., of all fourth-grade students in the nation, of female eighth-grade students in a state).
Transform Results to the Reporting Scale: Results for assessments conducted in different years are linked to reporting
scales to allow comparisons of year-to-year trend results for common populations on related assessments.
Create a Database: A database is created and used to make comparisons of all results, such as
percentiles, percentages at or above
achievement levels, and comparisons between groups and between years for a group. All comparisons are tested for statistical significance, and standard errors are computed for all statistics.
To ensure the reliability of NAEP results, extensive quality control and plausibility checks are carefully conducted as part of each analysis step. Quality control tasks are intended to verify that analysis steps have not introduced errors or artifacts into the results. Plausibility checks are intended to encourage thinking about the results, whether they make sense, and what story they tell.