In standard differential item functioning (DIF) analyses such as Mantel-Haenszel and SIBTEST, it is well established that a moderately long matching test is required for the procedures to be valid (i.e., identify DIF in items unconfounded by other irrelevant factors [e.g., Donoghue, Holland, and Thayer 1993]). Some NAEP assessments (e.g., writing assessments) contain as few as two six-category items per booklet. This is too little information for the test statistics associated with Mantel (1963) or SIBTEST (Shealy and Stout 1993) procedures to function effectively. Thus, standard DIF approaches based on statistical tests of items are likely to function poorly, and so are not used in the analysis of certain NAEP assessments.
In these cases, the standardization method of Dorans and Kulick (1986) is used to produce descriptive statistics. The matching variable is the total score on the booklet. As in other NAEP DIF analyses, the statistics are computed based on pooled booklet matching; the results are accumulated over the booklets in which a given item appears (e.g., Allen and Donoghue 1996). The statistic of interest appears under the label "standardized mean difference." First, differences in the item score between the two comparison groups are calculated for each possible booklet score. The standardized mean difference for the item is the weighted average of these differences, where the relative frequency of the focal group at each booklet score serves as the weighting function.
Significance testing is not performed, due to the low reliability of the matching variable. Instead, the standardized mean difference values are used descriptively, to identify those items that demonstrate the most evidence of DIF. A rough criterion to describe DIF for polytomous items has been to create the ratio of the standardized mean difference to the item's standard deviation and flag any item with a ratio of at least .25.