Enhanced procedures are employed for constructed-response items that are scored polytomously. Methods parallel to those used for dichotomously scored items result in values reported for each distinct response category for the item. Response categories for each item are defined in two ways—one using the original codes for responses as specified in the scoring rubrics used by the scorers, and one using the codes as defined in the Item Response Theory (IRT) scales. The response categories as defined in the IRT scales are based on scoring guides developed by subject area and measurement experts who specify the treatment of each response category in scaling. For some items, two or more of the response categories as specified in the original scoring rubrics are collapsed to create the response categories as defined in the IRT scales. A common example is a constructed-response item that initially has seven response categories (not reached, omitted, off-task, and the four valid response categories) in the original scoring rubric, where the seven categories are mapped into four response categories with not reached, omitted, and off-task responses being combined with the lowest of the four response categories in the IRT scale. The ordered categories can be mapped into a set of integers in the corresponding order (e.g., 0, 1, 2, and 3) or into a set of numbers ranging from 0 to 1 (e.g., 0.00, 0.33, 0.67, 1.00), where each response category is assigned zero, partial, or full credit. Classical item statistics are calculated for both sets of response categories (before and after scaling). However, the response categories as defined in the IRT scales are used to calculate the item statistics that are reported.
The following statistics, analogous to those for dichotomously scored items, are computed:
the percentage of examinees providing a response that was off task;
the ratio of the mean item score to the maximum-possible item score (in place of p+);
the inverse-normally transformed ratio of the mean item score to the maximum-possible item score scaled to mean 13 and standard deviation 4 (in place of delta);
the polyserial correlation coefficient between the item score and the total score for the block in which the item appears (in place of the biserial); and
the Pearson product-moment correlation coefficient between the item score and the total score for the block in which the item appears (in place of the point-biserial).
The total block score for each examinee is calculated by adding a one for each dichotomously scored item answered correctly plus the credit assigned to the examinee's response category for each polytomously scored item. Missing responses for polytomously scored items are treated in the same way in NAEP classical item analyses as in NAEP IRT analyses.