Abedi, J. (1996). The Interrater/Test Reliability System. (ITRS). Multivariate Behavioral Research, 31(4): 409–417.
Allen, M.J., and Yen, W.M. (1979). Introduction to Measurement Theory. Monterey, CA: Brooks/Cole Publishing Company.
Allen, N.A., Hombo, C.H., and Stoeckel, J.J. (2003). NAEP 1999 Long-Term Trend Technical Analysis Report. Washington, DC: National Center for Education Statistics.
Allen, N.A., Jenkins, F., and Schoeps, T.S. (2003). NAEP 1997 Arts Technical Analysis Report. Washington, DC: National Center for Education Statistics.
Allen, N.L., and Donoghue, J.R. (1994, April). Differential Item Functioning Based on Complex Samples of Dichotomous and Polytomous Items. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.
Allen, N.L., and Donoghue, J.R. (1996). Applying the Mantel-Haenszel Procedure to Complex Samples of Items. Journal of Educational Measurement, 33(2): 231–251.
Allen, N.L., Carlson, J.E., and Zelenak, C.A. (1999). The NAEP 1996 Technical Report (NCES 99–452). Washington, DC: National Center for Education Statistics.
Allen, N.L., Jenkins, F., Kulick, E., and Zelenak, C.A. (1997). NAEP 1996 State Assessment Program in Mathematics (NCES 97–951). Washington, DC: National Center for Education Statistics.
Allen, N.L., Kline, D.L., and Zelenak, C.A. (1997). The NAEP 1994 Technical Report (NCES 97–897). Washington, DC: National Center for Education Statistics.
Allen, N.L., Mazzeo, J., Ip, E.H.S., Swinton, S.S., Isham, S.P., and Worthington, L. (1995). Data Analysis and Scaling for the 1994 Trial State Assessment in Reading. In J. Mazzeo, N.L. Allen, and D.L. Kline (Eds.), Technical Report of the NAEP 1994 Trial State Assessment in Reading (pp. 169–219). Washington, DC: National Center for Education Statistics.
Allen, N.L., Mazzeo, J., Isham, S.P., Fong, Y.F., and Bowker, D.W. (1994). Data Analysis and Scaling for the 1992 Trial State Assessment in Reading. In E.G. Johnson, J. Mazzeo, and D.L. Kline (Eds.), Technical Report of the 1992 Trial State Assessment Program in Reading (pp. 147–149). Washington, DC: National Center for Education Statistics.
Allen, N.L., McClellan, C.A., and Stoeckel, J.J. (2005). NAEP 1999 Long-Term Trend Technical Analysis Report: Three Decades of Student Performance (NCES 2005–484). U.S. Department of Education, National Center for Education Statistics. Washington, DC: U.S. Government Printing Office.
Allen, N.L., Swinton, S.S., Isham, S.P., and Zelenak, C.A. (1998). Technical Report: NAEP 1996 State Assessment. Washington DC: National Center for Education Statistics.
American College Testing. (1993). Setting Achievement Levels on the 1992 National Assessment of Educational Progress in Mathematics, Reading, and Writing: A Technical Report on Reliability and Validity. Iowa City, IA: Author.
American College Testing. (1998). Developing Achievement Levels on the 1998 NAEP in Civics: Field Trial Report. Iowa City, IA: Author.
American College Testing. (1999a). Developing Achievement Levels on the 1998 NAEP in Civics: Pilot Study Report. Iowa City, IA: Author.
American College Testing. (1999b). Developing Achievement Levels on the 1998 NAEP in Civics: Final Report. Iowa City, IA: Author.
American College Testing. (1999c). Developing Achievement Levels on the 1998 NAEP in Writing: Field Trial Report. Iowa City, IA: Author.
American College Testing. (1999d). Developing Achievement Levels on the 1998 NAEP in Writing: Final Report. Iowa City, IA: Author.
American College Testing. (1999e). Developing Achievement Levels on the 1998 NAEP in Writing: Pilot Study Report. Iowa City, IA: Author.
Andersen, E.B. (1980). Comparing Latent Distributions. Psychometrika, 45: 121–134.
Angoff, W.H. (1971). Scales, Norms, and Equivalent Scores. In R.L. Thorndike (Ed.), Educational Measurement (2nd ed., pp. 508–600). Washington DC: American Council on Education.
Ballator, N., and Jerry, L. (1999a). NAEP 1998 Reading State Report for [state] (NCES 1999–460). Washington, DC: National Center for Education Statistics.
Ballator, N., and Jerry, L. (1999b). NAEP 1998 Writing State Report for [state] (NCES 1999–463). Washington, DC: National Center for Education Statistics.
Beall, G. (1971). Change-Over Experiments in Practice (ETS Research Bulletin 71–38). Princeton, NJ: Educational Testing Service.
Beaton, A.E., and Johnson, E.G. (1990). The Average Response Method of Scaling. Journal of Educational Statistics, 15(1): 9–38.
Beaton, A.E., and Johnson, E.G. (1992). Overview of the Scaling Methodology Used in the National Assessment. Journal of Educational Measurement, 29(2): 163–175.
Beaton, A.E., and Zwick, R. (1990). The Effect of Changes in the National Assessment: Disentangling the NAEP 1985–86 Reading Anomaly (No. 17-TR-21). Princeton, NJ: Educational Testing Service, National Assessment of Educational Progress.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Series B, 57(1): 289–300.
Birnbaum, A. (1968). Some Latent Trait Models and Their Use in Inferring an Examinee's Ability. In F.M. Lord and M.R. Novick (Eds.), Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.
Bock, R.D. (1972). Estimating Item Parameters and Latent Ability When Responses Are Scored in Two or More Nominal Categories. Psychometrika, 37(1): 29–51.
Bourque, M.L., and Garrison, H. (1991). The Levels of Mathematics Achievement: Initial Performance Standards for the 1990 NAEP Mathematics Assessment (Vols. I and II). Washington, DC: National Assessment Governing Board.
Bourque, M.L., Champagne, A.B., and Crissman, S. (1997). NAEP 1996 Science Performance Standards: Achievement Results for the Nation and the States. Washington, DC: National Assessment Governing Board.
Broughman, S.P., and Colaciello, L.A. (1998). Private School Universe Study, 1995–96 (NCES 98–229). Washington, DC: National Center for Education Statistics.
Calderone, J., King, L.M., and Horkay, N. (Eds.). (1997). The NAEP Guide: A Description of the Content and Methods of the 1997 and 1998 Assessments (NCES 97–990). Washington, DC: National Center for Education Statistics.
Carlson, J.E. (1993). Dimensionality of NAEP Instruments That Incorporate Polytomously Scored Items. Paper presented at the annual meeting of the American Educational Research Association, Atlanta, GA.
Carlson, J.E., and Jirele, T. (1992, April). Dimensionality of 1990 NAEP Mathematics Data. Paper presented at the meeting of the American Educational Research Association, San Francisco, CA.
Center for Civic Education. (1994). National Standards for Civics and Government. Calabasas, CA: Author.
Center for Research on Evaluation, Standards, and Student Testing (CRESST). (1996). Writing Framework and Specifications for the 1998 National Assessment of Educational Progress (Contract RS89174001). Washington DC: National Assessment Governing Board.
Chang, H.H., Mazzeo, J., and Roussos, L. (1995). Detecting DIF for Polytomously Scored Items: An Adaptation of the SIBTEST Procedure (ETS Research Report RR-95-05). Princeton, NJ: Educational Testing Service.
Cheung, O., Clements, B., and Miu, Y.C. (1994). The Feasibility of Collecting Comparable National Statistics About Students With LEP. Washington, DC: National Center for Education Statistics.
Cizek, G. (1993). Reactions to National Academy of Education Report: Setting Performance Standards for Student Achievement. Washington, DC: National Assessment Governing Board.
Cochran, W.G. (1977). Sampling Techniques. New York, NY: John Wiley & Sons.
Cochran, W.G., and Cox, G.M. (1957). Experimental Designs. New York: John Wiley & Sons, Inc.
Cohen, J. (1968). Weighted Kappa: Nominal Scale Agreement With Provision for Scaled Disagreement or Partial Credit. Psychological Bulletin, 70(4): 213–220.
Council of Chief State School Officers. (1996). Civics Framework for the 1998 National Assessment of Educational Progress. Washington, DC: National Assessment Governing Board.
Cronbach, L.J. (1951). Coefficient Alpha and the Internal Structure of Tests. Psychometrika, 16: 297–334.
Curry, L. (1987, April). Group Decision Process in Setting Cut-Off Scores. Paper presented at the annual meeting of the American Educational Research Association, Washington, DC.
Deming, W.E. and F.F. Stephan (1940), On a Least Square Adjustment of a Sampled Frequency Table When the Expected Marginal Totals Are Known. Annals of Mathematical Statistics, 11: 427?444.
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum Likelihood From Incomplete Data via the EM Algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39: 1–38.
Donahue, P.L., Voelkl, K.E., Campbell, J.R., and Mazzeo, J. (1999). NAEP 1998 Reading Report Card for the Nation and the States, 23(2) (NCES 1999–500). Washington, DC: National Center for Education Statistics.
Donoghue, J.D., and Isham, S.P. (1998). A Comparison of Procedures to Detect Item Parameter Drift (with S.P.). Applied Psychological Measurement, 22: 33–51.
Donoghue, J.R. (1993). An Empirical Examination of the IRT Information in Polytomously Scored Reading Items (ETS Research Report RR-95-05). Princeton, NJ: Educational Testing Service.
Donoghue, J.R. (1995, April). Assessing Some of the Measurement Properties of Longer Reading Blocks: An Application of the Bootstrap to Structural Equation Models. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.
Donoghue, J.R. (1998a). Detecting DIF for Polytomously Scored Items: An Adaptation of the SIBTEST Procedure [Computer program]. Princeton, NJ: Educational Testing Service.
Donoghue, J.R. (1998b). NAEP SIBTEST [Computer program]. Princeton, NJ: Educational Testing Service.
Donoghue, J.R. (2000) On the Derivation of an Effect Size Measure for Polytomous DIF [draft manuscript]. Princeton, NJ: Educational Testing Service.
Donoghue, J.R., and Mazzeo, J. (1995). Assessing Some of the Properties of Longer Blocks in the 1992 NAEP Reading Assessment (ETS Research Report RR-95-28). Princeton, NJ: Educational Testing Service.
Donoghue, J.R., and Hombo, C.M. (June, 1999). Some Asymptotic Results on the Distribution of an IRT Measure of Item Fit, 22(1). Paper presented at the annual meeting of the Psychometric Society, Lawrence, KS.
Donoghue, J.R., Holland, P.W., and Thayer, D.T. (1993). A Monte Carlo Study of Factors That Affect the Mantel-Haenszel and Standardization Measures of Differential Item Functioning. In P.W. Holland and H. Wainer (Eds.), Differential Item Functioning (pp. 137–166). Hillsdale, NJ: Erlbaum.
Dorans, N., and Kulick, E. (1986). Demonstrating the Utility of the Standardization Approach to Assessing Unexpected Differential Item Performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23: 355–368.
Educational Testing Service. (1987). ETS Standards of Quality and Fairness. Princeton, NJ: Author.
Educational Testing Service. (1992). Innovations and Ingenuity: A Foundation for the Future. Application for Cooperative Agreement for NAEP (CFDA Number: 84.999E). Princeton, NJ: Author.
Educational Testing Service. (2000). ETS Standards of Quality and Fairness. Princeton, NJ: Author.
Engelen, R.J.H. (1987). Semiparametric Estimation in the Rasch Model, 59(3) (Research Report 87-1). Twente, the Netherlands: Department of Education, University of Twente.
Fitzpatrick, A.R. (1989). Social Influences in Standard-Setting: The Effects of Social Interaction on Group Judgments. Review of Educational Research, 59: 315–328.
Friedman, C.B., and Ho, K.T. (1990, April). Interjudge Consensus and Intrajudge Consistency: Is It Possible to Have Both on Standard Setting? Paper presented at the annual meeting of the National Council for Measurement in Education, Boston, MA.
Fuchs, L.S., Fuchs, D., Hosp, M.D., and Jenkins, J.R. (2001). Oral Reading Fluency as an Indicator of Reading Competence: A Theoretical, Empirical, and Historical Analysis. Scientific Studies of Reading, 5(3): 239–256.
Glaser, R., Linn, R., and Bohrnstedt, G. Assessing Student Achievement in the States. The first report of the National Academy of Education Panel on the Evaluation of the NAEP Trial State Assessment: 1990 Trial State Assessment. National Academy of Education, 1992.
Glaser, R., Linn, R., and Bohrnstedt, G. The Trial State Assessment: Prospects and Realities. The third report of the National Academy of Education Panel on the Evaluation of the NAEP Trial State Assessment: 1992 Trial State Assessment. National Academy of Education, 1993.
Grant, E.L. (1964). Statistical Quality Control (pp. 211–212). Washington, DC: McGraw Hill.
Gray, L.M., Krenzke, T., and Wallace, L. (2000). Sampling Activities and Field Operations for 1998 NAEP. Rockville, MD: Westat.
Greenwald, E.A., Persky, H.R., Campbell, J.R., and Mazzeo, J. (1999). NAEP 1998 Writing Report Card for the Nation and the States (NCES 1999–462). Washington, DC: National Center for Education Statistics.
Hambleton, R.K., and Bourque, M.L. (1991). The Levels of Mathematics Achievement, Vol. II, Technical Report. Washington, DC: National Assessment Governing Board.
Hansen, M and Tepping, B. 1985. Estimation of variance in NAEP. Unpublished manuscript. Rockville, MD: Westat.
Hochberg, Y. (1988). A Sharper Bonferroni Procedure for Multiple Tests of Significance. Biometrika, 75: 800–802.
Hoijtink, H. (1991). Estimating the Parameters of Linear Models With a Latent Dependent Variable by Nonparametric Maximum Likelihood (Research Bulletin HB-91-1040-EX). Groningen, The Netherlands: Psychological Institute, University of Groningen.
Holland, P.W., and Thayer, D.T. (1988). Differential Item Performance and the Mantel-Haenszel Procedure. In H.Wainer and H.I. Braun (Eds.), Test Validity. Hillsdale, NJ: Erlbaum.
Hombo, C.M., Donoghue, J.R., and Thayer, D.T. (2000, July). Some Properties of the Distribution of an IRT Measure of Item Fit. Paper presented at the annual meeting of the Psychometric Society, Vancouver, British Columbia.
Houser, J. (March 1995). Assessing Students With Disabilities and Limited English Proficiency (Working Paper No. 95–13). Washington, DC: National Center for Education Statistics, Policy, and Review Branch, Data Development Division.
Huynh, H. (1994). Some Technical Aspects of Standard Setting. Paper presented at the Joint Conference on Standard Setting for Large-Scale Assessments. Washington, D.C.
Huynh, H. (1998). On Score Locations of Binary and Partial Credit Items and Their Applications to Item Mapping and Criterion-Referenced Interpretation. Journal of Educational and Behavioral Statistics, 23(1): 38–58.
Improving America's Schools Act of 1994. Pub. L.. 103–382, Title I, Part A, 108 Stat. 3519 (1994).
Jenkins, F., Kulick, E., Kaplan, B.A., Wang, S., Qian, J., and Wang, X. (1997). Data Analysis and Scaling for the 1996 State Assessment Program in Mathematics. In N. L.Allen, F. Jenkins, E. Kulick, and C.A. Zelenak (Eds.), Technical Report of the NAEP 1996 State Assessment Program in Mathematics (NCES 97–951). Washington, DC: National Center for Education Statistics.
Johnson, E.G. (1989). Considerations and Techniques for the Analysis of NAEP Data. Journal of Educational Statistics, 14(4): 303–334.
Johnson, E.G. (1992). The Design of the National Assessment of Educational Progress. Journal of Educational Measurement, 29(2): 95–110.
Johnson, E.G., and Carlson, J.E. (1994). The NAEP 1992 Technical Report. Washington, DC: National Center for Education Statistics.
Johnson, E.G., and King, B.F. (1987). Generalized Variance Functions for a Complex Sample Survey. Journal of Official Statistics, 3: 235–250.
Johnson, E.G., and Rust, K.F. (1992). Population Inferences and Variance Estimation for NAEP Data. Journal of Educational Statistics, 17: 175–190.
Johnson, E.G., and Rust, K.F. (1993). Effective Degrees of Freedom for Variance Estimates From a Complex Sample Survey. American Statistical Association 1993 Proceedings: Survey Research Methods Section, pp. 863–866.
Kane, M. (1993). Comments on the NAE Evaluation of the NAGB Achievement Levels. Washington, DC: National Assessment Governing Board.
Kaplan, B.A., and Johnson, E.G. (1992, April). Reliability of Professionally Scored Data: NAEP-Related Issues. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.
Kaplan, B.A., Beaton, A.E., Johnson, E.G., and Johnson, J.R. (1988). National Assessment of Educational Progress: 1986 Bridge Studies. Princeton, NJ: Educational Testing Service, National Assessment of Educational Progress.
Kaplan, D. (1995). The Impact of BIB Spiraling-Induced Missing Data Patterns on Goodness-of-Fit Tests in Factor Analysis. Journal of Educational and Behavioral Statistics, 20: 69–82.
Keyfitz, N. (1951). Sampling With Probability Proportional to Size; Adjustment for Changes in Probabilities. Journal of the American Statistical Association, 46: 105–109.
Kish, L. (1965). Survey Sampling. New York: John Wiley & Sons.
Kish, L. (1992). Weighting for Unequal Pi. Journal of Official Statistics, 8: 183–200.
Kish, L., and Frankel, M.R. (1974). Inference From Complex Samples. Journal of the Royal Statistical Society, Series B, 36: 1–22.
Kolen, M.J., and Brennan, R.L. (2004). Test Equating, Scaling, And Linking: Methods And Practices (2nd ed.). Springer.
Krewski, D., and Rao, J.N.K. (1981). Inference From Stratified Samples: Properties of Linearization, Jackknife, and Balanced Repeated Replication. Annals of Statistics, 9: 1010–1019.
Laird, N.M. (1978). Nonparametric Maximum Likelihood Estimation of a Mixing Distribution. Journal of the American Statistical Association, 73: 805–811.
Langer, J. A. (1990). The Process of Understanding: Reading for Literary and Informative Purposes. Research in the Teaching of English, 24(3): 229–260.
Langer, J.A. (1989). The Process of Understanding Literature. Report Series 2.1. Albany, NY: Center for the Learning and Teaching of Literature, State University of New York.
Lindsey, B., Clogg, C.C., and Grego, J. (1991). Semiparametric Estimation in the Rasch Model and Related Exponential Response Models, Including a Simple Latent Class Model for Item Analysis. Journal of the American Statistical Association, 86: 96–107.
Little, R.J.A., and Rubin, D.B. (1983). On Jointly Estimating Parameters and Missing Data. American Statistician, 37: 218–220.
Little, R.J.A., and Rubin, D.B. (1987). Statistical Analysis With Missing Data. New York, NY: John Wiley & Sons.
Little, R.J.A., and Rubin, D.B. (2002). Statistical Analysis With Missing Data, 2nd ed., New York: John Wiley & Sons.
Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
Lutkus, A.D., Weiss, A.R., Campbell, J.R., Mazzeo, J., and Lazer, S. (1999). NAEP 1998 Civics Report Card for the Nation (NCES 2000–457). Washington, DC: National Center for Education Statistics.
Mantel N., and Haenszel, W.M. (1959). Statistical Aspects of the Analysis of Data From Retrospective Studies of Disease. Journal of the National Cancer Institute, 22: 719–748.
Mantel, N. (1963). Chi-Square Tests With One Degree of Freedom: Extensions of the Mantel-Haenszel Procedure. Journal of the American Statistical Association, 58: 690–700.
Mazzeo, J. (1991). Data Analysis and Scaling. In S.L. Koffler (Ed.), The Technical Report of NAEP's 1990 Trial State Assessment Program (No. ST-21-01, pp. 138–182). Washington, DC: National Center for Education Statistics.
Mazzeo, J., Allen, N.L., and Kline, D.L. (1995). Technical Report of the NAEP 1994 Trial State Assessment Program in Reading. Washington, DC: National Center for Education Statistics.
Mazzeo, J., Carlson, J.E., Voelkl, K.E., and Lutkus, A.D. (1999). Increasing the Participation of Special Needs Students in NAEP: A Report on 1996 NAEP Research Activities (NCES 2000–473). Washington, DC: National Center for Education Statistics.
Mazzeo, J., Chang, H., Kulick, E., Fong, Y.F., and Grima, A. (1993). Data Analysis and Scaling for the 1992 Trial State Assessment. In E.G. Johnson, J. Mazzeo, and D.L. Kline (Eds.), Technical Report of the NAEP 1992 Trial State Assessment Program in Mathematics (NCES 23–ST05). Washington, DC: National Center for Education Statistics.
Mazzeo, J., Donoghue, J.R., Liu, B., and Xu, X. (2018). Estimating Standard Errors for NAEP that Incorporate Random-Groups Linking Error for the Transition from Paper-based to Digital-based Assessments. Paper presented at the annual meeting of the National Council on Measurement in Education, New York, NY.
Mazzeo, J., Johnson, E.G., Bowker, D., and Fong, Y.F. (1992). The Use of Collateral Information in Proficiency Estimation for the Trial State Assessment. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.
Messick, S.J., Beaton, A.E., and Lord, F.M. (1983). National Assessment of Educational Progress Reconsidered: A New Design for a New Era. (NCES 15-TR-20). Washington, DC: National Center for Education Statistics.
Mislevy, R.J. (1985). Estimation of Latent Group Effects. Journal of the American Statistical Association, 80: 993–997.
Mislevy, R.J. (1990). Scaling Procedures. In E.G. Johnson and R. Zwick (Eds.), Focusing the New Design: The NAEP 1988 Technical Report (No. 19-TR-20, pp. 229–250). Princeton, NJ: Educational Testing Service, National Assessment of Educational Progress.
Mislevy, R.J. (1991). Randomization-Based Inference About Latent Variables From Complex Samples. Psychometrika, 56: 177–196.
Mislevy, R.J., and Bock, R.D. (1982). BILOG: Item Analysis and Test Scoring With Binary Logistic Models [Computer program]. Chicago: Scientific Software.
Mislevy, R.J., and Sheehan, K.M. (1987). Marginal Estimation Procedures. In A.E. Beaton (Ed.), Implementing the New Design: The NAEP 1983–84 Technical Report (No. 15-TR-20, pp. 293?360). Princeton, NJ: Educational Testing Service, National Assessment of Educational Progress.
Mislevy, R.J., and Stocking, M.L. (1989). A Consumer's Guide to LOGIST and BILOG. Applied Psychological Measurement, 13(1): 57–75.
Mislevy, R.J., and Wu, P.K. (1988). Inferring Examinee Ability When Some Item Responses Are Missing (ETS Research Report RR-88-48-ONR). Princeton, NJ: Educational Testing Service.
Mislevy, R.J., Beaton, A.E., Kaplan, B., and Sheehan, K.M. (1992). Estimating Population Characteristics From Sparse Matrix Samples of Item Responses. Journal of Educational Measurement, 29(2): 133–161.
Mislevy, R.J., Johnson, E.G., and Muraki, E. (1992). Scaling Procedures in NAEP. Journal of Educational Statistics, 17(2): 131–154.
Muraki, E. (1992). A Generalized Partial Credit Model: Application of an EM algorithm. Applied Psychological Measurement, 16(2): 159–176.
Muraki, E. (1993). Information Functions of the Generalized Partial Credit Model. Applied Psychological Measurement, 17(4): 351–363.
Muraki, E., and Bock, R.D. (1991). PARSCALE: Parameter Scaling of Rating Data. Chicago, IL: Scientific Software, Inc.
Muraki, E., and Bock, R.D. (1997). PARSCALE: IRT Item Analysis and Test Scoring for Rating-Scale Data. Chicago, IL: Scientific Software International.
Muthén, B. (1991, November). Issues in Using NAEP Mathematics Items to Study Achievement Dimensionality, Within-Grade Differences, and Across-Grade Growth. Report presented to the Design and Analysis Committee of the National Assessment of Educational Progress, Washington, DC.
National Academy of Sciences. (1998). Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: Author.
National Assessment Governing Board. (1989). Setting Achievement Goals on the National Assessment of Educational Progress: A Draft Policy Statement. Washington, DC: Author.
National Assessment Governing Board. (1990). Reading Framework for the National Assessment of Educational Progress, 1992. Washington, DC: Author.
National Assessment Governing Board. (1996a). Civics Framework for the 1998 National Assessment of Educational Progress. Washington, DC: Author.
National Assessment Governing Board. (1996b). Writing Framework and Specifications for the 1998 National Assessment of Educational Progress, 1992–1998. Washington, DC: Author.
National Assessment Governing Board. (2003). Setting Achievement Goals on the National Assessment of Educational Progress: A Draft Policy Statement. Washington, DC: Author.
National Assessment Governing Board. (2012). NAEP Background Questions and the Use of Contextual Data in NAEP Reporting. Washington, DC: U.S. Department of Education, National Assessment Governing Board.
National Assessment Governing Board. (2022). National Assessment Governing Board Assessment Framework Development Policy Statement. Washington, DC: U.S. Department of Education, National Assessment Governing Board.
National Assessment of Educational Progress Improvement Act of 1988. Pub L. 100–297, U.S.C.A. 1221 et seq. (1988).
National Assessment of Educational Progress. (1996). The NAEP Guide: A Description of the Content and Methods of the 1994 and 1996 Assessments. Washington, DC: National Center for Education Statistics.
National Center for Education Statistics. (2004). NCES Plan for NAEP Background Variable Development. Washington, DC.
National Computer Systems. (1998). 1998 NAEP Assessment: Report of Processing and Professional Scoring Activities. Iowa City, IA: Author.
Oh, H.L. and F.J. Scheuren (1987). Modified Raking Ratio Estimation. Survey Methodology, 13: 209-219.
Oh, H.L., and Scheuren, F.J. (1983). Weighting Adjustment for Unit Nonresponse. In W.G. Madow, I. Olkin, and D.B. Rubin (Eds.), Incomplete Data in Sample Surveys, Volume 2: Theory and Bibliography (pp. 143–184). New York: Academic Press.
Olson, J.F., and Goldstein, A.A. (1997). The Inclusion of Students With Disabilities and Limited English Proficient Students in Large-Scale Assessments: A Summary of Recent Progress (NCES 97–482). Washington, DC: U.S. Department of Education.
Oranje, A. (2006). Jackknife Estimation of Sampling Variance of Ratio Estimators in Complex Samples: Bias and the Coefficient of Variation (ETS Research Report). Princeton, NJ.
Orlando, M., and Thissen, D. (2000). Likelihood-Based Item-Fit Indices for Dichotomous Item Response Theory Models. Applied Psychological Measurement, 24(1): 50–64.
Pellegrino, J.W., Jones, L.R., and Mitchell, K.J. (Eds.). (1998). Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Committee on the Evaluation of National Assessments of Educational Progress, Board on Testing and Assessment, Commission on Behavioral and Social Sciences and Education, National Research Council. Washington, DC: National Academy Press.
Petersen, N. (1988). DIF Procedures for Use in Statistical Analysis. Internal memorandum.
Potter, F. (1988). Survey of Procedures to Control Extreme Sampling Weights. American Statistical Association 1988 Proceedings: Survey Research Methods Section, pp. 225–230.
Rizzo, L. and Rust, K. (2011). Finite Population Correction for NAEP Variance Estimation. JSM Proceedings, Survey Research Methods Section, pp.2501-2515. Alexandria, VA: American Statistical Association.
Rock, D.A. (1991, November). Subscale Dimensionality. Paper presented at the meeting of the Design and Analysis Committee of the National Assessment of Educational Progress, Washington, DC.
Rogers, A.M. and Stoeckel, J.J. NAEP 2006 National Civics, Economics, and U.S. History Restricted Use Data File Data Companion. National Center for Education Statistics. Washington, D.C.
Rogers, A.M., Kokolis, G.A., Stoeckel, J.J., and Kline, D.L. (2000a). National Assessment of Educational Progress: 1998 Civics Assessment Secondary-Use Data Files Data Companion. Princeton, NJ: Educational Testing Service, National Assessment of Educational Progress.
Rogers, A.M., Kokolis, G.A., Stoeckel, J.J., and Kline, D.L. (2000b). National Assessment of Educational Progress: 1998 Writing Assessment Secondary-Use Data Files Data Companion. Princeton, NJ: Educational Testing Service, National Assessment of Educational Progress.
Rogers, A.M., Kokolis, G.A., Stoeckel, J.J., and Kline, D.L. (2000c). National Assessment of Educational Progress: 1998 Reading Assessment Secondary-Use Data Files Data Companion. Princeton, NJ: Educational Testing Service, National Assessment of Educational Progress.
Roussos, L., and Stout, W. (1996). A Multidimensionality-Based DIF Analysis Paradigm. Applied Psychological Measurement, 20(4): 355–371.
Roussos. L.A., Stout, W.F., and Marden, J.I. (1998). Using New Proximity Measures With Hierarchical Cluster Analysis to Detect Multidimensionality. Journal of Educational Measurement, 35: 1–30.
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York, NY: John Wiley & Sons.
Rubin, D.B. (1991). EM and Beyond. Psychometrika, 56: 241–254.
Rust, K. (1985). Variance Estimation for Complex Estimators in Sample Surveys. Journal of Official Statistics, 1(4): 381–397.
Rust, K.F., and Johnson, E.G. (1992). Sampling and Weighting in the National Assessment. Journal of Educational and Behavioral Statistics, 17(2), 111–129.
Rust, K.F., Bethel, J., Burke, J., and Hansen, M.H. (1990). The 1988 National Assessment of Educational Progress: Sampling and Weighting Procedures, Final Report. Rockville, MD: Westat.
Rust, K.F., Burke, J., and Fahimi, M. (1992). 1990 National Assessment of Educational Progress Sampling and Weighting Procedures, Part 2: National Assessment, Final Report. Rockville, MD: Westat.
Samejima, F. (1969). Estimation of Latent Ability Using a Response Pattern of Graded Scores. Psychometrika, 34(4): 129–301.
Sarndal, C-E., Swensson, B., and Wretman, J. (1992). Model Assisted Survey Sampling. New York: Springer-Verlag.
Satterthwaite, F.E. (1941). Synthesis of Variance. Psychometrika, 6: 309–316.
Shaffer, J.P. (1994). Multiple Hypothesis Testing: A Review (Tech. Rep. No. 23). Research Triangle Park, NC: National Institute of Statistical Sciences.
Shao, J., and Tu, D. (1995). The Jackknife and the Bootstrap. New York, NY: Springer.
Shealy, R., and Stout, W. (1993). A Model-Based Standardization Approach That Separates True Bias/DIF From Group Ability Differences and Detects Test Bias/DIF as Well as Item Bias/DIF. Psychometrika, 58: 159–194.
Sheehan, K.M. (1985). MGROUP: Estimation of Group Effects in Multivariate Models [Computer program]. Princeton, NJ: Educational Testing Service.
Shepard, L.A., Glaser, R., Linn, R., and Bohrnstedt, G. Setting Performance Standards for Student Achievement. Report of the NAE Panel on the Evaluation of the NAEP Trial State Assessment: An Evaluation of the 1992 Achievement Levels. National Academy of Education, 1993.
Smith, R.L., and Smith, J.K. (1988). Differential Use of Item Information by Judges Using Angoff and Nedelsky Procedures. Journal of Educational Measurement, 25: 259–274.
Stone, C.A., Ankenmann, R.D., Lane, S., and Liu, M. (1993, April). Scaling QUASAR's Performance Assessments. Paper presented at the annual meeting of the American Educational Research Association, Atlanta, GA.
Stone, C.A., Mislevy, R.J., and Mazzeo, J. (1994, April). Misclassification Error and Goodness-of-Fit in IRT Models. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.
Tanner, M., and Wong, W. (1987). The Calculation of Posterior Distributions by Data Augmentation (with discussion). Journal of the American Statistical Association, 82: 528–550.
Thomas, N. (1993a). Asymptotic Corrections for Multivariate Posterior Moments With Factored Likelihood Functions. Journal of Computational and Graphical Statistics, 2: 309–322.
Thomas, N. (1993b). The E-Step of the MGROUP EM Algorithm (Program Statistics Research, ETS Research Report RR-95-05). Princeton, NJ: Educational Testing Service.
Thomas, N. (1994). CGROUP and BGROUP: Modifications of the MGROUP Program to Estimate Group Effects in Multivariate Models [Computer programs]. Princeton, NJ: Educational Testing Service.
Tukey, J.W. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley Publishing Company.
von Davier, M., Sinharay, S., Oranje, A., and Beaton, A. (2007) The Statistical Procedures Used in National Assessment of Educational Progress: Recent Developments and Future Directions. In C.R. Rao and S. Sinharay (Eds.), Handbook of Statistics: Psychometrics (Vol. 26, pp. 1039–1055). Amsterdam, The Netherlands: Elsevier.
Wainer, H. (1974). The Suspended Rootogram and Other Visual Displays: An Empirical Validation. American Statistician, 28(4): 143–145.
Wallace, L., and Rust, K.F. (1999). Sample Design. In N.L. Allen, J.E. Carlson, and C.A. Zelenak (Eds.), The NAEP 1996 Technical Report (NCES 1999–452). Washington, DC: National Center for Education Statistics.
Weiss, A.R., Lutkus A.D., Grigg, W.S., and Niemi, R.G. (2000). The Next Generation of Citizens: NAEP Trends in Civics, 1988 to 1998 (NCES 2000–494). Washington, DC: National Center for Education Statistics.
Westat. (1998). Report on Data Collection Activities for All States. Rockville, MD: Author.
Westat. (2000). Report on Data Collection Activities for All States. Rockville, MD: Author.
Westat. (2000b). Wesvar 4.0 User's Guide. Rockville, MD: Westat.
Westat. (2001). Sample Design for 2001 NAEP History and Geography Assessment. Rockville, MD: Author.
Westat. (2002). Sample Design for 2002 NAEP Reading and Writing Assessment. Rockville, MD: Author.
Westat. (2005). Supplemental Tables From NAEP 2005 Sample Design. Rockville, MD.
Westat. (2007). Supplemental Tables From NAEP 2007 Sample Design. Rockville, MD.
Williams, V.S.L., Jones, L.V., and Tukey, J.W. (1999). Controlling Error in Multiple Comparisons, With Examples From State-to-State Differences in Educational Achievement. Journal of Educational and Behavioral Statistics, 24(1): 42–69.
Wingersky, M., Kaplan, B.A., and Beaton, A.E. (1987). Joint Estimation Procedures. In A.E. Beaton (Ed.), Implementing the New Design: The NAEP 1983–84 Technical Report (No. 15-TR-20, pp. 285–292). Princeton, NJ: Educational Testing Service, National Assessment of Educational Progress.
Wolter, K.M. (1985). Introduction to Variance Estimation. New York, NY: John Wiley & Sons.
Yamamoto, K. (1988). Science Data Analysis. In A.E. Beaton (Ed.), Expanding the New Design: The NAEP 1985–86 Technical Report (No. 17-TR-20, pp. 243?255). Princeton, NJ: Educational Testing Service, National Assessment of Educational Progress.
Yamamoto, K., and Mazzeo, J. (1992). Item Response Theory Scale Linking in NAEP. Journal of Educational Statistics, 17(2): 155–173.
Yen, W. M. (1993). Scaling Performance Assessments: Strategies for Managing Local Item Dependence. Journal of Educational Measurement, 30(3): 187–213.
Zhang, J., and Stout, W. (1999). The Theoretical DETECT Index of Dimensionality and Its Application to Approximate Simple Structure. Psychometrika, 64(2): 213–249.
Zieky, M. (1993). Practical Questions in the Use of DIF Statistics. In P.W. Holland and H.Wainer (Eds.), Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.
Zwick, R. (1986). Assessment of the Dimensionality of NAEP Year 15 Reading Data (ETS Research Report No. 86–4). Princeton, NJ: Educational Testing Service.
Zwick, R. (1987). Assessment of the Dimensionality of NAEP Year 15 Reading Data. Journal of Educational Measurement, 24(4): 293–308.
Zwick, R. (1988). Professionally Scored Items. Technical memorandum.
Zwick, R., and Grima, A. (1991). Policy for Differential Item Functioning (DIF) Analysis in NAEP. Technical memorandum.
Zwick, R., Donoghue, J.R., and Grima, A. (1993). Assessment of Differential Item Functioning for Performance Tasks. Journal of Educational Measurement, 30: 233–251.
Zwinderman, A.H. (1991). Logistic Regression Rasch Models. Psychometrika, 56: 589–600.