Statistical Standards
Statistical Standards Program
Table of Contents
1. Development of Concepts and Methods
2. Planning and Design of Surveys
3. Collection of Data
4. Processing and Editing of Data
5. Analysis of Data / Production of Estimates or Projections

5-1 Statistical Analysis, Inference, and Comparisons
5-2 Variance Estimation
5-3 Rounding
5-4 Tabular and Graphic Presentations of Data

6. Establishment of Review Procedures
7. Dissemination of Data
Appendix A
Appendix B
Appendix C
Appendix D
Publication information

For help viewing PDF files, please click here



PURPOSE: To ensure that statistical analyses, comparisons, and inferences included in NCES products are based on appropriate statistical procedures.

KEY TERMS: effect size, estimation, hypothesis testing, Minimum Substantively Significant Effect (MSSE), power, rejection region, simple comparison, statistical inference, tail, Type I error, and Type II error.

STANDARD 5-1-1: Statistical analyses must be approached from an analysis plan that considers relevance to policy, prior findings in existing literature, and/or results of previous survey research. The analysis plan must specify the main research questions, and justify the choice of statistical methodology.

STANDARD 5-1-2: Analyses of sample survey data based on a stratified sample design must use appropriate case weights to correct for the unequal probabilities of selection. In the case of a stratified sample design with disproportionate sample allocation, the use of appropriate case weights will reduce the biases in means and totals, but will not necessarily correct biases in standard errors.

STANDARD 5-1-3: The criterion for judging statistical significance in all reported hypothesis tests will be a = 0.05 (0.95 for confidence intervals). Reports will indicate an observed difference as statistically significant when an appropriate hypothesis test rejects the null hypothesis at a = 0.05. When estimates are compared to one another based on exploratory research and presented in descriptive reports, observed deviations in either direction are of interest and the rejection region lies within both tails of the distribution of the test statistic. The conclusions stated in the text are to be supported by two-tailed tests of significance (such as t tests or z tests).

    GUIDELINE 5-1-3A: If the survey purpose or prior research indicates that only differences between estimates in a specific direction are of interest or an established trend is to be updated with a new year of data, one-sided tests (in tests such as t tests or z tests) may be used to optimize power. In this case the region of rejection of the null hypothesis HO, is contained in only one tail of the sampling distribution of the test statistic.

STANDARD 5-1-4: Reported analyses must focus on differences that are substantively important (i.e., it is not necessary, or desirable, to discuss every statistically significant difference in a report). Statistical analysis techniques must be used that are appropriate for the specific research question. The rationale for the analytic approach must be described. The efficacy of individual statistical approaches depends on the assumptions of the techniques having been met; therefore, the assumptions underlying the techniques must be discussed.

    GUIDELINE 5-1-4A: When conducting multiple comparisons, appropriate procedures should be considered to control the level of Type I error for simultaneous inferences. Multiple comparison procedures include, for example, Bonferroni, False Discovery Rate (FDR), Scheffe, and Tukey tests (see, for example Hochberg, Y. and Tamhane, A.C. 1987 and Benjamini, Y. and Hochberg,Y. 1995).

    GUIDELINE 5-1-4B: Alternative presentation of the results, such as confidence intervals or coefficients of variation, should also be considered as appropriate.

    GUIDELINE 5-1-4C: When testing for structure in the data over time, a trend test or other suitable procedure should be performed (e.g., regression, ANOVA, or non-parametric statistics). In conducting over time analyses, possible changes in population composition should be considered.

    GUIDELINE 5-1-4D: When it is appropriate, the use of multiple regression and multivariate analysis techniques should be considered to examine relationships between a dependent variable (e.g., test score) and a set of independent variables (e.g., race, sex, and family background). Such techniques can provide an integrated approach to testing many simultaneous relationships.

    GUIDELINE 5-1-4E: In general, standardized regression coefficients should be used. When the units of measurement are meaningful (e.g., number of years of schooling), unstandardized regression coefficients or mean differences should be provided.

    GUIDELINE 5-1-4F: When the results of an analysis are statistically significant, it is useful to consider the substantive interpretation of the size of the effect. For this purpose, the observed difference can be converted into an effect size to allow the interpretation of the size of the difference.

    For a t-test of the mean difference, for example, the estimated effect size is the observed difference between the two observed means relative to a measure of variability, such as the standard deviation.

    In correlation analysis, r is the effect size. Consult Cohen (1988) for measures of effect size using additional statistical procedures.

    Cohen's (1988) convention for interpreting effect sizes may be used. Empirical evidence has shown that for t tests or z tests, an effect size of 0.2 is small, 0.5 is medium, and 0.8 is large. As for correlations, an r of 0.1 is small, 0.3 is medium, and 0.5 is large.

    GUIDELINE 5-1-4G: Another approach to considering the substantive importance of a significant difference is to compare the size of the difference to the minimum substantively significant effect (MSSE) size that is determined a priori.

    GUIDELINE 5-1-4H: When reporting on the significance of important findings, confirmatory and corroborative statistical methods and significance tests should be used. For example, if the original significant finding is based on a simple comparison t test, t tests adjusted for multiple comparisons could also be used if appropriate. Another example would be to confirm important findings obtained with one analytic approach with a second analysis conducted using an alternative approach.

STANDARD 5-1-5: Failure to reject the null hypothesis does not imply acceptance of the null hypothesis. When the null hypothesis is not rejected, the following options are available:

  1. Do not report on this test.
  2. Report that statistically significant differences or effects were not detected.
  3. If the significance is between .05 and .10, and the observed differences s are believed to be real, based on research or other evidence, but are not significant at the .05 level, possible associated with small sample sizes and/or large standard errors, this may be noted.
  4. If the estimate is "unreliable," the reader may be informed that the standard error is so high that the observed large differences are not statistically significant.
  5. If a statistically significant difference for a total group under study is observed, but similar subgroup differences of the same magnitude are associated with smaller sample sizes and/or larger standard errors and are not statistically significant, this may be noted.
  6. If there are large apparent differences that are not significant, possibly associated with small sample sizes and/or larger standard errors, this may be noted.
  7. Use a 95 percent confidence interval to describe the magnitude of the possible difference or effect.


Agresti, A. (2002). Categorical Data Analysis, 2nd Edition. NewYork, NY: Wiley Interscience.

Benjamini, Y. and Hochberg,Y. (1995). "Controlling for the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing." Journal of the Royal Statistical Society, Series B, 57(1), pp. 289-300.

Binder, D.A., Gratton, M., Hidiroglou, M. A., Kumar, S. and Rao, J.N.K. (1984). "Analysis of Categorical Data from surveys with Complex Designs: Some Canadian Experiences." Survey Methodology, Vol. 10, 141 | 156.

Cohen, B.H. (2001). Explaining Psychological Statistics. New York: Wiley.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press.

Cohen, J. and Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Hillsdale, NJ: L. Erlbaum Associates.

Draper, N. R. and Smith, H. (1998). Applied Regression Analysis, 3rd Edition. NY: Wiley Interscience.

Hays, W. L. (1994). Statistics. Fifth Edition. Fort Worth, TX: Harcourt College Publishers.

Hochberg, Y. and Tamhane, A.C. 1987. Multiple Comparison Procedures. New York: John Wiley & Sons.

Hoenig, J.M. and Heisey, D.M. (2001). "The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis." The American Statistician 55(1) pp. 19-24.

Holt, D., Smith, T.M.F., and Winter, P.D. (1980). "Regression Analysis from Complex Surveys." Journal of the Royal Statistical Society, Series A, Vol. 143, 474-481.

Jones, L.V., Lewis, C., and Tukey, J.W. (2001). Hypothesis tests, multiplicity of. In N.J. Smelser & P.B. Baltes, Eds., International Encyclopedia of the Social and Behavioral Sciences. London: Elsevier Science, Ltd., pp. 7127-7133.)

Kish, L.and Frankel, M.R. (1974). "Inferences from Complex Samples." Journal of the Royal Statistical Society, Series B, Vol. 36, 1-37.

Kleinbaum, D.G., Kupper, L.L., Muller, K.E., and Nizam, A. (1998). Applied Regression Analysis and Other Multivariate Methods. Pacific Grove: Duxbury Press.

Lehtonen, R. and Pahkinen, E.J. (1995). Practical Methods for Design and Analysis of Complex Surveys. New York, NY: Wiley Interscience.

Moore, D.S. (2000). The Basic Practice of Statistics. 2nd edition. New York:NY: W.H. Freeman.

NCES Statistical Analysis Manual 2002 (forthcoming). Washington, DC: NCES.

Neter, J., Kutner, M., Nachtsheim, C., and Wasserman, W. (1996). Applied Linear Statistical Models, 4th Edition. New York: NY: McGraw-Hill - Irwin

Skinner, C.J., Holt, D., and Smith, T.M.F. eds. (1989). Analysis of Complex Surveys. New York, NY: John Wiley & Sons.