Skip Navigation

Statistical Standards
Statistical Standards Program
 
Table of Contents
 
Introduction
1. Development of Concepts and Methods
2. Planning and Design of Surveys

 
2-1 Design of Surveys
2-2 Survey Response Rate Parameters
2-3 Developing RFPs for Surveys
2-4 Pretesting Survey Systems
2-5 Maintaining Data Series Over Time
2-6 Educational Testing
 

3. Collection of Data
4. Processing and Editing of Data
5. Analysis of Data / Production of Estimates or Projections
6. Establishment of Review Procedures
7. Dissemination of Data
 
Glossary
Appendix A
Appendix B
Appendix C
Appendix D
 
Publication information

For help viewing PDF files, please click here
PLANNING AND DESIGN OF SURVEYS

SUBJECT: EDUCATIONAL TESTING

NCES STANDARD: 2-6

PURPOSE: To ensure that educational tests used in NCES surveys for measuring and making inferences about education-related domains are valid, technically sound, and fair. To ensure that the administration and scoring of educational tests are standardized, that scales used over time are stable, and that the results are reported in a clear unbiased manner.

KEY TERMS: accommodation, assessment, classical test theory, cut score, derived score, Differential Item Functioning (DIF), disability, domain, equating, fairness, field test, Individualized Education Plan (IEP), instrument, Item Response Theory (IRT), linkage, precision, reliability, scaling, scoring/rating, Section 504, and validity.


STANDARD 2-6-1: Instrument Development-All test instruments used in NCES surveys must be developed following an explicit set of specifications. The development of the instrument must be documented so that it can be replicated. The instrument documentation must include the following:

  1. Purpose(s) of the instrument;
     
  2. Domain or constructs that will be measured;
     
  3. Framework of the instrument in terms of items, tasks, questions, response formats, and modes of responding;
     
  4. Number of items and time required for administration;
     
  5. Context in which the instrument will be used;
     
  6. Characteristics of intended participants;
     
  7. Desired psychometric properties of the items, and the instrument as a whole;
     
  8. Conditions and procedures of administering the instrument;
     
  9. Procedures of scoring; and
     
  10. Reporting of the obtained scores.

    GUIDELINE 2-6-1A: Relevant experts should review the domain definitions and the instrument specifications. The qualifications of the experts, the process by which the review is conducted, and the results of the review should be documented.

    GUIDELINE 2-6-1B: All items should be reviewed before and after pilot and field tests. Pilot and field tests should be conducted on subjects with characteristics similar to intended participants. The sample design for pilot and field tests should be documented.

    GUIDELINE 2-6-1C: Field test sample should include an adequate number of cases with the characteristics necessary to determine the psychometric properties of items.

    GUIDELINE 2-6-1D: Empirical analysis and the model (e.g., Classical and/or Item Response Theory) used to evaluate the psychometric properties of the items during the item review process should be documented.

    GUIDELINE 2-6-1E: When a time limit is set for performance, the extent to which the scores include a speed component and the appropriateness of this component to the defined domain should be documented.

    GUIDELINE 2-6-1F: If the conditions of administration are allowed to vary across participants, the variations and rationale for them should be documented.

    GUIDELINE 2-6-1G: Directions for test administrations should be described with sufficient clarity for others to replicate.

    GUIDELINE 2-6-1H: When a shortened or altered form of an instrument is used, the differences from the original instrument and the implications of those differences for the interpretations of scores should be documented.


STANDARD 2-6-2:
Validity - All test instruments used in NCES surveys must meet the purpose(s) stated in the instrument specifications. All intended interpretations and proposed uses of raw scores, scale scores, cut scores, equated scores, and derived scores, including composite scores, sub-scores, score differences, and profiles, must be supported by evidence and theory.

    GUIDELINE 2-6-2A: Evidence of validity should be based on analyses of the content, response processes (i.e. the thought processes used to produce an answer), internal structure of the instrument, and/or the relationship of scores to a criterion.

    GUIDELINE 2-6-2B: The rationale for each intended use of the test instruments and test proposed interpretations of the scores obtained should be explicitly stated.

    GUIDELINE 2-6-2C: When judgments occur in the validation process, the selection process for the judges (experts/observers/raters) and the criteria for judgments should be described.


STANDARD 2-6-3: Reliability - The scores obtained by a test instrument must be free from the effects of random variations due to factors such as administration conditions and/or differences between scorers. The reliability of the scores must be adequate for the intended interpretations and uses of the scores.

The reliability must be reported, either as a standard error of measurement or as an appropriate reliability coefficient (e.g., alternate form coefficient, test-retest/stability coefficient, internal consistency coefficient, generalizability coefficient). Methods (including selection of sample, sample sizes, sample characteristics) of quantifying the reliability of both raw and scale scores must be fully described. Scorer reliability, rater to rater, and rater-year reliability must be reported when the scoring process involves judgment.

    GUIDELINE 2-6-3A: All relevant sources of measurement errors and summary statistics of the size of the errors from these sources should be reported.

    GUIDELINE 2-6-3B: When average scores for participating groups are used, the standard error of measurement of group averages should be reported. Standard error statistics should include components due to sampling examinees, as well as components due to measurement error of the test instrument.

    GUIDELINE 2-6-3C: Reliability information on scores for each group should be reported when an instrument is used to measure different groups (e.g., race/ethnicity, gender, age, or special populations).

    GUIDELINE 2-6-3D: Reliability information should be reported for each version of a test instrument when original and altered versions of an instrument are used.

    GUIDELINE 2-6-3E: Separate reliability analyses should be performed when major variations of the administration procedure are permitted to accommodate disabilities.


STANDARD 2-6-4: Fairness - Test instruments used in NCES surveys must be designed, developed, and administered in ways that treat participants equally and fairly, regardless of differences in personal characteristics such as race, ethnicity, gender, age, socioeconomic status, or disability that are not relevant to the intended uses of the instrument.

    GUIDELINE 2-6-4A: Language, symbols, words, phrases, and content that are generally regarded as offensive by members of particular groups should be eliminated, except when judged to be necessary for adequate representation of the domain.

    GUIDELINE 2-6-4B: Although differences in the subgroups' performance do not necessarily indicate that a measurement instrument is unfair, differences between groups should be investigated to make sure that they are not caused by construct-irrelevant factors.

    GUIDELINE 2-6-4C: When research shows that Differential Item Functioning (DIF) exists, studies should be conducted to detect and eliminate aspects of test design, content, and format that might bias test scores for a particular subgroup.

    GUIDELINE 2-6-4D: In testing applications where the level of linguistic or reading ability is not a purpose of the assessment, the linguistic or reading demands of the test instrument should be kept to a minimum.

    GUIDELINE 2-6-4E: The testing or assessment process should be carried out so that test takers receive comparable and equitable treatment during all phases of the testing process.


STANDARD 2-6-5:
Testing individuals with disabilities or limited English proficiency - Whenever possible, scores derived from test instruments used in NCES surveys must validly, reliably, and fairly reflect the performance of all participants, including individuals with disabilities and individuals of diverse linguistic backgrounds. Although the exact procedures will vary across surveys, appropriate and reasonable accommodations in accordance with applicable federal nondiscrimination laws for special populations must be incorporated. Differences in performance must reflect the construct measured rather than any construct-irrelevant factors such as disabilities and/or language differences.

    GUIDELINE 2-6-5A: Permitted accommodations and/or modifications for special populations and the rationale for each accommodation should be documented in the data file and survey methodology report.

    GUIDELINE 2-6-5B: The extent to which data gathered with accommodations meet measurement standards of validity and reliability should be documented.

    For individuals with disabilities:
    GUIDELINE 2-6-5C: Empirical procedures used to review items to ensure fairness, to evaluate whether DIF exists, and to determine accommodations for students/individuals with disabilities should be included in the documentation.

    GUIDELINE 2-6-5D: Decisions about accommodations for individuals with disabilities should be made by individuals who are knowledgeable of existing research on the effects of the specific disabilities on test performance.

    GUIDELINE 2-6-5E: The participant's Individualized Education Plan (IEP) or Section 504 plan must be consulted prior to making determinations of whether a participant with a disability will participate in the assessment, and what accommodations, if any, are appropriate.

    For individuals of diverse linguistic backgrounds:
    GUIDELINE 2-6-5F: Empirical procedures used to review items to ensure appropriateness of materials for participants with various backgrounds and characteristics (e.g., nativity, experience in U.S. schools) should be documented to evaluate whether DIF exists, and to evaluate the linguistic or reading demands to ensure that they are no greater than required.

    GUIDELINE 2-6-5G: If an instrument is translated to another language, translation evaluation procedures, and the comparability of the translated instrument to the original version should be documented.


STANDARD 2-6-6: Administration - Administration of all test instruments used in each NCES survey must be standardized. Test administration must follow procedures specified in the test administration manual. The administration manual must include descriptions of the following:

  1. Brief statement of the purpose of the survey and the population to be tested;
     
  2. Required qualifications of those administering the instrument;
     
  3. Required identifying information of the participant;
     
  4. Materials, aids, or tools that are required, optional, or prohibited;
     
  5. Allowable instructions to the participants and procedures for timing the testing;
     
  6. Assignment of participants to groups, or special seating arrangements, and preparation of participants as relevant;
     
  7. Allowable accommodations;
     
  8. Desired testing conditions/environment; and
     
  9. Procedures to maintain security of the materials as applicable, and actions to take when irregularities are observed.

    GUIDELINE 2-6-6A: Administration procedures should be field tested. The approved procedures should be described clearly so they can be easily followed.

    GUIDELINE 2-6-6B: Survey staff administering the instrument should be trained according to the procedures prescribed in the administration manual.

    GUIDELINE 2-6-6C: Modifications or disruptions to the approved procedures should be documented so the impact of such departures can be studied.

    GUIDELINE 2-6-6D: Instructions presented to participants should include sufficient detail to allow the participants to respond to the task in the manner intended by the instrument developer.

    GUIDELINE 2-6-6E: Samples of administration sites should be monitored to ensure that the instrument is administered as specified.


STANDARD 2-6-7: Scoring and Scaling - Test scoring must be standardized within each survey, and scales must be stable if used over time.

    GUIDELINE 2-6-7A: Machine-scoring procedures should be checked for accuracy. The procedure, as well as the nature and extent of scoring errors, should be documented.

    GUIDELINE 2-6-7B: Hand scoring procedures should be documented, including rules governing scoring decisions, training procedures used to teach the rules to the coding staff, quality monitoring system used, and quantitative measures of the reliability of the resulting ratings. Criteria for evaluating the quality of individual responses should not be changed during the course of the scoring process.

    GUIDELINE 2-6-7C: All systematic sources of errors during the scoring process should be corrected and documented.

    GUIDELINE 2-6-7D: Consistency among scorers and potential drift over time in scoring/rating should be evaluated and documented.

    GUIDELINE 2-6-7E: Meanings, interpretations, limitations, rationales, and processes of establishing the reported scores should be clearly described in the technical report.

    GUIDELINE 2-6-7F: Stability of the scale should be monitored and corrected or revised, when necessary, if a scale is maintained over time.

    GUIDELINE 2-6-7G: Procedures for scoring-raw scores, scale scores-should be documented. The documentation should also include a description of the populations used for their development.

    GUIDELINE 2-6-7H: Procedures for deriving the weights should be described when weights are used to develop the scale scores.

    GUIDELINE 2-6-7I: Population norms to which the summary statistics refer should clearly be defined when group performance is summarized using norm scores.

    GUIDELINE 2-6-7J: Rationales and procedures for establishing cut scores should be documented when cut scores are established as part of the scale score reporting.

    GUIDELINE 2-6-7K: Cut scores should be valid; that is, participants above a cut point should demonstrate a qualitatively greater degree and/or different type of skills/knowledge than those below the cut point.

    GUIDELINE 2-6-7L: The method employed in a judgmental standard-setting process should be documented. The documentation should include the following:

    1. Selection and qualifications of judges;
       
    2. Nature of the request for their judgments;
       
    3. Training provided to the judges;
       
    4. Feedback of information to judges;
       
    5. Opportunities for judges to confer with one another concerning their judgments; and
       
    6. Methods used to aggregate the judgments and translate them into cut scores.

    GUIDELINE 2-6-7M: The judgmental methods used to establish cut scores should meet the following three criteria:

    1. The judgmental method should involve peer review and pre-testing.
       
    2. The judgments to be provided should not be so cognitively complex that the judges are unable to provide meaningful judgments.
       
    3. The process used to set cut scores should be described in sufficient detail so the process can be replicated.

    GUIDELINE 2-6-7N: An estimate of the amount of variability in cut scores must be provided regardless of whether the standard-setting procedure is replicated.

    GUIDELINE 2-6-7O: Equating/linking functions should be invariant across sub-populations when equating or linking is used to determine equivalent scores. Supporting evidence for the interchangeability of tests/test-forms should be provided.

    GUIDELINE 2-6-7P: Detailed technical information (i.e., design of equating studies, standard errors of measurement, statistical methods used, size and relevant characteristics of samples used, and psychometric properties of anchor items) should be provided for the methods by which equating or linking is established.

    GUIDELINE 2-6-7Q: Users should be warned that scores are not directly comparable when converted scores from two versions of the test are not strictly equivalent.


STANDARD 2-6-8: Reporting - Test results of the testing should be provided with sufficient detail and contextual information to understand the inferences that can and cannot be made from them.

    GUIDELINE 2-6-8A: The analysis of item responses or test scores should be described in detail, including procedures for scaling or equating.

    GUIDELINE 2-6-8B: Appropriate interpretations of all reported scores should be provided. The interpretations should describe what the test covers, what the scores mean, and the precision of the scores. The generalizability and limitations of reported scores should also be presented. Potential users should be cautioned against unsupported interpretations; that is, interpretations of scores that have not been investigated, or interpretations of scores inconsistent with available evidence.

    GUIDELINE 2-6-8C: Validity and reliability should be reported for the level of aggregation for which the scores are reported when matrix sampling is used. Scores should not be reported for individuals unless the validity, comparability, and reliability of such scores indicate that reporting individual scores is meaningful.


STANDARD 2-6-9: Manual-All evidence of compliance with the standards set forth above for each test instrument used in NCES surveys must be compiled in a manual.

    GUIDELINE 2-6-9A: Technical documentation should provide technical and psychometric information on a test as well as information on test administration, scoring, and interpretation.


REFERENCES

The Use of Tests When Making High-Stakes Decisions for Students: A Resource Guide for Educators and Policymaker, July 6, 2000. U.S. Department of Education, Office for Civil Rights. Author and Publisher.

ETS Standards for Quality and Fairness. (2000). Educational Testing Service. Princeton, NJ. Author and Publisher.

Standards for Educational and Psychological Testing. (1999). Prepared by the Joint Committee on Standards for Educational and Psychological Testing of the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME), Washington, DC: AERA.

NCES Statistical Standards. (1992, Reprinted in May 1996). U.S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics. Washington DC: U.S. Government Printing Office. NCES 92-021.

Code of Fair Testing Practices in Education. (1980). Prepared by the Joint Committee on Testing Practices, Washington, DC.