Statistical Standards Program
Table of Contents
1. Development of Concepts and Methods
2. Planning and Design of Surveys
2-1 Design of Surveys
2-2 Survey Response Rate Parameters
2-3 Developing RFPs for Surveys
2-4 Pretesting Survey Systems
2-5 Maintaining Data Series Over Time
2-6 Educational Testing
3. Collection of Data
4. Processing and Editing of Data
5. Analysis of Data / Production of Estimates or Projections
6. Establishment of Review Procedures
7. Dissemination of Data
For help viewing PDF files, please click here
|PLANNING AND DESIGN OF SURVEYS|
SUBJECT: EDUCATIONAL TESTING
NCES STANDARD: 2-6
PURPOSE: To ensure that educational tests used in NCES surveys for measuring and making inferences about education-related domains are valid, technically sound, and fair. To ensure that the administration and scoring of educational tests are standardized, that scales used over time are stable, and that the results are reported in a clear unbiased manner.
KEY TERMS: accommodation, assessment, classical test theory, cut score, derived score, Differential Item Functioning (DIF), disability, domain, equating, fairness, field test, Individualized Education Plan (IEP), instrument, Item Response Theory (IRT), linkage, precision, reliability, scaling, scoring/rating, Section 504, and validity.
GUIDELINE 2-6-1A: Relevant experts should review the domain definitions and the instrument specifications. The qualifications of the experts, the process by which the review is conducted, and the results of the review should be documented.
GUIDELINE 2-6-1B: All items should be reviewed before and after pilot and field tests. Pilot and field tests should be conducted on subjects with characteristics similar to intended participants. The sample design for pilot and field tests should be documented.
GUIDELINE 2-6-1C: Field test sample should include an adequate number of cases with the characteristics necessary to determine the psychometric properties of items.
GUIDELINE 2-6-1D: Empirical analysis and the model (e.g., Classical and/or Item Response Theory) used to evaluate the psychometric properties of the items during the item review process should be documented.
GUIDELINE 2-6-1E: When a time limit is set for performance, the extent to which the scores include a speed component and the appropriateness of this component to the defined domain should be documented.
GUIDELINE 2-6-1F: If the conditions of administration are allowed to vary across participants, the variations and rationale for them should be documented.
GUIDELINE 2-6-1G: Directions for test administrations should be described with sufficient clarity for others to replicate.
GUIDELINE 2-6-1H: When a shortened or altered form of an instrument is used, the differences from the original instrument and the implications of those differences for the interpretations of scores should be documented.
GUIDELINE 2-6-2A: Evidence of validity should be based on analyses of the content, response processes (i.e. the thought processes used to produce an answer), internal structure of the instrument, and/or the relationship of scores to a criterion.
GUIDELINE 2-6-2B: The rationale for each intended use of the test instruments and test proposed interpretations of the scores obtained should be explicitly stated.
GUIDELINE 2-6-2C: When judgments occur in the validation process, the selection process for the judges (experts/observers/raters) and the criteria for judgments should be described.
The reliability must be reported, either as a standard error of measurement or as an appropriate reliability coefficient (e.g., alternate form coefficient, test-retest/stability coefficient, internal consistency coefficient, generalizability coefficient). Methods (including selection of sample, sample sizes, sample characteristics) of quantifying the reliability of both raw and scale scores must be fully described. Scorer reliability, rater to rater, and rater-year reliability must be reported when the scoring process involves judgment.
GUIDELINE 2-6-3A: All relevant sources of measurement errors and summary statistics of the size of the errors from these sources should be reported.
GUIDELINE 2-6-3B: When average scores for participating groups are used, the standard error of measurement of group averages should be reported. Standard error statistics should include components due to sampling examinees, as well as components due to measurement error of the test instrument.
GUIDELINE 2-6-3C: Reliability information on scores for each group should be reported when an instrument is used to measure different groups (e.g., race/ethnicity, gender, age, or special populations).
GUIDELINE 2-6-3D: Reliability information should be reported for each version of a test instrument when original and altered versions of an instrument are used.
GUIDELINE 2-6-3E: Separate reliability analyses should be performed when major variations of the administration procedure are permitted to accommodate disabilities.
GUIDELINE 2-6-4A: Language, symbols, words, phrases, and content that are generally regarded as offensive by members of particular groups should be eliminated, except when judged to be necessary for adequate representation of the domain.
GUIDELINE 2-6-4B: Although differences in the subgroups' performance do not necessarily indicate that a measurement instrument is unfair, differences between groups should be investigated to make sure that they are not caused by construct-irrelevant factors.
GUIDELINE 2-6-4C: When research shows that Differential Item Functioning (DIF) exists, studies should be conducted to detect and eliminate aspects of test design, content, and format that might bias test scores for a particular subgroup.
GUIDELINE 2-6-4D: In testing applications where the level of linguistic or reading ability is not a purpose of the assessment, the linguistic or reading demands of the test instrument should be kept to a minimum.
GUIDELINE 2-6-4E: The testing or assessment process should be carried out so that test takers receive comparable and equitable treatment during all phases of the testing process.
GUIDELINE 2-6-5A: Permitted accommodations and/or modifications for special populations and the rationale for each accommodation should be documented in the data file and survey methodology report.
For individuals with disabilities:
GUIDELINE 2-6-5D: Decisions about accommodations for individuals with disabilities should be made by individuals who are knowledgeable of existing research on the effects of the specific disabilities on test performance.
GUIDELINE 2-6-5E: The participant's Individualized Education Plan (IEP) or Section 504 plan must be consulted prior to making determinations of whether a participant with a disability will participate in the assessment, and what accommodations, if any, are appropriate.
For individuals of diverse linguistic backgrounds:
GUIDELINE 2-6-5G: If an instrument is translated to another language, translation evaluation procedures, and the comparability of the translated instrument to the original version should be documented.
GUIDELINE 2-6-6A: Administration procedures should be field tested. The approved procedures should be described clearly so they can be easily followed.
GUIDELINE 2-6-6B: Survey staff administering the instrument should be trained according to the procedures prescribed in the administration manual.
GUIDELINE 2-6-6C: Modifications or disruptions to the approved procedures should be documented so the impact of such departures can be studied.
GUIDELINE 2-6-6D: Instructions presented to participants should include sufficient detail to allow the participants to respond to the task in the manner intended by the instrument developer.
GUIDELINE 2-6-6E: Samples of administration sites should be monitored to ensure that the instrument is administered as specified.
GUIDELINE 2-6-7A: Machine-scoring procedures should be checked for accuracy. The procedure, as well as the nature and extent of scoring errors, should be documented.
GUIDELINE 2-6-7B: Hand scoring procedures should be documented, including rules governing scoring decisions, training procedures used to teach the rules to the coding staff, quality monitoring system used, and quantitative measures of the reliability of the resulting ratings. Criteria for evaluating the quality of individual responses should not be changed during the course of the scoring process.
GUIDELINE 2-6-7C: All systematic sources of errors during the scoring process should be corrected and documented.
GUIDELINE 2-6-7D: Consistency among scorers and potential drift over time in scoring/rating should be evaluated and documented.
GUIDELINE 2-6-7E: Meanings, interpretations, limitations, rationales, and processes of establishing the reported scores should be clearly described in the technical report.
GUIDELINE 2-6-7F: Stability of the scale should be monitored and corrected or revised, when necessary, if a scale is maintained over time.
GUIDELINE 2-6-7G: Procedures for scoring-raw scores, scale scores-should be documented. The documentation should also include a description of the populations used for their development.
GUIDELINE 2-6-7H: Procedures for deriving the weights should be described when weights are used to develop the scale scores.
GUIDELINE 2-6-7I: Population norms to which the summary statistics refer should clearly be defined when group performance is summarized using norm scores.
GUIDELINE 2-6-7J: Rationales and procedures for establishing cut scores should be documented when cut scores are established as part of the scale score reporting.
GUIDELINE 2-6-7K: Cut scores should be valid; that is, participants above a cut point should demonstrate a qualitatively greater degree and/or different type of skills/knowledge than those below the cut point.
GUIDELINE 2-6-7L: The method employed in a judgmental standard-setting process should be documented. The documentation should include the following:
GUIDELINE 2-6-7M: The judgmental methods used to establish cut scores should meet the following three criteria:
GUIDELINE 2-6-7N: An estimate of the amount of variability in cut scores must be provided regardless of whether the standard-setting procedure is replicated.
GUIDELINE 2-6-7O: Equating/linking functions should be invariant across sub-populations when equating or linking is used to determine equivalent scores. Supporting evidence for the interchangeability of tests/test-forms should be provided.
GUIDELINE 2-6-7P: Detailed technical information (i.e., design of equating studies, standard errors of measurement, statistical methods used, size and relevant characteristics of samples used, and psychometric properties of anchor items) should be provided for the methods by which equating or linking is established.
GUIDELINE 2-6-7Q: Users should be warned that scores are not directly comparable when converted scores from two versions of the test are not strictly equivalent.
GUIDELINE 2-6-8B: Appropriate interpretations of all reported scores should be provided. The interpretations should describe what the test covers, what the scores mean, and the precision of the scores. The generalizability and limitations of reported scores should also be presented. Potential users should be cautioned against unsupported interpretations; that is, interpretations of scores that have not been investigated, or interpretations of scores inconsistent with available evidence.
GUIDELINE 2-6-8C: Validity and reliability should be reported for the level of aggregation for which the scores are reported when matrix sampling is used. Scores should not be reported for individuals unless the validity, comparability, and reliability of such scores indicate that reporting individual scores is meaningful.
GUIDELINE 2-6-9A: Technical documentation should provide technical and psychometric information on a test as well as information on test administration, scoring, and interpretation.
The Use of Tests When Making High-Stakes Decisions for Students: A Resource Guide for Educators and Policymaker, July 6, 2000. U.S. Department of Education, Office for Civil Rights. Author and Publisher.
ETS Standards for Quality and Fairness. (2000). Educational Testing Service. Princeton, NJ. Author and Publisher.
Standards for Educational and Psychological Testing. (1999). Prepared by the Joint Committee on Standards for Educational and Psychological Testing of the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME), Washington, DC: AERA.
NCES Statistical Standards. (1992, Reprinted in May 1996). U.S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics. Washington DC: U.S. Government Printing Office. NCES 92-021.
Code of Fair Testing Practices in Education. (1980). Prepared by the Joint Committee on Testing Practices, Washington, DC.