The International Computer and Information Literacy Study (ICILS) is a computer-based international assessment of 8th-grade students’ capacities “to use information communications technologies (ICT) productively for a range of different purposes, in ways that go beyond a basic use of ICT” (Fraillon et al. 2018). First conducted in 2013, ICILS assessed students’ computer and information literacy (CIL) with an emphasis on the use of computers as information seeking, management, and communication tools. Twenty-one education systems around the world participated in ICILS 2013. The second cycle of ICILS was administered in 2018 and continued to investigate CIL, with the added international optional component to assess students’ computational thinking (CT) abilities, as well as how these abilities relate to school and out-of-school contexts that support learning. The United States participated in ICILS for the first time in 2018, along with 13 other education systems. Among them, nine, including the U.S., participated in the optional computational thinking (CT) component. ICILS is sponsored by the International Association for the Evaluation of Educational Achievement (IEA) and is conducted in the United States by the National Center for Education Statistics (NCES).
These Methodology and Technical Notes provide an overview, with a particular focus on the U.S. implementation, of the following technical aspects of ICILS 2018:
More detailed information can be found in the ICILS 2018 Technical Report at https://www.iea.nl/publications/technical-reports/icils-2018-technical-report.
In order to ensure comparability of the data across countries, the International Association for the Evaluation of Educational Achievement (IEA) established a set of detailed international requirements for the various aspects of data collection. The requirements regarding the target populations, sampling design, sample size, exclusions, and defining participation rates are described below.
International target populations
In order to identify comparable populations of students to be sampled, the IEA defined the international desired target population as follows:
Although participating education systems were expected to include all students in the International Target Population, sometimes it was not feasible to include all of these students because of geographic or linguistic constraints specific to the country or territory. Thus, each participating education system had its own “national” desired target population (also referred to as the National Target Population), which was the International Target Population reduced by the exclusions of those sections of the population that were not possible to assess. Working from the National Target Population, each participating education system had to operationalize the definition of its population for sampling purposes: i.e., define their “national” defined target population (referred to as the National Defined Population). While each education system’s National Defined Population ideally coincides with its National Target Population, in reality, there may be additional exclusions (e.g., of regions or school types) due to constraints of operationalizing the assessment (see section on Exclusions, below). In the United States there were no exclusions of this type, and therefore the National Defined Population and the National Target Population are the same.
It is not feasible to assess every 8th-grade student in each education system. Thus, a representative sample of 8th-grade students was selected. The sample design employed by the ICILS assessments is generally referred to as a two-stage stratified cluster sample. The sampling units at each stage were defined as follows.
ICILS guidelines called for a minimum of 150 schools to be sampled, with 20 students and 15 teachers selected per school.
All schools and students excluded from the national defined target population are referred to as the excluded population. Exclusions could occur at the school level, with entire schools being excluded, or within schools, with specific students excluded. Some accommodations were made available for students with disabilities and for students who were unable to read or speak the language of the test. The IEA requirement with regard to exclusions is that they should not exceed more than 5 percent of the national desired target population.
School exclusions. Education systems could exclude schools that
Within-school exclusions. Education systems were instructed to adopt the following international within-school exclusion rules to define excluded students:
Defined participation rates
In order to minimize the potential for response biases, the IEA developed participation or response rate standards that apply to all participating education systems and govern both whether or not a participating education system’s data are included in the ICILS international database as well as the way in which national statistics are presented in the international reports. These standards were set using composites of response rates at the school, student and teacher levels. Moreover, response rates were calculated with and without the inclusion of substitute schools (selected to replace original sample schools refusing to participate). The decision about how to report data if the sampling procedures have not been followed will be made on a case-by-case basis.
The response rate standards take the following two forms, distinguished primarily by whether or not the school participation rate of 85 percent was met.
Schools with less than 50 percent student participation are considered nonrespondents.
Participants satisfying the category 1 standard are included in the international tabular presentations without annotation. Those able to satisfy only the category 2 standard are included as well but are annotated to indicate their response rate status. Participants that do not meet category 1 or 2, but that provide documentation showing that they followed sampling procedures appear in a separate section of the report.
1 Some sampled schools may be considered ineligible, for example if they are closed, no longer have students at the target grade, or do not contain any eligible students (all students would be excluded due to the reasons provided above).
2 The ISCED was developed by the United Nations Educational, Scientific, and Cultural Organization (UNESCO) to facilitate the comparability of educational levels across countries. ISCED Level 1 begins with the first year of formal academic learning (UNESCO 2011). In the United States, ISCED Level 1 begins at grade 1.
The U.S. ICILS 2018 national sample design
In the United States and most other participating education systems, the target populations of students corresponded to the 8th grade.
The U.S. sampling frame was explicitly stratified by three categorical stratification variables:
The U.S. sampling frame was implicitly stratified (that is, sorted for sampling) by three stratification variables:
For the first stage of drawing the samples, a systematic probability-proportional-to-size (PPS) technique, where size was the estimated 8th-grade enrollment, was used to select schools for the original sample from a sampling frame based on the 2018 National Assessment of Educational Progress (NAEP) school sampling frame. Data for public schools in the sampling frame came from the Common Core of Data (CCD) [https://nces.ed.gov/ccd/], and data for private schools came from the Private School Universe Survey (PSS) [https://nces.ed.gov/surveys/pss/]. Note that the overlap with the NAEP school samples was not minimized for the ICILS 8th-grade sample because the ICILS sample was selected before the NAEP sample due to ICILS scheduling constraints. Thus, the overlap between the samples was minimized when the 2018 NAEP samples were selected. The U.S. ICILS 2018 national school sample consisted of 352 schools. Besides the original schools selected, two schools adjacent to each original school in the sampling frame were designated as substitute schools. The first school following the original sample school was the first substitute and the first school preceding it was the second substitute.
Student sampling and teacher sampling were accomplished by selecting a minimum of 30 8th-grade students and a minimum of 20 8th-grade teachers per school unless fewer than the minimum were available, in which case all students or teachers were selected. Each school selected was asked to prepare a list of 8th-grade students and a list of 8th-grade teachers in the school. Schools submitted these student lists and teacher lists via secure e-Filing. Students were selected from the comprehensive list of all target grade students using a systematic random sample, and teachers were randomly selected from the teacher list. This resulted in a total sample of 7,897 students and 3,730 teachers.
Note that in large schools, a smaller proportion of the students is selected, but this lower rate of selecting students in large schools is offset by a larger probability of selection of large schools, as schools are selected with probability proportional to size. In this way, the overall sample design for the United States results in an approximately self-weighting sample of students, with each student having a roughly equal probability of selection.
3 The primary purpose of stratification is to improve the precision of the survey estimates. If explicit stratification of the population is used, the units of interest (schools, for example) are sorted into mutually exclusive subgroups—strata. Units in the same stratum are as homogeneous as possible, and units in different strata are as heterogeneous as possible, with respect to the characteristics of interest to the survey. Separate samples are then selected from each stratum. In the case of implicit stratification, the units of interest are simply sorted with respect to one or more variables known to have a high correlation with the variable of interest. In this way, implicit stratification guarantees that the sample of units selected will be spread across the categories of the stratification variables.
4 The sample frame did not contain a direct measure of poverty. No National School Lunch Program (NSLP) data were available for private schools; thus all private schools are treated as low-poverty schools. Public schools with missing NSLP data were also treated as low-poverty schools.
5 The Census definitions of region were used. The Northeast region consists of Connecticut, Maine, Massachusetts, New Hampshire, New Jersey, New York, Pennsylvania, Rhode Island, and Vermont. The Midwest region consists of Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Missouri, Nebraska, North Dakota, Ohio, Wisconsin, and South Dakota. The South region consists of Alabama, Arkansas, Delaware, District of Columbia, Florida, Georgia, Kentucky, Louisiana, Maryland, Mississippi, North Carolina, Oklahoma, South Carolina, Tennessee, Texas, Virginia, and West Virginia. The West region consists of Alaska, Arizona, California, Colorado, Hawaii, Idaho, Montana, Nevada, New Mexico, Oregon, Utah, Washington, and Wyoming.
6 The NCES definitions of locale were used. The four urban-centric locale types are: (1) City, which consists of a large, midsize, or small territory inside an urbanized area and inside a principal city. (2) Suburb, which consists of a large, midsize, or small territory outside a principal city and inside an urbanized area. (3) Town, which consists of a fringe, distant, or remote territory inside an urban cluster. (4) Rural, which consists of a fringe census-defined rural territory.
ICILS is an international collaborative effort involving representatives from every country participating in the study. For ICILS 2018, the test development effort began with a review and revision of the ICILS 2013 framework that was used to guide the construction of the 2013 assessment. The framework was updated to reflect changes in the computer and information literacy (CIL) curriculum and instruction of participating countries and education systems. An additional international component was added in 2018 to assess students’ computational thinking (CT). United States and international experts in CIL and CT curricula, education, and measurement, as well as representatives from national educational centers around the world contributed to the final content of the 2018 framework. Maintaining the ability to measure change since the initial CIL assessment in 2013 was an important factor in revising the framework.
New items and tasks7 were field-tested in most of the participating countries. Results from the field test were used to evaluate item and task difficulty, how well items and tasks discriminated between high- and low-performing students, the effectiveness of distracters in multiple-choice items, scoring suitability and reliability for constructed-response items, and evidence of bias toward or against individual countries or in favor of boys or girls.
The 2018 CIL framework
CIL is defined as “an individual’s ability to use computers to investigate, create, and communicate in order to participate effectively at home, at school, in the workplace, and in the community.” The CIL construct comprises four strands that describe the skills and knowledge measured in the CIL assessment. Each strand is further defined in terms of two aspects that provide the set of knowledge, skills, and understandings held in common by the range of definitions of CIL.
The CT framework
CT is defined as an “individual’s ability to recognize aspects of real-world problems which are appropriate for computational formulation and to evaluate and develop algorithmic solutions to those problems so that the solutions could be operationalized with a computer.” The CT construct is described in two strands and several aspects.
Design of instruments
ICILS 2018 included cognitive assessments of CIL and CT and a student questionnaire, as well as questionnaires for school staff. Questionnaires for information, computer and technology (ICT) staff, principals, and teachers were self-administered, primarily using an online survey system.
ICILS 2018 was a computer-based assessment administered on a customized assessment platform that used purpose-built applications that followed standard interface conventions. These applications were designed to be similar to those that students would experience in their everyday computer use. Students completed a variety of tasks that used software tools and web content.
Tasks were embedded in five 30-minute CIL modules and two 25-minute CT modules. Each U.S. student completed two of the CIL modules and both CT modules. The CIL and CT modules were administered to students in a balanced randomized design. Three of the CIL modules were administered in ICILS 2013, and two were newly developed for 2018.
The CIL modules were based on a real-world theme and consisted of a series of five to eight smaller tasks that built context for one larger task that took 15 to 20 minutes to complete. Each CT module had a sequence of tasks relating to a unifying theme in a real-world situation.
Between CIL and CT parts of the assessment, students completed a 30-minute questionnaire designed to provide information about their backgrounds, attitudes, and experiences in and out of school related to use of computers and information and computer technology (ICT).
ICILS 2018 included questionnaires for principals, teachers, ICT staff, and students. Questionnaires were based on the contextual framework described in the ICILS 2018 Assessment Framework. Like the assessment items, all questionnaire items were field-tested and the results reviewed carefully. After the review, some of the questionnaire items were revised prior to their inclusion in the final questionnaires.
The questionnaires are designed to provide context for student achievement, focusing on such topics as students’ attitudes towards the use of computers and ICT, as well as students’ background characteristics and their experience using computers and ICT to complete a range of tasks inside and outside of school; teachers’ familiarity with ICT, their use of ICT in teaching, perceptions of ICT in schools, and learning to use ICT in teaching; and principals’ viewpoints regarding policies, procedures, and priorities for ICT in their school, as well as information about school characteristics. In addition, a questionnaire for ICT coordinators collected information on ICT resources, ICT use, ICT technical support and professional development opportunities in ICT at the school.
Online versions of the 30-minute teacher questionnaire and 15-minute principal and ICT coordinator questionnaires were offered to respondents as the primary mode of data collection.
Translation and adaptation
Source versions of all instruments (assessment modules and questionnaires as well as procedural manuals) were prepared in English by the ICILS International Study Center and translated into the primary language or languages of instruction in each education system. In addition, it was sometimes necessary to adapt the instrument for cultural context, even in countries and education systems that use English as the primary language of instruction. All U.S. translations and adaptations were prepared under comprehensive guidelines established by the IEA Secretariat. Well-trained and experienced verifiers reviewed and documented the quality and comparability of national instruments to the international versions. The goal of the translation and adaptation process is to ensure that neither the meaning nor the difficulty of items is changed.
7 ICILS included small discrete tasks (skill execution and information management) and large tasks that required the use of several applications to produce an information product that was scored by trained scorers according to specified scoring rubrics. Items are discrete questions within tasks.
The international versions of the ICILS 2018 student, teacher, school, and Information and Computer Technology (ICT) coordinator questionnaires are included in the ICILS IDB User Guide that is publicly available through the IEA data repository. Several questions on the student questionnaires were adapted to be appropriate in the U.S. educational and cultural context, and several U.S.-specific questions, such as race/ethnicity, were added to the international versions of the questionnaires. The U.S. versions of the student, teacher, school, and ICT coordinator questionnaires are below.
ICILS 2018 emphasized the use of standardized procedures for all participants. Each participating country and education system collected its own data, based on comprehensive manuals and training materials provided by the international project team. The manuals and materials explain the survey’s implementation, including precise instructions for the work of school coordinators and scripts for test administrators to use in testing sessions.
Recruitment of schools and students
The recruitment of schools required contacting schools in the sample to solicit their participation. In most cases, NAEP State Coordinators in each state education agency recruited public schools for ICILS 2018, working with the chief state school officer and school districts with sampled schools selected for ICILS 2018. NAEP State Coordinators followed up after the school district contact by contacting school principals to solicit participation. If a school declined to participate, then the district of the first substitute school was approached and the procedure was repeated. Each participating school was asked to nominate a school coordinator as the main point of contact for the study. The school coordinator worked with project staff to arrange logistics and liaise with staff, students, and parents as necessary
Schools chose one of three approaches for obtaining parental permission for students to participate: a simple notification, a notification with a refusal form, or a notification with a consent form for parents to sign. In each approach, parents were informed that their students could opt out of participating.
Incentives to schools, school coordinators, and students
Schools, school coordinators, and students were provided with small gifts of appreciation for their willingness to participate. Schools were offered $200, school coordinators received $100, and students were given a string backpack with an image of the map of the world. Certificates of community service were provided to students, and certificates of appreciation were provided to schools.
Test administration in the United States was carried out by professional staff trained according to the international guidelines. School personnel were asked only to assist listing of students and teachers, identifying space for testing in the school, specifying and carrying out parental notification or consent procedure, identifying students with special needs, and coordinating questionnaire completion by school staff.
Students with disabilities and/or English language learners were allowed access to some accommodations that they receive on their state assessments. Extended time could not be allowed due to the constraints of the ICILS student assessment system.
IEA Amsterdam and the ICILS International Study Center monitored compliance with the standardized procedures. National research coordinators were asked to nominate one or more persons unconnected with their national center, such as retired school teachers, to serve as quality control monitors (QCMs) for their country or education system. The ICILS International Study Center trained the QCMs on the required procedures for administering ICILS, the responsibilities of the national centers in conducting the study, and their own roles and responsibilities. Sixteen schools in the U.S. samples were visited by the U.S. monitor. These schools were scattered geographically across the nation.
This section describes the success of participating education systems in meeting the international technical standards on data collection. Information is provided for all participating education systems on their coverage of the target population, exclusion rates, and response rates.
Table 1 provides information on the average age at testing, target population coverage, and exclusions from education system target population. Table 2 provides information on weighted school participation rates before and after school replacement for each participating education system. See section on International Requirements for ICILS required participation rates.
These tables are provided in Excel.
The ICILS 2018 assessment consisted of questions and tasks that measured either computer and information literacy (CIL) or computational thinking (CT). The questions and tasks were embedded within CIL or CT modules based on a real-world theme. A scoring guide was created for every constructed response question or task included in the ICILS assessments. The scoring guides were carefully written and reviewed by national research coordinators of all participating education systems and other experts as part of the field test, and revised accordingly.
The ICILS assessment consisted of items that were automatically scored and those that were human scored. Each participating education system was responsible for scoring the human scored items following scoring guides. The United States national research coordinator is Linda Hamilton, National Center for Education Statistics. The national research coordinator and scoring lead staff from each education system attended scoring training sessions held by the ICILS International Study Center (ISC) and the International Association for the Evaluation of Educational Achievement (IEA). The training sessions focused on the scoring guides employed in ICILS 2018. Participants in these training sessions were provided with extensive practice in scoring example responses over several days.
For quality control purposes, information on within-country agreement and cross-country agreement among scorers was collected and documented by the ICILS ISC. Two scorers were assigned for each of the human-scored items to calculate the within-country agreement. The degree of agreement between the double-scored items provided a measure of the reliability for the scoring process. For cross-country agreement, the ICILS ISC conducted extensive data reviews to examine the percentages of scorer agreement and inter-coder reliability across participating countries.
Information on scoring guides and scoring reliability for human-scored items in ICILS 2018 is provided in ICILS 2018 Technical Report at https://www.iea.nl/publications/technical-reports/icils-2018-technical-report.
Data entry and cleaning
Each participating education system was responsible for submitting their education system’s data. In the United States, Westat was contracted to collect data for ICILS 2018 and prepare databases using a common international format. The IEA-supplied data management software, (IEA Data Management Expert [DME]), was used to create the country databases for questionnaire data. Each participating country was responsible to perform various data consistency and verification checks within the IEA DME software prior to data submission to IEA. Student assessment data were collected and submitted to the IEA Hamburg through the SoNET Assessment Master system. Test administration database was prepared and submitted using the IEA Windows Within-School Sampling Software (IEA WinW3S). The final country databases were then submitted to the IEA Hamburg (formerly known as the IEA Data Processing Center) in Hamburg, Germany, for further review and cleaning. The main purpose of this cleaning was to ensure that all information in the databases conformed to the internationally defined data structure. It also ensured that the national adaptations to questionnaires were reflected appropriately in codebooks and documentation, and that all variables selected for international comparisons were comparable across education systems.
IEA Hamburg was responsible for checking the data files from each education system, applying an extensive set of inter-related data checking and data-cleaning procedures to verify the accuracy and consistency of the data, and documenting any deviations from the international file structure. Queries arising during this process were addressed to national research coordinators. In the United States, the national research coordinator, along with Westat, reviewed the data cleaning reports and data almanacs and provided IEA Hamburg with assistance on data cleaning. Information on data cleaning quality control in ICILS 2018 is provided in ICILS 2018 Technical Report at https://www.iea.nl/publications/technical-reports/icils-2018-technical-report.
For the student assessment data, the ICILS ISC and IEA Hamburg provided the national item analysis reports to the participating countries to inform in detail about the performance of each assessment item in one’s country compared to the other participating countries. These national reports also provided national and international lists of problematic items. Graphical summary by each assessment with IRT item statistics are provided in the national item analysis reports. This sharing allowed participating countries to review and comment and ensured the data validity. Once any problems arising from this examination were resolved, sampling weights produced by the IEA Sampling Unit and IRT scaled student proficiency scores in CIL and CT were added to the file.
Detailed information on the entire data entry and cleaning process can be found in ICILS 2018 Technical Report at https://www.iea.nl/publications/technical-reports/icils-2018-technical-report.
Before the data were analyzed, responses from the groups of students assessed were assigned sampling weights (as described in the next section) to ensure that their representation in the ICILS 2018 results matched their actual percentage of the school population in the eighth-grade. With these sampling weights in place, the analyses of ICILS 2018 data proceeded in two phases: scaling and estimation. During the scaling phase, item response theory (IRT) procedures were used to estimate the measurement characteristics of each assessment question. During the estimation phase, the results of the scaling were used to produce estimates of student achievement. Subsequent conditioning procedures used the background variables collected by ICILS 2018 in order to limit bias in the achievement results. Additional information about these processes is provided below.
Responses from the groups of students were assigned sampling weights to adjust for the complex sample design that resulted in students having an unequal, but known, probability of selection. Additionally, an adjustment for school and student nonresponse was built into the weighting. More detailed information can be found in the ICILS 2018 Technical Report at https://www.iea.nl/publications/technical-reports/icils-2018-technical-report. In analyses of the ICILS data, it is necessary to use sampling weights to obtain accurate population estimates.
CIL scores and CT scores were estimated separately using an item response theory (IRT) model. Equating procedures were used to place CIL scores from the 2013 and 2018 assessments on the same scale. For equating purposes, all 2013 item parameters were re-estimated concurrently during the ICILS joint calibration process. More detailed information can be found in the IEA International Computer and Information Literacy Study 2018 Technical Report (2020) at https://www.iea.nl/publications/technical-reports/icils-2018-technical-report.
To keep student burden to a minimum, ICILS purposefully administered a limited number of assessment items to each student–too few to produce accurate individual content-related scale scores for each student. The number of assessment items administered to each student, however, is sufficient to produce accurate group content-related scale scores for subgroups of the population. These scores are transformed during the scaling process into plausible values to characterize students participating in the assessment, given their background characteristics. Plausible values are imputed values and not test scores for individuals in the usual sense. If used individually, they provide biased estimates of the proficiencies of individual students. However, when grouped as intended, plausible values provide unbiased estimates of population characteristics (e.g., means and variances for groups).
Plausible values represent what the performance of an individual on the entire assessment might have been, had it been observed. They are estimated as random draws (five for ICILS scores) from an empirically derived distribution of score values based on the student’s observed responses to assessment items and on background variables. Each random draw from the distribution is considered a representative value from the distribution of potential scale scores for all students in the sample who have similar background characteristics and similar patterns of item responses. Differences between plausible values drawn for a single individual quantify the degree of error (the width of the spread) in the underlying distribution of possible scale scores that could have caused the observed performances.
More detailed information can be found in the IEA International Computer and Information Literacy Study 2018 Technical Report (2020) at https://www.iea.nl/publications/technical-reports/icils-2018-technical-report.
ICILS established a CIL achievement scale as a reference point for future international assessments in computer and information literacy. Proficiency levels of CIL were also established to set benchmarks for future ICILS assessments and as an informative way to compare student performance across countries and over time. Students whose results are located within a particular proficiency level are typically able to demonstrate understandings and skills that are associated with that level, as well as knowledge and skills at lower proficiency levels.
The CIL proficiency levels were established in 2013 after consideration of the content and difficulty of the test items. The item content and relative difficulty were analyzed to identify themes of content and processes that could be used to characterize the different ranges, or levels, on the CIL achievement scale. This process was performed iteratively until each level showed distinctive characteristics, and the progression from low to high achievement across the levels was clear. The four proficiency levels and their boundaries are Level 1 (407), Level 2 (492), Level 3 (576), and Level 4 (661) scale points out of 700 total. Student scores below 407 scale points indicate CIL proficiency below the lowest level targeted by the assessment instrument. The CIL proficiency levels did not change from 2013 to 2018.
Given the limited number of CT tasks and score points, it was not possible to establish proficiency levels in the same way as for CIL. Instead, in order to provide broad descriptions of achievement across the scale, CT items were ordered by difficulty, then divided into three groups with equal numbers of items in each group. The descriptions of each region are syntheses of the common elements of students’ CT knowledge, skills, and understanding described by the items within each region. There are three regions in the CT scale: lower (below 459 scale points), middle (between 459 and 589 scale points inclusive), and upper (above 589 scale points), all out of 700 total scale score points. The regions of the CT scale should not be directly compared to the levels of the CIL scale, as they were developed using a different method.
More detailed information can be found in the IEA International Computer and Information Literacy Study 2018 Technical Report (2020) at https://www.iea.nl/publications/technical-reports/icils-2018-technical-report.
As with any study, there are limitations to ICILS data that researchers should take into consideration. Estimates produced using data from ICILS are subject to two types of error—nonsampling and sampling errors. Nonsampling errors can be due to errors made in collecting and processing data. Sampling errors can occur because the data were collected from a sample rather than a complete census of the population.
Nonsampling error is a term used to describe variations in the estimates that may be caused by population coverage limitations, nonresponse bias, and measurement error, as well as data collection, processing, and reporting procedures. The sources of nonsampling errors are typically problems like unit and item nonresponse, differences in respondents’ interpretations of the meaning of the survey questions, response differences related to the particular time the survey was conducted, and mistakes in data preparation.
Missing data for survey questionnaires, administrative data, and student assessment items were identified by missing data codes provided by the international data processing center during the data cleaning process for all participating countries. The codes differentiate not administered/missing by design from presented but not answered/invalid. The assessment items also include an additional missing code for not reached. An item was considered as presented but not answered/invalid if the respondent was expected to answer the item based on answers provided for other questions in the sequence but no response was given (e.g., no box was checked in the item which asked, “Are you a girl or a boy?” or an uninterpretable response (e.g., multiple responses to a question calling for a single response) was given. The not administered/missing by design code was used to identify items missing due to not being administered to the student (e.g., those items excluded from the student’s test booklet because of the booklet design, which rotates assessment blocks across booklets), or skip patterns in the teacher or principal, or items for which it is not logical for the respondent to answer the question (e.g., when the opportunity to make the response is dependent on a filter question). Finally, assessment items that are not reached were identified by a string of consecutive no responses continuing through to the end of the assessment.
The three key reporting variables identified in the ICILS data for the United States—student sex, student race/ethnicity, and the percentage of students in the school eligible for free or reduced-price lunch (FRPL)—all have low rates of missing responses. The response rates for these variables exceed the NCES standard of 85 percent and so can be reported without notation. Furthermore, the FRPL variable missing responses for public schools were imputed by substituting values taken from the Common Core of Data (CCD) for the schools in question. FRPL variable is only available for public schools.
Sampling errors arise when a sample of the population, rather than the whole population, is used to estimate some statistic. Different samples from the same population would likely produce somewhat different estimates of the statistic in question. This fact means that there is a degree of uncertainty associated with statistics estimated from a sample. This uncertainty is referred to as sampling variance and is usually expressed as the standard error of a statistic estimated from sample data. The approach used for estimating standard errors in ICILS was jackknife repeated replication (JRR). Standard errors can be used as a measure for the precision expected from a particular sample. Standard errors for all of the reported estimates are included in each of the downloadable Excel tables that accompany each online figure and table at https://nces.ed.gov/surveys/icils/icils2018/theme1.asp. Scroll to the bottom of each web page for the link to the downloadable Excel table.
Although not presented in this report, confidence intervals provide another way to make inferences about population statistics in a manner that reflects the sampling error associated with the statistic. The intervals are calculated with a set confidence level, which defines the frequency that the population statistic will fall within the interval. All ICILS significance tests presented in this report use a p value of 0.05, which is equivalent to a 95 percent confidence level. Using that confidence level and assuming a normal distribution, the population value of this statistic can be inferred to lie within the confidence interval in 95 out of 100 replications of the measurement on different samples drawn from the same population. The endpoints of a 95 percent confidence interval can be calculated from the sampled mean and standard error. The lowest endpoint of the interval equals the mean minus the product of 1.96 times the standard error, while the highest endpoint of the interval equals the mean plus the product of 1.96 times the standard error. See the Statistical Procedures section.
All ICILS 2018 participants were assured that their data would be confidential. Participants’ privacy was protected throughout data collection. Data security and confidentiality were maintained throughout all phases of the study, including data collection, data creation, data dissemination, and data analysis and reporting.
Potential disclosure can occur when the released data are compared against publicly available data collections that contain similar demographic information. Statistical disclosure control (SDC) measures that were implemented on the ICILS national data included the identifying and masking of potential disclosure risks for ICILS schools and adding an additional measure of uncertainty of school, teacher, and student identification through random data swapping.8 All procedures were carefully conducted and reviewed by NCES to ensure the protection of respondent confidentiality while preserving the integrity of the data. In accordance with NCES standard 4-2, confidentiality analyses for the United States were implemented to provide reasonable assurance that public-use data files issued by the IEA and NCES would minimize the risk of disclosure of individual U.S. schools, teachers, or students.
8 The NCES standards 4-2-1 through 4-2-12 (Revised 2012) (https://nces.ed.gov/statprog/2012/) provide the guidelines and methodology required to ensure data confidentiality for data dissemination. Perturbation disclosure limitation techniques are conducted to protect individually identifiable data. For public-use data files, NCES requires analysis and subsequent perturbations to be performed that minimize the possibility of a user matching outliers or unique cases on the file with external (or auxiliary) data sources. Because public-use files allow direct access to individual records, perturbation (such as random data swapping) and coarsening disclosure limitation techniques may both be required (Standard 4-2-8).
Tests of significance
Comparisons made in the text of this report were tested for statistical significance. For example, in the commonly made comparison of education systems’ averages against the average of the United States, tests of statistical significance were used to establish whether or not the observed differences from the U.S. average were statistically significant. The estimation of the standard errors that is required in order to undertake the tests of significance is complicated by the complex sample and assessment designs, both of which generate error variance. Together they mandate a set of statistically complex procedures in order to estimate the correct standard errors. As a consequence, the estimated standard errors contain a sampling variance component estimated by the jackknife repeated replication (JRR) procedure, and, where the assessments are concerned, an additional imputation variance component arising from the assessment design. Details on the procedures used can be found in the WesVar 5.0 User’s Guide (Westat 2007).
In almost all instances, the tests for significance used were standard t-tests.9 These tests fell into two categories according to the nature of the comparison being made: comparisons of independent samples and comparisons of nonindependent samples. Before describing the t-tests used, some background on the two types of comparisons is provided below.
The variance of a difference is equal to the sum of the variances of the two initial variables minus two times the covariance between the two initial variables. A sampling distribution has the same characteristics as any distribution, except that units consist of sample estimates and not observations. Therefore,
within a particular education system, any subsamples will be considered as independent only if the categorical variable used to define the subsamples was used as an explicit stratification variable.
Therefore, as for any computation of a standard error in ICILS 2018, replication methods using the supplied replicate weights are used to estimate the standard error on a difference. Use of the replicate weights implicitly incorporates the covariance between the two estimates into the estimate of the standard error on the difference.
The expected value of the covariance will be equal to zero if the two sampled groups are independent. If the two groups are not independent, as is the case with girls and boys attending the same schools within an education system, or comparing an education system’s mean with the international mean that includes that particular country, the expected value of the covariance might differ from zero.
In ICILS, participating education systems’ samples are independent. Therefore, for any comparison between two education systems, the expected value of the covariance will be equal to zero, and thus the standard error on the estimate is:
with θ being a tested statistic.
If one wants to determine whether girls’ performance differs from boys’ performance, for example, then, as for all statistical analyses, a null hypothesis has to be tested. In this particular example, it consists of computing the difference between the boys’ performance mean and the girls’ performance mean (or the inverse). The null hypothesis is:
To test this null hypothesis, the standard error on this difference is computed and then compared to the observed difference. The respective standard errors on the mean estimate for boys and girls can be easily computed.
Thus, in simple comparisons of independent averages, such as the U.S. average with other education systems’ averages, the following formula was used to compute the t statistic:
Both est1 and est2 are the estimates being compared (e.g., average of education system A and the U.S. average), and se1 and se2 are the corresponding standard errors of these averages.
For ICILS, there was a small number of participating education systems. When a country is compared to the international group, there is an overlap between the samples in the sense that the country is part of the international group. These are referred to as part-whole comparisons. Such comparisons require that the standard error of the mean differences be adjusted to account for the overlap. However, because the U.S. was not included in the international average, the comparison of average scores between the U.S. and the international average could be treated as comparisons of independent samples. All other comparisons are between each country and the U.S. and therefore are independent comparisons. Part-whole adjustments were made for the U.S. national variables, such as race/ethnicity and Free and Reduced Price Lunch, when comparing to the U.S. average.
9 Adjustments for multiple comparisons were not applied in any of the t-tests undertaken.
Since the U.S. ICILS 2018 weighted school response rates are below 85 percent, NCES requires an investigation into the potential magnitude of nonresponse bias at the school level in the U.S. sample. The investigation into nonresponse bias at the school level for the U.S. ICILS 2018 effort shows statistically significant relationships between response status and some of the available school characteristics that were examined in the analyses.
For original sample schools (not including substitute schools), eight variables were found to be statistically significantly related to participation in the bivariate analysis: school control; census region; poverty level; school size; grade 8 enrollment; White, non-Hispanic; Hispanic; and free or reduced-price lunch. Additionally, the absolute value of the relative bias for schools in rural areas and Asian is greater than 10 percent, which indicates potential bias even though no statistically significant relationship was detected. Although each of these findings indicates some potential for nonresponse bias, when all of these factors were considered simultaneously in a regression analysis (with six race/ethnicity variables), private schools, schools in central cities and suburbs, Northeast region, Midwest region, high poverty, medium sized schools, total school and grade 8 enrollments, Black, non-Hispanic, and Hispanic were significant predictors of school participation. The second model (with summed race/ethnicity percentage) showed that private schools, schools in central cities and suburbs, Northeast region, Midwest region, high poverty, medium sized schools, total school enrollment, and the summed race/ethnicity percentage were significant predictors of participation. The third model (with summed race/ethnicity percentage using public schools only) showed that schools in suburbs, Northeast region, Midwest region, high poverty, medium sized schools, free or reduced-price lunch eligibility, high poverty and free or reduced price lunch eligibility interaction term, and total school enrollment were significant predictors of school participation among public schools.
For the final sample of schools (with substitute schools), seven variables were found to be statistically significantly related to participation in the bivariate analysis: school control; census region; poverty level; school size; White, non-Hispanic; Hispanic; Hispanic and free or reduced-price lunch. Additionally, the absolute value of the relative bias for schools in towns and American Indian or Alaska Native are greater than 10 percent. When all of these factors were considered simultaneously in a regression analysis (with six race/ethnicity variables), private schools, schools in central cities, Northeast region, Midwest region, high poverty, medium sized schools, total school enrollment, Black, non-Hispanic, Hispanic, American Indian or Alaska Native, Hawaiian/Pacific Islander, and two or more races were significant predictors of participation. The second model (with summed race/ethnicity percentage) showed that private schools, schools in central cities, Northeast region, Midwest region, high poverty, medium sized schools, total school enrollment, and the summed race/ethnicity percentage were significant predictors of participation. The third model (with summed race/ethnicity percentage using public schools only) showed that schools in central cities, Northeast region, Midwest region, medium sized schools, free or reduced-price lunch eligibility, total school enrollment, and the summed race/ethnicity percentage was a significant predictor of school participation among public schools only.
For the final sample of schools (with substitute schools) with school nonresponse adjustments applied to the weights, no variables were found to be statistically significantly related to participation in the bivariate analysis, though the absolute value of the relative bias for schools in central cities and towns is greater than 10 percent. The multivariate regression analysis cannot be conducted after the school nonresponse adjustments are applied to the weights because nonresponse-adjusted weights does not apply to the nonresponding schools.
In sum, the investigation into nonresponse bias at the school level in the U.S. ICILS 2018 data provides evidence that there is some potential for nonresponse bias in the ICILS participating original sample based on the characteristics studied. It also suggests that, while there is some evidence that the use of substitute schools reduced the potential for bias, it has not reduced it substantially. However, after the application of school nonresponse adjustments, there is little evidence of resulting potential bias in the available frame variables and correlated variables in the final sample.
More detailed information can be found in the ICILS 2018 Technical Report at https://www.iea.nl/publications/technical-reports/icils-2018-technical-report.