EDUCATION INDICATORS: An International Perspective

### Using data from sample surveys

Two important sources of data for this report provide estimates based on sample surveys. Figures from the International Assessment of Educational Progress (IAEP) are derived from samples of students and school administrators. Figures from the International Association for the Evaluation of Educational Achievement's (IEA) Reading Literacy Study are derived from samples of students and teachers. Because data on the entire population are not collected in sample surveys, the resulting estimates may differ somewhat from estimates that would have been obtained from the whole population using the same instruments, instructions, and procedures.

The samples used in surveys are selected from a large number of possible samples of the same size that could have been selected using the same sample design. Estimates derived from the different samples would differ from each other. The difference between a sample estimate and the average of all possible samples is called the sampling deviation. The standard or sampling error of a survey estimate is a measure of the variation among the estimates from all possible samples and, thus, is a measure of the precision with which an estimate from a particular sample approximates the average result of all possible samples.

The estimated standard errors for two sample statistics can be used to estimate the precision of the difference between the two statistics and to avoid concluding there is an actual difference when the difference in sample estimates may only be due to sampling error. The need to be aware of the precision of differences arises, for example, when comparing mean proficiency scores between countries in the IAEP. The standard error, , of the difference between sample estimate A and sample estimate B (when A and B do not overlap) is where and are the standard error of sample estimates A and B, respectively. When the ratio (called a t-statistic) of the difference between the two sample statistics and the standard error of the difference as calculated above is less than 2, one cannot be sure the difference is not due only to sampling error and caution should be taken in concluding there is a difference. In this report, for example, if the t-statistics were less than 1.96, we would not conclude there is a difference. Some analysts, however, use the less restrictive criterion of 1.64, which corresponds to a 10 percent significance level.

To illustrate this further, consider the data on reading proficiency of 14-year-olds in table 7a and the associated standard error table 7b. The estimated average overall reading proficiency score for the sample of 14-year-olds in the United States was 535. For the sample in France, the estimated average was 549. Is there enough evidence to safely conclude that this difference is not due only to sampling error and that the actual average reading proficiency of 14-year-olds in the United States is lower than for their counterparts in France? The standard errors for these two estimates are 4.8 and 4.3, respectively. Using the above formula, the standard error of the difference is calculated as 6.4. The ratio of the estimated difference of 14 to the standard error of the difference of 6.4 is 2.19. Using the table below, it can be seen that there is less than 5 percent chance that the 14 point difference is due only to sampling error and one may safely conclude that the average proficiency score of 14-year-olds in the United States is lower than of their counterparts in France.

```Percent chance that a difference is due only to sampling error:

t-statistic     	1.00     	1.64    1.96    2.00    2.57
Percent chance*       	  32     	  10       5     4.5       1
```

The above procedure applies if one is only comparing students in France and the United States. However, most readers draw conclusions after making multiple comparisons within a table. In these circumstances, the chance that one of the many differences examined is only a result of sampling error increases (accumulates) as the number of comparisons increases. The Bonferroni procedure can be used to ensure that the likelihood of any of the comparisons being only a result of sampling error stays less than 5 percent is to reduce this risk for each of the comparisons being made. If N comparisons are being made then divide 5 percent by N and ensure that the risk of a difference being due only to sampling error is less than 5/N for each comparison. The table below provides critical values for the t-statistic for each comparison when it is a part of N comparisons.

```
Number of comparisons     	1     	2     	3     	4     	5     	10     	20     	40

Critical value*     		1.96  	2.24    2.39 	2.50 	2.58	2.81 	3.02    3.23

```

For example, a reader might examine table 7a not for the purpose of comparing the United States to France but to compare the United States to, say, other G-7 countries, which includes three of the countries in the table. After making three comparisons, the reader may want to draw the conclusion: & quot;Fourteen-year-olds in only one of the three countries, France, had higher average reading proficiency scores than 14-year-olds in the United States." However, because the reader is now making three comparisons and not just one, the critical value of t is 2.39 and not 1.96. Thus, since 2.19 (the t-statistic for the United States-France comparison) is not larger than 2.39, the conclusion is not safe to make.

It should be noted that most of the standard error estimates presented in subsequent sections and in the original documents are approximations. That is, to derive estimates of standard errors that would be applicable to a wide variety of items and could be prepared at a moderate cost, a number of approximations was required. As a result, the standard error estimates provide a general order of magnitude rather than the exact standard error for any specific item.

*Based on a 2-tailed test.
In addition to such sampling errors, all surveys, both universe and sample, are subject to design, reporting, and processing errors and errors due to nonresponse. To the extent possible, these nonsampling errors are kept to a minimum by methods built into the survey procedures. In general, however, the effects of nonsampling errors are more difficult to gauge than those produced by sampling variability.