State and County Literacy Estimates

Estimates Home

Overview

Frequently Asked Questions

Estimation Approach

General Cautions

Estimation Approach

Based on the survey data alone, the 2003 NAAL and 1992 NALS surveys were not designed to provide policymakers and educators with estimates of the percentages of adults at the lowest literacy level for all U.S. states and counties. To address the need for such estimates, a statistical model was developed to produce indirect (i.e. model dependent) county estimates of the percentages of adults lacking Basic Prose Literacy Skills (BPLS) based on the 2003 NAAL data. The same modeling approach was then applied to the 1992 NALS data. A Hierarchical Bayes model was adopted using a Markov Chain Monte Carlo (MCMC) method (Gelman et al. 2004) and implemented using the WinBUGS software (Lunn et al. 2000). The key components of the modeling approach were to develop (1) a sampling model for the sampling variability present in the county level, direct survey estimates of the county percentages of adults lacking BPLS estimates (for counties with some respondents) and (2) a logit model (linear logistic regression model) to predict the survey estimates from a set of auxiliary variables that were available and measured consistently for all counties.

From Direct county estimates to indirect county estimates

A large pool of items was used in the literacy assessment to enable the surveys to cover a broad range of literacy tasks. However, to keep the testing time at a reasonable level, only a subset of the items in the item pool was administered to each participant. Since respondents took different sets of items that could be different in level of difficulty, it would be inappropriate to base the literacy estimates simply on the number of correct answers obtained. Instead, a marginal maximum likelihood method was applied using the AM software (http://am.air.org/ to represent each individual's estimated proficiency as a probability distribution over all possible scores. The probability distributions for sampled individuals were then used in the estimation process to compute direct estimates of the percentage of adults lacking BPLS for individual counties included in the NAAL or NALS samples.

Variance estimates were then produced for the direct county estimates using a Taylor series approximation that took account of the survey weights and the clustered sample design within counties (see, for example, Wolter 1985). Given the relatively small sample sizes in most counties, the direct estimates were generally imprecise. Since the variance estimates were also subject to considerable sampling variability, they were smoothed using a generalized variance function approach. The direct estimates and smoothed variance estimates for the sampled counties were then used in the subsequent logit model analysis to compute model-dependent, indirect estimates for all counties in the United States.

Predictor variables

A key aspect of the small area estimation modeling for the 2003 NAAL and 1992 NALS was finding auxiliary variables that are measured consistently across all U.S. counties and that are effective predictors of the county percentages of adults lacking BPLS. The process of model development involved the compilation of a large number of auxiliary variables that were known or hypothesized to be correlated with literacy. The final set of predictors was selected based on its ability to best account for the between-county variation in the direct estimates of the percentage lacking BPLS for sampled counties.

The best set of predictors for the 2003 NAAL model comprised the following six variables:

Percentage of the county population who were foreign-born and who had stayed in the United States for 20 years or less years;
Percentage of county population age 25 and older with only a high school education or less;
Percentage of the county population who were Black or Hispanic;
Percentage of the county population in households with incomes below 150 percent of poverty level;
Indicator variable identifying the New England and North Central census divisions; and
Indicator variable identifying the SAAL states.

Apart from the SAAL state indicator, all the predictor variables were obtained from the 2000 Census of Population.

The predictors for the 1992 NALS model were:

Percentage of the county population for whom English was not a native language;
Percentage of the county population age 25 and older with only a high school education or less;
Percentage of the county population who were Black;
Percentage of the county population who were Hispanic;
Indicator variables identifying the New England and North Central census divisions; and
Indicator variable for counties in a SALS state.

All predictor variables were obtained from the 1990 Census of Population, with the exception of the SALS state indicator.

Small area model for state and county estimates

A single area-level statistical model that acknowledges the sampling variability present in the dependent variable was used to predict the county percentages of adults lacking BPLS. The logit of the direct county-level estimated percentages of adults lacking BPLS was used as the dependent variable in the model and the county-level variables described above were used as the predictor variables. The logit model also included random state and county effects.

Hierarchical Bayesian estimation techniques with the Markov Chain Monte Carlo (MCMC) approach were used to estimate the model parameters. The multiple estimates of model parameter values produced by the MCMC approach were used to produce posterior distributions of indirect estimates of the percentages of adults lacking BPLS for individual counties, whether or not they were included in the sample. Summary statistics for these posterior distributions, including their means (the indirect county estimates) and credible intervals, were also computed.

The indirect estimates for states were computed as weighted aggregates of the county indirect estimates, where the weights represent the county's proportion of the state's household population of adults aged 16 and over. Because county populations of household residents aged 16 years or older were not available for 2003 or 1992, the weight for each county was estimated based on available data from the U.S. Census Bureau for the year in question. The Census Bureau 2003 postcensal estimated residency counts include populations that are outside the scope of the NAAL small area estimation population, such as persons in group quarters and institutions. The 2003 estimated residency counts for ages 16 and older were therefore adjusted by the ratio of Census 2000 counts for persons within households to total population. These initial county population estimates for 2003 and 1992 were then calibrated to the sum of the final sampling weights for the SAAL and SALS states, respectively, and to the sum of the NAAL and NALS final sampling weights for counties in the remainder of each census region. This calibration served to improve the consistency between the indirect and direct estimates.

Credible intervals

The primary measure of precision reported for each state or county indirect estimate is its credible interval. The 95 percent credible intervals for both the indirect county estimates and the indirect state estimates were computed by calculating the 2.5 percent (lower bound) and 97.5 percent (upper bound) quantiles of the simulated posterior distributions for the indirect estimates of the percentage of adults lacking BPLS obtained from the MCMC samples. Since these posterior distributions are skewed, the credible intervals are asymmetric around the estimates. For more information about credible intervals refer to Uncertainty in Estimates

Comparison of estimates

Credible intervals for the differences between the indirect estimates for two states, for two counties, and for the two time points for a given county or state are provided in order to assist data users in making comparisons between states, between counties, and across time. Two methods have been applied to determine whether the 95 percent credible interval for the difference between two indirect estimates contains 0.

For the first method, the difference in the 2003 indirect estimate between two states or counties (within a particular state) was computed for each MCMC sample, and the credible interval for the difference was derived from the resultant posterior distribution. In this case, the results reported are the estimated difference and its credible interval. In view of the very large number of possible pairwise comparisons between counties across the nation (about 5 million), this procedure has been applied only for differences in the 2003 estimate between any pair of states and between any pair of counties that are within the same state.

The other method simply determines whether the credible interval for the difference contains 0, without computing that interval. This method has been applied for differences between the indirect estimates for counties in different states, for differences between the 2003 and 1992 indirect estimates for single states or counties, and for all differences between states and counties for 1992. This determination was readily made in two situations:

If the credible interval for an indirect estimate for county (or state) i does not overlap with the credible interval for the indirect estimate for county (or state) j, then one can conclude that the credible interval for the difference does not contain 0. For example if the credible interval for one county is from 6 percent to 12 percent, and for another county it is from 13 to 21 percent, then the credible interval for the difference will not include 0.
If the credible interval for the indirect estimate for county (or state) i is fully nested within the credible interval for the indirect estimate for county (or state) j, then the credible interval for the difference will contain 0. For example, if one county has a credible interval of 6 to 18 percent, and another county has a credible interval of 7 to 17 percent, then the credible interval for the difference will include 0.

The situation in which the credible intervals for two indirect county (or state) estimates partially overlap (e.g., the credible interval is from 6 to 18 percent for one county and from 12 to 24 percent for another county) is less straightforward. In this case the following approximate method was used. First, the standard deviations of the posterior distributions of the indirect estimates were estimated as one-fourth of the 95 percent credible interval widths for the estimates, under the assumption of approximate normality. Next, the standard deviation of the posterior distribution of the difference was estimated as the square root of the sum of the squares of the individual standard deviations (assuming that the covariance between the estimates was zero). Then the approximate credible interval was calculated as the indirect estimate of the difference plus or minus twice the approximated standard deviation of the posterior distribution for the difference. In view of the ad hoc nature of this method, the only result reported is whether or not this interval contains 0. No comparisons of estimates at different aggregation levels are allowed because the ad hoc formula does not account for comparisons between these estimates.

Results obtained from this approximate procedure were compared with the results obtained by using the first method described above for a subset of 8,887 pairs of 2003 indirect estimates for counties in different states. In the 7,549 cases of these 8,887 pairwise differences (85 percent) in which the credible interval computed with the first method contained 0, there was just one case using the approximation that did not contain 0. In the 1,338 cases (15 percent) in which the credible interval computed by the first method did not contain 0, there were 73 percent that had a credible interval from the approximate procedure that also did not contain 0. The approximate procedure is thus conservative in the sense that it sometimes indicates that the credible interval contains 0 when it does not in fact. In 81 percent of the 364 cases where the results differed, the credible intervals from the first method had one limit that was less than a percentage point from 0. Attempts to develop an alternative approximation showed no improvement.

Top