- Surveys & Programs
- Data & Tools
- Fast Facts
- News & Events
- Publications & Products
- About Us

- What is small area estimation and why is it necessary?
- What states were oversampled?
- How does the over sampling of certain states affect the indirect estimates?
- What is the target population for the indirect estimates and is it the same as the NAAL direct estimates?
- How many mental disability and language barrier cases are there?
- What do the four literacy levels mean?
- What does "lacking"
*Basic*prose literacy measure? - Why do you compute the indirect literacy estimate only for prose?
- What variables were considered as predictors for the small area model?
- What is the statistical model behind the small area estimation process?
- Can you give examples of other indirect estimates?
- How do you know whether the model works?
- How accurate are the indirect estimates?
- Why does the website limit my choice to one pairwise comparison at a time?
- What is the difference between a credible interval and a confidence interval?
- What are the national estimates of low literacy?

The NAAL and NALS sample sizes are large enough to provide reasonably precise standard survey "direct" estimates of literacy levels for the nation's adults and for major population groups of interest such as gender and age. In addition, reasonably precise direct estimates of literacy levels can be produced for those states that participated in the SAAL and SALS surveys and for their major subdomains. However, the sample sizes in other states and in jurisdictions within states, such as counties, are not large enough to produce direct estimates of adequate precision (some larger states may have sufficient sample sizes but the survey design does not support state-level estimation). Indeed, some states and most counties in the nation have no sample in the surveys. Nevertheless, policymakers, business leaders, and educators/researchers often need literacy information for states and counties.

In response to this need, NCES has used a statistical modeling approach to produce model-dependent estimates of the percentages of adults in the lowest literacy level on the prose scale for all states and counties in the nation. These estimates are called "indirect" estimates to distinguish them from standard survey or "direct" estimates that are derived directly from responses of individuals who live in an area included in the assessment. The indirect estimates are produced using small area estimation techniques that rely both on literacy estimates from other geographic areas included in the assessment and on other variables such as educational attainment that are available for all counties from "auxiliary" data produced by other sources (such as the decennial Census). This approach uses sample information from all counties to "borrow strength" in producing the indirect estimates. By creating a model that predicts literacy levels for counties in the sample from the auxiliary data, the model can then be used to make predictions for all counties and states. Rao 2003 and Jiang and Lahiri (2006) provide comprehensive overviews and comparisons of models and methods for small area estimation.

The states oversampled in 2003 (SAAL states) were Kentucky, Maryland, Massachusetts, Missouri, New York, and Oklahoma. The states oversampled in 1992 (SALS states) were California, Illinois, Indiana, Iowa, Louisiana, New Jersey, New York, Ohio, Pennsylvania, Texas, and Washington.

The main purpose of the SAAL and SALS samples was to enable states to produce reliable direct state estimates of literacy levels for all scales, at all levels, and for their major subgroups. The larger sample sizes in these states were also beneficial in producing generally more precise state and county indirect estimates of the percentages of adults lacking *Basic Prose Literacy Skills (BPLS)*.

The NAAL and the NALS household samples were designed to be nationally representative samples of the population of persons who were 16 years of age or older, excluding persons not living in households or college dormitories, at the time of the interview. This population is the starting point for both the direct and indirect estimates. Adults who could not be tested because of a mental disability that precluded conducting the interview do not contribute to either the direct or the indirect estimates. The direct estimates also exclude adults who were unable to take the assessment because of a language barrier. However, these adults are included in the indirect estimates and are classified as lacking *Basic Prose Literacy Skills (BPLS)* on the grounds that they can be considered to be at the lowest level of English literacy. As a result, the indirect estimates of the percentages of adults lacking *BPLS* are not comparable to the percentages of adults *Below Basic* in prose literacy in other NAAL or NALS published results.

In addition to the household samples, both NAAL and NALS included samples of adults from federal and sate prisons. The inmate samples did not contribute to the indirect county and state estimates presented in this report.

Of the adults sampled for NAAL, 1 percent was classified as mental disability cases and 2 percent were classified as language barrier cases. The NALS had same percentages of mental disability and language barrier cases as the NAAL.

The NAAL used a set of four categories: *Below Basic*, *B**asic*, *Intermediate*, and *Proficient* to describe the literacy levels of the adult population in prose, document, and quantitative literacy. For definitions of the four levels, see NAAL's webpage on Performance Levels. The indirect estimates were computed for prose only. See question 8 for further explanation.

Adults in the *Below Basic* group and those not able to take the assessment because of a language barrier are classified as lacking *Basic Prose Literacy Skills (BPLS)*. The percentage of those who lack *BPLS* reflects the magnitude of the adult household population at the lowest level of English literacy. The literacy of adults who lack *BPLS* ranges from being unable to read and understand any written information to being able only to locate easily identifiable information in short, commonplace prose text in English, but nothing more advanced. For the indirect estimates, adults who were not able to take the assessment because of a language barrier are included.

Three components of literacy were measured in the 2003 NAAL and the 1992 NALS: prose, document, and quantitative. Reviews of the NAAL literacy (direct) estimates showed that prose performed better in measuring literacy skills at the lower end of the literacy scales than did the other components.

More than 100 county-level variables across 20 major types of variables (e.g., poverty, income, education, occupation) were examined as potential predictors for the percentage of adults lacking *Basic Prose Literacy Skills* in the small area modeling used to produce the 2003 NAAL indirect estimates,. The primary source was county-level data from the 2000 Census of Population. Summary File 3 (SF3) was used to extract county-level auxiliary variables. The SF3 contains the Census "short form" items (asked of all households) including information about age, gender, race, Hispanic or Latino origin, household relationship, and owner/renter status. The SF3 also contains the Census "long form" data coming from questions asked of about one-sixth of America's households. The questions ask about income, education, language spoken, housing structure, housing costs, commuting, and many other topics. In addition to the Census of Population, various other sources were used for obtaining county-level and state-level auxiliary variables, for example, the Bureau of Economic Analysis (BEA) per capita personal income estimates for local areas, the Census Bureau's Small Area Income and Poverty Estimates (SAIPE) program, and the U.S. Department of Agriculture (USDA) Economic Research Service Rural-Urban Continuum Codes program. For the 1992 NALS model, in general the variables used in the final Hierarchical Bayes (HB) model for 2003 were considered (using the 1990 Census variable definitions) along with language spoken from the 1990 Census long form. A list of the final predictor variables is given in the Estimation Approach.

The statistical model used to produce the indirect county estimates of the percentages of adults lacking *Basic Prose Literacy Skills (BPLS)* was developed using the 2003 NAAL data; the same modeling approach was then applied to the 1992 NALS data. A Hierarchical Bayes (HB) model was adopted using a Markov Chain Monte Carlo (MCMC) method, and was implemented using the WinBUGS software (Lunn et al. 2000). The key component of the approach was to develop a logit model (linear logistic regression model) to predict the direct county percentages of adults lacking *BPLS* for counties with sample respondents from a set of auxiliary variables that were available and measured consistently for all counties. Non-informative prior distributions were used for the model parameters.

The posterior distributions for the model parameters were used to produce the indirect estimates for all U.S. counties based on their values for the predictor variables and incorporating information about the direct estiamtes in counties with sample data. The state estimates were created by aggregating the county estimates, again using an HB approach. See Small Area Estimation Method for state and county estimate for more information.

See the NAAL Small Area Estimation Technical Report (U.S. Department of Education, National Center for Education Statistics, 2007) for further details of the model.

The Census Bureau's Small Area Income and Poverty Estimates (SAIPE) is another example of indirect estimates. SAIPE provides annual estimates of income and poverty for states, counties, and school districts. Indirect estimates are also produced for the National Survey of Drug Use and Health. Other examples can be found in the Federal Committee on Statistical Methodology, Statistical Policy Working Paper 21.

A number of methods were used to evaluate the fit of the Hierarchical Bayes (HB) models to the county direct estimates. None of the methods indicated appreciable problems with the final models. For the 2003 NAAL, alternative models were fit to the data to determine whether the model results were sensitive either to the prior distributions used for modeling or to the set of auxiliary variables used in the model. This analysis supported the choice of the final model and indicated that the indirect estimates were not sensitive to the variants of the model that were investigated. The final model also proved satisfactory with regard to several diagnostic tests of fit. In addition, comparisons of direct estimates for a variety of domains defined along different dimensions with aggregations of the indirect county estimates for those domains showed a close correspondence in each case. A comparison between the NAAL and NALS results showed that the models were generally comparable in their ability to fit the data at the county level.

Overall, the levels of precision of the 2003 and 1992 model estimates for sample counties are fairly comparable. The county estimates have median coefficients of variation (CVs) of 33.0 percent for the 2003 NAAL and 34.7 percent for the 1992 NALS. The state estimates are more precise, with median CVs of 14.0 and 15.3 for the 2003 NAAL and the 1992 NALS, respectively. Overall, the analysis of the 2003 and 1992 results indicated that gains in precision were achieved in the indirect estimates for SAAL and SALS states as a result of their increased sample size.

When the credible interval for a difference does not include 0, there is a statistical risk that there is in fact not a true difference. As the number of comparisons conducted increases, so does the risk that a false conclusion of a significant difference is made for one or more of the differences being compared. To focus users on specific comparisons, the pairwise comparison tool is constructed to allow only one comparison at a time.

A credible interval for the percentage of the adult population in a county or state lacking *Basic Prose Literacy Skills (BPLS)* defines the interval for which there is a specified probability (often chosen to be 95 percent) that the true value of the percentage lacking *BPLS* is within this interval, given initial assumptions of what the value of this percentage may be and information provided in the data. A confidence interval uses information to describe the range of values for which the true value of percentage of the population lacking *BPLS* could come, given the estimate calculated from the available data. In the context of hypothesis testing, a 95 percent confidence interval for an estimate of the percentage of the population lacking *BPLS* indicates the range of values for which the hypothesis equaled the value would be accepted with 95 percent confidence. Using traditional hypothesis testing, we would reject the hypothesis that values of percentage of the population lacking *BPLS* outside of a given confidence interval could have produced the observed value this percentage at the level of confidence associated with the confidence interval

The national direct estimates of the percentages of adults lacking *BPLS* are 14.5 percent for the 2003 NAAL and 14.7 percent for the 1992 NALS. In comparison, the national direct estimates of the percentages *Below Basic* in prose literacy are 13.6 percent for the NAAL and 13.8 percent for the NALS.