IES Blog

Institute of Education Sciences

Making Meaning Out of Statistics

By Dr. Peggy G. Carr, NCES Commissioner

The United States does not have a centralized statistical system like Canada or Sweden, but the federal statistical system we do have now speaks largely with one voice thanks to the Office of Management and Budget’s U.S. Chief Statistician, the Evidence Act of 2018, and proposed regulations to clearly integrate extensive and detailed OMB statistical policy directives into applications of the Act. The Evidence Act guides the work of the federal statistical system to help ensure that official federal statistics, like those we report here at NCES, are collected, analyzed, and reported in a way that the public can trust. The statistics we put out, such as the number and types of schools in the United States, are the building blocks upon which policymakers make policy, educators plan the future of schooling, researchers develop hypotheses about how education works, and parents and the public track the progress of the education system. They all need to know they can trust these statistics—that they are accurate and unbiased, and uninfluenced by political interests or the whims of the statistical methodologist producing the numbers. Through the Evidence Act and our work with colleagues in the federal statistical system, we’ve established guidelines and standards for what we can say, what we won’t say, and what we can’t say. And they help ensure that we do not drift into territory that is beyond our mission.

Given how much thought NCES and the federal statistical system more broadly has put into the way we talk about our statistics, a recent IES blog post, “Statistically Significant Doesn't Mean Meaningful, naturally piqued my interest. I thought back to a question on this very topic that I had on my Ph.D. qualifying statistical comprehensive essay exam. I still remember nailing the answer to that question all these years later. But it’s a tough one—the difference between “statistically significant” and “meaningful” findings—and it’s one that cuts to the heart of the role of statistical agencies in producing numbers that people can trust.

I want to talk about the blog post—the important issue it raises and the potential solution it proposes—as a way to illustrate key differences in how we, as a federal agency producing statistics for the public, approach statistics and how researchers sometimes approach statistics. Both are properly seeking information but often for very different purposes requiring different techniques. And I want to say I was particularly empathetic with the issues raised in the blog post given my decades of background managing the National Assessment of Educational Progress (NAEP) and U.S. participation in major international assessments like the Program for International Student Assessment (PISA). In recent years, given NAEP’s large sample size, it is not unheard of for two estimates (e.g., average scores) to round to the same whole number, and yet be statistically different. Or, in the case of U.S. PISA results, for scores to be 13 points apart, but yet not be statistically different. So, the problem that the blog post raises is both long standing and quite familiar to me.


The Problem   

Here’s the knotty problem the blog post raises: Sometimes, when NCES says there’s no statistically significant difference between two numbers, some people think we are saying there’s no difference between those two numbers at all. For example, on the 2022 NAEP, we estimated an average score of 212 for the Denver Public School District in grade 4 reading. That score for Denver in 2019 was 217. When we reported the 2022 results, we said that there was no statistically significant difference between Denver’s grade 4 reading scores between 2019 and 2022 even though the estimated scores in the two years were 5 points apart. This is because the Denver scores in 2019 and 2022 were estimates based on samples of students and we could not conclude that if we assessed every single Denver fourth-grader in both years that we wouldn’t have found, say, that the scores were 212 in both years. NAEP assessments are like polls: there is uncertainty (a margin of error) around the results. Saying that there was no statistically significant difference between two estimates is not the same as saying that there definitely was no difference. We’re simply saying we don’t have enough evidence to say for sure (or nearly sure) there was a difference.

Making these kinds of uncertain results clear to the public can be very difficult, and I applaud IES for raising the issue and proposing a solution. Unfortunately, the proposed solution—a “Bayesian” approach that “borrows” data from one state to estimate scores for another and that relies more than we are comfortable with, as a government statistical agency, on the judgment of the statistician running the analysis—can hurt more than help.


Two Big Concerns With a Bayesian Approach for Releasing NAEP Results

Two Big Concerns With a Bayesian Approach for NAEP

Big Concern #1: It “borrows” information across jurisdictions, grades, and subjects.

Big Concern #2: The statistical agency decides the threshold for what’s “meaningful.”

Let me say more about the two big concerns I have about the Bayesian approach proposed in the IES blog post for releasing NAEP results. And, before going into these concerns, I want to emphasize that these are concerns specifically with using this approach to release NAEP results. The statistical theory on which Bayesian methods are based is central to our estimation procedures for NAEP. And you’ll see later that we believe there are times when the Bayesian approach is the right statistical approach for releasing results.


Big Concern #1: The Proposed Approach Borrows Information Across Jurisdictions, Grades, and Subjects

The Bayesian approach proposed in the IES blog post uses data on student achievement in one state to estimate performance in another, performance at grade 8 to estimate performance at grade 4, and performance in mathematics to estimate performance in reading. The approach uses the fact that changes in scores across states often correlate highly with each other. Certainly, when COVID disrupted schooling across the nation, we saw declines in student achievement across the states. In other words, we saw apparent correlations. The Bayesian approach starts from an assumption that states’ changes in achievement correlate with each other and uses that to predict the likelihood that the average score for an individual state or district has increased or decreased. It can do the same thing with correlations in changes in achievement across subjects and across grade levels—which also often correlate highly. This is a very clever approach for research purposes.

However, it is not an approach that official statistics, especially NAEP results, should be built upon. In a country where curricular decisions are made at the local level and reforms are targeted at specific grade levels and in specific subjects, letting grade 8 mathematics achievement in, say, Houston influence what we report for grade 4 reading in, say, Denver, would be very suspect. Also, if we used Houston results to estimate Denver results, or math results to estimate reading results, or grade 8 results to estimate grade 4 results, we might also miss out on chances of detecting interesting differences in results.


Big Concern #2: The Bayesian Approach Puts the Statistical Agency in the Position of Deciding What’s “Meaningful”

A second big concern is the extent to which the proposed Bayesian approach would require the statisticians at NCES to set a threshold for what would be considered a “meaningful” difference. In this method, the statistician sets that threshold and then the statistical model reports out the probability that a reported difference is bigger or smaller than that threshold. As an example, the blog post suggests 3 NAEP scale score points as a “meaningful” change and presents this value as grounded in hard data. But in reality, the definition of a “meaningful” difference is a judgment call. And making the judgment is messy. The IES blog post concedes that this is a major flaw, even as it endorses broad application of these methods: “Here's a challenge: We all know how the p<.05 threshold leads to ‘p-hacking’; how can we spot and avoid Bayesian bouts of ‘threshold hacking,’ where different stakeholders argue for different thresholds that suit their interests?”

That’s exactly the pitfall to avoid! We certainly do our best to tell our audiences, from lay people to fellow statisticians, what the results “mean.” But we do not tell our stakeholders whether changes or differences in scores are large enough to be deemed "meaningful," as this depends on the context and the particular usage of the results.

This is not to say that we statisticians don’t use judgement in our work. In fact, the “p<.05” threshold for statistical significance that is the main issue the IES blog post has with reporting of NAEP results is a judgement. But it’s a judgement that has been widely established across the statistics and research worlds for decades and is built into the statistical standards of NCES and many other federal statistical agencies. And it’s a judgement specific to statistics: It’s meant to help account for margins of error when investigating if there is a difference at all—not a judgement about whether the difference exceeds a threshold to count as “meaningful.” By using this widely established standard, readers don’t have to wonder, “is NAEP setting its own standards?” or, perhaps more important, “is NAEP telling us, the public, what is meaningful?” Should the “p<.05” standard be revisited? Maybe. As, I note below, this is a question that is often asked in the statistical community. Should NCES and NAEP go on their own and tell our readers what is a meaningful result? No. That’s for our readers to decide.


What Does the Statistical Community Have to Say?

The largest community of statistical experts in the United States—the American Statistical Association (ASA)—has a lot to say on this topic. In recent years, they grappled with the p-value dilemma and put out a statement in 2016 that described misuses of tests of statistical significance. An editorial that later appeared in the American Statistician (an ASA journal) even recommended eliminating the use of statistical significance and the so-called “p-values” on which they are based. As you might imagine, there was considerable debate in the statistical and research community as a result. So in 2019, the president of the ASA convened a task force, which clarified that the editorial was not an official ASA policy. The task force concluded: “P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. . . . Much of the controversy surrounding statistical significance can be dispelled through a better appreciation of uncertainty, variability, multiplicity, and replicability.”

In other words: Don't throw the baby out with the bathwater!


So, When Should NCES Use a Bayesian Approach?

Although I have been arguing against the use of a Bayesian approach for the release of official NAEP results, there’s much to say for Bayesian approaches when you need them. As the IES blog post notes, the Census Bureau uses a Bayesian method in estimating statistics for small geographic areas where they do not have enough data to make a more direct estimation. NCES has also used similar Bayesian methods for many years, where appropriate. For example, we have used Bayesian approaches to estimate adult literacy rates for small geographic areas for 20 years, dating back to the National Assessment of Adult Literacy (NAAL) of 2003. We use them today in our “small area estimates” of workplace skill levels in U.S. states and counties from the Program for the International Assessment of Adult Competencies (PIAAC). And when we do, we make it abundantly clear that these are indirect, heavily model-dependent estimates.

In other words, the Bayesian approach is a valuable tool in the toolbox of a statistical agency. However, is it the right tool for producing official statistics, where samples, by design, meet the reporting standards for producing direct estimates? The short answer is “no.”


Conclusion

Clearly and accurately reporting official statistics can be a challenge, and we are always looking for new approaches that can help our stakeholders better understand all the data we collect. I began this blog post noting the role of the federal statistical system and our adherence to high standards of objectivity and transparency, as well as our efforts to express our sometimes-complicated statistical findings as accurately and clearly as we can. IES has recently published another blog post describing some great use cases for Bayesian approaches, as well as methodological advances funded by our sister center, the National Center for Education Research. But the key point I took away from this blog post was that the Bayesian approach was great for research purposes, where we expect the researcher to make lots of assumptions (and other researchers to challenge them). That’s research, not official statistics, where we must stress clarity, accuracy, objectivity, and transparency.  

I will end with a modest proposal. Let NCES stick to reporting statistics, including NAEP results, and leave questions about what is meaningful to readers . . . to the readers!

Measuring the Homeschool Population

By Sarah Grady

How many children are educated at home instead of school? Although many of our data collections focus on what happens in public or private schools, the National Center for Education Statistics (NCES) tries to capture as many facets of education as possible, including the number of homeschooled youth and the characteristics of this population of learners. NCES was one of the first organizations to attempt to estimate the number of homeschoolers in the United States using a rigorous sample survey of households. The Current Population Survey included homeschooling questions in 1994, which helped NCES refine its approach toward measuring homeschooling.[i] As part of the National Household Education Surveys Program (NHES), NCES published homeschooling estimates starting in 1999. The homeschooling rate has grown from 1.7 percent of the school-aged student population in 1999 to 3.4 percent in 2012.[ii]

NCES recently released a Statistical Analysis Report called Homeschooling in the United States: 2012. Findings from the report, detailed in a recent blog, show that there is a diverse group of students who are homeschooled. Although NCES makes every attempt to report data on homeschooled students, this diversity can make it difficult to accurately measure all facets of the homeschool population.

One of the primary challenges in collecting relevant data on homeschool students is that no complete list of homeschoolers exists, so it can be difficult to locate these individuals. When lists of homeschoolers can be located, problems exist with the level of coverage that they provide. For example, lists of members of local and national homeschooling organizations do not include homeschooling families unaffiliated with the organizations. Customer lists from homeschool curriculum vendors exclude families who access curricula from other sources such as the Internet, public libraries, and general purpose bookstores. For these reasons, collecting data about homeschooling requires a nationally representative household survey, which begins by finding households in which at least one student is homeschooled.

Once located, families can vary in their interpretation of what homeschooling is. NCES asks households if anyone in the household is “currently in homeschool instead of attending a public or private school for some or all classes.” About 18 percent of homeschoolers are in a brick-and-mortar school part-time, and families may vary in the extent to which they consider children in school part-time to be homeschoolers. Additionally, with the growth of virtual education and cyber schools, some parents are choosing to have the child schooled at home but not to personally provide instruction. Whether or not parents of students in cyber schools define their child as homeschooled likely varies from family to family.

NHES data collection begins with a random sample of addresses distributed across the entire U.S. However, most addresses will not contain any homeschooled students. Because of the low incidence of homeschooling relative to the U.S. population, a large number of households must be screened to find homeschooling students.  This leaves us with a small number of completed surveys from homeschooling families relative to studies of students in brick-and-mortar schools. For example, in 2012, the NHES program contacted 159,994 addresses and ended with 397 completed homeschooling surveys.

Smaller analytic samples can often result in less precise estimates. Therefore, NCES can estimate only the size of the total homeschool population and some key characteristics of homeschoolers with confidence, but we are not able to accurately report data for very small subgroups. For example, NCES can report the distribution of homeschoolers by race and ethnicity,[iii] but more specific breakouts of the characteristics of homeschooled students within these racial/ethnic groups often cannot be reported due to the small sample sizes and large standard errors. For a more comprehensive explanation of this issue, please see our blog post on standard errors.  The reason why this matters is that local-level research on homeschooling families suggests that homeschooling communities across the country may be very diverse.[iv] For example, Black, urban homeschooling families in these studies are often very different from White, rural homeschooling families. Low incidence and high heterogeneity lead to estimates with lower precision.

Despite these constraints, the data from NHES continue to be the most comprehensive that we have on homeschoolers. NCES continues to collect data on this important population. The 2016 NHES recently completed collection on homeschooling students, and those data will be released in fall 2017.

[i] Henke, R., Kaufman, P. (2000). Issues Related to Estimating the Home-school Population in the United States with National Household Survey Data (NCES 2000-311). National Center for Education Statistics. Institute of Education Sciences. U.S. Department of Education. Washington, DC.

[ii] Redford, J., Battle, D., and Bielick, S. (2016). Homeschooling in the United States: 2012 (NCES 2016-096). National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education. Washington, DC.

[iv] Hanna, L.G. (2012). Homeschooling Education: Longitudinal Study of Methods, Materials, and Curricula. Education and Urban Society 44(5): 609–631.

Statistical Concepts in Brief: Embracing the Errors

By Lauren Musu-Gillette

EDITOR’S NOTE: This is part of a series of blog posts about statistical concepts that NCES uses as a part of its work.

Many of the important findings in NCES reports are based on data gathered from samples of the U.S. population. These sample surveys provide an estimate of what data would look like if the full population had participated in the survey, but at a great savings in both time and costs.  However, because the entire population is not included, there is always some degree of uncertainty associated with an estimate from a sample survey. For those using the data, knowing the size of this uncertainty is important both in terms of evaluating the reliability of an estimate as well as in statistical testing to determine whether two estimates are significantly different from one another.

NCES reports standard errors for all data from sample surveys. In addition to providing these values to the public, NCES uses them for statistical testing purposes. Within annual reports such as the Condition of Education, Indicators of School Crime and Safety, and Trends in High School Drop Out and Completion Rates in the United States, NCES uses statistical testing to determine whether estimates for certain groups are statistically significantly different from one another. Specific language is tied to the results of these tests. For example, in comparing male and female employment rates in the Condition of Education, the indicator states that the overall employment rate for young males 20 to 24 years old was higher than the rate for young females 20 to 24 years old (72 vs. 66 percent) in 2014. Use of the term “higher” indicates that statistical testing was performed to compare these two groups and the results were statistically significant.

If differences between groups are not statistically significant, NCES uses the phrases “no measurable differences” or “no statistically significant differences at the .05 level”. This is because we do not know for certain that differences do not exist at the population level, just that our statistical tests of the available data were unable to detect differences. This could be because there is in fact no difference, but it could also be due to other reasons, such as a small sample size or large standard errors for a particular group. Heterogeneity, or large amounts of variability, within a sample can also contribute to larger standard errors.

Some of the populations of interest to education stakeholders are quite small, for example, Pacific Islander or American Indian/Alaska Native students. As a consequence, these groups are typically represented by relatively small samples, and their estimates are often less precise than those of larger groups. These less precise estimates can often be reflected in larger standard errors for these groups. For example, in the table above the standard error for White students who reported having been in 0 physical fights anywhere is 0.70 whereas the standard error is 4.95 for Pacific Islander students and 7.39 for American Indian/Alaska Native students. This means that the uncertainty around the estimates for Pacific Islander and American Indian/Alaska Native students is much larger than it is for White students. Because of these larger standard errors, differences between these groups that may seem large may not be statistically significantly different. When this occurs, NCES analysts may state that large apparent differences are not statistically significant. NCES data users can use standard errors to help make valid comparisons using the data that we release to the public.

Another example of how standard errors can impact whether or not sample differences are statistically significant can be seen when comparing NAEP scores changes by state. Between 2013 and 2015, mathematics scores changed by 3 points between for fourth-grade public school students in Mississippi and Louisiana. However, this change was only significant for Mississippi. This is because the standard error for the change in scale scores for Mississippi was 1.2, whereas the standard error for Louisiana was 1.6. The larger standard error, and therefore larger degree of uncertainly around the estimate, factor into the statistical tests that determine whether a difference is statistically significant. This difference in standard errors could reflect the size of the samples in Mississippi and Louisiana, or other factors such as the degree to which the assessed students are representative of the population of their respective states. 

Researchers may also be interested in using standard errors to compute confidence intervals for an estimate. Stay tuned for a future blog where we’ll outline why researchers may want to do this and how it can be accomplished.