Overview of the Assessment
Reporting the Assessment—Scale Scores and Achievement Levels
Description of Science Performance by Item Maps for Each Grade
Results are Estimates
NAEP Reporting Groups
Cautions in Interpretations
NAEP assesses science performance by administering assessments to samples of students who are representative of the nation's students. The content of the NAEP science assessment is determined by a framework incorporating expert perspectives about science knowledge and its measurement. Read more about what the assessment measures, how it was developed, who took the assessment, and how the assessment was administered. This page describes elements of the main NAEP science assessment, and does not apply to the long-term trend science assessment (which was discontinued in 1999). Read more about the difference between the main and long-term trend NAEP assessments.
Beginning in 2002, the NAEP national sample was obtained by aggregating the samples of public school students from each state and jurisdiction, and then supplementing the aggregate sample with a nationally representative sample of students from nonpublic schools, rather than by obtaining an independently selected national sample. As a consequence the national sample size increased, and smaller differences between years or between groups of students were found to be statistically significant than would have been detected in previous assessments. In keeping with past practice, all statistically significant differences are indicated in the current web results pages.
The 2011 science assessment was administered at grade 8 only so that results from both the NAEP mathematics and science assessments in 2011 could be linked to results for the 2011 Trends in International Mathematics and Science Study (TIMSS). The results from the linking study will be presented in a separate report and will show how the performance of eighth-grade students in states and selected districts compares to international benchmarks.
The NAEP Science Framework (9.99 MB) describes the types of questions to be included in the 2011 assessment and how they should be designed and scored. The National Assessment Governing Board oversees the development of NAEP frameworks that describe the specific knowledge and skills to be assessed in each subject. The 2009 and 2011 assessments were developed using the same framework, allowing the results from the two assessment years to be compared.
In 2009, a new science framework was introduced meaning that the trend line cannot be maintained and comparisons cannot be made between 2009 and 2011 and previous years in which the assessment was administered. Find out how the 2009 framework differs from the previous framework. The 2009 science assessment, like other NAEP assessments since 1996, permitted test accommodations for students with disabilities (SD) and for English language learners (ELL). Read more about NAEP's policy of inclusion.
Differences between groups of students are discussed only if they have been determined to be statistically significant.
The results of student performance on the NAEP science assessment are presented on this website in two ways: as average scores on the NAEP science scale and as the percentages of students attaining NAEP science achievement levels. The average scale scores represent how students performed on the assessment. The achievement levels represent how that performance measured up against defined expectations for achievement. Thus, the average scale scores represent what students know and can do, while the achievement-level results indicate the degree to which student performance meets expectations of what they should know and be able to do.
Average science scale score results are based on the NAEP science scale, which ranges from 0 to 300.
In 2009, the first year of the new science framework, an overall science scale was developed at each grade. The scale at each grade ranges from 0 to 300 with a mean of 150 and standard deviations of 35. Although the score ranges are identical, the scales were derived independently at each grade; therefore, scales cannot be compared across grades. In 2009 and 2011, the overall science scale was derived from an analysis of all science items (representing the three fields of science: Physical Science, Life Science, and Earth and Space Sciences) together.
Average scores for each of the three science content areas specified in the framework are also available and are reported on the 0 - 300 scale. Because subscales are set separately for each content area, comparisons cannot be made from one area to another.
Achievement-level results are presented in terms of science achievement levels as adopted by the National Assessment Governing Board, and are intended to measure how well students' actual achievement matches the achievement desired of them. For each grade tested, the Governing Board has adopted three achievement levels: Basic, Proficient, and Advanced. For reporting purposes, the achievement-level cut scores are placed on the science scales, resulting in four ranges: below Basic, Basic, Proficient, and Advanced.
The Governing Board established its achievement levels in 1996 based upon the science content framework and standard-setting process involving a cross section of educators and interested citizens from across the nation who were asked to judge what students should know and be able to do relative to the content reflected in the NAEP science framework. The achievement levels and cut scores were revised in 2009 to reflect the new science framework. Explore the new achievement-level descriptions for science. As provided by law, NCES has determined that the achievement levels are to be considered developmental and should be interpreted and used with caution. However, both NCES and the Governing Board believe these performance standards are useful for understanding trends in student achievement.
The performance of fourth-, eighth-, and twelfth-graders can be illustrated by maps that position question descriptions along the NAEP science scale. The descriptions used on these maps focus on the science skill or knowledge needed to answer the question. For multiple-choice questions, the description indicates the skill or knowledge demonstrated by selection of the correct option; for constructed-response questions, the description takes into account the skill or knowledge specified by the different levels of scoring criteria for that question.
Approximately 25 to 30 science questions per grade are placed on the item map. Explore the item maps for grade 8 science.
The average scores and percentages presented on this website are estimates because they are based on representative samples of students rather than on the entire population of students. Moreover, the collection of subject-area questions used at each grade level is but a sample of the many questions that could have been asked. As such, NAEP results are subject to a measure of uncertainty, reflected in the standard error of the estimates. The standard errors for the estimated scale scores and percentages in the figures and tables presented on this website are available through the NAEP Data Explorer.
Results are provided for groups of students defined by shared characteristics—gender, race or ethnicity, eligibility for free/reduced-price school lunch, students with disabilities, and students identified as English language learners. Based on participation rate criteria, results are reported for subpopulations only when sufficient numbers of students and adequate school representation are present. The minimum requirement is at least 62 students in a particular group from at least five primary sampling units (PSUs). However, the data for all students, regardless of whether their group was reported separately, were included in computing overall results. Explanations of the reporting groups are presented below.
Results are reported separately for males and females.
Prior to 2011, student race/ethnicity was obtained from school records and reported for the six mutually exclusive categories shown below:
Students who identified with more than one of the other five categories were classified as “other” and were included as part of the "unclassified" category along with students who had a background other than the ones listed or whose race/ethnicity could not be determined.
In compliance with new standards from the U.S. Office of Management and Budget for collecting and reporting data on race/ethnicity, additional information was collected in 2011 so that results could be reported separately for Asian students, Native Hawaiian/Other Pacific Islander students, and students identifying with two or more races. Beginning in 2011, all of the students participating in NAEP were identified by school reports as one of the seven racial/ethnic categories listed below:
As in earlier years, students identified as Hispanic were classified as Hispanic in 2011 even if they were also identified with another racial/ethnic group. Students who identified with two or more of the other racial/ethnic groups (e.g., White and Black) would have been classified as “other” and reported as part of the "unclassified" category prior to 2011, and classified as “two or more races” in 2011.
When comparing the results for racial/ethnic groups from 2011 to earlier assessment years, the 2011 data for Asian and Native Hawaiian/Other Pacific Islander students were combined into a single Asian/Pacific Islander category. Information based on student self-reported race/ethnicity will continue to be reported in the NAEP Data Explorer.
As part of the Department of Agriculture's National School Lunch Program (NSLP), schools can receive cash subsidies and donated commodities in turn for offering free or reduced-price lunches to eligible children. Based on available school records, students were classified as either currently eligible for the free/reduced-price school lunch or not eligible. Eligibility for free and reduced-price lunches is determined by students' family income in relation to the federally established poverty level. Students whose family income is at or below 130 percent of the poverty level qualify to receive free lunch, and students whose family income is between 130 percent and 185 percent of the poverty level qualify to receive reduced-price lunch. For the period July 1, 2010 through June 30, 2011, for a family of four, 130 percent of the poverty level was $28,665 and 185 percent was $40,793 in most states. The classification applies only to the school year when the assessment was administered (i.e., the 2010–11 school year) and is not based on eligibility in previous years. If school records were not available, the student was classified as "Information not available." If the school did not participate in the program, all students in that school were classified as "Information not available."Some schools provide free meals to all students irrespective of individual eligibility, using their own funds to cover the costs of noneligible students. Under special provisions of the National School Lunch Act intended to reduce the administrative burden of determining student eligibility every year, schools can be reimbursed based on eligibility data for a single base year. Participating schools might have high percentages of eligible students and report all students as eligible for free lunch.
Results are reported for students who were identified by school records as having a disability. A student with a disability may need specially designed instruction to meet his or her learning goals. A student with a disability will usually have an Individualized Education Program (IEP), which guides his or her special education instruction. Students with disabilities are often referred to as special education students and may be classified by their school as learning disabled (LD) or emotionally disturbed (ED).
Results are reported for students who were identified by school records as being English language learners. (Note that English language learners were previously referred to as limited English proficient (LEP).
The national results are based on a representative sample of students in both public schools and nonpublic schools. While national results reflect the performance of students in public, private, and other types of schools (i.e., Bureau of Indian Education schools and Department of Defense schools), state-level results reflect the performance of public school students only. Results are reported separately for Department of Defense schools in state tables and maps.
NAEP results are reported for four mutually exclusive categories of school location: city, suburb, town, and rural. The categories are based on standard definitions established by the Federal Office of Management and Budget using population and geographic information from the U.S. Census Bureau. Schools are assigned to these categories in the NCES Common Core of Data based on their physical address.
The classification system was revised for 2007 and 2009. The new locale codes are based on an address's proximity to an urbanized area (a densely settled core with densely settled surrounding areas). This is a change from the original system based on metropolitan statistical areas. To distinguish the two systems, the new system is referred to as "urban-centric locale codes." The urban-centric locale code system classifies territory into four major types: city, suburban, town, and rural. Each type has three subcategories. For city and suburb, these are gradations of size—large, midsize, and small. Towns and rural areas are further distinguished by their distance from an urbanized area. They can be characterized as fringe, distant, or remote.
Prior to 2003, NAEP results were reported for four NAEP-defined regions of the nation: Northeast, Southeast, Central, and West. As of 2003, to align NAEP with other federal data collections, NAEP analysis and reports have used the U.S. Census Bureau's definition of "region." The four regions defined by the U.S. Census Bureau are Northeast, South, Midwest, and West. The Central region used by NAEP before 2003 contained the same states as the Midwest region defined by the U.S. Census. The former Southeast region consisted of the states in the Census-defined South minus Delaware, the District of Columbia, Maryland, Oklahoma, Texas, and the section of Virginia in the District of Columbia metropolitan area. The former West region consisted of Oklahoma, Texas, and the states in the Census-defined West. The former Northeast region consisted of the states in the Census-defined Northeast plus Delaware, the District of Columbia, Maryland, and the section of Virginia in the District of Columbia metropolitan area. Therefore trend data by region are not provided for the 2005 science assessment. The table below shows how states are subdivided into these Census regions. All 50 states and the District of Columbia are listed. Other jurisdictions, including the Department of Defense Educational Activity schools, are not assigned to any region.
SOURCE: U.S. Department of Commerce Economics and Statistics Administration.
Parents' highest level of education is defined by the highest level reported by eighth-graders and twelfth-graders for either parent. Fourth-graders' replies to this question were not reported because their responses in previous studies were highly variable, and a large percentage of them chose the "I don't know" option.
All 50 states and Department of Defense Education Activity (DoDEA) schools participated in the 2011 science assessment. To ensure that the samples in each state are representative, NAEP has established policies and procedures to maximize the inclusion of all students in the assessment. Every effort is made to ensure that all selected students who are capable of participating meaningfully in the assessment are assessed. While some students with disabilities (SD) and/or English language learner (ELL) students can be assessed without any special procedures, others require accommodations to participate in NAEP. Still other SD and/or ELL students selected by NAEP may not be able to participate. Local school authorities determine whether SD/ELL students require accommodations or shall be excluded because they cannot be assessed. The percentage of SD and/or ELL students who are excluded from NAEP assessments varies from one jurisdiction to another and within a jurisdiction over time. Read more about the potential effects of exclusion rates on assessment results.
See additional information about the percentages of students with disabilities and English language learners identified, excluded, and assessed at the national and state level.
See the types of accommodations permitted for students with disabilities and/or English language learners at the national and state levels.
Exclusion rates for other subjects, as well as rates of use of specific accommodations, are available.Statistical Significance
Differences between scale scores and between percentages that are discussed in the results on this website take into account the standard errors associated with the estimates. Comparisons are based on statistical tests that consider both the magnitude of the difference between the group average scores or percentages and the standard errors of those statistics. Throughout the results, differences between scores or between percentages are discussed only when they are significant from a statistical perspective. The term "significant" is not intended to imply a judgment about the absolute magnitude or the educational relevance of the differences. It is intended to identify statistically dependable population differences to help inform dialogue among policymakers, educators, and the public.
All differences reported are significant at the 0.05 level with appropriate adjustments for multiple comparisons. In NAEP, the Benjamini-Hochberg False Discovery Rate (FDR) procedure is used to control the expected proportion of falsely rejected hypotheses relative to the number of comparisons that are conducted. When comparing all jurisdictions to each other, the testing procedures are based on all pairwise combinations of the jurisdictions in a particular year or pair of years.
Comparisons across states use a t-test (the method most commonly used to evaluate the differences in means between two groups) to detect whether a difference is statistically significant or not. There are four possible outcomes when comparing the average scores of jurisdictions A and B:
It may be possible that a given state or jurisdiction has a higher average scale score than the nation or another state but that the difference is not statistically significant, while another state with the same average score may show a statistical significant difference compared to the nation or the other state. These situations may arise due to the fact that standard errors vary across states/jurisdictions and estimates.
Users of this website are cautioned against interpreting NAEP results as implying causal relations. Inferences related to student group performance or to the effectiveness of public and nonpublic schools, for example, should take into consideration the many socioeconomic and educational factors that may also have an impact on performance.
The NAEP science scale makes it possible to examine relationships between students' performance and various background factors measured by NAEP. However, a relationship that exists between achievement and another variable does not reveal its underlying cause, which may be influenced by a number of other variables. Similarly, the assessments do not reflect the influence of unmeasured variables. The results are most useful when they are considered in combination with other knowledge about the student population and the educational system, such as trends in instruction, changes in the school-age population, and societal demands and expectations.
Beginning in 2002, the NAEP national sample was obtained by aggregating the samples of public school students from each state and jurisdiction, and then supplementing the aggregate sample with a nationally representative sample of students from nonpublic schools, rather than by obtaining an independently selected national sample. As a consequence the national sample size increased, and smaller differences between years or between groups of students were found to be statistically significant than would have been detected in previous assessments.
A caution is also warranted for some small population group estimates. At times in the results pages, smaller population groups show very large increases or decreases across years in average scores. However, it is often necessary to interpret such score gains with extreme caution. For one thing, the effects of exclusion-rate changes for small subgroups may be more marked for small groups than they are for the whole population. Also, the standard errors are often quite large around the score estimates for small groups, which in turn means the standard error around the gain is also large.