Skip Navigation
Illustration/Logo View Quarterly by  This Issue  |  Volume and Issue  |  Topics
Education Statistics Quarterly
Vol 3, Issue 2, Topic: Methodology
Monetary Incentives for Low-Stakes Tests
By: Harold F. O’Neil, Jr., Jamal Abedi, Charlotte Lee, Judy Miyoshi, and Ann Mastergeorge
 
This article was originally published as the Executive Summary of the Research and Development Report of the same name. The sample survey data are from the NCES Monetary Incentives for Low-Stakes Tests studies.
 
 

Research and Development Reports are intended to
  • share studies and research that are developmental in nature;
  • share results of studies that are on the cutting edge of methodological developments;
  • participate in discussions of emerging issues of interest to researchers.

These reports present results or discussion that do not reach definitive conclusions at this point in time, either because the data are tentative, the methodology is new and developing, or the topic is one on which there are divergent views. Therefore, the techniques and inferences made from the data are tentative and are subject to revision.

Back to top


Recent information in the 1990s on international assessments (e.g., the Third International Mathematics and Science Study, or TIMSS) indicates that 12th-grade students in the United States are doing extremely poorly on such assessments compared with their peers in other countries (Takahira et al. 1998). Similarly, many 12th-grade students are doing poorly on the National Assessment of Educational Progress (NAEP). On such assessments, in almost all cases, U.S. 12th-grade students perform relatively more poorly than U.S. 8th-grade students. For example, in TIMSS, 12th-grade students are below the international average whereas 8th-grade students are at the international average.

These poor results are usually attributed to cognitive factors related to students’ opportunities to learn, teachers’ lack of professional preparation, etc. However, a partial explanation for these results may be motivational. Because the low-stakes (for students) tests were administered late in these 12th-graders’ final year in high school, the timing may have negatively affected motivation, and thus performance. This phenomenon has been labeled “senioritis.” For the high school senior going into the world of work or on to postsecondary education, tests like TIMSS are clearly low stakes. Thus, one of the major questions about these tests concerns the possible impact of motivational factors on the results. If students are not motivated to perform well on low-stakes tests, then the results may underestimate what students could do if they gave these assessments their best effort.

Back to top


The basic approach of this research was to provide a sufficient monetary incentive to maximize student effort and therefore increase performance. Such an incentive was expected to stimulate a 0.5 standard deviation increase in performance. The results of this research will not generalize, without additional research, to either TIMSS or NAEP. Further, the results will not generalize to the impact of motivation variables (e.g., effort, self-efficacy)1 on the teaching and learning of math. However, it was expected that these results would constitute a proof of concept of the importance of manipulating motivation in low-stakes assessments for 12th-graders.

Prior research

Promising results were provided by prior NAEP motivation research sponsored by the National Assessment Governing Board (NAGB) and by the National Center for Education Statistics (NCES) of the Office of Educational Research and Improvement (OERI). In that research, it was hypothesized that the incentives would increase effort, which, along with prior knowledge, would improve performance. The effective incentive in this earlier study was money. In the study (O’Neil et al. 1992), various incentives (money, task, ego, standard NAEP instructions) were manipulated for 8th- and 12th-grade samples of students of various ethnicities (White, Black, Hispanic, and Asian/Pacific Islander).

In general, only the money incentive worked in the eighth grade. The results showed, in the best case, that the money incentive was effective for a subsample of the eighth-grade students (those who remembered their incentive/treatment group) with easy and medium-difficulty items. With respect to item difficulty results, because the motivational effect was at test time, it was not expected that this increased effort would improve performance on hard items, because students did not know the content. With respect to remembering one’s treatment group, presumably if one did not remember the incentive (money), then one would not increase one’s effort (and thus performance). However, no incentives were effective for 12th-grade students, even those who remembered their treatment group. The lack of effect for 12th-graders was hypothesized to be because the amount of money ($1.00 per item) was not large enough for 12th-graders and, further, many 12th-graders did not believe they would get the money.

Current approach

The approach for the current investigation with 12th-graders consisted of manipulating the amount of money per item correct so as to increase the motivational effect and thus increase performance. The amount of money given per item correct was either $0 (as in a standard low-stakes administration, e.g., TIMSS) or $10 per item correct (which was expected to be effective). The incentive group was compared with a group receiving standard low-stakes TIMSS instructions. Consistent with the prior NAEP study, information on effort, self-efficacy, and worry was also collected. For the current assessment, the released TIMSS math literacy scale items were used. This set of items included both multiple-choice and free-response items.

It was hypothesized that students receiving $10 per item correct would perform significantly higher in math than those who were not receiving any monetary incentive (the control group). Such students would also exhibit higher effort and self-efficacy but less worry than control group participants. In general, overall anxiety levels were expected to be low given the low-stakes nature of the test.

Design and samples

This investigation included a focus group study, a pilot study, a main study, and a supplementary study with Advanced Placement (AP) students in mathematics (called the AP study). In the focus group study, various levels of incentives were explored. This research is documented in Mastergeorge (1999). Parents and students who participated in the focus groups suggested that $5 to $10 per item correct would provide enough motivation for students in grade 12 to work harder on math test items. Based on these findings, in the present investigation students were offered $10 per item correct to find out whether their performance on the selected math items could be increased under such a high-stakes testing condition. The performance of students receiving $10 per item correct was then compared with the performance of students who responded to the same set of items with no monetary incentive.

A total of 725 students participated in the pilot, main, and AP studies. There were 144 students in the pilot study, 415 students in the main study, and 166 students in the AP study. For the pilot, main, and AP studies, students were selected from 23 different schools (5 schools in the pilot study, 9 schools in the main study, and 9 schools in the AP study) from southern California school districts in different locations. These schools had different demographics and different levels of overall student performance. However, the high non-English language background of the sample limits generalizing the findings. Findings should be interpreted in light of this caution.

Following the focus group study was the pilot study. The purpose of the pilot study was to test design issues, examine the accuracy and language of the instruments, and resolve logistical problems. The results of the pilot study helped in refining the instruments and modifying the design. The main study and AP study were then conducted.

Back to top


For an approximately 1-hour testing session, the average student in the incentive condition in the main study received $100 ($80 for an average of 7.96 items answered correctly and $20 for the two “easy” test items). In the AP study, the average student received $200. Such incentives were assumed to be motivational for the 12th-graders in our samples. However, the results of the main and AP studies showed no significant difference between the performance of students in the incentive and control groups. Statistically, there was no “main effect” of the incentive treatment. However, in the main study there was a complex interaction between three variables: treatment group, sex, and test booklet.2 After students were divided into subgroups based on these variables, however, none of the differences in their mean scores by subgroup were statistically significant. Thus, this interaction was conservatively interpreted as not supporting the major hypothesis. Further, the results of the AP study also did not support the major hypothesis. Although the total number of students in the main study was 393 after excluding students with incomplete data, the numbers became smaller when students were divided into subgroups by the levels of independent variables such as sex, test booklet, and treatment group. For some of the analyses, the number of students was not sufficient to detect a significant difference, even when the difference appeared relatively large. However, the number of students in both the main study sample and the AP study was sufficient to have detected a main effect for the incentive treatment.

There was a great deal of consistency in the data in both the main and AP studies. For example, males performed significantly better than females in both studies, which was expected, since local southern California samples consistently show differences between male and female performance on math tests. Although with the national TIMSS sample (Takahira et al. 1998), there were no significant differences by sex, such differences were found with the local southern California samples. Students in both the main study and the AP study reported significantly more effort in the incentive condition than in the control condition. Finally, in both studies self-efficacy and effort were positively related. These latter results make theoretical sense, as Bandura (1986, 1993, 1997) would predict that higher levels of self-efficacy should lead to higher levels of effort. 

It was also predicted, based on the prior NAEP research, that the incentive condition should result in higher effort. In both the main and AP studies, students in the incentive group were found to have significantly higher effort than students in the control group. In turn, this increased effort should have resulted in better math performance. So, why was there no significant main effect of the treatment on math performance, given that there was a main effect of the treatment on effort? The major reason suggested was the lack of relationship between self-reported effort and math achievement. Unexpectedly, for both the main and AP studies, self-reported effort was not significantly related to math performance (e.g., r = .007 in the AP study). With respect to effort, the research literature and the previous NAEP research using the same measures indicate that the relationship should be positive (i.e., higher effort should lead to better performance). Not surprisingly, such findings are puzzling.

There was not an issue of insufficient time to complete this test, given that the number of not-reached items was very low, indicating that students had sufficient time to complete almost all items on the test. Further, there were few items omitted in either study. The low number of not-reached and omitted items clearly indicates that students had sufficient time to complete the test. Thus, this set of items clearly constituted a power test, not a speed test. In the main study, the mean score was 7.96 out of 24 possible points (20 items, with a few extended-response items getting 2 possible maximum points). For the AP study, the mean was 17.95 out of 24 possible points (same test as the main study). There was no ceiling on the number of correctly answered items for which students would be paid.

To better understand the puzzling results of this investigation, the obvious next step is to replicate the investigation with samples more representative of U.S. students generally or in groups with very different compositions. These studies should be supplemented by a series of focus groups and cognitive laboratory approaches.

Back to top


In summary, effort was not related to performance. The conclusion for this set of studies is that a strong monetary incentive did not increase math performance on a set of TIMSS released math items with local southern California samples of convenience that included a high proportion of students with a non-English language background. Further, the inability to find motivational effects—despite a strong incentive, random assignment (with equivalence on background characteristics), tests of high- and low-performing students, and elimination of cases where students could not remember their incentive or treatment group—is quite compelling. It raises some fundamental questions about previous assumptions made about the motivational effect on test performance. It appears that factors in addition to motivation are coming into play. The authors believe that there is a senioritis effect, but that understanding its specific motivational effect on test performance and its amelioration awaits future research.

Back to top


Footnotes

1Self-efficacy refers to a student’s belief in his or her ability to perform a task.

2Two versions of the test booklet were used to minimize cheating.

Back to top


Bandura, A. (1986). Social Foundations of Thought and Action: A Social Cognitive Theory. Englewood Cliffs, NJ: Prentice Hall.

Bandura, A. (1993). Perceived Self-Efficacy in Cognitive Development and Functioning. Educational Psychologist, 28 (2): 117–148.

Bandura, A. (1997). Self-Efficacy: The Exercise of Control. New York: Freeman.

Mastergeorge, A. (1999, June). Focus Groups on Motivational Incentives for Low-Stakes Tests With Senior High School Students and Their Parents. Report to AIR/ESSI. Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing.

O’Neil, H.F., Jr., Sugrue, B., Abedi, J., Baker, E.L., and Golan, S. (1992). Final Report of Experimental Studies on Motivation and NAEP Test Performance. Report to NCES, Grant #RS90159001. Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing.

Takahira, S., Gonzales, P., Frase, M., and Salganik, L.H. (1998). Pursuing Excellence: A Study of U.S. Twelfth-Grade Mathematics and Science Achievement in International Context (NCES 98–049). U.S. Department of Education, National Center for Education Statistics. Washington, DC: U.S. Government Printing Office.

back to top
   

Data source:The NCES Monetary Incentives for Low-Stakes Tests studies, 2001.

For technical information, see the complete report:

O’Neil, H.F., Jr., Abedi, J., Lee, C., Miyoshi, J., and Mastergeorge, A. (2001). Monetary Incentives for Low-Stakes Tests (NCES 2001–024).

Author affiliations: H.F. O’Neil, Jr., University of Southern California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST); J. Abedi, C. Lee, J. Miyoshi, and A. Mastergeorge, University of California at Los Angeles/CRESST.

For questions about content, contact Val Plisko (vplisko@ed.gov).

To obtain the complete report (NCES 2001–024), call the toll-free ED Pubs number (877-433-7827) or visit the NCES Web Site (http://nces.ed.gov).


Back to top