In the 2018 NAEP ORF studyClick to open pdf., students’ passage reading was scored using the automatic speech analysis and scoring system with the exception of passage reading expression, which was scored by trained human scorers. Word list reading and pseudoword list reading was first transcribed by trained human scorers and then the rate, accuracy and words correct per minute (WCPM) variables were calculated by the automatic speech analysis and scoring system.
scoringThe 2018 NAEP ORF studyClick to open pdf. used a new automatic speech analysis and scoring system that transcribed students’ passage reading recordings and then aligned the resulting orthographic transcripts with the passage text in order to calculate rate, accuracy, and words correct per minute variables. The system recognizes accepted pronunciations of each word, taking into account dialect and second-language variations as long as the speaking pattern remains consistent throughout the reading.
The automatic speech analysis and scoring system began by identifying the “text span” in the passage (that is, the string of text that the student read aloud from the passage, starting with the first word that the student read and ending with the last word). Then the system transcribed and calculated the number of words within the text span that the student attempted to read (i.e., the “span length”). It also calculated the duration of oral reading time that the student spent reading the text span (i.e., the “span duration”). Lastly, the system counted the number of words that the student correctly read in the correct order within the span length (i.e., the number of correctly read words in the text span). These three pieces of information were used to calculate passage reading rate, accuracy, and WCPM variables.
wordlistFor recordings of word reading and pseudoword reading, trained human scorers transcribed students’ oral responses. Human transcription was conducted instead of machine transcription because students did not always follow the same order (e.g., from top to bottom and then left to right) when they read the word and pseudoword lists. After the lists were transcribed, the automatic speech analysis and scoring system produced a time alignment of each transcript with the corresponding student recording, calculated the length of time that the student spent reading the word list or pseudoword list, and counted the number of correctly read words or pseudowords from the list of words or pseudowords presented to students. These counts and the corresponding reading durations were combined to calculate the word reading and pseudoword reading WCPM (words correctly read per minute) variables. For example, if a student read 20 words correctly from the word list in 40 seconds, the word reading WCPM score would be 30 WCPM (20 words / 40 seconds X 60 seconds).
passagePassage reading expression is a rating of the student's ability to clearly express the meaning and structure of the text through appropriate intonation, rhythm, emphasis, and pausing that groups words into phrasal and larger units in ways that will enhance understanding and enjoyment in a listener. This variable was scored by trained human scorers using a 6-point scoring rubric developed for the study, as shown below. The scorers received intensive training on the use of the rubric and then successfully completed a qualification evaluation that demonstrated their understanding of and ability to accurately use the rubric to rate students’ oral reading.
Score | Level | Description |
---|---|---|
0 | Insufficient Sample |
|
1 | Word by Word |
|
2 | Local Grouping |
|
3 | Phrase & Clause |
|
4 | Sentence Prosody |
|
5 | Passage Expression |
|
8 | Silent Reader |
|
9 | Anomaly |
|
NOTE: Passage expression ratings of 8 and 9 were treated as missing as these students’ expression level could not be determined because of the quality/content of the audio file. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP) 2018 Oral Reading Fluency study. |
Evaluating the reliability of scoring is an essential step that is taken to ensure the validity and accuracy of the analysis performed for the study.
automaticTo evaluate the reliability of the automatic speech analysis and scoring system, a sample of the passage recordings (about 280 recordings for each passage, see column 3 of the table below) was transcribed by both the automatic speech analysis and scoring system and a trained human scorer. Each of the two transcriptions was aligned with reference to the passage text such that the alignment minimized insertions, deletions, and substitutions. Then, within the span of passage text that the student attempted to read, the system counted the number of words that were correctly read in the correct order. The correlation between the counts of words correctly read using the machine transcriptions and the human transcriptions of the same recordings for each passage is shown in the table below. On average, the interrater reliability between the machine and human transcriptions of the same recordings was 0.96.
Passage | Maximum Number of Words | Number of Second-Scored Audio Recordings | Interrater Reliability |
---|---|---|---|
Passage 1 | 162 | 279 | .99 |
Passage 2 | 153 | 275 | .98 |
Passage 3 | 162 | 283 | .93 |
Passage 4 | 152 | 283 | .94 |
NOTE: Hyphenated forms, e.g., ice-covered, were counted as two words. Passage interrater reliability is a correlation between the counts of words correctly read using the machine transcriptions and those using the human transcriptions of the same audio recording. The final passage interrater reliability (i.e. .96) is the average correlation across the four passages. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP) 2018 Oral Reading Fluency study. |
To examine the interrater reliability of word and pseudoword reading scoring, approximately 20 percent of students’ oral response recordings was transcribed by two different scorers to evaluate the reliability of human transcripts of the two list types. The correlation between the two human transcriptions was 0.99 and 0.97 for word reading and pseudoword reading, respectively.
humanTo examine the reliability of the human scoring for passage reading expression, approximately 40 percent of students’ passage reading responses across four passages were scored for expression by two scorers independently. Between two human scorings, the exact agreement rate (i.e., the percentage of scores that were exactly the same) was 58 percent and the adjacent agreement (i.e., the percentage of scores that were only one level different) was an additional 39 percent.
According to the standards for NAEP Writing assessment scoring, which has six scoring categories, exact agreement lower than 61 percent is flagged to indicate mild concern; exact agreement lower than 57 percent is flagged to indicate greater concern. Thus, the interrater agreement accomplished by the human scoring of passage reading expression (58 percent) was above the minimum standard. Learn more by reading about within-year interrater agreement.