In this corpus-based study, the transcripts of sixty television programs consisting of 31,323,019 running words will be analyzed using Range program (Heatley et al. 2002). In addition, the present study aims to identify the coverage in American television programs to learn English vocabulary in terms of genre. This chapter consists of four sections. The first section expounds the corpus data collected in American television programs. The second section describes the software, Range program, in the present study. The third section delineates the compilation of the two baseword lists, the BNC lists and the BNC/COCA lists, employed in Range. The fourth section explains the analysis procedures.
Corpus Data
The transcripts of 7,279 episodes of sixty American television programs were downloaded from TVsubtitles.net, available at http://www.tvsubtitles.net/, and analyzed in the present study. The transcripts downloaded couldn’t one hundred percent guarantee the accuracy of the real dialogues in television programs. However,
“the scripts should provide a reliable assessment of the vocabulary in television programs (Rodgers & Webb, 2011).”
The programs were chosen according to the following factors: popularity of the programs, genre, and availability of transcripts (Webb & Rodgers, 2009b). For popularity of the programs, TV Guide, the magazine with great credibility in the USA, provided a list of 97 most popular television programs on its website available at http://www.tvguide.com/top-tv-shows. It included both American and British television programs with various kinds of genres, such as comedy, drama, crime,
28
reality television shows, game shows, etc. Reality television shows and game shows were excluded because of the availability of transcripts. For genre, the basic three types of television programs, procedurals, serial dramas, and sitcoms, were then categorized and analyzed. Subgenres, including drama, medical drama, crime-thriller drama, and supernatural/si-fi drama, then were categorized as in Table 4. The titles of each episode of the television program are listed in the Appendix.
Table 4 Genres of the Television Programs and Numbers of the Episodes
No. Title Episodes No. Title Episodes
Sitcoms
1 The Big Bang Theory 121 2 Two and a Half Men 210
3 How I Met Your Mother 169 4 The Simpsons 420
5 30 Rock 130 6 Community 71
7 The Office 170 8 Modern Family 81
9 New Girl 34 10 Friends 200
Procedurals
11 NCSI 218 12 Person of Interest 30
13 Criminal Minds 169 14 Bones 151
15 The Closer 108 16 Castle 90
17 Chase 18 18 Law & Order: SVU 260
19 The Mentalist 103 20 CSI: Miami 232
29
Table 4. (continued)
No. Title Episodes No. Title Episodes
Serial Dramas Drama
21 Revenge 57 22 Brothers & Sisters 109
23 Ugly Betty 85 24 Dawson’s Creek 127
25 The West Wing 154 26 Friday Night Lights 76
27 Glee 96 28 Pretty Little Liars 93
29 Gossip Girl 119 30 Desperate
Housewives
177
Medical Drama
31 Grey’s Anatomy 177 32 House M.D. 176
33 ER 300 34 Saving Hope 32
35 Scrubs 181 36 Emily Owen M.D. 13
37 Body of Proof 42 38 A Gifted Man 16
39 Private Practice 112 40 Nip/Tuck 100
Supernatural / Si-fi Drama
41 True Blood 70 42 Supernatural 158
43 The Walking Dead 47 44 Fringe 100
45 The Vampire Diaries 103 46 Lost 121
47 Smallville 218 48 Charmed 178
49 Angel 110 50 Buffy the Vampire
Slayer
166
30
Table 4. (continued)
No. Title Episodes No. Title Episodes
Crime-Thriller Drama
51 Breaking Bad 62 52 Dexter 96
53 Sons of Anarchy 79 54 The Wire 69
55 The Shield 88 56 Justified 60
57 Chuck 91 58 Nikita 73
59 Burn Notice 91 60 Leverage 72
Total episodes 7,279
The Word Lists
The word lists in the present study to determine the 1,000 level at which the words occurred are the BNC lists and the BNC/COCA lists. The BNC is largely written with only 10% of spoken language (Nation, 2004). According to Nation (2006), the 14 lists were based on the range, frequency, and dispersion data of occurrence of words in the BNC. The validity of the word-family lists were checked if they are properly ordered by three ways. First, the number of tokens, types, and families should decrease from the first 1,000 word list to the fourteen 1,000 word list in an independent corpus. Second, since low-frequency words have fewer family members than high-frequency ones, the former with the same number of families tend to have fewer word types than the latter. The third way to check the validity of the lists were to be run over several corpora to make sure wide-range and high-frequency words are not missing in the lists.
The word family are the unit of counting in the BNC lists (Nation, 2006), and
31
the level of the word family was set at Level 6 in Bauer and Nation (1993) defining word families. Level 6 includes inflections (the plural -s; third person singular present tense; past tense -ed; past participle; -ing; comparative -er; superlative -est) and over 80 derivational affixes including -able, -ee, -ic, -ify, -ion, -less, -age, -ant, -ward, circum-, and -y, etc. The high-frequency word-family nature at Level 6 has the following members: natural, unnatural, unnaturally, naturally, natures, naturistic, naturistically, naturalness, naturalist, naturalists, and naturalism.
The fifteenth word list in the BNC contains proper nouns and marginal words, such as interjections, exclamations, and hesitation procedures, appearing necessarily in television programs. Proper nouns and marginal words can be easily understood in television programs. Hence, the fifteen word list can be added and excluded in both cases to see the differences.
Two important things should be noted that the first one is the BNC lists contain primarily written language with only 10% spoken language, so Rodgers and Webb (2011) suggested that the estimation of coverage might be a little conservative. The second should be noted is that the BNC lists consist primarily of British text (Nation, 2006). The coverage might be conservative since the corpus data are from American programs. At the time when Rodgers and Webb (2011) analyzed related versus unrelated television programs by using the BNC lists, they expected to see if the coverage of American programs could be higher with lists developed from an American corpus. In the present study, COCA comes into play.
Different from the BNC, COCA is the corpus of the 400 million words evenly divided between spoken and written language, including fiction, magazines, newspapers, and academic journals (Davies, 2010). Unlike other corpora, COCA is called a ‘monitor corpus’, which monitors the changes of language in the real world.
The development of the BNC/COCA lists were almost the same as the BNC lists.
32
Twenty-five word family lists were developed based on frequency and range data.
Four additional lists contains (1) proper nouns, (2) marginal words, such as swear words, exclamations, and letters of alphabet, (3) transparent compounds, and (4) abbreviations. Unlike the BNC lists, the first two 1000 word family lists were made using a specially designed 10 million token corpus (Nation, 2012). Since the previous lists developed from the BNC were strongly influenced by the corpus primarily consisted of written language (Nation, 2004), very common spoken English words were then added on to the high-frequency lists, the first two 1000 lists. The rest lists starting from the third to twenty-five 1000 lists developed based on frequency and range data, same as the way to develop the BNC lists.
Range Program
The transcripts were analyzed with the Range software (Nation & Heatley, 2002). This is a computer program used to compare the vocabulary in up to 32 different texts at the same time. For each word in the texts, it provides the information of how many texts the word occurs in, a headword frequency, a word family frequency figure, and a frequency figure for each of the texts the word occurs in (see Figure 1). In the present study, it is used to investigate the coverage in the sixty American television programs by the BNC lists and the BNC/COCA lists.
Fourteen 1,000-word-frequency lists were used with the Range program to show which words occur in which level. Nation (2006) created the fourteen 1,000-word-frequency lists in the British National Corpus (BNC). The word families in the lists were rated as Level 6 according to Bauer and Nation (1993). Proper nouns and marginal words such as, ah, oh, and um were listed in the fifteen and the sixteen lists. Words in texts cannot be found in the lists are listed as “Not found in any lists”.
33
More details on the lists can be found in Nation (2006).
Figure 1 Retrieving Vocabulary Coverage with Range Program
Another set of lists, the BNC/COCA word family lists, expanded the size to twenty-five word family lists. Lists twenty-sixth to thirtieth contain one nonsense word each since Nation and Davies tended to leave room for expanding additional lists. Lists thirty-first to thirty-fourth are (1) proper nouns, (2) marginal words including letters of alphabet, swear words and exclamations, (3) transparent compounds such as airbag and leadwork, and (4) abbreviations such as CSI and FBI.
For more information about the word lists, see Nation and Webb (2011). The word lists and Range program can be downloaded from Paul Nation’s website:
http://www.victoria.ac.nz/lals/about/staff/paul-nation.
Though Range program can count frequency and range, it cannot distinguish homographs. The lists integrated the BNC and COCA corpora, but phrases weren’t
34
included. Limitations above should be noted in the present study.
Analytical Procedures
The transcripts of the fifty American television programs were run through Range program to determine the cumulative coverage for each program, programs with the same genre, and sixty programs combined with the BNC lists and the BNC/COCA lists respectively. Considering that proper nouns, marginal words, transparent compounds, and abbreviations are easily learned (Nation, 2006; Webb &
Rodgers, 2009a, 2009b), the four lists were included in the cumulative coverage.
To answer the research questions in the present study, the cumulative coverage of each program was calculated by the three sets of lists. In addition, the researcher also focused on which word-frequency each program reached 95% and 98% coverage respectively. To determine which genres of programs reached 95% and 98% coverage with the most and the least frequent word lists, sets of programs under the same genre were examined respectively.
Qualitative analysis was also carried out to examine the differences between coverage in television programs by the two sets of lists. Words appeared with high-frequency in the television programs but not in any of the lists are worth discussing. The researcher explored the words not found in any lists to examine the discrepancies between the corpus data and the BNC lists and the BNC/COCA lists.
35