Chapter 3 Methodology
3.2. Measuring vocabulary growth
In the present study, vocabulary growth was measured by computing the cumulative vocabulary, the frequency of POS, the proportion of POS, noun/verb ratio, the type/token ratio of POS, and D measure.
3.2.1. Vocabulary size
In order to get vocabulary size, type frequency in each age group was computed, and it represented the number of distinctive words children produced in that period.
Besides, the number of cumulative vocabulary was also calculated and it would show the number of new words in a stage. The procedure is to count the cumulative vocabulary of each child first. Then, the mean cumulative vocabulary in each age stages was computed. The calculation was done with the freq command in the CLAN tools, and used the code below to get the total number of different word types in the transcript.
The freq code:
freq +t*CHI +t%mor +s"ADV*" +s"ADJ*" +s"ASP*" +s"CL*" +s"CONJ*"
+s"DE*" +s"DT*" +s"IDM*" +s"INT*" +s"MOD*" +s"NEG*" +s"Nloc*"
+s"Nn*" +s"Nppn*" +s"Npro*" +s"Nt*" +s"ONM*" +s"PREP*" +s"QN*"
+s"SFP*" +s"Va*" +s"Vc*" +s"Vi*" +s"Vr*" +s"Vt*" +s"WH*" @
The POS was specified in the code to prevent polysemy words from being counted as one type when POS was not specified. For example, the number of types of the child CHENG’s language sample in 3;1 was 195 when the POS was specified, and 189 when it was not specified. To count cumulative vocabulary, +u was added to the code to count the merged total number of different word types from successive transcript files, to prevent counting the same word type in several files repeatedly.
3.2.2. POS frequency
Word frequency is calculated in terms of word types and word tokens respectively.
Counting the number of types makes us know how many different words children produced, while counting the number of tokens makes us know the total number of words children produced and which words they used most frequently. In this study, the mean frequency and total frequency were calculated. The frequency calculation was
done
and its standard deviation in each age group. The total word frequency in each age group was calculated as well, and this was provided to know the total number of words produced by all children in a stage. To calculate the total number of words in each age group, +u is used in the code, and it automatically merged the files and computed the total word frequency.
However, one important thing to note is that the mean word type frequency calculated by the above steps (calculation A) does not equal to the total word type frequency divided by the number of files in an age group (calculation B). The discrepancy between two calculations resulted from the repeated counting of one type from different files in calculation A. When the word frequency of each file was counted separately in calculation A, it was possible to count a word type as one type in file A and count it again in file B. On the contrary, when the total word frequency was computed, it merged all files first and then counted the word frequency. A word type was counted as one type in this method of calculation. Consequently, the total word type frequency divided by the number of files in an age group (calculation B) is an average number of word types of knowing, whereas the mean type frequency (calculation A) is more like an average number of word types of using.
3.2.3. Type/token ratio
The mean token frequency reveals the number of total tokens children may produce in a sliced time, and the mean type frequency shows the number of different words children may use. However, a child may know many different words but only frequently use some of them, so type/token ratio needed to be examined. TTR of each POS in each file was computed and then the average TTR in each age group was
TTR =
Since the result of freq command consists of the number of word types, tokens and TTR, as Figure 3.1 has shown, thus, the calculation of TTR is done with the freq command in the CLAN program as well.
3.2.4. POS proportion
Since the different sample size has influence on the word frequency of each age group when comparing between parts-of-speech, the frequency need to be transferred into percentage when comparing the proportion of POS. First, get the word frequency of a POS and the total number of words in each file. Second, get the proportion of that POS: dividing the total number of words in each file by the word frequency of that POS in that file. The calculation formulation is listed below. Third, repeat the two steps above for all POS in all files. Last, calculate the average proportion of a POS in an age group. The POS type proportion and token proportion are calculated respectively.
POS Proportion =
3.2.5. Noun/verb ratio
In order to know whether children in the TCCM showed a tendency of noun bias or verb bias, the noun/verb ratio (NVR) in types was computed. The calculation procedure is as the following steps. First, the number of the noun types and verb types in each file was calculated. Second, the ratio of nouns versus verbs in each file was the number of noun types divided by the number of verb types, as the formulation below shows. Third, calculate the average N/V ratio of each age group.
NVR =
When the number of noun types within a child’s vocabulary outweighs verb types, the obtained N/V ratio will be higher than 1, and it will be viewed as “Noun Bias”. On the contrary, when the number of verb types within a child’s vocabulary outweighs noun types, the obtained N/V ratio will be smaller than 1, and it will be viewed as “Verb Bias”. In addition, there has been a controversy about noun bias or verb bias in early Mandarin acquisition, and one of the reasons of this controversy might be caused by the different definition of nouns and verbs. Thus, a broad and strict inclusion of nouns and verbs are taken into consideration when calculation N/V ratio. Four kinds of N/V ratios will be obtained under four situations, as Table 3.6 below shows. Broad nouns include all noun categories, common nouns (Nn), localizers (Nloc), proper nouns (Nppn), pronominals (Npro), and time nouns (Nt), while strict nouns include only common nouns (Nn). Meanwhile, broad verbs include all verbs, stative verbs (Va), Shi-words (Vc), intransitive verbs (Vi), transitive verbs (Vt), and reduplicated verbs (Vr), whereas strict verbs exclude stative verbs (Va), since stative verbs are viewed as adjectives in some studies.
Table 3.6: Noun/Verb ratios of different inclusion of nouns and verbs.
Noun/Verb ratio Broad Noun (all noun categories)
Strict Noun (only common noun) Broad Verb
(including stative verb) NVR 1 NVR 3
Strict Verb
(no stative verb) NVR 2 NVR 4
3.2.6. D-measure
In this study, D values are calculated to examine the lexical diversity of children in
different age groups. D-measure is calculated by using vocd command provided by the CLAN program (MacWhinney, 2000; McKee et al., 2000). First, read in the transcript into the CLAN program. Second, use the vocd command to compute the D value of that transcript. An example of the output of vocd is provided in Appendix 2. Third, calculate the average D value of each age group. The mean D value should provide some indications. As children grow up, their speech should become more complex and diverse. Thus, an increasing D value with increasing age is expected.