測量華語兒童早期詞彙成長:以語料庫為本之研究

(1)

國立臺灣大學文學院語言學研究所碩士論文

Graduate Institute of Linguistics College of Liberal Arts

National Taiwan University Master Thesis

測量華語兒童早期詞彙成長:以語料庫為本之研究 Measuring Early Vocabulary Growth in

Mandarin-Speaking Children: A Corpus-Based Study

楊靜琛 Jing-Chen Yang

指導教授：張顯達博士謝舒凱博士 Advisor: Hintat Cheung, Ph.D.

Shu-Kai Hsieh, Ph.D.

中華民國 104 年 7 月

(2)

(3)

誌謝

這是一篇碩論，但是對在人生的旅途中迷路的我而言，這一路走來相當漫長，

途中甚至因為迷惘、猶豫不決而無法堅持下去決定暫時放下這篇論文。雖然現在

一樣在迷惘中，但是為了讓自己不要在未來覺得遺憾，我再次打開論文的word 檔，

一步一步將其完成。

這過程中要感謝許多人。首先要感謝兩位指導老師，張顯達老師和謝舒凱老師，

不斷的給我鼓勵、以及論文的方向和各種建議。感謝他們能在百忙之中抽空指導我這篇難產的論文。也要感謝呂佳蓉老師在大家都很忙碌的暑假期間來擔任我的口試委員，讓我不至於找不到口試委員無法完成口試。感謝指導老師們與口試委員在口試時給予的各種建議，讓這篇論文能稍稍再美好一點。

另外還要感謝前所長宋麗梅老師與萬能的美玲助教，謝謝她們在我茫然的米蟲階段時，正好提供我工讀的機會，讓我不用為生活費而煩惱，因為這樣一個新的契機讓我有了重新接觸人群並調整生活步調的機會，甚至因此覺得，不如就把論文寫完吧！

還要感謝我所有的老朋友、新朋友、語言所認識的大家、以前大學的好朋友們。

因為太多人，怕有遺漏就不一一列舉，感謝你們在我茫然、焦慮、苦惱、厭煩時，

願意聽我碎碎念，願意跟我一起吐槽，願意給我一些建議。有時候或許只是一句話，但是每一句話累積起來的能量，支持著我一路完成這篇論文。甚至要感謝我的新室友林妹妹和她的貓，因為有個相當新鮮的新鮮人讓我意識到至少要完成一件事，給自己一個交代，而她的貓的存在本身就是一種正能量，讓我能有好心情。

也要感謝我的父母。或許他們看不慣我的決定，也許不支持，但是至少也不反對，任由我自行蹉跎人生，慢慢摸索人生方向。甚至在我經濟拮据時，仍願意支援我這個還不能賺錢補貼家用的女兒。

再次感謝我生活中遇到的許多人，因為有你們，我才能堅持下來！

(4)

ii

中文摘要

本文旨在測量臺灣地區華語兒童早期詞彙成長，以語料庫為語料來源，分析兒童詞彙成長與詞彙組成。詞彙成長的測量項目為總詞彙量、各個詞類之詞頻與

其所佔的比例、名詞動詞比率、相異詞比率、以及 D 數值。詞彙組成的測量項目

為各個語義範疇和概念層次中名詞的數量。計算結果顯示詞彙在上述測量項目中呈現隨著年齡增加而發展之趨勢。

詞彙量隨年齡增長，各個詞類之詞頻與所佔比例顯示名詞與動詞較其他詞類早習得。名詞動詞所佔比例隨著年齡增加而遞減，其他詞類所佔比例則是遞增。

相異詞比率表明所有年齡層的兒童皆習得較多名詞與動詞，同時也較常使用這兩個詞類。量詞和法向詞的相異詞比率遞減，表示這兩個詞類在較大的年齡層中的

使用率逐漸增加。名詞動詞比率顯示出在19-24 個月時有名詞偏向，之後的年齡層

中則有動詞偏向。D 數值的增長呈現兒童逐漸增加的詞彙多樣性。

名詞的語義範疇經分析後得知最具體、最親近兒童生活的名詞最早習得，且最常使用，如人物名稱、工具類、動物類、食物類和交通工具類。較抽象、離兒童生活較疏遠的名詞，比較晚習得且使用率較低，如數字、形狀、顏色、以及自然現象類。名詞的概念層次經分析後可知兒童最先習得基本層次詞彙，其後才是上位層次詞與下位層次詞。而習得上位層次詞與下位層次詞的時間點則因語義範疇而不同。基本層次詞的使用率較上位層次詞與下位層次詞的使用率高。

在這些測量兒童詞彙成長的計算方法中，其中幾個測量項目或許可以在後續深入研究後，成為測量詞彙成長之指標。

關鍵詞：詞彙習得、名詞偏向、基本層次範疇、華語習得、詞彙多樣性、語料庫

(5)

English Abstract

This study aims to measure early vocabulary growth in Mandarin-speaking children from Taiwan with a corpus-based method. Children’s vocabulary growth was examined in aspects of vocabulary growth and vocabulary organization. Vocabulary growth was measured by computing vocabulary size, frequency and proportion of parts-of-speech, noun-verb ratio, type-token ratio, and D measure. Vocabulary organization was measured by computing the number of nouns in semantic categories and conceptual levels. Results of the measures showed a developmental trend with increasing ages.

Vocabulary size increases with age. Frequency and proportion of lexical categories suggest that nouns and verbs were acquired before other word classes. The proportion of nouns and verbs decreases with increasing age, while that of other lexical categories increases. Type-token ratio implies that children acquired more nouns and verbs and also used them very often in all stages. The declining TTR of classifiers and modals indicate that they were used more frequently in later stages. Noun-verb ratio reveals that a weak noun bias was found in 19-24 months, and a verb bias was found in later stages.

D values suggest that children’s lexicon became increasingly diverse.

Analysis of semantic categories reveals that the most concrete nouns and nouns which were the closest to children’s life were acquired earlier, such as people, tools, animals, and food, and vehicles. Nouns in these categories were also used frequently.

Nouns which were abstract and far away from children’s life were acquired later and used with a lower frequency, such as numerals, shapes, colors, and nouns of natural phenomena. Analysis of conceptual levels has shown that basic-level words were acquired first, followed by subordinate words and superordinate words. The timing of acquiring superordinate level and subordinate level words varies in different semantic categories. Basic-level words were used with a higher frequency than the other two levels.

All of the computations about vocabulary growth and vocabulary organization in this study have revealed general developmental trends of children’s early Mandarin vocabulary. Some of the measured items may become an index of measuring of vocabulary growth after further studies.

Key words: vocabulary acquisition, noun bias, basic-level category, Mandarin

acquisition, lexical diversity, corpus

(6)

iv

List of Figures

Figure 2.1: Family of curves of increasing diversity with increasing D values (McKee et

al., 2000). ... 14

Figure 3.1: The output of freq command. ... 34

Figure 4.1: The distribution of total word token frequency. ... 48

Figure 4.2: The mean of children’s cumulative vocabulary. ... 50

Figure 4.3: The line graph of word types in POS. ... 51

Figure 4.4: The line graph of word tokens in POS. ... 53

Figure 4.5: The line graph of TTR of POS. ... 55

Figure 4.6: The development trend of Noun/Verb ratio. ... 62

Figure 4.7: The development trend of D value. ... 63

Figure 4.8: Noun type proportion changes of conceptual levels. ... 72

Figure 4.9: Noun token proportion changes of conceptual levels. ... 73

Figure 4.10: The proportion of conceptual levels in the category of people. ... 74

Figure 4.11: The proportion of conceptual levels in the category of body parts. ... 76

Figure 4.12: The proportion of conceptual levels in the category of clothing. ... 77

Figure 4.13: The proportion of conceptual levels in the category of vehicles. ... 78

Figure 4.14: The proportion of conceptual levels in the category of tools. ... 79

Figure 4.15: The proportion of conceptual levels in the category of food. ... 80

Figure 4.16: The proportion of conceptual levels in the category of animals. ... 81

Figure 4.17: The proportion of conceptual levels in the category of natural phenomena and materials. ... 82

Figure 4.18: The proportion of conceptual levels in the category of toys. ... 83

Figure 4.19: The proportion of conceptual levels in the category of locations and buildings. ... 85

Figure 4.20: The proportion of conceptual levels in the category of spatial words. ... 86

Figure 4.21: The proportion of conceptual levels in the category of abstract nouns... 87

(10)

viii

List of Tables

Table 3.1: Children’s information in the HTC01 database. ... 25

Table 3.2: The information of samples in each age group. ... 25

Table 3.3: The mean age and standard deviation in each age group. ... 26

Table 3.4: The 28 POS tags. ... 27

Table 3.5: POS tags and examples of nouns and verbs used in the TCCM corpus. ... 32

Table 3.6: Noun/Verb ratios of different inclusion of nouns and verbs. ... 37

Table 3.7: Semantic categories in different studies. ... 41

Table 3.8: Examples in conceptual levels. ... 44

Table 4.1: The total number of main POS. ... 46

Table 4.2: The total word frequency and TTR in each age stage. ... 47

Table 4.3: Children’s cumulative vocabulary. ... 49

Table 4.4: The mean type frequency of POS in each age stage. ... 50

Table 4.5: The mean token frequency of POS in each stage. ... 52

Table 4.6: The mean TTR of each POS in each age stage. ... 54

Table 4.7: The mean proportion of POS types in each age stage. ... 56

Table 4.8: The mean proportion of POS tokens in each age stage. ... 56

Table 4.9: Children’s NVR. ... 61

Table 4.10: Children’s D values. ... 63

Table 4.11: The total numbers of nouns in semantic categories. ... 66

Table 4.12: The total number of nouns in conceptual levels. ... 67

Table 4.13: Five semantic categories with the most noun types in each age stage. ... 68

Table 4.14: The patterns of type proportion changes of semantic categories. ... 69

Table 4.15: The five most frequently used semantic categories in each age stage. ... 70

Table 4.16: The patterns of token proportion changes of semantic categories. ... 70

Table 4.17: The numbers of noun types of conceptual levels in each age stage. ... 71

Table 4.18: The numbers of noun tokens of conceptual levels in each age stage. ... 73

Table 4.19: The word frequency among conceptual levels. ... 88

Table 4.20: Acquisition sequence of conceptual levels. ... 90

(11)

Chapter 1 Introduction

1.1. Background

Language acquisition is the process that people not only acquire the ability to perceive and comprehend language, but also learn to produce and use language to communicate. The ability of successfully using language to communicate needs knowledge of phonology, morphology, syntax, semantics, and a large amount of vocabulary. The process of acquiring language and knowing how to use language has received a lot of attention in linguistics research as well as in the field of children’s development. There have been studies about how children produce sounds, how they learn vocabulary, how they construct sentences, and how they understand adult’s language. Among all topics of language acquisition, vocabulary growth is a major milestone to be charted in children’s early language development. It can serve as a predictor of children’s performances in later language skills (Hao, Shu, Xing, & Li, 2008). Hence, many researchers have investigated the acquisition of vocabulary from various aspects including lexical categories, semantic categories, and basic-level effect in early vocabulary.

Studies about the development of lexical categories in children’s language have revealed that nouns and verbs are acquired before other word classes. Some studies further found a universal noun bias in early vocabulary acquisition (Gentner, 1982;

Haryu et al., 2005; Hsu, 1996). Noun bias means that nouns are acquired first and the predominant words in early vocabulary. On the contrary, some researchers argue that noun bias is not universal and for Mandarin early vocabulary growth, a verb bias is

(12)

Tardif, 1996). The issue on noun-or-verb bias in early vocabulary is still not settled.

Studies about semantic categories of nouns have shown a different development tendency. Children produced device nouns frequently, due to the frequent inputs in children’s daily life (Peng & Chong, 2010). Sheng et al. (Sheng, Deng, Zhang, Liang, &

Lu, 2012) investigated vocabulary using MCDI, and they found that children acquired more onomatopoeia words than words in other categories (i.e. nouns, verbs, adjectives/adverbs, conventional game words, numeral words, quantifiers, interrogatives, pronouns, directional words, and time words). Kuo, Tsay, and Peng (2005) and Jiang (2000) have found that objects which children contact most often were usually most frequently produced words in children’s language.

In addition, categorization is an important cognitive process in which people group different objects together and form a class. A hierarchy of levels is found during the categorization process, and there is a common level of categorization, which is called basic-level category (Rosch, 1978). The theory of basic-level category was applied to the categorization of lexicon. Studies about basic-level effect on children’s early vocabulary have concluded that basic-level words were acquired earlier and used more frequently in our speech than superordinate words and subordinate words (Jiang, 2000;

Zeng & Zou, 2012; Li, 2014).

Previous studies in vocabulary development in Chinese investigated a specific issue or a particular theoretical aspect of vocabulary acquisition by using language samples they collected. A comprehensive study of Chinese children’s vocabulary development is yet to be found. Moreover, there has been no corpus-based report on early vocabulary growth in Mandarin-speaking children from Taiwan. Therefore, this study aims to use corpus data to examine the findings in previous researches, and to provide a complete scope of early vocabulary development of Mandarin-speaking

(13)

children before 4 years old in Taiwan.

1.2. Purpose of the study

The purpose of this study is to examine the development of children’s early Mandarin lexicon with a corpus-based method, and then to find out reliable measurement items of young children’s early vocabulary. In this study, children’s vocabulary development will be examined in two aspects. One is vocabulary growth which is to examine the amount of vocabulary. The other one is vocabulary organization which is to examine the contents of children’s early vocabulary. This study tries to provide evidence to support or oppose the findings in past studies of children’s early vocabulary acquisition. Thus, the two research questions are:

(1) What are the major changes in early vocabulary development of Mandarin-speaking children?

(2) How can early vocabulary development of Mandarin-speaking children be indexed?

The following analyses and discussions are supposed to provide evidence for or against the findings in previous studies, as well as to find out one or more items which can be used to measure children’s early Mandarin vocabulary. The results would reveal an estimated performance of Mandarin-speaking children’s early vocabulary acquisition.

A complete scope of early vocabulary development of Mandarin-speaking children before 4 years old is expected. It may provide information for other studies of the same scale and offer a direction for future studies in children language acquisition.

(14)

1.3. Organization of this study

The thesis consists of five chapters. Chapter 1 provides the background of this study, and it introduces the purpose of the study and the research questions. Chapter 2 is literature review, introducing methods of observing children language development in past studies and previous findings of children’s early vocabulary acquisition. Chapter 3 presents the methodology used in this study, which is a corpus-based analysis. In Chapter 3, the corpus used in this study is introduced first, including the reason of choosing this corpus, information about language samples examined in this study, the definitions of lexical categories, semantic categories, and conceptual levels used in the study. Besides, the items of measuring vocabulary growth and vocabulary organization are introduced. Chapter 4 presents the results of the measurements mentioned in chapter 3 and builds up the profile of children’s early lexicon. A general observation of the whole corpus is provided first, and then detailed results of all measured items are provided. Chapter 5 provides a general discussion based on the results, implication of this study, suggestion for future studies and conclusion. Abbreviations used in this study are listed below.

CHILDES: Child Language Data Exchange System

MCDI: MacArthur-Bates Communicative Development Inventory (Liu & Tsao, 2010) NVR: Noun/Verb ratio; N/V ratio

POS: Part-of-speech

TCCM: Taiwan Corpus of Child Mandarin TTR: Type/Token ratio

(15)

Chapter 2 Literature Review

2.1. Methods of exploring children’s early vocabulary

Various methods have been adopted for the study of children’s vocabulary development. In some studies, a wordlist was first constructed and validated with parents’ input (Fenson, Dale, Reznick, & Bates, 1994; Dale & Fenson, 1996). Parents were asked to check the wordlist to record words and expressions that children have acquired. Researchers also designed experiments to examine children’s performance of specific language knowledge (Haryu et al., 2005; Mervis & Crisafi, 1982; Zeng & Zou, 2012). Some other researchers also recorded children’s speeches on audio files, transcribe the collected data, and then analyze it (Hsu, 1996; Huang, 2009; Lee, 2014;

Tardif, 1996; Yeh, 2009). In this section, three methods of investigating children’s language are reviewed: parental report, experiment, and corpus-based study using CHILDES.

Parental report

Vocabulary development is an essential milestone in children’s early language development. It can be used to predict children’s performances of later language skills (Hao et al., 2008). Due to the importance of children’s early vocabulary, researchers have built norms for vocabulary measurements, among which the parental report has been the most widely used measurements. Parental report is less time consuming and easy to conduct, so researchers can easily collect a large number of children’s early words in a short time. Parental report needs parents to judge whether their children can

(16)

and Dale and Fenson (Dale & Fenson, 1996) were the first to do this study and provided a norm for the parental report which was called MacArthur-Bates Communicative Development Inventory (MCDI). It was used to investigate American children’s comprehension and production vocabulary between the ages of 0;8 (8 months) and 2;4 (2 years 4 months). Based on the American English version of MCDI, researchers have established MCDI in other languages and linguistic variants, such as British English (Hamilton, Plunkett, & Schafer, 2000), Spanish (Jackson-Maldonado, Thal, Marchman, Bates, & Gutierrez-Clellen, 1993), and Mandarin Chinese (Liu & Tsao, 2010).

Experiments

Some researchers may choose to conduct experiments to examine children’s language performance. The advantage of an experiment approach is that researchers can test an assumption by controlling variables. In this way, researchers can know which assumption can be applied to children’s acquisition. However, the disadvantage is that children’s language samples are elicited, and the elicitation data might be different from children’s natural speech. Since there are advantages and disadvantages of different research methods, researchers would use different methods or different subjects to repeatedly test an assumption.

For instance, in answering the question of a possible noun bias in early vocabulary acquisition, Gentner (1982), Hsu (1996), and Tardif (1996) counted the numbers of words in children’s spontaneous speech. On the other hand, Haryu and colleagues (2005) used an experimental method to investigate the noun bias in early acquisition. They conducted an experiment to investigate whether there is a “Noun Bias” in Mandarin Chinese, Japanese, and English preschoolers of three- and five-year-old. They compared the ease of fast-mapping novel nouns to a novel object and novel verbs to a novel action.

(17)

Their findings supported the universal “Noun Bias” view.

In order to exploring the basic level effect in children’s vocabulary, Jiang (2000) and Lee (2014) collected a number of spontaneous speech samples from children, and counted the number of words. On the other hand, Mervis and Crisafi (1982) conducted an experiment, and found that basic-level terms were more advantageous to children’s lexicon learning than superordinate terms and subordinate terms. Furthermore, Zeng and Zou (2012) have worked on their study with three methods, including controlled experiments, to explore the early development of category levels of Mandarin-speaking children. Their results showed that basic-level words were dominant in comprehension and production of early vocabulary.

Corpus-based method

A corpus is a collection of a large number of language transcriptions. Corpus database provides children’s spontaneous speech, which is exactly recorded from their real daily speech. The most well-known corpus of child language is Child Language Data Exchange System (CHILDES) (MacWhinney, 2000). It provides the transcription format, CHAT format, and analyzing tools, CLAN tools (MacWhinney, 2000). There are some advantages of using CHILDES. Researchers can conduct their researches with available data when it is difficult to have native speakers of other languages to be their subjects. They can make cross-linguistic comparisons to find out a general tendency of language development. On the other hand, the disadvantage of a corpus-based approach is that researchers cannot test certain assumptions particularly. What corpus data contains is a sample of children’s everyday language use. The collected language samples may be restricted to some conversation topics. Researchers interested in other

(18)

themselves.

2.2. Performance of children’s early vocabulary

2.2.1. Vocabulary size

Past studies of English-speaking children have found that few children produce any words before age one. Most children produce their first recognizable words in 15 months or so. They have approximately 100 to 600 distinct words at 2 years old. They have about 14,000 words in comprehension and fewer in production by 6 years old.

These numbers imply that children acquire words between 2 to 6 years old at a rate of nine to ten words a day (Clark, 1993). Hsu (1996) has reported the vocabulary size of Mandarin-speaking children before 6 years old in his study. Children had 260 words before 2 years old, 634 words before 3 years old, 771 words before 4 years old, 808 words before 5 years old, and 895 words before 6 years old. Yeh’s study (2009) has shown a mean of nouns and verbs: 166 words before 2 years, 402 words before 3years, and 478 words before 4years. Tsay and Cheng (2011) reported the vocabulary size of 8 children speaking Taiwanese Southern Min from approximately one and a half years old to 4 years old. The total vocabulary size of these children is smaller or just a little bit larger than 2000. The vocabulary growth rate of two children was also reported and was compared with Clark’s results. Clark (1993) discussed two young children's vocabulary growth. One child Keren (reported in Dromi 1987) produced up to 337 new words by 1;5 while the other child Damon produced up to 337 new words by 1;9. As for Tsay and Cheng’s study (2011), they found that a child reached 337 new words between 1;5 and 1;6, and another child reached 337 new words between 1;6 and 1;7.

(19)

2.2.2. Noun bias or Verb bias

A universal “Noun Bias” in young children’s vocabulary development has been debated heatedly for years. Gentner (1982) proposed Natural Partitions Hypothesis, stating that there is a preexisting perceptual-conceptual distinction between concrete concepts (nouns) and predicative concepts (verbs), and the distinction between nouns and predicate terms is based on this perceptual-conceptual distinction. The Natural Partitions hypothesis also holds that nouns belonging to concrete concepts are conceptually simpler or more basic than verbs and other predicates. Examining cross-linguistic data further evidenced that young children acquire nouns easier and earlier than verbs, and this “Noun Bias” is a universal phenomenon.

However, some studies of Mandarin Chinese and Korean vocabulary development argued against “Noun Bias”, suggesting that nouns are not always acquired first (Choi

& Gopnik, 1995; Tardif, 1996; Sheng, Deng, Zhang, Liang, & Lu, 2012). Based on the data collected in Beijing, Tardif (1996) argued against the universality of “Noun Bias”.

Ten Mandarin-speaking children participated in her longitudinal study, but only the second or third recording was analyzed when children’s mean age was 21 months. In order to know whether the different definition of nouns and verbs lead to various conclusions, she included several strict and broad definitions of nouns and verbs, object labels and action words, and nominals and predicates. The results of her sliced data suggested that Mandarin-speaking children produce more verbs than nouns in their early lexicon. She further suggested that linguistic and sociocultural input factors accounted for a “Verb Bias”. Mandarin verbs occur frequently in adult inputs and verbs are highlighted by occurring in salient positions. Furthermore, morphological simplicity has effects on the bias in children’s performance. The morphology of English nouns is

(20)

nouns. Unlike English, Mandarin is morphologically transparent. “Noun Bias” is not reinforced by Mandarin morphology., so nouns are acquired later. The input frequency account was further evidenced in Dhillon (2010) who examined the observational data of English-, Spanish-, and Mandarin-speaking children in the CHILDES database.

Children’s age ranged from 1 year and 7 months to 2 years and 11 months. The number of noun types, verb types and other types, the number of noun types versus the number of verb types, and the proportion of nouns divided by the sum of the proportion of nouns and verbs were computed. The results showed that Mandarin-speaking children exhibited a “Noun Bias” in the early stage (1;7 to 2;0) but no “Noun Bias” in the later ages. English- and Spanish-speaking children displayed a “Noun Bias” across all ages.

These results have supported Dhillon’s (2010) arguments that argument-dropping in Mandarin would make children receive more verbs from the adults’ inputs than children of other languages. These factors lead to the prediction that children will learn verbs more easily.

Nevertheless, Hsu (1996) has reported that Mandarin-speaking children produce more nouns (60%) than verbs (25%), supporting a “Noun Bias”. Yeh (2009) conducted a study about young children’s acquisition of nouns and verb in Mandarin Chinese in Taiwan. The results indicated that nouns are the major words which children acquire in their early ages, as well as that younger children use relatively more nouns whereas older children use relatively more verbs. It also suggests that children’s cognitive abilities and social interactions have influences on the early acquisition of nouns and verbs. In addition to observational studies, Haryu and colleagues (2005) had done an experiment to investigate whether there was a “Noun Bias” in Mandarin Chinese, Japanese, and English preschoolers of three- and five-year-old. They compared the ease of fast-mapping novel nouns to a novel object and novel verbs to a novel action. One of

(21)

the conditions was “bare verb condition” in which novel verbs were presented with no argument. There are morphological affixes in English and Japanese but not in Mandarin, thus making the condition a “bare word condition” in Mandarin. The results showed that both the 3- and the 5-year-olds in three languages could fast-map a novel noun to a novel object, but they could not fast-map a novel verb to its meaning properly until five years old. Moreover, the results of bare word condition in Mandarin showed that children tended to map a novel word to a novel object. These findings supported the universal “Noun Bias” view. They concluded that the difficulty of Mandarin verb learning was resulted from the lack of verb morphology and the argument-dropping property of Mandarin. When children encounter a novel word, they need clues to decide whether to map the novel word to an object or an action. However, argument-dropping is allowed in Mandarin, and Mandarin verbs are not morphologically inflected. The linguistic properties of Mandarin imply that the linguistic clues to decide to map a word to an object or an action are not always available. As a consequence, children rely on the universal “Noun Bias” to map a novel word to a novel object.

In a cross-linguistic corpus-based study, Liu and her collaborators (2008) examined the noun versus verb (N/V) ratio in types in English, Cantonese and Mandarin, found different results. They conducted two studies examining 13- to 60-months-old children’s data in the CHILDES database (MacWhinney, 2000). The first study selected a total of 72 files, and the second study used all data available from the corpora. Their results revealed that when averaged over all ages, English-speaking children showed a stronger

“Noun Bias”, but Cantonese-speaking children had a relatively weak “Noun Bias” and Mandarin-speaking children even had no “Noun Bias”. The same pattern was observed in adults’ inputs, and Mandarin-speaking adults even displayed a “Verb Bias”.

(22)

The authors believed that adults’ inputs might have influence on children’s vocabulary learning. As what Tardif (2006) and Tardif, Gelman, and Xu (1999) have stated, Mandarin-speaking parents focused on verbs when talking to their children and the verb use was highly specific in Mandarin. On the contrary, English-speaking parents focused on nouns and used more general purpose verbs when talking to their children.

The controversy of the “Noun Bias” in Mandarin remains after continuous efforts.

Previous studies investigating this issue in Mandarin by using the CHILDES database found the “Verb Bias”, challenging the universal “Noun Bias” view. To clarify the dialectal variation in the so-called “Noun Bias” in Mandarin, this study examines the use of nouns and verbs in young Mandarin-speaking children in Taiwan. Moreover, Liu and her colleagues (2008) and Tardif (1996) concluded that there is no “Noun Bias” for Mandarin children using data from the CHILDES database. It is possible to reach a different conclusion using a different corpus. Besides, although Tardif (1996), Dhillon (2010), and Haryu and colleagues (2005) illustrated the biases in terms of morphology and argument structure of Mandarin, their explanations were different. Therefore, if there is a “Noun Bias” observed in this study, it supports Haryu and colleagues' explanation (2005); on the contrary, if there is no “Noun Bias” observed, it supports the explanation of Tardif (1996) and Dhillon (2010).

2.2.3. Lexical complexity in acquisition

Word frequency provides general information about the collected language samples. Word type frequency represents the number of different words children know, and word token frequency represents the total number of items children produced.

However, producing many types is not equal to using all the types frequently;

meanwhile, producing many tokens does not mean producing many various word types.

(23)

Thus, vocabulary diversity needs to be examined, and measurements of vocabulary diversity are frequently used in language research. One of the measurements based on the ratio of the number of different words (Types) to the total number of words (Tokens) is known as the type-token ratio (TTR). A high ratio of TTR may represent a rich vocabulary diversity which means that children produce relatively more different types in their speech. On the contrary, a low ratio of TTR may represent a weak vocabulary diversity which means that children produce fewer different types or repeatedly produce the same types in their speech.

Many measures of vocabulary diversity have been based on the type-token ratio.

Unfortunately, Heaps’s law (Heaps, 1978) predicts that the more words (tokens) a sample has, the less possible it is that new words (types) will show up. That is to say, the first few tokens in a sample are likely to be new types, but later words are likely to be types that have been used before. Thus, measures based on TTR are likely to be affected by the sample size. The TTR values are lower in samples with more tokens and vice versa (Tweedie & Baayen, 1998).

In order to fix the sample size problem of TTR, another measure of lexical diversity was invented. A program called vocd was developed to calculate D (McKee, Malvern, & Richards, 2000). This method depends on the analysis of the probability of a new word appearing in longer and longer samples. The analysis leads to a mathematical model of how TTR interacts with token size. By comparing the mathematical model with empirical data in a transcript, it provides a measure of lexical diversity which is called D. The formula of the model is the following equation.

TTR = 1 + 2 − 1

(24)

produ rando depic preve Besid

progr proce extra repla 100 s

N is the nu uce a theor om samples cted in Figu ents the flaw des, it uses

Figure 2.1 (McKee e

D-measure ram (MacW edure of ho acts many acement). T subsamples

umber of to retical curv s. This equa ure 2.1. Th w of TTR.

all the word

1: Family o et al., 2000)

e can be ca Whinney, 2 ow the pro subsamples The sample s are extract

okens. TTR ve which m ation yields e calculatio It avoids be ds produced

of curves of ).

alculated by 2000; McK ogram vocd

s of varyin size is from ted for each

is type-tok most closely s a family o on of D-me eing highly d by the par

f increasing

y using voc Kee et al.,

d calculates

ng sample s m 35 tokens h sample siz

ken ratio. D y fits the em of curves wi easure is sti y affected by

ticipants.

g diversity

cd comman

2000). Th

D values sizes with s to 50 toke ze. Second,

is a part of mpirical TT ith different

ll based on y the token

with increa

nd provided e following in a transc random sam ns (N = 35 the TTR of

f this formu TR curve i

t values of n the TTR, size of sam

asing D val

d by the C ng steps ar

cript. First, mpling (wi to N = 50) f each subsa

ula to in the D, as but it mples.

lues

CLAN e the

vocd

ithout ), and ample

(25)

is calculated, and the average TTR for each sample size is calculated to represent that point of the curve of TTR of that transcript. Third, the software finds the best fit of this empirical curve based on the TTR and the token size, and obtains the value of D-measure. The average D value of subsamples of varying sample sizes is calculated, and the best-fit D values is obtained with the least square difference method. Finally, repeat step 1 to step 3 for twice. The average best-fit D value is the best D value of that transcript (see McKee et al., 2000, for a complete flow chart). A high value of the D-measure reflects high lexical diversity, and a low value means low lexical diversity.

The calculation of D-measure takes different sample sizes into consideration, so it is proven to be a more valid and reliable measure of lexical diversity (MacWhinney, 2000;

McCarthy & Jarvis, 2007; McKee et al., 2000).

Liu and her colleagues (2008) used D-measure to quantify vocabulary diversity of speech samples in their cross-linguistic study about early lexical development. They found both language and age had significant main effects on lexical diversity.

Mandarin-speaking children had their mean D value of 44.21. They also found that differences between any two age groups are significant, except for the difference between 25-36 months old and 37-48 months old. The results reveal that children’s speech becomes more diverse with increasing ages.

2.2.4. Performance in semantic categories

Peng and Chong (2010) have conducted a study of early acquisition of nouns from two Mandarin-speaking children age from 1 to 3 years old. They have categorized nouns into kinship terms, organs, clothing, device terms, vehicles, food, animals, natural objects, and colors. Their study has indicated that children’s early noun acquisition is

(26)

category of device nouns, such as 燈 (dēng, lamp), 筆 (bǐ, pen) and 球 (qiú, ball), more frequently than nouns of other categories. On the other hand, the production of nouns in categories like kinship terms, organs, natural objects and colors is relatively few. The high frequency of producing device nouns is due to the frequent inputs in children’s daily life. Besides, children’s cognitive ability also has effects on their language acquisition, so children acquire color terms later until three years old.

Sheng et al. (2012) examined Chinese young children’s Mandarin development in Nanjing, using a revised Chinese version of MCDI (Liang et al, 2001). The study was done with 326 toddlers of two age stages, 14-16 months and 24-26 months. They have found that toddlers at stage 1 could express 42.09 words (SD = 64.43), and they acquired more onomatopoeia words than words in other categories (i.e. nouns, verbs, adjectives/adverbs, conventional game words, numeral words, quantifiers, interrogatives, pronouns, directional words, and time words). Besides, they could express more verbs than nouns. The results of stage 1 further showed that the age of 15 months was found to be the critical period of producing nouns, verbs, adjectives/adverbs, conventional game words and onomatopoeia.

Likewise, Lee (2014) has found in her study that animals, tool, and food contributes the most proportion on nouns in early Hakka vocabulary acquisition. She also concluded that the process of acquiring nouns, such as personal pronouns, body parts, clothing, tools, food, and animals, coincided with cognitive development.

Additionally, children acquire nouns which are familiar and close to them first, and acquire nouns which are unfamiliar and far away later. Besides, children even invented new words for colors, for example, the strawberry color.

Kuo, Tsay, and Peng (2005) also classified Taiwanese nouns into 22 categories which were adapted from Jiang’s (2000) classification. Kuo et al. (2005) compared their

(27)

results with Jiang’s (2000) results, and they have found that categories in which nouns occurred most frequently were overlapped in two studies, including animal, food, body parts, natural phenomenon or substances, fruits and vehicles. Thus, it seems that objects which children contact most often are usually most frequently produced in children’s language no matter in which language.

2.2.5. Basic level effects

Human beings can classify objects of similar features into one category in order to make the world less complex, and this process is called “categorization” (Bancroft, 1995). Based on personal experiences and social background, people categorize objects differently. Generally speaking, objects can be categorized at different hierarchical levels, and there exists a common basic-level at which human beings tend to represent objects in the world. The common level of categorization is called basic-level category (Rosch, 1976; Rosch, 1978). Besides, Rosch et al. (1976) proposed a three-level categorization: basic level, superordinate level, and subordinate level based on the prototype theory (Rosch, 1973; Rosch & Mervis, 1975).

(1) Superordinate category: objects which are the most general and abstract, with fewer common features and lower resemblance among the members.

(2) Basic-level category: objects which are concrete and share more common features among members. Members in one class have similar shapes and can be distinguished from those in other classes with other features.

(3) Subordinate category: objects which are the most specific and concrete.

Members have higher within category resemblance and many individual features to distinguish them from the contrasting subordinate category.

(28)

category are the most concrete objects, the shortest lexicon, the most useful level of classification, the most frequently used terms, and the first acquired terms for children.

Rosch and her colleagues (1976) also found that children as young as three years old have already mastered the basic-level category. Even two-year-old children have had basic-level categories, although the members of basic-level are different from adult’s concepts. The discrepancy between children and adults becomes smaller when children increase their lexicon and social experiences. In the experiment of Mervis and Crisafi’s study (1982), they found that basic-level terms are more advantageous to children’s lexicon learning than superordinate terms and subordinate terms.

Besides, Jiang (2000) extended the basic level effect to examine Mandarin nouns in adults’ and children’s lexicon. She uses several criteria to examine the basic-level effect in Mandarin, such as basic-level lexicon should be at the central level, concrete, shortest, highly frequently-used, high derivative, and early acquired. Based on these criteria, she analyzed word derivativity, word frequency and the learning order of words.

She found that children as young as 11 to 24 months old had 70% lexicon which was basic-level words, followed by subordinate words and superordinate words. Basic-level words were the most derivative and highly frequently used than the other two levels.

She further found that basic-level words which were related to children’s daily life had much higher derivativity, and much higher frequency of use, such as basic-level words about animals, food, toys and transportation vehicles. In terms of toys and fruits, superordinate words were even more frequently used than basic-level words.

However, Jiang (2000) took only eight semantic categories into consideration and only selected one word to represent each category of each level when discussing the derivativity and the frequency of children’s use. The eight categories are animals, plants, food, fruits, clothing, toys, furniture, and transportation vehicles. Her analysis was not

(29)

comprehensive enough, so more semantic categories are included in this study to explore the basic level effect on Mandarin vocabulary from young children.

There is a study of basic-level effect on Mandarin vocabulary with various research methods. Zeng and Zou (2012) have conducted their study with three methods:

controlled experiments, longitudinal case data, and Zhou’s corpus data to investigate the early development of category levels of Mandarin-speaking children. Their results showed that basic-level words were dominant in comprehension and production of early vocabulary. Basic-level words were acquired first, followed by subordinate words and superordinate words. They further found a correlation between the development of categorization and language development. The development of categorization ability promotes children’s early language development.

A study of basic-level effect on another language, Hakka, also reached the same conclusion. In Lee’s study (2014), her results have shown that a large number of Hakka nouns were basic-level words, subordinate words were fewer than basic-level words, and superordinate words were the least. In terms of the frequency of use, basic-level words were used with a higher frequency than superordinate words and subordinate words. Adopting Jiang’s theory of the ability of deriving new words (Jiang, 2000), basic-level nouns had the highest ability of deriving new words. Basic-level nouns of the category of tools had the best ability to derive new words. Besides, she also discovered that the timing of acquiring Hakka nouns of the superordinate level and subordinate level categories differs according to the types of nouns.

(30)

Chapter 3 Methodology

This chapter introduces the source of language samples, and how they were analyzed. The language samples were archived in Taiwan Corpus of Child Mandarin (Cheung, Chang, Ko, & Tsay, 2011). The analyses focused on examining vocabulary growth and vocabulary organization. Vocabulary growth is defined as how vocabulary grows in the aspect of quantity. It was measured by computing the number of words, type/token ratio, D measure, and noun/verb ratio. Vocabulary organization is defined as how vocabulary grows in the aspect of word meanings. It was measured by computing the number of noun types in different semantic categories and the three levels of categorization.

3.1. About the TCCM corpus

3.1.1. Taiwan Corpus of Child Mandarin

The language samples examined in this study are from the Taiwan Corpus of Child Mandarin (Cheung, Chang, Ko, & Tsay, 2011), hereafter TCCM¹. TCCM is a recently-built corpus which contains scripts transcribed from Mandarin-speaking children’s language samples from past studies conducted in Taiwan. TCCM is a language data exchange system which aims to provide an open access for general public and researchers, and to use a standardized format to transcribe and code the language samples collected in Taiwan. This corpus follows the standards of the Child Language

1

TCCM website: http://taiccm.org/

(31)

Data Exchange System (CHILDES) (MacWhinney, 2000), the most important child language system in the world, using its CHAT format and CLAN tools (MacWhinney, 2000). Besides, TCCM is a browsable database, so researchers can directly do analyses on the website.

TCCM has contained 362 files. Children’s age ranges from one year and five months (1;5) to eight years and five months (8;5). There are several different types of language samples in this corpus for different research interests. It includes language samples not only from typically developing children but also from children with specific language impairment. Besides, children’s language samples are not limited to only daily conversation. The data includes samples from parent-child book-reading, child story-reading, and children’s spontaneous speech. In TCCM, the HTC01 (Cheung, 1998) database provides data from children’s spontaneous speech; the HTC02 (Cheung, 2003) database contains samples from children with specific language impairment, but this database has not been accessible yet; the CJC01 (Huang, 2009) database provides data from parent-child book-reading conversation; the CJC02 (Chang, 2003) database provides data from adult-child conversation. In order to prevent possible effects from genre types and effects from the difference between typically developing and impaired children, the language data examined in the present study was limited to only one specific type. That is to say, only language samples from spontaneous conversation produced by typically-developing children, the HTC01 database, were examined.

The transcription has been processed. The utterances in the transcripts have been segmented and part-of-speech has been tagged onto words. The standards of word segmentation and part-of-speech tags are mainly based on Segmentation Principle for

Chinese Language Processing (Chinese Knowledge Information Processing Group,

(32)

Information Processing Group in Academia Sinica Institute of Information Science (CKIP Group, 1998a; CKIP Group, 1998b). It also refers to Elementary School

Children’s Common Words Report (National Languages Committee, 2000), The POS Guidelines for the Penn Chinese Treebank (Xia, 2000a), The Segmentation Guidelines for the Penn Chinese Treebank (Xia, 2000b). It concerns the properties of

Mandarin-speaking children’s language development as well. Thus, the processed language data in TCCM may provide researchers with much clearer data and a more convenient way for searching or follow-up analyses.

3.1.2. Use TCCM rather than CHILDES

In the field of corpus-based studies about child language, CHILDES has been the most important and well-known corpus, but it is not the best choice in the present study.

The three disadvantages of using CHILDES are regional difference of Mandarin, inconsistent transcription system, and insufficient parts-of-speech information.

The first disadvantage of using CHILDES is the regional difference of Mandarin.

Although Mandarin database is also provided in CHILDES, most of the data are recorded in Beijing and Nanjing, China, only a few data are from Hsinchu, Taiwan. As a result, many corpus-based studies about child language development of Mandarin have been based on data in the Beijing Mandarin database in CHILDES. For example, Tardif (Tardif, 1996) has done her study about the issue of the predominance of noun or verb in early language development with the Beijing Mandarin. Liu and her collaborators (Liu et al., 2008) conducted a cross-linguistic corpus-based study, using the Beijing Mandarin database as well. However, there are some dialectal variations among Mandarin from different regions. CHILDES may be a choice for those who are interested in Beijing Mandarin or Nanjing Mandarin, but it provides little help for

(33)

researchers investigating Mandarin spoken by children in Taiwan. On the other hand, TCCM collects language data from Mandarin-speaking children in Taiwan. Therefore, researchers interested in Mandarin in Taiwan now have a choice of using TCCM.

Secondly, using CHILDES has the problem of inconsistent transcription system.

The Mandarin transcripts are transcribed either in Chinese characters or in Pinyin system, the Standard Mandarin Romanization system (ISO 7098, 1982). Using Pinyin, sounds of Chinese characters are represented in Roman letters. However, a serious problem exists in the scripts transcribed in the Pinyin system. Pinyin system only records sounds; as a result, homophones are represented identically. Since there are more characters than sounds in Chinese, it is difficult to differentiate homophones when Chinese characters are transcribed in Pinyin system. For example, the sound zuò could represent 做 (do/make), 坐 (sit), and 座 (seat). A researcher could hardly determine which of several characters should be applied when a single character in the form of Pinyin exists without its context in a sentence. In addition to the difficulty of determining the appropriate semantic meaning of the homophone, a researcher has to deal with the problem of searching for a homophone of a specific meaning. Therefore, investigators using Mandarin database in CHILDES have to rewrite the transcripts into Chinese characters. On the contrary, the language samples in TCCM have been transcribed in Chinese characters rather than in Pinyin system. An investigator can examine the transcripts directly without the step of converting Pinyin system into character system. Likewise, they can easily recognize the characters and the semantic meanings of the homophones with the help of Chinese characters.

Lastly, researchers who have to take lexical category into consideration in their study would find that the Mandarin database in CHILDES does not provide sufficient

(34)

before analyses. Sentences have to be segmented, and POS information has to be added onto each word. Nevertheless, some of the transcripts in the Mandarin database in CHILDES do not provide parts-of-speech information. Investigators have to tag POS onto each word by themselves before their analyses. Rather, people using TCCM do not have to do any preprocess. All of the language samples in TCCM have already been segmented, and POS tags have been added onto all words of all scripts as well.

In brief, choosing CHILDES would have to deal with three disadvantages, while choosing TCCM would not. The most important of all, the purpose of the present study is to investigate language acquisition of Mandarin-speaking children in Taiwan, and thus TCCM is better than CHILDES.

3.1.3. Data used in this study

In order to explore language development from a longitudinal observation, language samples of children’s spontaneous speech in this study were retrieved from the HTC01 database (Cheung, 1998). The files were collected from 10 young children.

Children’s age range was from 1;5 (1 year and 5 months) to 4;3 (4 years and 3 months) (mean = 32.76 months, SD = 8.25 months). The following Table 3.1 is the information of each child in the HTC01 database, including their age range, recording sessions, number of child utterances, adult utterances and all utterances. These language samples were audio-recorded at the child’s home once a month. The data was transcribed with CHAT format using the CLAN analyzing tools (MacWhinney, 2000). Most children have more than 10 recording sessions for more than one year. According to the CHAT transcription format, utterances should end with an utterance terminator. The basic utterance terminators are the period, the exclamation mark, and the question mark. Thus, each line in a transcription contained one utterance which was also one intonation unit.

(35)

Table 3.1: Children’s information in the HTC01 database.

Child Age range Sessions Child utterance Adult utterance Total utterance

CHENG 3;1－3;11 11 3564 5246 8810

CHOU 2;1－3;4 16 5253 9131 14384

CHW 3;6－4;3 9 2685 3649 6334 JC 2;2－3;5 14 5244 8738 13982 Pan 1;7－3;9 19 3700 7028 10728

WANG 2;5－3;4 12 3192 4471 7663

WU 1;7－2;10 12 2854 8835 11689

WUYS 2;7－3;10 10 1879 5704 7583

XU 1;6－2;5 11 2726 3948 6674

YANG 1;5－2;9 13 2745 7131 9876

In the HTC01 database, there are 127 files of spontaneous speech from 10 typically-developing children. In order to observe the developmental trend of different age stages, children’s language samples were divided into 7 groups with six months as a scale. Children of 25 months to 42months old contribute the most files. Three file are from children younger than 19 months old, and another three files from a child older than 48 months old. These six samples are insufficient to form groups of 13-18 months and 49-54 months as other age groups with more than 10 samples, so these six files were not included in the following analyses. Therefore, there are a total of 121 files examined in the study. The following Table 3.2 provides the number of samples analyzed in each age group, the number of children in each age stage, and the number of utterances of children and adults.

Table 3.2: The information of samples in each age group.

Age Samples Child Child utterance Adult utterance Total utterance

19-24 months 19 4 4345 10957 15302

25-30 months 31 7 9537 19167 28704

31-36 months 28 7 6183 11682 17865

37-42 months 29 7 8360 14000 22360

43-48 months 14 4 3589 4812 8401

(36)

As seen in Table 3.2, the files are not equally distributed in age groups. Most files are in the group of 25-30 months, 31-36 months, and 37-42 months. The other two age groups consisted of 19 files in 19-24 months and 14 files in 43-48 months. There are a total of 32014 child utterances and 60618 adult utterances in these 121 files. Children’s mean age and standard deviation in each age group are provided in Table 3.3.

Table 3.3: The mean age and standard deviation in each age group.

Age range Mean age SD

19-24 months 21.4 1.83

25-30 months 27.6 1.78

31-36 months 33.2 1.79

37-42 months 39.1 1.62

43-48 months 45.3 1.49

3.1.4. Lexical categories

Lexical category is also called part-of-speech. The standard of tagging part-of-speech (POS) in the TCCM database is mainly adapted from Segmentation

Principle for Chinese Language Processing provided by Chinese Knowledge and

Information Processing Group of the Academia Sinica (CKIP Group, 1998a; CKIP Group, 1998b). There are 28 POS tags used in coding the language samples. The 28 POS tags and the word class they represent are listed in Table 3.4 below. In subsequent analyses, two lexical categories which are not Mandarin lexicon are excluded: foreign words (CS), most of which are Taiwanese Southern Min, and unknown words (CHM).

Word frequency of each POS is presented in Table 1 in Appendix 3. A total number of 4026 word types and 75739 word tokens are included in the analysis.

(37)

Table 3.4: The 28 POS tags.

Tag Part of Speech Tag Part of Speech

ADV Adverb Npro Pronominal ADJ Adjective Nt Time noun

ASP Aspect marker NEG Adverb of negation

CONJ Conjunction ONM Onomatopoeia

DE

Nominal marker, Adverbial marker, Complement marker

PREP Preposition DT Determiner QN Quantifier

INT Interjection SFP Sentence final particle CS Foreign words CHM Unknown word IDM Idiom Va Stative verb CL Classifier/measure word Vc Shi‐word MOD Modal Vi Intransitive verb Nn Common Noun Vt Transitive verb

Nloc Localizer Vr Reduplicated verb

Nppn Proper noun WH Interrogatives “WH‐”

In subsequent analyses, one of the measures of vocabulary growth is based on noun-verb ratio. However, as what Tardif (1996) stated, the inclusion of nouns and verbs is too broad and controversial. The classification of nouns and verbs may be different due to different standards in various studies. Thus, the definitions of these two word classes in TCCM are illustrated in detail. As for other parts-of-speech, the corresponding examples are provided in the Appendix 1. Nouns and verbs can be divided into some subcategories according to their meanings and functions. Noun category in this corpus includes common nouns, localizers, proper nouns, pronominal, and time nouns. Verb category is composed of stative verb, shi-word, intransitive verb, transitive verb and reduplicated verb.

Noun: Nouns are tagged as Nn. Nouns includes all countable and uncountable

concrete nouns, abstract nouns, collective nouns, and children’s unique forms. Example (3-1) lists the common nouns. Example (3-2) lists two children’s unique forms.

(3-1) 桌子 (zhuōzi, table), 水 (shuǐ, water), 香氣 (xiāngqì, aroma)

(38)

Localizer: Localizers are tagged Nloc, including location words which indicate the

relative position of objects. A noun with a postposition is viewed as localizer as well.

The following example (3-3) lists location words. Example (3-4) demonstrates that the postpositions 裡 (lǐ) and 上 (shàng) are attached to a noun to indicate the location. The whole phrase functions as a localizer.

(3-3) 中間 (zhōngjiān, middle/center), 西北 (xīběi, northwest) (3-4) 杯子裡 (bēizilǐ, in the cup), 地上 (dìshàng, on the ground)

Proper noun: Proper nouns are tagged as Nppn, including all proper names such

as the name of an individual person, place, etc., except for time-related names. Example (3-5) lists the name of a person and the title of a book. In addition, country names and city names are also proper nouns, as seen in example (3-6).

(3-5) 余光中 (yúguāngzhōng, Yu Kwang-Chung), 詩經 (shījīng, Book of Songs) (3-6) 西班牙 (xībānyá, Spain), 台北 (táiběi, Taipei)

Pronominal: Pronominals are tagged as Npro, including personal pronouns,

reflexive pronouns, demonstrative pronouns, and generic pronouns. The examples of pronominals are listed below.

(3-7) personal pronouns: 我 (wǒ, I/me), 你 (nǐ, you), 我們 (wǒmen, we/us) (3-8) reflexive pronouns: 自己 (zìjǐ, oneself)

(3-9) demonstrative pronouns: 這裡 (zhèlǐ, here), 這邊 (zhèbiān, here/this side) (3-10) generic pronouns: 之 (zhī, it/them), 其 (qí, that)

Time noun: Time-related nouns are tagged as Nt, including the name of a

historical time and a repeated time. The name of a historical time means the name of a dynasty and the reign title, as seen in example (3-11). The name of repeated time includes dates, seasons, and festivals, as seen in example (3-12).

(39)

(3-11) 清朝 (qīngcháo, the Qing Dynasty), 西元 (xīyuán, Anno Domini/ A.D.) (3-12) 春天 (chūntiān, spring)

Stative verb: Stative verbs are coded as Va which are defined as describing states

of affairs, rather than actions. Stative verbs usually have only one argument. The items in the example below are stative verbs. It shall be noted that sometimes stative verbs such as 好 (hǎo, good), 新 (xīn, new), and 漂亮 (piàoliàng, beautiful) should be classified into adjective when it precedes a noun.

(3-13) 大 (dà, big), 高 (gāo, high), 浪漫 (làngmàn, romantic),辛苦 (xīnkǔ, toilsome), 豐富 (fēngfù, abundant)

The coding of stative verbs in the TCCM database is different from the coding in the CHILDES database. In the TCCM database, some stative verbs, such as 漂亮 (piàoliàng, beautiful), are tagged as Va, while it is viewed as an adjective in the CHILDES database. This coding standard is applied to both Mandarin and Cantonese language samples in the CHILDES database. That is, what is coded as Va in the TCCM database is tagged as an adjective in the CHILDES database, and such a difference may lead to inconsistent results of some analysis such as computing the N/V ratios.

Shi-word: Shi-words are tagged as Vc, including the word 是 (shì, copula verb

be), 像 (xiàng, resemble), and 等於 (děngyú, equal). It is noteworthy that the word 像 (xiàng) is polysemy. 像 (xiàng) or 好像 (hǎoxiàng) also functions as a preposition or an adverb. Thus, 像 (xiàng) is only viewed as a shi-word, when expressing the meaning of equal or having the function of the be verb.

Intransitive verb: Intransitive verbs are tagged as Vi, including verbs which

cannot take a direct object, including intransitive dynamic verbs, motion verbs, verbs in

(40)

verb-object construction with locations as the object, and weather verbs. The following examples are intransitive verbs.

(3-14) 開會(kāihuì, to hold a meeting), 跑 (pǎo, run), 逛街 (guàngjiē, go shopping), 下雨 (xiàyǔ, rain)

In addition, some verbs need two arguments semantically, but the objects do not follow the verb directly. Instead, the objects need to follow prepositions or to be fronted.

This kind of verbs is also classified as intransitive verbs. For instance, 求婚 (qiúhūn, propose) and 心動 (xīndòng, touch one’s heart) are also viewed as intransitive verbs.

Transitive verbs: Transitive verbs are tagged as Vt, including verbs which usually

need two arguments semantically and can take a direct object. Examples of transitive verbs are listed below. Besides, the verb 要 (yào, want) and the existential verb 有 (yǒu, have) are also viewed as transitive verbs.

(3-15) 打 (dǎ, hit), 進 (jìn, enter), 吃 (chī, eat), 買 (mǎi, buy), 使用 (shǐyòng, use), 放 (fàng, put), 去 (qù, go), 猜想 (cāixiǎng, guess)

Reduplicated verb: Reduplicated verbs are tagged as Vr. The reduplicated verbs

occur in some patterns. The patterns and examples are listed below.

(3-16)

AA: 走走 (zǒuzǒu, walk), 看看 (kànkàn, look at) A 來 A 去: 跑來跑去 (pǎoláipǎoqù, run around)

A 上 A 下: 跳上跳下 (tiàoshàngtiàoxià, jump up and down) A 一 A: 敲一敲 (qiāoyìqiāo, knock again and again)

越來越A: 越來越胖 (yuèláiyuèpàng, fatter and fatter)

一面A 一面 B: 一面說一面笑 (yímiànshuō yímiànxiào, saying and smiling)

(41)

一邊A 一邊 B: 一邊走一邊吃 (yìbiānzǒu yìbiānchī, walking and eating) The coding of reduplicated verbs is the same as other POS, attaching the tag onto each word directly. For example, 走走 (zǒuzǒu, walk) is coded is “Vr|走走”; 跑來跑去 (pǎoláipǎoqù, run around) is coded as “Vr|跑來跑去”; 跳上跳下 (tiàoshàngtiàoxià, jump up and down), is coded as “Vr|跳上跳下”. However, it is noted that the “A 一 A”

pattern is coded as “V|A Vr|一 A” and the “越來越 A” pattern is coded as “Vr|越來越 V|A”. For example, 敲一敲(qiāoyìqiāo, knock and knock) is coded as “Vt|敲 Vr|一敲”;

越來越胖 (yuèláiyuèpàng, fatter and fatter) is coded as “Vr|越來越 Va|胖”. As for more complicated word phrase, the phrase is segmented and coded separately as seen in the following example.

(3-17)

From file "HTC01_CHW402_08.cha"

*WAN: 你一邊坐他有沒有一邊叫 ? nǐ yìbiā zuò tā yǒu méiyǒu yìbiānjiào Is he barking at you while you sit down?

%mor: Npro:2sg|你 Vr|一邊坐 Npro:3sg|他 Vt|有 NEG|沒有 Vr|一邊 叫 ?

The following Table 3.5 provides the POS tags and examples of nouns and verbs used in the TCCM corpus. Because different inclusion of nouns and verbs may lead to debates on noun bias or verb bias in early Mandarin acquisition, a broad and strict inclusion of nouns and verbs are taken into consideration when calculation N/V ratio.

Broad noun includes all subcategories of nouns; while strict noun includes only common nouns Nn. Broad verb includes all subcategories of verbs, while strict verb excludes stative verbs Va.