Chapter 2 Literature Review
2.3 Corpus Resources
2.3.1 Chinese Lexicon Profile
The Chinese Lexicon Profile (CLP, henceforth) is a research project launched at LOPE lab
at National Taiwan University.5 The project purports to build up a large-scaled open lexical database platform for Chinese mono-syllabic to tri-syllabic words used in Taiwan. With its incorporation of behavioral and normative data in the long term, the CLP would allow researchers across various disciplines to explore different statistical models in search for the determinant variables that influence lexical processing tasks, as well as the training and verification of computational simulation studies.6
In the initial design of the CLP, each word is presented with values for variables at different linguistic levels: orthography, phonetics, morpho-syntax, semantics and word frequency. Most characteristics of words in CLP have been gathered from numerous existing Chinese corpora and lexical resources. The resources are called by abbreviations in the following. For further details of the names, please see the List of Abbreviations at the beginning of this thesis.
The number of Chinese words in CLP has been accumulated up to 204,922 so far.
Inasmuch as the quantity is enormous, the data are temporarily separated into five files,
5 The project is inspired by the English Lexicon Project (ELP, Balota et al., 2007) and its French counterpart (FLP, Ferrand et al., 2010), and designed to be compatible with these resources for the purpose of cross-linguistic comparison. More details will be available at http://lope.linguistics.ntu.edu.tw/clp.
6 The author has undertaken the most tasks at the first stage of development, which is introduced in this subsection.
18
whose columns are shown in Table 2.2. As displayed in the table, all the files share three identical columns—the identification number of each word, the word form in the Sinica Corpus, and the word form in Chinese Wordnet (Huang & Hsieh, 2010). It should be noted that in the Chinese Wordnet, Chinese words are presented by the unit of ‘lemma.’ If a word has distinct sounds or origins, it would be represented in different lemmas. For instance, the word 查 has two sounds, including cha2 and zha1 , thus having two lemmas 查 1 and 查 2. However, other lexical resources subsumed in the CLP do not distinguish lemmas, but provide information at the word-form level. To align the data of Chinese Wordnet with other resources in the CLP files, lemmas are merged into a word form. In other words, lexical information provided in the CLP is on the basis of word forms. Aside from the three identical columns, the remaining columns in each file reposit information of orthography, phonetics, morpho-syntax, semantics or word frequency; the details are explained in the following.
(1) Orthography
Length the number of characters in a word form.
Stroke the strokes of each character in a word form. If one word was made up of more
than one character, the strokes of two adjacent characters were separated by the at sign,@.
19
Table 2.2: Column names of the collected lexical characteristics in Chinese Lexicon Profile
Orthography Phonetics Frequency
1. IdNum
20
Radical the radical of each character in a word form. The radicals referred to the semantic
radicals, which serve as indices in a Chinese dictionary. Radicals of adjacent characters were divided by @.
Stroke_PR
the strokes of the phonetic radical of each character in a word form.Nh_num_N1 the number of neighbors which shared the first character with a particular
word form. This type of neighbors was called N1 neighbors
Nh_rank_N1 A word and its N1 neighbors were compared with one another based on their
word frequencies in Freq_B. The one with the highest word frequency was ranked in the first place and assigned the value 1. Therefore, this rank column could let us know whether a word had neighbors which were more frequently used than it.
(2) Phonetics
Pinyin the Pinyin phonetic representations of characters in a word. For instance, ma3 is
the Pinyin of 馬 ‘horse’. If a word comprised more than one character, the at sign @ was also utilized here to set apart Pinyin information of characters next to each other.
Hp_num the number of homophones of a word. This column was based on the data on the
website of Sou Ci Xung Zi (搜詞尋字)7.
Hp_set a set encompassing a particular word and its homophonic words.
7 http://words.sinica.edu.tw/sou/sou.html
21
Hp_setId the CLP IdNum of each word in the Hp_set
(3) Morpho-syntax
The morpho-syntactic lexical characteristics were collected from Chinese Wordnet (CWN).
The CWN8 (Huang & Hsieh, 2010) is a lexical network consists of synsets and semantic relations. The meanings of a word there were disambiguated under the frame of three hierarchical levels— lemma, sense, and facet. In other words, a word may possess several lemmas, senses or facets, all of which carry part-of-speech (POS) tags. Therefore, a word in CWN may have more than one POS tag. The counts of all types of POS were recorded in syntactic columns introduced below.
Pos_A the number of being adjectives.
Pos_ADV the number of being adverbs.
Pos_ASP the number of being aspectual markers.
Pos_C the number of being conjunctions.
Pos_DET
the number of being determiners.Pos_M the number of being classifiers, like zhung3(種)、zhi1(隻)、ben3(本) Pos_N the number of being nouns and pronouns.
Pos_P the number of being prepositions.
8 http://lope.linguistics.ntu.edu.tw/cwn/
22
Pos_POST the number of being postposition, post-conjunctions (e.g. deng3-deng3 (等等)
‘and so on’), or post-numeric determiners (e.g. duo1 (多) ‘more than…’).
Pos_T the number of being de, interjections, or particles.
Pos_Vi the number of being intransitive verbs.
Pos_Vt the number of being transitive verbs.
Pos_nom the number of being nouns which were nominalized from verbs.
(4) Semantics
Reverse a tag regarding whether a word is still meaningful when its word order is
reversed. One pair of the example is jiao4-zong1 (教宗) ‘Pope’ and zong- jiao4 (宗教)
‘religion’. If the reversed word is a lexicon in Chinese, it was tagged with “Yes” in this column.
Sense_num the number of senses of a word form. The information of sense and facet
numbers9 was extracted from CWN.
Facet_num the number of facets of a word form.
Holo_num
the number of holonyms of a word form. Holonyms and meronyms,
9 As mentioned in the description of Morpho-syntax information within the CLP, lexical meanings in Chinese Wordnet are disambiguated at the levels of lemma, sense, and facet. When a lemma has different but related meaning, it is further divided into senses. For instance, zou3 (走) in 越「走」越遠 means ‘to walk’ and in 「走」
了一趟故宮 means ‘to visit.’ If there are even subtle differences within one sense, it would be further separated into facets. The noun bao4-zhi3 (報紙) ‘newspaper’ has two facets since it can refer to the object or the content of newspaper. Different from senses, facets can appear in the same context. For example, the two facets of bao4-zhi3 (報紙) can be interpreted in the sentence 我喜歡今天的報紙 ‘I like today’s newspaper.’
23
which will be introduced next, are opposite terms that denote the part-whole relationship between words. If X is a part of Y, X is a meronym of Y and Y is a holonym of X. For example, ‘tree’ is the holonym of ‘trunk’ or ‘branch’. The numbers of a word’s semantic relations were all collected from CWN.
Mero_num
the number of meronyms of a particular word form.Hyper_num the number of hypernyms of a particular word form. If X contains and is
more general than Y, X is the hypernym of Y. One example is that ‘color’ is the hypernym of ‘red’.
Hypo_num
the number of hyponyms of a particular word form. Hyponyms are the contrary of hypernyms.Anto_num
the number of antonyms of a particular word form. Antonyms refer to words that are opposite in meanings. There are three types of antonyms, including complementary opposites (e.g. male/female), contrary opposites (e.g. cold/hot), and relational opposites (e.g. buy/sell).Nearsyno_num
the number of nearsynonyms of a word form. Words with overlapping but subtly distinct meanings are called nearsynonyms. One example is bao-rung2 (包容) ‘to tolerate’ and rung2-ren3 (容忍) ‘to bear’.
Para_num
the number of paranyms of a word form. A set of words that are in the same semantic classification are designated paranyms. For instance, xung1-di4-xiang4 (兄弟24
象) ‘Brother Elephant’ and tung3-yi1-shi1 (統一獅) ‘UniPresident Lion’ belong to a group of paranyms concering baseball teams.
Variant_num the number of variants of a word from. Variants denote different word forms
of the same lexicon, like gong1-bu4 (公布) ‘to post’ and gong1-bu4 (公佈) ‘to post’.
SUMO_chi the Chinese SUMO ontological concepts of a word. The data of this column
are from Sinica BOW. A word may be polysemous, thus being subsumed by more than one SUMO concept. For instance, tian1-cai2 (天才) ‘genius’ belongs to the concepts of neng2-li4 (能力) ‘ability’ and reng2-lei4 (人類) ‘human’. The at sign, @, was used to divide adjacent concepts in this column.
SUMO_chi_u Some words may possess repeated SUMO concepts in the SUMO_chi
column. Take si1-sua4 (撕碎) for example. It can mean ‘to tear into pieces’ or ‘to tear into shreds’. The two meanings are categorized right into the same SUMO concept
feng1-li2 (分離) ‘seperating’. In this case, the SUMO_chi slot of si1-sua4 (撕碎) was
filled by “分離@分離” whereas its SUMO_chi_u slot was by only “分離”.
SUMO_num the number of a word’s SUMO_chi.
SUMO_num_u the number of a word’s SUMO_chi_u.
Cilin_tag the tag of a word in a Chinese Thesaurus entitled Tongyici Cilin (同義詞詞林)
(Mei, Zhu, Gao, & Yin, 1983). The purpose of the Chinese Thesaurus was to provide a repertoire of Chinese synonyms useful in writing and translating. Words in the book25
were organized into categories in three levels. The highest level included twelve categories, such as human, object, or action. Categories in the other two levels became more and more specific than those in the highest one. There were 94 mid-level categories and 1,429 low-level categories. The three levels were tagged with alphabets;
in addition, figures were utilized to group synonyms in the same category.
Cilin_layer1 the high-level tag of a word in Cilin.
Cilin_layer2 the mid-level tag of a word in Cilin.
Cilin_layer3 the low-level tag of a word in Cilin.
(5) Frequency
Freq_B the frequency of a word in the Academia Sinica Balanced Corpus
10.Freq_C the frequency of a word in the data of Central News Agency from the Chinese
Gigaword corpus (Graff, Chen, Kong, & Maeda, 2005). The news agency was located in Taiwan.
Freq_X the frequency of a word in the data of Xinhua News Agency from the Chinese
Gigaword corpus. The news agency was situated in Mainland China.
Freq_Z the frequency of a word in the data of Zaobao Newspaper from the Chinese
Gigaword corpus. Like Xinhua News Agency, the newspaper provides data in
10 http://db1x.sinica.edu.tw/kiwi/mkiwi/
26
Mainland China.
FreqR_B the ratio of a word’s Freq_B to the sum of word frequencies in the Academia
Sinica Balanced Corpus. As shown in the four previous columns, CLP includes word frequency records in four corpora. Note that the total frequencies in the corpora are different; thus, it is inappropriate to directly compare their frequency counts of a particular word. In the FreqR columns, the frequency ratio was computed via dividing a word frequency with the total word frequency of a corpus.
FreqR_C the ratio of a word’s Freq_C to the sum of word frequencies in the data of
Central News Agency.
FreqR_X the ratio of a word’s Freq_X to the sum of word frequencies in the data of
Xinhua News Agency.
FreqR_Z the ratio of a word’s Freq_Z to the sum of word frequencies in the data of
Zaobao Newspaper.
FreqR_100m_B the frequency ratio in FreqR_B multiplied by 100 million FreqR_100m_C the frequency ratio in FreqR_C multiplied by 100 million.
FreqR_100m_X the frequency ratio in FreqR_X multiplied by 100 million.
FreqR_100m_Z the frequency ratio in FreqR_Z multiplied by 100 million.
FreqR_100mL_B the base 10 logarithms of FreqR_100m_B. Since the Freq_100m
columns are multiplied by 100 million, the maximum value of FreqR_100mL will be 8.27
The logarithms were calculated in an effort to emphasize that a difference of one or two counts are crucial more for low-frequency words than for high-frequency words.