Chapter 4 Experiments on the Individual Differences of Lexical Behaviors
4.1 Experiment 1: The Role of the Frequency Index of Personal Word Usage in Lexical
4.1.1 Method
There were four steps in an attempt to calculate the frequency index per participant. The first step was to produce a list which contained all of the words he/she used and the occurrence frequency of those words in his/her segmented Facebook data. Examples of the list are shown in the first and second columns of Table 4.1. Second, the corresponding word frequency in Sinica Corpus of each word on the list was gathered from the CLP. The third column of Table 4.1 exemplified the Sinica frequencies. As shown in the column, a few words were assigned a missing value “NA” since they did not appear in the Sinica Corpus.
The words with missing values might be comparatively modern words (e.g. ju2-cha2 (桔茶)
‘orange tea’), new proper names (e.g. zhou4-yuan4 (咒怨) ‘Curse Grudge’), or a string that
48
was erroneously grouped as a word by the automatic segmentation program (e.g. zai4 wuo3
nao3 (在我腦) ‘in my brain’). Note that these words, which possessed no Sinica frequencies,
would be excluded from the calculation of participants’ frequency indices. This exclusion filtered out the data noise procured by automatic segmentation, thus diminishing the impact of segmentation errors on results of individual lexical behaviors.14
Table 4.1: An example of a portion of one participant’s word list
Third, based on the information gathered in the foregoing steps, the frequency index of personal word usage 𝑈𝑗 of the participant j was computed by Formula 1, where 𝑃𝑖𝑗 was the participant’s personal frequency of the ith word, 𝑆𝑖 was the word’s frequency in the Sinica Corpus, and n was the number of word types in the person’s Facebook data. In this
14 As explained in Section 3.2, we did not manually check and correct the automatic segmentation results mainly because we aimed to develop methodology that was not labor-consuming and could be comparatively adopted by future investigations on word recogntion.
49
formula, 𝑈𝑗 can be interpreted as the mean Sinica frequency of words used by the participant j on the Facebook. The lower the index was, the more rarely-seen words used by the participant were, which assumably meant the person had broader word knowledge.
Formula 1: The frequency index of personal word usage
Finally, the 𝑈𝑗 index of each participant was put along with his/her response latencies and accuracies in the lexical decision task for analysis.
The steps of computing the frequency index introduced above applied to the complete word list of each participant (called as “the Intact word list” hereafter). In addition to the list, this experiment also made the other word list for each participant to calculate another index.
This word list (called as “the NV word list” hereafter) comprised only multi-character words tagged as common nouns, nominalized verbs and verbs by the CKIP Segmentation System.
The NV types of words were selected for two reasons. First, the list was intended to contain merely content words. Table 4.2 exhibited all categories of part-of-speech used by the CKIP system. The grey-painted rows were the types selected for the NV list. As shown in the table, almost all types of content words were selected here except for the categories of
50
“Nb” (Proper noun), “Nc” (Location noun), “Ncd” (Localizer), and “Nd” (Time noun).
These categories of words were ubiquitous in Facebook data. Given that the website is known for posting feelings, activities, or opinions at any present, people often share a thing along with the information about what time it happened at where and with whom. Some Facebook users tended to provide much background information, and others did not. The tendency, however, may not result from a person’s habit of language usage, but be induced by the characteristic of the Facebook service. To minimize the influence of Facebook’s property on the frequency indices, the words indicating background information was filtered out by excluding “Nb”, “Nc”, “Ncd”, and “Nd” in the NV list. The other reason was that single-character words were debarred so as to reduce the impact of segmentation errors on the computation results. Take one sentence, “新(VH) 訓比(VC) 坐牢(VA) 還慘(b)”, for example. The character xing1 (新) ‘new’ was supposed to form a word along with xun4 (訓) as xing-xun4 (新訓) ‘recruit training’. However, it was mistakenly segmented as a word alone by the segmentation system. If the criterion of ruling out single-character words had not been set, the segmentation error 新 xing1 ‘new’ could not have be eliminated from the computation. It was because it got a verb tag “VH”, which conformed to the requirement of the NV list. Besides, a bunch of Chinese characters can be a meaningful word alone (e.g.
新 ). Consequently, unlike the multi-character segmentation errors (e.g. 訓 比 ), single-character words can easily find its corresponding word frequency in the Sinica
51
Corpus, even if they are segmentation errors. In order to avoid the mono-syllabic errors distorting the frequency indices, single-character words were barred from the NV word list.
Table 4.2: Part-of-Speech categories in the CKIP Chinese Word Segmentation System CKIP
Tags
CKIP Part-of-Speech categories Chinese English translation
A 非謂形容詞 Non-predicate adjective
Caa 對等連接詞,如:和、跟 Coordinating conjunction, e.g. han4, gen1 Cab 連接詞,如:等等 Listing conjunction, e.g. deng3-deng3 Cba 連接詞,如:的話 Sentence-final conjunction, e.g. de-hua4
Cbb 關聯連接詞 Correlative conjunction
Da 數量副詞 Quantitative adverb
Dfa 動詞前程度副詞 Pre-verbal adverb of degree
Dfb 動詞後程度副詞 Post-verbal adverb of degree
Di 時態標記 Aspectual adverb
Dk 句副詞 Sentential adverb
D 副詞 Adverb
Neu 數詞定詞. Numeral determinative
Nes 特指定詞 Specific determinative
Nep 指代定詞 Demonstrative determinative
Neqa 數量定詞 Quantitative determinative
Neqb 後置數量定詞 Post-quantitative determinative
Nf 量詞 Measure word
Ng 後置詞 Postposition
Nh 代名詞 Pronoun
52
Nv 名物化動詞 Nominalized verb
I 感嘆詞 Interjection
P 介詞 Preposition
T 語助詞 Particle
VA 動作不及物動詞 Active intransitive verb
VAC 動作使動動詞 Active causative verb
VB 動作類及物動詞 Active pseudo-transitive verb
VC 動作及物動詞 Active transitive verb
VCL 動作接地方賓語動詞 Active verb with a locative object
VD 雙賓動詞 Ditransitive verb
VE 動作句賓動詞 Active verb with a sentential object
VF 動作謂賓動詞 Active verb with a verbal object
VG 分類動詞 Classificatory Verb
VH 狀態不及物動詞 Stative intransitive verb
VHC 狀態使動動詞/ Stative causative verb
VI 狀態類及物動詞 Stative pseudo-transitive verb
VJ 狀態及物動詞 Stative transitive verb
VK 狀態句賓動詞 Stative verb with a sentential object
VL 狀態謂賓動詞 Stative verb with a verbal object
V_2 有 YOU3
DE 的, 之, 得, 地 Particle DE and its functional equivalents
SHI 是 SHI4
FW 外文標記 Foreign word
Since this experiment employed two types of personal word lists (i.e. “the Intact list”
and “the NV lsit”), each participant obtained two calculated frequency indices of personal word usage. The correlation between the two indices is illustrated in Figure 4.1 (r = .67).
Each participant’s indices based on the Intact word list and the NV word list separately fitted mixed-effects models, so that we could examine whether they accounted for variances
53
between participants’ response latencies and accuracies. It was hypothesized that lower frequency-index participants, who were postulated to have broader vocabulary knowledge, would respond more rapidly and more accurately than those with higher frequency indices.
Figure 4.1: The correlation of frequency indices of personal word usage computed by the Intact list and the NV list (r = .67)