Method - Experiment 1: The Role of the Frequency Index of Personal Word Usage in Lexical

Chapter 4 Experiments on the Individual Differences of Lexical Behaviors

4.1 Experiment 1: The Role of the Frequency Index of Personal Word Usage in Lexical

4.1.1 Method

There were four steps in an attempt to calculate the frequency index per participant. The first step was to produce a list which contained all of the words he/she used and the occurrence frequency of those words in his/her segmented Facebook data. Examples of the list are shown in the first and second columns of Table 4.1. Second, the corresponding word frequency in Sinica Corpus of each word on the list was gathered from the CLP. The third column of Table 4.1 exemplified the Sinica frequencies. As shown in the column, a few words were assigned a missing value “NA” since they did not appear in the Sinica Corpus.

The words with missing values might be comparatively modern words (e.g. ju2-cha2 (桔茶)

‘orange tea’), new proper names (e.g. zhou4-yuan4 (咒怨) ‘Curse Grudge’), or a string that

was erroneously grouped as a word by the automatic segmentation program (e.g. zai4 wuo3

nao3 (在我腦) ‘in my brain’). Note that these words, which possessed no Sinica frequencies,

would be excluded from the calculation of participants’ frequency indices. This exclusion filtered out the data noise procured by automatic segmentation, thus diminishing the impact of segmentation errors on results of individual lexical behaviors.¹⁴

Table 4.1: An example of a portion of one participant’s word list

Third, based on the information gathered in the foregoing steps, the frequency index of personal word usage 𝑈_𝑗 of the participant j was computed by Formula 1, where 𝑃_𝑖𝑗 was the participant’s personal frequency of the ith word, 𝑆_𝑖 was the word’s frequency in the Sinica Corpus, and n was the number of word types in the person’s Facebook data. In this

14 As explained in Section 3.2, we did not manually check and correct the automatic segmentation results mainly because we aimed to develop methodology that was not labor-consuming and could be comparatively adopted by future investigations on word recogntion.

formula, 𝑈_𝑗 can be interpreted as the mean Sinica frequency of words used by the participant j on the Facebook. The lower the index was, the more rarely-seen words used by the participant were, which assumably meant the person had broader word knowledge.

Formula 1: The frequency index of personal word usage

Finally, the 𝑈_𝑗 index of each participant was put along with his/her response latencies and accuracies in the lexical decision task for analysis.

The steps of computing the frequency index introduced above applied to the complete word list of each participant (called as “the Intact word list” hereafter). In addition to the list, this experiment also made the other word list for each participant to calculate another index.

This word list (called as “the NV word list” hereafter) comprised only multi-character words tagged as common nouns, nominalized verbs and verbs by the CKIP Segmentation System.

The NV types of words were selected for two reasons. First, the list was intended to contain merely content words. Table 4.2 exhibited all categories of part-of-speech used by the CKIP system. The grey-painted rows were the types selected for the NV list. As shown in the table, almost all types of content words were selected here except for the categories of

“Nb” (Proper noun), “Nc” (Location noun), “Ncd” (Localizer), and “Nd” (Time noun).

These categories of words were ubiquitous in Facebook data. Given that the website is known for posting feelings, activities, or opinions at any present, people often share a thing along with the information about what time it happened at where and with whom. Some Facebook users tended to provide much background information, and others did not. The tendency, however, may not result from a person’s habit of language usage, but be induced by the characteristic of the Facebook service. To minimize the influence of Facebook’s property on the frequency indices, the words indicating background information was filtered out by excluding “Nb”, “Nc”, “Ncd”, and “Nd” in the NV list. The other reason was that single-character words were debarred so as to reduce the impact of segmentation errors on the computation results. Take one sentence, “新(VH) 訓比(VC) 坐牢(VA) 還慘(b)”, for example. The character xing1 (新) ‘new’ was supposed to form a word along with xun4 (訓) as xing-xun4 (新訓) ‘recruit training’. However, it was mistakenly segmented as a word alone by the segmentation system. If the criterion of ruling out single-character words had not been set, the segmentation error 新 xing1 ‘new’ could not have be eliminated from the computation. It was because it got a verb tag “VH”, which conformed to the requirement of the NV list. Besides, a bunch of Chinese characters can be a meaningful word alone (e.g.

新 ). Consequently, unlike the multi-character segmentation errors (e.g. 訓比 ), single-character words can easily find its corresponding word frequency in the Sinica

Corpus, even if they are segmentation errors. In order to avoid the mono-syllabic errors distorting the frequency indices, single-character words were barred from the NV word list.

Table 4.2: Part-of-Speech categories in the CKIP Chinese Word Segmentation System CKIP