字詞辨識中個別差異之量度：個人詞彙行為之角色探究

全文

(1)國立臺灣師範大學英語學系碩. 士. 論. 文. Master Thesis Graduate Institute of English National Taiwan Normal University. 字詞辨識中個別差異之量度：個人詞彙行為之角色探究 Measuring Individual Differences in Visual Word Recognition: The Role of Individual Lexical Behaviors. 指導教授: 謝舒凱博士、詹曉蕙博士 Advisors: Dr. Shu-Kai Hsieh and Dr. Shiao-Hui Chan 研究生: 林欣霓 Student: Hsin-Ni Lin. 中華民國一 O 一年七月 July, 2012.

(2) 摘要本論文旨在以語料庫與計算語言學的研究方法，量測字詞辨識中受試者表現之個別差異。字詞辨識為心理語言學領域關注的議題，過去的研究 (Katz et al., in press; Lewellen, Goldinger, Pisoni, & Greene, 1993; Sears, Siakaluk, Chow, & Buchanan, 2008; Unsworth & Pexman, 2003; Yap, Balota, Sibley, & Ratcliff, 2012) 主要皆藉由測驗或問卷的方式，如詞彙測驗、詞彙熟悉度問卷，探討其中個別差異的來源；然而，這樣的研究方法，往往侷限於測驗可及的範疇，且受限於單一測驗包含的詞彙、分數、量尺等等。為了將研究範圍拓展至語言的實際使用面向上，本文從個人日常生活的詞彙行為 (lexical behaviors) 著手，提出「個人用詞之頻率指數」以及「個人詞頻」兩種變項的計量法；進而探討它們是否能解釋字詞辨識實驗中因受試者個人表現所造成的變異。研究經由四個步驟完成。第一，實施中文詞彙判斷作業 (lexical decision task)，用以收集字詞辨識之實驗數據。第二，自動抽取各受試者的臉書貼文，並加以斷詞。第三，利用斷詞結果，來計算前述兩種詞彙行為變項之數值。「個人用詞之頻率指數」是依據個人所用之詞彙在中研院平衡語料庫中相對應的詞頻而計算。「個人詞頻」意指詞彙判斷的實驗刺激 (stimuli) 於個人臉書貼文中出現的頻率高低。第四，統計分析的部分，採用擅於估計個人差異的混合效果模式 (mixed-effects models)。實驗結果顯示，「個人詞頻」效果顯著，受試者對於自己使用頻率較高的詞彙，反應較快；「個人用詞之頻率指數」較低的受試者，與預期相反地，正確率較低。此外，作為量度個人詞彙行為的先驅研究，本文亦提供計算方法論上的建議，如下所列。與預期相反的頻率指數結果，可能源於計量時所參照的平衡語料庫是由書面資料所組成，建議未來類似的實驗，應參照口語語料庫中的詞頻。另外，經由我們的實驗測試，即使自 i.

(3) 動斷詞的結果包含許多錯誤，利用該結果所得的個人總詞數來正規化其詞頻數，仍具有可行性。最後，當使用與臉書貼文一樣的自然語料 (naturalistic data) 進行計量時，建議研究個人的詞彙偏好或習性，而非個人使用的每一字詞。. 關鍵字: 個別差異、字詞辨識、詞彙行為、自然語料、混合效果模式. ii.

(4) Abstract This thesis aims to adopt a corpus-based computational linguistic approach to measure individual differences (IDs) in visual word recognition. Word recognition has been a cardinal issue in the field of psycholinguistics. Previous studies (Katz et al., in press; Lewellen, Goldinger, Pisoni, & Greene, 1993; Sears, Siakaluk, Chow, & Buchanan, 2008; Unsworth & Pexman, 2003; Yap, Balota, Sibley, & Ratcliff, 2012) examined the IDs by resorting to test-based or questionnaire-based measures (e.g. vocabulary tests and word familiarity questionnaires). Those measures, however, confined the research within the scope where they can evaluate, and also differentiated individuals within the boundary of limited scores, scales, or vocabularies. To extend the research to approximate to IDs in real life, the present study undertakes the issue from the observations of participants’ daily-life lexical behaviors. We proposed the methods to calculate the frequency index of personal word usage and personal word frequency, and further investigated that whether each of them accounted for participants’ variances in word recognition. The investigation was carried out in four steps. First, a lexical decision task containing 912 Chinese stimuli was conducted so as to collect the data of visual word recognition. Second, each participant’s Facebook posts were automatically extracted and segmented into words. Third, based on those words, the two variables of individual lexical behaviors were computed. The frequency index per person was derived via his/her words’ corresponding frequencies in the Academia Sinica Balanced Corpus. The personal word frequency referred to the relative degrees to which a given word-recognition stimulus occurred in one’s Facebook posts. Fourth, experimental data were analyzed in mixed-effects iii.

(5) models, which can precisely estimate by-subject differences. Results showed that the effects of personal word frequency reached significance; participants responded themselves more rapidly when encountering more frequently used words. People with lower frequency indices of personal word usage had a lower accuracy rates than others, which was contrary to our prediction. Besides, as a pioneer study of measuring lexical behaviors, this thesis also provides suggestions regarding the methodology, which are presented subsequently. The counter-prediction finding in the frequency index experiment was possibly attributed to that the Sinica Corpus mainly consists of written data; therefore, it is suggested that similar experiments in future research resort to the frequency counts in a spoken corpus. Additionally, according to our examination, a person’s total token number is feasible for normalizing his/her frequency counts even though word segmentation errors were contained within the tokens. Finally, when naturalistic data like the Facebook posts are utilized for the measurement, it is recommend basing the computation on personal preference or pattern of lexical usage, instead of on every single word in one’s language usage data.. Key words: individual differences, word recognition, lexical behaviors, naturalistic data, mixed-effects models. iv.

(6) Acknowledgements Finally I achieve the stage of writing acknowledgements. It is a pleasure to appreciate many generous people who made this thesis possible. First of all, I would like to express profound gratitude to my advisors, Prof. Shu-Kai Hsieh and Prof. Shiao-Hui Chan. With their full support and enlightening guidance, I had the precious opportunity to put into action my ideas concerning this interdisciplinary investigation. Moreover, their heart-warming encouragement helped me recover confidence whenever I encountered arduous obstacles in conducting the experiments. I also benefited much from Prof. Hsieh’s philosophical manners of discussing research methodology, which assisted me in thinking by going beyond the traditional frame. Prof. Chan always efficiently pinpointed general issues that I did not think of and should discuss in the thesis. Her insightful advices broadened the discussion in the thesis. Second, my heartfelt thanks also go to my committee members, Prof. Yung-Fong Hsu and Prof. Chien-Jer Charles Lin. Their constructive advices and invaluable comments greatly improved this thesis. Without Prof. Hsu’s evaluation and advices on the mixed-effects models, I would not have known whether I employed the models correctly and appropriately. Prof. Lin is the teacher who led me into the field of experimental linguistics when I was a junior student at college. I entered his LaCL lab, working on a NSC undergraduate project at that time. The experience made me find the glamour of linguistic experiments. I am also grateful to other professors who taught me diverse facets of linguistics in my undergraduate and graduate years at NTNU: Prof. Chun-Yin Doris Chen, Prof. Jen Tin, Prof. v.

(7) Jen-I Li, Prof. Joy Wu, Prof. Hui-Shan Lin, Prof. Hsi-Yao Su, Prof. Hsiao-hung Iris Wu, Prof. Kwock-Ping Tse, Prof. Miao-Ling Hsieh, and Prof. Yun-O Biq by alphabetic order. Their intriguing courses helped me accumulate understandings of linguistics and develop research abilities, both of which serve as a sound foundation of writing this thesis. I am indebted to friends and relatives who consented to participant in our experiments. For the purpose of keeping them anonymous, I do not name each of them here. Special thanks go to the staff members of LOPE (Lab of Ontologies, Language Processing, and e-Humanities), which is directed by Prof. Shu-Kai Hsieh: Chan-Chia Mike Hsu, Chih-Yao Lee, Chin-Ju Huang, Ding-Shiang Lin, Julia Wu, Katherine Chen, Matt Ku, Pierre Majistry, Qian-Rong Zhang, and Simon Shih, Tsun-Jui Liu, and Yu-Yung Chang by alphabetic order. They gave me lots of technical and spiritual supports, as well as crucial suggestions. Furthermore, I also would like to thank for the company and encouragement of my NTNU classmates: Abbie Hsu, Ann Lee, Bebe Wu, Bonnie Wei, Clare Liu, Charlie Chen, Gina Yang, Helen Chien, Lina Chiu, Katherine Chen, Monica Hsu, and Sam Jheng. In particular, Friday dates with Bebe and Helen in the last year of my graduate school life are the most unforgettable and treasurable moments. Last but most importantly, I give my deepest gratitude to my beloved family, aunts, uncles, as well as my dear boyfriend, Wei-Chun Chang. Their unconditional love, care, and backup gave me the source of energy to overcome setbacks and accomplish this work. Therefore, this thesis is dedicated to them.. vi.

(8) Contents 摘要 ....................................................................................................................................... i Abstract .............................................................................................................................. iii Acknowledgements .............................................................................................................. v Contents ............................................................................................................................. vii List of Tables ...................................................................................................................... ix List of Figures ..................................................................................................................... xi List of Abbreviations ......................................................................................................... xii Chapter 1 Introduction ....................................................................................................... 1 1.1 Background and Motivation ........................................................................................ 1 1.2 Research Questions ..................................................................................................... 4 1.3 Organization of the Thesis ........................................................................................... 5 Chapter 2 Literature Review .............................................................................................. 6 2.1 Individual Differences in Visual Word Recognition ..................................................... 6 2.2 The Correlation between Word Frequency and Word Difficulty ................................. 13 2.3 Corpus Resources ...................................................................................................... 16 2.3.1 Chinese Lexicon Profile ...................................................................................... 17 2.3.2 i-Corpus (Individualized Corpora Project)........................................................... 27 2.4 Mixed-effects Models ................................................................................................ 29 Chapter 3 Data Collection ................................................................................................ 33 3.1 Lexical Decision Task................................................................................................ 33 3.1.1 Participants ......................................................................................................... 33 3.1.2 Materials ............................................................................................................. 34 vii.

(9) 3.1.3 Procedure............................................................................................................ 38 3.2 Facebook Data........................................................................................................... 40 Chapter 4 Experiments on the Individual Differences of Lexical Behaviors .................. 46 4.1 Experiment 1: The Role of the Frequency Index of Personal Word Usage in Lexical Decision .......................................................................................................................... 46 4.1.1 Method ............................................................................................................... 47 4.1.2 Results and Discussion ....................................................................................... 53 4.2 Experiment 2: The Role of Personal Word Frequency in Lexical Decision ................. 63 4.2.1 Methods.............................................................................................................. 64 4.2.2 Results and Discussion ....................................................................................... 72 Chapter 5 Conclusion ........................................................................................................ 79 5.1 Summary of the Thesis .............................................................................................. 79 5.2 Contributions ............................................................................................................. 83 5.3 Limitations and Future Work ..................................................................................... 85 References .......................................................................................................................... 87 Appendix A Background Sheet ......................................................................................... 94 Appendix B Word List ...................................................................................................... 95 Appendix C Non-word List ............................................................................................. 108 Appendix D Instruction ...................................................................................................115 Appendix E Examples of Low-frequency Words in the Sinica Corpus ..........................116. viii.

(10) List of Tables Table 2.1: Measures of vocabulary knowledge in Lewellen et al., Katz et al., and Yap et al. ...................................................................................................... 12 Table 2.2: Column names of the collected lexical characteristics in Chinese Lexicon Profile .......................................................................................................... 19 Table 2.3: Comparison of by-subject differences in random regression and mixed-effects models (Baayen, 2008)........................................................... 31 Table 3.1: Counts of high-, mid-, and low-frequency word stimuli with different character numbers in the visual lexical decision task .................................... 36 Table 3.2: Counts of high-, mid-, and low-frequency word stimuli with different sense numbers in the visual lexical decision task .......................................... 36 Table 3.3: Counts of high-, mid-, and low-frequency word stimuli with different neighborhood sizes in the visual lexical decision task ................................... 37 Table 3.4: The numbers of participants’ four categories of Facebook messages .... 43 Table 3.5: The token numbers in participants’ Facebook posts.............................. 44 Table 4.1: An example of a portion of one participant’s word list ......................... 48 Table 4.2: Part-of-Speech categories in the CKIP Chinese Word Segmentation System ......................................................................................................... 51 Table 4.3: Statistical results of the mixed-effects models analyzing the frequency index of personal word usage and covariates (Response latency) .................. 56 Table 4.4: Statistical results of the mixed-effects models analyzing the frequency index of personal word usage and covariates in the Intact List (Response accuracy) ..................................................................................................... 58 Table 4.5: Statistical results of the mixed-effects models analyzing the frequency index of personal word usage and covariates in the NV List (Response accuracy) ..................................................................................................... 59 Table 4.6: Counts of Experiment 2 stimuli with different frequency types ............ 65 Table 4.7: Counts of Experiment 2 stimuli with different sense numbers .............. 65 ix.

(11) Table 4.8: Counts of Experiment 2 stimuli with different character numbers ........ 65 Table 4.9: Counts of Experiment 2 stimuli with different neighborhood sizes ....... 66 Table 4.10: Frequency counts of several lexical-decision stimuli in each participant’s Facebook data .......................................................................... 69 Table 4.11: Frequency ratios of several lexical-decision stimuli in each participant’s Facebook data .............................................................................................. 70 Table 4.12: Frequency z-scores of several lexical-decision stimuli in each participant’s Facebook data .......................................................................... 71 Table 4.13: Statistical results of the mixed-effects models analyzing personal word frequency ratio and covariates (Response latencies) ..................................... 75 Table 4.14: Statistical results of the mixed-effects models analyzing personal word frequency z-score and covariates (Response latencies) ................................. 75. x.

(12) List of Figures Figure 2.1: A snapshot of Facebook Wall ............................................................. 29 Figure 2.2: Growth of Facebook users in Taiwan (Smith, 2009) ........................... 29 Figure 2.3: Comparison of the estimated intercepts for subjects in random regression and mixed-effects models (Baayen, 2008) ................................... 32 Figure 3.1: The procedure for random generation of two-character non-word stimuli in the visual lexical decision task ...................................................... 38 Figure 3.2: The procedure of a trial in the lexical decision task ............................ 40 Figure 3.3: The procedure of collecting participants’ data of language usage on a Facebook Wall ............................................................................................. 41 Figure 4.1: The correlation of frequency indices of personal word usage computed by the Intact list and the NV list (r = .67) ..................................................... 53 Figure 4.2: Residual diagnostics for the models of the Intact list before (upper panels) and after (lower panels) removal of outliers...................................... 55 Figure 4.3: Partial effects of block number, trial number, freuqneyc type, character number, and sense number in the analysis of Experiment 1 (Response latency) .................................................................................................................... 57 Figure 4.4: Continuum of modes in Facebook posts ............................................. 62 Figure 4.5: Residual diagnostics for the model of personal word frequency ratios 73 Figure 4.6: Partial effects of block number, trial number, frequency type, character number, sense number, and personal word frequency (ratio and z-score) in the analysis of Experiment 2 .............................................................................. 76 Figure 4.7: The correlation plot of participants’ total token numbers and their token numbers summed from the 218 stimuli in Experiment 2 (r = .95) ................. 78. xi.

(13) List of Abbreviations CLP. Chinese Lexicon Project. A corpus stored characteristics of Chinese words, such as character numbers, strokes, word frequency, or Pinyin.. CWN. Chinese Wordnet. http://lope.linguistics.ntu.edu.tw/cwn. IDs. Individual Differences. The term refers to differences between participants, which is the main focus of this thesis.. LDT. Lexical Decision Task. It is one type of experiments designed for investigating word recognition. In the task, participants are asked to judge whether a presented word is an existent and meaningful word in the language in question.. Sinica Corpus. Academia Sinica Balanced Corpus of Modern Chinese. http://db1x.sinica.edu.tw/kiwi/mkiwi/. Sinica BOW. Academia Sinica Bilingual Ontological Wordnet. http://bow.sinica.edu.tw/. SUMO. Suggested Upper Merged Ontology. http://www.ontologyportal.org/. xii.

(14) Chapter 1 Introduction Introduction 1.1 Background and Motivation In the field of psycholinguistics, a cardinal research interest is to investigate how people recognize written words or access the corresponding word representations stored in their mental lexicon. Psycholinguists usually undertake the investigation starting from isolated words since less factors are involved, compared to words within sentences. Therefore, research on the isolated word recognition is fundamental for understanding how lexical access takes places. In general, the term ‘visual word recognition’ is used to simply address the recognition of isolated written words. Attentions in the traditional studies of visual word recognition had been paid to how characteristics of words per se (e.g. word length, word frequency, or neighborhood size) affected the procedure of recognition (Andrews, 1989; Forster & Chambers, 1973; Grainger, 1990; New, Ferrand, Pallier, & Brysbaert, 2006; Whaley, 1978). Recently, there has been a growing interest in the individual differences (IDs, henceforth) of experiment participants. 1.

(15) Schilling et al. (1998) and Yap et al. (2012) verified that statistically reliable IDs existed in word recognition experiments, including naming and lexical decision tasks. Other papers delved deep into whether IDs were bound with response latencies or even accuracies in the tasks; topics included participants’ print-exposure experience (Chateau & Jared, 2000; Sears et al., 2008), familiarity with words (Lewellen et al., 1993), reading skills (Unsworth & Pexman, 2003), and vocabulary knowledge (Katz et al., in press; Lewellen et al., 1993; Yap et al., 2012). Results of the studies showed that personal experiences and knowledge of words accounted for systematic variances between participants in word recognition. The findings revealed the inappropriateness of traditional research, which took only word variables into consideration and treated IDs as statistical errors. Without control of the IDs, systematic variances caused by personal factors might be mistakenly accredited to word variables. From the standpoint of conducting scientific experiments, it is crucial to explore IDs in word recognition and further control them during analysis so as to increase validity of experimental results. The aforementioned studies of IDs, however, have concentrated on test-measured or self-rated ID variables. In such approaches, the observed IDs were confined in the boundary of a test or questionnaire design, and the uniqueness of each individual in real life was neglected. In an attempt to examine the approximate real-life IDs, this thesis measures and analyzes IDs based on each participant's own lexical behaviors. Lexical behaviors here refer 2.

(16) to a person’s word usage and preference in his/her daily life. Intuitively, language usage reveals one’s vocabulary knowledge, such as the words the person knows and how to use those words within context. Vocabulary knowledge was proved relating to word recognition (Katz et al., in press; Lewellen et al., 1993; Yap et al., 2012); hence, it is highly possible that IDs of lexical behaviors can explain the disparity of participants’ performance in word recognition. The present study extracts the lexical behaviors of participants from their data on the Facebook Wall. Facebook1 is social network website. Users can share with friends their current situations by spontaneously posting anything coming to their minds on their own Facebook Walls. Such a platform is able to provide more casual and less embellished language records than formal publications, so that real-life and authentic lexical behaviors of individuals can be more likely to be obtained. Our attention for lexical behaviors is fastened upon the frequency index of personal word usage and the personal word frequency calculated from participants’ language data. Whether the two variables are associated with participant’s performance in a lexical decision 2 will be explored respectively in two experiments. More important, as a pioneer study on lexical behaviors and word recognition, the other main objective of this thesis is to provide noteworthy methodological issues revealed by the experimental results. For 1. https://www.facebook.com/. More details of the Facebook website are introduced in Section 2.3.2.. 2. Lexical decision and naming tasks are extensively-used experiments of visual word recognition. In a lexical decision task, participants are asked to look at each (non-)word stimulus and respond whether it is an existent word in the language in question. In a naming task, participants are asked to look at each stimulus which may be a picture and speak out the name of the object in the picture. This thesis adopts the lexical decision for investigation and the naming one can be further looked into in future research. 3.

(17) instance, prior to quantifying the ‘lexical’ behaviors, participants’ language data should be automatically segmented from strings into words at first. Automatic segmentation surely will result in segmentation errors, thus producing noise data. How to alleviate the contamination of noise in the experiments is concerned. In summary, this work adopts a computational linguistic approach for quantifying individual lexical behaviors, and further examines if such behaviors can account for systematic variances between participants in a lexical decision task. The methodology in such research is also the major concern of this interdisciplinary effort. Our experiment results, hopefully, will shed some light on the tie between word recognition and lexical-level daily-life IDs.. 1.2 Research Questions This thesis addresses two questions listed beneath. . Do lexical behaviors, in terms of the frequency index of personal word usage and the personal word frequency, account for variances between participants in a lexical decision task?. . If yes, do the experimental designs in this thesis reveal any methodology that is recommended or should be concerned in future research?. 4.

(18) 1.3 Organization of the Thesis The rest of this thesis is organized as follows. In Chapter 2, we first review empirical studies examining the IDs in visual word recognition. The reviews not only demonstrate the pitfall of neglecting IDs in traditional research but also are compared with lexical behaviors, which this present study pays attention to. Moreover, hypotheses for our two ID variables of lexical behaviors in word recognition are further discussed. What follows is the introduction of two corpus resources we utilized respectively to select stimuli of a lexical decision task and to quantify participants’ lexical behaviors. Besides, experimental results in this thesis were analyzed by mixed-effects models, rather than the analysis of variance (ANOVA) test which was commonly used in psycholinguistic studies. The merits of adopting mixed-effects models for analysis are provided at the end of the chapter. Chapter 3 presents the procedure of our data collection, including conducting a lexical decision experiment and extracting the experiment participants’ language usage data from Facebook Walls. In Chapter 4, the methods of computing lexical behavioral variables are reported. Results concerning the relationships between participants’ IDs of lexical behaviors and lexical-decision responses are offered and discussed. Chapter 5 concludes this thesis by giving a summary of our findings, contributions of the thesis, and discussions of the limitations as well as suggested work in the future.. 5.

(19) Chapter 2Literature Review Literature Review In this chapter, Section 2.1 provides a review and critique to the literature of the individual differences (IDs) in visual word recognition, followed by our preliminary hypothesis about the impact of personal word frequency on a word recognition experiment. Section 2.2 summarizes a study of the correlation between word frequency and word difficulty (Breland, 1996) and draws the prediction about the frequency index of personal word usage from his findings. In Sections 2.3-2.4, the corpus resources and statistical models utilized in this thesis are introduced.. 2.1 Individual Differences in Visual Word Recognition Studies of word recognition traditionally have concentrated on the role of word variables (e.g. word frequency or neighborhood size) in the recognition process, taking the discrepancies between participants’ responses as statistical deviation. One pitfall of the 6.

(20) traditional studies is that spurious or conflicting results might be obtained. An example can be seen in the paper of Usworth and Pexman (2003). Usworth and Pexman (2003) provided compelling evidence that a controversy over regularity effects in the literature were attributable to lacking control over participants’ IDs of reading skills. Regularity denotes that the extent to which the spelling-to-sound correspondence in words are invariant. The effects of regularity are that a response is made slower to less ‘regular’ words (e.g. pint) than to ‘regular’ words (e.g. name). Some researchers failed to find the effects in the visual lexical decision tasks (Coltheart, Davelaar, Jonasson, & Besner, 1979; Jared, McRae, & Seidenberg, 1990), whereas others reported significant regularity effects did exist (Parkin, 1982; Parkin & Ellingham, 1983; Parkin & Underwood, 1983; Stanovich & Bauer, 1978). Usworth and Pexman (2003) showed that although phonological processing was involved when people responded in visual lexical decision, the regularity effects were found in the responses of low-skilled readers, but not in high-skilled readers. This finding suggested the necessity of controlling IDs in word recognition studies. Especially, note that the participants in Usworth and Pexman (2003)’s study all came from the same university. Even if they were at the same education level, their IDs sufficiently resulted in distinct performance in word recognition. The findings in other papers also supported that the IDs among people with homogeneous education background would lead to significant differences in the process of 7.

(21) word recognition. Chateau and Jared (2000) conducted a homophone choice task, a lexical decision task, a pseudoword naming task, and a form priming tasks, where participants were undergraduates at the University of Western Ontario. The results demonstrated that during the word recognition, the orthographic and phonological representations of words were activated more efficiently for students with high exposure-to-print than those with low exposure-to-print. Similarly, Sears et al. (2008) reported that the word recognition of high-print-exposure people was less affected by the differences between high- and low-frequency words, or between large- and small-neighborhood-size words, compared with that of low-print-exposure people. The participants in Sears et al. (2008)’s research were all undergraduates at University of Calgary. Another example is Lewellen et al.’s (1993) study, which looked into IDs of lexical familiarity and vocabulary knowledge among undergraduates. It will be further introduced in the review of the IDs of vocabulary knowledge later. Among the studies pertinent to IDs in word recognition, those examining IDs of vocabulary knowledge are most related to this thesis. Studies on the IDs of vocabulary knowledge (Lewellen et al.,1993; Katz et al., in press; Yap et al., 2012) obtained consistent results— participants’ knowledge of words was positively associated with their performance of word recognition. Each of the studies is reviewed in the subsequent paragraphs. One study was conducted by Lewellen et al. (1993). They divided participants into two 8.

(22) groups depending on three criteria: high/low familiarity with words, vocabulary knowledge, and reading experiences. Their vocabulary test was taken from the Nelson-Denny reading test (Nelson & Denny, 1960), in which examinees should respond in 10 minutes to 100 multiple choices about which word among five options expressed one statement the best. Results showed that the high group processed words more efficiently than the low group no matter in tasks of naming, lexical decision, or semantic categorization 3. However, effects of word length and frequency did not significantly differ between the two groups of subjects. Lewellen et al. thus claimed that the group differences of word processing proficiency lied on their disparities of working memory capacity, rather than disparities in the automatic processing of word components. Another study done by Katz et al. (in press) also probed into the relationship between participants’ vocabulary size and response latencies in word recognition, in that they were interested in how much recognition experiments accounted for multiple aspects of reading. 4 Vocabulary size in the study was measured by the Woodcock-Johnson III Diagnostic Reading Battery (WJ) reading vocabulary test and oral vocabulary test (Wechsler, 1999), the Wechsler Abbreviated Scale of Intelligence (WASI) vocabulary test (Woodcock, Mather, & Schrank, 2004), and the Peabody Picture Vocabulary Test (PPVT) (Dunn & Dunn, 2007);. 3. In a semantic categorization task, participants are required to classify a presented stimuli into one of the semantic categories provided by experimenters, like ANIMALS, TOOLS, or PLANTS.. 4. Katz et al. (in press) regarded that people’s vocabulary size represented their reading experience because the size possibly increased with their exposure to print. 9.

(23) they were either a test of word production or a test of word choice. Canonical correlations between each of the four vocabulary test and response latencies in naming and lexical decision tasks were carried out. Statistical significance was found, even though the correlation was mild. Besides, the reaction time in lexical decision was more related to vocabulary size than the naming task. Katz et al. (in press) concluded that higher knowledge of oral or written vocabularies led to faster retrieval processing; that is, rich vocabulary facilitated, not inhibited, word recognition. Concerning the other study, Yap et al. (2012) conducted reliable tests between sessions and trials of each participant’s response latencies, and reported that subtle IDs were reliably found among 1,280 participants’ trial-level data in both lexical decision and naming tasks. This suggested that each participant possessed a relatively stable processing profile, which was argued to be distinct from his average processing speed. Vocabulary knowledge of participants in Yap et al.’s study resorted to Shipley Institute of Living Scale (Shipley, 1940), which comprised 40 multiple choices of word synonym. Results showed that higher vocabulary scores were related to faster and more accurate responses. Similar to Lewellen et al.’s (1993), in a lexical decision task, the performance of high-vocabulary-knowledge participants did not show smaller effects of structure properties (e.g. the number of letters, and phonological Levenshtein distance) and word frequency/semantics. Nonetheless, in a naming task, the two types of effects and a neighborhood size effect were found to differ 10.

(24) between high- and low-vocabulary-knowledge participants. More specifically, the high group’s responses varied less with the three types of word variables than the low group’s responses. The participants’ vocabulary scores in the study were interpreted as the integrity of word representations in their minds. Thereupon, on ground of the results, Yap et al. argued that participants with more integral word representations recognized word by relying more on automatic processing than control processing. Lewellen et al. (1993), Katz et al. (in press), and Yap et al. (2012) all measured vocabulary knowledge by adopting tests of word form, meaning, or association (see Table 2.1). It is doubted that how much the score of a vocabulary test represented or reflected a person’s vocabulary knowledge. In the tests, vocabulary knowledge was measured by a restricted set of words and within the range of the scale or score of a given test. Hence, this thesis attempts to investigate the knowledge from a distinct angle— lexical behaviors in participants’ language usage. One merit of doing so is that people’s lexical knowledge will be evaluated not by a small set of vocabularies in a given test, but by the words used by themselves. In this case, a variable’s value assigned to a given participant is personalized and not confined to the scale or the total score of a test. The other merit resides in that the data of language usage can provide a deeper insight into a person’s lexical knowledge, compared with a vocabulary test. If a person is able to use or produce a given word naturally (and frequently), it suggests that the word’s representation has been firmly 11.

(25) established in his/her mental lexicon. The test of whether one knows a word, on the contrary, is a relatively superficial measure.. Table 2.1: Measures of vocabulary knowledge in Lewellen et al., Katz et al., and Yap et al. Paper Lewellen et al. (1993). Task of visual word. Measure for testing. Task in the. recognition. vocabulary knowledge. measure. Naming Lexical decision Semantic categorization. Nelson-Denny reading test. Choice of stated words Production of. WJ reading vocabulary. synonyms, antonyms, and analogies Production of. Katz et al.. Naming. (in press). Lexical decision. WJ oral vocabulary. synonyms, antonyms, and analogies Production to. WASI vocabulary. define pictures and words. PPVT. Choice of stated pictures. Yap et al.. Naming. Shipley Institute of. Choice of. (2012). Lexical decision. Living Scale. synonyms. The present study examines the IDs of lexical behaviors by focusing on the frequency index of personal word usage as well as personal word frequency. The latter one refers to the degrees to which a given word-recognition stimulus occurred in one’s data of language 12.

(26) usage. It was hypothesized that a participant would respond faster to words that he/she used more frequently, and the personal preference of word usage was considered to be his/her own relative familiarity with the words here. Lewellen et al. (1993) also addressed the impact of the IDs concerning lexical familiarity on word recognition. Yet, it needs to be clarified that the experiment conducted in this thesis differed from theirs. In Lewellen et al. (1993)’s study, participants were split into high and low groups based on their answers to a lexical familiarity questionnaire; subsequently, whether the two groups behave distinctly in word recognition tasks were examined. In our experiment, however, a person’s familiarity with words would vary with his/her frequency of using those words. Concerning the variable of the frequency index of personal word usage, its hypothesis was drawn from results of a study testifying the correlation between word frequency and word difficulty (Breland, 1996). A review of the study is provided in the following section.. 2.2 The Correlation between Word Frequency and Word Difficulty An ID variable of lexical behaviors in this thesis (i.e. the frequency index of personal word usage) was proposed on the basis of Breland’s (1996) research. Breland (1996) examined the correlation between word frequency in corpora and word difficulty. Word frequency in the study meant the standard frequency index (SFI) of a word in four different collections of 13.

(27) text, including a corpus aggregated by Thorndike and Lorge (1944), the Brown corpus (Kucera & Francis, 1967), the American Heritage corpus (Carroll, Davies, & Richman, 1971), the College Board corpus (Breland, Jones, & Jenkins, 1994). SFI was computed through four steps. Breland’s description on the steps is listed below.. . F: A word type’s total frequency in the corpus.. . D: An index of how all occurrences of the word type disperse among 27 text categories The index ranges from 0.0 to 1.0. D is 0.0 if the word type occurs only in a single category; D was 1.0 when the occurrences of the word type are distributed in the exactly same proportion among every category. D was calculated as follows:. where n = the number of text categories; i = category number, i = 1, 2, …, n; 𝑝𝑖 = the probability of a token in the ith category; and 𝑝𝑖 log 𝑝𝑖 = 0 for 𝑝𝑖 = 0. . U: The estimated frequency per 1 million tokens. U is derived from F with an adjustment for D. When D is 1.0, U is computed simply as the frequency per 1 14.

(28) million tokens. If D is less than 1.0, then the value of U would be adjusted downward. When D equals 0.0, U has minimum value based on the average weighted probability of the word type over all text categories. U is computed as follows:. where N = the total token number in the corpus; and 𝑓𝑚𝑚𝑚 = 1/N times the sum. of the products 𝑓𝑖 and 𝑠𝑖 , where 𝑓𝑖 is the frequency in the ith category and 𝑠𝑖 is the number of tokens in the given category. . SFI: The standard frequency index. SFI was computed as:. With regard to word difficulty, Breland (1996) adopted the results of Dupuy (1974). Dupuy (1974) provided a difficulty rank of 123 words which were randomly sampled from the Webster’s Third New International Dictionary. The rank was based on ordered by the answers of nine groups of participants at different educational levels in Basic Word Vocabulary Test. The test included multiple choices with ten levels of difficulty. Each question of multiple choices had five options that passed careful distractor analyses. After 15.

(29) collecting information of word difficulty and SFI in four corpora, Breland (1996) conducted correlation tests of the two variables. Results showed high negative correlations, ranging from - .72 to - .83. Word frequencies across various text categories, however, are unavailable in the lexical resource that the present study is based upon (i.e. the Chinese Lexicon Profile, which will be introduced in the next section). Instead, a frequency count per Chinese word form in the Academia Sinica Balanced Corpus is supplied. For this reason, we did not follow Breland’s (1996) computation, but analogously assumed a corpus word frequency as the possibility that the word was generally acquired and used by native speakers. Usage of low-frequency words was considered to go along with relatively broad lexical knowledge. Accordingly, by referring to the frequencies in the Sinica Corpus, we proposed a method to compute the frequency index of personal word usage. It was preliminarily hypothesized that participants with lower indices had broader word knowledge, thus responding more accurately and rapidly in word recognition tasks.. 2.3 Corpus Resources Experiments in this thesis were conducted with the aid of two corpus resources: (1) the Chinese Lexicon Profile, and (2) the i-Corpus. Each of them is introduced in Section 2.3.1 and Section 2.3.2 beneath. 16.

(30) 2.3.1 Chinese Lexicon Profile The Chinese Lexicon Profile (CLP, henceforth) is a research project launched at LOPE lab at National Taiwan University. 5 The project purports to build up a large-scaled open lexical database platform for Chinese mono-syllabic to tri-syllabic words used in Taiwan. With its incorporation of behavioral and normative data in the long term, the CLP would allow researchers across various disciplines to explore different statistical models in search for the determinant variables that influence lexical processing tasks, as well as the training and verification of computational simulation studies. 6 In the initial design of the CLP, each word is presented with values for variables at different linguistic levels: orthography, phonetics, morpho-syntax, semantics and word frequency. Most characteristics of words in CLP have been gathered from numerous existing Chinese corpora and lexical resources. The resources are called by abbreviations in the following. For further details of the names, please see the List of Abbreviations at the beginning of this thesis. The number of Chinese words in CLP has been accumulated up to 204,922 so far. Inasmuch as the quantity is enormous, the data are temporarily separated into five files,. 5. The project is inspired by the English Lexicon Project (ELP, Balota et al., 2007) and its French counterpart (FLP, Ferrand et al., 2010), and designed to be compatible with these resources for the purpose of cross-linguistic comparison. More details will be available at http://lope.linguistics.ntu.edu.tw/clp.. 6. The author has undertaken the most tasks at the first stage of development, which is introduced in this subsection. 17.

(31) whose columns are shown in Table 2.2. As displayed in the table, all the files share three identical columns—the identification number of each word, the word form in the Sinica Corpus, and the word form in Chinese Wordnet (Huang & Hsieh, 2010). It should be noted that in the Chinese Wordnet, Chinese words are presented by the unit of ‘lemma.’ If a word has distinct sounds or origins, it would be represented in different lemmas. For instance, the word 查 has two sounds, including cha2 and zha1 , thus having two lemmas 查 1 and 查 2. However, other lexical resources subsumed in the CLP do not distinguish lemmas, but provide information at the word-form level. To align the data of Chinese Wordnet with other resources in the CLP files, lemmas are merged into a word form. In other words, lexical information provided in the CLP is on the basis of word forms. Aside from the three identical columns, the remaining columns in each file reposit information of orthography, phonetics, morpho-syntax, semantics or word frequency; the details are explained in the following.. (1) Orthography Length. the number of characters in a word form.. Stroke. the strokes of each character in a word form. If one word was made up of more. than one character, the strokes of two adjacent characters were separated by the at sign, @. 18.

(32) Table 2.2: Column names of the collected lexical characteristics in Chinese Lexicon Profile Orthography 1. IdNum 2. Bal_word 3. Cwn_word 4. Length 5. Stroke 6. Radical 7. Stroke_PR 8. Nh_num_N1 9. Nh_rank_N1 Morpho-syntax 1. IdNum 2. Bal_word 3. Cwn_word 4. Pos_A 5. Pos_ADV 6. Pos_ASP 7. Pos_C 8. Pos_DET 9. Pos_M 10. Pos_N 11. Pos_P 12. Pos_POST 13. Pos_T 14. Pos_Vi 15. Pos_Vt 16. Pos_nom. Phonetics 1. IdNum 2. Bal_word 3. Cwn_word 4. Pinyin 5. Hp_num 6. Hp_set 7. Hp_setId. Semantics 1. IdNum 2. Bal_word 3. Cwn_word 4. Reverse 5. Sense_num 6. Facet_num 7. Holo_num 8. Mero_num 9. Hyper_num 10. Hypo_num 11. Anto_num 12. Nearsyno_num 13. Para_num 14. Variant_num 15. SUMO_chi 16. SUMO_chi_u 17. SUMO_num 18. SUMO_num_u 19. Cilin_tag 20. Cilin_layer1 21. Cilin_layer2 22. Cilin_layer3. 19. Frequency 1. IdNum 2. Bal_word 3. Cwn_word 4. Freq_B 5. Freq_C 6. Freq_X 7. Freq_Z 8. FreqR_B 9. FreqR_C 10. FreqR_X 11. FreqR_Z 12. FreqR_100m_B 13. FreqR_100m_C 14. FreqR_100m_X 15. FreqR_100m_Z 16. FreqR_100mL_B 17. FreqR_100mL_C 18. FreqR_100mL_X 19. FreqR_100mL_Z.

(33) Radical the radical of each character in a word form. The radicals referred to the semantic radicals, which serve as indices in a Chinese dictionary. Radicals of adjacent characters were divided by @. Stroke_PR. the strokes of the phonetic radical of each character in a word form.. Nh_num_N1 the number of neighbors which shared the first character with a particular word form. This type of neighbors was called N1 neighbors Nh_rank_N1 A word and its N1 neighbors were compared with one another based on their word frequencies in Freq_B. The one with the highest word frequency was ranked in the first place and assigned the value 1. Therefore, this rank column could let us know whether a word had neighbors which were more frequently used than it.. (2) Phonetics Pinyin. the Pinyin phonetic representations of characters in a word. For instance, ma3 is. the Pinyin of 馬 ‘horse’. If a word comprised more than one character, the at sign @ was also utilized here to set apart Pinyin information of characters next to each other. Hp_num. the number of homophones of a word. This column was based on the data on the. website of Sou Ci Xung Zi (搜詞尋字)7. Hp_set. 7. a set encompassing a particular word and its homophonic words.. http://words.sinica.edu.tw/sou/sou.html 20.

(34) Hp_setId the CLP IdNum of each word in the Hp_set. (3) Morpho-syntax The morpho-syntactic lexical characteristics were collected from Chinese Wordnet (CWN). The CWN 8 (Huang & Hsieh, 2010). is a lexical network consists of synsets and semantic. relations. The meanings of a word there were disambiguated under the frame of three hierarchical levels— lemma, sense, and facet. In other words, a word may possess several lemmas, senses or facets, all of which carry part-of-speech (POS) tags. Therefore, a word in CWN may have more than one POS tag. The counts of all types of POS were recorded in syntactic columns introduced below.. Pos_A. the number of being adjectives.. Pos_ADV the number of being adverbs. Pos_ASP Pos_C Pos_DET. the number of being aspectual markers. the number of being conjunctions. the number of being determiners.. Pos_M. the number of being classifiers, like zhung3(種)、zhi1(隻)、ben3(本). Pos_N. the number of being nouns and pronouns.. Pos_P. the number of being prepositions.. 8. http://lope.linguistics.ntu.edu.tw/cwn/ 21.

(35) Pos_POST. the number of being postposition, post-conjunctions (e.g. deng3-deng3 (等等). ‘and so on’), or post-numeric determiners (e.g. duo1 (多) ‘more than…’). Pos_T. the number of being de, interjections, or particles.. Pos_Vi. the number of being intransitive verbs.. Pos_Vt. the number of being transitive verbs.. Pos_nom. the number of being nouns which were nominalized from verbs.. (4) Semantics Reverse. a tag regarding whether a word is still meaningful when its word order is. reversed. One pair of the example is jiao4-zong1 (教宗) ‘Pope’ and zong- jiao4 (宗教) ‘religion’. If the reversed word is a lexicon in Chinese, it was tagged with “Yes” in this column. Sense_num. the number of senses of a word form. The information of sense and facet. numbers 9 was extracted from CWN. Facet_num. the number of facets of a word form.. Holo_num. the number of holonyms of a word form. Holonyms and. 9. meronyms,. As mentioned in the description of Morpho-syntax information within the CLP, lexical meanings in Chinese Wordnet are disambiguated at the levels of lemma, sense, and facet. When a lemma has different but related meaning, it is further divided into senses. For instance, zou3 (走) in 越「走」越遠 means ‘to walk’ and in 「走」了一趟故宮 means ‘to visit.’ If there are even subtle differences within one sense, it would be further separated into facets. The noun bao4-zhi3 (報紙) ‘newspaper’ has two facets since it can refer to the object or the content of newspaper. Different from senses, facets can appear in the same context. For example, the two facets of bao4-zhi3 (報紙) can be interpreted in the sentence 我喜歡今天的報紙 ‘I like today’s newspaper.’ 22.

(36) which will be introduced next, are opposite terms that denote the part-whole relationship between words. If X is a part of Y, X is a meronym of Y and Y is a holonym of X. For example, ‘tree’ is the holonym of ‘trunk’ or ‘branch’. The numbers of a word’s semantic relations were all collected from CWN. Mero_num. the number of meronyms of a particular word form.. Hyper_num. the number of hypernyms of a particular word form. If X contains and is. more general than Y, X is the hypernym of Y. One example is that ‘color’ is the hypernym of ‘red’. Hypo_num. the number of hyponyms of a particular word form. Hyponyms are the. contrary of hypernyms. Anto_num. the number of antonyms of a particular word form. Antonyms refer to. words that are opposite in meanings. There are three types of antonyms, including complementary opposites (e.g. male/female), contrary opposites (e.g. cold/hot), and relational opposites (e.g. buy/sell). Nearsyno_num. the number of nearsynonyms of a word form. Words with overlapping. but subtly distinct meanings are called nearsynonyms. One example is bao-rung2 (包容) ‘to tolerate’ and rung2-ren3 (容忍) ‘to bear’. Para_num. the number of paranyms of a word form. A set of words that are in the same. semantic classification are designated paranyms. For instance, xung1-di4-xiang4 (兄弟 23.

(37) 象) ‘Brother Elephant’ and tung3-yi1-shi1 (統一獅) ‘UniPresident Lion’ belong to a group of paranyms concering baseball teams. Variant_num the number of variants of a word from. Variants denote different word forms of the same lexicon, like gong1-bu4 (公布) ‘to post’ and gong1-bu4 (公佈) ‘to post’. SUMO_chi. the Chinese SUMO ontological concepts of a word. The data of this column. are from Sinica BOW. A word may be polysemous, thus being subsumed by more than one SUMO concept. For instance, tian1-cai2 (天才) ‘genius’ belongs to the concepts of neng2-li4 (能力) ‘ability’ and reng2-lei4 (人類) ‘human’. The at sign, @, was used to divide adjacent concepts in this column. SUMO_chi_u. Some words may possess repeated SUMO concepts in the SUMO_chi. column. Take si1-sua4 (撕碎) for example. It can mean ‘to tear into pieces’ or ‘to tear into shreds’. The two meanings are categorized right into the same SUMO concept feng1-li2 (分離) ‘seperating’. In this case, the SUMO_chi slot of si1-sua4 (撕碎) was filled by “分離@分離” whereas its SUMO_chi_u slot was by only “分離”. SUMO_num the number of a word’s SUMO_chi. SUMO_num_u Cilin_tag. the number of a word’s SUMO_chi_u.. the tag of a word in a Chinese Thesaurus entitled Tongyici Cilin (同義詞詞林). (Mei, Zhu, Gao, & Yin, 1983). The purpose of the Chinese Thesaurus was to provide a repertoire of Chinese synonyms useful in writing and translating. Words in the book 24.

(38) were organized into categories in three levels. The highest level included twelve categories, such as human, object, or action. Categories in the other two levels became more and more specific than those in the highest one. There were 94 mid-level categories and 1,429 low-level categories. The three levels were tagged with alphabets; in addition, figures were utilized to group synonyms in the same category. Cilin_layer1 the high-level tag of a word in Cilin. Cilin_layer2 the mid-level tag of a word in Cilin. Cilin_layer3 the low-level tag of a word in Cilin.. (5) Frequency Freq_B the frequency of a word in the Academia Sinica Balanced Corpus 10. Freq_C the frequency of a word in the data of Central News Agency from the Chinese Gigaword corpus (Graff, Chen, Kong, & Maeda, 2005). The news agency was located in Taiwan. Freq_X the frequency of a word in the data of Xinhua News Agency from the Chinese Gigaword corpus. The news agency was situated in Mainland China. Freq_Z. the frequency of a word in the data of Zaobao Newspaper from the Chinese. Gigaword corpus. Like Xinhua News Agency, the newspaper provides data in. 10. http://db1x.sinica.edu.tw/kiwi/mkiwi/ 25.

(39) Mainland China. FreqR_B. the ratio of a word’s Freq_B to the sum of word frequencies in the Academia. Sinica Balanced Corpus. As shown in the four previous columns, CLP includes word frequency records in four corpora. Note that the total frequencies in the corpora are different; thus, it is inappropriate to directly compare their frequency counts of a particular word. In the FreqR columns, the frequency ratio was computed via dividing a word frequency with the total word frequency of a corpus. FreqR_C. the ratio of a word’s Freq_C to the sum of word frequencies in the data of. Central News Agency. FreqR_X. the ratio of a word’s Freq_X to the sum of word frequencies in the data of. Xinhua News Agency. FreqR_Z. the ratio of a word’s Freq_Z to the sum of word frequencies in the data of. Zaobao Newspaper. FreqR_100m_B. the frequency ratio in FreqR_B multiplied by 100 million. FreqR_100m_C. the frequency ratio in FreqR_C multiplied by 100 million.. FreqR_100m_X. the frequency ratio in FreqR_X multiplied by 100 million.. FreqR_100m_Z. the frequency ratio in FreqR_Z multiplied by 100 million.. FreqR_100mL_B. the base 10 logarithms of FreqR_100m_B. Since the Freq_100m. columns are multiplied by 100 million, the maximum value of FreqR_100mL will be 8. 26.

(40) The logarithms were calculated in an effort to emphasize that a difference of one or two counts are crucial more for low-frequency words than for high-frequency words. FreqR_100mL_C. the base 10 logarithms of FreqR_100m_C.. FreqR_100mL_X. the base 10 logarithms of FreqR_100m_X.. FreqR_100mL_Z. the base 10 logarithms of FreqR_100m_Z.. 2.3.2 i-Corpus (Individualized Corpora Project) As indicated previously, this thesis aims to probe into the individual’s lexical knowledge in terms of the lexical behaviors naturally represented in their daily language usage. It is worth noting that the stance we take in measuring the ‘individuality’ is naturalistic rather than natural, in that the lexical behaviors we describe are assumedly anchored in the interaction as naturalistic situated interactions, rather than natural ones (like using camera to collect data). A pitfall of the natural ones is that when observers and/or cameras are present those interactions are not quite what they would be in our absence. Based on the foregoing consideration, this thesis begins with a preliminary survey on Facebook data by employing one of modules in the on-going NSC-granted research project conducted at the LOPE lab, National Taiwan University. This project envisions an effort to construct i-corpora so as to obtain and analyze a wide spectrum of individual linguistic and extra-linguistic data. Considering the collected material is restricted by some copyright 27.

(41) issues, a set of iCorpus toolkits is proposed which performs the tasks of autonomous corpus data collection and exploitation (by running an integrated software package) to extract, analyze huge volumes of individual language usage data, and automatically provide an idiolect sketch with quantitative information for the benefits of linguistic and above all, sociolinguistic studies. 11 Facebook is a social networking service launched in 2004 in the U.S. The service was founded by Mark Zuckerberg. He originally intended to build an interpersonal network for students in Harvard University. Nowadays it turns into an international service, in which users possess their own Web space. In the space, a user can update his/her latest messages, photos, videos, and etc. All of the updates are aggregated and displayed the user’s Facebook Wall, whose snapshot is presented in Figure 2.1. The Facebook module in the i-Corpus project is able to extract the JSON format of one user’s Facebook Wall. Facebook released its Chinese version of Website in the mid 2008. Taiwanese users explosively grew in the mid 2009 (see Figure 2.2). Therefore, data of modern and the latest language usage can be collected from the network service.. 11. More information is available at http://lope.linguistics.ntu.edu.tw/iCorpus. 28.

(42) Figure 2.1: A snapshot of Facebook Wall. Figure 2.2: Growth of Facebook users in Taiwan (Smith, 2009). 2.4 Mixed-effects Models The present study adopted mixed-effects models to carry out statistical analyses. Mixed-effects models, which comprise fixed-effect and random-effect models, are a 29.

(43) statistical approach developed in a recent decade to solve the language-as-fixed-effect fallacy. The fallacy was profoundly brought to researchers’ attentions by Clark (1973). Clark argued that the then-traditional analysis, where only subject analysis was implemented, carried the assumption that the item variable, i.e. experiment stimuli, was a fixed factor. Consequently, experimental results were inappropriately over-generalized to a larger population of items in language, from which stimuli were randomly sampled. After the publication of Clark’s influential paper, linguists turned to other alternative statistics, the quasi-F ratio (Clark, 1973) or the separate by-subject and by-item analyses of variance (ANOVA) (Forster & Dickinson, 1976). The latter one, nowadays, even become a golden standard in psycholinguistic studies (Baayen, 2008; Baayen, Davidson, & Bates, 2008; Baayen & Milin, 2010; Raaijmakers, Schrijnemakers, & Gremmen, 1999). The emerging mixed-effects models, however, in effect have three robust advantages over either the quasi-F test or the by-subject and by-item ANOVA. First, there is no need to aggregate data by collapsing over items and then getting subject-level averaged results, and vice versa. In such case, IDs to each stimulus are observable and analyzable in the statistics. Second, mixed-effects models allow missing values and fit unbalanced designs, which are usually encountered in real life experiments. Albeit the quasi-F test outperforms mixed-effects models when nominal Type I error rates are concerned, it cannot be computed during the appearance of missing values. The third advantage is that covariates can be 30.

(44) accommodated in the models. This enables researchers to fully inspect the structure of the collected data or even to disentangle independent influences of all types of variables on experiment data. More intriguingly, because data need not be aggregated in mixed models, trial number or context-dimensional principal components can be treated as covariates. Aside from the above-mentioned advantages, the mixed models even win over regression models by providing improved estimates of by-subject differences, which is the most crucial advantage to this thesis. For instance, Baayen (2008) compared statistical results of a random regression and the mixed-effects models, as presented in Table 2.3.. Table 2.3: Comparison of by-subject differences in random regression and mixed-effects models (Baayen, 2008) Random Regression. Mixed-effects Models. Subject (Intercept) Frequency. Subject (Intercept) Frequency. S1. 365.2841. 1.2281146. S1. 385.4278. 1.08664. S2. 319.4525. 1.7300404. S2. 390.1994. 1.08664. S3. 445.8967. 0.6943159. S3. 399.5851. 1.08664. S4. 542.5428. -0.2364537 S4. 397.7705. 1.08664. S5. 325.6736. 1.6250778. S5. 387.1721. 1.08664. S6. 478.6631. 0.2033189. S6. 387.6356. 1.08664. S7. 471.4654. 0.6686009. S7. 413.3528. 1.08664. S8. 367.1283. 1.5067342. S8. 404.5415. 1.08664. S9. 236.8524. 2.3100814. S9. 377.8304. 1.08664. S10. 377.3522. 1.1365690. S10. 386.7957. 1.08664. 31.

(45) Both statistical methods modeled a relationship between word frequency and experiment participants’ response latencies. We can see that in the mixed models the coefficient for word frequency did not vary across subjects, and the range of by-subject intercepts was less wide compared to the random regression. For a clearer look at the by-subject differences, the intercepts are visualized in Figure 2.3. The left panel was derived from random regression whereas the right panel was from mixed-effects models. The little circles in each panel were the estimated intercepts for ten subjects, and the mean of the intercepts was the bold horizontal line (i.e. coef = 400). From the figure, it is evident that compared with random regression, mixed-effects models’ estimates are more successful due to their assumption of a common slope.. Figure 2.3: Comparison of the estimated intercepts for subjects in random regression and mixed-effects models (Baayen, 2008). 32.

(46) Chapter 3 Data Collection Data Collection This chapter consists of two sections. Section 3.1 presents the method how we collected data from a visual lexical decision task, which is a type of extensively-used word recognition experiments. Section 3.2 demonstrated the procedure of extracting and pre-processing participants’ Facebook data by means of an i-Corpus module. The Facebook data were prepared for the quantification of personal lexical behaviors introduced in Chapter 4.. 3.1 Lexical Decision Task 3.1.1 Participants Sixteen Chinese native speakers (10 females and 6 males; ages ranging from 21 to 29 years old) consented to participant in the task and were offered participant fees.. For the purpose of augmenting the possibility of finding individual differences (IDs) of personal lexical. 33.

(47) behaviors, the participants were recruited from diverse backgrounds. Eight of them were from various occupations, including soldiers, accountants, counselors, research assistants, and technology engineers. In addition, there were five graduate students and three undergraduates; most of them came from different departments or graduate institutes at National Taiwan Normal University, National Tsing Hua University, or National Central University. Participants should be right-handed and were required to report their familial handedness (i.e. whether they have left-handed blood relatives). Previous studies (Bever, Carrithers, Cowart, & Townsend, 1989) found that a familial left-handed person processed words by relying more on vocabulary knowledge than syntactic knowledge, contrary to a familial right-handed person. Thereupon, in addition to a self-report handedness inventory (Oldfield, 1971), a questionnaire of familial handedness was also included in the sheet of personal background information (see Appendix A).. 3.1.2 Materials Experiment materials included 456 Chinese words and 456 non-words. The word stimuli were selected from the Chinese Lexicon Profile (CLP) via three steps. First, a list of nouns was retrieved from the CLP. To ensure that the prominent part-of-speech of each stimulus was noun, we set a criterion that stimuli must be assigned a “noun” tag in the CLP and have no other kinds of parts-of-speech tags, such as verbs or adverbs. Words with the “noun” tag 34.

(48) in CLP, however, encompassed both nouns and pronouns. The pronouns were therefore manually checked and removed; only nouns were remained in the list of stimulus candidates. Second, nouns in the list were ordered according to their word frequency in the Academia Sinica Balanced Corpus. Each twenty percent of words (n = 152) at the top, middle, and bottom were then taken as high-, mid-, and low-frequency noun stimuli in this thesis, in order to ensure that frequency distribution was even. Finally, three Chinese-speaking reviewers inspected whether the 456 words were meaningful nouns in Mandarin Chinese. Although the stimuli selected from the CLP were certainly existent and meaningful Chinese words, some of them are rarely seen and would most likely be treated as non-words by our participants. Hence, if a word was considered to be a non-word by at least one reviewer, it was eliminated and replaced by another candidate of nominal stimuli. Consequently, there were still 456 words in the final list of word stimuli (Appendix B). In addition to word frequency, the number of characters, the number of senses, and the neighborhood size 12 of words were collected from the CLP and will be treated as covariates at the stage of statistical analysis because we intended to disentangle their influences on the lexical-decision responses. Tables 3.1-3.3 show the counts of high-, mid-, and. 12. The neighborhood size refers to how many neighbors a word has. Neighbors are words which differ from a particular word in a character at a certain position. For example, wei4-sheng1(衛生) ‘hygiene’ and wei4-xing1(衛星) ‘satellite’ are neighbors of wei4-mian3(衛冕) ‘to defend the championship’ because they both share the first character with wei4-mian3. Previous studies (H. W. Huang, 2003; Tsai, Lee, Lin, Tzeng, & Hung, 2006) revealed that the effect of neighborhood size was found when they counted the neighborhood size of the first character in a word, instead of the summed size of neighbors at all positions. Therefore, this thesis follows their approach. 35.

(49) low-frequency stimuli grouped respectively by the three variables.. Table 3.1: Counts of high-, mid-, and low-frequency word stimuli with different character numbers in the visual lexical decision task Number of. Type of frequency. characters. High-frequency words. Mid-frequency words. Low-frequency words. Count. 2. 141. 3. 11. 2. 129. 3. 23. 2. 116. 3. 36. Table 3.2: Counts of high-, mid-, and low-frequency word stimuli with different sense numbers in the visual lexical decision task Type of Frequency. Neighborhood size. High. Mid. Low. 1. 50. 86. 102. 2. 57. 44. 31. 3. 22. 12. 13. 4. 14. 9. 2. 5. 5. 1. 4. 6. 2. 0. 0. 8. 2. 0. 0. 36.

(50) Table 3.3: Counts of high-, mid-, and low-frequency word stimuli with different neighborhood sizes in the visual lexical decision task Type of Frequency. Neighborhood size. High. Mid. Low. 1-20. 18. 17. 37. 21-40. 25. 38. 22. 41-60. 29. 17. 25. 61-80. 15. 25. 15. 81-100. 22. 12. 19. 101-120. 8. 9. 4. 121-140. 15. 15. 7. 141-160. 6. 2. 6. 161-180. 5. 5. 2. 181-200. 0. 5. 6. 201- 900. 9. 7. 9. To equalize yes and no stimuli, 456 non-words were also subsumed into the stimuli. There were 387 two-character and 69 three-character non-words (Appendix C). These non-words were randomly generated by using characters of existing nouns in Chinese. Take two-character non-words for example. The procedure of random generation is illustrated in Figure 3.1. The first and second characters of existing nominal words were separately stored into two vectors. Next, the first and second characters of a non-word were randomly selected from the two vectors respectively and then combined altogether. If an automatically 37.

(51) generated non-word sounded like an existing word, it would not be included in the non-word list. The three-character non-words were generated from three-character existing words likewise. The task is a within-subjects design; in other words, a participant saw all of the 912 stimuli. The non-words, high-, mid-, and low-frequency words were evenly divided into four blocks. The order of four blocks was counterbalanced across 16 participants. Within a block, experimental stimuli were administered in a random order.. Figure 3.1: The procedure for random generation of two-character non-word stimuli in the visual lexical decision task. 3.1.3 Procedure Each participant was tested individually in a quiet room. They completed a sheet of background information at first, and then were seated in front of a laptop. The experiment. 38.

(52) was conducted and presented on the laptop via E-prime 2.0 professional. At the beginning of the experiment, an instruction (see Appendix D) was displayed, where participants were instructed to judge whether a visually presented stimulus was a meaningful word in Mandarin Chinese. Their judgment were recorded as soon as they pressed the ‘yes’ or ‘no’ response button. The ‘yes’ and ‘no’ buttons were respectively the k and f keys on the keyboard of the laptop; the two keys are both at a two-key distance from the central key. Upon the k key, a green-colored label written with 是 (shi4) ‘yes’ was pasted; upon the f key, there pasted a red-colored label written with 否 (fou3) ‘no’. Participants were required to respond as quickly as possible but without expense of accuracy. The whole experiment included four blocks; between blocks, participants could take a break so as to reduce an influence of fatigue on the response speed and accuracy. Before each block started, a reminder was presented on the laptop monitor to ask participants to put their forefingers on ‘yes’ and ‘no’ buttons. The procedure of a trial in the experiment is displayed in Figure 3.2. A trial was initiated with a fixation sign (+) appearing in the center of the monitor for 1000 ms. Next, a stimulus was presented. The presentation would be terminated immediately when a participant responded. If no response was detected in 4000 ms, the given stimulus would be removed from the monitor. After termination of the stimulus presentation, a feedback was provided on the monitor for 750 ms, along with the participant’s accumulated accuracy rate 39.

(53) in a block. The entire experiment lasted approximately one hour. Prior to the experiment, a practice session was given to familiarize participants with the experimental procedure. The session contained 4 words and 4 non-words, none of which appeared in our formal experiment.. +. stimulus. feedback. (1000 ms). (0 - 4000 ms). (750 ms). Figure 3.2: The procedure of a trial in the lexical decision task. 3.2 Facebook Data The Facebook module in i-Corpus was employed to gathering participants’ data of language usage and preferences. Figure 3.3 illustrated the procedure of accessing to and saving one Facebook user’s data. For the module was in its rudimentary stage of development, it was still semi-autonomous; more specifically, the initial steps in the procedure should be 40.

(54) manually accomplished. Details of all steps are elucidated subsequently.. Log in an APP to get a user's access token to Facebook Paste the access token in the i-Corpus program. Type in a participant's Facebook ID. Save the data on the participants' Facebook Wall (JSON format). Extract each message in categories of post, photo, comment, and other users' walls (One message was saved as a text.). Pre-process the 'post' messages by the CKIP Word Segementation System. Figure 3.3: The procedure of collecting participants’ data of language usage on a Facebook Wall. First, we logged in an application to obtain the author’s access token. Second, the access token was then pasted into the i-Corpus program in order to access the Web-page scripts of the author’s Facebook friend. Third and Fourth, the name of a Facebook user (i.e. a participant in this thesis) was entered into the program so as to retrieve his/her data in the scripts of the Facebook Wall. Fifth, from the Wall data, we extracted only the messages 41.