潛在概念分析-利用中文網路資料在向量空間模型中呈現語意關係概念知識

全文

(1)國立臺灣師範大學英語學系碩士論文 Master Thesis Graduate Institute of English National Taiwan Normal University. 潛在概念分析利用中文網路資料在向量空間模型中呈現語意關係概念知識 . Latent Conceptual Analysis Using Web data in Chinese to represent conceptual knowledge about word relations in a vector space model. 指導教授：謝舒凱博士張妙霞博士 Advisor: Dr. Shu-Kai Hsieh Dr. Miao-Hsia Chang. 研究生：張虔榮 Student: Qian-Rong Chang 中華民國一百零一年七月 July, 2012.

(2) 摘要. 在自然語言處理領域中，詞彙模式(lexical pattern)經常被使用在許多計算語意關係之間相似度的實驗裡。然而儘管這些詞彙模式的重要性日益增加，對於它們被宣稱所代表的語意關係，卻很少有學者去探討它們反映了哪種層面的訊息。本論文主張這些詞彙模式和它們所代表的語意關係，在語言使用過程中，具備了同樣的概念特性。同時本論文也提出一個稱為潛在概念分析(LCA)的計算模型，這個計算模型能掌握並且運用詞彙模式所具備的概念特性來進行相似度的計算。潛在概念分析是個自動化演算法，該演算法主要利用奇異值分解法(SVD)來處理因為大規模語料庫所產生的高維度問題。在本篇論文中，首先有 35 組詞彙模式經由半自動方式產生出來，作為 LCA 的輸入資料來源，接著每組詞彙模式都會產生一組列表，該列表會按照相似度距離由近到遠列出其他的 34 組詞彙模式。為了檢視 LCA 的功能，最後產生出來的結果會與另一組由手動標注的結果相互對照，這組由手動分群而成的結果所採取的準則來自詞彙資源網站 FrameNet 分類所依據的標準，最後結果顯示 LCA 所完成的相似度距離計算與手動分群的結果相似。本論文所採取的方法近似於 Turney (2006)與 Bollegala et al. (2009)所使用的方法，但差異在於本論文所提出之方法並不只是依靠頻率的分布情形，另外也將語言使用者對詞彙模式的概念知識納入 LCA 的計算考量。因為 LCA 的語料來源是網路內容，因此網路內容所具備的不穩定和易變動的特性也有時會影響 LCA 的表現。未來相關研究可依長期蒐集資料的方式來降低這個問題的影響。. 關鍵字：關係相似度、詞彙關係、向量空間模型 . i .

(3) ABSTRACT. In the field of Natural Language Processing, lexical patterns are often applied in many experiments that involve similarity measure among word relations. Despite their growing importance, however, these patterns are rarely examined in terms of what aspect they inherit from the word relation they are claimed to represent. In the thesis, it is proposed that lexical patterns exhibit the same conceptual nature as word relations do. They both display conceptual qualities when they are applied in language use. It is also proposed in this thesis that the conceptual nature of lexical patterns can be captured and implemented in a computational model, latent conceptual analysis (LCA), to calculate similarity among the patterns. LCA is an automatic algorithm that relies on singular vector decomposition (SVD) to reduce the high dimensionality resulted from large-scale corpus. In the thesis, after 35 lexical patterns are generated semi-automatically, each of them is sent to LCA as input data, whose distance from the other 34 patterns will be subsequently determined. To validate the performance of LCA, the result is compared to that of a manual clustering method whose standards are based on principles applied in FrameNet. As revealed from the comparison, LCA has achieved a result similar to that of manual clustering. The approach adopted in the thesis is similar to that applied by Turney (2006) and Bollegala et al. (2009). However, instead of relying solely on frequency distribution, language users’ conceptual knowledge about lexical patterns is also taken into consideration in LCA. Because LCA uses Web contents as its corpus, the dynamic and constantly changing nature of data collected from the Web can sometimes affect the performance of LCA. Therefore it is suggested that future. . ii .

(4) studies applying LCA should collect data in a long-term fashion to alleviate this problem.. Keywords: relation similarity, lexical relations, Vector Space Model. . iii .

(5) Acknowlegements 從前我不相信人的生命中會有貴人存在，我認為每個人的際遇，展現出的都是依照自身意願所打造出來的樣貌。然而在寫論文漫長的這幾年當中，在我走過了生命裡等待著我的許多起伏後，在論文完成的此刻，我暫時停下腳步回頭仔細檢視經過我生命的每張臉孔，我才開始相信貴人存在不是因為他們幫助我去成就了什麼，而是他們參與了我的生活，他們所言所行影響了過去的我，也塑造出未來的我，我真心感謝在我寫論文期間出現在我身邊的貴人們，感謝你們陪伴了我度過我的人生中如此重要的一段，你們的存在讓我得以存在。感謝我的指導教授，謝舒凱老師。他給予我在學術上的啓發無可計量，在他走向科技前端的創新思維中，潛藏了一份貢獻於世界的熱情，是我的指導教授讓我反思，在我們重複不段追求新知的反覆嘗試中，我能為我生存其中的社會帶來什麼，我能為龐大的學術之屋砌下哪些堅固的磚。在他對我的期待中，我更加看重自己的責任，在他對我的訓斥中，我學習著尊重自己所下的承諾，在他指導我論文的過程中，他曾對我失望也曾對我勉勵，他讓我知道，我還要再更加努力才能讓自己成為一個更負責任的人，不僅是在學術領域中成為一個繳出論文的人而已。感謝我的指導教授，張妙霞老師。她對學生的關懷，從來不是表面上的幾句話能輕易道盡。我依然記得當年我考上師大英語所，第一年上了老師的寫作課，老師在課堂上問了全班誰不清楚英文論文寫作格式，我是少數舉起手的人之一，此後每回老師抽取匿名作業文章供大家批評討論時，時常出現我的作業，經過一學期的磨鍊，我逐漸從一開始在課堂上感受到羞愧轉而接受自己在進步後得到的信心，那一學期的課程對我的寫作能力有相當大的助益，當然老師的指導也是我之所以能完成這本論文的重要原因。感謝我的口試委員，高照明老師和鍾曉芳老師。我每回看著鍾老師在我論文本上註記的每字每句，都由衷地感謝她對學生的付出和堅持，高照明老師願意在百忙中臨時抽空來擔任我的口試委員也令我十分感謝，他對我的鼓勵和建議也令我受益良多，這些老師們都是我生命中的貴人，他們參與了我論文誕生的過程，他們的每句建言和每項決定，都讓我期許自己也同樣能對自己提高自我要求和提高堅持。 . iv .

(6) 感謝我身邊的每一個朋友和我親愛的家人，所有對我表達關心和支持的每個你們，所有貼心不多問我的進度的你們，我瞭解你們對我的重要，也因此我要將這份作品獻給你們。寫了一本論文，在學術上是一個中途的檢查點，在我的生命中也是一個重要的檢驗點，我心中清楚我對自己不滿意什麼，我也瞭解我付出過什麼，我不會因為我的不足而抹煞自己付出過的努力，我也不該因為完成一本論文而對自己寬容，感謝鼓勵過我的每個人，感謝責罵過我的每個人，也感謝我自己，謝謝你沒有半途放棄。. . v .

(7) Table of Contents. CHAPTER 1. INTRODUCTION.......................................................................................... 1. 1.1 CONNECTING THE WORD, CONNECTING THE WORLD .................................................... 1 1.2 RESEARCH QUESTIONS ....................................................................................................... 3 1.3 A CAVEAT ABOUT DATA COLLECTION .............................................................................. 4 1.4 ORGANIZATION OF THE THESIS ......................................................................................... 4 CHAPTER 2 THEORETICAL BACKGROUND ............................................................... 5 2.1 WORD RELATIONS AND LEXICAL PATTERNS.................................................................... 6 2.1.1 WHAT KIND OF WORD RELATIONS? ................................................................................. 7 2.1.2 LEXICAL PATTERNS ......................................................................................................... 10 2.1.3 ALGORITHMS MEASURING RELATIONAL SIMILARITY .................................................... 12 2.2 VECTOR SPACE MODELS .................................................................................................. 16 2.2.1 WHAT IS A VECTOR SPACE MODEL? ............................................................................... 16 2.2.2 VSM IN NLP .................................................................................................................... 17 2.2.3 IMPLEMENTING VSM ....................................................................................................... 23 2.3 RETHINKING THE LEXICAL PATTERNS ........................................................................... 31 2.3.1 CAPTURING CONCEPTUAL NATURE IN LEXICAL PATTERNS ........................................... 32 2.4 A LINGUISTIC ANALYSIS FRAMEWORK........................................................................... 35 2.4.1 FRAME SEMANTICS .......................................................................................................... 36 2.4.2 FRAMENET ....................................................................................................................... 38 2.5 SUMMARY .......................................................................................................................... 40 CHAPTER 3 METHODOLOGY ........................................................................................ 42 3.1 SELECTING A PAIR ............................................................................................................. 42 3.2 SELECTING A CORPUS ....................................................................................................... 43. . vi .

(8) 3.3 EXTRACTING LEXICAL PATTERNS ................................................................................... 45 3.3.1 THE ALGORITHM FOR LEXICAL PATTERN GENERATION ................................................. 45 3.3.2 GENERATING LEXICAL PATTERNS ................................................................................... 46 3.4 LATENT CONCEPTUAL ANALYSIS .................................................................................... 48 3.4.1 THE ALGORITHM ............................................................................................................. 49 3.5 SUMMARY .......................................................................................................................... 51 CHAPTER 4 RESULTS AND DISCUSSION .................................................................... 53 4.1 THE MANUAL CLUSTERING .............................................................................................. 53 4.1.1 THE LEADERSHIP_(POLITICAL) FRAME ........................................................................... 54 4.1.2 THE CHANGE_OF_LEADERSHIP FRAME .......................................................................... 55 4.1.3 THE ATTENDING FRAME .................................................................................................. 56 4.1.4 THE MEET_WITH FRAME ................................................................................................. 56 4.1.5 THE STATEMENT FRAME ................................................................................................. 57 4.1.6 THE REQUEST FRAME ...................................................................................................... 58 4.1.7 THE DEFENDING FRAME .................................................................................................. 58 4.1.8 THE CONVEY_IMPORTANCE FRAME................................................................................ 59 4.1.9 THE EXPERIENCER_FOCUS FRAME.................................................................................. 59 4.1.10 THE QUITTING_A_PLACE FRAME .................................................................................. 60 4.1.11 THE JUDGMENT_COMMUNICATION FRAME .................................................................. 61 4.1.12 THE DESTROYING FRAME .............................................................................................. 61 4.1.13 THE SUBJECTIVE_INFLUENCE FRAME ........................................................................... 62 4.1.14 SUMMARY ...................................................................................................................... 63 4.2 SIMILARITY MEASURES DONE BY LCA AND ANALYSIS .................................................. 64 4.2.1 THE DATA AND DEFINITION OF CORRESPONDENCE ........................................................ 65 4.2.2 SIMILARITY MEASURE RESULT OF MULTIPLE MEMBER CLUSTERS ............................... 66 4.3 SUMMARY .......................................................................................................................... 82. . vii .

(9) CHAPTER 5 CONCLUDING REMARKS ........................................................................ 83 5.1 THESIS SUMMARY ............................................................................................................. 83 5.2 LIMITATIONS AND SUGGESTION FOR FUTURE STUDIES ................................................. 85 BIBLIOGRAPHY................................................................................................................... 86 APPENDIX A.......................................................................................................................... 92. . . viii .

(10) List of Tables Table 1. A sample of SAT question. 13. Table 2. Alternate pairs of the original pair quart:volume. 25. Table 3. Short strings that contain quart:volume. 25. Table 4. Sample dataset of implicit relation dataset. 29. Table 5. Top 10 sample clusters with most frequent lexical patterns. 29. Table 6. Annotation of the example sentence. 39. Table 7. Original pair and alternate pairs. 46. Table 8. 35 Lexical patterns with frequency. 47. Table 9. The manual clustering result of the 35 lexical patterns. 63. Table 10. Correspondence of LCA result to Leadership_(political) frame result. 66. Table 11. the commonly shared words in Y 的總統 X and X 連任 Y 總統. 68. Table 12. Correspondence of LCA result to Change_of_Leadership frame result. 69. Table 13. Correspondence of LCA result to the Attending frame result. 70. Table 14. the commonly shared words in X 參加 Y and X 愛 Y. 71. Table 15. Correspondence of LCA result to the Meet_with frame result. 73. Table 16. the commonly shared words between X 重申 Y and X 召見 Y. 74. Table 17. Correspondence of LCA result to the Statement frame result. 76. Table 18. the commonly shared words between X 呼籲 Y and X 說 Y. 77. Table 19. Correspondence of LCA result to the Experiencer_Focus frame result. 79. Table 20. the commonly shared words between X 說 Y and X 關心 Y. 80. Table 21. Correspondence of LCA result to the Quitting_a_Place frame result. . 81. ix .

(11) Chapter 1 Introduction 1.1 Connecting the Word, Connecting the World The ability to make connections is perhaps one of the most precious and requisite parts of human intellect. As human beings, we perceive, interact, and behave to a large extent based on utilizing previously stored conceptual knowledge, whether it is adding new constituents to modify the existing concepts or exploiting our source domain knowledge (the conceptual knowledge from which we draw metaphors) to analogize unfamiliar events. Although making connections is not a privilege of humans, we are the only species that can put those connections into words. We conceptualize the surrounding environment and transform it into verbal or textual stimuli that are employed in establishing associative paths. This process makes possible our ability to store, convey, and accumulate community experience, thereby forming cultures. Despite its importance in human intellect, connection in words can hardly be understood with a single hegemonic consensus among scholars in the realms of philosophy, linguistics, psychology, and computer science. Confusion abounds concerning incompatible terminology even in similar theoretical disciplines. Identical names in different approaches can also result in inconsistent interpretations. This inconsistent nature is in part due to word relations’ wide applicability in different research areas concerning meaning. This thesis will focus on the perspective of Natural Language Processing (NLP), searching for empirical evidence of word relations from the Web contents. In NLP, some models and applications are built relying on word relations with a view to constructing a structured representation of. . 1 .

(12) human worldviews. These algorithms serve to perform tasks such as recognizing word analogy (Turney et al. 2003; Veale 2004; Turney & Littman 2005), classifying semantic relations (Rosario, Hearst, & Fillmore 2002; Turney 2006), word sense disambiguation (Turney 2006), automatic thesaurus generation (Hearst 1992; Fellbaum 1998; Turney & Littman 2005) and information extraction (Zelenko, Aone, & Richardella 2003; Bollegala et al. 2009), etc. These studies of word relations that involve extracting information automatically from texts are often inspired by Hearst’s (1992) lexico-syntactic patterns (or lexical patterns). In her paper, Hearst identifies terms in the corpus that are connected through a predefined semantic relation (e.g., hyponymy) and from the result derives possible patterns that represent the desired word relations (e.g., such X as Y and X, especially Y). Once verified by experts, these lexico-syntactic patterns are subsequently applied domain-independently to define and extract new term pairs that may also exhibit the predefined relation (i.e., a hyponymic relation). It is worth noting that more than one lexico-syntactic pattern can represent a predefined word relation. Therefore calculating similarity between these patterns and thereby categorizing them into groups that represent different word relations are very important in this approach. Other researchers following Hearst’s work have expanded their research scope beyond the frequently studied canonical semantic relations (e.g., synonym or antonymy), including other lexical or implicit relations (e.g., attribute, part-of, people-affiliation, etc.), whose meanings are largely determined based on factual information. In order to collect sufficient information to successfully identify these implicit relations from the constantly growing and changing electric documents on the Internet, the lexico-syntactic patterns are often generated automatically. A common assumption in these works involving automatic generation of lexical patterns is that these patterns are recruited to exemplify human. . 2 .

(13) understanding of word relations in actual language use. However, in spite of its ubiquitous presence in various simulation models, this assumption is rarely examined in detail. In particular, few researchers have stated clearly in what sense word relations are encoded in the patterns in the computer-based approach.. 1.2 Research Questions To fill the void mentioned above, this thesis is concerned with investigating what aspect of word relations is captured in automatically generated lexico-syntactic patterns. In other words, I will try to answer in this thesis in what sense lexicosyntactic patterns are said to represent word relations. Another subsequent question will be is it possible to represent the aspect in a computational model that is used to calculate similarity measure for lexico-syntactic patterns? In this thesis, I will hypothesize that it is the conceptual nature that lexico-syntactic patterns share with the word relations they represent. More specifically, speakers’ concept about word relations such as synonymy or antonymy can be captured in their use of lexicosyntactic patterns such as X is identical to Y or X is the opposite of Y. These lexical patterns can evoke similar conceptual knowledge people have in their understanding of word relations. To validate the hypothesis, a Web content-based computational model will be built in this thesis to semi-automatically calculate similarities between lexico-syntactic patterns with regard to the concept they inherit from word relations. In order to determine the performance of the proposed model, I will conduct a linguistic analysis to analyze the results. It is hoped that the analysis result will help shed some light on how the theory and application of lexico-syntactic patterns can reflect our experience and knowledge about word relations.. . 3 .

(14) 1.3 A Caveat about Data Collection It is inevitable to make a few assumptions when dealing with usage data in a computational simulation model. Some of them are made because of technical limitations. For example, although we have seen remarkable progress in computer technology in recent years, it’s still superfluous to admit that computer is incapable of performing as well as humans in every linguistic aspect. The foremost obstacle for computer is that written text is so far the only preferable input format. Granted that there exist some devices that are able to transform vision, sound and other linguistically relevant signals into appropriate medium that can be processed by machines, they are still not sophisticated enough for a comprehensive application. Therefore, in the remainder of this thesis, whenever language is mentioned, it is the written language that I mean, unless otherwise stated.. 1.4 Organization of the Thesis This thesis is organized as follows. Chapter 2 reviews the lexical relations and determines the aspect to look at when discussing them. The introduction of lexical patterns and related research will be also given in this chapter. In addition, the basic component of constructing a vector space model and two state-of-the-art applications will be reviewed. Lastly, the linguistic analytical framework will also be introduced in Chapter 2. Chapter 3 introduces the corpus used in this thesis, including the data extracting method and research procedures. Chapter 4 focuses on the analysis of the results and the discussion. Chapter 5 offers the conclusion and some suggestions for further study.. . 4 .

(15) Chapter 2 Theoretical Background One of the central goals of this thesis is to provide a theoretical explanation for the applicability of lexical patterns representing word relations, indicating what aspects are conveyed through the automatically generated patterns. Since most research applying lexical patterns requires calculating similarity via computational models (more review of these works will be found in Sec 2.1.3), the findings of this study are also designed to be implemented in similarity measure algorithms. To begin with, a set of lexical patterns will be generated semi-automatically, and a computational model that reflects the hypothesized aspect will be constructed to calculate similarity between the lexical patterns. With a view to investigating how much the computational model reveals about human understanding of these lexical patterns, the result will be discussed in a linguistic analytical framework. As became clear in the procedures mentioned above, this thesis covers some topics in computer science and semantics. Therefore a brief but comprehensive review of related works in both realms has to be given before the discussion is continued. This chapter is divided into four parts. In the first part I will give an introduction to word relation and lexical patterns. To explain how lexical patterns are applied to facilitate computational research on relation similarity measure, the second part will focus on computational models and their applications. The third section, with the necessary theoretical background presented, a hypothesis of lexical patterns’ applicability will be developed. Lastly, a linguistic analytical framework will be presented for the discussion of research results in this thesis.. . 5 .

(16) 2.1 Word Relations and Lexical Patterns In light of the progress made in modeling human language ability, there is an increased attention being paid to mental lexicon. Among the interested parties, a growing consensus is that the paradigmatic semantic relations among words (e.g., antonymy, synonymy, hyponymy and the like) are somehow relevant to the structure of lexical or conceptual information (Murphy 2003). The importance of these relations manifests itself in an extensive range of research terminologies in fields including philosophy, cognitive psychology, linguistics, computer science and cognitive neuroscience, just about any areas that are intrigued by word meaning or the mind. On the one hand, everyone has basic knowledge about how these relations connect word senses. To test this universal relational competence in speakers of different languages, Raybeck and Herrmann (1990) presented pairs of words with predefined relations to speakers of 7 different languages who were then asked to make classification according to the similarity among the relations. The subjects had managed to sort invariably similar pairs of words (e.g., male/female and remember/forget) into one group while leaving pairs like car/tire to another, which suggests a universal knowledge about relations among speakers of different languages. On the other hand, however, no one seems to be able to pin down what exactly these relations are and where they should be included in our mental representation. Therefore it is extremely difficult, if not downright impossible, to develop a unifying definition of word relations that can fit well in the various theoretic structures. Having said that, since different approaches follow different traditions, it is of great importance to specify the perspective taken with regard to word relations in this thesis before I delve deeper. Therefore I will start this section by making clear what treatment of word relations is assumed. Are these relations among words or among . 6 .

(17) the things they represent? A brief history of research on lexical patterns and their relationship with word relations will then be given.. 2.1.1 What Kind of Word Relations? Most lexical semanticists claim that semantic relations do not connect words per se; rather, it is the senses words carry that are brought together in the relations. In some literatures these relations are studied under the name sense relations (Lyons 1977) or meaning relations (Allan 1986) instead of lexical relations. For instance, in the sense contrast high temperature/low temperature, a few word pairs can be said to encode this relational information: hot/cold, boiling/freezing, and steamy/lukewarm. However, this view on relations cannot fully account for the canonicity issue, as illustrated in works of several researchers (Gross, Fischer & Miller 1989; Charles, Reed & Derryberry 1994). They point out that in antonymic relations, some pairs of words (e.g., good/bad and hot/cold) are considered more canonical than others (e.g., hot/cool and happy/miserable). Additionally, as will become more evident in the following discussion, senses are not the only factor that comes into play when a relation is formed between two words. Other pragmatic factors also need to be considered. Accordingly, I prefer the term lexical relations to semantic relations when discussing word relations since senses are not the only concern in my work. A caveat is in order, however, since lexical relation itself is an ambiguous name for word relations if no further explanation is given. In Murphy’s words (2003), there are at least two treatments of lexical relations: intralexical treatment and metalexical treatment. Based on structuralist and generativist theories, a lexicon is a collection of linguistic information that cannot be derived from other information. Then it stands to reason that information contained in a lexicon is arbitrary or . 7 .

(18) idiosyncratic. In Murphy’s definition, therefore, an intralexical treatment of relations should be the one that asserts knowledge about word relations is (a) self-contained within lexicon and (b) specifically linguistic (therefore, non-linguistic knowledge will not be involved). In other words, this way of treating word relations holds that they cannot be derived and non-linguistic knowledge such as social or cultural context will not be involved in one’s knowledge about word relations. As will become clear in the later discussion, both claims cannot fully account for actual language use in some situations. According to Murphy (2003), the metalexical treatment contrasts with the intralexical one in that it holds that word relations are not contained in a lexicon and are therefore composed of human conceptual knowledge about the words, rather than of the words. The conceptual nature of word relations can be understood in three aspects: (a) they are productive, (b) they are context-dependent, and (c) they display the prototypicality effect. If word relations are arbitrary or idiosyncratic, then they should not be accounted for by rules. However, there is in fact an array of different productive mechanisms by which new instances of synonyms or antonyms are generated. For example, based on the morphological rule that generates oppositional pair of words, a new lexical item defriend is now widely used among the youngsters who are actively involved in social networking applications like Facebook and MySpace. The new lexical item stands in contrast to add-someone-to-the-friend-list. This example shows that word relations are not idiosyncratic linguistic information contained in one lexicon. Another piece of evidence in support of a metalexical approach is the fact that word relations are understood, to a large extent, based on contexts. For instance, luggage and baggage are only synonyms when they are used to describe containers filled with personal items during trips. An empty luggage will. . 8 .

(19) not be treated as a synonym to baggage, as illustrated in the anomalous sentence: *I bought a new set of baggage for my trip. Another example of relations’ dependence on context is the color pair blue/green. Although the pair is definitely not perceived as opposing lexical items when one is describing the physical features of objects, it is antonymous to most Taiwanese for the two colors represent parties which stand on the two extremes in the political spectrum in Taiwan. They stand in such a stark contrast that supporters of the two parties may actually be affected when asked to express preference for neutral objects with the color blue or green. This is because the color pair has been conceptually incorporated with their concepts of politics in this culture. One may argue that these senses of colors can be represented within the lexicon as different meanings and derive the relations accordingly. However, as the context can vary according to non-linguistic factors (e.g., social or cultural conventions), the meanings of sense can be potentially limitless, which proves that it is impossible to represent word relations intralexically. The last piece of evidence to justify the metalexical treatment is the prototypicality effect. People can naturally make judgments about better examples of a certain association, determining some candidates of word pairs to be more canonical than others (e.g., big/little over gigantic/tiny). These decisions are made based more on the subjects’ memory and experience than on their linguistic ability. As in the case of context-dependency, it is obvious that people produce and perceive lexical relations according to their knowledge about the world they live in, the social conventions they follow and the activities they participate in. To sum up, we understand lexical relations with the aid of our metalinguistic knowledge, and this knowledge is different from our linguistic one in its conceptuallike nature. Therefore, it suffices now to say lexical relations can be seen as. . 9 .

(20) realizations of our concepts about this world. This metalexical treatment of word relation and its conceptual nature will be the stance taken in this thesis and it goes in parallel with Murphy’s claim that “lexical should only indicate any properties ‘involving words’ rather than ‘contained in the mental lexicon’” (Murphy 2003, p.9).. 2.1.2 Lexical patterns In the last section I have made clear the metalexical treatment is the perspective taken when excavating lexical relations in this thesis. This approach to lexical relations takes on a conceptual interpretation of how we produce and perceive them based on our commonly shared experience and social background. This may seem like a trivial statement, but the fact is that few studies in NLP had attempted to state specifically what relations they are looking at before they implemented them. After pointing out what aspect to look at in lexical relations, I will now turn to their linguistic realizations in computational models that help achieve tasks in natural language processing, that is, the lexical patterns. Among several tasks that involved lexical relations in NLP, a ubiquitous assumption is that lexical patterns, or lexicosyntactic patterns, are treated as representation of lexical relations between word pairs. In most tasks that involve lexical patterns, they are defined as a set of items that co-occur so frequently with a word pair that they may be used to indicate the relation between the pair. The items in the set can be lemmas, punctuation marks, or words with specific part of speech tags. For example, the lexical pattern NP0 is the opposite of NP1 may indicate an antonymic relation between the word pair NP0 and NP1. To my best knowledge, Hearst’s (1992) work on hyponyms is the earliest one adopting the pattern-based method. Using a cyclopedic corpus that comprises 8.6 million words, Hearst extracted hyponymous information between named entities. . 10 .

(21) The task was done with the help of a set of 5 pre-defined lexical patterns that are supposed to capture hyponymous relations. One example of the patterns is such X as Y, which extracts facts about Shakespeare in sentences like such authors like Shakespeare. Among the 226 words that constitute 153 candidate hyponymous pairs that Hearst managed to identify, 106 words are included in WordNet, which suggests that Hearst’s approach can be seen as a useful complement to the existing thesaurus. Following works of Hearst’s, Berland and Charniak (1999) took a slightly different approach to the study of meronymous words. Excavating natural texts in a larger corpus (100 million words), they started by creating seed pairs (i.e., a set of hand-coded meronyms), which were then applied to extract substrings from sentences that include these seed pairs. After lexical patterns are manually identified from the collection of substrings, new pairs of candidate meronyms are thus picked out if they co-occur with the lexical patterns in sentences. The evaluation of the result was made based on the majority vote by 5 human judges. For the top 50 meronyms derived from their algorithm, they reported an accuracy of 55%. Targeting a range of relations including both meronyms and hyponyms, Pantel and Pennacchiotti (2006) identified a set of generic lexical patterns automatically. Generic patterns are the ones with broad coverage and low precision. They also began with a set of seed pairs, obtaining generic patterns to increase the recall rate. All generic patterns were then rated according to a reliability measure which calculates how reliable one pattern is based on pointwise mutual information. Discarding the less reliable patterns, the algorithms went on to collect new pairs from the Web. Using a corpus of 6 million words, they obtained precision scores between 73% and 85% when a random sample of 50 extracted pairs were judged by 2 human judges.. . 11 .

(22) So far, I have presented some introduction and instances of lexical patterns. It should now be clear that these patterns are widely applied in studying lexical relations. And as I mentioned in Chapter 1, such studies often involve the similarity measures of lexical relations. Therefore, in order to offer a more detailed description of what lexical patterns can help to achieve, in the following section I will go through some important tasks that involve measuring relational similarity.. 2.1.3 Algorithms Measuring Relational Similarity For a better understanding of what lexical relations can help achieve in NLP, I will now introduce some major tasks that involve measuring similarity between lexical relations.. 2.1.3.1 Recognizing Verbal Analogy A verbal analogy has the form A:B::C:D, meaning “A is to B as C is to D.” An example given by Turney and Littman (2005) is “mason is to stone as carpenter is to wood.” The task of recognizing verbal analogy is, given a stem word pair and a set of choice word pairs, selecting the choice that is most analogous to the stem. Turney et al. (2003) has attempted this problem by combining different independent modules to answer SAT (Scholastic Aptitude Test) questions. An example of SAT question is shown in Table 1. As shown in their discussion, of all the modules, the one that was based on the VSM (Vector Space Model) performed the best, achieving a score of 47% in answering a set of 374 SAT questions.. . 12 .

(23) Table 1. A sample of SAT question Stem: Choices:. mason:stone (a) teacher:chalk (b) carpenter:wood (c) soldier:gun (d) photograph:camera (e) book:word (b) carpenter:wood. Solution:. Veale (2004) dealt with the same set of 374 SAT questions with a lexiconbased approach. He applied WordNet in his research, measuring the paths from node A to node B in the word pair A:B. The evaluation of each candidate choice word pair was then to calculate the similarity of its path distance with that of the stem pair. The final result of Veale’s work has attained a score of 43%. Different from Veale’s lexicon-based approach, Turney (2005) applied a corpus-based one to solve the SAT questions, which attained a score of 56% in the result. In this work, Turney introduced an enhanced version of the VSM approach, calling it Latent Relational Analysis (LRA). LRA basically calculates similarity between pairs of words according to their corresponding lexical patterns. A more detailed review on the LRA module will be presented in Section 2.2.3.1. As argued by Turney (2006), in addition to answering SAT analogical questions, our daily use of metaphorical language can also be expressed in a SATstyle verbal analogy. For example, the sentence you need to budget your time can be expressed as in the format money:budget::time:schedule. This treatment of metaphor is supported by the claim made by Gentner et al. (2001), asserting that novel metaphors are understood using analogy while the conventional ones are simply . 13 .

(24) recalled from memory. In this case, even if someone is for the first time given the example sentence you need to budget your time, he/she will be able to use the analogical knowledge well in interpreting the novel metaphor.. 2.1.3.2 Classifying Noun-Modifier Relations The task of classifying noun-modifier relations is to identify the possible semantic relations between a noun and its modifier. There has been much scholarly attention paid to noun-modifier pairs because of their high frequency in English (Turney 2006). Lauer (1995) approached the noun-modifier problem with a corpus-based method. He used the British National Corpus (BNC) as his database, interpreting the pairs by inserting the prepositions such as of, for, in, at, on, and with between the noun and its modifier. For instance, the pair reptile haven was paraphrased as haven for reptiles. Specifically in the medical domain, Rosario and Hearst (2001) classified noun-modifier relations using Medical Subject Headings and Unified Medical Language System as their lexical resources. In the final result, they successfully distinguished 12 classes of semantic relations based on a supervised neural network.. 2.1.3.3 Information Extraction In general, Information Extraction (IE) in machine learning refers to automatic methods for creating a structured representation of selected information drawn from documents. A frequently applied method is to set up relations in natural language. More specifically, the task involves looking for pairs of entities that satisfy a given relation in the appointed document. Therefore drug names or adverse interactions between medical treatments can be automatically identified from multiple . 14 .

(25) unstructured medical documentations. The relation extraction task was first introduced as part of the Template Element Task in Message Understanding Conferences (MUC) in 1988. Many different approaches have been proposed since then. Zelenko, Aone, and Richardella (2003) introduced a kernel method between two parse trees for extracting relations such as person-affiliation and organizationlocation. They achieved success on two simple relation extraction tasks. Recently the Web mining-based approach to IE has gained much attention (Pantel & Pennacchiotti 2006; Banko et al. 2007; Bollegala et al. 2007; Davidov & Rappoport 2008). The basic procedures start with searching the Web as their corpus to gather co-occurrences of word pairs with strings of words between them, and then generate textual patterns according to the calculation of frequencies. This is similar to the task of classifying semantic relations, except that the focus here is on the relations between a specific pair of entities.. 2.1.3.4 Automatic Thesaurus Generation Algorithms of automatic thesaurus generation were at first developed with regard to certain specific semantic relations such as meronymy and hyponomy. Hearst (1992) introduced a corpus-based algorithm for extracting hyponyms (type of) and Berland and Charniak (1999) approached the meronyms (part of) by using a corpus. With a view to extracting more various relations to automatically generate a more comprehensive thesaurus, WordNet (Fellbaum 1998) was constructed to include more than a dozen semantic relations between words.. . 15 .

(26) 2.2 Vector Space Models So far we have specified in the previous sections that it is the conceptual nature of lexical relations that we are focusing on. We have also shown that lexical patterns can be seen as the linguistic realization of lexical relations and thus can be applied to facilitate research on relations. But what remains unanswered is how we can represent this conceptual nature of lexical relations in a format that can be processed by a machine. In NLP, the answer to this question lies in a readily deployable application—the Vector Space Models (VSMs). In the following sections, I will first introduce VSMs according to the components and the theoretical assumptions behind their construction. After offering basic knowledge about VSMs, I will review two implementations for a better understanding of how they work.. 2.2.1 What Is a Vector Space Model? First developed by Salton and his colleagues (Salton, Wong, & Yang 1975) for the SMART information retrieval system, the VSM has received much scholarly attention in Information Retrieval (IR). A large part of concepts behind modern search engines derive from the formalization of the VSM. Its success in IR has inspired researchers in NLP to apply it in processing natural texts automatically. The basic idea of any space model is that similarity can be represented as proximity in an n-dimensional space, in which n stands for an integer. According to Sahlgren (2006), space models use a geometric metaphor as their representational basis. This geometric metaphor is composed of two bodily metaphors: similarity-isproximity and entities-are-locations. The similarity-is-proximity essentially indicates that when two things (or meanings) are similar, they are conceptualized as being near each other, while dissimilar things (meanings) are conceptualized as being far apart. . 16 .

(27) The entities-are-locations metaphor, on the other hand, suggests that in order for two things (or meanings) to be conceptualized as being close to or far from each other, they need to possess spatiality. These two metaphors constitute the geometric metaphor: things (meanings) are locations in a vector space, and similarity is proximity between the locations. Note that not all space models using vectors and matrices are treated as a VSM. In Turney’s definition (2010:143), “the values of the elements in a VSM must be derived from event frequencies, such as the number of times that a given word appears in a given context.” Tasks involving the VSM include document retrieval (Salton et al. 1975), essay grading (Foltz, Laham, and Landauer 1999), question answering (Dang, Lin, and Kelly 2006), word similarity (Deerwester et al. 1990; Landauer and Dumais 1997), automatic thesaurus generation (Fellbaum, 1998), word sense disambiguation (Agirre and Edmonds 2006; Pedersen 2006), and information extraction (Bollegala 2007). There are several motivations for using the VSM for semantics. First of all, automatizing tasks that require laborious hand-coding efforts significantly reduces time spent on research and increases the scope of experiments. Secondly, because the VSM was originally devised to extract documents based on similarity calculation, it performs particularly well on tasks that measure closeness among linguistic units including paragraphs, sentences and words.. 2.2.2 VSM in NLP As pointed out by Turney (2010), The theme that unifies various VSM applications in natural language processing can be broadly summarized in the statistical semantics hypothesis: “Statistical patterns of human word usage can be used to figure out what people mean (Weaver 1955; Furnas et al. 1983). –If units of text have similar vectors . 17 .

(28) in a text frequency matrix, then they tend to have similar meanings” (Turney 2010:153). In the following section I will discuss some of the basic procedures in constructing a VSM and elaborate on what it actually means to say that vectors are similar.. 2.2.2.1 Linguistic Processing Normally the input of a VSM is a large corpus of natural language text. Before setting up a matrix, it can sometimes be useful to apply several linguistic pre-processing methods. Roughly speaking, these methods can be categorized into three groups: tokenization, normalization, and annotation. Note that these pre-processing methods can vary in order and contents due to different experimental considerations. Tokenization basically involves steps that break a raw text into meaningful subparts. In English, this is a relatively simple task since words are separated by spaces. According to Manning et al. (2008), an accurate English tokenizer is supposed to deal with punctuation (e.g., don’t, and Jane’s), hyphenation (e.g., ice-creamflavored candy), and recognizing multi-word terms (e.g., computer science and Ann Lee). Comparatively, tokenizing raw text in Mandarin Chinese can be a challenging task for there is no clear-cut boundary between words. For example, bianlishangdian ‘convenient store’ can be understood by some native speakers as one single word while it is still logically possible for others to conceive it as more than one word (e.g., bianli ‘convenient’ and shangdian ‘store’). To date, although there still exist some controversies in how to segment Chinese words, most researchers choose the system. . 18 .

(29) Chinese Knowledge Information Processing Group (CKIP)1 in segmentation for it offers a unified criteria and stable performance. Another type of linguistic pre-preprocessing method is normalization, the aim of which is to normalize superficial variations of a single word. Broadly speaking, there are two most common types of normalization: case folding (converting uppercase into lower case) and stemming (stripping words of their affixes) (Turney 2010). In information retrieval, the performance of a certain system is customarily evaluated by precision and recall. Precision of a system refers to an estimate of the conditional probability that the retrieved result is truly relevant while recall is the fraction of relevant instances being retrieved in a corpus. In most cases, the application of normalization will result in a reduced precision while it increases the recall rate. On the contrary, the application of annotation often brings about a reduced recall rate as well as an increased precision. This is partly because this procedure labels identical strings of words with distinguishing markings, thus adding to its ability to differentiate between identical words. Common types of annotation include part-of-speech tagging (e.g., markings of nouns and verbs), word sense tagging (marking polysemous words with their intended meanings), and parsing (marking words with their structural functions, that is, their grammatical roles) (Manning & Schütze 1999).. 1 http://ckipsvr.iis.sinica.edu.tw . 19 .

(30) 2.2.2.2 Mathematical Processing The mathematical processing of constructing a VSM requires steps in (1) building matrices, (2) weighting schemes, (3) smoothing the matrix, and (4) measuring the vectors. In this section the four mathematical procedures will be introduced in detail.. Building Matrices As mentioned above, one of the defining properties of VSM that set it apart from other vector matrices is that elements in VSMs correspond to event frequencies. The essence of event frequency can be illustrated in Turney’s description: “a certain item (term, word, word pair) occurred in a certain situation (document, context, pattern) a certain number of times (frequency)” (Turney 2010). Therefore building a matrix basically involves counting the event frequencies according to certain criteria in the corpus and recording the result in the cells of the corresponding matrix, usually denoted by F (for frequency). Normally the matrices built up according to a large corpus include considerable zero frequencies, thus resulting in a sparse representation.. Weighting Schemes In information retrieval, it is important to assign more weight to unexpected occurrences of events while reducing influence of those that are expected. This notion has been formalized into an array of different weighting schemes. One of the most popular ways to exert weight is the tf-idf (term frequency × inverse document frequency) family of weighting functions (Spärck Jones 1972). To illustrate, let us assume F is a words-by-documents matrix and fij is the weight assigned to the frequency of word i in document j, which is define as: fij = TFij × IDF. . 20 . (1).

(31) The point of the first component TFij is to indicate how important the word i is in the document j. The idea is that the higher the frequency one word occurs in a certain document, the more important it tends to be for identifying that document from a set of documents. The hypothesis that frequency can be taken as a reliable indicator of words’ discriminative ability can be traced to works done by Luhn (1958). The second component IDF suggests how discriminative the word i is among documents in the corpus. To put it in another way, one term is deemed less discriminative if it occurs in high frequency across several documents. As a consequence, it should be subjected to functions that decrease its influence in the matrix F. Other approaches to exerting weights include length normalization (Singhal, Salton, Mitra and, Buckley 1996), which aims to normalize document length in order to reduce a bias in favor of longer documents during similarity calculation, and Pointwise Mutual Information (PMI) (Church & Hanks 1989), which measures the mutual dependence of two random words.. Smoothing the Matrix One of the benefits in automatizing tasks in processing natural language texts is that it expands the scope of research and therefore the corpus used in systems tends to be larger than in hand-built ones. This increase in corpus size can be reflected in the added dimensions in a matrix. Unfortunately, high dimensionality can pose problem to the efficiency of any algorithms. The dilemma faced by most researchers is that we need to collect as much data as possible in order to truthfully reflect how people use their language, while it also concerns us that we do not want our data to grow so big that it may undermine or even prohibit our systems. The basic idea of smoothing a. . 21 .

(32) matrix is to alleviate such problems. Roughly speaking, it reduces dimensionality and removes noises from the matrix. One of the most widely used methods that serve such functions is the Singular Value Decomposition (SVD). Proposed by Deerwester et al. (1990), SVD functions as a way to improve similarity measurements with a mathematical operation on the term-document matrix. The mathematical details of SVD need not concern us here (a more detailed introductions can be seen in Berry et al. 1995), for the purpose of this thesis it is sufficient to note that SVD, based on linear algebra, serves to decompose a matrix into independent components to which factor analysis can be applied. The concept of factor analysis is that it represents the intercorrelations between a group of variables with a new group of abstract variables, each of which can be seen as unrelated to any member of the group while upon combination they can regenerate the original data source (Landauer & Dumais 1997). Accordingly, a high dimensional matrix can be decomposed into smaller matrices via SVD, which retain the linearly independent factors of the original matrix. By multiplying these smaller matrices to regenerate the original matrix after disregarding the less important ones, it suffices to say that the result is an approximation of the original matrix, but with a lower dimensionality. SVD has been applied to measure word similarity by Landauer and Dumais (1997) under the name of Latent Semantic Analysis (LSA). It is called latent in the sense that the hidden relations among words and documents can be retrieved instead of the superficial ones. Two documents can be considered close to each other without sharing the same term. For instance, a document containing boat can be associated with another document containing ship.. . 22 .

(33) Measuring Vectors From the geometric metaphor we have learned that two similar words can be conceptually understood as being close to each other in a space. Therefore calculating similarity between two words can also be understood as measuring distance of the two vectors that stand for the words in the space model. But how do we measure distance between two vectors? To date, the most popular way to measure vectors is through calculating the cosine angles between any two vectors (Turney 2010). To illustrate, let’s assume we want to calculate the cosine between two vectors ! and !, given by: cos(!, !) = . ! !!! !! ⋅!! ! ! ! ! !!! !! ⋅ !!! !!. =. ! !. ⋅. ! !. (2). Note that the two vectors have been normalized to unit length during the calculation of the cosine. It shows that this formula is not sensitive to the length of the vectors. This is very important since we do not want the more frequent words (e.g., car) whose vectors are relatively longer to be treated as a different word from its less frequent counterparts (e.g., automobile). The cosine ranges from −1, which means the two vectors are facing the opposite direction, to +1, which shows they are in the same direction.. 2.2.3 Implementing VSM After introducing what VSMs are and how to construct one, now I will briefly summarize two approaches in NLP that apply VSM. It is important to mention the two approaches for they are so far the state-of-the-art in terms of the tasks they are designed to solve. Both algorithms aim at computing the similarity between two given word pairs, assuming the closer two word pairs are determined to be, the more likely. . 23 .

(34) it is for them to share the same lexical relations, that is, they are considered analogies. The following two sections will give a more detailed description of both approaches.. 2.2.3.1 Latent Relational Analysis With a view to simulating human performance on making analogy between word pairs, Turney and Littman (2005) used the Alta Vista search engine to extract event frequencies in building vectors for their VSM. They evaluated the system’s performance with a set of 374 college-level SAT-style verbal analogy questions, achieving a score of 47%. The LRA introduced by Turney (2005) is an enhanced version of the original VSM approach. It extends the previous VSM in three aspects: (1) instead of the 128 hand-coded lexical patterns used in the VSM approach, LRA uses patterns that are automatically derived from the corpus, (2) SVD is applied to smooth the frequency matrix (thus the name latent), and (3) synonyms are automatically generated to explore reformulations of the original word pairs. LRA achieved a score of 56% on the 374 SAT analogy questions, which is statistically equivalent to the average college students’ score of 57%. The following is a short description of the core algorithm of LRA. First of all, in a given set of word pairs, find synonyms in Lin’s (1998) automatically generated thesaurus for each of the two words in a pair, and replace the original words in the pair with their synonyms to form alternate pairs. For example, for the pair quart:volume, Turney (2005) looked in Lin’s (1998) automatically generated thesaurus for synonyms. The list of alternate pairs is shown in Table 2.. . 24 .

(35) Table 2. Alternate pairs of the original pair quart:volume. pint:volume. quart:turnover. gallon:volume. quart:output. liter:volume. quart:export. squirt:volume. quart:value. pail:volume. quart:import. vial:volume. quart:revenue. pumping:volume. quart:sale. ounce:volume. quart:investment. spoonful:volume. quart:earnings. tablespoon:volume. quart:profit. Alternate pairs that co-occur rarely in the WMTS (Clarke, Cormack, & Palmer 1998), a corpus of about 50 billion English words, are dropped. The idea is that in the previous procedure, synonyms of original words can be generated based on different meanings, but if that is the case, the wrongly captured synonyms are less likely to cooccur with their counterpart in the pair. After the filtering procedure, the original pairs and the rest alternate pairs are then used to elicit short strings of words that begin with one of the word in a certain pair and end with the other. Examples of short strings are shown in Table 3. Table 3. Short strings that contain quart:volume. . quarts liquid volume. volume in quarts. quarts of volume. volume capacity quarts. quarts in volume. volume being about two quarts. quarts total volume. volume of milk in quarts. quarts of spray volume. volume include measures like quarts. 25 .

(36) The short strings are then used to generate candidate lexical patterns by replacing any or all or none of the intervening words with a wild card. For instance, the string quart of spray volume will yield candidate patterns including ‘of spray’, ‘* spray’, ‘of *’, and ‘* *’. This procedure is so powerful that it can generate more than millions of patterns. To alleviate burden of the system and increase precision, Turney (2005) kept only the top 4000 most frequent lexical patterns. The next step is to build a frequency matrix by mapping word pairs to row numbers and patterns to columns. Each cell in the matrix will accordingly represent the number of times that the corresponding pair (row) appears in the corpus with the corresponding pattern (column). Since the numbers in the cells are often zeros, resulting in a sparse matrix, SVD is then applied to the matrix in order to reduce noises and help condense the matrix. After the construction of the matrix, calculation of the relational similarity between any two pairs will then be done by obtaining the cosine of the angle between the two vectors representing their corresponding word pairs. Therefore, to answer a SAT analogy question, LRA calculates the similarity between the stem pair and each of the choice pairs, selecting the nearest pair to the stem as the final solution. Despite LRA’s impressive performance on SAT analogy questions, there are some issues left unsolved in the principles it adopts. Generally speaking, LRA determines that two pairs of words are similar to each other based on their similar distribution over several lexical patterns. However, since there can be more than one lexical relation that connects any two words, the overall lexical patterns collected can also represent several different lexical relations between one pair of words. For instance, the lexical patterns which appear between ostrich and bird can be X is a large Y, or X doesn’t look like a Y, two patterns that apparently represent different. . 26 .

(37) relations. Therefore, without a clear principle for the lexical patterns collected, it will be problematic to say two pairs of word are analogous in LRA.. 2.2.3.2 Implicit Relations among Entities In information extraction, implicit relations are largely employed to identify relationships between things. To be more specific, the interest of researchers in this field is to find out facts that bind two entities together. For example, the facts about cat and mouse can be captured through implicit relations in expressions such as ‘catch’, ‘run after’, and ‘hate’. Aiming at measuring similarity between pairs via clustering implicit relations, Bollegala et al. (2009) proposed an algorithm that specifically adopts Web contents as its corpus. Given two words in a pair, the algorithm finds contexts of both words on the Web, from which lexical patterns are extracted and clustered according to their frequencies. To calculate the similarity between the pair, Bollegala et al introduced the Mahalanobis distance (Mahalanobis 1936) in feature vector space to approximate the inter-cluster correlations. According to Turney (2010), event frequency is the definitional property of a VSM, and since the feature vector space built up by Bollegala et al. is based on the frequency of lexical patterns on the Web, it suffices to say that their approach is another example of VSM. The algorithm, termed as CORR by Bollegala et al., starts with collecting snippets (short description about Web page content in most commercial search engines) as the context in which both words of a pair co-occur via the search engine Google. A series of queries composed of one to three wildcards (e.g., ‘A * B’, ‘A * * B’, ‘A * * * B’, ‘B * A’, ‘B * * A’, and ‘B * * * A’) are sent to the search engine in order to circumvent the 1000 query limit imposed by Google. From the extracted . 27 .

(38) context, lexical patterns with a length of no more than 3 words are generated and the less frequent results are dropped. Since one lexical relation (or implicit relation in this case) can be expressed by several lexical patterns at the same time (e.g., ‘is a kind of’, ‘is a’, and ‘can be counted as a’ all express the same relation), the next step will be to cluster the generated patterns according to their supposedly corresponding relations. To calculate their similarity, each lexical pattern p is treated as a word-pair frequency vector p, whose elements are frequencies of the pattern p co-occurring with word pairs. The similarity between vectors is obtained via their cosine angle value. A sequential clustering algorithm is then applied to efficiently group lexical patterns based on their similarity and a cluster similarity threshold θ. The basic idea is that if the similarity between one word-pair frequency vector p and a certain cluster (which can be taken as a combination of several word-pair frequency vectors) exceeds the threshold θ, it will be included in the same cluster; otherwise it will be put into a new empty cluster. After the clustering procedure, lexical patterns are now grouped into sets, each of which can be taken as one lexical relation. In the last procedure, word pairs are represented as a feature vector whose elements are the frequencies of this word pair appearing in each cluster. It is important to remember that one cluster here contains a collection of several lexical patterns, so the frequency of one pair appearing in a certain cluster basically is the sum of frequencies of this pair co-occurring with each lexical pattern in the cluster. Relational similarity between two pairs a/b and c/d, denoted as relsim((a, b), (c, d)), will then be calculated as follows, !"#$%& !, ! , !, !. ! = !!" Λ!!". (3). Here, !!" and !!" respectively represent the feature vectors of word pair a/b and c/d and Λ is a correlational matrix. The definition of relational similarity as given in . 28 .

(39) Formula 3, according to Bollegala et al., can be seen as a general framework into which all of the existing systems can be integrated. If taken from the prospective of LRA’s algorithm, the matrix Λ can be obtained via SVD. However, as proposed by Bollegala et al., considering that fact that lexical relations are not always mutually independent, the correlation between the relations represented by clusters should be accounted for. Therefore instead of using SVD to produce the matrix, Mahalanobis distance, a distance measure in variable space that takes into account the correlation in the data, is applied in their approach. To evaluate the proposed algorithm, Bollegala et al. created a data set that included 100 instances of named-entities covering the five relation types, 20 for each: ACQUIRER-ACQUIREE, PERSON-BIRTHPLACE, CEO-COMPANY, COMPANY-HEADQUARTERS, and PERSON-FIELD. Table 4 shows some instances of each and the number of context extracted from the Web.. Table. 4 Sample dataset of implicit relation dataset. Relation type ACQUIRER-ACQUIREE. Total contexts 91349. Examples (Google, YouTube), (Adobe Systems, Macromedia). PERSEN-BIRTHPALCE. 72836. (Franz Kafka, Prague), (Charlie Chaplin, London). CEP-COMPANY. 82682. (Terry Semel, Yahoo), (Steve Jobs, Apple). COMPANY-HEADQUARTERS. 100887. (Microsoft, Redmond), (Yahoo, Sunnyvale). PERSON-FIELD. 99660. (Albert Einstein, Physics), (Roger Federer, Tennis). After extracting lexical patterns from the context, they clustered the patterns according to their similarity. Table 5 is the top 10 clusters with the most patterns. The total number of lexical patterns included in each cluster is shown within the brackets after the cluster. For every cluster, three most frequent patterns are given as example.. . 29 .

(40) Table. 5 Top 10 sample clusters with most frequent lexical patterns cluster1 (2868). X acquires Y. X has acquired Y. X’s Y acquisition. cluster2 (2711). Y legend X was. brazilian Y legend X. Y legend X was held. cluster3 (2615). Y champion X. world Y champion X. X teaches Y. cluster4 (2008). X to buy Y. X and Y confirmed. X buy Y is. cluster5 (2002). Y founder X. Y founder and ceo X. X, founder of Y. cluster6 (1364). X revolutionized Y. X professor of Y. In Y since X. cluster7 (845). X and modern Y. genius: X and modern Y. Y in DDDD, X was. cluster8 (280). X headquarters in Y.. X offices in Y. past X offices in Y. cluster9 (144). X’s childhood in Y. X’s birth in Y. Y born X. cluster10 (49). X headquarters in Y. X’s Y headquarters. Y –based X. The overall score of the system CORR in categorizing word pairs according to their similarity is 73.18%. To compare with other approaches, Bollegala et al. also performed the same task based on cluster inner-product baseline (IP), VSM (Turney et al. 2003), and LRA (Turney 2006), and CORR statistically outperformed those approaches. So far, I have gone through two important algorithms in relational similarity measures. Both approaches’ performances are impressive and can be readily applied to solve different tasks. However, it is also important to note that relying on frequency calculation alone can be problematic if we look deeper into the results. Frequencybased computational models can be powerful in the sense that they enables researchers to gather and process the enormous data that keep growing and changing as more advanced data-extracting techniques are being devised. However, without a theoretical background for the material gathered, researchers will not be able to locate or fix some of the errors in the sheer number of sentences they collect. As I mentioned in Chapter 1, most algorithms applying lexical patterns as the means to compute similarity between relations do not first discuss in what sense these patterns can represent the corresponding lexical relations. Therefore a comprehensive collection of . 30 .

(41) every lexical pattern generated in these models often results in anomalies in the results of similarity measures. For instance, in Bollegala et al.’s model, it is not difficult to identify some anomalies in the data they collect that undermined the systems. More specifically, although Bollegala et al. noticed that there can be more than one relation existing between any two words, and clustered lexical patterns before calculating word pair similarity measures, it is still easy to spot lexical patterns that were mistakenly put in the cluster which is supposed to represent one lexical relation. For instance, in one cluster that is claimed to represent an ACQUIRERACQUIREE relation, besides lexical patterns such as X to buy Y and X buy Y is, there are also other anomalous patterns with high frequencies such as X and Y confirmed and Y purchase to boost X. It is noteworthy that the second anomalous lexical pattern actually evokes an encouraging function of a purchase of one company rather than an acquisition situation. Obviously, a theory for lexical patterns is in need to improve future studies on relation similarity measures.. 2.3 Rethinking the Lexical Patterns In the last section I have explained how to build a VSM, including the essential components and the supporting theories. I have also discussed how we can represent our linguistic knowledge through the two implementations. It is now sufficient for us to say that VSM is actually one powerful and promising tool for researchers to simulate human language ability in NLP. However, despite the impressive performance demonstrated in those works, it is still left unanswered what aspect of lexical relation we are looking at when we apply lexical patterns as a medium to have a good grasp of what relations may be like in our mind. The assumption that lexical patterns can be treated as embodiment of lexical relations is rarely examined in works . 31 .

(42) that are preoccupied with solving specific tasks. Therefore now it is time for us to look deeper into the lexical patterns.. 2.3.1 Capturing Conceptual Nature in Lexical Patterns In Section 2.1.1, I have pointed out the conceptual nature of lexical relations based on the metalexical treatment. It’s evident that our knowledge about lexical relations is a conceptual one, which means it combines not just lexical entries or meanings, but concepts about words that are built upon our understanding of what we have experienced, conceived, and learned. Three aspect of this conceptual knowledge was discussed: (a) lexical relations are productive, (b) they are context-dependent, and (c) they display the prototypicality effect. Given that in NLP it is assumed that lexical patterns are the linguistic representations of lexical relations, it follows naturally that we can postulate that this conceptual nature can also be observed in and extracted from the lexical patterns. This also suggests that the three aspects should be able to account for behavior of lexical patterns. To begin with, lexical patterns can be productive. To be more specific, the words that are connected by patterns are not arbitrary or idiosyncratic since we can produce potentially limitless new pairs without hearing or reading them before. For example, the expression your word is the opposite of nutrient makes sense to people who hear it for the first time. The pattern is the opposite of can be used in such a way that the hearer fully grasps the intended meaning by applying his/her concepts about nutrients, that is, any substance that nourishes an organism. Actually, it is also argued by Murphy (2003:49) that the metalexical account of lexical relations can explain our ubiquitous metaphorical use of words.. . 32 .

(43) Secondly, lexical patterns display the prototypicality effect. People can naturally make metalinguistic judgment about what are the better substrings that serve as the indication of word relations. This is also supported by the findings of Jones (2002) that there exist some typical lexico-syntatic frames in which antonyms cooccur such as instead of. Lexical patterns are also context-based, which means the meaning of words connected by these patterns can vary according to different context of background. Cultural difference, geological distinction, and economic gap can all affect how speakers and listeners perceive the meaning of words connected by the patterns. For instance, the same expression this place is the opposite of Africa can be interpreted distinctly when the utterer is in the context of an international economic conference and a frosty cold street. In the former the economic development divide is possibly the main subject of discussion while in the latter it is more likely that the climate difference is the topic of conversation. This suggests that Africa in the example is actually treated as one concept that covers all information we have about the continent. Therefore the lexical pattern serves as a bridge that connects the speaker and hearer’s world and their knowledge about Africa. Now it’s evident that lexical patterns, just like lexical relations, connect concepts of what we know about the word. However, to my limited knowledge, there is currently no such a theory that accounts for this conceptual nature in lexical patterns. Considering the growing importance of lexical pattern in various applications simulating human behavior, a theory for it is obviously in need. When we characterize lexical patterns as our conceptual knowledge, we mean that these patterns can be understood and produced through reference to a structured background of experience, culture, and worldview. Just as I have argued in Section. . 33 .