Applications of Concept Representation

Chapter 3. Concept Representation

3.4 Applications of Concept Representation

Using extracted knowledge to represent concept is common in many literature (Chklovski, 2003; Etzioni, Banko, Soderland, & Weld, 2008; Singh et al., 2002; Yu & Chen, 2010).

Therefore, in our study, instead of extracting knowledge to build concepts, we focus on other related perspectives about concept representation and knowledge extraction.

To demonstrate the application of our concept representation scheme, we adopt two problems. First, we study commonsense knowledge classification and interpret feature engineering process in concept representation scheme. This shows that our concept

7 A similar algorithm appears in our paper (Yu & Chen, 2010).

representation is more general than feature engineering process and is more feasible in the point of view of natural language processing. We describe these issues in chapter 4. Second, we demonstrate the use of eliciting information from different perspectives to learn knowledge of word sense disambiguation. This process can be interpreted in same scheme, but we have a novel viewpoint to deal with WSD problem, which is a quite well-known problem in natural language processing. We describe these issues in chapter 5. These two demonstrations illustrate the application of our concept representation scheme.

To study perspectives that are important and usually ignored by researchers, we examine two assumptions about the content and size of knowledge. We describe these issues in chapter 6. In chapter 6, we also describe some important preprocess steps when we want to extract knowledge from the web.

Chapter 4. Commonsense Knowledge Classification

In this chapter, we adopt our concept representation scheme in commonsense knowledge (CSK) classification, a task which is to know whether there is a specific relation between two noun phrases.

When representing the static part of a concept in this chapter, we represent commonsense concepts (phrases)⁸ in predefined slots, and a learning algorithm is adopted to learn classifiers for commonsense knowledge. We formulate the CSK classification as a binary classification problem. The classifiers detect if a relation is valid between a pair of noun phrases. For example, relation CausesDesire holds between phrases “the need for money” and

“apply for a job”.

We organize materials of this chapter⁹ in the order below.

(1) We introduce OMCS database first. We use CSK in our experiments.

(2) We investigate related work about CSK mining.

(3) We propose our concept representation scheme for complex concepts, which is a noun phrase in this case.

(4) We describe data processing, experiment settings, and conducted experiments.

(5) We report our experimental results.

(6) We compare feature engineering and our concept representation scheme.

8 These words (concept, word, and phrase) are identical in this paper, but word and phrase are different when they are in the context of language.

9 The materials in this chapter are from our paper (Yu & Chen, 2010).

4.1 OMCS Database

We conduct our experiments by using dataset from OMCS project¹⁰, a public available database from MIT. This database contains CSK contributed by volunteers. The web volunteers enter sentences to a web system in a predefined format. In this way, each sentence has two aligned concepts corresponding to two arguments with a specific predicate (relation).

The concepts can be a word or a phrase. There are many predicate types in this database.

Table 3 lists some examples.

Predicate Concept 1 Concept 2

CausesDesire the need for money apply for a job

HasProperty Stones hard

Causes making friends a good feeling

HasPrerequi. having a party inviting people

CapableOf a cat catch a mouse

HasSubevent having fun laughing

UsedFor a clothing store changing room trying on clothes

IsA a swiss army knife a practical tool

AtLocation a refrigerator freezer the kitchen Table 3. Examples from OMCS database

Besides, each sentence has a confidence score determined by users collaboratively. When a user asserts a sentence as a valid CSK, confidence score of this sentence is increased by one.

On the contrary, if a user negates this sentence as a valid CSK, its score is decreased by one.

In this way, a confidence score of a CSK can be considered as an indicator of its quality.

4.2 Related Work

To assert relations between two concepts is a common task, and this task is useful for many

10 http://conceptnet.media.mit.edu/dist/ (Last access: 2013/06/18)

purposes. The most important purpose is to enlarge the knowledge base in order to alleviate knowledge acquisition bottleneck in many AI fields. There are many different types of knowledge that can be extracted from texts. Approaches that acquire commonsense knowledge from different sources (Chklovski, 2003; Schubert & Tong, 2003; Singh et al., 2002) have been proposed.

Chklovski and Gil (2005) roughly classified CSK acquisition approaches into three categories according to the knowledge sources. The approaches of the first category collect CSK from experts. WordNet (Fellbaum, 1998) and CYC (Lenat & Guha, 1989) are typical examples. A lot of knowledge is collected by linguists or by knowledge engineers. These approaches result in well-organized and high quality knowledge base, but the cost is expensive and then limits the scalability. The approaches of the second category collect CSK from untrained volunteers. Open Mind Common Sense (OMCS) (Singh et al., 2002) and LEARNER (Chklovski, 2003) are of this type. These approaches employ the vast volunteers in the Web to contribute CSK and correct the input CSK. The resulting CSK assertions can be in the order of million, but its quality is not as high as WordNet. The last approaches collect CSK from texts/the Web using algorithms. TextRunner (Etzioni et al., 2008), KNEXT (Schubert, 2009), and the systems (Cankaya & Moldovan, 2009; Clark & Harrison, 2009;

Girju, Badulescu, & Moldovan, 2006) are of this type. These approaches usually process texts or web pages first, and then use lexical or syntactical patterns to extract facts or general knowledge. Because the CSK mined is large, it is not feasible to examine all CSK assertions manually. The performance of a knowledge acquisition algorithm is evaluated directly by assessors' small sampling or indirectly by employing the extracted knowledge to some tasks such as word sense disambiguation and examining the performance of applications. These approaches have the feasibility of controllable knowledge domain and scalability of extracted knowledge base, but the quality of resulting CSK is hard to control.

4.3 Concept Representation Scheme for Phrase

For CSK classification, we propose a representation scheme to denote an assertion. In OMCS database, an assertion is already preprocessed to a tuple, i.e., (PredicateType, Concept₁, Concept2). Because a concept is usually a phrase, we represent a concept by using slots. The number of slots depends on different approaches shown as follows. A slot in turn contains words, and a word is represented by a co-occurrence vector.

We use a co-occurrence vector to represent a word in a slot, where D is a dictionary with size |D|, is the j-th entry in D, and is the co-occurrence frequency of entry and in a corpus.

We propose three approaches to determine the number of slots and how to place words in a concept into slots. The three approaches are discussed as follows:

(1) Bag-of-Words (BoW) Approach: All words are placed in one slot. BoW is considered as a baseline.

(2) N-V-OW Approach: All words are categorized into three slots, named HeadNoun, FirstVerb, and OtherWords. HeadNoun and FirstVerb are the head nominal of a phrase and the first verb of a phrase, respectively. Those words that are not in the two slots are put into OtherWords slot.

(3) N-V-ON-OW Approach: All words are categorized into four slots, named HeadNoun, FirstVerb, Other Nouns, and OtherWords. HeadNoun and FirstVerb are interpreted the same as those in the second approach. We further distinguish other words by their parts of speech (i.e., noun vs. non-noun).

With the approaches defined above, we define vector of slot k in concept j to be

. Vector of Concept₁ and vector of Concept₂ by using N-V-OW approach are define in . The concept

vectors for BoW and N-V-ON-OW approaches are defined in the similar way.

An assertion is a tuple as described, but we ignore the PredicateType information here because it is usually the same within a predicate. For example, in IsA predicate, the keywords are “is” or “are” which did not help much for binary classification in our setting. Hence, an assertion is a vector in which and come from Concept1 and Concept2, respectively.

4.4 CSK Classification Algorithm

CSK classification algorithm is described in Algorithm 2.

Algorithm 2. CSK Classification Algorithm Preprocessing

1 Use POS Tagger to tag an assertion 2 Place words of concepts into slots

3 Derive vector of word from a corpus 4 Represent an assertion

Feature Selection

5 Normalize concepts and to 1, respectively

6 Calculate Pearson correlation coefficient of each feature in slot 7 Select the first 10% features in each slot

Classification

8 Use support vector machine to classify assertions

In step 1, we use Stanford POS tagger (Toutanova, Klein, Manning, & Singer, 2003) to get tags. In step 2, we identify the head noun and the first verb of concepts by heuristic rules.

For example, the first appearance of a verb in a tagged sequence is regarded as the first verb, and the last noun in a phrase is considered as the head noun. We distinguish noun and non-noun by parts of speech. In step 3, we consider Google Web 1T 5-Gram as our reference corpus (Brants & Franz, 2006), and employ only 5-gram entries in this corpus.

The dictionary D in step 3 is a combination of WordNet 3.0 and Webster online dictionary¹¹ (noun, verb, adjective, and adverb). The resulting lexicon contains 236,775 entries. In step 5, the concept vectors are normalized to 1 respectively to equally emphasize on the two concepts. Only top 10% of features are selected.

4.5 Experiment Settings

We select positive assertions from OMCS and automatically generate negative assertions to produce a balance dataset for a predicate type. The positive assertions must meet the following four criteria: (1) the confidence score of an assertion must be at least 4; (2) the polarity (note that there are positive and negative polarities in OMCS¹²) must be positive; (3) a concept contains no relative clause, conjunct, or disjunct; and (4) the length of a concept is less than 8 words for simplicity. The negative assertions are generated by randomly selecting and merging concepts from the OMCS database. We ignore datasets of size smaller than 200.

Table 4 lists the resulting nine datasets among 18 predicate types in OMCS.

Predicate CausesDesire HasProperty Causes

Size 204 254 510

Predicate HasPrerequisite CapableOf HasSubevent

Size 912 916 1026

Predicate UsedFor IsA AtLocation

Size 1442 1818 2580

Table 4. Datasets for CSK classification.

For each dataset, we randomly split 90% for training and 10% for testing. Next, we use LibSVM (Chang & Lin, 2011) for classification. In SVM training, we adopt radial basis for

11 http://www.mso.anu.edu.au/~ralph/OPTED/index.html (Last access: 2013/08/15)

12 Only 1.8% of assertions with CS ≥ 4 have negative polarity.

kernel function, grid search in parameters (c, g) (80 pairs). After the best parameters are obtained, we train a model on training set by using these parameters, and apply trained model to test set. The same procedure is repeated ten times to obtain statistically significant results.

Note the performance variation of a classifier in classifying commonsense knowledge.

We can view a train set as a knowledgebase that one person owns, and this person may be 38 good at some aspects but bad at other aspects. This kind of train set may over-fit on some aspects and miss-classify other valid CSK. Because we aim to obtain a general purpose CSK classifier with broad coverage, the performance variation is an important indicator to evaluate a CSK classification algorithm.

4.6 Experiment Results

The test performance of classifiers is shown in Figure 6. The standard deviation and database size are also shown in this figure.

Cauz.Des. HasProp. Causes HasPreq. Capab.Of HasSub. UsedFor IsA AtLoc.

Accuracy (%)

In Figure 6, N-V-OW approach and N-V-ON-OW app-roach are better than BoW approach except in CausesDesire predicate type. N-V-ON-OW approach tends to have smaller performance variation than BoW and N-V-OW approaches. N-V-OW approach has the best accuracy (82.6%) and variation in HasProperty predicate. IsA’s best result is 74.4%, which is comparable to similar problems in SemEval-2007 Task 4 Classification of Semantic Relations between Nominals (Girju et al., 2006).

In this task, we can see that it is possible to design classifiers to detect CSK in the texts.

We adopt the concept representation scheme to interpret feature engineering process in next section.

4.7 Interpretation of Feature Engineering Process

In Algorithm 2, we describe a general feature engineering process in machine learning viewpoint. If we are concerned the learned concepts in a machine and we want to integrate many tasks in a single viewpoint to understand the content of machine's concepts, feature engineering viewpoint is hard for this purpose because different tasks may have its own feature engineering procedure. Our concept representation scheme can be used for this purpose.

In integrating different tasks using concept representation scheme, we can see that all learned concepts are stored in a uniform continuation, and the explicitization process just connect datasets, learned models, continuation, and its environment. In this viewpoint, we can treat different tasks and different learning algorithms in a uniform way. If we want to analyze whole system in a formal mathematical perspective, this interpretation give us an advantage because all continuations (concepts) and learning algorithms can be treated in a same way.

When we want to build human-like machine, this interpretation is very important.

Chapter 5. Word Sense Disambiguation

In this chapter, we use word sense disambiguation (WSD) to demonstrate the application of our concept representation framework. We organize materials of this chapter¹³ in the order below.

(1) We introduce WSD first and mention relation between concept and context, which is just like a continuation in computing environment.

(2) We investigate related work about WSD and mention context appropriateness and concept fitness.

(3) We explore context appropriateness and concept fitness formally.

(4) We describe problem formulations in WSD using the proposed concepts.

(5) We report data processing, experiment settings, and conducted experiments.

5.1 Introduction

Word Sense Disambiguation (WSD) is an important task that has gained great attention of many researchers for a long time. Because human always reuse same word to denote different meanings, it is natural to let a computer system to automatically recover the exact meaning in a given context. For example, word bank is reused to denote a concept depository financial institution in context “he saved his money in the biggest bank”, and to denote concept sloping land of waterin context “he takes a walk on the river bank”. In the mentioned cases, a word sense denotes a specific meaning of a word, and the mission of a WSD system is to discover the denoted meaning for a word in a given context. It is obvious that if a WSD system can precisely recover the exact meaning for a word in a given context, this will be beneficial for many NLP applications, such as Machine Translation and Information Extraction. For

13 We will use materials in this chapter in a paper.

instance, because the biggest bank in the example denotes a company, we may co-refer a company name to this bank in an Information Extraction task.

Many literatures dedicate to discuss WSD. Agirre and Edmonds (2006) edit a thorough book on WSD related issues, including the history of WSD, the word sense inventory approaches, evaluation methods and datasets, WSD algorithms, resources, and applications.

Navigli (2009) gives a newer and shorter survey on WSD. It lists many applications of WSD, including Information Extraction, Information Extraction (IE), Machine Translation (MT), Content Analysis, Word Processing, Lexicography, and the Semantic Web.

Many studies (Carpuat & Wu, 2005; Sanderson, 1994; Stokoe, Oakes, & Tait, 2003;

Zhong & Ng, 2012) focus on discussing the usefulness of WSD in IR systems. Zhong and Ng (2012) conduct their experiments in standard TREC collections and conclude that supervised WSD system can significantly improve the performance of a state-of-the-art IR system. Other studies (Carpuat & Wu, 2005, 2007; Chan & Ng, 2007) show that WSD can improve the performance of MT systems. Carpuat and Wu (2007) demonstrate a very promising results on Chinese-English machine translation task. They find that WSD systems can improve phrase-based statistical MT models in many metrics such as BLEU. Chan and Ng (2007) also show that WSD system significantly improves the performance of MT systems.

In Information Extraction, WSD appears in a richer linguistic phenomenon. For example, traditional WSD concerns homonym, which has same word form and different meanings in different contexts. For IE researchers, they also consider synonym (different word forms but same meaning or same denotation in an entity) and metonymy. Markert and Nissim (2007) organized a metonymy resolution task in SemEval-2007. This task tries to make a distinction of BMW in context “my BMW runs fast”, which refers to a transportation vehicle and not a denotation to the famous automobile company. In biomedical domain, WSD plays an important role for automatic biomedical literature analysis (Schuemie, Kors, & Mons, 2005).

WSD improves the accuracy of literature understanding and improves the identification of ambiguous entities. Stevenson and Guo (2010) study three types of ambiguous terms in biomedical documents including ambiguous terms, ambiguous abbreviations and ambiguous gene names. Their systems reach very high performance ranging from 87.9% to 99.0%. Dai, Tsai, and Hsu (2011) study Entity Linking (EL) task, which links different mentions (synonyms) in biomedical literature to database entries to help document analysis.

Standard performance evaluation of WSD algorithms comes from Senseval, which is a series of competitions related to NLP tasks. After Senseval-1 in 1998, Senseval-2 (Edmonds

& Cotton, 2001) formulates two WSD tasks: the lexical sample WSD and all-words WSD.

Lexical sample WSD decides a word sense of a single word in a given fragment of text which usually contains many sentences. The all-words WSD decides senses of multiple words in the same time. These datasets contain many human-annotated examples in many languages and in different settings. Many WSD systems adopt datasets in Senseval-2 and Senseval-3 (R.

Mihalcea, Chklovski, & Kilgarriff, 2004) for evaluating their WSD systems. With the standard evaluation datasets, researchers can give more meaningful performance comparisons between different WSD systems.

In the evaluation, researchers usually adopt WordNet (Fellbaum, 1998) as the sense inventory, which defines a closed set of senses for each word. In this situation, a word sense refers to a sense key in WordNet, and WSD problem is considered a classification problem because we want to decide the exact sense in the closed set. Kilgarriff (2006) gives a good survey on word sense. Some researchers did not use WordNet for sense inventory. They either use other ontologies or create their own sense clusters for sense inventory (Pantel & Lin, 2002). In these cases, comparisons of system performances are not easy, but the systems can have domain-specific sense inventory for their study.

Some researchers (Erk, McCarthy, & Gaylord, 2009; Erk & McCarthy, 2009; McCarthy,

Koeling, Weeds, & Carroll, 2004) study different perspectives of WSD. Because the most common sense is very useful in WSD systems, McCarthy et al. (2004) use unsupervised methods to find predominant word senses in text. Erk et al. (2009) investigate the word usages and word sense, and build datasets with graded senses in a given contexts. This setting is different from traditional WSD task which usually select the best-fit sense in a given context. They (Erk & McCarthy, 2009) propose many metrics to evaluate sense grading system and implement system for this tasks.

In this study, we explore a more general problem which concerns the relation between concepts and contexts. We consider two aspects of this relation: context appropriateness and concept fitness. The context appropriateness is a function of modeling the appropriateness of contexts for a concept. For example, if we consider concept depository financial institution for word bank, context “he saved his money in the biggest bank” is appropriate but context

“he takes a walk on the river bank” is inappropriate. On the other hand, the concept fitness is a function of modeling the fitness of concepts in a context. For example, if we consider context “he saved his money in the biggest bank”, concept depository financial institution is

在文檔中概念表徵及其應用 (頁 43-0)