Feature Extraction and Experiment Settings

Chapter 5. Word Sense Disambiguation

5.5 Feature Extraction and Experiment Settings

For a sample of SensEval’s lexical sample task, we extract nine types of features. The first four types are commonly used by WSD researchers (Lee & Ng, 2002). The next five types of features are new. We describe them below.

1. Part-of-Speech of neighboring words (POS):

We encode 7 POS features of the disambiguating word. A feature Pi (i=-3, -2, ..., 3) is a POS of neighboring word. P0 refer to the disambiguating word. If there is no such word in a position, we use NIL to be the tag. Each feature is regarded as a bag-of-word feature, and we assign weight of feature using 0/1 encoding, which means if that word appear, we assign weight 1 to that word, and assign 0 if that word did not appear.

2. Words in context (Context):

We lemmatize word unigram of surrounding context to WordNet 3.0 lemma and exclude stop words. The surrounding context includes all words in neighboring

sentences. This is a single feature and we use bag-of-word 0/1 encoding.

3. Collocation (Colloc):

We implement 11 collocations described in Lee and Ng's paper (Lee & Ng, 2002).

There are 11 features and we use bag-of-word 0/1 encoding for each feature.

4. Syntactic relations of target disambiguating word (SyntaxRel):

We implement syntactic relations described in Lee and Ng's paper (Lee & Ng, 2002).

There are different types features for noun, verb, and adjective. We use bag-of-word 0/1 encoding for each feature.

5. Words in dependency relation (drWord):

We parse sentence of target disambiguating word to get dependency relations (de Marneffe, MacCartney, & Manning, 2006; Toutanova et al., 2003). For a list of tuples of dependency relations (grammatical relation, governor, dependent), there is a sub-list R that the target disambiguating word are in the relations. We add all words in R (in governor or in dependent) for this feature except the target disambiguating word itself. This is a single feature and we use bag-of-word 0/1 encoding.

6. Grammatical relation in dependency relation (drRelt):

We add all grammatical relation that in sub-list R that the target disambiguating word is in the relations. This is a single feature and we use bag-of-word 0/1 encoding.

7. Words in dependency relation with role information (drRole):

We add all words with its role information that in sub-list R except disambiguating word itself. For example, for word w, we add string w_gov to indicate that word w is a governor in the dependency relation. Similarly, we add string w_gov to indicate its role is dependent. This is a single feature and we use bag-of-word 0/1 encoding.

8. Extension of words in dependency relation using WordNet definition (drDefi):

For each word w in feature drWord, we add all word unigrams of word w's definition.

If word w has multiple senses, we add all definitions in WordNet 3.0. The stop words are excluded. This is a single feature and we use bag-of-word 0/1 encoding.

9. Extension of words in dependency relation using WordNet synset (drCnpt):

For each word w in feature drWord, we add all synset ids of word w. If word w has multiple senses, we add all synset ids in WordNet 3.0. This is a single feature and we use bag-of-word 0/1 encoding.

In meaning composition, concept’s representation is build by summing all samples representation. Context, POS, and colloc are representation for context appropriateness. Other types of features are representations for concept fitness because these features are closely related to concept’s meaning and are usually adopted in meaning composition in knowledge extraction. We combine add and multiplication operations for meaning composition (Mitchell

& Lapata, 2008). For example, for feature vector v and u, the meaning composition function , where is point wise multiplication.

We conduct experiments in lexical sample tasks in SensEval-2 and SensEval-3. There are 73 words and 57 words in SensEval-2 and SensEval-3 lexical sample tasks, respectively.

These words are in three categories: noun, verb and adjective. There are total 8611 training samples and 4328 testing samples in SensEval-2, and 7860 and 3944 in SensEval-3. Some words have many senses but samples are little. For example, there are 43 senses for verb turn along with 131 training and 67 testing samples.

We use LibSVM (Chang & Lin, 2011) classifiers for multi-class and binary class classification, use Liblinear (Fan, Chang, Hsieh, Wang, & Lin, 2008) for regression function of MCwoMC-Reg, and use Ranking SVM (Joachims, 2002) for our ranking algorithms.

We adopt RBF kernel and perform a grid search for (c, g) in using LibSVM. We try 9 parameter C in using Liblinear. We use 3-fold cross validation for model selection in LibSVM and Liblinear.

We find that higher cost c of Ranking SVM usually resulting in better performance but taking more training time, and we fix cost to 10. We also set parameter -# 20000 for truncate long time training.

5.6 Experiment Results

First, we want to know the performance of each feature, and we show fine-grained summary results of SensEval-2 and SensEval-3 in the table below.

Features SensEval-2 SensEval-3

baseline+drDefi+drCnpt 58.78 57.82 65.67 64.89

baseline +drWord+drReRelt+drRole 56.61 58.01 64.21 64.12

All features 59.69 58.48 63.72 64.00

(Lee & Ng, 2002): micro averaged recall on all words n/a 65.4 n/a n/a

(Ando, 2006) n/a 65.3 n/a 74.1

Table 5. WSD results of MCwoMC using different features.

In Table 5, we can see that our performance is far behind the state of the art. We notice that our baseline is 57.54 which is still smaller than that of Lee and Ng (2002)'s system. But

we use same types of features. One possible reason is that Lee and Ng (2002) train binary classifiers using different parameters while we use same parameter for all binary classifiers of a word.

Next, we want to know the performance of different problem formulations. We show results in the table below.

Problem Formulation SensEval-2 SensEval-3

Train Test Train Test

MCwoMC 59.69 58.48 66.48 64.94

BCwMC-Reg 44.01 35.57 39.57 32.45

BCwMC-SVM 85.98 47.06 92.75 59.80

MCwMC-Reg 45.53 36.81 45.50 47.29

MCwMC-SVM 53.03 48.07 56.99 52.10

R2wMC-Ses 97.78 59.06 95.79 64.97

R2wMC-Wrd 97.90 58.31 96.06 64.64

R3wMC-Ses 97.87 59.06 95.56 64.10

R3wMC-Wrd 97.99 58.22 97.45 64.59

Table 6. WSD results in different problem formulations.

In Table 6, R2wMC-Ses has best performance, but it is still smaller than the performance of state of the art. We can see that our models have better performances based on same experiment process. Especially, the training results are very high. Because the resulting models using ranking is linear, we think that this phenomenon is very useful if we integrate unsupervised algorithms with our methods.

There are many possible ways to improve our performance. One possible direction is

using different approaches to derive concept representation. In our experiments, we sum all features of that sense to construct its representation. It may be better if we use unsupervised method to construct the representation. One possible direction is using dimension reduction to shorten the gap between training and testing. Approaches like principle component analysis (PCA) and feature selections may work for this case. We leave these issues for future work.

在文檔中概念表徵及其應用 (頁 72-78)