Chapter 5. Word Sense Disambiguation
5.4 Problem Formulations in WSD
Equation 5 can be used in the same way for contexts.
If we further apply Equation 2 and Equation 3 in Equation 4 and Equation 5, we get Equation 6 in the below, in which is a context that inappropriate for concept .
Equation 6
Equation 6 just says that a learned function must return larger score of one error than that of two errors. We can use a simple score assignment to adopt those equations in WSD problem. For example, we can assign left hand side of Equation 4 to have value 2, right hand sides of Equation 4 and Equation 5 to have value 1, and right hand side of Equation 6 to have value 0. In this way, we have more training data and more ways to utilize the information we have. We will illustrate approaches to construct training datasets by utilizing these equations in next section.
5.4 Problem Formulations in WSD
In this section, we illustrate different approaches to utilize context appropriateness and concept fitness along with the meaning composition feature processing approach. We will introduce baseline approach without meaning composition, multi-class classifications with meaning composition, binary classifications with meaning composition, and ranking approaches with meaning composition.
55
5.4.1 Multi-class Classification (Baseline)
In WSD, it is straightforward to formulate sense disambiguation of a word as a multi-class classification problem. Suppose we have a set of n word senses for word w, and we also have a set of m contexts
that word w occurs, where and denotes j-th context which its sense is . The multi-class classifiers learn a function which takes a context as input and predicts the correct sense . In most WSD settings, is ignored because word w is same for all contexts. In this case, classifiers learn function instead of . If classifiers learn function for a word with dataset
, we denote this kind of formulation Multi-Class without Meaning Composition (MCwoMC).
In this formulation, the decision function depends on classifiers we adopt. Researchers usually use one-vs-the-rest multi-class strategy to train a binary classifier for a sense, and ensemble many binary classifiers to produce final prediction. For example, researchers (Lee et al., 2004; Lee & Ng, 2002) train a binary SVM classifier for each word sense and output sense with highest prediction score. We show the sense decision function in Equation 7, where function is a binary classifier for sense using one-vs-the-rest multi-class strategy, and i = 1...n.
Equation 7
5.4.2 Multi-class Classification with Meaning Composition
If meaning composition is used, cannot be simplified into because
56
different word senses will result in different meanings. With meaning composition, j-th context can compose with different word senses and classifiers must give different predictions.
For example, although the disambiguating word is the same for all contexts, we have , in which is a meaning composition function for concept and its context . If classifiers learn function for a word with dataset
, we denote this kind of formulation Multi-Class with Meaning Composition (MCwMC).
In this formulation, the decision function must be carefully handled because the class information is encoded in training features though meaning composition. In our study, we find that if we use decision function like that of MCwoMC, the performances using different types of features can reach 92% accuracy which is wrong15. To have a correct decision function, we must test every possible meaning compositions of word sense and choose the highest score as predicted sense. We show the sense decision function in Equation 8, where f is decision function returned by classifier and sense score decision function g is a function of assigning a score to value .
Equation 8
For example, if we use Support Vector Regression (SVR) as classifier, and let class for value assignment. The regression function returns a regression value , and sense score decision function can adopt . This means we measure the distance between class’s value and regression value and choose closest one as sense assignment. We denote this approach to be MCwMC-Reg.
15 In this case, it means that classifier can make a clear distinguish between concepts but not concept in specific context. The reason is that concept’s features are derived by summing all contexts of a sense.
57
If we use many binary SVM classifiers for like that of MCwoMC, sense score decision function is the predicted value of that class . For example, if we have 3 senses , we must test three vectors , , and which represent three different meaning composition using different senses. A multi-class classifiers may output three scores of probability (score of , score of , score of ) for a test case. If test case has scores (0.95, 0.01, 0.11), we let = 0.95. If test case has scores (0.05, 0.89, 0.21), we let = 0.89. If test
case has scores (0.16, 0.05, 0.76), we let = 0.76. In this example, sense is the predicted sense because it has highest value (probability). We denote this approach to be MCwMC-SVM.
5.4.3 Binary Classification with Meaning Composition
We can simplify multi-class classification problem to a single binary classification problem by using meaning composition. The idea is simple: if the composition is valid, we assign 1 to it, and assign -1 otherwise. The resulting WSD dataset is
, where . We denote this kind of formulation Binary Classification with Meaning Composition (BCwMC).
This formulation has two advantages. First, it is simple and does not need to train many classifiers for multiple classes. Second, there are more training samples than that of original multi-class problem. For example, if there are 100 contexts and 10 senses for a word, we will have 1000 feature vectors in dataset.
This formulation comes with two disadvantages. First, it may result in an unbalanced dataset if there are a lot of senses of a word. For example, in SensEval-2 lexical sample task, word art.n (art with noun POS) has 19 senses. This means positive sample is only 5.3% in
58
whole dataset. It is not easy to train a good classifier in this case. Second, the resulting feature matrix may be a dense matrix which is not good if we want to handle large problems. In our study, the ratio of non-zero count of feature matrix using meaning composition may reach 2~30%.
Like MCwMC, for a context , we have multiple vectors which compose meaning with different senses. If we use binary classifier with output of probability, we can use sense decision function like Equation 8. We denote this approach to be BCwMC-SVM.
If we use regression, the sense score decision function is like that of MCwMC-Reg. But the classes are +1 and -1 in BCwMC. We denote this approach to be BCwMC-Reg.
5.4.4 Ranking 2-Level with Meaning Composition
Unlike BCwMC which uses single binary classification for a word’s disambiguation, we now try to formulate WSD as a ranking problem. In BCwMC formulation, if composition is valid, we assign +1 to it, and assign -1 otherwise. But if we want to learn a linear function that can rank all valid compositions before invalid compositions, we can formulate WSD using ranking (Cohen, Schapire, & Singer, 1999; Joachims, 2002).
Ranking is a common technique used to order things. One of the most important applications of ranking is ranking results of search engines (T.-Y. Liu, 2009). In WSD with meaning composition, if we can rank valid compositions in high performance, we have a good model for WSD. The ranking formulation of WSD is shown in Equation 9 which is similar to Ranking SVM formulation (Joachims, 2002).
59
subject to:
...
,
where
Equation 9
In Equation 9, is the learned linear function, is slack variable, is a sense, and is a context. We can assign relevance score of valid compositions to be 2 (relevant) and that of invalid compositions to be 1 (non-relevant). In this setting, a sense is like a query, and we retrieve valid meaning compositions to be relevant documents of that sense.
The ranking formulation is similar to formulation of binary classification, and we show if in Equation 10. We denote this approach to be Ranking 2-Levels with Meaning Composition (R2wMC-Ses), in which “Ses” means one sense one query.
Equation 10
In R2wMC-Ses formulation, the constraints are hold inside a sense. We can enforce all constraints to be hold for a word. In this case, we only have one query for a word. We use the same formulation except every sense is in one query. We denote this approach to be R2wMC-Wrd, in which “Wrd” means one word one query.
60
5.4.5 Ranking 3-Level with Meaning Composition
Now, we utilize context appropriateness and concept fitness using Equation 4 to Equation 6 in this section. Equation 4 to Equation 6 can be adopted to generate more training data when we assign different relevance scores to different situations. This is equivalently to direct adding Equation 4 to Equation 6 as constraints of Equation 9. We show the formulation in Eq. 10 below.
subject to:
,
where
, .
Equation 11
Equation 11 is simple. If context and concept match each other, we give it the highest ranking score. If context appropriateness and concept fitness are all missed, we give it the lowest score. If one of context appropriateness and concept fitness is missed, we give it middle score. In relevant score assignment of WSD, we set to 3 (most relevant),
61
and to 2 (relevant), and and to 1 (non-relevant). In this way, we generate many training samples for ranking algorithm even though the labeled dataset is small. We denote this approach to be Ranking 3-Levels with Meaning Composition (R3wMC-Ses), in which “Ses” means one sense one query.
In R3wMC-Ses formulation, the constraints are hold inside a sense. We can enforce all constraints to be hold for a word. In this case, we only have one query for a word. We use the same formulation except every sense is in one query. We denote this approach to be R3wMC-Wrd, in which “Wrd” means one word one query.
The sense decision functions of R3wMC-Ses and R3wMC-Wrd is the same. We use the equation in Equation 10 for sense decision function.