Similarity between Documents - Relationship Detection

Chapter 4 Relationship Detection

4.2 Similarity between Documents

In last section, we define an operator “similarity” to determine are two documents similar to each other. We will describe this operation clearly in this section.

For news document D1 and D2, their similarity using role and topic information is defined as below:

Formula 1: Role and Topic Cosine Similarity

The formula shows that the similarity between news documents is consisted of five parts:

1. Person role part. (1st row)

2. Location role part. (2nd row)

3. Organization role part. (3rd row)

4. Noun part. (4th row)

5. Verb part. (5th row)

The 1^st~3^rd part consisted of role information and the 4^th~5^th part consisted of topic information.

Each part consisted of two elements: a weight

ω

and the similarity degree of each part.

ω

is the weight of one part and the meaning of each one is below:

ω

The weight of person role part

ω

The weight of location role part

ω

The weight of organization role part

ω

The weight of noun part

ω

The weight of verb part

Table 1. Weight of role and topic parts

The second element of each part: the similarity degree, shows each part’s similarity degree between news documents. We will discuss this element in two parts: Role similarity and topic similarity.

1. Role similarity between news documents

The role similarity contains three kinds of similarity degree: person, location and organization. We will discuss the role similarity using the person role part as example.

The person role similarity degree is calculated by the formula as below:

∑

For news documents pair D1 and D2, we calculate the cosine similarity score between all the person role pairs which have the same entity names using their feature vectors.

coun

simil

similar behavior or similar situation.

The topic similarity is combined of the similarity scores of the topic part(including noun and verb). If two news articles’ topic similarity score is high, it means that the news articles might describe the similarity kind of events. For example, news articles describe about earth quake, they often had the same topic words as “quake”,

“amplitude”, ”epicenter”, ”rescue”…ect.

Both of the role similarity and the topic similarity play important role in helping to detect the relationship between news articles. We will introduce how we use this information to calculate the dependence between articles and how to combine these similarity scores in the next section.

What situation would be if two news articles have high role similarity but low topic similarity? To discuss the high and low combination would be interesting; we will discuss this problem after making experiments.

4.4 Discussion

The advantage of our method detects the relationship between news articles not only considers the characteristics of news events but also considers the concepts of name entities in news articles. The drawback of our work is the need of NER and part of speech tagging processes, which might make a miss rate and affect our results.

Chapter 5 Experiments

5.1 Overview

In this Chapter, we will make some experiment about relationship detection between news articles using role and topic information. At first, we introduce our experiment dataset and then describe how we set up the experiment in section 5.1~5.3.

Next, we will illustrate all experiment methods, experiment result and discuss each of them.

A basic experiment process graph as below:

Figure 5. Experiment Process Graph

5.2 Dataset

The dataset of our experiment are 162 news articles searched from Yahoo News Taiwan with the query “洗錢案”,and the date of these news articles are from 2008/9/24 to 2008/12/24.

Because of to organize the correct real person sensible relationships answer of these news stores is difficult and ambiguous in different people’s minds, so we just get a not huge amount of news articles and it make us easy to discuss the detail of the relationship answer.

We split the news articles into groups by date information. Two news articles will be assigned in one group if and only if they are published in the same day. After this step, we get 46 time groups span all 162 news articles, shown as Figure 6.

Figure 6. News Articles After Splitting.

5.3 Evaluation

As our knowing, there is no official benchmark or static answer for the study of event evolution mining, so we make the correct answer by our self.

We choose the recall and precision to evaluate the performance of our experiments.

The parameters and formulas is illustrate below:

| Rt | : # of true relationship labeled manually.

| Re | : # of relationship detected by our method.

Recall =

t t e

R R R ∩

Precision =

e t e

R R R ∩

5.4 Baseline

We believe that if two news articles are dependent to each other, they must be similar to each other, too. So we choose the document similarity as our strategy of detecting the relationships of news articles, and we should set the vectors as calculative element.

We set the vectors of news articles are keywords traced by WIKI titles and using their term frequency (tf) information to calculate the cosine similarity. If words appear in both a news article and WIKI title list, they would be the feature term of this news article.

We calculate the cosine similarity of the news articles with all other news in the nearest 10 time intervals because of avoids too far and endless calculation.

For any pair of news articles in different time interval, if the cosine similarity score of them is higher than a threshold, we make the relationship between this two news articles as true.

The experiment result is recall 0.66 and precision 0.36. The recall is normal but

precision is not well. To observe the experiment result, we find that, in our experiment data set, there are many news articles describe the similar topics, but the characters in these articles are not the same. Just using the keywords of news articles as features to calculate the similarity score will miscarriage the wrong answer as correct in this kind of cases.

5.5 Experiment 1. Different Weight

We will introduce the first experiment using the operations introduced in section 4.1 to detect the relationship between news articles in this section.

This experiment is to observe and discuss the effect of the similarity scores’ weight combination strategy of the five parts.

5.5.1. Static Weight

The similarity operation formula is shown below:

Sim (D1, D2)=

This score linear combined of five similarity score as person role part, location role part, organization role part, noun part and verb part. In this experiment we will try some weight combination to observe what weight combination of similarity score do well in relationship detection.

We set each of the weight parameters (

ω

_P,

ω

_L,

ω

_O,

ω

_N ,

ω

_V) between 0 to 1 and their sum to be 1.

Table 2 is one part of result in this experiment. The value in the first row (besides the “base”) means the combination method (M) with an ID (1~6), each of them correspond to a weight combination. And the “base” in first row means the experiment result in Chapter 3, we put it here in order to easily discuss and compare with result of experiment 1.

Base M1 M2 M3 M4 M5

ωP -- 1.0 0.0 0.0 0.0 0.0

ωL -- 0.0 1.0 0.0 0.0 0.0

ωO -- 0.0 0.0 1.0 0.0 0.0

ωN -- 0.0 0.0 0.0 1.0 0.0

ωV -- 0.0 0.0 0.0 0.0 1.0

Recall 0.68 0.57 0.17 0.25 0.68 0.69 Precision 0.36 0.66 0.54 0.66 0.17 0.16

Table 2. Static Weight Result (Single)

The method M1~M5 in Table7.1 are just using one part of the five element as the whole weight in similarity score. Method M6 let each part has the same weight to calculate the similarity score.

Let’s start to discuss the result of each combination strategy and experiment results.

M1:

M1 sets the

ω

_P to be 1 and sets others to be 0 so that we can clearly understand the effect of person role on the relationship detection. In this combination, we get a result as recall 0.57 and precision 0.66. Compare with the result of baseline, M1’s recall is not well but the precision is much better. The recall part means that, in the experiment data se, if we just using the role information to detect relationship between news events, we would miss detect about half of the correct event dependent relation. But in the precision part of the experiment result, it shows that using the person role similarity as whole weight would avoid making too many erroneous judgments of letting independent events to be dependent.

M2, M3:

The result of weight combination of M2 and M3 play very poor scores of recall and better then baseline scores of precision. The pool recall of M2 and M3 is reasonable because that the location and organization play much less important roles then person role information in the experiment dataset, most of the news articles in dataset happened around person, only a little news articles using the locations or organizations as the most important characters.

M4, M5:

The weight distribute of M4 and M5 means only using the topic part in relationship detecting. The result shows that only using topic part would has the similar result with baseline in recall and a very poor result in precision.

Table 3 is another weight combination experiment result, this strategy using the person role part as 0.9 and other one as 0.1.

M6 M7 M8 M9

ωP 0.9 0.9 0.9 0.9

ωL 0.1 0.0 0.0 0.0

ωO 0.0 0.1 0.0 0.0

ωN 0.0 0.0 0.1 0.0

ωV 0.0 0.0 0.0 0.1

Recall 0.54 0.54 0.66 0.66 Precision 0.72 0.75 0.64 0.62

Table 3. Static Weight Result (Combine)

M6, M7:

Person role similarity and one other role part similarity using the weight combination strategy as M6 and M8 in Table 3, compare with the method M1 which just using the person role part, this kind of method provide a better precision score. It means using would help arising the precision.

M8, M9:

M8 and M9 using the strategy as person role to be the main part and location role or organization role to be the auxiliary element with weight 0.1. The result shows that

both recall and precision exceed 60% in this combination strategy, it means that the combination of role similarity and topic similarity would have a good balance in recall and precision.

Table 4 is a weight combination experiment results, this combinative strategy is giving the person role similarity most weight (0.6) and other parts averagely shared the remaining weight (each of 0.1).

M10

ωP 0.6

ωL 0.1

ωO 0.1

ωN 0.1

ωV 0.1

Recall 0.72 Precision 0.79

Table 4. Static Weight Result (All)

This combination strategy lets the result to be so far the best result in our experiments, both recall and precision are above 70%, it tells us that using all parts of role and topic information might provide the whole information we need to detect the relationship between news events.

5.5.2. Dynamic Weight

The experiment result shown in section 6.6.1 tell us that the person role vectors play the most important role in the case of “洗錢案”. The reason of getting this result is because that most of the 162 news articles happened mainly around the persons, but

locations and organization play the most important role in little news articles of the experiment dataset.

Our goal in this thesis is to detect relationship between events in all kind of domain, we do not have a perfect static weight combination to fit all news article sets. Even in one news set which are describes a specific event, they might focus on different kind of roles in the whole event.

For example, our experiment dataset, the 162 news articles, although most of them are focus on person behavior, but some of them are focus on organizations and some are focus on locations. If we can automatically detect the weight strategy of news article pairs, the relationship detection result might be better. So we make a variant weight combinative strategy to dynamic decide each role vector part’s weight.

In this experiment, we use the combination strategy as below:

For each news article pair:

1. As the combination strategy of method M11, we select a main part and set its’

weight as 0.6, and others as 0.1.

2. The main part is chosen from the three role similarity part (person, location and organization). We select the part which owns the most amount of entity names in the news document pair.

Table7.4 shows the result of this combination strategy.

Variant Weight Recall 0.80 Precision 0.79

Table 5. Dynamic Weight Result

Using the dynamic weight method improve the news article relationship detection on the part of news articles which discussed about the organizations.

5.5.3. Comparison

We will compare the experiment results by the evolution graph in this section.

Figure 7 to Figure 9 show the evolution graphs of baseline, static weight and dynamic weight.

Figure 7. Evolution Graph of Baseline

Figure 8. Evolution Graph of Static Ally Weight

Figure 9. Evolution Graph of Dynamic Ally Weight

The blue line in the graph means the correct relationships between events to be recognize by our experiment processes, and the yellow lines means the correct relationships which miss detected by our experiment processes. The red line means the relationships which are not in the correct answer set but detected to be correct relationships by our experiment processes.

Figure 7 shows that the baseline detected too much noise by observing the high density of red lines. Compare with Figure 7, the Figure 8 and Figure 9 have much less red lines. It proved that our method could effectively decrease the noise of building an evolution graph.

Compare the Figure 8 and Figure 9, the biggest different of this two result is at the part that news events which discussed about organizations. The static weight strategy set the person as the best weight in our experiments so it could not detects the relationships between news events which discussed about the locations or the organizations well.

5.6 Experiment 2: Window Size

The window size of a role vector decides the information quantity of the vector. If a role vector has a too big window size, it might collect too many keywords into his keyword set, including more noise information. In the other side, if a role vector was building with a too small window size, it might miss useful information which helping for determine the role it played in news articles correctly.

How to select the suiTable window size is a good study for helping detecting the relationship between news articles.

5.6.1 Static Window size

In this section, we set the window sizes of all the role vectors in all news articles be the same form 5 to 50. And then we will observe the effect of the static window size.

The experiment result is shown below.

Figure 10. Result of Different Window Size

0 0.2 0.4 0.6 0.8 1

5 10 15 20 25 30 35 40 45 50

Recall Precision

Window Size

5 10 15 20 25 30 35 40 45 Recall 0.10 0.36 0.54 0.65 0.73 0.80 0.78 0.81 0.78 Precision 0.57 0.68 0.72 0.73 0.76 0.75 0.71 0.68 0.64

Table 6. Static Window Size

The recall is very poor in low window size environment, It means that collect too less feature words to a role will lead to miss detect many correct relationship between news articles. But the recall rose as the raising of window size. After the window size of 30, the recall is sTable about 0.8 and it will not increase or decrease as the window size.

The precision rose as the raising of window size before window size 30 but go down since window size 30. It means that too long window size would collect too much noise for the role and detect many not correct relationships as correct.

5.6.2 Dynamic Window size

The section 6.7.1 shows the effect of window size for all roles in news articles. In this section, we will discuss how each role has different window size will affect the experiment result.

First of all, we make an experiment; we set all person roles’ the same window size from 5 to 50 and observe the recall of each role. Below is some result:

Window size

20 25 30 35 40

蔡美利 30/53 33/53 32/55 30/58 30/59

吳景茂 46/63 48/60 46/66 46/67 47/68

林德訓 13/17 15/21 20/27 22/27 20/30

蔡銘哲 69/90 78/101 87/113 84/120 85/127

葉盛茂 9/9 8/9 10/11 11/12 10/12

辜濂松 9/14 10/12 8/10 9/11 13/15

Table 7. Some Persons’ Precision in Different Window Sizes

The result shows that roles provide the best contribution to relationship detection with different window size, some maybe at 25, some maybe at 30 and some at 35 and some even at 40.

5.7 Experiment 3: Different Feature Word

In our work, we choose the Wiki titles as our feature words in roles’ feature vectors.

Are Wiki titles really suitable to be the feature words? We make an experiment to answer this question.

In this experiment, Instead of wiki keywords, we use the bigrams in role’s windows as the feature words and the weight of feature words still be the value of term frequency. Table 8 shows the experiment result.

B0 B1 B2 B3 B6 B7

ω

P ^Variant

Weight

1.0 0.0 0.0 0.2 0.6

ω

L 0.0 1.0 0.0 0.2 0.1

ω

O 0.0 0.0 1.0 0.2 0.1

ω

N 0.1 0.0 0.0 0.0 0.2 0.1

ω

V 0.1 0.0 0.0 0.0 0.2 0.1

Recall 0.78 0.66 0.53 0.37 0.75 0.66

Precision 0.48 0.36 0.37 0.51 0.37 0.37

Table 8. The Result of Bigram Feature Words

The Table shows that using bigram as feature words to detect the relationship will get a low precision of result, because using all the bigrams would collect much noise to a role. It proved that using the wiki titles as feature words of a role is suiTable.

5.8 Experiment 4: Different Feature Size

This experiment discuss should we use all feature words in feature vectors to calculate the relationship between news articles? Some words of them might be not important for describe the role played by a name entity, if we can discard some feature words which is not that important, it will be speedup to the calculation.

In this experiment, we use the term frequency as the feature to determine if we should keep a feature word in a feature vector or discard it. We keep the feature words with high term frequency, which high in the front 90% to 10% and observe the result.

The experiment result as below:

Tf-rate Role Topic Role & Topic

90% 0.8 ;0.79 0.8 ;0.79 0.8 ;0.79 80% 0.8 ;0.79 0.8 ;0.79 0.8 ;0.79 70% 0.8 ;0.79 0.8 ;0.79 0.8 ;0.79 60% 0.8 ;0.79 0.8 ;0.79 0.8 ;0.79 50% 0.8 ;0.75 0.8 ;0.77 0.8 ;0.76

40% 0.8 ;0.71 0.8 ;0.72 0.8 ;0.71 30% 0.8 ;0.69 0.8 ;0.70 0.8 ;0.69 20% 0.78 ;0.67 0.8 ;0.68 0.77 ;0.65 10% 0.76 ;0.64 0.76 ;0.65 0.75 ;0.63

Table 9. The Result of Different Feature Sizes

The result tells us even if we just use the 50% high tf feature words, it will get the same result as using all feature words. So we can just use a part of feature words to calculate the relationship to speed up the evolution detection.

Chapter 6 Conclusion and Future Work

6.1 Conclusion

This thesis provides a method to detect the dependency between news events using the role and topic information. The role information not only considers the name entities in news events but also considers the concept of the name entities. Besides the roles, we also use the topic information to detect event evolution; the topic information tells us what kind of news event this event is.

Our experiments prove that our method is usable for event evolution detection. In first experiment we observe the effect of weight combination, and we found that using all part of the role and topic information would help to detect the relationship between news articles mostly. And then we found that the dynamic weight strategy is better than the static combination strategy because of the focal of different new article pairs is not the same.

The second experiment shows that too big or too small window size is not suitable for a role to collect the feature words. We have to choose a suitable window size to

maintain the information quantity and control the noise of a role’s feature vector. And +then we observe that entity name pay different performance in different window size, the best window size for the whole data set is not the best for all entity names. We do not find the method to dynamic determine the best window size for each entity’s roles yet.

The 3’rd experiment prove that the wiki titles are suitable to be the feature words of roles, it contains less noise than using all bigrams as feature words. The last experiment proved that we do not have to use all feature words to calculate the

在文檔中使用角色及主題資訊偵測新聞事件關係 (頁 33-0)