• 沒有找到結果。

Chapter 3   Model the News Article

3.3  The Role

In section 3.2, we know that a document contains a role set which consist of roles, before introducing the structure and how to build it in this section, we explain the reason of using the role information in this thesis.

3.3.1 The Role

The term “role” means a person’s or a thing’s function in a particular situation.

As the observation of news articles, we find that the news happened around the people, location, and organizations, so we could use these information to help the work of calculating the relationship between news articles.

Why this information help to detection the relations between news articles? For example, somebody Appears in many news articles, but he might play different roles in them. For example, the people “陳水扁” plays a role of president who attend to a speech in one news article but being a defendant in another one. To distinguish the

15

different roles the same person play in each news articles would help us to avoid some mistake in relation detection.

For example, two people just play each other’s role in two news articles, and the keywords distribute the roles actions and the person names were extracted to be the keywords to calculate the similarity of two news articles. If we just calculate the similarity by these keywords without distinguish the roles of the two people in the two news articles, it might get a high similarity score of these two news articles even if they have no relationship just because of the persons and the action keywords.

In this thesis, we will propose a method to calculate the relationship between news articles using some information including the roles of person names, location names and organization names played in these news articles.

As mentioned, each role corresponds to an entity name (person name, location name or organization name) in news article, and some other information would describe the behavior of this name entity. We will introduce the structure of a role here.

R : A Role (vector) in a document, and it is consist of 2-tuple.

R = < NE, FV >

A role is consisted of a name entity and a feature vector, and the relationship between role, name entity and feature vector is: each name entity play a special role in a document, and the feature vector describe what role this name entity played in the

16

document.

NE (Name Entity)

Name entity is the most important part of the role; it just like the role’s ID. The name entity means the object like person, location or organization in news document.

FV (Feature Vector)

The feature vector is a set of features, the feature is defined as below:

Fi(The i’th feature of a feature vector) = < fw ,

ω

f >

fw : Feature Word of a role

ω

f : Weight of a feature word

And the feature vector is defined below:

FV(Feature Vector) = < fw1 [

ω

f1], fw2 [

ω

f2], fw3 [

ω

f3], ……>

Feature vector of a role provide the information about what the role played by the name entity in this document. It consist of some features, and the feature consist of two element: feature word and the weight of feature word.

The feature word is selected from the document and pay mining for the role played by name entity. These feature words would let the name entity be meaningful and have their own role in news article.

The weight of feature word

ω

f means the important degree of the corresponding feature word of the role vector. The big

ω

f a feature word is, the more meaning this

17

feature word pays for the role played by name entity in document.

Below is an example of role vector, include all elements we introduced.

Role

Name Entity Feature Vector

<王建民> 紐約洋基[0.2],投手[0.3],勝投王[0.1],台灣[0.2],連勝[0.2]

<郭台銘> 鴻海[0.2],台灣首富[0.2],劉嘉玲[0.4],林志玲[0.2]

We will introduce how we building the role vector, including how to decide

the feature words and how to calculate the feature word’s weight

ω

f .

3.3.2 Building Roles

In this Chapter, we will introduce how do we determine the roles which played by names in news articles.

1. Finding name entities:

First of all, we have to find out all the entity names in news article. we find all the person names, location names and organization names in the news articles we want to deal with by 中研院自然語言處理-斷詞系統 and a NER system build by the concept of Erik Peterson’s study of NER.

It helps us located out all entity names in news articles, and then we will extract the feature words of each feature vector.

2. Building feature word set

Feature words mean the words which constructing the role vectors beside the role’s

18

own name and pay meaning for the role vector, These words would let the role words be meaningful and have their own role in news article.

Because of all the news articles were crawled by the same query, most of them share the same important terms, so using tf-idf method to extract the keyword would miss a lots of important keywords which have high document frequency.

Instead of using tf-idf, we choose the WIKI titles as the source to help us extract the keywords in news articles.

The wiki title list consist of all titles appeared in the wiki pedia, it contains a large amount of terms with meaning.

3. Decide the feature words in feature vector

As an assumption, we believe that the words near the name in the news article would be meaningful for the name. So we could collect the feature words near the entity names appear in the news article, it would give every name their special meaning in different news articles.

As above, we would define a term “window” here. In this thesis, the term “window”

of a role is defined as the distance before and after entity names. Each name has its own window, which with a parameter named “window size”, and the window means every position before and after the name’s location less than the window size. All feature words appear in one entity name’s window, we would regard these feature words belong to the

19

role’s feature vector.

If an entity name appeared in one news article more than one time, it would have many windows, and some of them might cover each other’s content range because of the distance between two names might less than the name’s window size. In this case, we would combine the windows and the repeat part would just be readed one time. For example, if a name has two windows separately covered the position (20~80) and (40~100), we will combine these two windows to one which cover the position (20~100). By the way, we will avoid the problem of calculate some keywords repeatedly.

Below is an example for the role of a person name. Let’s see the sentence as below first.

“林德訓在國務機要費案中,已經分別以偽造文書和偽證罪兩項罪名來起訴,如 今又再洗錢一案轉列被告”

If we define the window size as 100, the words with back color(國務機要費, 偽造 文書, 偽證罪, 洗錢, 被告) are the feature words near the name “林德訓” and in the name’s window in this case, we define them as the role words of “林德訓” in these sentences.

Every name of persons, locations and organizations in the 162 news articles would be regular as a vector with keywords near them like follow.

20

Entity Name Feature Words [Appeared times]

<林德訓> 國務機要費[1], 偽造文書[1], 偽證罪[1], 洗錢[1], 被告[2]

The vectors of the name in the news articles were regard as the role which the name played in this news article they appeared at.

The second part of the Name’s Role Words like the number “1” after the term “國 務機要費” means how many times “國務機要費” appeared in the windows of the name “林德訓” in this news article.

4. Feature Weight

Now we get the information about what feature words appear around and near the entity names, including how many times they appeared in these names’ windows. The next step to deal with the name vectors is to transfer the parameter “appear times” into

“term frequency (tf)”. It will be done by just dividing each appeared times of Name’s Role Words by the count of all appeared times of these words in the name’s windows.

After the step, we get the new name vector as follow.

Role

Entity Name Feature Vector

<林德訓> 國務機要費[0.16], 偽造文書[0.16], 偽證罪[0.16], 洗錢[0.16], 被 告[0.33]

That’s the final version of a role vector we have to prepare to detect the

21

relationship between news articles.

相關文件