• 沒有找到結果。

Chapter 1   Introduction

1.1  Motivations

Chapter 1 Introduction

1.1 Motivations

Every time we read a news article of interest to us, we might be also interested in the beginning and subsequent development of this news article. For example, when we go home from a travel, we have no idea about what happened these days. By sending the keywords about the news to search engines, we can get lots of related articles from the Web (For example, the Web pages of Yahoo! News or Google News), which are of interested to us. Besides, we can future look for related articles supported by many online news websites. These related articles have relevant topics to the news article we are reading. Figure 1 shows an example of this scenario.

2

Figure 1. News Articles Displayed as An List.

From the Web, there might be hundreds of news articles returned by search engines.

However, it is very difficult for us to understand the beginning and subsequent development of the events[3]. Originally, the returned by search engines list is sorted by the relevance scores between the query and documents, instead of the relationship of the news articles in a temporal order. We have to analyze such relations and determine the order of these news articles by ourselves. In addition, to browse such a huge amount of news articles is a hard job and time consuming.

Is there a better way to display the news articles which is easier for users to understand? The answer is yes. If search results can be displayed as “Evolution Graph”

like Figure 2, users will more easily realize what happened in the beginning and the development of an event.

gene

4

beginning and subsequent development of this news article.

How to detect the evolution of events has been discussed for years by the study of

“event threading”. Event threading not only detects the events within a topic but also captures the dependencies between events. Some previous studies about event threading used the content similarity as the feature to detect the dependency between news events.

Some considered the person name and location as the features, and most of them used the feature of time information.

We will introduce the TDT and event threading deeply in Chapter 2.

1.3 Problem Definition

As mentioned, our goal in this work is to detect the evolution between news events in the topic which we are interested in. To detect the evolution, we have to detect the relationship between news events.

To achieve our goal, first we have to select some news articles as input. We choose the news articles which are found in the online news websites by searching with a query.

We are interested in or look the related news articles developed by TDT technique as input. The input data also contains the time information, for example, the date and time of this news is released. In our work, we make an assumption that each news article is viewed as a news event.

Besides the input information, we have to decide the output information we want.

5

In this work, our experiment output will be the relationships between news articles.

We hope that our system can work in all domains or every news articles set that searched by a query. It will not be limit in specific topic and will be done automatic ally without too much assumption on what kind of news articles they are.

1.4 Basic Idea

To detect the evolution of events, we will start from the observation of events.

There are some sentences talking about what event or news event is:

"Every newspaper reporter should answer the questions, What? Who? Where?

When? Why?" [14]

An event is a specific thing that happens at a specific time.[15]

From the above sentences, it tells us that most events are consisted of five elements:

person, thing, time, location, and object. We could use this information to detect the relationship between events.

How do we use this information to detect the evolution of events? We split the elements besides time information into two classes: role and topic.

1. Role:

In our observation, it is obvious that most of the news happened around persons, locations, and organizations, and every name entity plays a special role in the news articles they belong to. The role is defined to be the person’s or thing’s

6

function in a particular situation. For example, Chien-Ming Wang(王建民) plays a role of a pitcher in a news article but plays a role of a father in another news article.

We believe that if we can distinguish the different roles which the names play in every news articles, it would be helpful to detect the relationships between news events.

2. Topic:

The idea comes that the topic is to roughly describe the concept of a news article, and we could initially know what this news article is described by the topics. For example, in an earthquake news report, the topic might contain the terms like amplitude, epicenter, rescue, etc.

We will discuss how and why the role and topic help to detect the relationship of news articles in this thesis.

1.5 Challenges

Challenge1. How to model the news articles

As mentioned in last section, our basic idea is using the role and topic information to detect the relationship between news events. Let’s discuss what challenge we will meet about the role part at first.

It is easy for human being to recognize the roles played by name entities in

7

different news events, but it is not easy to identify this kind of information by a computer. How to automatically identify the concept behind a name entity in news articles will be a big challenge to our work.

Secondly, the topic part either exist some challenges. What kind of information should be extracted from news articles to be the topic information of the news events?

And how do we extract this information?

Challenge2. How to calculate the relationship between news articles

After modeling the news articles, the next problem we will face to is how to use this information to calculate the relationship between news articles.

We will develop some operations which will help to calculate the relationship between news events for the model. What information should we consider? How do we calculate the role similarity and the topic similarity? How do we determine the weight of a name entity’s role in news events?

1.6 Thesis Organization

We will introduce the related work in Chapter 2. Chapter 3 will describe how we model a news article. Chapter 4 illustrates the operations to calculate the relationship between news articles. Chapter 5 introduces our experiments and discusses each experiment’s motivation and results. Chapter 6 is the conclusion of this thesis.

8

Chapter 2 Relate Work

There are two studies are strongly related to evolution detection. The first one is topic detection and tracking (TDT), this study detects topics from a news story stream and put the news stories with the same topic together. The second one is event threading, which to capture the evolution of events in a specific topic. We will introduce this two studies in this Chapter.

2.1 Topic Detection and Tracking (TDT)

News events happening in every second, when facing such a huge amount of news information, it’s difficult for human being to organization these information. One study named topic detection and tracking is to deal with this problem.

Topic detection and tracking (TDT) has some tasks to organize the news articles

stream. The term “topic” in the study of TDT is defined to be a set of news articles which are strongly related to some seminal real-world event or in other words it defined to be a seminal event or activity, along with all directly related events and activities. For example, the quake at somewhere, any discuss of this quake like finding a survivor in some

9

building or aftershock at some place near the position of quake, and so on, are all part of the topic.

The TDT system detects topics and tracks all their related documents [7][8][9].

Topic detection detects clusters of stories that discuss the same topic, and topic tracking detects stories that discuss a previously known topic. This work contains some methods based on semantic word networks [12], and some based on the vector space model [13].

TDT focus on five tasks as below[5]:

1. Article segmentation: To segment a continuous stream of text (including

transcribed speech) into its constituent stories.

2. First Article Detection (or New Article Detection): The goal of this task (FTD) is to detect the topic which didn’t have been discussed before when a news article which is talked about it appear.

3. Cluster Detection: The goal of cluster detection is to cluster the news articles on the same topics into bins. When a first article arriving, we should create a new bin to prepare to put the news articles which might happen in the future.

4. Tracking: We track the news stream, when a news articles arrive, it should to

recognize which one topic the news article is related to .

By the way, we could assign a news articles to one of all topics which has been detected by first article detection just if the news article does not create a

10

new topic.

5. Article Link Detection: We choose two news articles randomly and then detect

are these two news articles related to the same topic.

The study of TDT discuss a lot about how to organize news articles into different topics, but if we want to know the beginning and subsequent development of one news event, for example news events in the same topic, we need to build the evolution of news events[3]. In the next we will introduce a study of event threading which to construct the evolution of events.

2.2 Event Threading

The study of event threading detects the dependencies among events; it could dig out the relationship between a pair of events. Detecting the dependencies of events would help to construct the evolution graph.

As our knowing, there are some studies of event threading to detect the dependency between events. At first, Uramoto and Takeda provide the evolution graph and used document content similarity to detect the relationship between news articles [4]. Chien Chin Chen and Meng Chang Chen use temporal similarity (TS) to associate events, the TS is multiplied by temporal weighting and text similarity between events [3]. Nallapati cluster related stories into events by person, location, content and time information, and then detect the dependency between events by content similarity and

11

nearest parent node[1]. Mei and Zhai build unigram language model for each events and used KL divergence to measure distance between events[2].

Besides the evolution detection by documents, there are some works to detect the evolution of events by video news reports[6][10][11].

12

Chapter 3 Model the News Article

3.1 Analysis of News Articles

News articles always exist with some information: When it happened or published?

Where it happened? Who are appeared in it? What happened about person, place or organization?

As an observation, we found that most of the news happened around persons, locations, organizations, and every entity names play a special role in the news articles they belong to. We believe that if we can distinguish the different roles which the names play in every news articles by a same name, it would be helpful to detection the relationships between news articles. We named the names in news articles as “Role words”.

Beside role words, some words either play important roles in news articles which to roughly describe the achievement of a news article (like the word “quake”).

Observing these words would help us understand what happened in this news article, and initially knowing what this news article is described, besides the information like

13

where and who. We define the kind of words as “Topic words”.

3.2 Modeling the News Article

As the observation of article, we make an assumption that role information and topic information will help us in detect relationship between news events. In this section, we will introduce how we project the news article into a model which contained the role and topic information in order to help relationship detection.

At first, we introduce the document of the news article.

Definition:

D(Document) : A document (vector) modeled by a news article, and it is consist of

4-tuple.

D = < E , RS , TS , Z >

E(Event) : The event which the News article describe about. In our thesis, we

make an assumption that every news article is an event, so each news document has its own event information.

RS(Role Set) :

Role set of a document. Each name entity in the document corresponds to a role, and each role contains the behavior information of the name entity. The RS consist of three kinds of roles; include person roles, location roles and organization roles. We will deeply introduce the construct of topic part in next section.

14

TS(Topic Set) :

Topic set of a document, containing the topic information about topic words. We will deeply introduce the construct of topic part in section 7.4.

Z(Time Stamp) :

Time information of this topic, it might be the date this news article published, or the time interval the news article assigned in.

3.3 The Role

In section 3.2, we know that a document contains a role set which consist of roles, before introducing the structure and how to build it in this section, we explain the reason of using the role information in this thesis.

3.3.1 The Role

The term “role” means a person’s or a thing’s function in a particular situation.

As the observation of news articles, we find that the news happened around the people, location, and organizations, so we could use these information to help the work of calculating the relationship between news articles.

Why this information help to detection the relations between news articles? For example, somebody Appears in many news articles, but he might play different roles in them. For example, the people “陳水扁” plays a role of president who attend to a speech in one news article but being a defendant in another one. To distinguish the

15

different roles the same person play in each news articles would help us to avoid some mistake in relation detection.

For example, two people just play each other’s role in two news articles, and the keywords distribute the roles actions and the person names were extracted to be the keywords to calculate the similarity of two news articles. If we just calculate the similarity by these keywords without distinguish the roles of the two people in the two news articles, it might get a high similarity score of these two news articles even if they have no relationship just because of the persons and the action keywords.

In this thesis, we will propose a method to calculate the relationship between news articles using some information including the roles of person names, location names and organization names played in these news articles.

As mentioned, each role corresponds to an entity name (person name, location name or organization name) in news article, and some other information would describe the behavior of this name entity. We will introduce the structure of a role here.

R : A Role (vector) in a document, and it is consist of 2-tuple.

R = < NE, FV >

A role is consisted of a name entity and a feature vector, and the relationship between role, name entity and feature vector is: each name entity play a special role in a document, and the feature vector describe what role this name entity played in the

16

document.

NE (Name Entity)

Name entity is the most important part of the role; it just like the role’s ID. The name entity means the object like person, location or organization in news document.

FV (Feature Vector)

The feature vector is a set of features, the feature is defined as below:

Fi(The i’th feature of a feature vector) = < fw ,

ω

f >

fw : Feature Word of a role

ω

f : Weight of a feature word

And the feature vector is defined below:

FV(Feature Vector) = < fw1 [

ω

f1], fw2 [

ω

f2], fw3 [

ω

f3], ……>

Feature vector of a role provide the information about what the role played by the name entity in this document. It consist of some features, and the feature consist of two element: feature word and the weight of feature word.

The feature word is selected from the document and pay mining for the role played by name entity. These feature words would let the name entity be meaningful and have their own role in news article.

The weight of feature word

ω

f means the important degree of the corresponding feature word of the role vector. The big

ω

f a feature word is, the more meaning this

17

feature word pays for the role played by name entity in document.

Below is an example of role vector, include all elements we introduced.

Role

Name Entity Feature Vector

<王建民> 紐約洋基[0.2],投手[0.3],勝投王[0.1],台灣[0.2],連勝[0.2]

<郭台銘> 鴻海[0.2],台灣首富[0.2],劉嘉玲[0.4],林志玲[0.2]

We will introduce how we building the role vector, including how to decide

the feature words and how to calculate the feature word’s weight

ω

f .

3.3.2 Building Roles

In this Chapter, we will introduce how do we determine the roles which played by names in news articles.

1. Finding name entities:

First of all, we have to find out all the entity names in news article. we find all the person names, location names and organization names in the news articles we want to deal with by 中研院自然語言處理-斷詞系統 and a NER system build by the concept of Erik Peterson’s study of NER.

It helps us located out all entity names in news articles, and then we will extract the feature words of each feature vector.

2. Building feature word set

Feature words mean the words which constructing the role vectors beside the role’s

18

own name and pay meaning for the role vector, These words would let the role words be meaningful and have their own role in news article.

Because of all the news articles were crawled by the same query, most of them share the same important terms, so using tf-idf method to extract the keyword would miss a lots of important keywords which have high document frequency.

Instead of using tf-idf, we choose the WIKI titles as the source to help us extract the keywords in news articles.

The wiki title list consist of all titles appeared in the wiki pedia, it contains a large amount of terms with meaning.

3. Decide the feature words in feature vector

As an assumption, we believe that the words near the name in the news article would be meaningful for the name. So we could collect the feature words near the entity names appear in the news article, it would give every name their special meaning in different news articles.

As above, we would define a term “window” here. In this thesis, the term “window”

of a role is defined as the distance before and after entity names. Each name has its own window, which with a parameter named “window size”, and the window means every position before and after the name’s location less than the window size. All feature words

of a role is defined as the distance before and after entity names. Each name has its own window, which with a parameter named “window size”, and the window means every position before and after the name’s location less than the window size. All feature words

相關文件