• 沒有找到結果。

Machine Discovery Event Network Analysis on News Document

N/A
N/A
Protected

Academic year: 2022

Share "Machine Discovery Event Network Analysis on News Document"

Copied!
16
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Discovery

Event Network Analysis on News Document

資工所 R95922081 許守傑

(2)

Outline:

1. Problem definition 2. Input/Output 3. Data

4. Methodology

 Preprocessing

 Network construction

 Basic properties

 Substructure – connected components

 Substructure – cliques

 Clustering

5. Conclusion & future work

6. References and resources

(3)

※ Because the figures in my report are a little small, every clear figure is available at http://www.csie.ntu.edu.tw/~r95081/MD/

Problem definition:

Given a news corpus, there is usually a huge bunch of news in the corpus and there should exist some interesting and explosive events that most of the users want to know. Do these celebrities and interesting events have certain special characteristics? Can we find a method to discover these celebrities and interesting events automatically? The main purpose of my project is to use the machine learning and social network techniques to solve the problem.

Input/Output:

news

news news news

news news

Input: a news corpus Output: a pruned event network The part of network construction will be discussed in the Methodology section.

Data:

Yahoo news 2007/01~2007/11

Totally 185227 news and about 100 thousands of proper nouns.

In the following experiment, I use the news of one season (2007/01~2007/03), totally 52605 news documents.

Methodology:

In this section, I will explain my methodology of each block in Figure01:

newsnews

Build Event Network

Proper Noun extraction

Network analysis

Substructure extraction newsnewsnews

news

Build Event Network Build Event Network

Proper Noun extraction

Network analysis

Substructure extraction

Figure 01

(4)

 Preprocessing:

In the original corpus, each news document is just a plain text:

[20070331-caqp.txt]百變張韶涵 小巨蛋營造魔幻音樂夢境

百變歌后張韶涵雖然去年在中國已有三場大型演唱會的經驗,但是台北對她而言意義重大。近鄉情怯的壓 力,……

To extract the proper nouns, I need a Chinese segmentation and POS tagger. The Sinica website provides a useful tool for this task and after tagging the result will be:

百變(VH) 張韶涵(Nb) 小(VH) 巨蛋(Nc) 營造(VC) 魔幻(VH) 音樂(Na) 夢境(Na)

百變(VH) 歌(Na) 后(Ng) 張韶涵(Nb) 雖然(Cbb) 去年(Nd) 在(P) 中國(Nc) 已(D) 有(V_2) 三(Neu) 場(Nf) 大型(A) 演唱會(Na) 的(DE) 經驗(Na) ,(COMMACATEGORY) 但是(Cbb) 台北(Nc) 對(P) 她(Nh) 而(Cbb) 言(VE) 意義(Na) 重大(VH) 。(PERIODCATEGORY) 近鄉情怯(VH) 的(DE) 壓力(Na) ,……

Each word is followed by a POS tag and the performance of segmentation is also quite good. The Sinica taggger has a detailed definition of tag set; here I list the part of nouns:

POS tag Characteristic

Na /*普通名詞*/

Nb /*專有名稱*/

Nc /*地方詞*/

Ncd /*位置詞*/

Nd /*時間詞*/

Neu /*數詞定詞*/.

Nes /*特指定詞*/

Nep /*指代定詞*/

Neqa /*數量定詞*/

The Nb tag is usually stands for some proper nouns such as person names, organization names. I think the Nb tag is useful for event network construction and extract them from the corpus.

 Network construction:

After extract proper nouns from each document, we must define the components of a graph: vertex and edge:

Vertex: a unique proper noun

Edge: the mutual information between different proper nouns The mutual information of two nodes u, v is defined as:

( , )

( , ) log( )

( ) ( ) c u v mi u v

c u c v

=

c(u,v) is the co-occurrence of node u,v in the corpus (two nodes coexist in one document will lead to one co-occurrence count); c(u)/c(v) is the total occurrence of vertex u/v in the corpus. Given these two information, there still have a little problem: some vertex is not very representative. For example, the sentence below is common situation:

謝(Nb)姓(Na)嫌犯(Na)對(P)警方(Na)坦承(VE),(COMMACATEGORY)…

(5)

Obviously,

(Nb) is not so informative in our purpose; In my opinion, this redundant information will disturb the original structure. Thus, an adequate pruning and down-sampling is a necessary step.

In order to down-sample the graph, I observe the sorted count of vertex degree and the mutual information of each vertex (Figure02, Figure 03):

Figure02 Figure03

According to the degree and mutual information, I use threshold to reduce the graph’s complexity.

The following (Figure04, Figure05, and Figure06) are three plots of the graph in different threshold:

Figure04 (degree>=300, mi>=-8) Figure05 (degree>=500, mi>=-8) Figure06 (degree>=800, mi>=-8)

Thus, the following analysis and experiment will work on the graph of [degree>= 800, mi>=-8]

(Figure06) and the equivalent graph with Chinese vertex is Figure07.

(6)

Figure07

 Basic properties:

1. Zipf’s law of Nb

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Here we examine the Zipf’s law on proper nouns (Nb) and in Figure08 it is honestly representing this characteristic: the rank multiply the log count is almost a constant.

Figure08

(7)

2. Power law

Here I want to examine the long tail property of a network; Figure09 and Figure10 show the sorted vertex degree plots, the log scale version is more similar to a power law distribution.

Figure09 Figure10 3. connectedness

Krackhardt's connectedness for a graph G is equal to the fraction of all dyads, i, j such that there exists an undirected path from i to j in G. (This, in turn, is just the density of the weak reachability graph of G.) Obviously, the connectedness score ranges from 0 (for the null graph) to 1 (for weakly connected graphs). Connectedness is one of four measures (connectedness, efficiency, hierarchy, and lubness) suggested by Krackhardt for summarizing hierarchical structures. Each corresponds to one of four axioms which are necessary and sufficient for the structure in question to be an outtree;

thus, the measures will be equal to 1 for a given graph iff that graph is an outtree. Deviations from unity can be interpreted in terms of failure to satisfy one or more of the outtree conditions,

information which may be useful in classifying its structural properties. Our network’s connectedness is 0.959627.

4. efficiency

Let G be a graph with weak components G1, G2, …Gn. For convenience, we denote the cardinalities of these components' vertex sets by |V(G)|=N and |V(Gi)|=Ni. Then the Krackhardt efficiency of G is given by which can be interpreted as 1 minus the proportion of possible ``extra'' edges (above those needed to weakly connect the existing components) actually present in the graph.

A graph which an efficiency of 1 has precisely as many edges as are needed to connect its components; as additional edges are added, efficiency gradually falls towards 0. In our case, the efficiency is 0.9502366.

5. graph diameter

The diameter of a graph is the length of the longest geodesic and in my graph the diameter is 11.

The number roughly describes the scale of a network; in our case, 11 means our graph is a pretty small one.

(8)

6. graph density

Density of a graph is the ratio of the number of edges and the number of possible edges:

2 | |

| | (| | 1) density E

V V

=

In our graph, density = 0.05038265, which means the network is quite sparse.

 Substructure – connected component

Connected component is a basic and simple structure of a graph, so I first extract these sub-graphs from the network. Figure11~Figure18 (cc01~cc08) are isolated components:

Figure11 – cc01 Figure12 – cc02 Figure13 – cc03 Figure14 – cc04

Figure15 – cc05 Figure16 – cc06 Figure17 – cc07 Figure18 – cc08

Figure19

(9)

Figure19 is the corresponding graph of each connected component. In my observation, small sized connected components already have certain semantics; cc02 and cc03 describes the political relationships of DPP and KMT respectively; cc04 represents the important IC companies and the popular entertainment events around a chairman; cc05 represents the car accident of a female super idol. But the larger connected component (cc01) is more complex and not trivial.

 Substructure – cliques

To decompose the cc01, I decided to find the cliques of the sub-graph. In graph theory, finding k-clique is a NP complete problem and the maximal size I can find in cc01 is 7-clique in Linux server of CSIE, finding 8-clique will lead to out of memory error. Thus, I search for all 6-clique and 7-clique in the sub-graph:

cc01 # of cliques Average mi

6-clique 15291 -6.50768827525996 7-clique 18334 -6.50725919466564

Total 33625 -6.50745431946466

Figure20 shows the sorted average mutual information of every 6/7-cliques:

Figure20

I randomly select 10 cliques (Figure21~Figure30):

Figure21 Figure22 Figure23 Figure24

(10)

Figure25 Figure26 Figure27 Figure28

Figure29 Figure30

Each clique strongly represents certain semantic and topic. In my consideration, a clique with high average mutual information can be an event unit and because the cliques are originally formed by some news documents. With the keywords in those documents, searching cliques can be a good event detector in event network discovery task. Besides, page rank is also a popular ranking metric – we can calculate the average page rank score for clique selection.

.

 Clustering

After observing the original network, I manually label the graph with several prominent clusters in Figure31:

Figure31

(11)

The main purpose of clustering is to decompose the entire graph to reasonable clusters. But before clustering, we need to define the distance measure between each node first. Usually, we use adjacency matrix as a graph’s representation:

V4 V43 V166 V93 V87 V64 V30 V4 0.000000 0.000000 0.000000 -7.910818 0.000000 -7.773538 -7.682799 V43 0.000000 0.000000 0.000000 -6.382662 0.000000 0.000000 -6.629133

With the matrix we can calculate the “Hamming distance” between two vertices and find a distances matrix of positions based on structural equivalence. In R environment, it provides a classical

multidimensional scaling of a data matrix (also known as principal coordinates analysis) and plot the metric MDS of vertex position in 2 dimensions (Figure32):

Figure32

With the metric, the original data is transformed to a better space and we can propose many clustering methods to aggregate similar vertices; here I use the model based clustering which was mentioned in the machine discovery class. Model based clustering is a sophisticated method – it finds the optimal clustering model according to Bayesian Information Criterion (BIC) for EM initialized by hierarchical clustering for parameterized Gaussian mixture models (Figure33).

Figure33 – different models of GMM

(12)

I set maximal clusters to 20 and run the model based clustering. According to the BIC, the best model is VVV at 9 clusters (Figure34 Figure35):

Figure34 Figure35

After the clustering is done, the result is Figure 36 (different color stands for different cluster):

Figure36

From the result, we can easily find out that tightly coupled clusters can be correctly aggregated, but loosely coupled clusters are not. I think the most probable reason should be the distance measure – hamming distance can precisely describe the vertex similarity, but vertex difference measured by it will be vague. Furthermore, the structural position of vertices with fewer connections is not clear.

The two reasons will mislead the model to cluster them together (green and black clusters).

I extract the sub-graphs of every cluster and made plots, Figure37 is the original sub-graph and

(13)

Figure38 is the corresponding Chinese version:

Figure37

(14)

Figure38

Conclusion & Future work:

The main purpose of the event network is to provide a platform for different application, such as information retrieval. But there is a serious problem about it – evaluation. In my experiment, it is hard to define a suitable evaluation metric and only human evaluation is available. To have a more complete research result, I need to survey more paper and reorganize it to a better structure.

In my project, there are still a lot of interesting topics to be discovered. The task is a little bit similar to Google sets. Not only the Nb can build the network, but also every noun can take into

consideration and analysis. Besides, the original scheme can be modified as Figure39

Build Event Network

newsnews

Topic 1

Topic n Topic 2

. . .

Network analysis

Substructure extraction Build Event Network

newsnewsnews news

Topic 1

Topic n Topic 2

. . .

Network analysis

Substructure extraction

Figure39

(15)

In NLP area, some state-of-art topic detection techniques are very useful, such as Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), and Latent Dirichlet Allocation (LDA). I have tried to train an 8 topic mixture LDA model and the following is the topic words corresponding to each topic:

Topic Topic words

000 多雲 度 地區 比賽 陣雨 球員 選手 王建民 冠軍 投手 短暫雨 先發 大聯盟

球迷 美國 教練 天氣

001 醫院 醫師 民眾 指數 股市 治療 健康 台北 台股 研究 上漲 患者 業者 市場 消費者 醫療 病患 下跌 手術 感染 疾病 台灣 病毒 藥物

002 台灣 政府 學生 中國 大學 國家 中央社 記者 國際 美國 教育 計畫 發展 學校

社會 教育部 世界 文化 研究 委員會 經濟 大陸 行政院 日本

003 警方 高鐵 民眾 人員 男子 嫌犯 員警 調查 警察 歹徒 報導 現場 意外

記者 女子 上午 集團 交通 學生 逮捕 家屬 台北 捷運

004 台灣 新聞 報導 民視 知道 希望 媽媽 開始

005 總統 媒體 立委 美國 馬英九 報導 台灣 國民黨 新聞 民進黨 藝人 節目 法院

許瑋倫 律師 胡瓜

006 陣風 台灣 活動 局部 日本 東北風 春節 公尺 舉辦 海面 藝術 飯店 文化 業者

觀光 時間 傳統 表演 高雄 作品 音樂 浪高 期間 體驗 現場

007 銀行 美元 市場 投資 全球 服務 集團 中國 企業 力霸 美國 網路 中華 金融

台灣 基金 經濟 資訊 電子

The point is, if we decompose the documents into a more specific space, the granularity of event networks will be better and includes more details.

Furthermore, after extracting the sub-graphs, how do we describe a sub-graph with text?

Automatic document summarization and title generation techniques can be applied to our network and supply more information.

Besides model based clustering, I also tried hierarchical clustering but haven’t analyzed it yet:

Maybe hierarchical clustering can fix the problem occurs in model based clustering.

(16)

Finally, I really think the lectures and the tasks are very interesting; every time I run the program with expectation to plot a graph and the result always surprises me. Some graphs are special art works rather than simple experiment result! I think if I use all data to produce the experiment, the result will reveal more exciting facts.

Reference & resource:

1. Different Aspects of Social Network Analysis, Mohsen Jamali and Hasssan Abolhassani, Web Intelligence Research Laboratory, Computer Engineering Department, Sharif University of Technology, Tehran, Iran., WI2006

2. Measurement and Analysis of Online Social Networks, Alan Mislove, Massimiliano Marcon, Krishna P.Gummadi, Peter Druschel, Bobby Bhattacharjee, ACM ICM2007

3. Learning Social Network from Web Documents Using Support Vectro Machine, Masound Makrehchi, Mohamed S. Kimmel, Pattern Analysis and Machine Intelligence Lab, Department of Electrical and Computer Engineering, University of Waterloo, WI2006

4. R http://www.r-project.org/

5. SNA library http://erzuli.ss.uci.edu/R.stuff/

6. iGraph library http://cneurocvs.rmki.kfki.hu/igraph/

7. Graphviz http://www.graphviz.org/

參考文獻

相關文件

A metric (topological) space X is disconnected if it is the union of two disjoint nonempty open subsets.. Otherwise, X

So we check derivative of f and g, the small one would correspond to graph (b) and the other to (c)(i.e.. Moreover, f is

Several of the graphs exhibit horizontal grid curves, but the curves for this surface are neither circles nor straight lines, so graph III is the only possibility... First we graph

As in Example 4, we place the origin at the southwest corner of

In an oilre nery a storage tank contains 2000 gallons of gasoline that initially has 100lb of an additive dissolved in it. In preparation for winter weather, gasoline containing 2lb

In this talk I will discuss the finer geometric structure of the set of ``free singularities`` which arise in an optimal transport problem from a connected set to a disconnected

(a) A special school for children with hearing impairment may appoint 1 additional non-graduate resource teacher in its primary section to provide remedial teaching support to

We will give a quasi-spectral characterization of a connected bipartite weighted 2-punctually distance-regular graph whose halved graphs are distance-regular.. In the case the