Machine Discovery Event Network Analysis on News Document

(1)

Machine Discovery

Event Network Analysis on News Document

資工所 R95922081 許守傑

(2)

Outline:

1. Problem definition 2. Input/Output 3. Data

4. Methodology

Preprocessing

Network construction

Basic properties

Substructure – connected components

Substructure – cliques

Clustering

5. Conclusion & future work

6. References and resources

(3)

※ Because the figures in my report are a little small, every clear figure is available at http://www.csie.ntu.edu.tw/~r95081/MD/

Problem definition:

Given a news corpus, there is usually a huge bunch of news in the corpus and there should exist some interesting and explosive events that most of the users want to know. Do these celebrities and interesting events have certain special characteristics? Can we find a method to discover these celebrities and interesting events automatically? The main purpose of my project is to use the machine learning and social network techniques to solve the problem.

Input/Output:

news

news news news

news news

Input: a news corpus Output: a pruned event network The part of network construction will be discussed in the Methodology section.

Data:

Yahoo news 2007/01~2007/11

Totally 185227 news and about 100 thousands of proper nouns.

In the following experiment, I use the news of one season (2007/01~2007/03), totally 52605 news documents.

Methodology:

In this section, I will explain my methodology of each block in Figure01:

newsnews

Build Event Network

Proper Noun extraction

Network analysis

Substructure extraction newsnewsnews

news

Build Event Network Build Event Network

Proper Noun extraction

Substructure extraction

Figure 01

(4)

Preprocessing:

In the original corpus, each news document is just a plain text:

[20070331-caqp.txt]百變張韶涵小巨蛋營造魔幻音樂夢境

百變歌后張韶涵雖然去年在中國已有三場大型演唱會的經驗，但是台北對她而言意義重大。近鄉情怯的壓 力，……

To extract the proper nouns, I need a Chinese segmentation and POS tagger. The Sinica website provides a useful tool for this task and after tagging the result will be:

百變(VH) 張韶涵(Nb) 小(VH) 巨蛋(Nc) 營造(VC) 魔幻(VH) 音樂(Na) 夢境(Na)

百變(VH) 歌(Na) 后(Ng) 張韶涵(Nb) 雖然(Cbb) 去年(Nd) 在(P) 中國(Nc) 已(D) 有(V_2) 三(Neu) 場(Nf) 大型(A) 演唱會(Na) 的(DE) 經驗(Na) ，(COMMACATEGORY) 但是(Cbb) 台北(Nc) 對(P) 她(Nh) 而(Cbb) 言(VE) 意義(Na) 重大(VH) 。(PERIODCATEGORY) 近鄉情怯(VH) 的(DE) 壓力(Na) ，……

Each word is followed by a POS tag and the performance of segmentation is also quite good. The Sinica taggger has a detailed definition of tag set; here I list the part of nouns:

POS tag Characteristic

Na /*普通名詞*/

Nb /*專有名稱*/

Nc /*地方詞*/

Ncd /*位置詞*/

Nd /*時間詞*/

Neu /*數詞定詞*/.

Nes /*特指定詞*/

Nep /*指代定詞*/

Neqa /*數量定詞*/

The Nb tag is usually stands for some proper nouns such as person names, organization names. I think the Nb tag is useful for event network construction and extract them from the corpus.

Network construction:

After extract proper nouns from each document, we must define the components of a graph: vertex and edge:

Vertex: a unique proper noun

Edge: the mutual information between different proper nouns The mutual information of two nodes u, v is defined as:

( , )

( , ) log( )

( ) ( ) c u v mi u v

c u c v

=

c(u,v) is the co-occurrence of node u,v in the corpus (two nodes coexist in one document will lead to one co-occurrence count); c(u)/c(v) is the total occurrence of vertex u/v in the corpus. Given these two information, there still have a little problem: some vertex is not very representative. For example, the sentence below is common situation:

謝(Nb)姓(Na)嫌犯(Na)對(P)警方(Na)坦承(VE)，(COMMACATEGORY)…

(5)

Obviously,

謝

(Nb) is not so informative in our purpose; In my opinion, this redundant information will disturb the original structure. Thus, an adequate pruning and down-sampling is a necessary step.

In order to down-sample the graph, I observe the sorted count of vertex degree and the mutual information of each vertex (Figure02, Figure 03):

Figure02 Figure03

According to the degree and mutual information, I use threshold to reduce the graph’s complexity.

The following (Figure04, Figure05, and Figure06) are three plots of the graph in different threshold:

Figure04 (degree>=300, mi>=-8) Figure05 (degree>=500, mi>=-8) Figure06 (degree>=800, mi>=-8)

Thus, the following analysis and experiment will work on the graph of [degree>= 800, mi>=-8]

(Figure06) and the equivalent graph with Chinese vertex is Figure07.

(6)

Figure07

Basic properties:

1. Zipf’s law of Nb

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Here we examine the Zipf’s law on proper nouns (Nb) and in Figure08 it is honestly representing this characteristic: the rank multiply the log count is almost a constant.

Figure08

(7)

2. Power law

Here I want to examine the long tail property of a network; Figure09 and Figure10 show the sorted vertex degree plots, the log scale version is more similar to a power law distribution.

Figure09 Figure10 3. connectedness

Krackhardt's connectedness for a graph G is equal to the fraction of all dyads, i, j such that there exists an undirected path from i to j in G. (This, in turn, is just the density of the weak reachability graph of G.) Obviously, the connectedness score ranges from 0 (for the null graph) to 1 (for weakly connected graphs). Connectedness is one of four measures (connectedness, efficiency, hierarchy, and lubness) suggested by Krackhardt for summarizing hierarchical structures. Each corresponds to one of four axioms which are necessary and sufficient for the structure in question to be an outtree;

thus, the measures will be equal to 1 for a given graph iff that graph is an outtree. Deviations from unity can be interpreted in terms of failure to satisfy one or more of the outtree conditions,

information which may be useful in classifying its structural properties. Our network’s connectedness is 0.959627.

4. efficiency

Let G be a graph with weak components G1, G2, …Gn. For convenience, we denote the cardinalities of these components' vertex sets by |V(G)|=N and |V(Gi)|=Ni. Then the Krackhardt efficiency of G is given by which can be interpreted as 1 minus the proportion of possible ``extra'' edges (above those needed to weakly connect the existing components) actually present in the graph.

A graph which an efficiency of 1 has precisely as many edges as are needed to connect its components; as additional edges are added, efficiency gradually falls towards 0. In our case, the efficiency is 0.9502366.

5. graph diameter

The diameter of a graph is the length of the longest geodesic and in my graph the diameter is 11.

The number roughly describes the scale of a network; in our case, 11 means our graph is a pretty small one.

(8)

6. graph density

Density of a graph is the ratio of the number of edges and the number of possible edges:

2 | |

| | (| | 1) density E

V V

=

−

In our graph, density = 0.05038265, which means the network is quite sparse.

Substructure – connected component

Connected component is a basic and simple structure of a graph, so I first extract these sub-graphs from the network. Figure11~Figure18 (cc01~cc08) are isolated components:

Figure11 – cc01 Figure12 – cc02 Figure13 – cc03 Figure14 – cc04

Figure15 – cc05 Figure16 – cc06 Figure17 – cc07 Figure18 – cc08

Figure19

(9)

Figure19 is the corresponding graph of each connected component. In my observation, small sized connected components already have certain semantics; cc02 and cc03 describes the political relationships of DPP and KMT respectively; cc04 represents the important IC companies and the popular entertainment events around a chairman; cc05 represents the car accident of a female super idol. But the larger connected component (cc01) is more complex and not trivial.

Substructure – cliques

To decompose the cc01, I decided to find the cliques of the sub-graph. In graph theory, finding k-clique is a NP complete problem and the maximal size I can find in cc01 is 7-clique in Linux server of CSIE, finding 8-clique will lead to out of memory error. Thus, I search for all 6-clique and 7-clique in the sub-graph:

cc01 # of cliques Average mi

6-clique 15291 -6.50768827525996 7-clique 18334 -6.50725919466564

Total 33625 -6.50745431946466

Figure20 shows the sorted average mutual information of every 6/7-cliques:

Figure20

I randomly select 10 cliques (Figure21~Figure30):

Figure21 Figure22 Figure23 Figure24

(10)

Figure25 Figure26 Figure27 Figure28

Figure29 Figure30

Each clique strongly represents certain semantic and topic. In my consideration, a clique with high average mutual information can be an event unit and because the cliques are originally formed by some news documents. With the keywords in those documents, searching cliques can be a good event detector in event network discovery task. Besides, page rank is also a popular ranking metric – we can calculate the average page rank score for clique selection.

.

Clustering

After observing the original network, I manually label the graph with several prominent clusters in Figure31:

Figure31

(11)

The main purpose of clustering is to decompose the entire graph to reasonable clusters. But before clustering, we need to define the distance measure between each node first. Usually, we use adjacency matrix as a graph’s representation:

V4 V43 V166 V93 V87 V64 V30 V4 0.000000 0.000000 0.000000 -7.910818 0.000000 -7.773538 -7.682799 V43 0.000000 0.000000 0.000000 -6.382662 0.000000 0.000000 -6.629133

…

With the matrix we can calculate the “Hamming distance” between two vertices and find a distances matrix of positions based on structural equivalence. In R environment, it provides a classical

multidimensional scaling of a data matrix (also known as principal coordinates analysis) and plot the metric MDS of vertex position in 2 dimensions (Figure32):

Figure32

With the metric, the original data is transformed to a better space and we can propose many clustering methods to aggregate similar vertices; here I use the model based clustering which was mentioned in the machine discovery class. Model based clustering is a sophisticated method – it finds the optimal clustering model according to Bayesian Information Criterion (BIC) for EM initialized by hierarchical clustering for parameterized Gaussian mixture models (Figure33).

Figure33 – different models of GMM

(12)

I set maximal clusters to 20 and run the model based clustering. According to the BIC, the best model is VVV at 9 clusters (Figure34 Figure35):

Figure34 Figure35

After the clustering is done, the result is Figure 36 (different color stands for different cluster):

Figure36

From the result, we can easily find out that tightly coupled clusters can be correctly aggregated, but loosely coupled clusters are not. I think the most probable reason should be the distance measure – hamming distance can precisely describe the vertex similarity, but vertex difference measured by it will be vague. Furthermore, the structural position of vertices with fewer connections is not clear.

The two reasons will mislead the model to cluster them together (green and black clusters).

I extract the sub-graphs of every cluster and made plots, Figure37 is the original sub-graph and

(13)

Figure38 is the corresponding Chinese version:

Figure37

(14)

Figure38

Conclusion & Future work:

The main purpose of the event network is to provide a platform for different application, such as information retrieval. But there is a serious problem about it – evaluation. In my experiment, it is hard to define a suitable evaluation metric and only human evaluation is available. To have a more complete research result, I need to survey more paper and reorganize it to a better structure.

In my project, there are still a lot of interesting topics to be discovered. The task is a little bit similar to Google sets. Not only the Nb can build the network, but also every noun can take into

consideration and analysis. Besides, the original scheme can be modified as Figure39

Build Event Network

newsnews

Topic 1

Topic n Topic 2

. . .

Substructure extraction Build Event Network

newsnewsnews news

Topic 1

Topic n Topic 2

. . .

Substructure extraction

Figure39

(15)

In NLP area, some state-of-art topic detection techniques are very useful, such as Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), and Latent Dirichlet Allocation (LDA). I have tried to train an 8 topic mixture LDA model and the following is the topic words corresponding to each topic:

Topic Topic words

000 多雲度地區比賽陣雨球員選手王建民冠軍投手短暫雨先發大聯盟

球迷美國教練天氣

001 醫院醫師民眾指數股市治療健康台北台股研究上漲患者業者市場消費者醫療病患下跌手術感染疾病台灣病毒藥物

002 台灣政府學生中國大學國家中央社記者國際美國教育計畫發展學校

社會教育部世界文化研究委員會經濟大陸行政院日本

003 警方高鐵民眾人員男子嫌犯員警調查警察歹徒報導現場意外

記者女子上午集團交通學生逮捕家屬台北捷運

004 台灣新聞報導民視知道希望媽媽開始

005 總統媒體立委美國馬英九報導台灣國民黨新聞民進黨藝人節目法院

許瑋倫律師胡瓜

006 陣風台灣活動局部日本東北風春節公尺舉辦海面藝術飯店文化業者

觀光時間傳統表演高雄作品音樂浪高期間體驗現場

007 銀行美元市場投資全球服務集團中國企業力霸美國網路中華金融

台灣基金經濟資訊電子

The point is, if we decompose the documents into a more specific space, the granularity of event networks will be better and includes more details.

Furthermore, after extracting the sub-graphs, how do we describe a sub-graph with text?

Automatic document summarization and title generation techniques can be applied to our network and supply more information.

Besides model based clustering, I also tried hierarchical clustering but haven’t analyzed it yet:

Maybe hierarchical clustering can fix the problem occurs in model based clustering.

(16)

Finally, I really think the lectures and the tasks are very interesting; every time I run the program with expectation to plot a graph and the result always surprises me. Some graphs are special art works rather than simple experiment result! I think if I use all data to produce the experiment, the result will reveal more exciting facts.

Reference & resource:

1. Different Aspects of Social Network Analysis, Mohsen Jamali and Hasssan Abolhassani, Web Intelligence Research Laboratory, Computer Engineering Department, Sharif University of Technology, Tehran, Iran., WI2006

2. Measurement and Analysis of Online Social Networks, Alan Mislove, Massimiliano Marcon, Krishna P.Gummadi, Peter Druschel, Bobby Bhattacharjee, ACM ICM2007

3. Learning Social Network from Web Documents Using Support Vectro Machine, Masound Makrehchi, Mohamed S. Kimmel, Pattern Analysis and Machine Intelligence Lab, Department of Electrical and Computer Engineering, University of Waterloo, WI2006

4. R http://www.r-project.org/

5. SNA library http://erzuli.ss.uci.edu/R.stuff/

6. iGraph library http://cneurocvs.rmki.kfki.hu/igraph/

7. Graphviz http://www.graphviz.org/