• 沒有找到結果。

N-Gram Model with a Background Distribution

N/A
N/A
Protected

Academic year: 2022

Share "N-Gram Model with a Background Distribution"

Copied!
9
0
0

加載中.... (立即查看全文)

全文

(1)

N-Gram Model with a Background Distribution

Zehua Yan and Fang Li

Department of Computer Science and Engineering, Shanghai Jiao Tong University

{yanzehua,fli}@sjtu.edu.cn http://lt-lab.sjtu.edu.cn

Abstract. Automatic thread extraction for news events can help peo- ple know different aspects of a news event. In this paper, we present a method of extraction using a topical N-gram model with a background distribution (TNB). Unlike most topic models, such as Latent Dirich- let Allocation (LDA), which relies on the bag-of-words assumption, our model treats words in their textual order. Each news report is repre- sented as a combination of a background distribution over the corpus and a mixture distribution over hidden news threads. Thus our model can model “presidential election” of different years as a background phrase and “Obama wins” as a thread for event “2008 USA presidential elec- tion”. We apply our method on two different corpora. Evaluation based on human judgment shows that the model can generate meaningful and interpretable threads from a news corpus.

Keywords: news thread, LDA, N-gram, background distribution.

1 Introduction

News events happen every day in the real world, and news reports describe different aspects of the events. For example, when an earthquake occurs, news reports will report the damage caused, the actions taken by the government, the aid from the international world, and other things related to the earthquake.

News threads represent these different aspects of an event.

Topic models, such as Latent Dirichlet Allocation (LDA) [1] can extract la- tent topics from a large corpus based on the bag-of-words assumption. Actually news reports are sets of semantic units represented by words or phrases. N-gram phrases are meaningful to represent these semantic units. For example, “Bush Government” and “Security Council” in table 1 are two news threads for the

“Iran nuclear program” event. They capture two aspects of the meaning of the event reports. Our task is to automatically extract news threads from news re- ports.

Reports of a news event or a topic discuss the same event or the same topic and share some common words. Based on the analysis of LDA results, we find that such common words represent the background of the event. We then assume each

B.-L. Lu, L. Zhang, and J. Kwok (Eds.): ICONIP 2011, Part II, LNCS 7063, pp. 416–424, 2011.

 Springer-Verlag Berlin Heidelberg 2011c

(2)

news report is represented by a combination of (a) a background distribution over the corpus, (b) a mixture distribution over hidden news threads.

In this paper, we use a topical n-gram model with a background distribution (TNB) to extract news threads from a news event corpus. It is an extension of the LDA model with word order and a background distribution. In the following, our model will be introduced, then experiments described and results given.

Table 1. Threads and news titles for news event“Iran nuclear program”

Event corpus Thread News report titles

the Security Council

Options for the Security Council Iran ends cooperation with IAEA

Iran Iran likely to face Security Council

Nuclear

the Bush government

Rice: Iran can have nuclear energy, not arms Program Bush plans strike on Iran’s nuclear sites

Iran Details Nuclear Ambitions

2 Related Work

In [2]’s work, news event threading is defined as the process of recognizing events and their dependencies. They proposed an event model to capture the rich struc- ture of events and their dependencies in a news topic. Features such as temporal locality of stories and time-ordering are used to capture events.

[3] proposed a probabilistic model that accounts for both general and specific aspects of documents. The model extends LDA by introducing a specific aspect distribution and a background distribution. In this paper, each document is represented as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being specific to the documents. The model has been applied in information retrieval and showed that it can match documents both at a general level and at specific word level. Similarly, [4] proposed an entity-aspect model with a background distribution; the model can automatically generate summary templates from given collections of summary articles.

Word order and phrases are often critical to capture the latent meaning of text. Much work has been done on probabilistic generation models with word order influence. [5] develops a bigram topic model on the basis of a hierarchical Dirichlet language model [6], by incorporating the concept of topic into bigrams.

In this model, word choice is always affected by the previous word.

[7] proposed an LDA collocation model (LDACOL). Words can be generated from the original topic distribution or the distribution in relation to the previous word. A new bigram status variable is used to indicate whether to generate a bigram or a unigram. It is more realistic than the bigram topic model which always generates bigrams. However, in the LDA Collocation model, bigrams do not have topics because the second term of a bigram is generated from a distribution conditioned on its previous word only.

(3)

Further, [8] extended LDACOL by changing the distribution of previous words into a compound distribution of previous word and topic. In this model, a word has the option to inherit a topic assignment from its previous word if they form a bigram phrase. Whether to form a bigram for two consecutive word tokens depends on their co-occurrence frequency and nearby context.

3 Our Methods

3.1 Motivation

We analyze different news reports, and find that there are three kinds of words in a news report: background words (B), thread words (T) and stop words (S).

Background words describe the background of the event. They are shared by reports in the same corpus. Thread words illustrate different aspects of an event.

Stops words are meaningless and appear frequently across different corpora.

For example, there are two sentences from a news report of “US presidential election” in table 2. The first sentence talks about “immigration policy” and the second discusses “healthcare”. Stop words are labeled with “S” such as “as”

and “the”. Background words are “presidential” and “election” which appear in both sentences and are labeled with “B”. Other words are thread words that are specifically associated with different aspects of the event, such as “immigration”

and “healthcare”.

Table 2. Two sentences from “US presidential election”

As/S we/S approach the/S 2008 Presidential/B election/B,/S both/S John/B McCain/B and/S Barack/B Obama/B are/S sharpening/T their/S perspectives/B on/S immigration/T policy/B./S

After/S the/S economy/T ,/S US/B healthcare/T is/S the/S biggest/T domestic/T issue/T influencing/B voters/B in/S the/S US/B presidential/B election/B ./S

Also, we note that adjacent words can form a meaningful phrase and provide a clearer meaning, for example, “presidential election” and “domestic issue”.

Based on the analysis, there are four possible combinations as follows:

1. B+B: Presidential/B election/B 2. B+T: US/B healthcare/T 3. T+B: immigration/T policy/B 4. T+T: domestic/T issue/T

There is no doubt that “B+B” is a background phrase, and the “T+T” is a thread phrase. Both “B+T” and “T+B” are regarded as thread phrases because the phrase contains a thread word. For example, immigration is a thread word and policy is a background word; the phrase “immigration policy” identifies a type of “policy”, and should be viewed as a thread phrase.

(4)

3.2 Topical N-Gram Model with Background Distribution

We now propose our topical n-gram model with a background distribution (TNB) for news reports. Notation used in this paper is listed in table 3. Stop words are identified and removed using a stop word list.

In our model, each news report is represented as a combination of two kinds of multinomial word distribution:

(a) There is a background word distribution Ω with Dirichlet prior parameter β1, which generates common words across different threads. (b) There are T thread word distributions φt(1 < t < T ) with Dirichlet prior parameter β0. A hidden bigram variable xi is used to indicate whether a word is generated from the background word distribution or the thread word distribution.

A hidden bigram variable yi is introduced to indicate whether word wi can form a phrase with its previous word wi−1 or not. Unlike [8], we assume phrase generation is only affected by the the previous word.

(a) LDA (b) TNB

Fig. 1. Graphical model for LDA and TNB

Figure 1 shows graphical models of LDA and TNB. For each word wi, LDA first draws a topic zi from the document-topic distribution p(z|θd) and then draws the word from the topic-word distribution p(wizi). TNB has a similar general structure to the LDA model but with additional machinery to identify word wi’s category (background or thread word) and whether it can form a phrase with the previous word wi−1.

For each word wi, we first sample variable yi. If yi = 0, wi is not influenced by wi−1. If yi= 1, wi−1 and wi can form a phrase. As analyzed before, phrases have four possible combinations. There are two situations when yi= 1 :

1. if wi−1 ∈ zt, wi draws either from the thread zt or the background distri- bution.

2. if wi−1is a background word, widraws from any threads or the background distribution.

(5)

Table 3. Notation used in this paper

SYMBOL DESCRIPTION SYMBOL DESCRIPTION

α Dirichlet prior ofθ β0 Dirichlet prior ofφ β1 Dirichlet prior ofΩ γ1 Dirichlet prior ofλ γ2 Dirichlet prior ofσ T number of threads D number of documents W number of unique words w(d)i theithword in document d

zi(d) the thread associated withith word in the documentd

yi(d)

the bigram status between the

xi(d)

the bigram status indicate the (i − 1)thword andithword ithword is a background

in the document d word or topic word

θ(d) the multinomial distribution

φz

the multinomial distribution of topics w.r.t the documentd of words w.r.t the topicz Ω the multinomial distribution

ψi

the Bernoulli distribution of words w.r.t the background of status variableyi(d) λi

the Bernoulli distribution of status variablexi(d)

Second, we sample variable xi. If xi = 1, wi is a background word, it is generated from M ulti(Ω). Else it is generated in the same way as LDA.

3.3 Inference

For this model, exact inference over hidden variables is intractable due to the large number of variables and parameters. There are several approximate infer- ence techniques which can be used to solve this problem, such as variational methods [9], Gibbs sampling [10] and expectation propagation [11]. As [12]

showed that phrase assignment can be sampled efficiently by Gibbs sampling, Gibbs sampling is adopted for approximate inference in our work.

The conditional probability of wi given a document dj can be written as:

p(wi|dj) = (p(xi= 0|dj)T

t=1p(wi|zi=t, d)

+p(xi= 1|dj)p(w)) × p(wi|yi, wi−1) (1) where p(wi|zi= t, d) is the thread word distribution and p(w) is the background word distribution. p(wi|yi, wi−1) describe the wi−1 sinfluence over wi.

In Figure 1(b), if yi = 0, the wi will not be influenced by wi−1 and will be generated from the background distribution and thread distribution. Gibbs sampling equations are derived as follows:

p(xi= 0, yi= 0, zi=t|w, x−i, z−i, α, β0, γ1, γ2)∝

Nd0,−i1

Nd,−i+2γ1 × CT Dtd,−i

tCT D

t d,−i+T α× Cwt,−iW T 0

wCW T

wt,−i+T β0×NN0wi−12

wi−1+2γ2

(2)

p(xi= 1, yi= 0|w, x−i, z−i, β1, γ1, γ2)∝

Nd1,−i1

Nd,−i+2γ1 × CWw,−i1

wCW

w,−i+T β1 ×NN0wi−12

wi−1+2γ2

(3)

(6)

If yi= 1, the wi can form a phrase with wi−1.

p(xi= 0, yi= 1, zi=t|wi−1, zi−1=t, α, β0, γ1, γ2)∝

Nd0,−i1

Nd,−i+2γ1 × Cwt,−iW T 0

wCW T

wt,−i+T β0×NN1wi−12

wi−1+2γ2

(4)

p(xi= 1, yi= 1|wi−1, zi−1=t, α, β1, γ1, γ2)∝

Nd1,−i1 Nd,−i+2γ1

CWw,−i1

 wCW

w ,−i+T β1 ×NN1wi−12

wi−1+2γ2

(5) where the subscript−i stands for the count when word i is removed. Nd is the number of words in document d. Nd0stands for the number of thread words in document d, and Nd1is the number of background words in document d. Nwi−1

is the number of words wi−1. N0wi−1 and N1wi−1 is the number of words wi−1

which have been drawn from as a unigram or as a part of phrase. CwtW T, CwW are the number of times a word is assigned to a thread t, or to a background distribution respectively.

4 Experiments

4.1 Experimental Settings

Two corpora are used in the experiments. The Chinese news corpus is an event based corpus, which contains 68 event sub-corpora, such as “2007 Nobel prize”.

The number of news reports in a sub-corpus varies from 100 to 420. Another corpus is the Reuters-21578 financial news corpus. We select five sub corpora from it, they are: “crude”, “grain”, “interest”, “money-fx” and “trade”. Each of them contains more than 300 reports which describe many events.

Experiments are run on both corpora with different numbers of threads. The experiments are run with 500 iterations for each case. And we set α = 50/T where T is the number of threads, β0= 0.1, β1= 0.1 and γ1= 0.5, γ2= 0.5 by experience.

The LDA result is used as our baseline. The top three words of LDA are compared with the top three phrases generated by TNB on different corpora at different numbers of threads.

4.2 Evaluation Metrics

There is no golden standard for news thread extraction. Only humans can iden- tify and understand news threads for different news events. The top three phrases of TNB and top three words of LDA are evaluated by voluntary judges on a scale of 0 to 1. Report titles are provided as the basis for judging. Score 1 means the phrase or the word represents the meaning of the title well. Score 0 means the word or the phrase does not capture the meaning of the title. Score 0.5 is be- tween them. The precision of news threads are calculated in the following three formula:

top−1 =

T

t scoret1

T (6)

(7)

top−2 =

T

t max(scoret1, scoret2)

T (7)

top−3 =

T

t max(scoret1, scoret2, scoret3)

T (8)

where scoretiis the score of the ith word in thread t.

4.3 Results and Analysis

Table 4 and 5 shows the precisions of news thread extraction from the Chinese and Ruters corpus with different numbers of threads. As the number of thread increases, the precision decreases. We analyze both corpora. The Chinese corpus is event-based, the number of 5 or 8 matches its semantic meaning hidden in each event corpus. Twenty threads are adequate to the semantic meanings of the Reuters sub-corpora. The hidden semantics of the corpus dominate the precision and final results.

The precision of TNB is much better than LDA. We give two explanations.

Table 7 shows both results extracted from the “2007 Nobel Prize” reports. First, the top LDA words do not consider the background influence, common words such as “Nobel” appearing in the top three words. Such words cannot be regarded as thread words to represent different aspects of an event. In TNB, thread-specific words (such as “Peace”) can be extracted and form an n-gram phrase with backgroun word to represent the thread more clearly. The second explanation is that a phrase delivers more clear information than a unigram word. For example,

“peace” vs. “Nobel Peace Prize”. The top three results of TNB for threads related to the Nobel Peace Prize convey two meanings ”Nobel Peace Prize” and

”Climate change problem”, while people need his knowledge to understand the top three words of LDA.

Table 4. Precision on Chinese corpus

Evaluations Number of thread

5 8 10 12

TNB top-1 72.3% 65.4% 61.5% 60.9%

TNB top-2 85.2% 82.4% 77.7% 75.1%

TNB top-3 90.6% 88.3% 82.9% 81.4%

LDA top-1 43.4% 38.3% 31.9% 30.3%

LDA top-2 51.3% 45.5% 37.5% 36.9%

LDA top-3 58.4% 55.1% 46.9% 43.3%

Table 5. Precision on Reuter corpus

Evaluations Number of thread

20 25 30

TNB top-1 55.2% 44.3% 38.3%

TNB top-2 73.2% 61.1% 57.7%

TNB top-3 81.3% 69.4% 66.3%

LDA top-1 32% 29.5% 28.3%

LDA top-2 41.5% 37% 38.4%

LDA top-3 52% 41.5% 40%

Table 6 lists the background words of five sub-corpora of Reuters news. These sub-corpora are not event-based, The background words still catch many features of each category. For example, words like “wheat”, “grain” and “agriculture” are easily identified as background words for the category of grain. The word ”say”

appears as the top background word for all these sub-corpora. The reason is that reports in the Reuters corpus always reference different peoples’ opinions, so the word frequency is really high. Therefore “say” is regarded as a background word.

(8)

Table 6. Background words for Reuters corpus trade crude grain interest money-fx

say say say say say

trade oil wheat rate dollar japan company price bank rate japanese dlrs grain market blah official mln corn blah trade

Table 7. LDA and TNB result for threads of “2007 Nobel prize”

Nobel Peace Prize Nobel Economics Prize LDA Result

Peace 0.032 Nobel 0.041

Nobel 0.025 Sweden 0.035

Climate 0.024 economics 0.029

Gore 0.023 announce 0.027

change 0.019 prize 0.021

president 0.016 date 0.015

committee 0.013 winner 0.014

global 0.013 economist 0.013

TNB Background words

America 0.015 research 0.013

university 0.013 nobel 0.012

gene 0.011 Prize 0.011

TNB Result

Nobel Peace Prize 0.033 The Royal Swedish Academy 0.056 Climate change problem 0.032 announce Nobel economics prize 0.052

Climate change 0.018 Swedish kronor 0.038

5 Conclusion

In this paper, we present a topical n-gram model with background distribution (TNB) to extract news threads. The TNB model adds background analysis and the word-order feature to standard LDA. Experiments indicate that our model can extract more interpretable threads than LDA from a news corpus. We also find that the number of threads and the event type can influence the precision of news thread extraction. Experiments show that TNB works well not only on an event-based corpus but also on a topic-based corpus. In the future, we plan to develop a dynamic mechanism to decide a suitable number of threads for different news event types to improve the precision of news thread extraction.

Acknowledgements. This research is supported by the Chinese Natural Sci- ence Foundation under Grant Numbers 60873134.The authors thank Mr.Sandy Harris for English improvment and other students for human evaluations in the experiments.

(9)

References

1. Blei, D.M., Ng, A.Y., Jordan, M.I., Lafferty, J.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

2. Nallapati, R., Feng, A., Peng, F., Allan, J.: Event threading within news topics.

In: Proceedings of the Thirteenth ACM International Conference on Information and knowledge Management, pp. 446–453. ACM (2004)

3. Chemudugunta, C., Smyth, P., Steyvers, M.: Modeling General and Specific As- pects of Documents with a Probabilistic Topic Model. In: Advances in Neural Information Processing Systems, pp. 241–242 (2006)

4. Li, P., Jiang, J., Wang, Y.: Generating templates of entity summaries with an entity-aspect model and pattern mining. In: Proceedings of the 48th Annual Meet- ing of the Association for Computational Linguistics, pp. 640–649. Association for Computational Linguistics (2010)

5. Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 977–984. ACM (2006) 6. MacKay, D.J.C., Peto, L.C.B.: A hierarchical dirichlet language model. Natural

language engineering 1(03), 289–308 (1995)

7. Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation.

Psychological Review 114(2), 211 (2007)

8. Wang, X., McCallum, A., Wei, X.: Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE International Con- ference on Data Mining ICDM 2007, pp. 697–702. IEEE (2007)

9. Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to vari- ational methods for graphical models. Machine learning 37(2), 183–233 (1999) 10. Andrieu, C., De Freitas, N., Doucet, A., Jordan, M.I.: An introduction to mcmc

for machine learning. Machine learning 50(1), 5–43 (2003)

11. Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model.

In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pp. 352–359. Citeseer (2002)

12. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101(suppl. 1), 5228 (2004)

數據

Table 1. Threads and news titles for news event“Iran nuclear program”
Fig. 1. Graphical model for LDA and TNB
Table 3. Notation used in this paper
Table 7. LDA and TNB result for threads of “2007 Nobel prize”

參考文獻

相關文件

A model of shape and motion model was learned from training sequences of hand motion against a plain background, tracked by Kalman filter (using signed.. The Condensation

In this section we define a general model that will encompass both register and variable automata and study its query evaluation problem over graphs. The model is essentially a

1 Generalized Extreme Value Distribution Let Y be a random variable having a generalized extreme- value (GEV) distribution with shape parameter ξ, loca- tion parameter µ and

Keywords: Adaptive Lasso; Cross-validation; Curse of dimensionality; Multi-stage adaptive Lasso; Naive bootstrap; Oracle properties; Single-index; Pseudo least integrated

Bootstrapping is a general approach to statistical in- ference based on building a sampling distribution for a statistic by resampling from the data at hand.. • The

Project 1.3 Use parametric bootstrap and nonparametric bootstrap to approximate the dis- tribution of median based on a data with sam- ple size 20 from a standard normal

In this process, we use the following facts: Law of Large Numbers, Central Limit Theorem, and the Approximation of Binomial Distribution by Normal Distribution or Poisson

[r]