Entity Term Extraction - Event-related Sentence Selection

Chapter 3 Event-related Sentence Selection

3.1 Entity Term Extraction

Given a query term of an event, Lucene, which is a text search engine library, is used to get the event-related posts from a social media platform. Then two steps are performed to extract event-related entity terms: (1) preprocessing, and (2) term filtering.

First of all, before beginning, three principles for selecting the representative entity terms are assumed.

(1) An Entity term should be a noun, because the topic-semantics of a sentence is represented by the nouns.

(2) The importance of an entity term in posts should be evaluated according to its frequency and the co-occurrence with other entity terms.

(3) A representative entity term should have strong association with the query event term.

3.1.1. Preprocessing

At first, each posted message is separated into sentences by commas or periods. A Chinese natural language processing tool [23] is used to parse sentences into Chinese semantic terms and their parts of speech (POS tagging). The part of speech taggers will mark nouns as “N” and verbs as “V”. Figure 3 shows the POS tagging result of a sentence, where each term is followed by its corresponding POS tag. Then

the following 4 steps are performed to extract the candidate entity terms:

1) The terms labeled as Nouns are picked out.

2) The terms in the stop-word list are removed.

3) The single words, temporal or location terms are removed.

4) The compound Noun phrase terms which are combined from continuous Nouns (NN+) are added into the result to be candidate entity terms.

Figure 3.2: Result of preprocessing

Take Figure 3.2 as an example, among the terms labeled as “N”. the single words:

“陸(Ground)” and “上(Upon)” and the temporal terms: “晚上(night)” and “七點 (seven o’clock)” are removed. Then the Noun phase terms “中央(Center)” and “氣象局(Bureau of Meteorology)” are combined to be a compound Noun phase term

“中央氣象局(Central Meteorological Bureau)”. It is similar to combine “颱風 (Typhoon)” and “警報(Alarming)” to be a compound term. Accordingly, there are six candidate entity terms extracted from the example sentence.

3.1.2. Term Filtering

Let the set of event-related posts be denoted by PD={pd1, pd2,…,pdn}, where pdi

indicates a post (i=1,2,…,n). After preprocessing, m distinct terms are extracted. Let the set of terms be denoted as 𝑇 = {𝑡₁, 𝑡₂, … , 𝑡_𝑛}, where each tj denotes a term (j=1,2,…,n). In order to filter out event irrelevant terms, we applied the centrality

score computing method proposed by Jiaul H. Paik et al[10] to evaluate the importance of each term in the posts.

The centrality score computation proposed in [10] has the following assumption.

The importance of a term has a strong association with the other terms appearing in the same post. In general, the more frequent a term appearing in the corpus is, the more important the term is. Additionally, if a term 𝑡_𝑖 is more frequent relatively to another term 𝑡_𝑗 , the term 𝑡_𝑖 should have higher importance than 𝑡_𝑗. The importance of terms is passed between each other, and weighted according to the relative frequency for each pair of terms. Based on this assumption, the centrality score of a term 𝑡_𝑖 is defined as follows:

𝐶𝑒𝑛𝑡(𝑡_𝑖) = ∑^|{𝑡_{𝑗=1,𝑗≠𝑖}^𝑗^𝜖𝑇}|𝐶𝑢𝑚𝑅𝐹(𝑡_𝑖|𝑡_𝑗) ∙ 𝐶𝑒𝑛𝑡(𝑡_𝑗) (1)

where 𝐶𝑒𝑛𝑡(𝑡_𝑗) denotes the centrality score of the term 𝑡_𝑗, whose initial value is 1.

𝐶𝑢𝑚𝑅𝐹(𝑡_𝑖|𝑡_𝑗) denotes the cumulative frequency of 𝑡_𝑖 relative to the frequency of 𝑡_𝑗 in the set of posting data. The importance score of 𝑡_𝑖 is computed by the

summation of 𝐶𝑢𝑚𝑅𝐹(𝑡_𝑖|𝑡_𝑗) multiplied by 𝐶𝑒𝑛𝑡(𝑡_𝑗) for the other terms 𝑡_𝑗. 𝐶𝑢𝑚𝑅𝐹(𝑡_𝑖|𝑡_𝑗) is defined as follows:

𝐶𝑢𝑚𝑅𝐹(𝑡_𝑖|𝑡_𝑗) = ∑_𝑝𝑑 𝑅𝐹(𝑡_𝑖|𝑡_𝑗, 𝑝𝑑_𝑖)

𝑖∈𝑃𝐷 (2)

where, 𝑅𝐹(𝑡_𝑖|𝑡_𝑗, 𝑝𝑑_𝑖) is defined as:

𝑅𝐹(𝑡_𝑖|𝑡_𝑗, 𝑝𝑑_𝑘) = {

𝑙𝑜𝑔2(1+𝑃_𝑝𝑑𝑘(𝑡_𝑖))

𝑙𝑜𝑔2(1+𝑃_𝑝𝑑𝑘(𝑡_𝑗)), 𝑖𝑓 𝑃_𝑝𝑑_𝑘(𝑡_𝑗) > 0 𝑙𝑜𝑔₂(1 + 𝑃_𝑝𝑑_𝑘(𝑡_𝑖)), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

(3)

where 𝑅𝐹(𝑡_𝑖|𝑡_𝑗, 𝑝𝑑_𝑖) indicates the relative frequency between 𝑡_𝑖 and the other terms in a post 𝑝𝑑_𝑘. 𝑃_𝑝𝑑_𝑘(𝑡_𝑗) is the term frequency of 𝑡_𝑗 in the post 𝑝𝑑_𝑘. The logarithmic damping function is to restrict the scopes of the high frequency terms.

Because the centrality score of each term 𝑡_𝑖 in T will be computed the centrality score by Equation 1 iteratively, the computation for all terms can be written compactly using a matrix notation shown in as Equation 4.

𝐶𝑒𝑛𝑡^𝑇 = 𝐶𝑢𝑚𝑅𝐹 × 𝐶𝑒𝑛𝑡^𝑇 (4)

where 𝐶𝑢𝑚𝑅𝐹 represents the matrix of the cumulative frequency between each pair of terms in T by computing Equation 2. 𝐶𝑒𝑛𝑡^𝑇 is a vector to denote the centrality score of all terms in T, where all dimensions are set to be 1 initially. The centrality score vector is updated by each iteration until the order of terms according

to their centrality score is not changed anymore. Then, according to computed centrality scores of terms, the terms with the top-n centrality scores are selected as the candidate term set, denoted as 𝐶𝑇 = {𝑒𝑡₁, 𝑒𝑡₂, … , 𝑒𝑡_𝑛}.

Finally, the centrality score of each term is normalized by dividing the maximum centrality score among all terms. Only a certain number of terms are remained which are specified depends on different query events. The experiment result will be shown in the chapter 5.

<Example 3-1> Example of Computation of Centrality Score

Suppose an event-related post set PD = {pd1, pd2, pd3, pd4} was shown in Table 3.1.

Besides, the extracted terms after preprocessing are shown in the right-most column.

At first, the relative frequency of each term, i.e. the 𝑅𝐹 function shown in Equation 3, to the others are computed. Table 3.2 shows the obtained relativity frequency weight of “颱風假(Typhoon Holiday)” to “服務業(Service Industry)”, “員工 (Employee)” and “工資(Salary)” in every post. Therefore, the cumulative relativity

frequency is computed as

follows: 𝐶𝑢𝑚𝑅𝐹(“服務業(Service Industry)”|“颱風假(Typhoon Holiday)”) = 1.844 + 0.222 + 0.263 + 1 = 3.329.

Accordingly, the matrix of the cumulative relativity frequency between each pair of terms shown in Table3.3.

Table 3.1: Example of posting data typhoon holiday tomorrow, how to count the salary? If the employer should give another day off?

服務業(Service

The policy of the half-day off typhoon holiday changed after the protest by the citizens, because the typhoon was leaving off Taipei.

The no rainy typhoon holiday benefits the boss of theater and the businessman but disadvantaged the employee.

颱風假(Typhoon pays for this salary and who is going to save this money.

服務業(Service

Table 3.2: The relative frequency of “颱風假(Typhoon Holiday)” to the other

Table 3.3: The cumulative relativity frequency of each pair of terms pairs

𝐶𝑢𝑚𝑅𝐹(𝑡_𝑖|𝑡_𝑗)

After performing matrix multiplication of Equation 4, each term’s centrality score is

updated iteratively until the order of term ranked by centrality score is unchanged.

The results of the iterative computations are displayed in Table 3.4.

Table 3.4: The result of centrality score by iteration

颱風假

After the process of term filtering, there are still some candidate terms irrelevant to the query event although they are often mentioned in the posts. For example, “板主 (Administrator of discussion board)” commonly appears in the posts of PTT. “粉絲團(Fans group)” is also a frequently mentioned term. Based on the centrality score,

these terms are considered to be representative terms. However, they have a lower semantic contribution related to the query event. Therefore, a clustering method is proposed to solve this problem. By clustering the terms according to their semantics, clusters that convey similar semantic concepts are obtained. Then the score of each cluster related to the query event is evaluated. The cluster of terms, which has the lowest event relevant score with the query event, is labeled as entities with poor semantic quality. The others are labeled as entities with good semantic quality. The result is consulted when performing sentence selection.

I utilized Word2Vec[9] proposed by Google to construct a feature vector for each

term. In this thesis, about eighty-two thousand posts collected from the PTT Bulletin Board System are used to train the word embedding model. Then the similarity between two entity terms is computed according to the cosine similarity of their corresponding vectors. The K-means algorithm was applied to cluster the entity terms into k clusters, in which a strategy was designed to dynamically decide a proper setting of k. The detailed method is described in the following section. Also, the corresponding pseudo code for topic clustering of entity terms is shown in the following Algorithm 1.

At first, the K-means algorithm with a initial value of k is performed on the candidate entities (line 5) to get a set of clusters: C = {c0, c1, …, ck-1}. The event association score 𝐸𝐴𝑆(𝑐𝑡_𝑖, 𝑞𝑡) measures the relevance degree of a term with the query event term qt by applying Kullback–Leibler divergence (KL divergence). From line 6 to 13, the event relevant score of each cluster c in C is computed, which is the average event association score 𝐸𝐴𝑆(𝑐𝑡_𝑖, 𝑞𝑡) for each term 𝑐𝑡_𝑖 in cluster c. The event association score of a term 𝑐𝑡𝑖 is formulated as follows:

𝐸𝐴𝑆(𝑐𝑡_𝑖, 𝑞𝑡) = 𝑃_𝑑(𝑐𝑡_𝑖|𝑞𝑡) ⋅ 𝑙𝑜𝑔₂(^𝑃^𝑑^(𝑐𝑡^𝑖^|𝑞𝑡)

𝑃𝑑(𝑐𝑡_𝑖|𝑞𝑡̅̅̅) ) (5)

Let PD denote the set of documents in the PTT posts corpus. The conditional probability that 𝑐𝑡_𝑖 appeared in the documents appearing qt, i.e. 𝑃_𝑑(𝑐𝑡_𝑖|𝑞𝑡), is computed as follows:

𝑃_𝑑(𝑐𝑡_𝑖|𝑞𝑡) = ^|{𝑝𝑑_{𝑖 |}𝑝𝑑_𝑖 ∈ 𝑃𝐷 ∧ 𝑐𝑡_𝑖 ∈ 𝑝𝑑_𝑖 ∧ 𝑞𝑡 ∈ 𝑝𝑑_{𝑖 }|}

|{𝑝𝑑_{𝑖 |}𝑝𝑑_𝑖 ∈ 𝑃𝐷 ∧ 𝑞𝑡 ∈ 𝑝𝑑_{𝑖 }|} (6)

Besides, 𝑃_𝑑(𝑐𝑡_𝑖|𝑞𝑡̅ ) denotes the conditional probability that 𝑐𝑡𝑖 appeared in the documents without appearing 𝑞𝑡.

Initially, the number of cluster, k, is set to be 3 as shown in line 3. In Line 15, the clusters are sorted according to their event relevant scores in descending order. In order to evaluate if it is a proper decision to discard the cluster that has the lowest relevant score, the average difference of the relevant scores between each pair of

neighbor clusters in the ranked list of clusters is computed (Line 19) to get the average gap.

From line 20 to 25, when the gap between the cluster with the lowest ES score and the previous cluster is larger than the average gap, the cluster with the lowest ES

score are assigned to 𝐶_𝑙. The terms belongs to this cluster are labeled as the entities with poor semantic quality. Otherwise, the value of k is increased by 1, and the clustering process repeats. Finally, in line 27, the terms in the other clusters are assigned to 𝐶_𝑒, which contains the entities with good semantic quality.

<Example 3-2> Example of Term Topic Clustering

Suppose there are twelve candidate terms in CT and the result of clustering is shown in Table 3.5 when k = 3. According to the average event related score of the terms in 𝐶₁, the evaluation score of 𝐶₁ is 0.45. The evaluation score of 𝐶₂ and 𝐶₃ are 0.46 and 0.44, respectively. Accordingly, the ranked list of the clusters is 𝐶₂, 𝐶₁ and 𝐶₃. Besides, the average gap is ^0.01+0.01

2 = 0.01. Because the gap of evaluation scores between 𝐶₁ and 𝐶₃ is 0.01, which is not larger than the average gap, k is set to be 4 and perform clustering again. The clustering result for k =4 is shown in Table 3.6.

Table 3.5: Example of Clustering on terms when k = 3

Table 3.6: Example of Clustering on terms when k = 4 Cluster id Cluster Term EAS(𝑐𝑡_𝑖, 𝑞𝑡) 𝐸𝑆_𝑐(𝐶_𝑛)

𝐶₃

員工(Employee) 0.44

0.43 服務業(Service industry) 0.36

台鐵(Taiwan Railway) 0.5

𝐶₄

粉絲團(Fan Page) 0.35

0.28

民族(Ethnic) 0.26

朋友(Friend) 0.25

As Table 3.6 shows, the ranked list of the clusters are 𝐶₁, 𝐶₂, 𝐶₃ and 𝐶₄, where the average gap on their evaluation scores is 0.087. In this situation, the gap between 𝐶₃ and 𝐶₄, i.e. 0.15, is bigger than the average gap. Therefore, 𝐶₄ is selected to be 𝐶_𝑙. The terms “粉絲團(Fan Page)”, “民族(Ethnic)”, and “朋友(Friend)” are labeled as the entities with poor semantic quality.

3.3 Sentence Selection

In this thesis, the generated timeline summarization consists of the semantics representative sentences. Accordingly, two stages of processing are performed to select event-related sentences: (1) sentence filtering and (2) representative score computation for sentences.

3.3.1 Sentence Filtering

This stage of processing aims to filter the semantics sentences with cluster. Three principles are adopted to filter the sentences.

(1) The sentence is not completed. It was observed that the length of the sentence is too short to convey enough information.

(2) The sentence is not readable. When the sentence is too long unusually, its readability is poor. It was observed that the sentence has no punctuation or full of repeated words or meaningless symbols.

(3) The sentence is not event-related in semantics. It was observed that the sentence contains more the entities with poor semantics quality than the entities with good quality.

Let 𝑆 = {𝑠₁, 𝑠₂, … , 𝑠_𝑘} denoted the set of all sentences extracted from the posts in PD. Besides, let 𝑇_𝐵 and 𝑇_𝐺 denote the sets of entity terms with poor quality and the ones with good quality, respectively, extracted from Cl and Ce in Section 3.2.

According to the filtering principles 1 and 2, the corresponding processing is performed as follows. Let 𝑎𝑣𝑔𝑙𝑒𝑛(𝑆) denote the average length of the sentences in S. For each sentence s in S, if |𝑙𝑒𝑛(𝑠) − 𝑎𝑣𝑔𝑙𝑒𝑛(𝑆)| > 10, s is removed from S.

Furthermore, the average length of passages in the sentences separated by commas and periods is calculated. If the length of a passage in a sentence is longer than the average length of passages, the sentence will be discarded. Therefore, the remaining sentences are readable.

According to the third filtering principle, for each sentence 𝑠_𝑖 in S, the ratio of entity

terms with poor semantic quality and the ones with good quality in the sentence is evaluated according to the following function:

𝑅(𝑠_𝑖) = ^|{𝑡_𝑗|𝑡_𝑗 ∈ 𝑠_𝑖 ∧ 𝑡_𝑗 ∈ 𝐶_{𝑙 }|}

|{𝑡_𝑘|𝑡_𝑘 ∈ 𝑠_𝑖 ∧ 𝑡_𝑘 ∈ 𝐶_𝑒}|+𝜀 (7)

where 𝜀 is a constant to avoid the special case that 𝑠_𝑖 does not contain any entity with good quality. If 𝑅(𝑠_𝑖) is bigger than a gain threshold value, which is set to be 0.4 according to the result of experiment described in chapter 5, the sentence 𝑠_𝑖 is pruned from the candidate sentence set S.

3.3.2 Representative Score Computation for Sentences

After performing sentence filtering according to the above three principles, the representative score 𝑆𝑅𝑆(𝑠_𝑖) of every sentence 𝑠_𝑖 in the remaining sentences is evaluated. The assumption is that if a sentence contains more entity terms which are highly related to the query event, the sentence is more representative for the query event. Therefore, 𝑆𝑅𝑆(𝑠_𝑖) is defined to be the weighted sum of the representative score, 𝑇𝑅𝑆(𝑒_𝑖), of each entity term 𝑒_𝑖 in 𝑠_𝑖, where the weight uses the centrality score of 𝑒_𝑖, i.e. 𝐶𝑒𝑛𝑡(𝑒_𝑖). Besides, there is a penalty for a sentence containing any term in 𝑇_𝐵, 𝑆𝑅𝑆(𝑠_𝑖) will be subtracted. Accordingly, the equation of computing 𝑆𝑅𝑆(𝑠_𝑖) of 𝑠_𝑖 is defined as:

𝑆𝑅𝑆(𝑠_𝑖) = ∑_𝑒 𝑇𝑅𝑆(𝑒_𝑖)

𝑖∈𝑠_𝑖∧𝑒_𝑖∈𝑇𝐺 ∙ 𝐶𝑒𝑛𝑡(𝑒_𝑖) − ∑_𝑒 𝑇𝑅𝑆(𝑒_𝑖)

𝑖∈𝑠_𝑖∧𝑒_𝑖∈𝑇𝐵 ∙ 𝐶𝑒𝑛𝑡(𝑒_𝑖) (8)

The representative score of a term 𝑒_𝑖, 𝑇𝑅𝑆(𝑒_𝑖) is to compute the semantic importance of an entity term. It is measured by considering both how related the term is to the event and how concentrative the term is to a topic cluster. Therefore, the computation of 𝑇𝑅𝑆(𝑒_𝑖) is defined to be the multiplication of two measures:

the event association score of 𝑐𝑡_𝑖, 𝐸𝐴𝑆(𝑐𝑡_𝑖, 𝑞𝑡) , and its concentration score, 𝐶𝑆(𝑐𝑡_𝑖).

𝑇𝑅𝑆(𝑐𝑡_𝑖) = 𝐶𝑆(𝑐𝑡_𝑖) ∗ EAS(𝑐𝑡_𝑖, 𝑞𝑡) (9)

Generally, the closer a term is to the centroid of its cluster, the more concentration the term is to a semantic topic. Let 𝑐_𝑚 denote the centroid of the cluster that the term 𝑐𝑡_𝑖 belonging to. The concentration score of an entity term 𝑐𝑡_𝑖, denoted as 𝐶𝑆(𝑐𝑡_𝑖), is measured by computing the cosine similarity between 𝑐𝑡_𝑖 and 𝑐_𝑚 as follows:

𝐶𝑆(𝑐𝑡_𝑖) = ^𝑐𝑡⃗⃗⃗⃗⃗ ⋅ 𝑐^𝑖 ⃗⃗⃗⃗⃗⃗⃗ _𝑚

||𝑐𝑡⃗⃗⃗⃗⃗ || || 𝑐𝑖 ⃗⃗⃗⃗⃗⃗⃗ ||_𝑚 (10)

𝑐⃗⃗⃗⃗ _𝑚 is computed by averaging the Word2vec[22] vectors of each 𝑐𝑡_𝑖. The higher the concentration score is, the higher topic representation the entity term is.

<Example 3-3> Example of computation of representative score

According to the result of Example of 3-2, the set of entity terms with good quality, 𝑇_𝐺 = {“ 柯文哲 (Ko Wen-je)”, “ 首長 (Mayor)”, “ 政府 (Government)”, “ 政策

(Policy)”, “水庫(Reservoir)”, “大潮(Tide)”, “颱風假(Typhoon holiday)”, “員工 (Employee)”, “服務業(Service industry)”, “台鐵(Taiwan Railway)”}. The set of entity terms with poor quality, 𝑇_𝐵 = {“粉絲團(Fan Page)”, “民族(Ethnic)”, “朋友 (Friend)”}. The representation score of each entity term is shown in Table 3.7.

Table 3.7: Representative score of entity term

Entity Term 𝐶𝑆(𝑐𝑡_𝑖) EAS(𝑐𝑡_𝑖, 𝑞𝑡) 𝑇𝑅𝑆(𝑐𝑡_𝑖)

柯文哲(Ko Wen-je) 0.8 0.64 0.512

首長(Mayor) 0.7 0.55 0.385

政策(Policy) 0.8 0.5 0.4

政府(Government) 0.5 0.45 0.225

大潮(Tide) 0.6 0.55 0.33

水庫(Reservoir) 0.3 0.45 0.135

颱風假(Typhoon holiday) 0.75 0.6 0.45

員工(Employee) 0.6 0.44 0.264

服務業(Service industry) 0.4 0.36 0.144

台鐵(Taiwan Railway) 0.7 0.5 0.35

Some example sentences for computing the representative score of sentences are shown in Table 3.8. For the sentence 𝑠₄, 𝑅(𝑠₄) = ¹

0+0.01= 100, which is larger than the threshold value. Therefore, 𝑠₄ is pruned from S. Similarly, 𝑠₇.is also remove For the sentence 𝑠₃, it contains neither entity term in 𝑇_𝐵 nor in 𝑇_𝐺. Therefore, 𝑅𝑆(𝑠₃) = ⁰

0.01= 0 and 𝑠₃ is retained in S.

Table 3.8: Result of sentence pruning

Id Sentence Example Sentence

Length Pruned

𝑠₁

柯文哲原本宣布只放半天颱風假引發不少網友不滿

People are not satisfied with the policy of the typhoon holiday announced by the mayor.

22 No

𝑠₂

自己過去 10 多年擔任地方首長都是下半天停班停課

It is usually half-day typhoon holiday when I am the mayor for past ten years.

22 No

𝑠₃

只要雨下的很大的時候就會從門縫打進來 If the rain is heavy, the water always comes in to the door easily.

18 No

𝑠₄

加藤軍臺灣粉絲團 2.0 看來颱風夜大家都很無聊

It seams everyone is bored in night of Typhoon in the fan page of Gato.

20 Yes

𝑠₇

由於多放了一天假不知道到幹嘛結果陪朋友颱風假去打網咖

I played the online game with friends because the extra typhoon holiday.

26 Yes

𝑠₈

北北基不放颱風假昨晚兩小時內政策大轉彎 The policy of typhoon holiday for Taipei changed in two hours.

Therefore, 𝑆𝑅𝑆(𝑠_𝑖) for each sentence in S is shown in Table 3.9.

Table 3.9: Representative score of a sentence

id Sentence Example 𝑆𝑅𝑆(𝑠_𝑖) It is usually half-day typhoon holiday when I am the mayor for past ten years.

0.385

𝑠₃

只要雨下的很大的時候就會從門縫打進來

If the rain is heavy, the water always comes in to the door easily.

0.0

𝑠₈

北北基不放颱風假昨晚兩小時內政策大轉彎

The policy of typhoon holiday for Taipei changed in two hours. typhoon holiday, so do the stock salesperson.

0.714

Chapter 4 Sub-event Matching and Summarization

This chapter introduces how to summarize the user opinions on each sub-event related to the query event. The assumption is that there are some sub-events occurred during the discussion period of the query event. Besides, most comments shown in the posts discuss a certain sub-event. Therefore, first, the important sub-event sentences are extracted from S to represent sub-events. This kind of sentences is the short sentences without comma. The remaining sentences which are divided by period in S are considered candidate comments on the sub-events. After that, each comment is matched with the discovered events to find a corresponding sub-event. In general, there are different discussing aspects among these comments.

Therefore, the comments are clustered into several groups with different discussing aspects. Finally, the appearing time of each sub-event and its discussions is considered to organize summarization. Each processing step will be described in the following section individually.

The tasks described above are divided into four main processing steps: (1) sub-event sentence selection, (2) sub-event matching, (3) aspect discovering, and (4) timeline summarization of sub-events and comments. The flow chart is shown in Figure 4.1.

Figure 4.1: Flow chart of sub-event matching and summarization

4.1. Sub-event sentence selection

A rule-based method is proposed to extract the sub-event sentences from S. The resultant passages which have at least one entity term, one verb term, one temporal term, and one location term, are selected as sub-event sentences. Among the set of extracted sub-event sentences, denoted as CSE, the 𝑆𝑅𝑆(𝑠_𝑖) score is used to decide the representative score related to the query event for each sentence 𝑠_𝑖. Moreover, to prevent from selecting several sentences representing the same sub-event, the diversity of sentences is also considered to select the representative sub-event sentences from CSE.

The algorithm for selecting the representative sub-event sentences from CSE is as follows. Let SE denote the set of selected representative sub-event sentences, where SE is an empty set initially. At first, the sentence in CSE with the highest 𝑆𝑅S(𝑠_j) is selected into SE. Then in each iteration, 𝑆𝑒𝑙𝑒𝑐𝑡𝑆𝑐𝑜𝑟𝑒(s_j) is computed for each

在文檔中 Timeline Summarization for Event-related Facts and Public Issues on Chinese Social Media Platform (頁 24-0)