• 沒有找到結果。

Chapter 3 Event-related Sentence Selection

3.1 Entity Term Extraction

Given a query term of an event, Lucene, which is a text search engine library, is used to get the event-related posts from a social media platform. Then two steps are performed to extract event-related entity terms: (1) preprocessing, and (2) term filtering.

First of all, before beginning, three principles for selecting the representative entity terms are assumed.

(1) An Entity term should be a noun, because the topic-semantics of a sentence is represented by the nouns.

(2) The importance of an entity term in posts should be evaluated according to its frequency and the co-occurrence with other entity terms.

(3) A representative entity term should have strong association with the query event term.

3.1.1. Preprocessing

At first, each posted message is separated into sentences by commas or periods. A Chinese natural language processing tool [23] is used to parse sentences into Chinese semantic terms and their parts of speech (POS tagging). The part of speech taggers will mark nouns as “N” and verbs as “V”. Figure 3 shows the POS tagging result of a sentence, where each term is followed by its corresponding POS tag. Then

the following 4 steps are performed to extract the candidate entity terms:

1) The terms labeled as Nouns are picked out.

2) The terms in the stop-word list are removed.

3) The single words, temporal or location terms are removed.

4) The compound Noun phrase terms which are combined from continuous Nouns (NN+) are added into the result to be candidate entity terms.

Figure 3.2: Result of preprocessing

Take Figure 3.2 as an example, among the terms labeled as “N”. the single words:

“陸(Ground)” and “上(Upon)” and the temporal terms: “晚上(night)” and “七點 (seven o’clock)” are removed. Then the Noun phase terms “中央(Center)” and “氣 象局(Bureau of Meteorology)” are combined to be a compound Noun phase term

“中央氣象局(Central Meteorological Bureau)”. It is similar to combine “颱風 (Typhoon)” and “警報(Alarming)” to be a compound term. Accordingly, there are six candidate entity terms extracted from the example sentence.

3.1.2. Term Filtering

Let the set of event-related posts be denoted by PD={pd1, pd2,…,pdn}, where pdi

indicates a post (i=1,2,…,n). After preprocessing, m distinct terms are extracted. Let the set of terms be denoted as 𝑇 = {𝑡1, 𝑡2, … , 𝑡𝑛}, where each tj denotes a term (j=1,2,…,n). In order to filter out event irrelevant terms, we applied the centrality

score computing method proposed by Jiaul H. Paik et al[10] to evaluate the importance of each term in the posts.

The centrality score computation proposed in [10] has the following assumption.

The importance of a term has a strong association with the other terms appearing in the same post. In general, the more frequent a term appearing in the corpus is, the more important the term is. Additionally, if a term 𝑡𝑖 is more frequent relatively to another term 𝑡𝑗 , the term 𝑡𝑖 should have higher importance than 𝑡𝑗. The importance of terms is passed between each other, and weighted according to the relative frequency for each pair of terms. Based on this assumption, the centrality score of a term 𝑡𝑖 is defined as follows:

𝐶𝑒𝑛𝑡(𝑡𝑖) = ∑|{𝑡𝑗=1,𝑗≠𝑖𝑗𝜖𝑇}|𝐶𝑢𝑚𝑅𝐹(𝑡𝑖|𝑡𝑗) ∙ 𝐶𝑒𝑛𝑡(𝑡𝑗) (1)

where 𝐶𝑒𝑛𝑡(𝑡𝑗) denotes the centrality score of the term 𝑡𝑗, whose initial value is 1.

𝐶𝑢𝑚𝑅𝐹(𝑡𝑖|𝑡𝑗) denotes the cumulative frequency of 𝑡𝑖 relative to the frequency of 𝑡𝑗 in the set of posting data. The importance score of 𝑡𝑖 is computed by the

summation of 𝐶𝑢𝑚𝑅𝐹(𝑡𝑖|𝑡𝑗) multiplied by 𝐶𝑒𝑛𝑡(𝑡𝑗) for the other terms 𝑡𝑗. 𝐶𝑢𝑚𝑅𝐹(𝑡𝑖|𝑡𝑗) is defined as follows:

𝐶𝑢𝑚𝑅𝐹(𝑡𝑖|𝑡𝑗) = ∑𝑝𝑑 𝑅𝐹(𝑡𝑖|𝑡𝑗, 𝑝𝑑𝑖)

𝑖∈𝑃𝐷 (2)

where, 𝑅𝐹(𝑡𝑖|𝑡𝑗, 𝑝𝑑𝑖) is defined as:

𝑅𝐹(𝑡𝑖|𝑡𝑗, 𝑝𝑑𝑘) = {

𝑙𝑜𝑔2(1+𝑃𝑝𝑑𝑘(𝑡𝑖))

𝑙𝑜𝑔2(1+𝑃𝑝𝑑𝑘(𝑡𝑗)), 𝑖𝑓 𝑃𝑝𝑑𝑘(𝑡𝑗) > 0 𝑙𝑜𝑔2(1 + 𝑃𝑝𝑑𝑘(𝑡𝑖)), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

(3)

where 𝑅𝐹(𝑡𝑖|𝑡𝑗, 𝑝𝑑𝑖) indicates the relative frequency between 𝑡𝑖 and the other terms in a post 𝑝𝑑𝑘. 𝑃𝑝𝑑𝑘(𝑡𝑗) is the term frequency of 𝑡𝑗 in the post 𝑝𝑑𝑘. The logarithmic damping function is to restrict the scopes of the high frequency terms.

Because the centrality score of each term 𝑡𝑖 in T will be computed the centrality score by Equation 1 iteratively, the computation for all terms can be written compactly using a matrix notation shown in as Equation 4.

𝐶𝑒𝑛𝑡𝑇 = 𝐶𝑢𝑚𝑅𝐹 × 𝐶𝑒𝑛𝑡𝑇 (4)

where 𝐶𝑢𝑚𝑅𝐹 represents the matrix of the cumulative frequency between each pair of terms in T by computing Equation 2. 𝐶𝑒𝑛𝑡𝑇 is a vector to denote the centrality score of all terms in T, where all dimensions are set to be 1 initially. The centrality score vector is updated by each iteration until the order of terms according

to their centrality score is not changed anymore. Then, according to computed centrality scores of terms, the terms with the top-n centrality scores are selected as the candidate term set, denoted as 𝐶𝑇 = {𝑒𝑡1, 𝑒𝑡2, … , 𝑒𝑡𝑛}.

Finally, the centrality score of each term is normalized by dividing the maximum centrality score among all terms. Only a certain number of terms are remained which are specified depends on different query events. The experiment result will be shown in the chapter 5.

<Example 3-1> Example of Computation of Centrality Score

Suppose an event-related post set PD = {pd1, pd2, pd3, pd4} was shown in Table 3.1.

Besides, the extracted terms after preprocessing are shown in the right-most column.

At first, the relative frequency of each term, i.e. the 𝑅𝐹 function shown in Equation 3, to the others are computed. Table 3.2 shows the obtained relativity frequency weight of “颱風假(Typhoon Holiday)” to “服務業(Service Industry)”, “員工 (Employee)” and “工資(Salary)” in every post. Therefore, the cumulative relativity

frequency is computed as

follows: 𝐶𝑢𝑚𝑅𝐹(“服務業(Service Industry)”|“颱風假(Typhoon Holiday)”) = 1.844 + 0.222 + 0.263 + 1 = 3.329.

Accordingly, the matrix of the cumulative relativity frequency between each pair of terms shown in Table3.3.

Table 3.1: Example of posting data typhoon holiday tomorrow, how to count the salary? If the employer should give another day off?

服務業(Service

The policy of the half-day off typhoon holiday changed after the protest by the citizens, because the typhoon was leaving off Taipei.

The no rainy typhoon holiday benefits the boss of theater and the businessman but disadvantaged the employee.

颱風假(Typhoon pays for this salary and who is going to save this money.

服務業(Service

Table 3.2: The relative frequency of “颱風假(Typhoon Holiday)” to the other

Table 3.3: The cumulative relativity frequency of each pair of terms pairs

𝐶𝑢𝑚𝑅𝐹(𝑡𝑖|𝑡𝑗)

After performing matrix multiplication of Equation 4, each term’s centrality score is

updated iteratively until the order of term ranked by centrality score is unchanged.

The results of the iterative computations are displayed in Table 3.4.

Table 3.4: The result of centrality score by iteration

颱風假

After the process of term filtering, there are still some candidate terms irrelevant to the query event although they are often mentioned in the posts. For example, “板主 (Administrator of discussion board)” commonly appears in the posts of PTT. “粉絲 團(Fans group)” is also a frequently mentioned term. Based on the centrality score,

these terms are considered to be representative terms. However, they have a lower semantic contribution related to the query event. Therefore, a clustering method is proposed to solve this problem. By clustering the terms according to their semantics, clusters that convey similar semantic concepts are obtained. Then the score of each cluster related to the query event is evaluated. The cluster of terms, which has the lowest event relevant score with the query event, is labeled as entities with poor semantic quality. The others are labeled as entities with good semantic quality. The result is consulted when performing sentence selection.

I utilized Word2Vec[9] proposed by Google to construct a feature vector for each

term. In this thesis, about eighty-two thousand posts collected from the PTT Bulletin Board System are used to train the word embedding model. Then the similarity between two entity terms is computed according to the cosine similarity of their corresponding vectors. The K-means algorithm was applied to cluster the entity terms into k clusters, in which a strategy was designed to dynamically decide a proper setting of k. The detailed method is described in the following section. Also, the corresponding pseudo code for topic clustering of entity terms is shown in the following Algorithm 1.

At first, the K-means algorithm with a initial value of k is performed on the candidate entities (line 5) to get a set of clusters: C = {c0, c1, …, ck-1}. The event association score 𝐸𝐴𝑆(𝑐𝑡𝑖, 𝑞𝑡) measures the relevance degree of a term with the query event term qt by applying Kullback–Leibler divergence (KL divergence). From line 6 to 13, the event relevant score of each cluster c in C is computed, which is the average event association score 𝐸𝐴𝑆(𝑐𝑡𝑖, 𝑞𝑡) for each term 𝑐𝑡𝑖 in cluster c. The event association score of a term 𝑐𝑡𝑖 is formulated as follows:

𝐸𝐴𝑆(𝑐𝑡𝑖, 𝑞𝑡) = 𝑃𝑑(𝑐𝑡𝑖|𝑞𝑡) ⋅ 𝑙𝑜𝑔2(𝑃𝑑(𝑐𝑡𝑖|𝑞𝑡)

𝑃𝑑(𝑐𝑡𝑖|𝑞𝑡̅̅̅) ) (5)

Let PD denote the set of documents in the PTT posts corpus. The conditional probability that 𝑐𝑡𝑖 appeared in the documents appearing qt, i.e. 𝑃𝑑(𝑐𝑡𝑖|𝑞𝑡), is computed as follows:

𝑃𝑑(𝑐𝑡𝑖|𝑞𝑡) = |{𝑝𝑑𝑖 |𝑝𝑑𝑖 ∈ 𝑃𝐷 ∧ 𝑐𝑡𝑖 ∈ 𝑝𝑑𝑖 ∧ 𝑞𝑡 ∈ 𝑝𝑑𝑖 }|

|{𝑝𝑑𝑖 |𝑝𝑑𝑖 ∈ 𝑃𝐷 ∧ 𝑞𝑡 ∈ 𝑝𝑑𝑖 }| (6)

Besides, 𝑃𝑑(𝑐𝑡𝑖|𝑞𝑡̅ ) denotes the conditional probability that 𝑐𝑡𝑖 appeared in the documents without appearing 𝑞𝑡.

Initially, the number of cluster, k, is set to be 3 as shown in line 3. In Line 15, the clusters are sorted according to their event relevant scores in descending order. In order to evaluate if it is a proper decision to discard the cluster that has the lowest relevant score, the average difference of the relevant scores between each pair of

neighbor clusters in the ranked list of clusters is computed (Line 19) to get the average gap.

From line 20 to 25, when the gap between the cluster with the lowest ES score and the previous cluster is larger than the average gap, the cluster with the lowest ES

score are assigned to 𝐶𝑙. The terms belongs to this cluster are labeled as the entities with poor semantic quality. Otherwise, the value of k is increased by 1, and the clustering process repeats. Finally, in line 27, the terms in the other clusters are assigned to 𝐶𝑒, which contains the entities with good semantic quality.

<Example 3-2> Example of Term Topic Clustering

Suppose there are twelve candidate terms in CT and the result of clustering is shown in Table 3.5 when k = 3. According to the average event related score of the terms in 𝐶1, the evaluation score of 𝐶1 is 0.45. The evaluation score of 𝐶2 and 𝐶3 are 0.46 and 0.44, respectively. Accordingly, the ranked list of the clusters is 𝐶2, 𝐶1 and 𝐶3. Besides, the average gap is 0.01+0.01

2 = 0.01. Because the gap of evaluation scores between 𝐶1 and 𝐶3 is 0.01, which is not larger than the average gap, k is set to be 4 and perform clustering again. The clustering result for k =4 is shown in Table 3.6.

Table 3.5: Example of Clustering on terms when k = 3

Table 3.6: Example of Clustering on terms when k = 4 Cluster id Cluster Term EAS(𝑐𝑡𝑖, 𝑞𝑡) 𝐸𝑆𝑐(𝐶𝑛)

𝐶3

員工(Employee) 0.44

0.43 服務業(Service industry) 0.36

台鐵(Taiwan Railway) 0.5

𝐶4

粉絲團(Fan Page) 0.35

0.28

民族(Ethnic) 0.26

朋友(Friend) 0.25

As Table 3.6 shows, the ranked list of the clusters are 𝐶1, 𝐶2, 𝐶3 and 𝐶4, where the average gap on their evaluation scores is 0.087. In this situation, the gap between 𝐶3 and 𝐶4, i.e. 0.15, is bigger than the average gap. Therefore, 𝐶4 is selected to be 𝐶𝑙. The terms “粉絲團(Fan Page)”, “民族(Ethnic)”, and “朋友(Friend)” are labeled as the entities with poor semantic quality.

3.3 Sentence Selection

In this thesis, the generated timeline summarization consists of the semantics representative sentences. Accordingly, two stages of processing are performed to select event-related sentences: (1) sentence filtering and (2) representative score computation for sentences.

3.3.1 Sentence Filtering

This stage of processing aims to filter the semantics sentences with cluster. Three principles are adopted to filter the sentences.

(1) The sentence is not completed. It was observed that the length of the sentence is too short to convey enough information.

(2) The sentence is not readable. When the sentence is too long unusually, its readability is poor. It was observed that the sentence has no punctuation or full of repeated words or meaningless symbols.

(3) The sentence is not event-related in semantics. It was observed that the sentence contains more the entities with poor semantics quality than the entities with good quality.

Let 𝑆 = {𝑠1, 𝑠2, … , 𝑠𝑘} denoted the set of all sentences extracted from the posts in PD. Besides, let 𝑇𝐵 and 𝑇𝐺 denote the sets of entity terms with poor quality and the ones with good quality, respectively, extracted from Cl and Ce in Section 3.2.

According to the filtering principles 1 and 2, the corresponding processing is performed as follows. Let 𝑎𝑣𝑔𝑙𝑒𝑛(𝑆) denote the average length of the sentences in S. For each sentence s in S, if |𝑙𝑒𝑛(𝑠) − 𝑎𝑣𝑔𝑙𝑒𝑛(𝑆)| > 10, s is removed from S.

Furthermore, the average length of passages in the sentences separated by commas and periods is calculated. If the length of a passage in a sentence is longer than the average length of passages, the sentence will be discarded. Therefore, the remaining sentences are readable.

According to the third filtering principle, for each sentence 𝑠𝑖 in S, the ratio of entity

terms with poor semantic quality and the ones with good quality in the sentence is evaluated according to the following function:

𝑅(𝑠𝑖) = |{𝑡𝑗|𝑡𝑗 ∈ 𝑠𝑖 ∧ 𝑡𝑗 ∈ 𝐶𝑙 }|

|{𝑡𝑘|𝑡𝑘 ∈ 𝑠𝑖 ∧ 𝑡𝑘 ∈ 𝐶𝑒}|+𝜀 (7)

where 𝜀 is a constant to avoid the special case that 𝑠𝑖 does not contain any entity with good quality. If 𝑅(𝑠𝑖) is bigger than a gain threshold value, which is set to be 0.4 according to the result of experiment described in chapter 5, the sentence 𝑠𝑖 is pruned from the candidate sentence set S.

3.3.2 Representative Score Computation for Sentences

After performing sentence filtering according to the above three principles, the representative score 𝑆𝑅𝑆(𝑠𝑖) of every sentence 𝑠𝑖 in the remaining sentences is evaluated. The assumption is that if a sentence contains more entity terms which are highly related to the query event, the sentence is more representative for the query event. Therefore, 𝑆𝑅𝑆(𝑠𝑖) is defined to be the weighted sum of the representative score, 𝑇𝑅𝑆(𝑒𝑖), of each entity term 𝑒𝑖 in 𝑠𝑖, where the weight uses the centrality score of 𝑒𝑖, i.e. 𝐶𝑒𝑛𝑡(𝑒𝑖). Besides, there is a penalty for a sentence containing any term in 𝑇𝐵, 𝑆𝑅𝑆(𝑠𝑖) will be subtracted. Accordingly, the equation of computing 𝑆𝑅𝑆(𝑠𝑖) of 𝑠𝑖 is defined as:

𝑆𝑅𝑆(𝑠𝑖) = ∑𝑒 𝑇𝑅𝑆(𝑒𝑖)

𝑖∈𝑠𝑖∧𝑒𝑖∈𝑇𝐺 ∙ 𝐶𝑒𝑛𝑡(𝑒𝑖) − ∑𝑒 𝑇𝑅𝑆(𝑒𝑖)

𝑖∈𝑠𝑖∧𝑒𝑖∈𝑇𝐵 ∙ 𝐶𝑒𝑛𝑡(𝑒𝑖) (8)

The representative score of a term 𝑒𝑖, 𝑇𝑅𝑆(𝑒𝑖) is to compute the semantic importance of an entity term. It is measured by considering both how related the term is to the event and how concentrative the term is to a topic cluster. Therefore, the computation of 𝑇𝑅𝑆(𝑒𝑖) is defined to be the multiplication of two measures:

the event association score of 𝑐𝑡𝑖, 𝐸𝐴𝑆(𝑐𝑡𝑖, 𝑞𝑡) , and its concentration score, 𝐶𝑆(𝑐𝑡𝑖).

𝑇𝑅𝑆(𝑐𝑡𝑖) = 𝐶𝑆(𝑐𝑡𝑖) ∗ EAS(𝑐𝑡𝑖, 𝑞𝑡) (9)

Generally, the closer a term is to the centroid of its cluster, the more concentration the term is to a semantic topic. Let 𝑐𝑚 denote the centroid of the cluster that the term 𝑐𝑡𝑖 belonging to. The concentration score of an entity term 𝑐𝑡𝑖, denoted as 𝐶𝑆(𝑐𝑡𝑖), is measured by computing the cosine similarity between 𝑐𝑡𝑖 and 𝑐𝑚 as follows:

𝐶𝑆(𝑐𝑡𝑖) = 𝑐𝑡⃗⃗⃗⃗⃗ ⋅ 𝑐𝑖 ⃗⃗⃗⃗⃗⃗⃗ 𝑚

||𝑐𝑡⃗⃗⃗⃗⃗ || || 𝑐𝑖 ⃗⃗⃗⃗⃗⃗⃗ ||𝑚 (10)

𝑐⃗⃗⃗⃗ 𝑚 is computed by averaging the Word2vec[22] vectors of each 𝑐𝑡𝑖. The higher the concentration score is, the higher topic representation the entity term is.

<Example 3-3> Example of computation of representative score

According to the result of Example of 3-2, the set of entity terms with good quality, 𝑇𝐺 = {“ 柯 文 哲 (Ko Wen-je)”, “ 首 長 (Mayor)”, “ 政 府 (Government)”, “ 政 策

(Policy)”, “水庫(Reservoir)”, “大潮(Tide)”, “颱風假(Typhoon holiday)”, “員工 (Employee)”, “服務業(Service industry)”, “台鐵(Taiwan Railway)”}. The set of entity terms with poor quality, 𝑇𝐵 = {“粉絲團(Fan Page)”, “民族(Ethnic)”, “朋友 (Friend)”}. The representation score of each entity term is shown in Table 3.7.

Table 3.7: Representative score of entity term

Entity Term 𝐶𝑆(𝑐𝑡𝑖) EAS(𝑐𝑡𝑖, 𝑞𝑡) 𝑇𝑅𝑆(𝑐𝑡𝑖)

柯文哲(Ko Wen-je) 0.8 0.64 0.512

首長(Mayor) 0.7 0.55 0.385

政策(Policy) 0.8 0.5 0.4

政府(Government) 0.5 0.45 0.225

大潮(Tide) 0.6 0.55 0.33

水庫(Reservoir) 0.3 0.45 0.135

颱風假(Typhoon holiday) 0.75 0.6 0.45

員工(Employee) 0.6 0.44 0.264

服務業(Service industry) 0.4 0.36 0.144

台鐵(Taiwan Railway) 0.7 0.5 0.35

Some example sentences for computing the representative score of sentences are shown in Table 3.8. For the sentence 𝑠4, 𝑅(𝑠4) = 1

0+0.01= 100, which is larger than the threshold value. Therefore, 𝑠4 is pruned from S. Similarly, 𝑠7.is also remove For the sentence 𝑠3, it contains neither entity term in 𝑇𝐵 nor in 𝑇𝐺. Therefore, 𝑅𝑆(𝑠3) = 0

0.01= 0 and 𝑠3 is retained in S.

Table 3.8: Result of sentence pruning

Id Sentence Example Sentence

Length Pruned

𝑠1

柯文哲原本宣布只放半天颱風假引發不少網 友不滿

People are not satisfied with the policy of the typhoon holiday announced by the mayor.

22 No

𝑠2

自己過去 10 多年擔任地方首長都是下半天停 班停課

It is usually half-day typhoon holiday when I am the mayor for past ten years.

22 No

𝑠3

只要雨下的很大的時候就會從門縫打進來 If the rain is heavy, the water always comes in to the door easily.

18 No

𝑠4

加藤軍臺灣粉絲團 2.0 看來颱風夜大家都很無 聊

It seams everyone is bored in night of Typhoon in the fan page of Gato.

20 Yes

𝑠7

由於多放了一天假不知道到幹嘛結果陪朋友 颱風假去打網咖

I played the online game with friends because the extra typhoon holiday.

26 Yes

𝑠8

北北基不放颱風假昨晚兩小時內政策大轉彎 The policy of typhoon holiday for Taipei changed in two hours.

Therefore, 𝑆𝑅𝑆(𝑠𝑖) for each sentence in S is shown in Table 3.9.

Table 3.9: Representative score of a sentence

id Sentence Example 𝑆𝑅𝑆(𝑠𝑖) It is usually half-day typhoon holiday when I am the mayor for past ten years.

0.385

𝑠3

只要雨下的很大的時候就會從門縫打進來

If the rain is heavy, the water always comes in to the door easily.

0.0

𝑠8

北北基不放颱風假昨晚兩小時內政策大轉彎

The policy of typhoon holiday for Taipei changed in two hours. typhoon holiday, so do the stock salesperson.

0.714

Chapter 4 Sub-event Matching and Summarization

This chapter introduces how to summarize the user opinions on each sub-event related to the query event. The assumption is that there are some sub-events occurred during the discussion period of the query event. Besides, most comments shown in the posts discuss a certain sub-event. Therefore, first, the important sub-event sentences are extracted from S to represent sub-events. This kind of sentences is the short sentences without comma. The remaining sentences which are divided by period in S are considered candidate comments on the sub-events. After that, each comment is matched with the discovered events to find a corresponding sub-event. In general, there are different discussing aspects among these comments.

Therefore, the comments are clustered into several groups with different discussing aspects. Finally, the appearing time of each sub-event and its discussions is considered to organize summarization. Each processing step will be described in the following section individually.

The tasks described above are divided into four main processing steps: (1) sub-event sentence selection, (2) sub-event matching, (3) aspect discovering, and (4) timeline summarization of sub-events and comments. The flow chart is shown in Figure 4.1.

Figure 4.1: Flow chart of sub-event matching and summarization

4.1. Sub-event sentence selection

A rule-based method is proposed to extract the sub-event sentences from S. The resultant passages which have at least one entity term, one verb term, one temporal term, and one location term, are selected as sub-event sentences. Among the set of extracted sub-event sentences, denoted as CSE, the 𝑆𝑅𝑆(𝑠𝑖) score is used to decide the representative score related to the query event for each sentence 𝑠𝑖. Moreover, to prevent from selecting several sentences representing the same sub-event, the diversity of sentences is also considered to select the representative sub-event sentences from CSE.

The algorithm for selecting the representative sub-event sentences from CSE is as follows. Let SE denote the set of selected representative sub-event sentences, where SE is an empty set initially. At first, the sentence in CSE with the highest 𝑆𝑅S(𝑠j) is selected into SE. Then in each iteration, 𝑆𝑒𝑙𝑒𝑐𝑡𝑆𝑐𝑜𝑟𝑒(sj) is computed for each

The algorithm for selecting the representative sub-event sentences from CSE is as follows. Let SE denote the set of selected representative sub-event sentences, where SE is an empty set initially. At first, the sentence in CSE with the highest 𝑆𝑅S(𝑠j) is selected into SE. Then in each iteration, 𝑆𝑒𝑙𝑒𝑐𝑡𝑆𝑐𝑜𝑟𝑒(sj) is computed for each

相關文件