Related Research Projects - Literature Survey

Literature Survey

2.3 Related Research Projects

This section offers a brief introduction of example research projects in the field.

2.3.1 PERSIVAL

PERSIVAL (PErsonalized Retrieval and Summarization of Image, Video, And Language)⁹ [94] is designed to provide personalized access to a distributed patient care digital library. The system consists of: (1) a query component that uses clinical context to help formulate user queries, (2) a search component that uses machine learning to find relevant sources and patient information, and (3) a personalized presentation component that uses patient information and domain knowledge to summarize related multimedia resources. A multidocument summarization system, CENTRIFUSER [61] (see also [37], [62], [63]), is integrated in PERSIVAL to support personalized summarization. CENTRIFUSER models all the input documents into a composite topic tree, with a node standing for one topic (e.g., disease, symptom, etc.) extracted from documents. Using the topic tree, CENTRIFUSER determines which parts of the tree are relevant to the query and the patient information, and then extracts related parts to create a summary.

9 http://persival.cs.columbia.edu/

2.3.2 NewsBlaster

NewsBlaster¹⁰ [92], [93] is an on-line news summarization system, which supports topic detection, tracking, and summarization for daily browsing of news. The core summarization module, Columbia Summarizer, is composed of: (1) router, (2) MultiGen [96], and DEMS [120]. The router determines which type an input event cluster is and forwards the cluster to a suitable summarization module. The type, here, can be single-event, multi-event, biography, and other. MultiGen generates a concise summary based on the detection of similarities and differences across documents.

Machine learning and statistical techniques are exploited to identify groups of similar passages (i.e., themes), followed by information fusion [9] to synthesize common information into an abstractive summary using natural language generation. While MultiGen is designed to cope with topically-related documents, DEMS is more general for loosely-related documents. DEMS combines features for new information detection and uses heuristics to extract important sentences into a summary.

2.3.3 MEAD

MEAD¹¹ [111] is an essentially statistical summarizer in public domain, developed to produce extractive summaries for either single- or multi-document summarization by sentence extraction. MEAD consists of: (1) feature extractor, (2) sentence scorer, and (3) sentence re-ranker. The feature extractor extracts summarization-related features from sentences, such as position, centroid, cosine with query, and length. The sentence scorer combines various features to measure the salience of a sentence.

Finally, the sentence re-ranker iteratively selects candidate summary sentences while redundant sentences are avoided by checking similarity against prior selected ones.

10 http://newsblaster.cs.columbia.edu/

11 http://www.summarization.com/mead/

NewsInEssence¹² [108] and WebInEssence [109] are two practical applications of MEAD. Given the user’s interest, NewsInEssence retrieves related news articles from different online newswires and produces an extractive summary according to the user-specified parameters. WebInEssence, instead, is integrated into a general-purpose Web search engine to summarize the returned search results.

2.3.4 GLEANS

GLEANS [30] is a multidocument summarization system. The system classifies document clusters into a category, in which the content is about single person, single event, multiple events, or natural disaster. For each category, GLEANS maintains a set of predefined templates. Text entities and their logical relations are first identified, and mapped into canonical, database-like representations. Then, sentences, which conform to predefined coherence constraints, are extracted to form the final summary.

2.3.5 NeATS

NeATS [77], [78] is an extractive summarizer for multidocument summarization. The system is composed of: (1) content selection, (2) content filtering, and (3) content presentation. The content selection module recognizes key concepts by calculating likelihood ratios of unigrams, bigrams, and trigrams of terms. The content filtering module extracts sentences based on term frequency, sentence position, stigma words, and maximum marginal relevance [20]. Finally, the content presentation module exploits term clustering and explicit time annotation to organize important sentences into a coherent summary.

iNeATS [71] is a derivative of NeATS. The system allows users to dynamically

12 http://www.newsinessence.com/

control over the summarization process. Furthermore, it supports the linking from the summary to the original documents, as well as the visualization of the spatial information, indicated in the summary, on a geographical map.

2.3.6 GISTexter

GISTexter [51] is designed to produce both extracts and abstracts for single- and multi-document summarization. The core of GISTexter is an information extraction (IE) system, CICERO [50], which identifies entities and fills relevant information, such as text snippets and co-reference information, into predefined IE-style templates using pattern rules. To generate summaries, GISTexter chooses representative templates and extracts source sentences for the template snippets.

Chapter 3 Multidocument Summarization

Multidocument summarization refers to the process of producing a single summary of a set of topically-related documents (i.e., a set of documents on the same or related, but unspecified topic). In this chapter, we deal with multidocument summarization using an extraction-based approach to create an extractive generic summary of multiple documents.

The proposed approach follows the most common technique for summarization, namely sentence extraction: important sentences are identified and extracted verbatim from documents and are composed into a summary. In the proposed approach, the multidocument summarization task is divided into three sub-tasks:

(1) Ranking sentences according to their likelihood of being part of the summary;

(2) Eliminating redundancy while extracting the most important sentences;

(3) Organizing extracted sentences into a summary.

The focus of the proposed approach is a novel sentence ranking method to perform the first sub-task. The idea of modeling a single document into a text relationship map [119] is extended to model a set of topically-related documents into a sentence similarity network (i.e., a network of sentences, with a node referring to a sentence and an edge indicating that the corresponding sentences are related to each other), based on which a graph-based sentence ranking algorithm, iSpreadRank, is proposed.

iSpreadRank hypothesizes that the importance of a sentence in the network is

related to the following factors: (1) the number of sentences to which it connects, (2) the importance of its connected sentences, and (3) the strength of relationships between it and its connected sentences. In other words, iSpreadRank supposes that a sentence, which connects to many of the other important sentences, is itself likely to be important. To realize this hypothesis, iSpreadRank practically applies spreading activation [106] to iteratively re-weight the importance of sentences by spreading their sentence-specific feature scores¹³ throughout the network to adjust the importance of other sentences. Consequently, a ranking of sentences indicating the relative importance of sentences is reasoned.

Given a ranking of sentences, in the second sub-task, a strategy of redundancy filtering, based on cross-sentence informational subsumption [111], is utilized to iteratively extract one sentence at a time into the summary, if it is not too similar to any sentences already included in the summary. In practice, this strategy only extracts high-scoring sentences with less redundant information than others. Finally, in the third sub-task, a sentence ordering policy, which considers together topical relatedness and chronological order between sentences, is employed to organize extracted sentences into a coherent summary.

This chapter is structured as follows: Section 3.1 introduces the design of the proposed approach to multidocument summarization. Section 3.2 describes technical details of the proposed graph-based sentence ranking algorithm, iSpreadRank, as well as the proposed summarization approach. The experimental results are reported in Section 3.3 and finally Section 3.4 provides discussions about the proposed summarization approach in different aspects.

13 The sentence-specific feature scores work as the local information of every sentence, and are considered together with relationships between sentences to help derive the global information of sentences (i.e., the relative importance of sentences).

3.1 Design

Fig. 3.1 illustrates the design of the proposed approach to multidocument summarization. The input is a group of topically-related documents. The output is a concise summary which provides the condensed essentials of the input documents.

The summarizer takes all the documents as a single document and produces an extractive summary by selecting characteristic sentences from the document group.

All sentences in the document group are first ranked according to their degree of importance. Based on the ranking of sentences, the summarizer then iteratively extracts one sentence at a time, which not only is important but also has less redundancy than other sentences extracted prior to it. The extraction finishes once the required length of the summary is met. The extracted sentences are finally composed into the output summary.

Fig. 3.1. The proposed multidocument summarization approach

The whole summarization process can be decomposed into three phases: (1) the preprocessing phase preprocesses the input documents, (2) the sentence ranking phase

Topically-related Documents

Preprocessing Sentence Ranking: iSpreadRank

Feature Extraction

Sent. Similarity Network Modeling

Sentence Extraction Sentence Ordering

Extractive Summary

Preprocessing Sentence Ranking Summary Production

ranks sentences according to their likelihood of being part of the summary, and (3) the summary production phase creates the output summary. The entire process, as shown in Fig. 3.1, can be further divided into several stages, namely preprocessing, feature extraction, sentence similarity network modeling, sentence ranking, sentence extraction, and sentence ordering. They are outlined as follows, in order of execution:

(1) Preprocessing:

Several linguistic analysis steps are carried out in this stage. A tokenizer segments text into words, numbers, symbols and punctuations. A sentence splitter identifies the boundaries of sentences. A passage indexer constructs a vector representation of every sentence using the well-known TF-IDF term weighting scheme [118]. For the term weighting scheme, please refer to Section 3.2.1.

(2) Sentence similarity network modeling (see Section 3.2.1):

The input documents are transformed into a sentence similarity network, with a node referring to a sentence, and an edge indicating that the corresponding sentences are related to each other. The relationship between a pair of sentences is measured as the level of their lexical overlap.

(3) Feature extraction (see Section 3.2.2):

A feature profile is created to capture the values of various sentence-specific features of all sentences. Three surface-level features are employed: (1) centroid, (2) position, and (3) first-sentence overlap. The feature scores, acting as the local information of every sentence, are integrated into the sentence ranking algorithm

to help derive the global information of sentences (i.e., the relative importance of sentences).

(4) Sentence ranking (see Section 3.2.3):

A graph-based sentence ranking algorithm, iSpreadRank, takes a sentence similarity network and a feature profile as inputs, and applies spreading activation [106] to iteratively re-weight the importance of sentences by spreading their sentence-specific feature scores (computed in the feature extraction stage) throughout the network to adjust the importance of other sentences. A ranking of sentences is finally inferred in order of their relative importance.

(5) Sentence extraction (see Section 3.2.4):

A sentence extraction module, based on cross-sentence informational subsumption [111] for redundancy filtering, iteratively examines sentences in the rank order, and adds one sentence at a time into the summary, if it is not too similar to any sentences already in the summary. Here, the degree of redundancy between two sentences is determined by a threshold imposed on the sentence similarity. In this way, only high-scoring sentences with less redundant information than others are extracted into the summary.

(6) Sentence ordering (see Section 3.2.5):

The final summary is structured in the following steps: Semi-similar sentences in the extracted sentence set are first grouped together, based on another similarity threshold smaller than that used in sentence extraction. Each group is then ordered chronologically into a macro-ordering according to the

earliest timestamp of the sentences within it. Finally, micro-ordering is applied to sort all sentences in each group in chronological order. This policy, considering together topical relatedness and chronological order between sentences, is similar to the augmented sentence ordering algorithm proposed in [8], in which the topical relatedness between sentences is determined by text cohesion in their original documents.

3.2 Algorithm

Section 3.2.1 describes the modeling of a group of documents into a sentence-based network. Section 3.2.2 presents the extraction of sentence-specific features. Section 3.2.3 introduces the graph-based sentence ranking algorithm, iSpreadRank. Section 3.2.4 and Section 3.2.5 provide the methods of sentence extraction and sentence ordering, respectively.

3.2.1 Text as a graph: sentence similarity network

[119] used the techniques for inter-document link generation to produce intra-document links between passages of a document, and obtained a text relationship map (or a content similarity network) to characterize the structure of the text based on its linkage patterns. In this section, the same idea is extended to model a group of documents into a network of sentences that are related to each other, resulting in a sentence similarity network.

Fig. 3.2 gives an example of the network. A sentence similarity network is defined as a graph with nodes and edges linking nodes. Each node in the network stands for a sentence. Two sentences are connected if and only if they are similar to each other. Hence, an edge between two nodes indicates that the corresponding two

sentences are considered to be “semantically related” [119].

Fig. 3.2. A sentence similarity network

In order to construct such a network, each sentence is represented as a vector of weighted terms, based on which the similarity between two sentences is obtained to determine if there exists an edge between them. Let W = {t1, …, tn} (|W| = n) denote the set of index terms in the document group. The vector representation of a sentence sj is specified by Eq. (3.1), where wi,j is the TF-IDF weight of term ti in sj, given in Eq. sentences in the document group, and ni denotes the number of sentences where ti

appears.

The degree of similarity between a pair of sentences si and sj is computed, by Eq.

(3.3), as the cosine of the angle between the vectors of sr and _i sr . _j

| ) | , (

j i

j i j

i s s

s s s

sim r r

r r

= ⋅ (3.3)

An edge between s_i and s_j exists if sim(s_i, s_j) is greater than a similarity threshold, α . In the current implementation, α is empirically set to 0.1.

3.2.2 Feature extraction

In the literature, a variety of surface-level features have been profitably employed to determine the likelihood of sentences of being part of the summary (e.g., [66], [76], [103], [137]). Inspired by the success of previous works, we also attempt to integrate the feature scores of sentences into the proposed graph-based sentence ranking algorithm.

We take into account three surface-level features: (1) centroid, (2) position, and (3) first-sentence overlap. All of these features (see Table 3.1) have been evaluated and found as effective predictors of the salience of sentences for multidocument summarization in [111].

(1) f1: Centroid

This feature measures the relatedness between a sentence and the centroid of the input set of documents. A sentence with more centroid words is considered to be more central to the topic.

(2) f2: Position

Important sentences tend to appear in particular positions (e.g., the beginning or the end) in the document. This feature is computed as inversely proportional to the position of a sentence from the beginning.

(3) f3: First-sentence overlap

The first sentence usually provides an overview of a document. This feature is determined as the inner-product similarity of a sentence and the first sentence in the same document.

Table 3.1. The sentence-specific feature set (excerpted from [111]) Feature name Feature value

f1: Centroid

∑

f(w, s): the number of occurrences of term w in sentence s C(w): the centroid value of w

f2: Position

i: the position of sentence s in document d n_d: the total number of sentences in d f3: First-sentence

overlap Score_f₃(s)=sr ⋅₁ sr

s1: the first sentence in the same document with sentence s

Combination ⁼

∑

^×

A feature profile is generated to capture the scores of features of all sentences, and is input to the sentence ranking algorithm. Each feature score in the feature profile is further normalized into the range between 0 and 1. The feature scores, acting as the local information of every sentence, are integrated into the sentence ranking algorithm to help derive the global information of sentences (i.e., the relative importance of sentences).

3.2.3 Ranking the importance of sentences

The proposed sentence ranking algorithm, iSpreadRank, is designed to rank the importance of sentences for extraction-based summarization. iSpreadRank practically applies spreading activation [106] to realize the hypothesis that the importance of a sentence in the network is related to the following factors: (1) the number of sentences to which it connects; (2) the importance of its connected sentences, and (3) the strength of relationships between it and its connected sentences.

Spreading activation is originally developed in psychology to explain the cognitive process of human comprehension through semantic memory (see [4], [23], [106]). The theory claims that human’s long-term memory is structured as an associative network in which similar memory units have strong connections and dissimilar units have none or weak connections. Accordingly, a memory retrieval is viewed as a searching across the network by activating a set of source nodes with stimuli (or energy), then iteratively propagating the energy in parallel along links throughout the network to other connected nodes, to discover more related nodes with hidden information.

Spreading activation has recently been applied in many other research fields, such as information retrieval (e.g., [13]), hypertext structure analysis (e.g., [105]), Web trust management (e.g., [143]), collaborative recommendation (e.g., [58]), and so forth. This section takes spreading activation one step further, and discusses the combination of sentence-specific feature scores and the sentence similarity network model together, under the framework of spreading activation, to reason the relative importance of sentences.

3.2.3.1 iSpreadRank

The inputs to iSpreadRank comprise a sentence similarity network (see Section 3.2.1) and a feature profile (see Section 3.2.2). The output is a ranking of sentences indicating the importance of all sentences, in the order from the highest to the lowest.

iSpreadRank adopts a particular model of spreading activation, namely the Leaky Capacitor Model [4], and operates in three steps: (1) initialization, (2) inference, and (3) prediction.

The initialization step transforms the input sentence similarity network into a matrix representation for later computation. The inference step applies spreading activation to reason the relative importance of sentences, where the sentence-specific local importance of each sentence, initialized by the input feature profile, is iteratively spread throughout the whole network to adjust the importance of other neighboring sentences. In this step, the algorithm iterates until an equilibrium state of the network is obtained. Finally, the prediction step outputs a ranking of sentences according to the inference results in the inference step. In summary, the goal of iSpreadRank is to re-weight similar sentences with similar degree of importance, and hence they are ranked in close positions in the reasoned ranking.

(1) Initialization:

Let G = (V, E) represent the sentence similarity network with the set of nodes }V ={s₁,...,s_m and the set of edges E, where si denotes a sentence, and E is a subset of V×V. For simplicity, every node with no edges connecting it to other nodes is eliminated from G. Such a weighted graph representation of the input document group can be transformed into an adjacency matrix, A, with rows and columns labeled by sentence nodes, and each entry aij initialized by Eq. (3.4).

Notably, A is a symmetric matrix since G is an undirected graph.

if i= j

⎩⎨

=⎧

= ( , )

j i ji

ij a sim s s

a if i≠ j

(3.4)

In Eq. (3.4), sim(si, sj), as defined in Eq. (3.3), indicates the similarity between a pair of sentences si and sj and sim(s_i,s_j)≥α . Note that α is the similarity threshold mentioned in Section 3.2.1.

(2) Inference:

Each node in the network has an activation level¹⁴. The algorithm iteratively updates the activations of all nodes (i.e., sentences) over discrete time until it is stopped by the user, or a termination condition is triggered. In one iteration, each node obtains a new activation level by collecting the activations from its connected nodes, and then propagates the new activation along links to its neighbors as a function of its current activation and the strength of relationships between nodes.

The iteration itself can be mathematically defined in a simple linear algebra formula. Let X represent an m-dimensional vector to capture the activations of nodes {s1, …, sm} in the network. A particular vector, X(0), is the activation vector at the initial step where the activation of each sentence node is initialized as its sentence-specific feature score computed by feature extraction. At iteration t, the algorithm maintains the activation vector X(t) using Eq. (3.5).

) 1 ( )

0 ( )

(t = X +MX t−

X , M =σR^T (3.5)

14 The term “activation” in this chapter is interchangeable with the term “importance.” It is used here in order to follow the terminology of spreading activation.

In Eq. (3.5), σ (0≤σ <1) is a spreading factor determining the propagation efficiency that a node converts the activations from its neighbors to its own

在文檔中摘錄式多文件自動化摘要方法之研究 (頁 42-126)