Multidocument Summarization - Literature Survey

Literature Survey

2.1 Multidocument Summarization

[97] pioneered work on multidocument summarization. They established relationships between news stories by aggregating similar extracted templates using logical relationships, such as agreement and contradiction. The summary was constructed by a sentence generator based on the facts and their relationships in the templates. These template-based methods are still of interests recently (see [51], [135]). However, manual efforts are required to define domain-specific templates, while poorly-defined templates may lead to incomplete extraction of facts.

Most recent studies have adopted clustering to identify themes⁸ (i.e., clusters of common information) (e.g., [9], [14], [29], [44], [53], [96], [101]). These approaches are founded on an observation that multiple documents concerning a particular topic tend to contain redundant information, as well as information unique to each [29].

Once themes have been recognized, a representative passage in each theme is selected and included in the summary; alternatively, repeated phrases in clusters are exploited to generate an abstract-like summary by information fusion [110].

Typical research on theme clustering is briefed as follows. [9] and [96]

8 A theme, also viewed as a sub-topic, is defined as a group of passages (such as sentences and paragraphs) which all convey approximately the same information [96].

discovered common themes using graph-based clustering based on features, such as word co-occurrence, noun phrase matching, synonymy matching, and verb semantic matching. Similar phrases in the identified themes were synthesized into a summary by information fusion using natural language generation. [44] grouped paragraphs into clusters and collected, into the summary, from each group a significant passage with large coverage and low redundancy, measured by maximal marginal relevance (MMR) [20]. This strategy aimed at high relevance to the query and keeping redundancy low in the summary. [29] evaluated several policies for choosing indicative sentences from sentence clusters and concluded that the best policy is to extract sentences with the highest sum of relevance scores for each cluster. [101]

clustered sentences as topical regions. Seed paragraphs, each having the maximum total similarity with others in the same topical region, are considered as the representative passages.

Other studies have applied information retrieval and statistical methods to find salient concepts, as well as informative words and phrases in multiple documents (e.g., [43], [49], [65], [77], [111]). For instance, [111] detected a set of statistically important words as the topic centroid of a document cluster, which is treated as a feature and considered together with other heuristics to extract sentences. [77]

recognized key concepts by calculating log-likelihood ratios of unigrams, bigrams and trigrams of terms, and then clustered these concept-bearing terms to detect sub-topics.

Each sentence in the document set was ranked using key concepts in order to produce an extractive summary. [49] discussed different strategies to create signatures of topic themes and evaluated methods to use them in summarization.

Surface-level features extended from the well-developed single-document summarization methods have also been exploited (e.g., [54], [84], [91], [111]).

Heuristics-based approaches selectively combine features to yield a scoring function for the discrimination of salient text units. Commonly used features include sentence position, sum of TF-IDF in a sentence, similarity with headline, sentence cluster similarity, etc. Alternatively, there are approaches that apply machine learning to automatically combine surface-level features from a corpus of documents and their summaries. For instance, [54] used support vector machines (SVM) [132] to learn a sentence ranking model.

Techniques depending on a thorough analysis of the discourse structure of the text have been explored (e.g., [11], [15], [22], [62], [139]). [139] developed a cross-document structure theory (CST) to define the cross-document rhetorical relationships between sentences across documents. The cohesion of extractive summaries is found to be meliorated by the CST relationships. [15] and [22] built lexical chains to identify topics in the input texts. Sentences are ranked according to the number of word co-occurrences in the chains and sentences. [11] constructed noun phrase co-reference chains across documents based on a set of predefined word-level fuzzy relations. The most important noun phrases in important chains are selected to score sentences.

Researchers have also investigated graph-based approaches. [86] modeled term occurrences as a graph using cohesion relationships (e.g., synonymy, and co-reference) among text units. The similarities and differences in documents are successfully pinpointed by applying spreading activation [106] and graph matching. Sentences are extracted based on a scoring function which measures term weights in the activated graph. [124] constructed a graph using the similarity relations between sentences. The summary is generated by traversing sentences along a shortest path of the minimum cost from the first to the last sentence. [138] presented a bipartite graph of texts where

spectral graph clustering is applied to partition sentences into topical groups.

Some graph-based methods employ the concept of centrality in social network analysis. [119] first attempted such an approach for single-document summarization.

They proposed a text relationship map to represent the structure of a document, and utilized the degree-based centrality to measure the importance of sentences. Later works following the idea of graph-based document models employed distinct ranking algorithms to determine the centralities of sentences. [39] recognized the most significant sentences by a sentence ranking algorithm, LexRank, which performs PageRank [18] on a sentence-based network according to the hypothesis that sentences similar to many other sentences are central to the topic. [38], [133]

examined biased PageRank to extract the topic-sensitive structure beyond the text graph for question-focused summarization. [98] examined several graph ranking methods originally proposed to analyze webpage prestige, including PageRank and HITS [64], for single-document summarization. [100] extended the algorithm of [98]

for multiple documents. A meta-summary of documents is produced from a set of single-document summaries in an iterative manner. [140] proposed a cue-based hub-authority approach that brings surface-level features into a hub/authority framework. HITS is used in their work to rank sentences.

Last but not least, other graph-based works build a dependency graph with a word as a node and a syntactical relation as a link. One good example is [130] for event-focused news summarization, which employed PageRank to identify word entities participating in important events or relationships among all documents.

在文檔中摘錄式多文件自動化摘要方法之研究 (頁 35-39)