Multidocument Summarization Framework

Literature Survey

5.1 Multidocument Summarization Framework

This thesis has proposed an extraction-based summarization framework, as shown in Fig. 5.1, for the creation of generic and query-focused summaries of multiple documents. Note that Fig. 5.1 is the union of Fig. 3.1 and Fig. 4.1. In the figure, the

“ * ” symbol indicates that the input/output or the module is specific to multidocument summarization, while the “ † ” symbol denotes that the input/output or the module is designed for query-focused multidocument summarization. The whole summarization process can be decomposed into three phases: (1) the preprocessing phase preprocesses the input documents and the query statement if given, (2) the sentence scoring/ranking phase scores sentences and ranks them according to their

likelihood of being part of the summary, and (3) the summary production phase extracts important sentences to create a summary. The details are presented and discussed in Chapter 3 and Chapter 4.

Fig. 5.1. Proposed framework for extraction-based multidocument summarization The proposed summarization framework has several benefits. First, it is in an unsupervised manner, and therefore no training data is required. Second, it is domain- and language-independent since it takes into account neither domain-specific knowledge nor deep linguistic analysis particular to languages. Hence, it is relatively easy to use the summarization approach as a base prototype in any domains and for documents in any languages. Third, it is flexible and extensible due to the underlying modulization design. For instance, other surface-level features can be added to help measure the importance of sentences. Finally, the core module of sentence scoring/ranking makes it adaptive to produce either short or long summaries in different sizes, based on a ranking over all sentences.

*†Topically-related Documents

*†Preprocessing *Sentence Ranking: iSpreadRank

*Sent. Similarity Network Modeling

*†Sentence Extraction

*†Extractive Summary

Sentence Scoring/Ranking Summary Production

†User Profile

†Query Statement

*†Feature Extraction

†Query Relevance Analysis

*†Sentence Scoring *†Sentence Ordering

Preprocessing

There exist some limitations, even though the proposed summarization framework has proven successful to a degree by the evaluation on the DUC 2004 and DUC 2005 data sets. It is essentially a surface-level approach based on the use of features to recognize important sentences (see Section 1.1.3 for the categorization of summarization techniques). Hence, there is neither deep analysis of natural language processing performed, discourse structure considered, nor domain-specific knowledge involved in summarization, leading to the bad understanding of the input texts. On the other hand, the strategy of sentence extraction may include good content in the summaries. However, it does not guarantee good summary quality in terms of coherence, cohesion, and overall organization.

5.2 Contributions

The principal contributions of this thesis to the field include: (1) an overall introduction to text summarization, (2) a relatively complete survey of the current state of the art in multidocument summarization, (3) a general-purpose extraction-based summarization framework for producing generic and query-focused summaries of multiple documents, (4) a discussion on the proposed summarization framework in characteristics, benefits and limitations, and (5) case studies of the proposed summarization framework on the DUC 2004 and DUC 2005 data sets.

In the following, we outline the contributions with respect to the proposed summarization methods.

(1) Multidocument summarization:

Chapter 3 proposes a novel graph-based sentence ranking method, iSpreadRank, to rank sentences according to their likelihood of being part of the

summary. The input set of documents are modeled as a sentence similarity network. A feature profile is created to capture the values of surface-level features of all the sentences and the feature scores serve as the initial importance of nodes in the network. To reason the relative importance of sentences, iSpreadRank practically applies spreading activation to iteratively re-weight the importance of sentences by collecting the importance propagated from their connected nodes as a function of the importance of the connected nodes and the strength of relationships between nodes. iSpreadRank, in fact, operates like a semi-supervised learning process in which the initial labeling of every sentence is determined by its feature score, and the final labeling of sentences is based on the feature scores of sentences and the relationships between them.

For summarization, a sentence extraction method, based on cross-sentence informational subsumption (CSIS) for redundancy filtering, iteratively extracts one sentence at a time into the summary, which not only has high importance but also has less redundancy than the other sentences extracted prior to it. Finally, a sentence ordering policy, which considers together topical relatedness and chronological order between sentences, is employed to organize extracted sentences into a coherent summary.

The proposed summarization method is evaluated in a case study with the DUC 2004 data set and found to perform well in various ROUGE measures.

Experimental results show that the proposed method performs competitively to the top systems at DUC 2004.

(2) Query-focused multidocument summarization:

Chapter 4 proposes a new scoring method, which combines (1) the degree

of relevance of a sentence to the query, and (2) the informativeness of a sentence, to measure the likelihood of sentences of being part in the summary. The degree of query relevance of a sentence is assessed as the similarity between the sentence and the query computed in a latent semantic space, and the informativeness of a sentence is estimated using surface-level features. While most research works have mainly focused on the identification of query-biased sentences, our idea to takes into account together query-dependent feature (e.g., the degree of relevance of a sentence to the query) and query-independent feature (e.g., the informativeness of a sentence) is relatively new. Furthermore, the proposed use of latent semantic analysis (LSA) can potentially relate a sentence and the query semantically and hence obtains a better estimate of the similarity, even the number of matched keywords between them is not significant.

For summarization, a novel sentence extraction method, inspired by maximal marginal relevance (MMR), is also developed to iteratively extract one sentence at a time into the summary, if it is not too similar to any sentences already extracted. In one iteration, all the remaining unselected sentences are re-scored and ranked using a modified MMR function, so as to extract the sentence with the highest score. Finally, the extracted sentences are concatenated in chronological order to form the output summary.

The proposed summarization method is evaluated in a case study with the DUC 2005 data set and found to perform well in various ROUGE measures.

Experimental results show that the proposed method performs competitively to the top systems at DUC 2005.

在文檔中摘錄式多文件自動化摘要方法之研究 (頁 126-131)