Tasks and Challenges - 摘錄式多文件自動化摘要方法之研究

There are two research tasks discussed in this thesis: (1) multidocument summarization, and (2) query-focused multidocument summarization. The first focuses on producing a generic summary of a set of topically-related documents, while the second focuses on, given a user query, generating a query-focused summary of a set of topically-related documents to reflect particular points that are relevant to the user’s desired topic(s) of interest. Both tasks are addressed in this thesis using the most common technique for summarization, namely sentence extraction: important sentences are identified and extracted verbatim from documents and composed into an extractive summary. The first step towards sentence extraction is obviously to score and rank sentences in order of importance, which is the major focus of this thesis.

1.2.1 Multidocument summarization

Early works on text summarization dealt with single-document summarization. Since the late-90s, the rapid increase and the availability of online texts have made multidocument summarization a worth problem to be solved. Given a collection of documents on the same (or related) topic (e.g., news articles on the same event from several newswires), summaries that deliver the majority of information content among documents and emphasize the differences would be significantly helpful to a reader.

However, it is much harder towards multidocument summarization than towards single-document summarization, since several unique issues, such as anti-redundancy and content ordering, need to be addressed. In general, the major challenge of

multidocument summarization is to discover similarities across documents, as well as to identify distinct significant aspects from each one.

By the definition given in [110], multidocument summarization is the process of producing a single summary of a set of related documents, where three major issues need to be addressed: (1) identifying important similarities and differences among documents, (2) recognizing and coping with redundancy, and (3) ensuring summary coherence. Previous works have investigated various methods for solving these issues.

For instance, sentence clustering to identifying similarities (e.g., [29], [44], [53], [96]), information extraction to facilitating the identification of similarities and differences (e.g., [97]), maximum marginal relevance (MMR) [20] and cross-sentence informational subsumption (CSIS) [111] to removing redundancy, and information fusion (e.g., [9]) and sentence ordering (e.g., [8]) to generating coherent summaries.

For a general overview of the current state of the art, please refer to Chapter 2.

While many approaches to single-document summarization have been extended to deal with multidocument summarization (e.g., [22], [49], [77], [84]), there are still a number of new issues, as briefed below, needed to be addressed. See also [44].

(1) Lower compression rate:

Traditionally, a compression rate ranging from 1% to 30% is suitable for single-document summarization [87]. However, for multidocument summarization, the degree of compression rate is typically much low. For example, [44] found that a compression to the 1% or 0.1% level is required for summarizing 200 documents.

(2) Anti-redundancy:

The degree of redundancy in information contained in a group of related documents is usually high, due to the reason that each document in the group is apt to describe the main points as well as necessary shared background [44].

Therefore, it is necessary to minimize redundancy in the summary of multiple documents (i.e., to avoid including similar or redundant information into the summary).

(3) Information fusion:

One problem of the selection of a subset of similar passages in extraction-based approaches is the production of a summary biased towards some sources. Information fusion, which synthesizes common information, such as repetitive phrases, in the set of related passages into the summary, can alleviate this problem by the use of reformulation rules.

(4) Content ordering:

Content ordering is the organization of information from different sources to ensure the coherence of the summary. In single-document summarization, content ordering could be decided, based on the precedence orders in the original document. In multidocument summarization, instead, no single document can provide a global ordering of information in the summary.

In this study, we focus on extraction-based multidocument summarization to produce an extractive generic summary for a set of related news articles on the same event. In the approach that we propose in Chapter 3, the multidocument summarization task is divided into three sub-tasks: (1) ranking sentences according to their likelihood of being part of the summary, (2) eliminating redundancy while

extracting the most important sentences, and (3) organizing extracted sentences into a summary.

The focus of the proposed approach is a novel sentence ranking method to perform the first sub-task. The idea of modeling a single document into a text relationship map [119] is extended to model a set of topically-related documents into a sentence similarity network (i.e., a network of sentences, with a node referring to a sentence and an edge indicating that the corresponding sentences are related to each other), based on which a graph-based sentence ranking algorithm, iSpreadRank, is proposed.

iSpreadRank hypothesizes that the importance of a sentence in the network is related to the following factors: (1) the number of sentences to which it connects, (2) the importance of its connected sentences, and (3) the strength of relationships between it and its connected sentences. In other words, iSpreadRank supposes that a sentence, which connects to many of the other important sentences, is itself likely to be important. To realize this hypothesis, iSpreadRank practically applies spreading activation [106] to iteratively re-weight the importance of sentences by spreading their sentence-specific feature scores throughout the network to adjust the importance of other sentences. Consequently, a ranking of sentences indicating the relative importance of sentences is reasoned.

Given a ranking of sentences, in the second sub-task, a strategy of redundancy filtering, based on cross-sentence informational subsumption [111], is utilized to iteratively extract one sentence at a time into the summary, if it is not too similar to any sentences already included in the summary. In practice, this strategy only extracts high-scoring sentences with less redundant information than others. Finally, in the

third sub-task, a sentence ordering policy, which considers together topical relatedness and chronological order between sentences, is employed to organize extracted sentences into a coherent summary.

The proposed summarization method is evaluated using the DUC 2004 data set, and found to perform well. Experimental results show that the proposed method obtained a ROUGE-1 score of 0.38068, which is competitive to that of the 1st-ranked system at DUC 2004.

1.2.2 Query-focused multidocument summarization

Query-focused multidocument summarization is a particular task of multidocument summarization. Given a cluster of documents relevant to a specific topic, a query statement consisted of a set of related questions, and a user profile, the task is to create a brief, well-organized, fluent summary which either answers the need for information expressed in the query statement or explains the query, at the level of granularity specified in the user profile. Table 1.3 and Table 1.4 give examples of the query statements. The level of granularity, here, can be either specific or general:

while a general summary prefers a high-level generalized description biased to the query, a specific summary should describe and name specific instances of events, people, places, etc.

As stated in [3], this task can be seen as topic-oriented, informative multidocument summarization, where the goal is to produce a single text as a compressed version of a set of documents with a minimum loss of relevant information. This suggests that a good summary for query-focused multidocument summarization should not only best satisfy the need for information expressed in the query statement but also need to cover as much of the important information as

possible across documents [136].

Table 1.3. Query statement for set d357i with granularity specified as “specific”

<topic>

<title> Boundary disputes involving oil </title>

<narr>

What countries are or have been involved in land or water boundary disputes with each other over oil resources or exploration? How have disputes been resolved, or towards what kind of resolution are the countries moving? What other factors affect the disputes?

</narr>

<granularity> specific </granularity>

</topic>

Table 1.4. Query statement for set d376e with granularity specified as “general”

<topic>

<title> World Court </title>

<narr>

What is the World Court? What types of cases does the World Court hear?

</narr>

<granularity> general </granularity>

</topic>

In general, the challenges of query-focused multidocument summarization are twofold. The first one is to identify important similarities and differences among documents, which is a common issue of multidocument summarization. The second one is the need to take into account query-biased characteristics during the summarization process.

In this study, we focus on extraction-based query-focused multidocument summarization to produce an extractive query-focused summary, which reflects particular points relevant to user’s interests, for a set of related news articles on the same event. In the approach that we propose in Chapter 4, the query-focused multidocument summarization task is divided into four sub-tasks: (1) examining the degree of relevance between each sentence and the query statement, (2) ranking

sentences according to their degree of relevance to the query and their likelihood of being part of the summary, (3) eliminating redundancy while extracting the most important sentences, and (4) organizing extracted sentences into a summary.

The first sub-task is addressed as a query-biased sentence retrieval task. For each sentence s, given a query q, the degree of relevance between s and q is measured as the degree of similarity between them, i.e., sim(s, q). Three similarity measures are proposed to assess sim(s, q). The first is computed as the dot production of the vectors of s and q in the vector space model. The second exploits latent semantic analysis (LSA) [32] to fold s and q into a reduced semantic space and computes their similarity based on the transformed vectors of s and q in the semantic space. Finally, with the idea of model averaging, the third combines the similarities obtained from the first and the second in a linear manner.

In the second sub-task, several surface-level features are extracted to measure how representative a sentence is with respect to the whole document cluster. The feature scores, acting as the strength of representative power (i.e., the informativeness) of each sentence, are combined with the degree of relevance between the sentence and the query to score all sentences. As for the third sub-task, a novel sentence extraction method, inspired by maximal marginal relevance (MMR) [20] for redundancy filtering, is utilized to iteratively extract one sentence at a time into the summary, if it is not too similar to any sentences already included in the summary. In one iteration, all the remaining unselected sentences are re-scored and ranked using a modified MMR function, so as to extract the sentence with the highest score. Finally, in the fourth sub-task, all extracted sentences are simply ordered chronologically to form a coherent summary.

The proposed summarization method is evaluated using the DUC 2005 data set, and found to perform well. Experimental results show that the proposed method obtained a ROUGE-2 score of 0.07265 and a ROUGE-SU4 score of 0.12568, which are competitive to those of the 1st-ranked and 2nd-ranked systems at DUC 2005.

在文檔中摘錄式多文件自動化摘要方法之研究 (頁 27-34)