Query-Focused Summarization - 多語言複合式文件自動摘要之研究 (3/3)

多語言複合式文件自動摘要之研究 (3/3)

4 Query-Focused Summarization

This section presents details for modules introduced in the previous section.

4.1 Relevance Analysis

Given a query q, for each sentence s, the degree of relevance between s and q is evaluated based on their similarity, sim(s, q). Three approaches are introduced in this section to compute sim(s, q). The first is the similarity measured in the traditional vector space model (VSM); the second employs latent semantic analysis (LSA) to derive semantic-level similarity; and the third integrates similarities from VSM and LSA in a linear manner.

4.1.1 Similarity Based on VSM: sim1(s, q)

In the vector space model, since s and q are both represented as weighted vectors using Eq. (26) and Eq. (27) respectively, the similarity between s and q is computed as the inner product of the two corresponding vectors. More specifically, the relevance of s given q is defined as Eq. (28). This model has been proven successful for query-biased sentence retrieval [1] and is used in this work as a competitive baseline.

∑

∈ +

In recent years, LSA [9] has been profitably employed in information retrieval to derive inherent semantic structure from a corpus. We employ LSA to measure the semantic relevance of a sentence s to the query q.

First, a word-by-sentence matrix, A, is built from all sentences, as presented in Eq. (29). In this matrix, columns represent sentences and rows denote unique words found in the collection. (Note: without loss of generality, m is greater than or equals to n.) The cell of row i and column j (i.e., ai,j), computed by Eq. (26), signifies the

Singular Value Decomposition (SVD) is then performed on A. The SVD projection of A is the product of U, Z and V: A=UZV^Twhere U is an m×n matrix of left singular vectors, Z is an n×n matrix with a diagonal (σ1, …, σn)¹⁹ and zeros elsewhere, and V is an n×n matrix of right singular vectors. In theory, U and V are both orthogonal matrices; there exists a property that U^TU=V^TV=I, where I is the identity matrix; Z could be interpreted as a semantic space (or the topic structure) derived from the corpus; U and V could be viewed as semantic representations of words and sentences in Z respectively.

Finally, Dimension Reduction is applied to Z by keeping only k (k ≤ r) singular values to obtain an approximate Zk. A new matrix, A~

, which denotes the semantic representation of A in Zk could be obtained by folding A into the reduced space Zk

using Eq. (30).

19 If rank(A) = r, Z satisfies σ₁≥σ₂≥…≥σ_r>σ_r+1=…=σ_n=0.

~ = A

U

Z

_k−1

A

^{Eq. (30)}

Similarly, a query q=<tq,1, …, tq,m> can be mapped into the same semantic space Zk

with Eq. (31).

~ = qU

Z

_k⁻1

q

^{Eq. (31)}

Thus, the semantic similarity between s and q, i.e., sim2(s, q), can be obtained by Eq.

(32).

Conceptually, LSA is the process of discovering relationships among word co-occurrences. Regarding Zk, the SVD analysis provides information about how words and sentences are related to k latent semantics. Furthermore, in Eq. (32), q times U_k(Z_k⁻¹)²U_k^T can be viewed as query expansion, where synonymy is essentially defined by similarity of word co-occurrences derived in the semantic space.

4.1.3 Hybrid Relevance Analysis: sim3(s, q)

The hybrid relevance analysis is practically proposed as a linear combination of

) ,

1(sq

sim and sim₂(s,q) to take the advantage of the effectiveness of both approaches.

In this combination, the noises introduced from both models could be reduced by model averaging, which is expected to obtain more robust sentence relevance. As a result, the proposed hybrid similarity metric is defined as Eq. (33).

)

4.2 Surface Feature Extraction

In the literature of text summarization, surface features, such as position and tf-idf weighting, have been studied and proven useful for extracting significant sentences (e.g., [16], [25]). In this work, we aim to re-examine these features in order to understand whether they could be salient evidences for sentence scoring in the query-focused summarization task. For a sentence s, we consider five surface features

to obtain its feature score. They are: 1) position; 2) average tf-idf weight of significant words; 3) similarity with title; 4) similarity with document centroid; and 5) similarity with topic centroid.

f1 – position: it is believed that important sentences are usually located in particular positions in a document. Take a news article for instance, the first sentence always introduces the main topic or summarizes the whole story. The position score is defined in Eq. (34), which was proposed by Hirao, et al. [12]. In Eq. (34), |D| is the number of words in the document D that contains s and NC(s) is the number of words appearing before s in D. On the basis of this mechanism, the first sentence obtains the highest score and the last has the lowest one.

f2 – avg. tf-idf weight of significant words: generally, terms with higher term frequency and tf-idf values are more important, implying that a sentence with a higher summation of tf-idf values from its constituent words tends to be a substantial sentence. In previous studies, all words in a sentence are taken into account and the total score is averaged over the length of the sentence. In our work, in order to obtain a more precise weight for a sentence s, we only consider significant words in s. The average tf-idf score is computed as shown in Eq. (35), where w(t, s) is the weight in Eq. (26).

A significant word is defined as a keyword t which satisfies the criterion shown in Eq.

(36), where u is the mean and σ is the standard deviation of all w(t, C), and w(t, C) is the summation of all tf-idf values for t from all sentences in the document collection C.

f3 – similarity with title: there is no doubt that the title always sums up the main theme of a document. In other words, the more similar a sentence is with the title, the more important it is. This similarity is measured by Eq. (37) where stitle is the title sentence.

|

f4 – similarity with document centroid: this measures the centrality of a sentence with the whole document. The centrality means the similarity between s and the centroid of the document. Generally speaking, if a sentence contains more concepts identical to those of other sentences in the same document, it tends to be more significant. This feature score is obtained with Eq. (38), in which Dr_centroid

is the average vector representation of all sentences in the document D.

|

f5 – similarity with topic centroid: similar to f4, this feature estimates the similarity of a sentence with the centroid of the topic cluster. The score is computed as Eq. (39), where Tcentroid is the average representation of all sentences in the document collection.

The sentence score denotes the importance of a sentence and is exploited as a judgment to determine whether a sentence should be included in the output summary.

We define the score of a sentence s by taking into account: 1) its relevance to the query q; and 2) its salience of low-level features. Eq. (40) delineates the scoring function. weight for fi used for linear combination.

在文檔中多語言複合式文件自動摘要之研究(III) (頁 76-80)