Information Retrieval Approach

Chapter 2 Literature Review

2.4 Information Retrieval Approach

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

2.4 Information Retrieval Approach

The part of discussing the leading trend between the academic papers of this study considers similar issues to the TDT method, but solves them by a different approach. Information retrieval is applied to simplify sets of related problems. A

“batch-oriented” scheme is employed to solve the research question. The information retrieval applied herein is discussed below.

1. Constructing term vectors in document space

Document representation and indexing for statistical purposes can be achieved by representing each textual document as a set of terms. The set of terms defines a space such that each distinct term indicates one dimension in that space. Because the space represents a document is represented as a set of terms, it can be regarded as a

“document space” (Salton, 1983; Salton, 1989).

2. Representing Text

The representation of a problem has a strong impact on the accuracy of the generalization of a learning system. A document, which is a string of characters, needs to be transformed into a representation appropriate for the learning algorithm.

Information retrieval suggests that word stems work well as representation units, and that their ordering in a document is insignificant for many tasks. The word stem is derived from the occurrence form of a word by removing case and inflection information (Porter, 1980). For example, “computes”, “computing”, and “computer”

are all mapped to the same stem “comput”. The “word” and “word stem” are used synonymously in the following (Joachims, 1998).

Information retrieval produces an attribute-value representation of text. Each distinct word

w corresponds to a term, and

TF ( w

_i,

d )

represents the number of times that word

w occurs in document d. To avoid unnecessarily large term vectors,

_i a word is treated as a term only if it appears in the training data at least three times, and is not a “stop-word” (e.g. “and”, “or”) (Joachims, 1998).

The basic representation shows that scaling the dimensions of the term vector with their inverse document frequency

IDF

(

w

_i)(Salton & Buckley, 1988) improves the performance. The value of

IDF

(

w

_i) can be derived from the document

‧

Here, n denotes the total number of training documents. Intuitively, the inverse document frequency of a word is low if it occurs in many documents, and is highest if the word occurs in only one. To abstract from different document lengths, each document term vector

d

_i

is normalized to unit length (Joachims, 1998).

3. Term Selection

Text in processing typically involves term spaces containing 10000 dimensions or more, often exceeding the number of available training examples. Term selection is often necessary to enable the use of conventional learning methods, improve generalization accuracy and avoid overfitting (Yang & Pedersen, 1997; Moulinier et

al., 1996).

The most popular approach to term selection is to select a subset of the available terms using methods such as DF-thresholding (Yang & Pedersen, 1997), the χ -test ² (Schutze et al., 1995) or the term strength criterion (Yang & Wilbur, 1996).

Information gain is the most commonly adopted, and often the most effective (Yang

& Pedersen, 1997), method for selecting termscriterion (Joachims, 1998). However, because this study adopts titles of the papers as descriptors, each candidate terms, obtained by filtering the word stems obtained from the TextAnalyst, must appear at least three times (Joachims, 1998). 253 single-word terms were thus selected.

4. The measurement of the similarity between documents

Benchmarks are required to help measure of the similarity between documents.

Some commonly-used benchmarks are discussed below.

 Euclidean distance

The Euclidean distance, d, between two points, x and y, in one-, two-, three-, or higher-dimensional space, is given by the following familiar Formula (2-2):

∑

x and y, respectively.

 Jaccard coefficient

Similarity measures between objects that contain only binary attributes are called similarity coefficients, and always have values between 0 and 1. A value of 1

‧

means that the two objects are completely similar; while a value of 0 means that the objects are not at all similar.

Let x and y denote two objects consisting of n binary terms. The comparison of two such objects, i.e., two binary vectors, leads to the following four quantities (frequencies):

f =the number of terms where x is 0 and y is 0

f =the number of terms where x is 0 and y is 1

f =the number of terms where x is 1 and y is 0

f =the number of terms where x is 1 and y is 1

The Jaccard coefficient, symbolized by J, is given by the following equation:

 Extended Jaccard coefficient

The extended Jaccard coefficient can be employed for document data, and reduces to the Jaccard coefficient in the case of binary attributes. The extended Jaccard coefficient is also known as the Tanimoto coefficient. This coefficient, represented by EJ, is defined with the following equation:

y

 Cosine similarity

Documents are often represented as vectors, where each term represents the frequency with which a particular term (word) occurs in the document. However, this is a simplification, since certain common words are ignored, and various processing techniques are utilized to take account of different forms of the same word, different document lengths and different word frequencies.

‧

depend on the number of shared 0 values, because any two documents are likely to not contain many of the same words. Therefore, if 0-0 matches are counted, then most documents will be highly similar to most other documents. Therefore, a similarity measure for documents needs to ignore 0-0 matches like the Jaccard measure, but also must be able to handle non-binary vectors (Tan et al., 2006).

The cosine similarity is one of the most common measure of document similarity. If x and y are two document vectors, then

y

where ·denotes the vector dot product,

∑

The investigation of the leading trend between academic papers of this study extends the principle of citation analysis and hyperlink analysis. If two papers using the same terms of topic have high similarity, then they are considered to have a high connection. The research suggests that besides the citations between papers and hyperlinks between the web pages of universities, the academic intelligence between the topics of papers can also be discovered. The most common approach to content analysis in the literature is TDT. However, this study applies information retrieval techniques rather than TDT. This is because although TDT is the newest technique to solve the problem of new issue detection, it is not as straightforward as information retrieval. Moreover, conference papers and journal papers are analyzed with batch processing, which is not suitable for TDT, which typically processes an event at a single time point.

The development of the emerging topic detection indices and the proposal of the emerging topics by the publications and authors of the study extend the research result and investigate the second task of TDT works which is topic emergence detection described in section 2.2. We aim at detecting the emergence of a new topic

在文檔中資訊檢索之學術智慧 - 政大學術集成 (頁 30-33)

Information Retrieval Approach

Chapter 2 Literature Review

2.4 Information Retrieval Approach

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

2.4 Information Retrieval Approach

w corresponds to a term, and

TF ( w

d )

w occurs in document d. To avoid unnecessarily large term vectors,

IDF

w

IDF

w

‧

d

al., 1996).

Information gain is the most commonly adopted, and often the most effective (Yang

∑

x and y, respectively.

‧

f =the number of terms where x is 0 and y is 0

f =the number of terms where x is 0 and y is 1

f =the number of terms where x is 1 and y is 0

f =the number of terms where x is 1 and y is 1

y

‧

y

∑

立政治大學