indexing and abstracting. automatic indexing

(1)

Kuang-hua Chen

Department of Library and Information Science

National Taiwan University

Indexing and Abstracting

Lecture 09 -- Automatic Indexing

9-2

Language & Information Processing System, LIS, NTU

Outline

What’s the subject indexing?

Types of subject indexing

The taxonomy for subject indexing

Index Structures

The models for automatic indexing

3-tier model for automatic indexing

Natural language processing techniques

Future trends

Subject Indexing

The function of indexing is to describe the

"aboutness" of documents

Indexing terms are used to present content of

document

Challenge

how to select a set of indexing terms to represent the

document contain thousands of words faithfully

Process

subject analysis

subject translation

Process of Subject Indexing

Subject analysis

Analyze content of texts and distill the subject

concepts

Basis of subject indexing

Subject translation

Translate the subject concepts to index terms

Two approaches

natural language indexing (free-text indexing）

controlled vocabulary indexing

(2)

9-5

Tasks of Indexing

Analysis of the subject content of the

document

Review of indexing policies and authorities

to aid in the correct assignment of terms

Presentation of the index terms in the

appropriate order of the indexing system

Weighting of index terms

Quality control of the index terms

9-6

Language & Information Processing System, LIS, NTU 文獻內容主題概念主題索引（檢索）語言檢索者資訊需求主題概念主題檢索系統檢索結果主題分析主題分析轉譯轉譯索引詞彙檢索詞彙儲存輸入輸出

索引過程

檢索過程

主題分析主題轉譯

Indexing Consistency

The degree of agreement in the

representation of the essential information

content of the document by certain sets of

indexing terms selected individually and

independently by each of the indexers in

the group.

Indexing Consistency Rating

all studies indicated that consistency was

very low

a figure of 30% often was used

Indexing consistency can vary on several

factors

familiarity with the indexing policies

experience with the specific subject

the document most recently indexed

(3)

9-9

How to Measure Consistency

inter-indexer consistency

the overlap in index term assignment by two or

intra-indexer consistency

the same indexer indexes the same document

at two different times

9-10

Increase Indexing Consistency

Manual Aids

Vocabulary control

thesauri

scope notes

indexer may choose not to use manual aids

takes additional time

relevance of the aid to the problem is not

apparent

indexer believes there is no problem at all

Machine-Readable Indexing Aids

The indexer’s tools included authority files,

policy manuals, handbooks, textbooks, etc.

Machine Readable Indexing resources are

available.

Pre- versus Post-coordination

Pre-coordinated indexing term

complex/compound concepts are represented

in a single term

Post-coordinated indexing term

(4)

9-13

Controlled versus Uncontrolled

Controlled Indexing

may be selected from a hierarchical thesaurus

may be selected from a list of classification

level subject headings

Uncontrolled Indexing

natural language terms (free terms) from texts

with or without standardization

9-14

Automatic versus Manual

Automatic indexing

Apply computers to proceed the indexing task

Manual indexing

Human indexers proceed the indexing task

Indexing Scheme

Use 3-tuple to represent possible indexing

scheme

The first element denotes pre-coordinated (+)

or post-coordinated (-)

The second element denotes controlled (+) or

uncontrolled (-)

The third element denoted automatic (+) or

manual (-)

IS(-, +, +) represents post-coordinated,

controlled, and automatic indexing

Automatic Indexing

Most works are devoted to automatic

free-text indexing

Few works concern the automatic

(5)

9-17

Indexing Aims

The effectiveness of any content analysis or

indexing system is controlled by two

parameters

indexing exhaustivity

the degree to which all aspects of the subject

matter of a text item are actually recognized

term specificity

the degree of breadth or narrowness of the terms

9-18

Term Specificity

Broad terms cannot distinguish relevant

from irrelevant items

Narrow terms retrieve relatively fewer

items, but most of the retrieved materials

are likely to be helpful to users

Approaches for Automatic Indexing

Semantic Approach

based on understanding texts

domain-dependent

Syntactic Approach

based on syntactic analysis of texts

language-dependent

Statistical Approach

based on the statistics of terms

portable

Term Frequency

Function words

for example, "and", "or", "of", "but", ...

the frequencies of these words are high in all texts

Content words

words that actually relate to document content tend

to occur with greatly

varying frequencies in the

different texts

of a collection

the frequency of content word may be used to

(6)

9-21

A Frequency-Based Indexing Method

Eliminate common function words from the

document texts by consulting a special dictionary,

or stop list, containing a list of high frequency

function words

Compute the term frequency

tf

ij

for all remaining

terms

T

j

in each document

D

i

, specifying the

number of occurrences of

T

j

in

D

i

Choose a threshold frequency

T

, and assign to

each document

D

i

all term

T

j

for which

tf

ij

>

T

9-22

Document Frequency (DF)

The number of documents which contain

the designated word for a certain collection

df

_j

=df(T

_j

)=NumberOfDocumentContain(T

_j

)

Compose a Single

Frequency-Based Indexing Model

Best indexing terms are those that occur

frequently in individual documents but

rarely in the remainder of the collection

A simple combined term importance

indicator is

w

tf

N

df

ij ij j

=

× log

An Improved Indexing Policy

for Free-Term Indexing

Eliminating common function words

Computing the value of w

ij

for each term T

j

in each document D

i

Assigning to the documents a collection of

all terms with sufficiently high (tf

×idf)

factors

(7)

9-25

Problems of Traditional Model

for Control-Vocabulary Indexing

Term statistics

Term frequency (TF)

Document frequency (DF)

Inverse document frequency

(IDF = log(N/DF))

Traditional model: TF×IDF

High DF words

Common words

Domain-specific words

Subject-specific words

A

B

C

A+B+C is the words

with high DF and low IDF

A = Common words

B = Domain-specific words

C = Subject-specific words

9-26

3-Tier Model for Automatic Indexing

TF of the

same

document

DF of the

same subject

IDF of the

same domain

正範例負範例特徵相關函數訓練新文件特徵標題辨識控制詞彙相關函數資料庫

訓練過程

辨識過程

取樣文件資料庫

Basic Idea

{

}

{

}

{

}

{

}

{

m

}

j ml m m m m m m jl j j j j j j l l l

w

R

H

w

R

H

w

R

H

w

R

H

w

R

H

,...

,

,...

,

,...

,

,...

,

,...

,

Result

Learning

Headings

Subject

4 3 2 1 4 3 2 1 3 34 33 32 31 3 3 2 24 23 22 21 2 2 1 14 13 12 11 1 1

,

3 2 1

=

M

(8)

9-29

The Scheme for Term Weight

合併合併合併合併

DF

I

D

F

H1 H2 H3 D₁₁ D₁₂ D₁₃ D_1n D₂₁ Dm1 Dmn Hm 9-30

DF versus IDF

DF

original set

IDF

combined set

Common Words

High Low

Domain-specific

Words

High Low

Subject-specific

Words

High High

Term Weighting

mk

nk

ik

TF

OSDF

CSIDF

Weight

IDF

DF

TF

Weight

×

=

×

=

_original

_set

_combined

_set

m

k

mk

CSIDF

n

k

nk

OSDF

i

k

ik

TF

i

k

ik

Weight

collection

document

combined

in

term

of

frequency

document

inverse

collection

document

original

in

term

of

frequency

document

in

term

of

frequency

document

in

term

of

weight

=

Training Stage

Select experimental texts and controlled

vocabulary

Select testing subjects

Train parameters for the proposed model

T r a in i n g S e t

Te s tin g S e t

(To ta l)

P o s itiv e

4 0 ,0 0 0

2 0 ,0 0 0

6 0 ,0 0 0

N e g a tiv e

4 0 0

(9)

9-33

Testing Stage

Compute the indexing score for testing texts

Thresholding

(

) ( )

)

R

in

included

t

isn'

word

when the

0,

word

a

of

weight

(

document

the

in

words

of

number

j

=

×

=

∑

OSDF

CSIDF

TF

Score

Indexing

document

the

to

assigned

is

H

_j

THEN

M

IS

IF

j

>

9-34

Evaluation Criteria

Indexing precision

indexing recall

模型索引之文件數

正確索引之文件數

=

Precision

Indexing

數

文件集中應索引之文件

模型正確索引之文件數

=

Recall

Indexing

Abstract Part

Tr a in in g S e t

Te s tin g S e t

T h re s h o ld

P re c is io n ( % )

R e c a ll( % )

P re c is io n ( % )

R e c a ll( % )

0 .1

7 7 .3 1

9 9 .6 2

7 6 .7 8

9 6 .6 3

0 .2

8 7 .1 7

9 8 .8 3

8 6 .4 7

9 2 .9 0

0 .3

9 1 .6 8

9 7 .3 6

9 1 .0 1

8 9 .4 9

0 .4

9 3 .9 2

9 5 .3 2

9 3 .3 2

8 6 .1 9

0 .5

9 5 .4 9

9 2 .8 8

9 5 .0 0

8 3 .3 9

0 .6

9 6 .3 3

9 0 .2 4

9 5 .9 1

8 0 .6 2

0 .7

9 6 .9 2

8 7 .3 1

9 6 .5 6

7 7 .8 7

0 .8

9 7 .3 2

8 4 .5 1

9 7 .0 1

7 5 .5 1

0 .9

9 7 .6 2

8 1 .6 9

9 7 .3 5

7 3 .1 5

1 .0

9 7 .8 3

7 9 .0 9

9 7 .5 9

7 1 .0 9

Title Part

Tr a in in g S e t

Te s tin g S e t

T hre s ho ld

P re c is io n(% )

R e c a ll(% )

P re c is io n(% )

R e c a ll(% )

0 .1

8 0 .5 2

7 6 .4 8

8 0 .8 7

7 8 .2 4

0 .2

8 6 .8 5

7 4 .8 6

8 6 .7 8

7 4 .3 8

0 .3

8 9 .9 4

7 2 .8 6

8 9 .6 7

7 0 .7 6

0 .4

9 1 .9 2

7 0 .7 3

9 1 .5 4

6 7 .3 4

0 .5

9 2 .9 8

6 8 .5 0

9 2 .5 7

6 4 .4 8

0 .6

9 3 .6 0

6 6 .0 9

9 3 .1 8

6 1 .7 6

0 .7

9 4 .1 0

6 3 .7 8

9 3 .7 1

5 9 .5 8

0 .8

9 4 .5 1

6 1 .4 7

9 4 .1 4

5 7 .6 7

0 .9

9 4 .8 8

5 9 .3 0

9 4 .5 6

5 5 .6 7

1 .0

9 5 .1 9

5 7 .1 2

9 4 .9 2

5 3 .9 2

(10)

9-37

Abstract Part in Training

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold Precision Rec 9-38

Abstract Part in Testing

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold Precision Rec

Title Part in Training

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Title Part in Testing

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

(11)

9-41

Comparison to Traditional Model

Tr a in in g S e t

Te s tin g S e t

T h re s h o ld

P re c is io n ( % )

R e c a ll( % )

P re c is io n ( % )

R e c a ll( % )

0 .1

8 6 .6 7

9 7 .8 0

8 4 .8 6

8 4 .3 0

0 .2

9 3 .3 3

8 9 .0 1

9 0 .5 1

6 0 .6 5

0 .3

9 5 .0 8

7 3 .2 6

9 1 .1 8

3 9 .2 0

0 .4

9 6 .1 0

5 4 .4 5

9 1 .5 4

2 3 .9 2

0 .5

9 7 .0 9

3 8 .0 4

9 2 .6 2

1 4 .3 0

0 .6

9 8 .5 3

2 4 .8 4

9 5 .5 5

7 .9 4

0 .7

9 9 .8 1

1 5 .4 6

9 9 .3 4

4 .5 2

0 .8

9 9 .8 9

9 .4 7

9 9 .6 0

2 .4 6

0 .9

1 0 0 .0 0

5 .7 0

1 0 0 .0 0

1 .3 6

1 .0

1 0 0 .0 0

3 .4 5

1 0 0 .0 0

0 .7 1

9-42

Related Research

Training Set

Testing Set

Threshold

_Positive(%)

_Negative(%)

_Positive(%)

_Negative(%)

0.3

97.85

7.67

90.54

7.67

0.4

96.11

4.98

87.39

4.98

0.5

94.00

4.25

84.64

4.25 Leung&Kan

89.70

4.88

87.72

6.01

Comparisons for Abstract

Training part

Precision > 90%, when threshold between 0.27 and 0.61

Both precision and recall > 94%, when threshold = 0.43

Testing part

Both precision and recall > 90%, when threshold = 0.27

Training part and testing part

Recall > 96% and keep precision > 77%

Precision > 97% and keep recall > 71%

Training Part

Precision > 90% and recall > 70%, when threshold = 0.4

Both precision and recall > 76%, when threshold = 0.1

Testing Part

Precision > 90% and recall > 70%, when threshold = 0.3

Both precision and recall > 78%, when threshold= 0.1

Training part and testing part

Precision = 90%, we can keep the recall above 70%

The appropriate threshold for this application is

0.27

indexing and abstracting. automatic indexing

Kuang-hua Chen

Department of Library and Information Science

National Taiwan University

Indexing and Abstracting

Lecture 09 -- Automatic Indexing

Outline



What’s the subject indexing?



Types of subject indexing



The taxonomy for subject indexing



Index Structures



The models for automatic indexing



3-tier model for automatic indexing



Natural language processing techniques



Future trends

Subject Indexing



The function of indexing is to describe the

"aboutness" of documents



Indexing terms are used to present content of

document



Challenge

how to select a set of indexing terms to represent the

document contain thousands of words faithfully



Process

subject analysis

subject translation

Process of Subject Indexing



Subject analysis

Analyze content of texts and distill the subject

concepts

Basis of subject indexing



Subject translation

Translate the subject concepts to index terms

Two approaches

natural language indexing (free-text indexing）

controlled vocabulary indexing

Tasks of Indexing



Analysis of the subject content of the

document



Review of indexing policies and authorities

to aid in the correct assignment of terms



Presentation of the index terms in the

appropriate order of the indexing system



Weighting of index terms



Quality control of the index terms

索引過程

檢索過程

Indexing Consistency



The degree of agreement in the

representation of the essential information

content of the document by certain sets of

indexing terms selected individually and

independently by each of the indexers in

the group.

Indexing Consistency Rating



all studies indicated that consistency was

very low



a figure of 30% often was used