• 沒有找到結果。

indexing and abstracting. automatic indexing

N/A
N/A
Protected

Academic year: 2021

Share "indexing and abstracting. automatic indexing"

Copied!
11
0
0

加載中.... (立即查看全文)

全文

(1)

Kuang-hua Chen

Department of Library and Information Science

National Taiwan University

Indexing and Abstracting

Lecture 09 -- Automatic Indexing

9-2

Language & Information Processing System, LIS, NTU

Outline

„

What’s the subject indexing?

„

Types of subject indexing

„

The taxonomy for subject indexing

„

Index Structures

„

The models for automatic indexing

„

3-tier model for automatic indexing

„

Natural language processing techniques

„

Future trends

Subject Indexing

„

The function of indexing is to describe the

"aboutness" of documents

„

Indexing terms are used to present content of

document

„

Challenge

„

how to select a set of indexing terms to represent the

document contain thousands of words faithfully

„

Process

„

subject analysis

„

subject translation

Process of Subject Indexing

„

Subject analysis

„

Analyze content of texts and distill the subject

concepts

„

Basis of subject indexing

„

Subject translation

„

Translate the subject concepts to index terms

„

Two approaches

„

natural language indexing (free-text indexing)

„

controlled vocabulary indexing

(2)

9-5

Language & Information Processing System, LIS, NTU

Tasks of Indexing

„

Analysis of the subject content of the

document

„

Review of indexing policies and authorities

to aid in the correct assignment of terms

„

Presentation of the index terms in the

appropriate order of the indexing system

„

Weighting of index terms

„

Quality control of the index terms

9-6

Language & Information Processing System, LIS, NTU 文獻內容 主題概念 主 題 索引(檢索)語言 檢索者資訊 需求 主題概念 主 題 檢索 系統 檢索結果 主題 分析 主題 分析 轉 譯 轉 譯 索引詞彙 檢索詞彙 儲存 輸入 輸 出

索引過程

檢索過程

主 題 分 析 主 題 轉 譯

Indexing Consistency

„

The degree of agreement in the

representation of the essential information

content of the document by certain sets of

indexing terms selected individually and

independently by each of the indexers in

the group.

Indexing Consistency Rating

„

all studies indicated that consistency was

very low

„

a figure of 30% often was used

„

Indexing consistency can vary on several

factors

„

familiarity with the indexing policies

„

experience with the specific subject

„

the document most recently indexed

(3)

9-9

Language & Information Processing System, LIS, NTU

How to Measure Consistency

„

inter-indexer consistency

„

the overlap in index term assignment by two or

more indexers for the same document

„

intra-indexer consistency

„

the same indexer indexes the same document

at two different times

9-10

Language & Information Processing System, LIS, NTU

Increase Indexing Consistency

„

Manual Aids

„

Vocabulary control

„

thesauri

„

scope notes

„

indexer may choose not to use manual aids

„

takes additional time

„

relevance of the aid to the problem is not

apparent

„

indexer believes there is no problem at all

Machine-Readable Indexing Aids

„

The indexer’s tools included authority files,

policy manuals, handbooks, textbooks, etc.

„

Machine Readable Indexing resources are

available.

Pre- versus Post-coordination

„

Pre-coordinated indexing term

„

complex/compound concepts are represented

in a single term

„

Post-coordinated indexing term

(4)

9-13

Language & Information Processing System, LIS, NTU

Controlled versus Uncontrolled

„

Controlled Indexing

„

may be selected from a hierarchical thesaurus

„

may be selected from a list of classification

level subject headings

„

Uncontrolled Indexing

„

natural language terms (free terms) from texts

with or without standardization

9-14

Language & Information Processing System, LIS, NTU

Automatic versus Manual

„

Automatic indexing

„

Apply computers to proceed the indexing task

„

Manual indexing

„

Human indexers proceed the indexing task

Indexing Scheme

„

Use 3-tuple to represent possible indexing

scheme

„

The first element denotes pre-coordinated (+)

or post-coordinated (-)

„

The second element denotes controlled (+) or

uncontrolled (-)

„

The third element denoted automatic (+) or

manual (-)

„

IS(-, +, +) represents post-coordinated,

controlled, and automatic indexing

Automatic Indexing

„

Most works are devoted to automatic

free-text indexing

„

Few works concern the automatic

(5)

9-17

Language & Information Processing System, LIS, NTU

Indexing Aims

„

The effectiveness of any content analysis or

indexing system is controlled by two

parameters

„

indexing exhaustivity

„

the degree to which all aspects of the subject

matter of a text item are actually recognized

„

term specificity

„

the degree of breadth or narrowness of the terms

9-18

Language & Information Processing System, LIS, NTU

Term Specificity

„

Broad terms cannot distinguish relevant

from irrelevant items

„

Narrow terms retrieve relatively fewer

items, but most of the retrieved materials

are likely to be helpful to users

Approaches for Automatic Indexing

„

Semantic Approach

„

based on understanding texts

„

domain-dependent

„

Syntactic Approach

„

based on syntactic analysis of texts

„

language-dependent

„

Statistical Approach

„

based on the statistics of terms

„

portable

Term Frequency

„

Function words

„

for example, "and", "or", "of", "but", ...

„

the frequencies of these words are high in all texts

„

Content words

„

words that actually relate to document content tend

to occur with greatly

varying frequencies in the

different texts

of a collection

„

the frequency of content word may be used to

(6)

9-21

Language & Information Processing System, LIS, NTU

A Frequency-Based Indexing Method

„

Eliminate common function words from the

document texts by consulting a special dictionary,

or stop list, containing a list of high frequency

function words

„

Compute the term frequency

tf

ij

for all remaining

terms

T

j

in each document

D

i

, specifying the

number of occurrences of

T

j

in

D

i

„

Choose a threshold frequency

T

, and assign to

each document

D

i

all term

T

j

for which

tf

ij

>

T

9-22

Language & Information Processing System, LIS, NTU

Document Frequency (DF)

„

The number of documents which contain

the designated word for a certain collection

„

df

j

=df(T

j

)=NumberOfDocumentContain(T

j

)

Compose a Single

Frequency-Based Indexing Model

„

Best indexing terms are those that occur

frequently in individual documents but

rarely in the remainder of the collection

„

A simple combined term importance

indicator is

w

tf

N

df

ij ij j

=

× log

An Improved Indexing Policy

for Free-Term Indexing

„

Eliminating common function words

„

Computing the value of w

ij

for each term T

j

in each document D

i

„

Assigning to the documents a collection of

all terms with sufficiently high (tf

×idf)

factors

(7)

9-25

Language & Information Processing System, LIS, NTU

Problems of Traditional Model

for Control-Vocabulary Indexing

„

Term statistics

„

Term frequency (TF)

„

Document frequency (DF)

„

Inverse document frequency

(IDF = log(N/DF))

„

Traditional model: TF×IDF

„

High DF words

„

Common words

„

Domain-specific words

„

Subject-specific words

A

B

C

A+B+C is the words

with high DF and low IDF

A = Common words

B = Domain-specific words

C = Subject-specific words

9-26

Language & Information Processing System, LIS, NTU

3-Tier Model for Automatic Indexing

TF of the

same

document

DF of the

same subject

IDF of the

same domain

正範例 負範例 特徵 相關函數訓練 新文件 特徵 標題辨識 控制詞彙相關 函數資料庫

訓練過程

辨識過程

取樣 文件資料庫

Basic Idea

{

}

{

}

{

}

{

}

{

m

}

j ml m m m m m m jl j j j j j j l l l

w

w

w

w

w

R

H

w

w

w

w

w

R

H

w

w

w

w

w

R

H

w

w

w

w

w

R

H

w

w

w

w

w

R

H

,...

,

,

,

,...

,

,

,...

,

,

,

,...

,

,

,

,...

,

,

,

Result

Learning

Headings

Subject

4 3 2 1 4 3 2 1 3 34 33 32 31 3 3 2 24 23 22 21 2 2 1 14 13 12 11 1 1

,

3 2 1

=

=

=

=

=

M

M

M

M

(8)

9-29

Language & Information Processing System, LIS, NTU

The Scheme for Term Weight

合併 合併 合併 合併

DF

I

D

F

H1 H2 H3 D11 D12 D13 D1n D21 Dm1 Dmn Hm 9-30

Language & Information Processing System, LIS, NTU

DF versus IDF

DF

original set

IDF

combined set

Common Words

High Low

Domain-specific

Words

High Low

Subject-specific

Words

High High

Term Weighting

mk

nk

ik

ik

TF

OSDF

CSIDF

Weight

IDF

DF

TF

Weight

×

×

=

×

×

=

original

set

combined

set

m

k

mk

CSIDF

n

k

nk

OSDF

i

k

ik

TF

i

k

ik

Weight

collection

document

combined

in

term

of

frequency

document

inverse

collection

document

original

in

term

of

frequency

document

document

in

term

of

frequency

document

in

term

of

weight

=

=

=

=

Training Stage

„

Select experimental texts and controlled

vocabulary

„

Select testing subjects

„

Train parameters for the proposed model

T r a in i n g S e t

Te s tin g S e t

(To ta l)

P o s itiv e

4 0 ,0 0 0

2 0 ,0 0 0

6 0 ,0 0 0

N e g a tiv e

4 0 0

4 0 0

(9)

9-33

Language & Information Processing System, LIS, NTU

Testing Stage

„

Compute the indexing score for testing texts

„

Thresholding

(

) ( )

)

R

in

included

t

isn'

word

when the

0,

word

a

of

weight

(

document

the

in

words

of

number

j

=

×

×

=

OSDF

CSIDF

TF

Score

Indexing

document

the

to

assigned

is

H

j

THEN

M

IS

IF

j

>

9-34

Language & Information Processing System, LIS, NTU

Evaluation Criteria

„

Indexing precision

„

indexing recall

模型索引之文件數

正確索引之文件數

=

Precision

Indexing

文件集中應索引之文件

模型正確索引之文件數

=

Recall

Indexing

Abstract Part

Tr a in in g S e t

Te s tin g S e t

T h re s h o ld

P re c is io n ( % )

R e c a ll( % )

P re c is io n ( % )

R e c a ll( % )

0 .1

7 7 .3 1

9 9 .6 2

7 6 .7 8

9 6 .6 3

0 .2

8 7 .1 7

9 8 .8 3

8 6 .4 7

9 2 .9 0

0 .3

9 1 .6 8

9 7 .3 6

9 1 .0 1

8 9 .4 9

0 .4

9 3 .9 2

9 5 .3 2

9 3 .3 2

8 6 .1 9

0 .5

9 5 .4 9

9 2 .8 8

9 5 .0 0

8 3 .3 9

0 .6

9 6 .3 3

9 0 .2 4

9 5 .9 1

8 0 .6 2

0 .7

9 6 .9 2

8 7 .3 1

9 6 .5 6

7 7 .8 7

0 .8

9 7 .3 2

8 4 .5 1

9 7 .0 1

7 5 .5 1

0 .9

9 7 .6 2

8 1 .6 9

9 7 .3 5

7 3 .1 5

1 .0

9 7 .8 3

7 9 .0 9

9 7 .5 9

7 1 .0 9

Title Part

Tr a in in g S e t

Te s tin g S e t

T hre s ho ld

P re c is io n(% )

R e c a ll(% )

P re c is io n(% )

R e c a ll(% )

0 .1

8 0 .5 2

7 6 .4 8

8 0 .8 7

7 8 .2 4

0 .2

8 6 .8 5

7 4 .8 6

8 6 .7 8

7 4 .3 8

0 .3

8 9 .9 4

7 2 .8 6

8 9 .6 7

7 0 .7 6

0 .4

9 1 .9 2

7 0 .7 3

9 1 .5 4

6 7 .3 4

0 .5

9 2 .9 8

6 8 .5 0

9 2 .5 7

6 4 .4 8

0 .6

9 3 .6 0

6 6 .0 9

9 3 .1 8

6 1 .7 6

0 .7

9 4 .1 0

6 3 .7 8

9 3 .7 1

5 9 .5 8

0 .8

9 4 .5 1

6 1 .4 7

9 4 .1 4

5 7 .6 7

0 .9

9 4 .8 8

5 9 .3 0

9 4 .5 6

5 5 .6 7

1 .0

9 5 .1 9

5 7 .1 2

9 4 .9 2

5 3 .9 2

(10)

9-37

Language & Information Processing System, LIS, NTU

Abstract Part in Training

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold Precision Rec 9-38

Language & Information Processing System, LIS, NTU

Abstract Part in Testing

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold Precision Rec

Title Part in Training

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold Precision Rec

Title Part in Testing

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold Precision Rec

(11)

9-41

Language & Information Processing System, LIS, NTU

Comparison to Traditional Model

Tr a in in g S e t

Te s tin g S e t

T h re s h o ld

P re c is io n ( % )

R e c a ll( % )

P re c is io n ( % )

R e c a ll( % )

0 .1

8 6 .6 7

9 7 .8 0

8 4 .8 6

8 4 .3 0

0 .2

9 3 .3 3

8 9 .0 1

9 0 .5 1

6 0 .6 5

0 .3

9 5 .0 8

7 3 .2 6

9 1 .1 8

3 9 .2 0

0 .4

9 6 .1 0

5 4 .4 5

9 1 .5 4

2 3 .9 2

0 .5

9 7 .0 9

3 8 .0 4

9 2 .6 2

1 4 .3 0

0 .6

9 8 .5 3

2 4 .8 4

9 5 .5 5

7 .9 4

0 .7

9 9 .8 1

1 5 .4 6

9 9 .3 4

4 .5 2

0 .8

9 9 .8 9

9 .4 7

9 9 .6 0

2 .4 6

0 .9

1 0 0 .0 0

5 .7 0

1 0 0 .0 0

1 .3 6

1 .0

1 0 0 .0 0

3 .4 5

1 0 0 .0 0

0 .7 1

9-42

Language & Information Processing System, LIS, NTU

Related Research

Training Set

Testing Set

Threshold

Positive(%)

Negative(%)

Positive(%)

Negative(%)

0.3

97.85

7.67

90.54

7.67

0.4

96.11

4.98

87.39

4.98

0.5

94.00

4.25

84.64

4.25

Leung&Kan

89.70

4.88

87.72

6.01

Comparisons for Abstract

„

Training part

„

Precision > 90%, when threshold between 0.27 and 0.61

„

Both precision and recall > 94%, when threshold = 0.43

„

Testing part

„

Both precision and recall > 90%, when threshold = 0.27

„

Training part and testing part

„

Recall > 96% and keep precision > 77%

„

Precision > 97% and keep recall > 71%

„

Training Part

„

Precision > 90% and recall > 70%, when threshold = 0.4

„

Both precision and recall > 76%, when threshold = 0.1

„

Testing Part

„

Precision > 90% and recall > 70%, when threshold = 0.3

„

Both precision and recall > 78%, when threshold= 0.1

„

Training part and testing part

„

Precision = 90%, we can keep the recall above 70%

„

The appropriate threshold for this application is

0.27

參考文獻

相關文件

主題一 :人性化設計 主題二 :盲點與解決 主題三 :落點與志願 主題四

主題一 :人性化設計 主題二 :盲點與解決 主題三 :落點與志願 主題四 :網路選填..

一、 動機與目的 二、 問題分析 三、 相關文獻 四、 行動設計 五、 實施程序 六、 結果與討論 七、 結論與建議 八、 檢討與省思.

定義問題 統整資訊 概念圖【行動版】.

主題 學習者 主要指導老師 教學重點 木都新語

主題 學習者 主要指導老師 教學重點 木都新語

E-A2 具備探索問題的思考能力,並透過體驗與實踐處理日常生活問 題。A. A3規劃執行

貳、 主題引言 一、 名詞解釋.