Kuang-hua Chen
Department of Library and Information Science
National Taiwan University
Indexing and Abstracting
Lecture 09 -- Automatic Indexing
9-2
Language & Information Processing System, LIS, NTU
Outline
What’s the subject indexing?
Types of subject indexing
The taxonomy for subject indexing
Index Structures
The models for automatic indexing
3-tier model for automatic indexing
Natural language processing techniques
Future trends
Subject Indexing
The function of indexing is to describe the
"aboutness" of documents
Indexing terms are used to present content of
document
Challenge
how to select a set of indexing terms to represent the
document contain thousands of words faithfully
Process
subject analysis
subject translation
Process of Subject Indexing
Subject analysis
Analyze content of texts and distill the subject
concepts
Basis of subject indexing
Subject translation
Translate the subject concepts to index terms
Two approaches
natural language indexing (free-text indexing)
controlled vocabulary indexing
9-5
Language & Information Processing System, LIS, NTU
Tasks of Indexing
Analysis of the subject content of the
document
Review of indexing policies and authorities
to aid in the correct assignment of terms
Presentation of the index terms in the
appropriate order of the indexing system
Weighting of index terms
Quality control of the index terms
9-6
Language & Information Processing System, LIS, NTU 文獻內容 主題概念 主 題 索引(檢索)語言 檢索者資訊 需求 主題概念 主 題 檢索 系統 檢索結果 主題 分析 主題 分析 轉 譯 轉 譯 索引詞彙 檢索詞彙 儲存 輸入 輸 出
索引過程
檢索過程
主 題 分 析 主 題 轉 譯Indexing Consistency
The degree of agreement in the
representation of the essential information
content of the document by certain sets of
indexing terms selected individually and
independently by each of the indexers in
the group.
Indexing Consistency Rating
all studies indicated that consistency was
very low
a figure of 30% often was used
Indexing consistency can vary on several
factors
familiarity with the indexing policies
experience with the specific subject
the document most recently indexed
9-9
Language & Information Processing System, LIS, NTU
How to Measure Consistency
inter-indexer consistency
the overlap in index term assignment by two or
more indexers for the same document
intra-indexer consistency
the same indexer indexes the same document
at two different times
9-10
Language & Information Processing System, LIS, NTU
Increase Indexing Consistency
Manual Aids
Vocabulary control
thesauri
scope notes
indexer may choose not to use manual aids
takes additional time
relevance of the aid to the problem is not
apparent
indexer believes there is no problem at all
Machine-Readable Indexing Aids
The indexer’s tools included authority files,
policy manuals, handbooks, textbooks, etc.
Machine Readable Indexing resources are
available.
Pre- versus Post-coordination
Pre-coordinated indexing term
complex/compound concepts are represented
in a single term
Post-coordinated indexing term
9-13
Language & Information Processing System, LIS, NTU
Controlled versus Uncontrolled
Controlled Indexing
may be selected from a hierarchical thesaurus
may be selected from a list of classification
level subject headings
Uncontrolled Indexing
natural language terms (free terms) from texts
with or without standardization
9-14
Language & Information Processing System, LIS, NTU
Automatic versus Manual
Automatic indexing
Apply computers to proceed the indexing task
Manual indexing
Human indexers proceed the indexing task
Indexing Scheme
Use 3-tuple to represent possible indexing
scheme
The first element denotes pre-coordinated (+)
or post-coordinated (-)
The second element denotes controlled (+) or
uncontrolled (-)
The third element denoted automatic (+) or
manual (-)
IS(-, +, +) represents post-coordinated,
controlled, and automatic indexing
Automatic Indexing
Most works are devoted to automatic
free-text indexing
Few works concern the automatic
9-17
Language & Information Processing System, LIS, NTU
Indexing Aims
The effectiveness of any content analysis or
indexing system is controlled by two
parameters
indexing exhaustivity
the degree to which all aspects of the subject
matter of a text item are actually recognized
term specificity
the degree of breadth or narrowness of the terms
9-18
Language & Information Processing System, LIS, NTU
Term Specificity
Broad terms cannot distinguish relevant
from irrelevant items
Narrow terms retrieve relatively fewer
items, but most of the retrieved materials
are likely to be helpful to users
Approaches for Automatic Indexing
Semantic Approach
based on understanding texts
domain-dependent
Syntactic Approach
based on syntactic analysis of texts
language-dependent
Statistical Approach
based on the statistics of terms
portable
Term Frequency
Function words
for example, "and", "or", "of", "but", ...
the frequencies of these words are high in all texts
Content words
words that actually relate to document content tend
to occur with greatly
varying frequencies in the
different texts
of a collection
the frequency of content word may be used to
9-21
Language & Information Processing System, LIS, NTU
A Frequency-Based Indexing Method
Eliminate common function words from the
document texts by consulting a special dictionary,
or stop list, containing a list of high frequency
function words
Compute the term frequency
tf
ij
for all remaining
terms
T
j
in each document
D
i
, specifying the
number of occurrences of
T
j
in
D
i
Choose a threshold frequency
T
, and assign to
each document
D
i
all term
T
j
for which
tf
ij
>
T
9-22
Language & Information Processing System, LIS, NTU
Document Frequency (DF)
The number of documents which contain
the designated word for a certain collection
df
j
=df(T
j
)=NumberOfDocumentContain(T
j
)
Compose a Single
Frequency-Based Indexing Model
Best indexing terms are those that occur
frequently in individual documents but
rarely in the remainder of the collection
A simple combined term importance
indicator is
w
tf
N
df
ij ij j=
× log
An Improved Indexing Policy
for Free-Term Indexing
Eliminating common function words
Computing the value of w
ij
for each term T
j
in each document D
i
Assigning to the documents a collection of
all terms with sufficiently high (tf
×idf)
factors
9-25
Language & Information Processing System, LIS, NTU
Problems of Traditional Model
for Control-Vocabulary Indexing
Term statistics
Term frequency (TF)
Document frequency (DF)
Inverse document frequency
(IDF = log(N/DF))
Traditional model: TF×IDF
High DF words
Common words
Domain-specific words
Subject-specific words
A
B
C
A+B+C is the words
with high DF and low IDF
A = Common words
B = Domain-specific words
C = Subject-specific words
9-26
Language & Information Processing System, LIS, NTU
3-Tier Model for Automatic Indexing
TF of the
same
document
DF of the
same subject
IDF of the
same domain
正範例 負範例 特徵 相關函數訓練 新文件 特徵 標題辨識 控制詞彙相關 函數資料庫訓練過程
辨識過程
取樣 文件資料庫Basic Idea
{
}
{
}
{
}
{
}
{
m}
j ml m m m m m m jl j j j j j j l l lw
w
w
w
w
R
H
w
w
w
w
w
R
H
w
w
w
w
w
R
H
w
w
w
w
w
R
H
w
w
w
w
w
R
H
,...
,
,
,
,...
,
,
,...
,
,
,
,...
,
,
,
,...
,
,
,
Result
Learning
Headings
Subject
4 3 2 1 4 3 2 1 3 34 33 32 31 3 3 2 24 23 22 21 2 2 1 14 13 12 11 1 1,
3 2 1=
=
=
=
=
M
M
M
M
9-29
Language & Information Processing System, LIS, NTU
The Scheme for Term Weight
合併 合併 合併 合併DF
I
D
F
H1 H2 H3 D11 D12 D13 D1n D21 Dm1 Dmn Hm 9-30Language & Information Processing System, LIS, NTU
DF versus IDF
DF
original set
IDF
combined set
Common Words
High Low
Domain-specific
Words
High Low
Subject-specific
Words
High High
Term Weighting
mk
nk
ik
ik
TF
OSDF
CSIDF
Weight
IDF
DF
TF
Weight
×
×
=
×
×
=
original
set
combined
set
m
k
mk
CSIDF
n
k
nk
OSDF
i
k
ik
TF
i
k
ik
Weight
collection
document
combined
in
term
of
frequency
document
inverse
collection
document
original
in
term
of
frequency
document
document
in
term
of
frequency
document
in
term
of
weight
=
=
=
=
Training Stage
Select experimental texts and controlled
vocabulary
Select testing subjects
Train parameters for the proposed model
T r a in i n g S e t
Te s tin g S e t
(To ta l)
P o s itiv e
4 0 ,0 0 0
2 0 ,0 0 0
6 0 ,0 0 0
N e g a tiv e
4 0 0
4 0 0
9-33
Language & Information Processing System, LIS, NTU
Testing Stage
Compute the indexing score for testing texts
Thresholding
(
) ( )
)
R
in
included
t
isn'
word
when the
0,
word
a
of
weight
(
document
the
in
words
of
number
j=
×
×
=
∑
OSDF
CSIDF
TF
Score
Indexing
document
the
to
assigned
is
H
jTHEN
M
IS
IF
j>
9-34Language & Information Processing System, LIS, NTU
Evaluation Criteria
Indexing precision
indexing recall
模型索引之文件數
正確索引之文件數
=
Precision
Indexing
數
文件集中應索引之文件
模型正確索引之文件數
=
Recall
Indexing
Abstract Part
Tr a in in g S e t
Te s tin g S e t
T h re s h o ld
P re c is io n ( % )
R e c a ll( % )
P re c is io n ( % )
R e c a ll( % )
0 .1
7 7 .3 1
9 9 .6 2
7 6 .7 8
9 6 .6 3
0 .2
8 7 .1 7
9 8 .8 3
8 6 .4 7
9 2 .9 0
0 .3
9 1 .6 8
9 7 .3 6
9 1 .0 1
8 9 .4 9
0 .4
9 3 .9 2
9 5 .3 2
9 3 .3 2
8 6 .1 9
0 .5
9 5 .4 9
9 2 .8 8
9 5 .0 0
8 3 .3 9
0 .6
9 6 .3 3
9 0 .2 4
9 5 .9 1
8 0 .6 2
0 .7
9 6 .9 2
8 7 .3 1
9 6 .5 6
7 7 .8 7
0 .8
9 7 .3 2
8 4 .5 1
9 7 .0 1
7 5 .5 1
0 .9
9 7 .6 2
8 1 .6 9
9 7 .3 5
7 3 .1 5
1 .0
9 7 .8 3
7 9 .0 9
9 7 .5 9
7 1 .0 9
Title Part
Tr a in in g S e t
Te s tin g S e t
T hre s ho ld
P re c is io n(% )
R e c a ll(% )
P re c is io n(% )
R e c a ll(% )
0 .1
8 0 .5 2
7 6 .4 8
8 0 .8 7
7 8 .2 4
0 .2
8 6 .8 5
7 4 .8 6
8 6 .7 8
7 4 .3 8
0 .3
8 9 .9 4
7 2 .8 6
8 9 .6 7
7 0 .7 6
0 .4
9 1 .9 2
7 0 .7 3
9 1 .5 4
6 7 .3 4
0 .5
9 2 .9 8
6 8 .5 0
9 2 .5 7
6 4 .4 8
0 .6
9 3 .6 0
6 6 .0 9
9 3 .1 8
6 1 .7 6
0 .7
9 4 .1 0
6 3 .7 8
9 3 .7 1
5 9 .5 8
0 .8
9 4 .5 1
6 1 .4 7
9 4 .1 4
5 7 .6 7
0 .9
9 4 .8 8
5 9 .3 0
9 4 .5 6
5 5 .6 7
1 .0
9 5 .1 9
5 7 .1 2
9 4 .9 2
5 3 .9 2
9-37
Language & Information Processing System, LIS, NTU
Abstract Part in Training
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Threshold Precision Rec 9-38Language & Information Processing System, LIS, NTU
Abstract Part in Testing
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Threshold Precision RecTitle Part in Training
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Threshold Precision RecTitle Part in Testing
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Threshold Precision Rec9-41
Language & Information Processing System, LIS, NTU
Comparison to Traditional Model
Tr a in in g S e t
Te s tin g S e t
T h re s h o ld
P re c is io n ( % )
R e c a ll( % )
P re c is io n ( % )
R e c a ll( % )
0 .1
8 6 .6 7
9 7 .8 0
8 4 .8 6
8 4 .3 0
0 .2
9 3 .3 3
8 9 .0 1
9 0 .5 1
6 0 .6 5
0 .3
9 5 .0 8
7 3 .2 6
9 1 .1 8
3 9 .2 0
0 .4
9 6 .1 0
5 4 .4 5
9 1 .5 4
2 3 .9 2
0 .5
9 7 .0 9
3 8 .0 4
9 2 .6 2
1 4 .3 0
0 .6
9 8 .5 3
2 4 .8 4
9 5 .5 5
7 .9 4
0 .7
9 9 .8 1
1 5 .4 6
9 9 .3 4
4 .5 2
0 .8
9 9 .8 9
9 .4 7
9 9 .6 0
2 .4 6
0 .9
1 0 0 .0 0
5 .7 0
1 0 0 .0 0
1 .3 6
1 .0
1 0 0 .0 0
3 .4 5
1 0 0 .0 0
0 .7 1
9-42Language & Information Processing System, LIS, NTU
Related Research
Training Set
Testing Set
Threshold
Positive(%)
Negative(%)
Positive(%)
Negative(%)
0.3
97.85
7.67
90.54
7.67
0.4
96.11
4.98
87.39
4.98
0.5
94.00
4.25
84.64
4.25
Leung&Kan
89.70
4.88
87.72
6.01
Comparisons for Abstract
Training part
Precision > 90%, when threshold between 0.27 and 0.61
Both precision and recall > 94%, when threshold = 0.43
Testing part
Both precision and recall > 90%, when threshold = 0.27
Training part and testing part
Recall > 96% and keep precision > 77%
Precision > 97% and keep recall > 71%
Training Part
Precision > 90% and recall > 70%, when threshold = 0.4
Both precision and recall > 76%, when threshold = 0.1
Testing Part
Precision > 90% and recall > 70%, when threshold = 0.3
Both precision and recall > 78%, when threshold= 0.1
Training part and testing part