Incremental Maintenance of
Ontology-Exploiting Association Rules
Ming-Cheng Tseng
1, Wen-Yang Lin
2and Rong Jeng
31, 3
Institute of Information Engineering, I-Shou University, Taiwan
2
Dept. of Comp. Sci. & Info. Eng., National University of Kaohsiung, Taiwan
Outline
Introduction
Problem description
The proposed algorithm
Performance evaluation
Conclusions
Introduction
Motivation
In general, there exist lots of semantic relationships
(domain knowledge) among items
It is natural to incorporate domain ontology into the
process of data mining to explore more innovative rules
The source databases are changing over time
E.g., insertion, deletion, modification
The discovered knowledge (rules) has to be updated to
Introduction (cont.)
Association rules
Given:
A database of customer transactions
Each transaction is a set of items
Find all rules X Y that correlate the presence of
one set of items X with another set of items Y
Example:
Introduction (cont.)
Strong association rules
Given:
User’s specified constraints
Minimum support (min_sup)
minimum confidence (min_conf)
Finding rules X Y with support and confidence larger than t
he user’s specified minimum values
Example:
min_sup = 25%, min_conf = 50%
Introduction (cont.)
Frequent itemsets (patterns) mining
The association mining problem can be reduced to the pr
oblem of mining frequent itemsets, i.e., itemsets with supp
ort larger than min_sup
Example
min_sup = 25%, min_conf = 50%
Sony VAIO HP LaserJet 1300 (Sup. 30%, Conf. 60%)
sup({Sony VAIO, HP LaserJet 1300}) = 30%
Introduction (cont.)
Ontology
W3C Web Ontology Working Group
“An ontology formally defines a common set of terms
that are used to describe and represent a domain
knowledge.”
e.g., taxonomy: a kind of ontology presenting class
ification relationship among objects
Tomato
Vegetable
Carrot
Kale
Non-root
Vegetable
Pickle
Apple
Fruit
Papaya
Introduction (cont.)
Ontology-exploiting association rules
---Memory Hard Disk Notebook Desktop PC PC ---RAM 256MB S 60GB IBM 60GB RAM 512MB Sony VAIO Gateway GE IBM TP Printer HP DeskJet Epson EPL ---Ink Cartridge Photo Conductor Toner Cartridge ---Composition Classification
IBM 60GB HD => HP DeskJet
Problem Description
Incremental maintenance of ontology-exploiting associatio
n rules
Given:
A database of customer transactions DB
An incremental database db
An item ontology T
Discovered frequent itemsets in DB, L
minimum support, ms, and minimum confidence, mc
Find all frequent itemsets in UD = DB + db w.r.t. ms
Construct all strong rules from the frequent itemsets w.r.t. m
Problem Description (cont.)
-- Example
TID
Purchased Items
1
IBM TP, Epson EPL, Toner Cartridge
2
Sony VAIO, IBM TP, Epson EPL
3
IBM TP, HP DeskJet, Ink Cartridge
4
HP DeskJet
5
IBM TP, HP DeskJet, Ink Cartridge
6
Sony VAIO, Ink Cartridge
Composition Classification
Photo
Conductor
Toner
Cartridge
HP
DeskJet
Printer
Epson
EPL
-Ink
Cartridge
- -
-RAM
256MB
IBM
60GB
Sony
VAIO
PC
IBM
TP
S
60GB
-Customer transactions DB
L
1Count
L
2& L
3Count
{Printer}
{PC}
{IBM TP}
{RAM 256MB*}
{IBM 60GB*}
5
5
4
5
4
{Printer, PC}
{Printer, IBM TP}
{Printer, RAM 256MB*}
{Printer, IBM 60GB*}
{RAM 256MB*, IBM 60GB*}
{Printer, RAM 256MB*, IBM 60GB*}
4
4
4
4
4
4
Discovered frequent itemsets L
Item ontology G
Problem Description (cont.)
Example
TID
Purchased Items
1
IBM TP, Epson EPL, Toner Cartridge
2
Sony VAIO, IBM TP, Epson EPL
3
IBM TP, HP DeskJet, Ink Cartridge
4
HP DeskJet
5
IBM TP, HP DeskJet, Ink Cartridge
6
Sony VAIO, Ink Cartridge
Composition Classification
Photo
Conductor
Toner
Cartridge
HP
DeskJet
Printer
Epson
EPL
-Ink
Cartridge
- -
-RAM
256MB
IBM
60GB
Sony
VAIO
PC
IBM
TP
S
60GB
-TID
Items Purchased
7
Toner Cartridge
8
IBM TP, HP DeskJet, IBM 60GB, Toner
Cartridge
9
IBM 60GB, Toner Cartridge
Customer transactions DB
Incremental transactions db
Item ontology G
minsup = 70%
Updated frequent itemsets L’
Basic scheme
An Apriori-based maintenance algorithm
Employing a bottom-up, level-wise searching strategy
Starting from frequent 1-itemset, L
1, then L
2, …, L
k, etc.
A
B
C
D
ABC ABD
ACD
BCD
ABCD
AB AC
AD
BC
BD CD
Notation
Definition
DB
Original database
db
Incremental database
UD
Updated database UD DB + db
T
Item ontology
ED
Extension of DB with extended items in T
ed
Extension of db with extended items in T
UE
Updated extended database UE ED + ed
The Proposed Algorithm – IMARO (cont.)
Example
Note on database extension
A component item may exist as a primitive item itself
To clarify the meaning of associations involving such an
item, we have to differentiate the role this item play
e.g.,
IBM TP => Ink Cartridge
buy an IBM TP notebook, also buy an Ink Cartridge
buy an IBM TP notebook, also buy an product composed of Ink
Cartridge
The Proposed Algorithm – IMARO (cont.)
TID
Purchased Items
5
IBM TP, HP DeskJet, Ink Cartridge
TID
Primitive Items
Extended Items
5
IBM TP, HP DeskJet,
Ink Cartridge*
PC, RAM 256MB, IB
M 60GB, Printer, Ink
The Proposed Algorithm – IMARO (cont.)
Process flow for updating frequent k-itemsets
Frequent/infrequent itemsets inference
The Proposed Algorithm – IMARO (cont.)
Conditions
Results
L
EDL
edUE
Action
Case
freq.
no
1
undetd.
compare sup
UD(A) with ms
2
undetd.
scan DB
3
The Proposed Algorithm – IMARO (cont.)
Optimization 1: Candidate pruning
Any candidate itemset that contains both an item and anyo
ne of its extensions (generalized item or component) is pru
ned.
Photo
Conductor
Toner
Cartridge
HP
DeskJet
Printer
Epson
EPL
-Ink
Cartridge
- -
-RAM
256MB
IBM
60GB
Sony
VAIO
PC
IBM
TP
S
60GB
-{Epson EPL, Printer}
The Proposed Algorithm – IMARO (cont.)
The extension of an item
can be added only if that i
tem does appear in at lea
st one candidate itemset
being counted currently
Photo Conductor Toner Cartridge HP DeskJet Printer Epson EPL -Ink Cartridge - - -RAM 256MB IBM 60GB Sony VAIO PC IBM TP S 60GB
Performance Evaluation
Compared with applying our proposed algorithms, AROC and AROS, to the whole database DB+db with T
Test data
A synthetic dataset generated by the IBM data generator with artificially–built ontology
Parameter
Default value
|DB|
Number of original transactions
200,000
|t|
Average size of transactions
20
N
Number of items
362
R
Number of groups
30
L
Number of levels
4
Performance Evaluation (cont.)
Varying minimum supports
10
100
1000
1
1.5
2
2.5
3
3.5
ms %
R
un
t
im
e
(s
ec
.)
AROC
AROS
IMARO
log
Performance Evaluation (cont.)
Varying incremental transaction size
0
50
100
150
200
250
300
2
4
6
8
10
12
14
16
18
20
Number of incremental transctions (x 10,000)
R
un
t
im
e
(s
ec
.)
Conclusions
We have investigated the problem of updating ontology-e
xploiting association rules when new transactions are ins
erted into the database
An Apriori-based algorithm is proposed
Other issues
More complicated semantic relationships and knowledge
Non-uniform minimum support
Generalized item or composite item occurs more frequently
Towards a total solution for evolving environments
Ontology evolution, database update