Chapter 2 Background Knowledge
2.4 Multidimensional Association Mining
In previous section the basic association mining is described which is usually applied in market basket analysis to explore the associations between products bought by a customer.
Is it possible to explore the associations beyond those in between products? The answer is yes because multidimensional association mining allows discovery of associations among multiple attributes. In this section we will introduce in detail the multidimensional association mining.
In an association mining, if the items in a rule involve only one attribute, it is called a
single-dimensional association rule while the items in a rule involve two or more
attributes is called a multidimensional association rule. An attribute is also called a dimension in the perspective of multidimensional data model.
Definition 1. (Multi-dimensional association rule) Consider a transaction table composed of
k attributes (dimensions). Let x
im and yjn be the values of attributes Xi and Yj, respectively.The form of a multi-dimensional association rule is:
X
1= “x1m”, X2 = “x2m”, …., Xi = “xim” ⇒ Y1 = “y1n”, Y2 = “y2n”, …., Yj = “yjn”Following the work in [22, 61], the multi-dimensional association rules can be categorized into three types as follows.
(1) Intra-dimensional association rule
This type of rule shows association within an attribute. The items in an intra-dimensional association rule involve only a single attribute. For example, the following
intra-dimensional association rule,
ProdName =“LG DVD Burner”⇒ ProdName =“DVD-R 8X Disk”,
shows that people purchase “LG DVD Burner” are also likely to purchase “DVD-R 8X
Disk”, which involves the only attribute ProdName.
(2) Inter-dimensional association rule
This type of rule shows association among multiple attributes. The items in an inter-dimensional association rule involves more than one attributes with each attribute appears only once in the rule. For example, an inter-dimensional association rule,
Gender = “Female”, ProdName= “JVC UX-C305 Hi-Fi System” ⇒ City=”Taipei”,
involves three attributes with no repetition of any attribute in the rule. This rule indicates that the females who buy “JVC UX-C305 Hi-Fi” are likely to live in the city
“Taipei”,
(3) Hybrid association rule
This type of rule is a combination of inter-dimensional and intra-dimensional associations.
The items in such a rule also involve multiple attributes. It differs from inter-dimensional association rule in that it allows repeated attributes. For example, the following
hybrid-dimensional association rule,
Education = “College”, ProdName = “Acer PC” ⇒ ProdName = “HP printer”,
shows that people with college education who purchase “Acer PC” tend to also purchase
“HP printer”. The attribute ProdName appears twice in the rule.
Definition 2. (Mining Model or Query) Suppose a star schema S containing a fact table F
and m dimension tables {D1
, D
2, …, D
m}. Let T be a joined table from S composed of a1, a
2, …, a
r attributes, such that ∀ai, a
j∈ Attr(D
k), 1 ≤ i ≤ r, 1≤ j ≤ r, i ≠ j and 1 ≤ k ≤ m. HereAttr(D
k) denotes the attribute set of dimension table Dk. With tG, tM ⊆ {a1, a
2, …., a
k} and tG∩ tM = ∅, a mining model or query of multidimensional association rules from T is defined
as,
MQ: <t
G, tM, [wc], [hc], ms, mc>,where tG, tM, wc, hc, ms and mc are the elements described as follows,
t
G : grouping ID (transaction ID),t
M: interested mining attributes,wc: optional filtering condition before grouping operation,
hc: optional filtering condition after the grouping operation,
ms: minimum support, and
mc: minimum confidence
The format of “wc” is
wc: (wca wco wcv) [, (wca wco wcv)],
where wca, wco and wcv are attribute, relational operator and value within a “where”
condition. For example,
wc: Country=”Japan”,
where “Country” is the wca, “=” is the wco and ”Italy” is the wcv.
The elements tG
, t
M, wc and hc are involved in acquiring the target data as shown in
Figure 2. A data warehouse stores complete and primitive data. In multidimensional association rule mining, data in warehouse can be joined and grouped into transactions at different granularity level according to a user’s need. Table 3 is an example based on the data warehouse in Figure 3.Table 3. An example of target data grouping by CustID and Date
tid
t
Gt
MCustID Date Education *ProdName
1 C001 2008-02-01 College B,C,E
2 C003 2008-02-03 High School A,B,E
3 C003 2008-02-10 High School A,D
4 C004 2008-02-05 Middle High C,E
5 C005 2008-02-09 College B,C,D,E
6 C005 2008-02-15 College A,B,E
*A: IBM60GB B: IBM TP C: RAM 512MB D: Ink Cartridge E: Hard Disk
From all the customers’ daily transactions, a user wants to know if there exist any associations between customers’ education and the products sold. In such case, each transaction in the target data is identified by the composite of the customer ID and the transaction date while the interested mining attributes are education and the products.
Therefore the query will be tG
= {CustID, Date} and t
M = {Education, ProdName}. Its target data is as shown in Table 3 with six transactions. Suppose customer ID becomes the only transaction ID, i.e. tG= {CustID},
the transaction count will be four as shown in Table4.
Table 4. An example of target data grouping by CustID
tid
t
Gt
MCustID Education * ProdName
1 C001 College B,C,E
2 C003 High School A,B,D,E
3 C004 Middle High C,E
4 C005 College A,B,C,D,E
*A: IBM60GB B: IBM TP C: RAM 512MB D: Ink Cartridge E: Hard Disk
CHAPTER 3
RELATED WORK
In this chapter we will review the related work of this research. We discuss it in threefold:
(1) use of ontology in data mining, (2) data warehouse mining and (3) active data mining.
3.1 Use of Ontology in Data Mining
If concept hierarchy or taxonomy can be viewed as ontology, the use of ontology in data mining can be traced back to 1991 when Nunez used information of classification hierarchy and attribute processing cost to improve the efficiency of classification process [44]. Later Han and Fu [21] and Srikant and Agrawal [53] also proposed to combine classification hierarchy to mine multilevel association rules and generalized association rules, respectively.
Their works were later extended by Chien et al. [9], who applied not only classification but also composition hierarchical knowledge to mining fuzzy association rules. These researches, however, concentrated on the design of algorithms, yet the discussion of ontology structure design and its benefit to data mining are not covered. Until recently, the research of applying ontology to data mining was exploited by several studies such as, ontology-based induction of classification rules [55], ontology-based explanation of association rules [13,54], ontology–supported selection of classification algorithm [6,38]
and ontology-guided new attributes generation from databases [50]. Cespivova et al. [7] has
conducted a systematic study by discussing the roles of medical domain ontology in each aspect of the KDD process. Similar study was also presented in [18]. In [46], Pan et al.
proposed an ontology supported data mining from databases. They maintained previous mining results in ontology that can further be applied for incremental association rule mining.
3.2 Data Warehouse Mining
Currently, the research on data warehouse mining is mostly concentrated on data mining from data cube or multi-dimensional database. J. Han’s research group pioneered this research subject [22]. In [52], Psaila and Lanzi studied multi-level association mining from a primitive data warehouse and proposed a mining algorithm. To the best of our knowledge, the research by Priebe and Pernul [51] is the only work up to date that exploits the issues of incorporating ontology into knowledge discovery from data warehouses. In particular, it proposed an intelligent web portal that integrated OLAP and information retrieval via ontology, yet it focused on information retrieval issues but not on data mining.
3.3 Active Data Mining
Surveyed in [47], considerable proposals and applications have been provided to the active database systems. Yet not much research on the active data mining has been given. Data
Psaila [2]. They presented active data mining from accumulated association rules when certain trend of rules is found. They focus on defining shapes of trends and query languages for getting the shape. Their work also stimulated the closely related theme about change detection from dynamically evolving data. Some researches focused on dealing with the changes of classification mining [17,40,57]. Recently, the research of active data mining has been broadened extensively [42], referring to any effort in automating the processes within the generic KDD framework [15], such as data collection, model selection [5], pattern discovery [58], rule evaluation [35, 36], etc. Our study of active mining in this context involves target data selection, pattern discovery and deployment, which to the best of our knowledge has not been investigated in the literature.
In summary, our work focuses on activating data re-mining with the queries in the user preference ontology that embodies the representative of queries used in history. Many concepts or techniques used in our framework in some way resemble those adopted in the recommender system community [1], for example, incorporating user preferences [3, 16], learning user preference without intrusively monitoring the user’s activities [31, 41], ontological representation of user preferences [41], and query log analysis [14, 59].
Nevertheless, the main purpose of our work is quite different. Rather than recommending unseen items or subjects in which a user is interested, we aim at capturing what data mining queries interest a user the most and build an activating mechanism to re-execute these
queries, keeping the users aware of the newest updates of the evolving environment. Yet we believe that the experiences learned in the recommender system community can shed light on promising research issues in the active data mining.
CHAPTER 4
THE PROPOSED SYSTEM FRAMEWORK OF INTELLIGENT DATA WAREHOUSE MINING
In this chapter we will introduce our proposed data warehouse mining system that incorporates with various ontologies to fulfill intelligent assistance in mining processes. The feasibility of such a system in amending the aforementioned deficiencies of modern data warehouse mining systems will be clarified through examples.
Figure 11. An ontology-integrated data warehouse mining system for multidimensional association rule
We assume that data has been extracted from different data sources such as transaction database, flat files and internet pages which is then transformed and load into a data warehouse. A data warehouse keeps integrated data which provides users with convenient access to the consistent data without worrying about the heterogeneous data source or data types. Therefore, it is handy for knowledge discovering operations such as data mining and OLAP operation. In our proposed data warehouse mining system, varied ontologies are utilized. Figure 11 shows the proposed data warehouse mining system framework which includes regular and active mining paths. A regular mining process begins with the setting of a query by a user. According to the specified query, the target data is prepared and the mining engine is then launched. The user tunes a query repetitively until a satisfactory result is found. The tunings of minimum support and minimum confidence are especially a challenge to a user who is not experienced in mining multidimensional association rules.
During the tuning process the system will intelligently help the user converging toward a good query they desire with the assistance of the schema ontology, schema constraint ontology, domain ontology and user preference ontology. The query elements of a successful mining will be saved in the mining log, which provides data for further analysis and references. The user preference ontology is constructed based on the data in the mining log. The mining log can be aggregated and the closely related queries can be concisely combined together. The analyzed results of the mining log are then utilized to construct the
user preference ontology. With the help of the user preference ontology the active mining mechanism and the dispatching functions are allowed. The dotted line in Figure 11 is the path of active mining. The module of active and the dispatch will initiate an action to trigger a data mining with queries in the user preference ontology. The results will be maintained in the rule base and specific rules will be dispatched to specific users according to the information in user profile database.
The knowledge that is helpful in supporting the system’s intelligent services are incorporated in this framework, which includes characteristics of data warehouse schema, constraint relationships between attributes, domain specific knowledge and user preference information that is accumulated from history mining. The knowledge in these ontologies is beyond the presentation capability of current data warehouse systems and in our dissertation we structure them into four different types: (1) schema ontology, (2) schema constraint ontology, (3) domain ontology, and (4) user preference ontology. These ontologies are maintained by experts and all of which, as will be clarified later, are mainly applied to assist users in query formulation and support the active mining mechanism. The domain ontology also provides the mining engine with further information to find rules with extra concepts.
In the following sections, we will describe the details of each ontology.
4.1 Schema Ontology
Multidimensional data model shows inter-relationships of fact table and dimension tables through the key structures of relational tables. There are other relationships between dimensions or attributes that are not shown in the model yet can be beneficial to data mining process, including concept hierarchical relationships and different additive characteristics of fact measures in the data warehouse. We use a schema ontology to construct such relationships. Figure 12 is an example schema ontology corresponding to the multidimensional data model in Figure 3.
There are three additive types of fact measures for OLAP operation. (1) Additive, which can be summed up along all the dimensions. Some typical examples are sale quantity and sale amount which can be summed up along all the dimensions. (2) Semi-additive, which can be summed up only along certain dimensions. For example, the cost can have sensible aggregation only along product dimension. (3) Non-additive, which have no sensible summation along any dimension. For example, profit is calculated by subtracting cost from sale amount and is a non-additive measure. Another obvious example, which is not part of the schema, is the balance amount in bank statement. It cannot generate any sensible summation for analysis purpose because it is an accumulated amount that is non-additive. The information of the additive types of measures is valuable in that it helps validate user’s specification of aggregation. For example, the target data for the query with
t
G= {CustID} and t
M= {Cost} will generate summation of Cost along CustID, which is
illegal. With the help of information of additive types in schema ontology, the system
detects such ineffective settings for mining.
additive fact semi- addit ive fact non- additive fact
Figure 12. An example of schema ontology
The hierarchical relationships are important in that it supports a user with better data selection mechanism hence minimize infeasible settings of queries. For example, according to the schema ontology, the products: “ASUS EeePC 900” and “ASUS EeePC 1000” have the following concept hierarchy,
If a user knows that there exist hierarchical relationships between the type of a product and its category, he will not try to dig the associations between them. If they do, the trivial
EeePC (Type)
ASUS (Brand)
PC (Category)
patterns will be generated because a product type decides its category. For example,
Type = “EeePC” ⇒ Category = “PC”
is known to people, therefore the mining is redundant.
4.2 Schema Constraint Ontology
In [48, 49], the authors explored all possible mining spaces of mining attribute combinations under varied transaction ID and defined allowed range via data constraints, which can minimize the searching space efficiently.
In multidimensional association rule mining, similar data constraints also exist on tG
and tM. In addition, we have observed some predicates that are specifically for multidimensional association rule mining. Also some constraints are domain dependent, for example, tG
= {Size} may be valid in plastic bag business but not in 3C business. In this
dissertation we present some example constraints for illustration purpose. All these constraint relationships are beyond the representation capability of a data warehouse. The schema constraint ontology is used to describe such semantic constraint relationships in general. Figure 13 is an example of schema constraint ontology derived from the schema in Figure 3. Some attributes have multiple constraints. These constraints are valuable in providing information for the system to verify the user specified tG and tM of a query, hence avoid invalid query formulation. The following are some examples of constraintrelationships that are related to the multidimensional association rule mining.
Dimension Attributes Predicate Attributes Dimension
Time
Figure 13. An example of schema constraint ontology (1) Decisive
This is the relationship of functional dependency. Whenever a functional dependency relationship exists between attributes, for example, two attributes A1,A2 and two transactions
t
1,t2, if t1.A1 = t2.A1 then t1.A2 = t2.A2, redundant mining space or known patterns will be generated and consequently waste the mining efforts. For example, a query with tG1 = {CustID, Education} is a redundant form of tG2 = {CustID} because Education is decided by CustID. The mining space created for tG1 and tG2 will be exactly the same. Another example is that a query with tM = {ProdName, Brand} will generate redundant rules because ProdName decides Brand, therefore associations of ProdName and Brand areProdName= “IBM TP” ⇒ Brand= “IBM”,
is an example of a redundant form.
(2) GroupOnly
An attribute with constraint of GroupOnly can be used as a transaction ID only, for example, CustID. If CustID is used for mining, the following unnecessary rule will be
generated:
CustID=”C003” ⇒ Category=”Printer”.
(3) ItemOnly
This constraint shows that an attribute can only be used as an interested mining attribute. For example, Size in the 3C Domain is not suitable for deciding the granularity of data.
(4) Together
Some attributes need to couple together with each other for correct discrepancy of transactions, whereas separate use of them will be senseless. For example, in medical application, a drug dosage includes quantity and its unit. Suppose that they are stored in separate attributes; either quantity or its unit being used alone will be senseless. For example, 50 mg and 0.5 g are two dosages, 50 and 0.5 cannot be analyzed alone without their unit mg and g.
(5) Follow
Some attributes should be used by following another. For example, a city is located in a region, but different regions could have cities with the same name. Considering the case in the United States, in both states of Tennessee and Ohio, there is a city called Springfield. If the attribute City is used alone as a transaction ID, then the results will be misleading.
Therefore, ‘City’ is confined to be used by the following of ‘Region’.
(6) NoSingleGroup
An attribute with constraint of NoSingleGroup cannot be used alone as a transaction ID. For example, gender has only two distinct values namely, male and female. Grouping by
gender will result in too few transactions to generate any patterns.
(7) NoIntraMining
An attribute with constraint of NoIntraMining cannot be used alone as an interested mining item. In another word, it cannot be used to mine intra-multidimensional association rule. Examples such as tM = {Gender} or tM = {Education}, will generate unnecessary rules
like
Gender=”Female” ⇒ Gender=”Male”, or
Education=”College” ⇒ Education=”High School”.
Exclusive
This constraint prohibits the simultaneous appearance of attributes. For example, assume two attributes calendar and fiscal dates are both in time dimension and a fiscal year
starts from Sep. 1 which is across two calendar years. The use of FiscalYear and the
CalendarMonth together as the transaction ID are incorrect. Constraint of Exclusive is used
to avoid invalid combination of such attributes.
The schema constraint ontology is particularly helpful for checking query settings.
User specification of tG and tM can be verified with the knowledge presented in this ontology.
Through the intelligent user interfaces, incorrect or inefficient tG or
t
M can be reminded and adjusted interactively.4.3 Domain Ontology
Domain ontologies are used to construct the domain knowledge related to the mining subject. The contents are specific to certain dimensions in a data warehouse, such as product dimension and customer dimension have domain ontologies of their own to enforce the expression of their semantic relationships. Following W3C recommendation [45], a domain ontology can be constructed in the triple format of “Subject-Relationship-Object”. For example, the following triple indicates that hard disk is a component of PC:
where the subject PC has Composition relationship with the object Hard disk.
Figure 14 is an example domain ontology of 3C products, which explores classification (is-a) and composition (has-a) relationships between products. Other relationships, such as
PC Composition Hard disk,
product features or compatibilities of software to OS or memory to motherboards etc, can
Figure 14. An example of 3C domain ontology
One of the advantages for introducing domain ontology into the mining system is to find concept extended rules from the data warehouse. For data with hierarchical relationships, some previous work mentioned extension of association rules [13,21,23,39,53]
from primitive into generalized or leveled format. Current data warehouse systems cannot represent possible extended relationships between data values. For example, the hierarchical relationship in the product dimension of Figure 4 describes only the category and brand of
products and their types yet some possible concept extensions between products are not reflected. We know that “Printer” and “PC” are two categories of products, but we also notice that there exist more conceptual relationships between product types and categories that are not disclosed in the schema structure. Domain ontology describes such conceptual relationships respective to each product. For example different products may have different levels of conceptual hierarchies as shown in Figure 14 that Epson LQ2090, Epson EPL 6000
products and their types yet some possible concept extensions between products are not reflected. We know that “Printer” and “PC” are two categories of products, but we also notice that there exist more conceptual relationships between product types and categories that are not disclosed in the schema structure. Domain ontology describes such conceptual relationships respective to each product. For example different products may have different levels of conceptual hierarchies as shown in Figure 14 that Epson LQ2090, Epson EPL 6000