• 沒有找到結果。

Chapter 3 Related Work

3.3 Active Data Mining

Surveyed in [47], considerable proposals and applications have been provided to the active database systems. Yet not much research on the active data mining has been given. Data

Psaila [2]. They presented active data mining from accumulated association rules when certain trend of rules is found. They focus on defining shapes of trends and query languages for getting the shape. Their work also stimulated the closely related theme about change detection from dynamically evolving data. Some researches focused on dealing with the changes of classification mining [17,40,57]. Recently, the research of active data mining has been broadened extensively [42], referring to any effort in automating the processes within the generic KDD framework [15], such as data collection, model selection [5], pattern discovery [58], rule evaluation [35, 36], etc. Our study of active mining in this context involves target data selection, pattern discovery and deployment, which to the best of our knowledge has not been investigated in the literature.

In summary, our work focuses on activating data re-mining with the queries in the user preference ontology that embodies the representative of queries used in history. Many concepts or techniques used in our framework in some way resemble those adopted in the recommender system community [1], for example, incorporating user preferences [3, 16], learning user preference without intrusively monitoring the user’s activities [31, 41], ontological representation of user preferences [41], and query log analysis [14, 59].

Nevertheless, the main purpose of our work is quite different. Rather than recommending unseen items or subjects in which a user is interested, we aim at capturing what data mining queries interest a user the most and build an activating mechanism to re-execute these

queries, keeping the users aware of the newest updates of the evolving environment. Yet we believe that the experiences learned in the recommender system community can shed light on promising research issues in the active data mining.

CHAPTER 4

THE PROPOSED SYSTEM FRAMEWORK OF INTELLIGENT DATA WAREHOUSE MINING

In this chapter we will introduce our proposed data warehouse mining system that incorporates with various ontologies to fulfill intelligent assistance in mining processes. The feasibility of such a system in amending the aforementioned deficiencies of modern data warehouse mining systems will be clarified through examples.

Figure 11. An ontology-integrated data warehouse mining system for multidimensional association rule

We assume that data has been extracted from different data sources such as transaction database, flat files and internet pages which is then transformed and load into a data warehouse. A data warehouse keeps integrated data which provides users with convenient access to the consistent data without worrying about the heterogeneous data source or data types. Therefore, it is handy for knowledge discovering operations such as data mining and OLAP operation. In our proposed data warehouse mining system, varied ontologies are utilized. Figure 11 shows the proposed data warehouse mining system framework which includes regular and active mining paths. A regular mining process begins with the setting of a query by a user. According to the specified query, the target data is prepared and the mining engine is then launched. The user tunes a query repetitively until a satisfactory result is found. The tunings of minimum support and minimum confidence are especially a challenge to a user who is not experienced in mining multidimensional association rules.

During the tuning process the system will intelligently help the user converging toward a good query they desire with the assistance of the schema ontology, schema constraint ontology, domain ontology and user preference ontology. The query elements of a successful mining will be saved in the mining log, which provides data for further analysis and references. The user preference ontology is constructed based on the data in the mining log. The mining log can be aggregated and the closely related queries can be concisely combined together. The analyzed results of the mining log are then utilized to construct the

user preference ontology. With the help of the user preference ontology the active mining mechanism and the dispatching functions are allowed. The dotted line in Figure 11 is the path of active mining. The module of active and the dispatch will initiate an action to trigger a data mining with queries in the user preference ontology. The results will be maintained in the rule base and specific rules will be dispatched to specific users according to the information in user profile database.

The knowledge that is helpful in supporting the system’s intelligent services are incorporated in this framework, which includes characteristics of data warehouse schema, constraint relationships between attributes, domain specific knowledge and user preference information that is accumulated from history mining. The knowledge in these ontologies is beyond the presentation capability of current data warehouse systems and in our dissertation we structure them into four different types: (1) schema ontology, (2) schema constraint ontology, (3) domain ontology, and (4) user preference ontology. These ontologies are maintained by experts and all of which, as will be clarified later, are mainly applied to assist users in query formulation and support the active mining mechanism. The domain ontology also provides the mining engine with further information to find rules with extra concepts.

In the following sections, we will describe the details of each ontology.

4.1 Schema Ontology

Multidimensional data model shows inter-relationships of fact table and dimension tables through the key structures of relational tables. There are other relationships between dimensions or attributes that are not shown in the model yet can be beneficial to data mining process, including concept hierarchical relationships and different additive characteristics of fact measures in the data warehouse. We use a schema ontology to construct such relationships. Figure 12 is an example schema ontology corresponding to the multidimensional data model in Figure 3.

There are three additive types of fact measures for OLAP operation. (1) Additive, which can be summed up along all the dimensions. Some typical examples are sale quantity and sale amount which can be summed up along all the dimensions. (2) Semi-additive, which can be summed up only along certain dimensions. For example, the cost can have sensible aggregation only along product dimension. (3) Non-additive, which have no sensible summation along any dimension. For example, profit is calculated by subtracting cost from sale amount and is a non-additive measure. Another obvious example, which is not part of the schema, is the balance amount in bank statement. It cannot generate any sensible summation for analysis purpose because it is an accumulated amount that is non-additive. The information of the additive types of measures is valuable in that it helps validate user’s specification of aggregation. For example, the target data for the query with

t

G

= {CustID} and t

M

= {Cost} will generate summation of Cost along CustID, which is

illegal. With the help of information of additive types in schema ontology, the system

detects such ineffective settings for mining.

additive fact semi- addit ive fact non- additive fact

Figure 12. An example of schema ontology

The hierarchical relationships are important in that it supports a user with better data selection mechanism hence minimize infeasible settings of queries. For example, according to the schema ontology, the products: “ASUS EeePC 900” and “ASUS EeePC 1000” have the following concept hierarchy,

If a user knows that there exist hierarchical relationships between the type of a product and its category, he will not try to dig the associations between them. If they do, the trivial

EeePC (Type)

ASUS (Brand)

PC (Category)

patterns will be generated because a product type decides its category. For example,

Type = “EeePC” ⇒ Category = “PC”

is known to people, therefore the mining is redundant.

4.2 Schema Constraint Ontology

In [48, 49], the authors explored all possible mining spaces of mining attribute combinations under varied transaction ID and defined allowed range via data constraints, which can minimize the searching space efficiently.

In multidimensional association rule mining, similar data constraints also exist on tG

and tM. In addition, we have observed some predicates that are specifically for multidimensional association rule mining. Also some constraints are domain dependent, for example, tG

= {Size} may be valid in plastic bag business but not in 3C business. In this

dissertation we present some example constraints for illustration purpose. All these constraint relationships are beyond the representation capability of a data warehouse. The schema constraint ontology is used to describe such semantic constraint relationships in general. Figure 13 is an example of schema constraint ontology derived from the schema in Figure 3. Some attributes have multiple constraints. These constraints are valuable in providing information for the system to verify the user specified tG and tM of a query, hence avoid invalid query formulation. The following are some examples of constraint

relationships that are related to the multidimensional association rule mining.

Dimension Attributes Predicate Attributes Dimension

Time

Figure 13. An example of schema constraint ontology (1) Decisive

This is the relationship of functional dependency. Whenever a functional dependency relationship exists between attributes, for example, two attributes A1,A2 and two transactions

t

1,t2, if t1.A1 = t2.A1 then t1.A2 = t2.A2, redundant mining space or known patterns will be generated and consequently waste the mining efforts. For example, a query with tG1 = {CustID, Education} is a redundant form of tG2 = {CustID} because Education is decided by CustID. The mining space created for tG1 and tG2 will be exactly the same. Another example is that a query with tM = {ProdName, Brand} will generate redundant rules because ProdName decides Brand, therefore associations of ProdName and Brand are

ProdName= “IBM TP” ⇒ Brand= “IBM”,

is an example of a redundant form.

(2) GroupOnly

An attribute with constraint of GroupOnly can be used as a transaction ID only, for example, CustID. If CustID is used for mining, the following unnecessary rule will be

generated:

CustID=”C003” ⇒ Category=”Printer”.

(3) ItemOnly

This constraint shows that an attribute can only be used as an interested mining attribute. For example, Size in the 3C Domain is not suitable for deciding the granularity of data.

(4) Together

Some attributes need to couple together with each other for correct discrepancy of transactions, whereas separate use of them will be senseless. For example, in medical application, a drug dosage includes quantity and its unit. Suppose that they are stored in separate attributes; either quantity or its unit being used alone will be senseless. For example, 50 mg and 0.5 g are two dosages, 50 and 0.5 cannot be analyzed alone without their unit mg and g.

(5) Follow

Some attributes should be used by following another. For example, a city is located in a region, but different regions could have cities with the same name. Considering the case in the United States, in both states of Tennessee and Ohio, there is a city called Springfield. If the attribute City is used alone as a transaction ID, then the results will be misleading.

Therefore, ‘City’ is confined to be used by the following of ‘Region’.

(6) NoSingleGroup

An attribute with constraint of NoSingleGroup cannot be used alone as a transaction ID. For example, gender has only two distinct values namely, male and female. Grouping by

gender will result in too few transactions to generate any patterns.

(7) NoIntraMining

An attribute with constraint of NoIntraMining cannot be used alone as an interested mining item. In another word, it cannot be used to mine intra-multidimensional association rule. Examples such as tM = {Gender} or tM = {Education}, will generate unnecessary rules

like

Gender=”Female” ⇒ Gender=”Male”, or

Education=”College” ⇒ Education=”High School”.

Exclusive

This constraint prohibits the simultaneous appearance of attributes. For example, assume two attributes calendar and fiscal dates are both in time dimension and a fiscal year

starts from Sep. 1 which is across two calendar years. The use of FiscalYear and the

CalendarMonth together as the transaction ID are incorrect. Constraint of Exclusive is used

to avoid invalid combination of such attributes.

The schema constraint ontology is particularly helpful for checking query settings.

User specification of tG and tM can be verified with the knowledge presented in this ontology.

Through the intelligent user interfaces, incorrect or inefficient tG or

t

M can be reminded and adjusted interactively.

4.3 Domain Ontology

Domain ontologies are used to construct the domain knowledge related to the mining subject. The contents are specific to certain dimensions in a data warehouse, such as product dimension and customer dimension have domain ontologies of their own to enforce the expression of their semantic relationships. Following W3C recommendation [45], a domain ontology can be constructed in the triple format of “Subject-Relationship-Object”. For example, the following triple indicates that hard disk is a component of PC:

where the subject PC has Composition relationship with the object Hard disk.

Figure 14 is an example domain ontology of 3C products, which explores classification (is-a) and composition (has-a) relationships between products. Other relationships, such as

PC Composition Hard disk,

product features or compatibilities of software to OS or memory to motherboards etc, can

Figure 14. An example of 3C domain ontology

One of the advantages for introducing domain ontology into the mining system is to find concept extended rules from the data warehouse. For data with hierarchical relationships, some previous work mentioned extension of association rules [13,21,23,39,53]

from primitive into generalized or leveled format. Current data warehouse systems cannot represent possible extended relationships between data values. For example, the hierarchical relationship in the product dimension of Figure 4 describes only the category and brand of

products and their types yet some possible concept extensions between products are not reflected. We know that “Printer” and “PC” are two categories of products, but we also notice that there exist more conceptual relationships between product types and categories that are not disclosed in the schema structure. Domain ontology describes such conceptual relationships respective to each product. For example different products may have different levels of conceptual hierarchies as shown in Figure 14 that Epson LQ2090, Epson EPL 6000 and Epson NX105 have one, two and three classification levels respectively. Therefore, it is possible to find extended rules by incorporating the domain ontology with data warehouse mining. In Figure 14, “PC” and “Notebook” are the generalized classifications of “IBM TP”

while “RAM 256MB” and “IBM 60G” are its compositions. The following rule,

ProdName =“HP DeskJet” ⇒ ProdName =“IBM TP”,

reveals that most customers who buy “HP DeskJet” tend to also buy “IBM TP”. According to the composition relationships in 3C domain ontology, we know that IBM TP composed of

IBM 60GB. Therefore, the following extended rule can be derived:

ProdName =“HP DeskJet”

⇒ ProdName =“IBM TP” with ProdName =“IBM

60GB”,

which shows that customers who buy “HP DeskJet“ tent to buy “IBM TP” with “IBM

60GB”.

Another motivation of incorporating domain ontology is that it helps a user to clarify

the scope of mining target. For example, if a user is only interested in analyzing PC components, he can specify a filtering condition to exclude all products that are not PC components by utilizing the composition relationship expressed in the domain ontology.

Precisely, the filtering condition wc, can be defined to have ProdName in the

3C_Domain_Ontology as follows:

wc = ProdName in 3C_Domain_Ontology (‘PC’, composition, var_All ).

The predicate 3C_Domain_Ontology takes the triple of subject-relationship-object as parameters, where var_All is a variable representing all the products that are components of

PC. Thus with the help of domain ontology, the system allows a user to specific more

precise target data for mining.

4.4 User Preference Ontology

The user preference ontology contains the representative queries with user information derived from the mining log. The user preference ontology provides two approaches of supports. First, it helps and/or inspires users’ expression of their mining intension by browsing its contents; second, it provides the mining query for active re-mining. The user preference ontology can be regarded as a concise version of the mining log that contains surrogate queries, the frequently used queries with representative power.

The intelligent support of our proposed system provides recommendations of query

elements, which includes for example the tG, tM, ms and mc. The intelligent knowledge for recommendation is mainly from the user preference ontology.

medium

strong weak strong strong weak

high

Figure 15. An example of user preference ontology

Figure 15 shows an example of user preference ontology. The surrogate queries are structured in hierarchies of (tG) (tM) (ms, mc). The attribute index provides fast access to tG and tM. The user profile index connects to the users in the user preference ontology and also to the user profile database in the enterprise. Each user connects to a surrogate query with a user preference associated with it and each surrogate query also associates with a representative power (Rep_Power). Both Rep_Power and User_Pref can be query conditions for accessing surrogate queries desires. For example, suppose user A wants to

medium preference for them. The conditional setting will be,

Mining queries: Rep_Power ≥ Strong, GUsermining = {A}, UserPRef = High.

In next chapter, we will illustrate the details of how a user preference ontology is constructed.

CHAPTER 5

THE CONSTRUCTION OF USER PREFERENCE ONTOLOGY

5.1 Mining Log

A query reflects a user’s mining intension. For each successful mining process, the query and its user information are maintained in a log which includes tG, tM, wc, ms and mc; its user profile, such as user name, e-mail, departmental information etc. and the mining results, in principle, the rule count. A mining log grows by time and will be too tedious to be helpful in mining. Therefore the condensation of the mining log to get the frequently used queries is necessary.

The mining log is structured in a star schema as shown in Figure 16. A star schema is a multidimensional data modal which is convenient for online analytical processing (OLAP).

The purpose of structuring the mining log in a star schema is for its necessity to perform OLAP operations while condense the log to the user preference ontology. OLAP provides operations, namely roll-up, drill-down, slice and dice. It also offers capabilities of calculating ratios, variance, etc., and generates summarizations and aggregations at a specific granularity level.

QueryFact

Figure 16. Mining log star

5.2 The Construction of User Preference Ontology

The construction of a user preference ontology is a process of condensation of the mining log which is a pre-processed procedure thus the construction time is not a major concern.

The construction process is started by aggregating the mining log and according to the aggregation result, surrogate queries are then generated. The representative power and the user preference of the surrogate queries are also derived and the user profiles are also integrated. Figure 17 shows the construction steps.

Figure 17. Construction steps of user preference ontology

Mining Log

1. Aggregate mining log by tG, 2. Generate surrogate queries

3. Decide the representative power and user preference of the surrogate queries 4. Prune the surrogate queries

with sparse representative power

5. Integrate with the user profile User

Preference ...

5.2.1 Aggregate Mining Log by t

G

, t

M

and UserID

In data mining, a query can be used by many users abundant times throughout time and all the successful queries are recorded in the mining log. Summarizing the frequencies of queries by tG, tM and User_ID are used for generating appropriate surrogate queries. The averages, minimum and maximum of ms and mc are calculated in order to derive the ms and

mc of the surrogate queries. To obtain the minimum set of ms and mc, the minimum value of

ms is the only value to be selected and its corresponding mc is the mc of the minimum set.

The mc of the minimum set is not given by selecting the minimum mc from the history log.

The maximum set of ms and mc is obtained in the same way. For example, suppose (25%,

30%), (35%, 25%) and (45%, 20%) are (ms, mc) of a query that are used many times in the

history thus min(ms, mc) is (25%, 30%) instead of (25%, 20%) and the max(ms, mc) is (45%, 20%) instead of (45%, 30%). A SQL like language to express this process is:

SELECT count(*), avg(ms), avg(mc), min({ms,mc}), max({ms,mc})

FROM MiningLogStar

GROUPING BY t

G, tM, User_ID

INTO Table GM-Pattern Table.

Table 5 is an example of the aggregation table, called a GM-Pattern Table.

Definition 3. A GM-Pattern: <tG

◊ t

M> is any couple of tG

and t

M in the GM-Pattern table.

For example, in Table 5 GM-Pattern of query no. 10 is <{CustID, Date, Category} ◊

Definition 4. Surrogate GM-Pattern vs. Redundant GM-Pattern: suppose <tGs ◊ tMs> and

<tGi ◊ tMi> are two GM-patterns. <tGs ◊ tMs> is said to be the Surrogate GM-Pattern of <tGi

t

Mi>and <tGi ◊ tMi> is the Redundant GM-Pattern if tGs = tGi, tMs tMi and <tGs ◊ tMs>and

<tGi ◊ tMi> overlaps with maximum range of ms.

The reason a surrogate GM-Pattern can represent a redundant GM-Pattern is because

The reason a surrogate GM-Pattern can represent a redundant GM-Pattern is because

相關文件