Chapter 4 Incremental Mining Algorithms for Document Classifiers
4.7 Conclusions
This study proposes a domain-space weighting scheme to represent documents in domain-space and incrementally construct a classifier to resolve the document representation and categories adaptation problems. The scheme consists of three major phases: Training Phase, Discrimination Phase and Tuning Phase. The training algorithm incrementally extracts and weights features from each individual category, and then integrates the results into a feature-domain weighting table. The discrimination algorithm reduces feature weights with lower discriminating powers.
When these algorithms finish constructing the classifier, the tuning algorithm strengthens it using feedback information from tuning documents to reduce the number of false positives. Experiments with the Reuters-21578 benchmark show that with sufficient training documents, the classifier is rather effective and efficient.
Chapter 5
From Incremental Mining to Multidimensional Online Mining for Knowledge Discovery
5.1 Introduction
Although incremental mining algorithms are rather efficient and useful for static models such as mining all the data accumulated thus far and mining only a recently collected portion of data in uncomplicated applications, they usually provide little support for user focus (e.g., limiting the computation to what interests the user) and user interaction (e.g., dynamically changing the parameters or constraints). This may produce thousands of rules that are irrelevant and uninteresting to users. On the other hand, decision-makers usually diversely consider problems at different aspects: they may need to analyze market demands, customer preferences, localities, and short-term/long-term trends; they may want to understand the change of discovered patterns or rules in different dimensions. This may neither flexibly obtain rules or patterns from their interesting portions of data, nor diversely consider problems at different aspects to provide online decision supports for users.
Some examples about that a decision-maker usually requires online mining supports of association rules are shown below.
Scenario 1: A decision-maker may have known which product combinations
sold in last August were popular, and wants to know which product combinations sold in last September were also popular.
Scenario 2: A decision-maker may have known that people often buy beer and
diapers together from a transaction database, and want to further know under what contexts (e.g., place, month, or branch) this pattern is significant or, oppositely, under what contexts this pattern becomes insignificant.
Scenario 3: A decision-maker may want to know how the mined patterns this
year differ from those last year, such as what new patterns appear and what old patterns disappear.
Scenario 4: A marketing analyst may want to analyze the data collected from the
branches in Los Angeles and San Francisco in all the first quarters in the last five years.
Scenario 5: A marketing analyst may want to know what patterns are significant
in the recent month when the minimum support increases from 5% to 10%.
The examples above all require more context information to describe the problem domain. A mining algorithm that can handle relevant context information in mining requests will thus help decision-makers consider various aspects of problems in diverse ways.
Constraint-based and multidimensional mining techniques [11][15][35][37]
[48][51][52][65][72] which allow users to specify constraints as a guidance have thus been developed to identify and extract interesting and focused knowledge from a data warehouse or a database. Users can continually express his focus and change not only the parameters but also the constraints in the mining process. For example, Kamber et al. [48] proposed a famous approach that allowed users to specify the predicates that appear in antecedent and consequent parts of association rules. However, putting all data gathered in different contexts (such as different branches, different time intervals and different regions) together for centralized mining seems to be time-consuming
and infeasible for online mining support because of the size of data. Users may need to wait for a long period of time for the mining results.
Different from the techniques of constraint-based and multidimensional mining, we attempt to extend the concept of effectively utilizing previously discovered patterns in incremental mining to support multidimensional online mining. We first systematically mines rules or patterns from data gathered in different contexts according to the pre-defined parameter setting, and forwards the rules or patterns with the corresponding context information to a structural repository called knowledge warehouse for centralized post-mining and refining. Then, we can efficiently acquire user-interesting and/or user-focused association rules or patterns by integrating related mining information from the knowledge warehouse, and greatly reduce the cost of mining the underlying data at each time.
Consequently, a systematic, automatic, integrated, and on-demand architecture, called Online Knowledge Discovery System (OKDS), can be developed to help managers and decision-makers diversely consider problems at different aspects and provide online mining supports. The OKDS mainly consists of five major components, knowledge client, knowledge warehouse, knowledge organizer, mining agent, and underlying storage facility. Through the mining agents systematically and continuously mine potentially useful patterns from each underlying storage facilities, the knowledge organizer structurally stores these mined patterns into the knowledge warehouse, and thus users can utilize aggregation and generalization functions in the knowledge client for online patterns generation.
5.2 Related Work
Data warehouse is an integrated, subject-oriented, and nonvolatile data
repository containing historical and aggregated data from operational and legacy systems for supporting decision-making processes [17][45][97]. Comparing to routine works of On-Line Transaction Processing (OLTP) in the operational databases, the purpose of data warehouse is to help analysts On-Line Analytical Processing (OLAP).
Therefore, to facilitate complex analyses and achieve high query throughput is the most important consideration in the data warehouse. Table 5-1 lists the major differences between the operational database and the data warehouse [17][45]. Thus, data warehouses are usually maintained separately from the organization’s operational databases.
Table 5-1: Differences between the operational database and the data warehouse Aspects Operational database Data Warehouse User z Data entry clerk
z System designer z System administrator
z Decision maker z Knowledge worker z Executives
Function z Daily operations z OLTP
z Decision support z OLAP
DB Design z Application oriented z Subject oriented
Data z Current
z Up-to-date atomic z Relational(normalized) z Isolated
z Historical z Summarized z Multidimensional z Integrated
Usage z Repetitive routine z Ad hoc
Access z Read/write
z Simple transaction
z Read mostly z Complex query System Requirements z Transaction throughput
z Data consistency
z Query throughput z Data accuracy
Developing a data warehouse often extracts user-interesting information from each source (operational database) in advance, then merging the relevant information, and consequently installing into a structurally centralized repository for later analysis.
The data warehouse often adopts a multidimensional data model to prepare the data for analytical processing under multidimensional consideration. The star schema
consisting of a fact table and a set of dimension tables is the most used form in the multidimensional data model. The fact table contains user-interesting measure attributes, which are the objects for analysis, and key attributes (identifiable attributes) to each of the related dimension tables. The dimension table contains additional attributes to further describe each of key attributes in the fact table. The multidimensional data model provides users a clear and multidimensional view of data. Data can be easily accessed by manipulating the dimensions.
5.3 Knowledge Warehouse
For providing efficient online mining, the knowledge warehouse is initiated from the concept of effectively utilizing previously discovered patterns in incremental mining. As we know, for not wasting the previously mined patterns and improving rule maintenance performance, incremental mining algorithms always keep the mined patterns into the storage for later use. For providing multidimensional consideration, the knowledge warehouse is further referred to the multidimensional data model of data warehouse capable of supporting ad-hoc queries and decision making by aggregation functions and OLAP operations.
As the data under decision-support consideration does not evolve in an arbitrary way (e.g., the data in the data warehouse may be inserted or deleted in a block during an interval of a month [32]), the knowledge warehouse is thus proposed to structurally and systematically store the context information and mining information for each inserted dataset. The context information is used to represent the contexts of each individual block of data which are gathered together from a specific business viewpoint, such as region, time and branch. The mining information is used to record the available information mined from each individual block of data by a batch mining
algorithm, such as the number of data, the number of mined patterns, and the set of previously mined patterns with related information. Conceptually, the knowledge warehouse is similar to the data warehouse for OLAP. Both of them systematically preprocess the underlying data in advance, integrate related information, and store the results in a centralized structural repository for later use and analysis. However, the data warehouse is mainly used to store mined patterns at knowledge level but not data at information level. Table 5-2 lists the major differences between the knowledge warehouse and the data warehouse.
Table 5-2: Differences between the knowledge warehouse and the data warehouse Aspects Data Warehouse Knowledge Warehouse
Function z OLAP z Online mining
Data z Historical
z Summarized z Multidimensional z Integrated
z Mined
z Multidimensional
Access z Read mostly z Complex query
z Read only z Mining query System Requirements z Query throughput
z Data accuracy
z Mining throughput z Knowledge usability
The star schema can still be a concise and organized structure to model the knowledge warehouse. The context information and mining information can be represented by dimensions and measures, respectively. Example 5-1 shows a star schema of the knowledge warehouse used to provide online generation of association rules for product sales in a bicycle manufacturer. However, unlike the summarized information on measure attributes in the data warehouse, the mining information in the knowledge warehouse, such as the mined patterns, may not be directly aggregated to satisfy users’ mining requests. Thus, the major challenge of the knowledge warehouse is how to efficiently aggregate, generalize and manipulate the mining
information. In the next chapter, we will design corresponding aggregation and generalization approaches to provide online mining supports on association rules.
Example 5-1: Figure 5-1 is a star schema of the knowledge warehouse for a
bicycle manufacturer. It consists of three dimensions, Time, Branch and Minsup, and three measures, No_Trans, No_Patterns and Pattern_Set. Of the three dimensions, Time and Branch are nonnumeric dimension similar to that in a typical data warehouse, and Minsup is a numeric dimension indicating the minimum supports for the measures, No_Patterns and Pattern_Set. Of the three measures, No_Trans is a numeric measure that can be calculated similar to that in a typical data warehouse, and No_Patterns is also a numeric measure and decided by Pattern_Set, and Pattern_Set is a set measure that represents a collection of frequent itemsets with their supports under the corresponding time and branches and satisfying a minimum support in Minsup dimension.
Figure 5-1: An example of the star schema of a knowledge warehouse time_key
minsup_key branch_key
No_Patterns No_Trans time_key
day month
minsup_key minsup_value
branch_key branch_name
region country Time
dimension table
sales fact table
Branch dimension table
Minsup dimension table
Pattern_Set year
5.4 Online Knowledge Discovery System (OKDS)
Based on the proposed knowledge warehouse, a systematic, automatic, integrated, and on-demand architecture, called Online Knowledge Discovery System (OKDS), can be developed to provide managers and decision-makers multidimensional online mining supports. The OKDS, as shown in Figure 5-2, mainly consists of five major components, knowledge client, knowledge warehouse, knowledge organizer, mining agent, and underlying storage facility.
Knowledge Warehouse
Mining Agent 1
Mining Agent 2
Mining Agent n Knowledge Organizer
Knowledge Client 2 Knowledge
Client 1
Knowledge Client m
Underlying Storage Facility 1
Underlying Storage Facility 2
Underlying Storage Facility n
Figure 5-2: The OKDS architecture
Whenever a new block of data is inserted into a underlying storage facility, the corresponding mining agent will systematically and continuously mine potentially useful patterns from the block of data as the mining information; then the knowledge
organizer will structurally store the mining information associated with related context information in the knowledge warehouse; and thus users can utilize aggregation and generalization functions in the knowledge client for online generation of patterns. On the other hand, when an old block of data is deleted from a underlying storage facility, its corresponding context and mining information will be removed from the knowledge warehouse by the knowledge organizer.
z Underlying storage facility: A underlying storage facility is served as materials supplier in OKDS to provide underling, purpose-oriented and pre-processed data.
Therefore, it can be a data warehouse, a preprocessed database or a cleaned file.
z Mining agent: Agents often play autonomous, adaptive and intelligent roles in a distributed system. For example, for an intelligent travel service system, a traveling agent follows the user setting or the user profile to collect interesting traveling paths and hotel coupons; a scheduling agent follows the user program, weather prediction and news to provide proper periods for traveling; and a coordinator agent is responsible for coordinate the traveling and scheduling agents capable of obtaining proper traveling packages and suitable schedules for users. Thus, a mining agent mainly follows the user setting to periodically detect the data changes in a underlying storage facility, automatically mining potential patterns from a underlying storage facility and reporting the results to the knowledge organizer. The knowledge organizer is served as the coordinator agent of mining agents.
z Knowledge organizer: The knowledge organizer is responsible for periodically maintaining the patterns in the knowledge warehouse. Its works include collecting the discovered patterns from each mining agent, merging or summarizing these ones as the mining information, and then storing them
associated with corresponding context information into the knowledge warehouse. For improving the performance of fulfilling user requests, the knowledge organizer is also responsible for constructing and maintaining the materialized views of knowledge warehouse.
z Knowledge client: A knowledge client is an interface used for receiving user’s mining requests, transferring a mining request to an operating procedure for the knowledge warehouse, and reporting mining results to users. For convenient to understand and evaluate the mining results, it is also responsible for providing visualization services.
Chapter 6
Multidimensional Online Mining Algorithms for Generation of Association Rules
6.1 Introduction
Previous works on mining association rules can be classified into batch mining [5][16][40][53][61][67][78][95] and incremental mining approaches [8][20][21]
[27][43][77][85] according to their processing procedures. Most focus on finding association rules in specified parts of databases that satisfy the user-specified minimum support and minimum confidence [11][15][37][52][65]. Some contexts (circumstances) such as region, time, and branch are usually ignored in mining requests, and thus they usually can not flexibly obtain association rules from portions of data, diversely consider problems and provide on-line decision supports for users.
To provide ad-hoc, query-driven and online mining support for generation of association rules, we first propose a relation called the Multidimensional Pattern Relation (MPR) as a form of knowledge warehouse to structurally and systematically store context and mining information for later analysis [90][92]. We then develop an online mining approach called Three-phase Online Association Rule Mining (TOARM) based on this proposed MPR to support online generation of association rules under multidimensional considerations. The TOARM approach consists of three phases, candidate itemset generation, candidate itemset reduction, and association rule generation, during which final sets of patterns satisfying various mining requests
are found. The candidate itemset generation phase selects tuples that satisfy the context constraints in mining requests and generates candidate itemsets from the matching tuples. The candidate itemset reduction phase then calculates upper-bound supports for the candidate itemsets and uses two pruning strategies to reduce the number of candidates. Finally, the association rule generation phase finds final frequent itemsets and derives association rules from them.
6.2 Related Work
Recently, researchers have developed online mining algorithms to obtain required sets of association rules without re-processing the entire database whenever user-specified thresholds are changed. Examples are the OLAP-style algorithm proposed by Aggarwal and Yu [1] and the Carma algorithm proposed by Hider [41].
The OLAP-style algorithm is quite similar to a typical incremental mining algorithm that utilizes previously mined patterns to save on I/O and computation. It first stores primary itemsets based on a low minimum support criterion in a latticed data structure, and then responds to users’ queries with higher minimum support criteria by processing the lattice. It thus preprocesses the data just once, but can efficiently handle multiple user queries. The Carma algorithm attempts to provide intermediate results as feedback to users while databases or minimum support thresholds are being changed. Users are thus able to dynamically adjust thresholds according to intermediate results. The Carma algorithm uses two runs. During the first run, it constructs a lattice composed of all potential frequent itemsets from the transactions.
Each itemset in the lattice uses a lower bound and an upper bound to record its possible support range. When a mining request is input, itemsets in the lattice whose support ranges cover or are larger than the new minimum support threshold are output
to the second run. During the second run, the Carma algorithm finds the precise support for each itemset from the first run to determine whether it is truly large.
Interestingly, many large organizations have multiple databases distributed at different branches. Traditional data mining algorithms may put all data from different databases in a common repository for centralized analysis. This kind of mining causes some problems. The collected data may be too huge to be coped with, and some useful rules or patterns regarding local databases may be lost. As a result, multi-database mining has recently been recognized as an important research topic and some studies [50][98][105] on mining association rules over multi-databases have been proposed. These approaches mine rules or patterns at different databases and then gather the mined results.
These online mining and multi-database mining approaches do not, however, maintain a repository to systematically and structurally store the mining information and related context information for later flexible analysis.
6.3 Multidimensional Pattern Relation (MPR)
In this section, we formally define the Multidimensional Pattern Relation (MPR) for storing context information and mining information for later analysis. First, a relation schema R, denoted by R(A1, A2, …, An), is made up of a relation name R and a list of attributes A1, A2, …, An. Each attribute Ai is associated with a set of attribute values, called the domain of Ai and denoted by dom(Ai). A relation r of the relation schema R(A1, A2, …, An) is a set of tuples {t1, t2, …, tm}. Each tuple ti is an ordered list of n values <vi1, vi2, …, vin>, where each value vij is an element of dom(Aj).
A Multidimensional Pattern Relation Schema (MPRS) is a special relation schema for storing mining information. An MPRS consists of three types of attributes:
identification (ID), context, and content. There is only one identification attribute for an MPRS. It is used to uniquely label tuples. Context attributes describe the contexts (circumstances) of an individual data block, gathered together from a specific business viewpoint. Examples of context attributes are region, time, and branch.
Content attributes describe available mining information discovered from each individual data block by a batch mining algorithm. Examples of content attributes are number of transactions, number of mined patterns, and the set of previously mined frequent itemsets with supports.
The set of all patterns, with supports, previously mined from an individual data block is called a pattern set (ps) in this study. Assume the minimum support is s and l frequent itemsets are discovered in a data block. A pattern set can be represented as ps
The set of all patterns, with supports, previously mined from an individual data block is called a pattern set (ps) in this study. Assume the minimum support is s and l frequent itemsets are discovered in a data block. A pattern set can be represented as ps