Data Mining

(1)

吳漢銘國立政治大學統計學系

資料探勘簡介

Data Mining

C00-2

(2)

為什麼要使用R做為資料探勘工具?

 Why R?

 R is a high-quality, cross-platform, flexible, widely used open source, free language for statistics, graphics, mathematics, and data science

 R contains more than 5,000 algorithms and millions of users with domain knowledge worldwide.

 There are three shortages of R:

 One is that it is memory bound, so it requires the entire dataset store in memory (RAM) to achieve high performance, which is also called in-memory analytics.

 packages contributing to R communities arebug-proneand need more testing to ensure the quality of codes.

 R seems slowthan some other commercial languages.

 Fortunately, there are packages available to overcome these problems.

2/30

Source: http://www.kdnuggets.com/polls/2015/analytics-data-mining- data-science-software-used.html

Six of the Best Open Source Data Mining Tools

http://thenewstack.io/six-of-the-best-open-source-data-mining-tools/

40 Top Free Data Mining Software

http://www.predictiveanalyticstoday.com/top-free-data-mining- software/

10 Best Data Mining Software For Better Analysis

http://digital.guide/10-best-data-mining-software-better- analysis/14362/?v=1

(3)

參考書目/學習網站 ^3/30

http://www.rdatamining.com/

R Package

"DMwR"

(4)

學習資源

 27 Free Data Mining Books

http://www.dataonfocus.com/21-free-data-mining-books/

 Introduction to Data Mining

http://www-users.cs.umn.edu/~kumar/dmbook/index.php

 Data Mining, Analytics, Big Data, and Data Science

http://www.kdnuggets.com/

4/30

(5)

Some free online data sources

Some free online data sources particularly helpful to learn about data mining

 Frequent Itemset Mining Dataset Repository:

http://fimi.ua.ac.be/data/

 UCI Machine Learning Repository:

http://archive.ics.uci.edu/ml/

 The Data and Story Library at statlib:

http://lib.stat.cmu.edu/DASL/

 WordNet:

http://wordnet.princeton.edu

5/30

(6)

Rattle: A Graphical User Interface for Data Mining using R ^6/30

> install.packages("rattle")

> library(rattle)

> rattleInfo()

> nstall.packages(rattleInfo())

> rattle()

http://rattle.togaware.com/

(7)

SAS Enterprise Miner ^7/30

http://www.sas.com/zh_tw/software/analytics/enterprise-miner.html

(8)

History

 The term "Data Mining" appeared around 1990in the database community.

 Gregory Piatetsky-Shapiro coined the term

"Knowledge Discovery in Databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AIand Machine Learning Community.

 However, the term data mining became more popular in the business and press communities.

 Currently, Data Mining and Knowledge Discovery are used interchangeably.

 Since about 2007, "Predictive Analytics" and since 2011, "Data Science" terms were also used to describe this field.

8/30

Source: History of data mining

http://rayli.net/blog/data/history-of-data-mining/

FRANS COENEN, Data Mining: Past, Present and Future, The Knowledge Engineering Review, Vol. 00:0, 1–24. 2004, Cambridge University Press

(9)

What is Data Mining? ^(1/4)

 "Data mining is the process of exploration and analysis, by

automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules.” (M. J. A. Berry and G. S. Linoff)

 "Data mining is finding interesting structure (patterns, statistical models, relationships) in databases.” (U. Fayyad, S. Chaudhuri and P. Bradley)

 "Data mining is the application of statistics in the form of exploratory data analysis and predictive models to reveal

patterns and trends in very large data sets.”("Insightful Miner 3.0 User Guide”)

 Data mining is also known as Knowledge Discovery in Data (KDD).

9/30

(10)

 Non-trivial extraction of implicit, previously unknown and potentially useful information from data.

 Generally, data mining is the process of analyzing data

from different perspectives and summarizing it into useful information.

 Technically, data mining is the process of finding

correlations or patterns among dozens of fields in large relational databases.

 Data mining is the practice of automatically searching

large stores of data to discover patterns and trends that go beyond simple analysis.

10/30

(11)

 Data mining is a process used by companies to turn raw data into useful information.

 Data mining depends on effective data collection and warehousing as well as computer processing.

 Data mining is the computational process of discovering patterns in large data sets ("big data") involving methods at the intersection of artificial intelligence, machine

learning, statistics, and database systems.

 The overall goal of the data mining process is to extract information from a data set and transform it into an

understandable structure for further use.

11/30

(12)

 Data mining is sifting through very large amounts of data for useful information.

 Data mining uses artificial intelligence techniques, neural networks, and advanced statistical tools (such as cluster analysis) to reveal trends, patterns, and relationships, which might otherwise have remained undetected.

 Data mining requires a class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior.

12/30

(13)

Data Mining

 Data mining derives its name from the similarities between searching for valuable information in a large database and mining rocks for a vein (礦脈) of valuable ore (礦石).

 mining for gold in rocks: “gold mining” (not "rock mining")

 data mining should have been called “knowledge mining”.

 Other similar terms referring to data mining are: data dredging, knowledge extraction and pattern discovery.

 Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD.

13/30

(14)

Data Mining Diagrams ^14/30

Source: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Source:

http://blogs.sas.com/content/subconsciousmusings/files/2014/08/data- mining-Venn-diagram.png

(15)

Data Mining Diagrams ^15/30

Source: Published on Nov 26, 2014

Language Technologies for Geomatics: From Intelligence to Agility Published in: Technology

http://www.slideshare.net/VisionGEOMATIQUE2014/gagnon- 20141112vision

Source:

http://www.simplilearn.com/data-mining-vs-statistics-article

(16)

What are Data Mining and Knowledge Discovery?

^16/30

Source: Osmar R. Zaïane, 1999 CMPUT690 Principles of Knowledge Discovery in Databases

pre-processing Examples of data mining

https://en.wikipedia.org/wiki/Examples_of_data_mining

(17)

The Data Mining Process: CRISP-DM

 Business understanding: determining business objectives, establishing data mining goals, and developing a plan.

 Data understanding: initial data collection, data description, data exploration, and the verification of data quality.

 Data preparation: the data needs to be selected, cleaned, and then built into the desired form and format.

 Modeling: visualization and cluster analysis, association rules. to discover knowledge represented as rules.

 Evaluation: the results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases.

 Deployment: data mining can be used to both verify previously held hypotheses or for

17/30

Cross-Industry Standard Process

for Data Mining (CRISP-DM)

(18)

The Data Mining Process: SEMMA

 Sample: In this step, a portion of a large dataset is extracted.

 Explore: To gain a better

understanding of the dataset,

unanticipated trends and anomalies are searched in this step.

 Modify: The variables are created,

selected, and transformed to focus on the model construction process.

 Model: A variable combination of models is searched to predict a desired outcome.

 Assess: The findings from the data mining process are evaluated by its usefulness and reliability.

18/30

Source: Bater Makhabel, 2014, Learning Data Mining with R, Packt Publishing, (December 22, 2014).

Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA

(19)

Data Mining Tasks



There are two types of data mining tasks:

 descriptive data mining tasks that describe the general properties of the existing data, and

 predictive data mining tasks that attempt to do predictions based on inference on available data.



The 6 Common DM Tasks: Description, Estimation, Prediction, Classification, Clustering, Association

19/30

(20)

Data Mining Tasks

 Data processing: cleaning, integration, reduction, transformation,

feature extraction (dimension reduction), discretization, missing values, outliers detection.

 Data exploration: summary, visualization, correlation analysis

 Regression: linear, polynomial, lasso, logistic regression, nonlinear regression, regression tree.

 Classification: nearest neighbor, linear discriminant analysis, decision tree, naive Bayes classifier, CART, random forest, SVM, artificial neural networks, ...

 Cluster analysis: k-means, hierarchical clustering, PAM, model-based, ...

 Link analysis: associations rules discovery

 Model evaluation: variables selections, cross-validation

 Better Modelling: ensembles (bagging, boosting), kernel methods, ...

 Applications and Beyond:

 Data types: text and web data, stream, time series and sequence data, network data.

 Big data

20/30

(21)

Six Common Classes of Tasks

 Summarization : providing a more compact representation of the data set, including visualization and report generation.

 Anomaly detection (Outlier/change/deviation detection): The identification of unusual data records, that might be interesting or data errors.

 Regression : find a function which models the data with the least error.

 Classification : is the task of generalizing known structure to apply to new data.

 Clustering: is the task of discovering groups and structures in the data that are in some way or another "similar", without using

known structures in the data.

 Association rule learning (Dependency modelling): Searches for relationships between variables.

21/30

(22)

Association Rule Discovery



Given a set of records each of which contain some number of items from a given collection;



Produce dependency rules which will predict occurrence of an item based on occurrences of other items.

22/30

Source:Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

(23)

What is Statistics?

 Statistics is a component of data mining that provides the

tools and analytics techniques for dealing with large amounts of data.

 It is the science of learning from data and includes everything

from collecting and organizing to analyzing and presenting data.

 It is concerned with probabilistic models, specifically inference, using data.

 While the aims of statistics and data mining are similar, it is estimated that there are very few statisticians to deal with the demands of data analysts.

 Statisticians were the first to use the term data mining.

23/30

(24)

Statistics and Data Mining ^(1/4)

 Statistics studies the collection, analysis, interpretation or explanation, and presentation of data. It serves as the foundation of data mining.

 Originally, data mining was a derogatory (貶低的) term referring to attempts to extract information that was not supported by the data.

 To some extent, data mining constructs statistical models, which is an underlying distribution, used to visualize data.

 Data mining has an inherent relationship with statistics; one of the mathematical foundations of data mining is statistics, and many statistics models are used in data mining.

 Statistical methods can be used to summarize a collection of data and can also be used to verify data mining results.

24/30

(25)

 There is a great deal of overlap between data mining and statistics.

 Most of the techniques used in data mining can be placed in a statistical framework. However, data mining techniques are not the same as

traditional statistical techniques.

 Traditional statistical methods, in general, require a great deal of user interaction in order to validate the correctness of a model. As a result, statistical methods can be difficult to automate.

 Moreover, statistical methods typically do not scale wellto very large data sets.

 Statistical methods rely on testing hypotheses or finding correlations based on smaller, representative samples of a larger population.

 Data mining methods are suitable for large data sets and can be more readily automated.

 In fact, data mining algorithms often require large data sets for the creation of quality models.

25/30

(26)

 Both data mining and statistics are related to learning from data.

They are all about discovering and identifying structures in them, thus aimed at turning data to information.

 And although the aims of both these techniques overlap, they have different approaches.

 Statistics is only about quantifying data. While it uses tools to find relevant properties of data, it is a lot like math. It provides the tools necessary for data mining.

 Data mining, on the other hand, builds models to detect patterns and relationships in data, particularly from large data bases.

26/30

(27)

 Gregory Piatetsky-Shapiro: Statistics is at the core of data mining -

helping to distinguish between random noise and significant findings, and providing a theory for estimating probabilities of predictions, etc.

 However Data Mining is more than Statistics.

 DM covers the entire process of data analysis, including data cleaning and preparation and visualization of the results, and how to produce predictions in real-time, etc.

27/30

Source: http://www.kdnuggets.com/faq/difference-data-mining-statistics.html

[Wiki] Gregory I. Piatetsky-Shapiro (born 7 April 1958) is a data scientist, co-founder of KDD

conferences and ACM SIGKDD association for Knowledge Discovery and Data Mining, and President of KDnuggets, a leading site on Business Analytics, Data Mining, and Data Science. For simplicity, he usually abbreviates his name as Gregory Piatetsky.

Further reading: Friedman, J. H. "Data Mining and Statistics: What's the Connection?" (Nov. 1997b).

(28)

Statistics, Data Mining and Big Data

^28/30

Source: http://www.theusrus.de/blog/some-truth-about-big-data/

(29)

Data Mining and Statistics:

What is the Connection?

 The field of data mining, like statistics, concerns itself with "learning from data" or

"turning data into information".

 Rather, it is important to note that data mining can learn from statistics – that, to a large extent, statistics is fundamental to what data mining is really trying to achieve.

However,

 most data miners tend to be ignorant of statistics and client's domain;

 statisticians tend to be ignorant of data mining and client's domain; and

 clients tend to be ignorant of data mining and statistics.

 computer scientistsfocus upon database manipulations and processing algorithms;

 statisticians focus upon identifying and handling uncertainties; and

 clients focus upon integrating knowledge into the knowledge domain.

 Moreover, most data miners and statisticians continue to sarcastically criticise each other.

 Data mining and statistics will inevitably grow toward each other in the near future because data mining will not become knowledge discovery without statistical thinking, statistics will not be able to succeed on massive and complex datasets without data mining approaches.

29/30

(30)

Challenges/Trends of Data Mining

Challenges

 Scalability, Dimensionality, Complex and Heterogeneous Data

 Data Quality

 Data Ownership and Distribution (eg, skewed distribution)

 Privacy Preservation

 New application-based requirements: Streaming Data

Trends

 Application Exploration

 Scalable and interactive data mining methods

 Visual data mining

 New methods of mining complex types of data

 Biological data mining

 Data mining and software engineering

 Web mining, real-time data mining

 Distributed data mining

 Real time data mining

 Multi database data mining

 Privacy protection and information security in data mining.

30/30

Source: http://www.simplilearn.com/data-mining-vs-statistics-article