資料、客戶、需求

(1)

CRM & Data Mining

資料、客戶、需求

瞭解顧客需求、與顧客互動

z

髮型設計師與編辮子的歐巴尚

z

酒的零售商與柑子店

z

大公司：一對一行銷

z

personalization vs. professionism

資料倉儲與公司記憶（corporate memory）

z

OLAP & DW

電影：「全民公敵」、「神鬼交鋒」

z

例：個人採購的整個程序＋運貨程序(UPS)

資料探勘與企業智慧

z

原始的資料用途與企業智慧

資料 — 企業寶貴之資產

資訊資訊 Mining Mining

資料倉儲

Mining Mining

知識 (Corporate Memory)

(Corporate Intelligence) Data explosion problem! 資料爆炸！

starving for knowledge!渴求知識

Business Intelligence (BI)

資料分析 Data Mining

OLAP

資料管理者 MIS 資料轉換的工具

Extract Transform

Load 資料倉儲/資料市集

Metadata

Templates

資料源 Data Source

使用者 Decision Making

CRM Marketing Campaign 描述資料的資料

Meta Data

•收集資料 -營運資料、市場調查資料、固定Panel追蹤

•管理資料 -ETL＆Data warehousing

•資料中獲取智慧 -Data Mining、OLAP、Statistics

•應用智慧 -行銷策略、主管決策、互動化CRM機制

(2)

資料探勘與企業智慧

Increasing potential to support

business decisions End User

Business Analyst

Data Analyst

DBA

決策

Making Decisions Data Presentation資料呈現

Visualization Techniques Data Mining資料探勘 Information Discovery Data Exploration資料探索

OLAP, MDA

Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts資料倉儲/市集

Data Sources 資料來源

Paper, Files, Information Providers, Database Systems, OLTP

資料庫管理師資料

分析師企業分析師

用戶增加

支援企業決策潛能

From BI to CRM

Decision Support--OLAP Internal Data

Internal Data

Internal Data External Data External Data External Data

Staging Repository for Loading & Indexing Staging Repository for Staging Repository for Loading & Indexing Loading & Indexing

Campaign Analysis Campaign Campaign Analysis Analysis

Customer Segmentation

Customer Customer Segmentation Segmentation

Channel Analysis Channel Channel Analysis Analysis

Cross-sell analysis Cross Cross--sellsell

analysis analysis

Target Marketing

Target Target Marketing Marketing

Data Mining

Projections Projections

Budgeting/

Profitability Profitability

List Selection

List List Selection

Selection Campaign Planning Campaign Campaign Planning

Planning FulfillmentFulfillmentFulfillment TrackingTrackingTracking Response/

Interactive Response/

Response/

Interactive Interactive

Campaign Management/Sales Automation

Interactive Channel

Data Mart

Data Mart

WarehousingWarehousing^Data^Data

Call Center Internet Direct Mail Agent e-mail VRU Call Center Internet Direct Mail Agent e Call Center Internet Direct Mail Agent e--mail VRUmail VRU Churn

Analysis Churn Churn Analysis Analysis

Decision Support--OLAP Internal Data

Internal Data

Internal Data External Data External Data External Data

Staging Repository for Loading & Indexing Staging Repository for Staging Repository for Loading & Indexing Loading & Indexing

Data Mining

Budgeting/

Response/

Campaign Management/Sales Automation

Interactive Channel

Data Mart

Data Mart

Decision Support--OLAP Internal Data

Internal Data

Internal Data External Data External Data External Data

Staging Repository for Loading & Indexing Staging Repository for Staging Repository for Loading & Indexing Loading & Indexing

Data Mining

Budgeting/

Response/

Campaign Management/Sales Automation

Interactive Channel

Data Mart

Data Mart

Internal Data Internal Data

Internal Data External Data External Data External Data

Staging Repository for Loading & Indexing Staging Repository for Staging Repository for Loading & Indexing Loading & Indexing

Data Mining

Budgeting/

Response/

Campaign Management/Sales Automation

Interactive Channel

Data Mart

Data Mart

Staging Repository for Loading & Indexing Staging Repository for Staging Repository for Loading & Indexing Loading & Indexing

Data Mining

Budgeting/

Response/

Campaign Management/Sales Automation

Interactive Channel

Data Mart

Data Mart

Data Mart

Data Mart

WarehousingWarehousing^Data^Data Data Data Warehousing Warehousing

iRate (Internet Rating) ISS ( I-Survey System) Marketing Campaign Membership Data Product Data Order Record Website Logfile New Member New Product Search

GENUINE CRM

Data warehousing Data Mart Data Mining One to One

FAX Call center DM

E-mail AD System EC System Content PDA (XML) Mobile (PMML) Set-top Box

資料探勘：導論

背景

z

[data explosion] 管理資訊超載（overload）及結構化不足、資訊混亂與誤用

z

[necessity] 管理問題複雜度高＆即時決策分析日益重視

⇒ [solution] On Line Analytic Processing, data warehousing & data mining

發展目的

z

有效利用蒐集之市場

、客戶、供應商、競爭對手及未來趨勢資訊

(note:domain dependent)

z

使企業經由有效之方法與技術從歷史資料裡擷取有用的知識 (vs. 串流資料)

⇒資料探勘的目標，是使一個公司更了解顧客，以增進它在行銷、銷售、顧客服務營運上的表現 (DM in CRM)

Check it out

(3)

資料庫技術的演進

1960s 資料收集

z

Data collection, database creation, information management systems and network DBMS

1970s 資料庫

z

Relational data model, relational DBMSimplementation

1980s 進階資料庫

z

RDBMS, advanced data models(extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s—2000s 資料探勘

z Data mining

and data warehousing, multimedia databases, and Web databases

何謂資料探勘

Data Mining (KDD)

z

Extraction of interesting(非瑣碎的

non-trivial,

隱含的

implicit, previously unknown and potentially useful) information

or patternsfrom data in

large databases

z

A step in KDD process: running mining algorithms to produce desired patterns

=>

資料探勘:運用統計/ML/DB，透過自動或半動的工具來探索和分析

大量資料，以發掘有意義的規律和規則

原理

z

主要方法:資料庫、資料視覺、統計學、機器學習等

z

相關技術:類神經網路、模糊邏輯、基因演算法、基因規畫、案例

庫推理法、規則庫推理、統計迴歸等

z

知識表現:決策樹、法則、定量數學公式、黑箱公式等

資料探勘：KDD的程序

z Data mining: the core of knowledge discovery process.

核心程序

Data Cleaning

Data Integration

Databases Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

KDD: Knowledge Discovery in Database

KDD Process (Interactive and iterative)互動、反覆

z

Learning the application domain (relevant prior knowledge & goals of application)學習應用領域及相關知識

Steps

z 資料選擇

（data selection：creating a target data set)

z 資料清理與前置處理

（data cleaning & preprocessing ：may take 60% of effort!)

z 資料簡化與轉換

（data reduction & transformation：find useful features, dimensionality/variable reduction, invariant representation)

z 資料探勘

(nchoose function: summarization/ classification/ clustering regression/ association ochoose algorithms psearch for interest patterns)

z 模式評估與知識呈現

(Pattern evaluation & knowledge presentation:

visualization, transformation)

(4)

各種應用

資料庫分析與決策支援

z Market analysis and management

市場分析與應用

¾

target marketing, customer relation management, market basket analysis, cross selling, market segmentation

z Risk analysis and management

風險分析與管理

¾

Forecasting, customer retention, improved underwriting, quality control, competitive analysis

z Fraud detection and management

詐欺偵測與管理

¾

防止信用卡盜刷 (台灣每年信用卡盜刷金額高達30億台幣)

• 透過類神經網路模型能夠找出盜刷行為模式

• (盜刷前會有一筆利用信用卡打公共電話，且通話時間少於一分鐘的交易紀錄)

Other Applications 其他

市場分析與應用(I)

Data sources for analysis 資料來源

z

Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies

Target marketing 目標行銷

z

Find clusters of “model” customers who share the same characteristics (interest, income level, spending habits)

Determine customer purchasing patterns over time決定模式

z

Conversion of single to a joint bank account (marriage)

Cross-market analysis 跨市場分析

z

Associations/co-relations between product sales

z

Prediction based on the association information

市場分析與應用(II)

Customer profiling (clustering/classification)剖繪

z

data mining can tell you what types of customers buy what products

Identifying customer requirements客戶需求

z

identifying the best products for different customers

z

use prediction to find what factors will attract new customers

Provides summary information 綜合資訊

z

various multidimensional summary reports

z

statistical summary information (data central tendency and variation)

風險分析與管理

Finance planning and asset evaluation財務規劃與資產評估

z

cash flow analysis and prediction

z

contingent claim analysis to evaluate assets

z

cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)

Resource planning資源規劃

z

summarize and compare the resources and spending

Competition競爭

z

monitor competitors and market directions

z

group customers into classes and a class-based pricing procedure

z

set pricing strategy in a highly competitive market

(5)

詐欺偵測、不尋常的樣式

車險Auto insurance:

detect a group of people who stage accidents to collect on insurance

洗錢Money laundering:

detect suspicious money transactions

醫療保險Medical insurance:

detect professional patients and ring of doctors and ring of references

不當醫療Inappropriate medical treatment:

Australian Health Insurance

Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr)

電話卡詐欺Detecting telephone fraud

z Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.;British Telecom

零售業 Retail:

Analysts estimate that 38% of retail shrink is due to dishonest employees

反恐（Anti-terrorism）

Other Applications

Sports 運動

z

IBM Advanced Scout analyzed NBA game statistics(shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat

Astronomy 天文

z

JPL and the Palomar Observatory discovered 22 quasarswith the help of data mining

Internet Web Surf-Aid 網站

z

IBM Surf-Aid applies data mining algorithms to Web access logs for

market-related pages

to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

資料探勘在CRM的應用資料探勘在CRM的應用(續）

區隔分析找出最容易購買的顧客根據客戶偏好加強購買數量提昇購買產品數量

增加市佔率提昇銷售

時間序列分析

了解市場傾向,即時將產品推到市場市場發展傾向

傾向分析

關聯性分析

分類最容易購買的產品銷售新產品

增加顧客佔有率交叉銷售

預測分析

歷史客戶流失模式,預測未來降低顧客流失率

維持舊客戶

市場進行區隔直效行銷獲取新客戶吸引顧客

資料探勘方式意義

功能

(6)

資料採礦應用—市場區隔分析

利用顧客基本資料相似性，將人分群，進行區別化

真實世界上難以達到真正的一對一，可以透過群集分析進

行多對多

集群一集群二集群三集群四集群五男 77.40% 51.90% 42.00% 100.00%

女 22.60% 48.10% 58.00% 100.00%

19歲及以下 88.40% 0.70% 4.00% 6.40%

20-24歲 8.20% 24.10% 42.30% 43.00%

25-29歲 5.50% 1.70% 29.20% 33.00% 31.90%

30-34歲 30.10% 0.40% 26.30% 15.00% 8.80%

35-39歲 28.40% 10.90% 3.10% 5.20%

40-44歲 16.90% 0.40% 5.10% 1.30% 3.60%

45-49歲 14.00% 0.40% 2.20% 0.40% 0.80%

50歲及以上 5.10% 0.40% 1.50% 0.90% 0.40%

國中以下 1.30% 17.70% 0.40% 0.40%

高中/五專 19.10% 79.70% 97.80% 1.30% 0.80%

大專學院 62.30% 2.60% 82.10% 89.20%

研究所以上 17.40% 2.20% 16.10% 9.60%

性別

年齡

最高學歷

集群一集群二集群三集群四集群五

死亡險 138 56 85 83 86

生死合險 65 36 72 42 60

傷害險 216 102 153 224 136

健康險 0.12 0.05 0.12 0.10 0.13

年金險 6 0 5 0 0

死亡險 93% 10% 62% 67% 75%

生死合險 52% 32% 63% 36% 18%

傷害險 72% 30% 70% 80% 92%

健康險 92% 80% 85% 86% 88%

年金險 2% 0% 1% 0% 0%

險種

平均保額(單位：萬元)

投保比例

險種集群一集群二集群三集群四集群五男 77.40% 51.90% 42.00% 100.00%

女 22.60% 48.10% 58.00% 100.00%

19歲及以下 88.40% 0.70% 4.00% 6.40%

20-24歲 8.20% 24.10% 42.30% 43.00%

25-29歲 5.50% 1.70% 29.20% 33.00% 31.90%

30-34歲 30.10% 0.40% 26.30% 15.00% 8.80%

35-39歲 28.40% 10.90% 3.10% 5.20%

40-44歲 16.90% 0.40% 5.10% 1.30% 3.60%

45-49歲 14.00% 0.40% 2.20% 0.40% 0.80%

50歲及以上 5.10% 0.40% 1.50% 0.90% 0.40%

國中以下 1.30% 17.70% 0.40% 0.40%

高中/五專 19.10% 79.70% 97.80% 1.30% 0.80%

大專學院 62.30% 2.60% 82.10% 89.20%

研究所以上 17.40% 2.20% 16.10% 9.60%

性別

年齡

最高學歷

死亡險 138 56 85 83 86

生死合險 65 36 72 42 60

傷害險 216 102 153 224 136

健康險 0.12 0.05 0.12 0.10 0.13

年金險 6 0 5 0 0

死亡險 93% 10% 62% 67% 75%

生死合險 52% 32% 63% 36% 18%

傷害險 72% 30% 70% 80% 92%

健康險 92% 80% 85% 86% 88%

年金險 2% 0% 1% 0% 0%

險種

平均保額(單位：萬元)

投保比例

險種

資料採礦應用—顧客流失分析

假如顧客 . . .

同一個帳號裡只有一個

手機

而且加入門號新機優惠

方案

而且屬於高手機更換率

者

則他們極有可能在三個

月後流失 (60.1%).

有些現有顧客祇是利用門號優惠方案購買更便宜的手機而已

資料採礦應用—直效郵件回應

台灣地區直效郵件平均購買率降至0.3~0.05%

台灣一年共有57億封垃圾電子郵件

20.6%網友養成直接刪除廣告郵件習慣

例:整合客戶資料庫執行交叉行銷

Data Mining Data Warehouse Savings

Mortgage

Loans Checking

Credit Cards Time Deposits

Target Marketing Credit Risk Management

Portfolio Analysis Retail Banking

（以美國銀行為例）

(7)

DW, Data mining &OLAP Data mining Applications

Data Mining 實務應用

百貨公司顧客資訊

Safeway 販賣促銷資訊（e.g. coupon)

音樂/電影喜好問卷蒐集

Fidelity Investment客戶服務 (cross-selling)

First USA Bank信用卡資料（汽車房貸）

Capital One 降低貸款風險損失率

First Union預測潛在流失客戶

預測侵蝕性的物質對皮膚的影響降低產品(藥品或毒品)的發展成本和

時間，以及減少動物實驗的需求

分析零售商店歷史銷售記錄與位置概述以決定最佳的位置

分析提款機設置地點最佳位置

Data Mining – 什麼樣的資料?

Relational databases關連式資料庫

Data warehouses資料倉儲

Transactional databases交易資料

Advanced DB & information repositories

^（儲藏）

z

Object-oriented and object-relational databases

z

Spatial （空間）databases

z

Time-series （時序）data & temporal （時間的）data

z

Text databases & multimedia databases

z

Heterogeneous （異質）& legacy（傳統） databases

z

WWW

(8)

Data mining process - NCR

Business Understanding

Data Understanding

Data Preparation

Modeling Evaluation

Deployment

Data mining process

Business Understanding

z

To make the best use of data mining you must make a clear statement of your objectives

¾

Increase response rate & value of a response

z

An effective statement of the problem will also include a way of measuring the result of your knowledge discovery project

Data Understanding - Visualization

z

Use of scatter plots, and other visual Use of scatter plots, and other visual mediums to analyze data mediums to analyze data

¾

graphs, distributions, histograms graphs, distributions, histograms

¾

scatter plots

¾

association webs

• Show strength of connection between symbolic values

• Weight of line indicates strong/medium/weak connection

¾

GIS - Geographical Information Systems

z

Limitations - low dimensionality (1- 3)

Overview

z

repeated iterations of data preparation and model building steps as learning something from model that suggests you modify the data

z

take anywhere from 50% to 85% of the time and effort of KD process

Steps

z

Collection

¾

Identify the source of data you will be mining

¾

Data-gathering phase may be necessary

z

Assessment

¾

GIGO

¾

Missing value or violate integrity constraints

z

Consolidation & cleaning

¾

Consolidate the data and repair, insofar as possible

Data mining process - Data Preparation

Steps

z

Data selection

¾

Compute time is determined by both the number of cases(rows) and the number of variables (columns)

¾

Knowledge of the problem domain can let you make many of these selections correctly

¾

Data visualization can help identify important independent variables and reveal collinear variables

¾

Filter the outliers

¾

Sample the data where database is large

z

Transformation

¾

Computation (e.g., ratio)

¾

Grouping continuous values

¾

Scaling

¾

Normalizing

¾

Symbolic to numeric transformation

¾

Coding discrete value

Data Preparation (cont.)

(9)

“Discovered” Interesting Patterns

A data mining system/query may generate thousands of patterns, not all of them are interesting.

z

Suggested approach: Human-centered, query-based, focused mining

Interestingness measures

z

A pattern is interestingif it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm

z

Objective vs. subjective interestingness measures

¾

Objective - based on statistics & structures of patterns (EX. support, confidence)

¾

Subjective -based on user’s belief in the data (Ex. unexpectedness, novelty, actionability)

Suggestions

z Completeness - Find all the interesting patterns

¾

Association, classification&clustering

z Optimization - Search for only interesting patterns

¾

Approaches

• First general all the patterns and then filter out the uninteresting ones.

• Generate only the interesting patterns—mining query optimization

典型資料探勘系統的架構

Data Warehouse Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine Pattern evaluation

Graphical user interface

Knowledge-base

Data Mining 功能 (I)

概念描述：特徵與區別（Concept description: Characterization and discrimination）

z

廣義化、綜合(Generalize, summarize)

z

對比資料的特性（contrast data characteristics）

關連（Association

：correlation and causality相關、因果)

z

Diaper -> Beer [0.5%, 75%]

分類與預測（Classification and Prediction

）

z

建立模型（函數）以描述與分辨類別或概念，作為未來預測用

¾例：classify countries based on climate, or classify cars based on gas mileage

z

預測某些未知的、或遺失的(missing) 數值

Data Mining 功能 (II)

群聚分析 (Cluster analysis)

z

類別標籤未知: 把資料依相似性分群

¾

e.g., cluster houses to find distribution patterns

z

maximizing intra-class similarity

z

minimizing interclass similarity

離群分析 (Outlier analysis)

z

outlier: 某資料object，無法符合資料的一般行為（模式）

z

雜質noise？例外exception？ No! 用在fraud detection, rare events analysis

趨勢與演進 (Trend and evolution analysis)

z

trend and (偏差) deviation: regression analysis

z

sequential pattern mining

z

periodicity analysis

z

similarity-based analysis

Estimation, Visuation

(10)

Data Mining - 功能與技術

功能技術適用領域

關聯性 (Association) 案例庫推理/集合理論/統計

菜籃分析

時間序列 (Sequence)

類神經網路/統計利率預測

分類 (Classification)

基因演算/類神經網路/統計/ 客戶評鑑分類模糊邏輯案例推理/決策樹

公式 (Modeling)

基因規劃/基因演算/迴歸銷售預測

群組 (Clustering)

類神經網路/模糊邏輯/ 市場區隔基因演算/統計

Interesting patterns!

Data Mining: 分類方式

廣義的功能性 General functionality

z Descriptive

data mining

z Predictive

data mining

各種觀點的分類 Different views, different classifications

z

Kinds of datato be mined

z

Kinds of knowledgeto be discovered

z

Kinds of techniquesutilized

z

Kinds of applicationsadapted

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database

Technology Statistics

Other Disciplines Information

Science Machine

Learning Visualization

Association

(11)

Applications of KDD (I)

Financial Investment Management

z

Fidelity Stock Selector

¾

uses neural network to selection investments

z

LBS Capital Management

¾

uses ESs, neural nets, and GAs to manage portfolios worth $600 million

z

Carlberg & Associates

¾

uses a neural network model for predicting Standard & Poor’s 500 Index

Fraud Detection

z

FALCON

¾

using neural network shell

¾

detect suspicious credit card transactions

z

FAIS

¾

detect money-laundering activity from financial transactions

z

Telecommunication

¾

AT&T’s system detecting international calling fraud

¾

GTE and NYNEX: detecting cellular cloning fraud

Applications of KDD (II)

Manufacturing and Production

z

Prospective KDD AP -control & schedule technical production processes

z

Main advantage - high cost savings

z

Key challenge - representation and exploitation of time and location as well as model levels, such as quality, process, and control

z

Examples

¾

Europe chemical company - assist in production process for polymeric plastics

¾

CASSIOPEE - diagnose and predict problems in Boeing 737

Network management

z

Filter redundant alarms, locate problems in the network, predict severe faults

z

Example - Telecommunication Alarm Sequence Analyzer (TASA), by University of Helsinki

¾

locate frequently occurring alarm episodes

¾

present them as rules

¾

integrate into alarm-handling software

KDD 的挑戰（困難） (I)

Larger Databases 超大型資料庫

z can’t fit in

main memory at one time

z

solutions -

sampling, approximation methods, parallel processing

High Dimensionality高維度

z

increase size of search space for model induction in a combinatorially explosive manner

z

increase chances that learner will find spurious patternsthat are not valid in general

z

solutions - use prior knowledgeto identify irrelevant variables

Changing Data and Knowledge 資料、知識的變化

z

changing may make previously discovered patterns invalid

z

solutions -

incremental

methods for updating the patterns

Missing & Noisy Data缺漏資料與雜訊

z

solutions -

statistical strategies

to identify hidden variables & dependencies

KDD 的挑戰（困難） (II)

Over fitting 「過」於調適

z

good performance on training data, but poor performance on real data

z

solutions - cross-validation, regularization, other statistical strategies

Complex Relationships between Fields關係複雜

z

most of algorithms developed for simple attribute-value records

z

require algorithms to deal with hierarchically structured attributes or values, relations between attributes

Understanding of Patterns 模式認知

z

make the discoveries more understandable by humans

z

solutions - graphical representations, natural language generations, information visualization

User Interaction & Prior Knowledge用戶互動與既有知識

z

encoding domain knowledge into learning systems

Integration with Other Systems整合

z

integration with spreadsheet, DBMS, visualization tools

(12)

常用的資料探勘方法

決策樹與規則

(Decision trees and rules: propositional logic power)

非線性回歸方式

(Nonlinear regression methods e.g. Neural network

範例為基準的方法(Example-based methods)

e.g. nearest-neighbor classification, regression algorithms, case-based reasoning (CBR)

基因演算法

(Genetic algorithms)

歸納邏輯

(Inductive logic programming: first-order logic power)

機率圖形式相依模式

(Probabilistic graphical dependency models: e.g.

Baysian network)

決策樹 - Example (I)

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Outlook

Sunny Overcast Rain

High Normal

No Yes

Strong ^Weak

No Yes

Humidity Yes Wind

決策樹學習

Inductive rules

1. IF Outlook= Sunny AND Humidity= Normal, THEN PlayTennis

2. IF Outlook= Overcast, THEN PlayTennis

3. IF Outlook= Rain AND Wind= Weak, THEN PlayTennis

If Time_band >=2.5 years and Time_employed >=1.5 year while reject is only 3.2% likely.

A total of 63 cases fit this profile, 61 accepts and 2 rejects.

決策樹- Example (II)

(13)

輸入

依誤差調整權重

真實值預測值

監督式學習架構

類神經網路-監督式學習

輸入

調整優勝單元至輸入層之權重

競爭優勝單元輸出非監督式學習架構

類神經網路 - Example

working phase training

phase

……

申請人資料(Input)

核准否 (Current Output)

核准否 (Desired Output)

Learning Algorithm W_i

Wi

……

申請人資料 (Input)

核准否?

(Output) W_i

W_i

訓練階段工作階段

實例應用- 客戶評鑑與分類 (NeuroFuzzy)

F e a t u r e ( I ) D a t a T y p e C o n t e n t 性別 C h a r a c t e r F : F e m a l e ; M : M a l e

婚姻狀況 C h a r a c t e r Y : M a r r i e d ; N : S i n g l e : U : U n k n o w n

子女數目 I n t e g e r R a n g e : [ 1 . . 8 ]

年齡 I n t e g e r R a n g e : [ 1 . . 7 0 ]

職業別 L i s t R a n g e : [ 1 . . 1 0 ]

郵遞區號 I n t e g e r T h r e e - d i g i t s z i p c o d e

儲蓄率 I n t e g e r R a n g e : [ 1 . . 2 7 ]

購買潛力

( p r e d i c t e d o u t c o m e )

C h a r a c t e r Y : Y e s ; N : N o

客戶購買案例特色

類別性別婚姻狀況子女數目年紀職業別郵遞區號儲蓄率

A 女 Y 1 40 3 540 27

B 男 N 4 64 7 540 27

最具購買潛力客戶

C 男 Y 4 52 2 570 26

類別性別婚姻狀況子女數目年紀職業別郵遞區號儲蓄率

D 女 Y 3 58 2 120 19

E 男 N 4 60 2 120 19

最不具購買潛力客

戶

F 女 N 4 55 6 650 23

最具（最不具）購買潛力客戶案例組合

其他應用實例

Safeway 案例

z

8 million transaction data/week ，500家店面與600萬客戶

z

市場競爭激烈，傳統手法技術式微（低價位、多據點、多類產品）

z

新的競爭關鍵焦點 -掌握客戶需要（哪類客戶買哪些商品以及購買頻

率）

醫療保險FAMS

z

功能

¾

偵測( Detection) - 利用fuzzy modeling和統計技術來分析群組的行為，

針對每個醫療服務提供者評定分數，以反應其遍離行為標準的程度

¾

調查(Investigation) - 分析提供者的分數和詳細的賠償資料

¾

解決(Settlement) -詳細分析群組行為和賠償的報告和圖表，報告可用來

協商、解決問題和檢舉不法之事

¾

預防(Prevention) - 支援提供者的監視和提供新的工具來評估和教育他

們，改善提供者的行為，以防止醫療詐欺和濫用，減低保險公司的損失

z

特性

¾

利用案例資料的“retrospective analysis”，分析帳目和醫療提供者的醫療

工作以找出有嫌疑的提供者

(14)

案例— Customer selection for DM

輸入變項值給予綜合評定分數：盈餘、住家地理位置、房地產總

值、年齡、小孩個數、是否結婚

IT:基因演算法、決策樹、類神經網路

Data mining modeling

OLAP Mining線上多維度分析探勘

Data Mining 與 Data Warehousing 之整合

z Data mining systems, DBMS, Data warehouse systems coupling

¾

No coupling

¾

loose-coupling

¾

semi-tight-coupling

¾

tight-coupling

z On-line analytical mining data

¾

integration of mining and OLAP technologies

z Interactive mining multi-level knowledge

¾

Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.

z Integration of multiple mining functions

¾

Characterized classification, first clustering and then association

顧客銷售區域

產品

市場

時間

產品

財務指標

銷售銷售

多維度角度分析 - OLAP

(15)

An OLAM Architecture

Data Warehouse

Meta Data

MDDB OLAM

Engine

OLAP Engine

User GUI API

Data Cube API

Database API

Data cleaning Data integration

Layer3 OLAP/OLA

M

Layer2 MDDB

Layer1 Data Repository

Layer4 User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

資料探勘主要議題 (I)

探勘方法論與使用者參與

z

Mining different kinds of knowledge in databases

z

Interactive mining of knowledge at multiple levels of abstraction

z

Incorporation of background knowledge

z

Data mining query languages and ad-hoc data mining

z

Expression and visualization of data mining results

z

Handling noise and incomplete data

z

Pattern evaluation: the interestingness problem

效能與可擴充性

z

Efficiency and scalability of data mining algorithms

z

Parallel, distributed and incremental mining methods

資料探勘主要議題(II)

各類資料型態

z

Handling relational and complex types of data

z

Mining information from heterogeneous DBs and global information systems (WWW)

各類應用與社會影響

z

Application of discovered knowledge

¾

Domain-specific data mining tools, Intelligent query answering, Process control &decision making

z

Integration of discovered knowledge with existing knowledge: A knowledge fusion problem

z

Protection of data security, integrity, and privacy

研究議題

Applications

z

E-Commerce/M-Commerce

z Customer Relationships Management (顧客關係管理)

Web mining and text mining

(news group, email, documents)

Biomedical/DNA data mining （生醫、DNA）

Cube exploration （倉儲探索*）/ trends

Mining frequent and sequential patterns (循序樣式）

Anomaly（異常） mining

On-line, real-time, stream data mining

z Traffic Engineering/Management

(16)

Applications of KDD - marketing

Predicting the size of TV audiences

z

using neural networks and rule induction

z

examine factors relating audience size

Analyzing supermarket sales data

z

Coverstory and Spotlight: producing reports, using natural language and graphics, on the most significant changes in a particular product volume and share broken down by region, product type, etc.

z

Opportunity Explorer

z

Management Discovery Tool

¾

summarization, trend analysis, change analysis, and measure and segment comparison

Market basket analysis

z

association rules

¾associations between different products bought by the customer

z

Lucent Technology’s NicheWorks

¾

clustered purchases to be visualized intuitively

Other Applications of KDD

Health Care - KEFIR

z

determine most interesting deviations & explain key deviations

z

generate recommendations

Data Quality

z

verify financial trading data & detect errors

NBA Basketball Games - IBM Advanced Scout

z

help coaches to discover valuable patterns for improvements in their strategy

Discovery Agents

z

these systems ask the user to specify a profile of interest and search for related information among a wide variety of public domain and proprietary sources

z

Examples

資料、客戶、需求

CRM & Data Mining