CRM & Data Mining
資料、客戶、需求
瞭解顧客需求、與顧客互動
z
髮型設計師與編辮子的歐巴尚z
酒的零售商與柑子店z
大公司:一對一行銷z
personalization vs. professionism 資料倉儲與公司記憶(corporate memory)
z
OLAP & DW 電影:「全民公敵」、「神鬼交鋒」
z
例:個人採購的整個程序+運貨程序(UPS) 資料探勘與企業智慧
z
原始的資料用途與企業智慧資料 — 企業寶貴之資產
資訊 資訊 Mining Mining
資料 倉儲
Mining Mining
知識 (Corporate Memory)
(Corporate Intelligence) Data explosion problem! 資料爆炸!
starving for knowledge!渴求知識
Business Intelligence (BI)
資料分析 Data Mining
OLAP
資料管理者 MIS 資料轉換的工具
Extract Transform
Load 資料倉儲/資料市集
Metadata
Templates
資料源 Data Source
使用者 Decision Making
CRM Marketing Campaign 描述資料的資料
Meta Data
•收集資料 -營運資料、市場調查資料、固定Panel追蹤
•管理資料 -ETL&Data warehousing
•資料中獲取智慧 -Data Mining、OLAP、Statistics
•應用智慧 -行銷策略、主管決策、互動化CRM機制
資料探勘與企業智慧
Increasing potential to support
business decisions End User
Business Analyst
Data Analyst
DBA
決策Making Decisions Data Presentation資料呈現
Visualization Techniques Data Mining資料探勘 Information Discovery Data Exploration資料探索
OLAP, MDA
Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts資料倉儲/市集
Data Sources 資料來源
Paper, Files, Information Providers, Database Systems, OLTP
資料庫 管理師 資料
分析師 企業 分析師
用戶 增加
支援 企業決策 潛能
From BI to CRM
Decision Support--OLAP Internal Data
Internal Data
Internal Data External Data External Data External Data
Staging Repository for Loading & Indexing Staging Repository for Staging Repository for Loading & Indexing Loading & Indexing
Campaign Analysis Campaign Campaign Analysis Analysis
Customer Segmentation
Customer Customer Segmentation Segmentation
Channel Analysis Channel Channel Analysis Analysis
Cross-sell analysis Cross Cross--sellsell
analysis analysis
Target Marketing
Target Target Marketing Marketing
Data Mining
Projections Projections
Budgeting/
Budgeting/
Profitability Profitability
List Selection
List List Selection
Selection Campaign Planning Campaign Campaign Planning
Planning FulfillmentFulfillmentFulfillment TrackingTrackingTracking Response/
Interactive Response/
Response/
Interactive Interactive
Campaign Management/Sales Automation
Interactive Channel
Data MartData Mart
WarehousingWarehousingData DataCall Center Internet Direct Mail Agent e-mail VRU Call Center Internet Direct Mail Agent e Call Center Internet Direct Mail Agent e--mail VRUmail VRU Churn
Analysis Churn Churn Analysis Analysis
Decision Support--OLAP Internal Data
Internal Data
Internal Data External Data External Data External Data
Staging Repository for Loading & Indexing Staging Repository for Staging Repository for Loading & Indexing Loading & Indexing
Campaign Analysis Campaign Campaign Analysis Analysis
Customer Segmentation
Customer Customer Segmentation Segmentation
Channel Analysis Channel Channel Analysis Analysis
Cross-sell analysis Cross Cross--sellsell
analysis analysis
Target Marketing
Target Target Marketing Marketing
Data Mining
Projections Projections
Budgeting/
Budgeting/
Profitability Profitability
List Selection
List List Selection
Selection Campaign Planning Campaign Campaign Planning
Planning FulfillmentFulfillmentFulfillment TrackingTrackingTracking Response/
Interactive Response/
Response/
Interactive Interactive
Campaign Management/Sales Automation
Interactive Channel
Data MartData Mart
WarehousingWarehousingData DataCall Center Internet Direct Mail Agent e-mail VRU Call Center Internet Direct Mail Agent e Call Center Internet Direct Mail Agent e--mail VRUmail VRU Churn
Analysis Churn Churn Analysis Analysis
Decision Support--OLAP Internal Data
Internal Data
Internal Data External Data External Data External Data
Staging Repository for Loading & Indexing Staging Repository for Staging Repository for Loading & Indexing Loading & Indexing
Campaign Analysis Campaign Campaign Analysis Analysis
Customer Segmentation
Customer Customer Segmentation Segmentation
Channel Analysis Channel Channel Analysis Analysis
Cross-sell analysis Cross Cross--sellsell
analysis analysis
Target Marketing
Target Target Marketing Marketing
Data Mining
Projections Projections
Budgeting/
Budgeting/
Profitability Profitability
List Selection
List List Selection
Selection Campaign Planning Campaign Campaign Planning
Planning FulfillmentFulfillmentFulfillment TrackingTrackingTracking Response/
Interactive Response/
Response/
Interactive Interactive
Campaign Management/Sales Automation
Interactive Channel
Data MartData Mart
WarehousingWarehousingData DataCall Center Internet Direct Mail Agent e-mail VRU Call Center Internet Direct Mail Agent e Call Center Internet Direct Mail Agent e--mail VRUmail VRU Churn
Analysis Churn Churn Analysis Analysis
Internal Data Internal Data
Internal Data External Data External Data External Data
Staging Repository for Loading & Indexing Staging Repository for Staging Repository for Loading & Indexing Loading & Indexing
Campaign Analysis Campaign Campaign Analysis Analysis
Customer Segmentation
Customer Customer Segmentation Segmentation
Channel Analysis Channel Channel Analysis Analysis
Cross-sell analysis Cross Cross--sellsell
analysis analysis
Target Marketing
Target Target Marketing Marketing
Data Mining
Projections Projections
Budgeting/
Budgeting/
Profitability Profitability
List Selection
List List Selection
Selection Campaign Planning Campaign Campaign Planning
Planning FulfillmentFulfillmentFulfillment TrackingTrackingTracking Response/
Interactive Response/
Response/
Interactive Interactive
Campaign Management/Sales Automation
Interactive Channel
Data MartData Mart
WarehousingWarehousingData DataCall Center Internet Direct Mail Agent e-mail VRU Call Center Internet Direct Mail Agent e Call Center Internet Direct Mail Agent e--mail VRUmail VRU Churn
Analysis Churn Churn Analysis Analysis
Staging Repository for Loading & Indexing Staging Repository for Staging Repository for Loading & Indexing Loading & Indexing
Campaign Analysis Campaign Campaign Analysis Analysis
Customer Segmentation
Customer Customer Segmentation Segmentation
Channel Analysis Channel Channel Analysis Analysis
Cross-sell analysis Cross Cross--sellsell
analysis analysis
Target Marketing
Target Target Marketing Marketing
Data Mining
Projections Projections
Budgeting/
Budgeting/
Profitability Profitability
List Selection
List List Selection
Selection Campaign Planning Campaign Campaign Planning
Planning FulfillmentFulfillmentFulfillment TrackingTrackingTracking Response/
Interactive Response/
Response/
Interactive Interactive
Campaign Management/Sales Automation
Interactive Channel
Data MartData Mart
Data MartData Mart
WarehousingWarehousingData Data Data Data Warehousing WarehousingCall Center Internet Direct Mail Agent e-mail VRU Call Center Internet Direct Mail Agent e Call Center Internet Direct Mail Agent e--mail VRUmail VRU Churn
Analysis Churn Churn Analysis Analysis
iRate (Internet Rating) ISS ( I-Survey System) Marketing Campaign Membership Data Product Data Order Record Website Logfile New Member New Product Search
GENUINE CRM
Data warehousing Data Mart Data Mining One to One
FAX Call center DM
E-mail AD System EC System Content PDA (XML) Mobile (PMML) Set-top Box
資料探勘:導論
背景z
[data explosion] 管理資訊超載(overload)及結構化不足、資訊混亂 與誤用z
[necessity] 管理問題複雜度高 & 即時決策分析日益重視⇒ [solution] On Line Analytic Processing, data warehousing & data mining
發展目的z
有效利用蒐集之市場、 客戶 、 供應商 、 競爭對手及未來趨勢資訊
(note:domain dependent)
z
使企業經由有效之方法與技術從歷史資料裡擷取有用的知識 (vs. 串流資料)⇒資料探勘的目標,是使一個公司更了解顧客,以增進它在行銷、銷售、顧 客服務營運上的表現 (DM in CRM)
Check it out
資料庫技術的演進
1960s 資料收集
z
Data collection, database creation, information management systems and network DBMS 1970s 資料庫
z
Relational data model, relational DBMSimplementation 1980s 進階資料庫
z
RDBMS, advanced data models(extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s—2000s 資料探勘
z Data mining
and data warehousing, multimedia databases, and Web databases何謂資料探勘
Data Mining (KDD)
z
Extraction of interesting(非瑣碎的non-trivial,
隱含的implicit, previously unknown and potentially useful) information
or patternsfrom data inlarge databases
z
A step in KDD process: running mining algorithms to produce desired patterns=>
資料探勘:運用統計/ML/DB,透過自動或半動的工具來探索和分析大量資料,以發掘有意義的規律和規則
原理
z
主要方法:資料庫、資料視覺、統計學、機器學習等z
相關技術:類神經網路、模糊邏輯、基因演算法、基因規畫、案例庫推理法、規則庫推理、統計迴歸等
z
知識表現:決策樹、法則、定量數學公式、黑箱公式等資料探勘:KDD的程序
z Data mining: the core of knowledge discovery process.
核心程序
Data Cleaning
Data Integration
Databases Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
KDD: Knowledge Discovery in Database
KDD Process (Interactive and iterative)互動、反覆
z
Learning the application domain (relevant prior knowledge & goals of application)學習應用領域及相關知識 Steps
z 資料選擇
(data selection:creating a target data set)z 資料清理與前置處理
(data cleaning & preprocessing :may take 60% of effort!)z 資料簡化與轉換
(data reduction & transformation:find useful features, dimensionality/variable reduction, invariant representation)z 資料探勘
(nchoose function: summarization/ classification/ clustering regression/ association ochoose algorithms psearch for interest patterns)z 模式評估與知識呈現
(Pattern evaluation & knowledge presentation:visualization, transformation)
各種應用
資料庫分析與決策支援
z Market analysis and management
市場分析與應用¾
target marketing, customer relation management, market basket analysis, cross selling, market segmentationz Risk analysis and management
風險分析與管理¾
Forecasting, customer retention, improved underwriting, quality control, competitive analysisz Fraud detection and management
詐欺偵測與管理¾
防止信用卡盜刷 (台灣每年信用卡盜刷金額高達30億台幣)• 透過類神經網路模型能夠找出盜刷行為模式
• (盜刷前會有一筆利用信用卡打公共電話,且通話時間少於一分鐘的交易紀錄)
Other Applications 其他
市場分析與應用(I)
Data sources for analysis 資料來源z
Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies
Target marketing 目標行銷z
Find clusters of “model” customers who share the same characteristics (interest, income level, spending habits)
Determine customer purchasing patterns over time決定模式z
Conversion of single to a joint bank account (marriage)
Cross-market analysis 跨市場分析z
Associations/co-relations between product salesz
Prediction based on the association information市場分析與應用(II)
Customer profiling (clustering/classification)剖繪
z
data mining can tell you what types of customers buy what products Identifying customer requirements客戶需求
z
identifying the best products for different customersz
use prediction to find what factors will attract new customers Provides summary information 綜合資訊
z
various multidimensional summary reportsz
statistical summary information (data central tendency and variation)風險分析與管理
Finance planning and asset evaluation財務規劃與資產 評估
z
cash flow analysis and predictionz
contingent claim analysis to evaluate assetsz
cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning資源規劃
z
summarize and compare the resources and spending Competition競爭
z
monitor competitors and market directionsz
group customers into classes and a class-based pricing procedurez
set pricing strategy in a highly competitive market詐欺偵測、不尋常的樣式
車險Auto insurance:
detect a group of people who stage accidents to collect on insurance 洗錢Money laundering:
detect suspicious money transactions 醫療保險Medical insurance:
detect professional patients and ring of doctors and ring of references 不當醫療Inappropriate medical treatment:
Australian Health InsuranceCommission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr)
電話卡詐欺Detecting telephone fraud
z Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.;British Telecom
零售業 Retail:
Analysts estimate that 38% of retail shrink is due to dishonest employees
反恐(Anti-terrorism)Other Applications
Sports 運動
z
IBM Advanced Scout analyzed NBA game statistics(shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy 天文
z
JPL and the Palomar Observatory discovered 22 quasarswith the help of data mining Internet Web Surf-Aid 網站
z
IBM Surf-Aid applies data mining algorithms to Web access logs formarket-related pages
to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.資料探勘在CRM的應用 資料探勘在CRM的應用(續)
區隔分析找出最容易購買的顧客 根據客戶偏好加強購買數量 提昇購買產品數量
增加市佔率 提昇銷售
時間序列分析
了解市場傾向,即時將產品推到市場 市場發展傾向
傾向分析
關聯性分析
分類最容易購買的產品 銷售新產品
增加顧客佔有率 交叉銷售
預測分析
歷史客戶流失模式,預測未來 降低顧客流失率
維持舊客戶
市場進行區隔 直效行銷 獲取新客戶 吸引顧客
資料探勘方式 意義
功能
資料採礦應用—市場區隔分析
利用顧客基本資料相似性,將人分群,進行區別化
真實世界上難以達到真正的一對一,可以透過群集分析進
行多對多
集群一 集群二 集群三 集群四 集群五 男 77.40% 51.90% 42.00% 100.00%
女 22.60% 48.10% 58.00% 100.00%
19歲及以下 88.40% 0.70% 4.00% 6.40%
20-24歲 8.20% 24.10% 42.30% 43.00%
25-29歲 5.50% 1.70% 29.20% 33.00% 31.90%
30-34歲 30.10% 0.40% 26.30% 15.00% 8.80%
35-39歲 28.40% 10.90% 3.10% 5.20%
40-44歲 16.90% 0.40% 5.10% 1.30% 3.60%
45-49歲 14.00% 0.40% 2.20% 0.40% 0.80%
50歲及以上 5.10% 0.40% 1.50% 0.90% 0.40%
國中以下 1.30% 17.70% 0.40% 0.40%
高中/五專 19.10% 79.70% 97.80% 1.30% 0.80%
大專學院 62.30% 2.60% 82.10% 89.20%
研究所以上 17.40% 2.20% 16.10% 9.60%
性別
年齡
最高學歷
集群一 集群二 集群三 集群四 集群五
死亡險 138 56 85 83 86
生死合險 65 36 72 42 60
傷害險 216 102 153 224 136
健康險 0.12 0.05 0.12 0.10 0.13
年金險 6 0 5 0 0
集群一 集群二 集群三 集群四 集群五
死亡險 93% 10% 62% 67% 75%
生死合險 52% 32% 63% 36% 18%
傷害險 72% 30% 70% 80% 92%
健康險 92% 80% 85% 86% 88%
年金險 2% 0% 1% 0% 0%
險種
平均保額(單位:萬元)
投保比例
險種 集群一 集群二 集群三 集群四 集群五 男 77.40% 51.90% 42.00% 100.00%
女 22.60% 48.10% 58.00% 100.00%
19歲及以下 88.40% 0.70% 4.00% 6.40%
20-24歲 8.20% 24.10% 42.30% 43.00%
25-29歲 5.50% 1.70% 29.20% 33.00% 31.90%
30-34歲 30.10% 0.40% 26.30% 15.00% 8.80%
35-39歲 28.40% 10.90% 3.10% 5.20%
40-44歲 16.90% 0.40% 5.10% 1.30% 3.60%
45-49歲 14.00% 0.40% 2.20% 0.40% 0.80%
50歲及以上 5.10% 0.40% 1.50% 0.90% 0.40%
國中以下 1.30% 17.70% 0.40% 0.40%
高中/五專 19.10% 79.70% 97.80% 1.30% 0.80%
大專學院 62.30% 2.60% 82.10% 89.20%
研究所以上 17.40% 2.20% 16.10% 9.60%
性別
年齡
最高學歷
集群一 集群二 集群三 集群四 集群五
死亡險 138 56 85 83 86
生死合險 65 36 72 42 60
傷害險 216 102 153 224 136
健康險 0.12 0.05 0.12 0.10 0.13
年金險 6 0 5 0 0
集群一 集群二 集群三 集群四 集群五
死亡險 93% 10% 62% 67% 75%
生死合險 52% 32% 63% 36% 18%
傷害險 72% 30% 70% 80% 92%
健康險 92% 80% 85% 86% 88%
年金險 2% 0% 1% 0% 0%
險種
平均保額(單位:萬元)
投保比例
險種
資料採礦應用—顧客流失分析
假如顧客 . . .
同一個帳號裡只有一個手機
而且加入門號新機優惠方案
而且屬於高手機更換率者
則他們極有可能在三個月後流失 (60.1%).
有些現有顧客祇是利用門號優惠方案購買 更便宜的手機而已
資料採礦應用—直效郵件回應
台灣地區直效郵件平均購買率降至0.3~0.05%
台灣一年共有57億封垃圾電子郵件 20.6%網友養成直接刪除廣告郵件習慣
例:整合客戶資料庫執行交叉行銷
Data Mining Data Warehouse Savings
Mortgage
Loans Checking
Credit Cards Time Deposits
Target Marketing Credit Risk Management
Portfolio Analysis Retail Banking
(以美國銀行為例)
DW, Data mining &OLAP Data mining Applications
Data Mining 實務應用
百貨公司顧客資訊
Safeway 販賣促銷資訊(e.g. coupon)
音樂/電影喜好問卷蒐集
Fidelity Investment客戶服務 (cross-selling)
First USA Bank信用卡資料(汽車房貸)
Capital One 降低貸款風險損失率
First Union預測潛在流失客戶
預測侵蝕性的物質對皮膚的影響降低產品(藥品或毒品)的發展成本和
時間,以及減少動物實驗的需求
分析零售商店歷史銷售記錄與位置概述以決定最佳的位置
分析提款機設置地點最佳位置
Data Mining – 什麼樣的資料?
Relational databases關連式資料庫
Data warehouses資料倉儲
Transactional databases交易資料
Advanced DB & information repositories
(儲藏)z
Object-oriented and object-relational databasesz
Spatial (空間)databasesz
Time-series (時序)data & temporal (時間的)dataz
Text databases & multimedia databasesz
Heterogeneous (異質)& legacy(傳統) databasesz
WWWData mining process - NCR
Business Understanding
Data Understanding
Data Preparation
Modeling Evaluation
Deployment
Data mining process
Business Understanding
z
To make the best use of data mining you must make a clear statement of your objectives¾
Increase response rate & value of a responsez
An effective statement of the problem will also include a way of measuring the result of your knowledge discovery project Data Understanding - Visualization
z
Use of scatter plots, and other visual Use of scatter plots, and other visual mediums to analyze data mediums to analyze data¾
graphs, distributions, histograms graphs, distributions, histograms¾
scatter plots¾
association webs• Show strength of connection between symbolic values
• Weight of line indicates strong/medium/weak connection
¾
GIS - Geographical Information Systemsz
Limitations - low dimensionality (1- 3) Overview
z
repeated iterations of data preparation and model building steps as learning something from model that suggests you modify the dataz
take anywhere from 50% to 85% of the time and effort of KD process Steps
z
Collection¾
Identify the source of data you will be mining¾
Data-gathering phase may be necessaryz
Assessment¾
GIGO¾
Missing value or violate integrity constraintsz
Consolidation & cleaning¾
Consolidate the data and repair, insofar as possibleData mining process - Data Preparation
Steps
z
Data selection¾
Compute time is determined by both the number of cases(rows) and the number of variables (columns)¾
Knowledge of the problem domain can let you make many of these selections correctly¾
Data visualization can help identify important independent variables and reveal collinear variables¾
Filter the outliers¾
Sample the data where database is largez
Transformation¾
Computation (e.g., ratio)¾
Grouping continuous values¾
Scaling¾
Normalizing¾
Symbolic to numeric transformation¾
Coding discrete valueData Preparation (cont.)
“Discovered” Interesting Patterns
A data mining system/query may generate thousands of patterns, not all of them are interesting.
z
Suggested approach: Human-centered, query-based, focused mining Interestingness measures
z
A pattern is interestingif it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirmz
Objective vs. subjective interestingness measures¾
Objective - based on statistics & structures of patterns (EX. support, confidence)¾
Subjective -based on user’s belief in the data (Ex. unexpectedness, novelty, actionability) Suggestions
z Completeness - Find all the interesting patterns
¾
Association, classification&clusteringz Optimization - Search for only interesting patterns
¾
Approaches• First general all the patterns and then filter out the uninteresting ones.
• Generate only the interesting patterns—mining query optimization
典型資料探勘系統的架構
Data Warehouse Data cleaning & data integration Filtering
Databases
Database or data warehouse server
Data mining engine Pattern evaluation
Graphical user interface
Knowledge-base
Data Mining 功能 (I)
概念描述:特徵與區別(Concept description: Characterization and discrimination)
z
廣義化、綜合(Generalize, summarize)z
對比資料的特性(contrast data characteristics) 關連(Association
:correlation and causality相關、因果)z
Diaper -> Beer [0.5%, 75%] 分類與預測(Classification and Prediction
)z
建立模型(函數)以描述與分辨類別或概念,作為未來預測用¾例:classify countries based on climate, or classify cars based on gas mileage
z
預測某些未知的、或遺失的(missing) 數值Data Mining 功能 (II)
群聚分析 (Cluster analysis)
z
類別標籤未知: 把資料依相似性分群¾
e.g., cluster houses to find distribution patternsz
maximizing intra-class similarityz
minimizing interclass similarity 離群分析 (Outlier analysis)
z
outlier: 某資料object,無法符合資料的一般行為(模式)z
雜質noise?例外exception? No! 用在fraud detection, rare events analysis 趨勢與演進 (Trend and evolution analysis)
z
trend and (偏差) deviation: regression analysisz
sequential pattern miningz
periodicity analysisz
similarity-based analysis Estimation, Visuation
Data Mining - 功能與技術
功能 技術 適用領域
關聯性 (Association) 案例庫推理/集合理論/統計
菜籃分析時間序列 (Sequence)
類神經網路/統計 利率預測分類 (Classification)
基因演算/類神經網路/統計/ 客戶評鑑分類 模糊邏輯案例推理/決策樹公式 (Modeling)
基因規劃/基因演算/迴歸 銷售預測群組 (Clustering)
類神經網路/模糊邏輯/ 市場區隔 基因演算/統計Interesting patterns!
Data Mining: 分類方式
廣義的功能性 General functionality
z Descriptive
data miningz Predictive
data mining 各種觀點的分類 Different views, different classifications
z
Kinds of datato be minedz
Kinds of knowledgeto be discoveredz
Kinds of techniquesutilizedz
Kinds of applicationsadaptedData Mining: Confluence of Multiple Disciplines
Data Mining
Database
Technology Statistics
Other Disciplines Information
Science Machine
Learning Visualization
Association
Applications of KDD (I)
Financial Investment Management
z
Fidelity Stock Selector¾
uses neural network to selection investmentsz
LBS Capital Management¾
uses ESs, neural nets, and GAs to manage portfolios worth $600 millionz
Carlberg & Associates¾
uses a neural network model for predicting Standard & Poor’s 500 Index Fraud Detection
z
FALCON¾
using neural network shell¾
detect suspicious credit card transactionsz
FAIS¾
detect money-laundering activity from financial transactionsz
Telecommunication¾
AT&T’s system detecting international calling fraud¾
GTE and NYNEX: detecting cellular cloning fraudApplications of KDD (II)
Manufacturing and Production
z
Prospective KDD AP -control & schedule technical production processesz
Main advantage - high cost savingsz
Key challenge - representation and exploitation of time and location as well as model levels, such as quality, process, and controlz
Examples¾
Europe chemical company - assist in production process for polymeric plastics¾
CASSIOPEE - diagnose and predict problems in Boeing 737 Network management
z
Filter redundant alarms, locate problems in the network, predict severe faultsz
Example - Telecommunication Alarm Sequence Analyzer (TASA), by University of Helsinki¾
locate frequently occurring alarm episodes¾
present them as rules¾
integrate into alarm-handling softwareKDD 的挑戰(困難) (I)
Larger Databases 超大型資料庫z can’t fit in
main memory at one timez
solutions -sampling, approximation methods, parallel processing
High Dimensionality高維度z
increase size of search space for model induction in a combinatorially explosive mannerz
increase chances that learner will find spurious patternsthat are not valid in generalz
solutions - use prior knowledgeto identify irrelevant variables
Changing Data and Knowledge 資料、知識的變化z
changing may make previously discovered patterns invalidz
solutions -incremental
methods for updating the patterns
Missing & Noisy Data缺漏資料與雜訊z
solutions -statistical strategies
to identify hidden variables & dependenciesKDD 的挑戰(困難) (II)
Over fitting 「過」於調適z
good performance on training data, but poor performance on real dataz
solutions - cross-validation, regularization, other statistical strategies
Complex Relationships between Fields關係複雜z
most of algorithms developed for simple attribute-value recordsz
require algorithms to deal with hierarchically structured attributes or values, relations between attributes
Understanding of Patterns 模式認知z
make the discoveries more understandable by humansz
solutions - graphical representations, natural language generations, information visualization
User Interaction & Prior Knowledge用戶互動與既有知識z
encoding domain knowledge into learning systems
Integration with Other Systems整合z
integration with spreadsheet, DBMS, visualization tools常用的資料探勘方法
決策樹與規則
(Decision trees and rules: propositional logic power) 非線性回歸方式
(Nonlinear regression methods e.g. Neural network 範例為基準的方法(Example-based methods)
e.g. nearest-neighbor classification, regression algorithms, case-based reasoning (CBR)
基因演算法
(Genetic algorithms) 歸納邏輯
(Inductive logic programming: first-order logic power) 機率圖形式相依模式
(Probabilistic graphical dependency models: e.g.Baysian network)
決策樹 - Example (I)
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Outlook
Sunny Overcast Rain
High Normal
No Yes
Strong Weak
No Yes
Humidity Yes Wind
決策樹學習
Inductive rules
1. IF Outlook= Sunny AND Humidity= Normal, THEN PlayTennis
2. IF Outlook= Overcast, THEN PlayTennis
3. IF Outlook= Rain AND Wind= Weak, THEN PlayTennis
If Time_band >=2.5 years and Time_employed >=1.5 year while reject is only 3.2% likely.
A total of 63 cases fit this profile, 61 accepts and 2 rejects.
決策樹- Example (II)
輸入
依誤差調整權重
真實值 預測值
監督式學習架構
類神經網路-監督式學習
輸入
調整優勝單元至 輸入層之權重
競爭優勝單元輸出 非監督式學習架構
類神經網路 - Example
working phase training
phase
……
申請人資料(Input)
核准否 (Current Output)
核准否 (Desired Output)
Learning Algorithm Wi
Wi
……
申請人資料 (Input)
核准否?
(Output) Wi
Wi
訓練階段 工作階段
實例應用- 客戶評鑑與分類 (NeuroFuzzy)
F e a t u r e ( I ) D a t a T y p e C o n t e n t 性 別 C h a r a c t e r F : F e m a l e ; M : M a l e
婚 姻 狀 況 C h a r a c t e r Y : M a r r i e d ; N : S i n g l e : U : U n k n o w n
子 女 數 目 I n t e g e r R a n g e : [ 1 . . 8 ]
年 齡 I n t e g e r R a n g e : [ 1 . . 7 0 ]
職 業 別 L i s t R a n g e : [ 1 . . 1 0 ]
郵 遞 區 號 I n t e g e r T h r e e - d i g i t s z i p c o d e
儲 蓄 率 I n t e g e r R a n g e : [ 1 . . 2 7 ]
購 買 潛 力
( p r e d i c t e d o u t c o m e )
C h a r a c t e r Y : Y e s ; N : N o
客 戶 購 買 案 例 特 色
類 別 性 別 婚 姻 狀 況 子 女 數 目 年 紀 職 業 別 郵 遞 區 號 儲 蓄 率
A 女 Y 1 40 3 540 27
B 男 N 4 64 7 540 27
最具購買 潛力客戶
C 男 Y 4 52 2 570 26
類 別 性 別 婚 姻 狀 況 子 女 數 目 年 紀 職 業 別 郵 遞 區 號 儲 蓄 率
D 女 Y 3 58 2 120 19
E 男 N 4 60 2 120 19
最不具購 買潛力客
戶
F 女 N 4 55 6 650 23
最 具(最不具)購買潛力 客戶案例組合
其他應用實例
Safeway 案例
z
8 million transaction data/week ,500家店面與600萬客戶z
市場競爭激烈,傳統手法技術式微(低價位、多據點、多類產品)z
新的競爭關鍵焦點 -掌握客戶需要(哪類客戶買哪些商品以及購買頻率)
醫療保險FAMS
z
功能¾
偵測( Detection) - 利用fuzzy modeling和統計技術來分析群組的行為,針對每個醫療服務提供者評定分數,以反應其遍離行為標準的程度
¾
調查(Investigation) - 分析提供者的分數和詳細的賠償資料¾
解決(Settlement) -詳細分析群組行為和賠償的報告和圖表,報告可用來協商、解決問題和檢舉不法之事
¾
預防(Prevention) - 支援提供者的監視和提供新的工具來評估和教育他們,改善提供者的行為,以防止醫療詐欺和濫用,減低保險公司的損 失
z
特性¾
利用案例資料的“retrospective analysis”,分析帳目和醫療提供者的醫療工作 以找出有嫌疑的提供者
案例— Customer selection for DM
輸入變項值給予綜合評定分數:盈餘、住家地理位置、 房地產總值、年齡、小孩個數、是否結婚
IT:基因演算法、決策樹、類神經網路
Data mining modeling
OLAP Mining線上多維度分析探勘
Data Mining 與 Data Warehousing 之整合
z Data mining systems, DBMS, Data warehouse systems coupling
¾
No coupling¾
loose-coupling¾
semi-tight-coupling¾
tight-couplingz On-line analytical mining data
¾
integration of mining and OLAP technologiesz Interactive mining multi-level knowledge
¾
Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.z Integration of multiple mining functions
¾
Characterized classification, first clustering and then association顧客 銷售區域
產品市 場
時間
產品
財務指標
銷售 銷售
多維度角度分析 - OLAP
An OLAM Architecture
Data Warehouse
Meta Data
MDDB OLAM
Engine
OLAP Engine
User GUI API
Data Cube API
Database API
Data cleaning Data integration
Layer3 OLAP/OLA
M
Layer2 MDDB
Layer1 Data Repository
Layer4 User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
資料探勘主要議題 (I)
探勘方法論與使用者參與
z
Mining different kinds of knowledge in databases
z
Interactive mining of knowledge at multiple levels of abstraction
zIncorporation of background knowledge
z
Data mining query languages and ad-hoc data mining
zExpression and visualization of data mining results
zHandling noise and incomplete data
z
Pattern evaluation: the interestingness problem
效能與可擴充性
z
Efficiency and scalability of data mining algorithms
zParallel, distributed and incremental mining methods
資料探勘主要議題(II)
各類資料型態
z
Handling relational and complex types of data
z
Mining information from heterogeneous DBs and global information systems (WWW)
各類應用與社會影響
z
Application of discovered knowledge
¾
Domain-specific data mining tools, Intelligent query answering, Process control &decision makingz
Integration of discovered knowledge with existing knowledge: A knowledge fusion problem
z
Protection of data security, integrity, and privacy
研究議題
Applications
z
E-Commerce/M-Commercez Customer Relationships Management (顧客關係管理)
Web mining and text mining
(news group, email, documents) Biomedical/DNA data mining (生醫、DNA)
Cube exploration (倉儲探索*)/ trends
Mining frequent and sequential patterns (循序樣式)
Anomaly(異常) mining
On-line, real-time, stream data mining
z Traffic Engineering/Management
Applications of KDD - marketing
Predicting the size of TV audiences
z
using neural networks and rule inductionz
examine factors relating audience size Analyzing supermarket sales data
z
Coverstory and Spotlight: producing reports, using natural language and graphics, on the most significant changes in a particular product volume and share broken down by region, product type, etc.z
Opportunity Explorerz
Management Discovery Tool¾
summarization, trend analysis, change analysis, and measure and segment comparison Market basket analysis
z
association rules¾associations between different products bought by the customer