Deng Cai (蔡登)
College of Computer Science Zhejiang University
[email protected]
Introduction to Data Mining
1
Deng Cai (蔡登)
College of Computer Science Zhejiang University
[email protected]
Introduction to Machine Learning
2
© Deng Cai, College of Computer Science, Zhejiang University
Short Bio
Dr. Deng Cai (蔡登)
[email protected], [email protected]
Professor at CS college (the state key lab of CAD&CG).
紫金港校区蒙民伟楼508
Research interests:
Machine learning
Data mining
Computer vision
…
http://dengcai.zjulearning.org:8081/
3
© Deng Cai, College of Computer Science, Zhejiang University
Course Information
Web: http://dengcai.zjulearning.org:8081/Courses/DM/
Homework: http://assignment.zjulearning.org:8081/
缺省用户名和密码:学号,登陆之后修改密码
Time:
Monday, 14:05 – 15:35
Thursday, 14:05 – 15:35
Place:Room 504, 7th teaching building, Yuquan Campus
QQ group: 397340601(DM_ZJU) (Apply with name and student ID) TA: 张永辉、胡津铭
4
© Deng Cai, College of Computer Science, Zhejiang University
Course information (Cont’d)
Prerequisite:
Linear algebra, analysis, probability theory
Basic programming skills
Course textbook: No textbook is required. (Papers and other materials are available at the class web page)
Objective:
Basic understandings of some of the important machine learning methods.
Basic ability to use some machine learning techniques to solve real
world problems.
© Deng Cai, College of Computer Science, Zhejiang University
Reference Books
R. Duda, P. Hart & D. Stork, Pattern Classification (2 nd ed.), Wiley, 2000
C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
T. Hastie, R. Tibshirani & J.
Friedman, The Elements of
Statistical Learning: Data Mining, Inference, and Prediction (2 nd ed.), Springer, 2009
Kevin Murphy, Machine Learning:
A Probabilistic Perspective, The
MIT Press, 2012
© Deng Cai, College of Computer Science, Zhejiang University
Reference Books
You can download all the books from the QQ group
© Deng Cai, College of Computer Science, Zhejiang University
Evaluation
Quizzes (15%)
Four assignments (10% each)
Everyone do it by himself
Final exam (45% )
Programming language:
Matlab
Tutorials– http://www.math.ufl.edu/help/matlab‐tutorial/
– http://www.math.mtu.edu/~msgocken/intro/node1.html
Python
8
© Deng Cai, College of Computer Science, Zhejiang University
Course Policies
Class
No laptop, no cellphone.
Cheating
No.
Homework:
You have to write you own solution/program.
Late Policy:
0~24 hours: 90%
24~48 hours: 50%
48 hours ~: 25%
Questions?
9
© Deng Cai, College of Computer Science, Zhejiang University
Why Take This Course?
It is NOT
Easy course with high scores
Recommendation letter for US school application
Rank 1st
You should
Work hard
Be honest
10
© Deng Cai, College of Computer Science, Zhejiang University
What is machine learning?
Machine learning is the study of computer
systems that improve their performance through experience.
Learn existing and known structures and rules.
Discover new findings and structures.
Face recognition
News summarizationIn machine learning, we study two types of
problems
© Deng Cai, College of Computer Science, Zhejiang University
The first kind of problems
刘德华 章子怡 王俊凯 ……
章子怡
© Deng Cai, College of Computer Science, Zhejiang University
The first kind of problems
不同人
同一个人 同一个人
© Deng Cai, College of Computer Science, Zhejiang University
The first kind of problems
57岁 30岁 28岁
18岁
... ...
14岁
33岁
© Deng Cai, College of Computer Science, Zhejiang University
The second kind of problems
© Deng Cai, College of Computer Science, Zhejiang University
Two kinds of problems
What are the differences?
Supervised learning vs. Unsupervised learning
© Deng Cai, College of Computer Science, Zhejiang University
Two kinds of problems
What are the differences?
Supervised learning vs. Unsupervised learning
Supervised learning
Goal: learn a mapping from inputs 𝒙 to outputs 𝑦
Training data: a labeled set of input‐output pairs
Classification (Categorization, Decision making…)
𝑦 is a categorical variable Regression
𝑦 is real‐valued© Deng Cai, College of Computer Science, Zhejiang University
Two kinds of problems
What are the differences?
Supervised learning vs. Unsupervised learning
Unsupervised learning
We are only given inputs
Goal: find “interesting patterns”
Much less well‐defined problem
Discovering clusters, Clustering
Discovering latent factors
Dimensionality reduction, Matrix factorization, Topic modeling© Deng Cai, College of Computer Science, Zhejiang University
Two kinds of problems
What are the differences?
Supervised learning vs. Unsupervised learning
Reinforcement learning
It is a supervised learning scenario
No desired category signal is given
The only teaching feedback is that the tentative category is right or wrong.
This is useful for learning how to act or behave when
given occasional reward or punishment signals.
© Deng Cai, College of Computer Science, Zhejiang University
Focus of This Course
What are the typical machine learning problems?
Supervised Learning
Classification (decision making)
Regression Unsupervised Learning
Cluster analysis
Latent factor analysisWhat are the basic machine learning tools (methods, algorithms)?
Matlab/Python programming
20
© Deng Cai, College of Computer Science, Zhejiang University
Basic Concepts of Supervised Learning
Sample, example, pattern
Features, predictors, independent variables
𝒙 , 𝒙 , ⋯ 𝒙State of the nature, labels, pattern class, class, responses, dependent variables
𝜔 , 𝜔 , ⋯ 𝜔 or 𝑦 , 𝑦 , ⋯ 𝑦 or 𝑧 , 𝑧 , ⋯ 𝑧Training data
𝒙 , 𝜔 , 𝒙 , 𝜔 , ⋯ 𝒙 , 𝜔Model, statistical model, pattern class model, classifier
𝑓Test data
Training error & test error
© Deng Cai, College of Computer Science, Zhejiang University
Supervised Learning
Learning from experience(training data), and build model to predict the future
Design &
Train Model Collect
training samples
Define features
Make prediction
?
Training phaseTest phase
Step 1 Step 2
Representation Learning
© Deng Cai, College of Computer Science, Zhejiang University
Supervised Learning
Design &
Train Model Define
features
Step 1 Step 2
Which step is more important in building a successful system?
Which one is the focus of this course?
© Deng Cai, College of Computer Science, Zhejiang University
Why general classification hard?
Intra‐class variability
The letter “T” in different typefaces
Same face under different expression, pose, illumination Define
features
Step 1 is not
good enough
© Deng Cai, College of Computer Science, Zhejiang University
Why general classification hard?
Inter‐class similarity
Define features
Step 1 is not
good enough
© Deng Cai, College of Computer Science, Zhejiang University
Semantic Gap
Looks similar
But semantically different
Looks different
But semantically
the same
© Deng Cai, College of Computer Science, Zhejiang University
Representation: Features
Extract features to represent the samples Feature vector
Good representation:
Low intra‐class variability
Low inter‐class similarity
© Deng Cai, College of Computer Science, Zhejiang University
Fish Classification:
Salmon v. Sea Bass
28
Preprocessing involves image enhancement and segmentation;
(i) separate touching or occluding fishes and
(ii) extract fish
contour
© Deng Cai, College of Computer Science, Zhejiang University
Representation: Fish Length As Feature
How to design a classifier?
© Deng Cai, College of Computer Science, Zhejiang University
30
Representation: Fish Length As Feature
Training (design or learning) Samples
© Deng Cai, College of Computer Science, Zhejiang University
Probability Densities
31
© Deng Cai, College of Computer Science, Zhejiang University
32
Fish Lightness As Feature
Overlap of these histograms is small compared to
length feature
© Deng Cai, College of Computer Science, Zhejiang University
33
Two‐dimensional Feature Space
Two features together are better than individual features
Linear (simple) decision boundary
© Deng Cai, College of Computer Science, Zhejiang University
34
Complex Decision Boundary
© Deng Cai, College of Computer Science, Zhejiang University
© Deng Cai, College of Computer Science, Zhejiang University
Generalization
A generalization of a concept is an extension of the concept to less‐
specific criteria.
Generalization of the classifier (model)
The performance of the classifier on test data.
Training error:
Simple model large training error Complex model less training error
Test error:
Simple model ? Complex model ?
© Deng Cai, College of Computer Science, Zhejiang University
Prerequisite Knowledge
Probability:
Bayes theorem
Analysis:
Gradient descent
Linear Algebra