• 沒有找到結果。

Teaching Machine Learning: Foundations, Techniques and Project

N/A
N/A
Protected

Academic year: 2022

Share "Teaching Machine Learning: Foundations, Techniques and Project"

Copied!
33
0
0

加載中.... (立即查看全文)

全文

(1)

Teaching Machine Learning:

Foundations, Techniques and Project

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Appier/National Taiwan University

September 7, 2018

some parts based on Lin, Madgon-Ismail, and Abu-Mostafa. Teaching machine learning to a diverse audience: the foundation-based approach.

Teaching Machine Learning Workshop @ ICML ’12.

(2)

About Me

Hsuan-Tien Lin

• Chief Data Scientist, Appier

• Professor, Dept. of CSIE, National Taiwan University

• Co-author of textbook “Learning from Data: A Short Course”

• Instructor of the NTU-Coursera Mandarin-teaching ML Massive Open Online Courses

• “Machine Learning Foundations”:

www.coursera.org/course/ntumlone

• “Machine Learning Techniques”:

www.coursera.org/course/ntumltwo

(3)

Diversity in ML classes

NTU ML 2011 Fall (77 students)

• background diversity • “maturity” diversity

• junior: 8

• senior: 20

• master: 44

• phd: 5

• similarly diverse in RPI and in Caltech (online course)1

challenge:

serving CS students while accommodating the needs of

diverse non-CS audience

mindset of the audience?

1http://work.caltech.edu/telecourse

(4)

Observed Mindsets of the Diverse Audience

• highly

motivated

to learn—not satisfied with only shallow comic-book stories

• often with

minimum but non-empty

math/programming background—capable of downloading and trying the latest packages

words of a student from industry (Caltech online course 2012)

(5)

Our Proposed Teaching Approach

• foundation-based, and foundation-first

• then, compensate foundation with

a couple of

useful algorithms/techniques

comparison to techniques-based

• techniques-based:

hops through the forest of

many

latest and greatest techniques

foundation-based: illustrate the

map (core)

first to prevent getting lost in the forest

foundation-based: prepare students for

easy

learning of untaught/future techniques

(6)

Our Proposed Teaching Approach [Cont.]

• foundation-based, and foundation-first

• then, compensate foundation with

a couple of

useful algorithms/techniques

comparison to foundation-later

• foundation-later:

• first, techniques to raise interests

• then, foundations to consolidate understanding

foundation-first: build the

basis (core)

first to perceive the techniques from the right angle

foundation-first: let students

know when and

how to use the powerful tools

before getting

(7)

Our Proposed Foundation: Three Concepts

understand learnability, approximation and generalization

• when can we learn and what are the tradeoffs?

• conducting machine learning

properly

use simple models first

• the linear model coupled with some nonlinear transforms is typically enough for most applications

• conducting machine learning

safely

deal with noise and overfitting carefully

• how to tackle the “dark side” of learning?

• conducting machine learning

professionally

our experience: worth starting with those foundations,

even for a diverse audience

(8)

learnability, approximation & generalization

—conducting machine learning properly

good learning (test performance)

=

good approximation (training performance)

+

good generalization (complexity penalty)

a must-teach key message

• can be illustrated in

different forms

(e.g. VC bound, bias-variance, even human-learning philosophy)

• make learning

non-trivial and fascinating

to students

(9)

learnability, approximation & generalization

—conducting machine learning properly [Cont.]

wrong use of learning (beginner’s mistakes)

ensure

good approximation, pray for good generalization

—praying for something out-of-control

right use of learning

ensure

good generalization, try best for good approximation

—trying something possibly in-control

We cannot guarantee learning. We can

“guar-

antee” no disasters. That is, after we learn

we will either declare success or failure, and in both cases we will be right.

(10)

linear models

—conducting machine learning safely

linear models

=

good generalization

withestablished optimization toolsfor

good approximation

• after knowing

approximation/generalization:

a good stage

for learning safe techniques

sufficiently useful

for many practical problems (Yuan et al., 2012)

building block

in sophisticated techniques through

feature transforms

• make learning

concrete

to students

(11)

linear models

—conducting machine learning safely [Cont.]

wrong use of learning (beginner’s mistakes)

start with the “greatest” techniques first —

a point of no return

right use of learning

start with the

simplest

techniques first —

and yes, it can work well

a rich and representative family of linear techniques

• classification: approx. combinatorial optimization (perceptron-like)

• regression: analytic optimization (pseudo-inverse)

• logistic regression: iterative optimization (SGD)

Students coming from diverse backgrounds not only get the

big picture, but also the finer

details in a concrete setting.

(12)

deal with noise and overfitting

—conducting machine learning professionally

• overfit = difficult to ensure good

generalization/learning with

stochastic or deterministic noise

on finite data

regularization

= tools for further guaranteeing

good generalization

• validation= tools for certifyinggood learning

overfit(data size, noise level)

• turn amateur students to

professionals

• make learning

artistic

to students

(13)

deal with noise and overfitting

—conducting machine learning professionally [Cont.]

wrong use of learning (beginner’s mistakes)

apply all possible techniques and choose by

best approximation result

—high risk of overfitting

right use of learning

apply a reasonable number of well-regularizedtechniques and choose bybest validation result—relatively immune to noise and overfitting

Complex situations call for

simpler

models.

(14)

Teaching/Learning Life After the Foundations:

Techniques

Support Vector Machine

generalization large-margin bound approximation quadratic programming linear model basic formulation feature transform through kernel regularization large-margin validation #-SV bound

Neural Network

#-neuron bound gradient decent et al.

neurons

through cascading

weight-decay or early-stopping for choices in regularization

[libsvm-2.9]$ ./svm-train -t 2 -g 0.05 -c 100 heart_scale optimization finished, #iter = 1966

Total nSV = 113

good approximation (by choosing kernel and optimization)

good generalization (by regularization)

(15)

Teaching/Learning Life After the Foundations [Cont.]

• Caltech 2012: (mixed)

7 weeks

of foundations, 0.5 week of NNet, 0.5 week of RBF Net, 1 week of SVM

• NTU ML (with MOOCs): (sequential)

8 weeks

of foundations, 3 weeks of SVM, 3 weeks of aggregation, 2 weeks of deep learning

—with an in-class data mining competition where students exploited taught/not-taughttechniques with ease

often

incremental

efforts to teach/learn a new technique after solid foundations

(16)

Mini Summary

foundation-based, foundation-first

—works well in our experience

• learnability:

philosophical

understanding, make learning

non-trivial, conduct learning properly

• linear models:

algorithmic

modeling, make learning

concrete,

conduct learning

safely

• overfitting:

practical

tuning, make learning

artistic, conduct

learning

professionally

(17)

Excitement of Competition

史丹佛這樣教創新

http:

//www.cw.com.tw/article/article.action?id=5059685

「第六、鼓勵學生競賽。從來沒有一件事像「競爭」這樣,能讓人廢寢 忘食、24小時工作絲毫不倦。我們鼓勵學生參加各式各樣的國際競賽,

我們的學生蓋了一間太陽能屋,做電動車、機器人,參加

DARPA(國防高等研究計劃署)挑戰賽,也參加企業營運書的競賽。」

(18)

Machine Learning Competition: Mini-KDD Cup

Background

• an annual competition on KDD (knowledge discovery and data mining)

• organized by ACM SIGKDD, starting from 1997, now

the most prestigious data mining competition

• usually lasts 3-4 months

• participants include famous research labs (IBM, AT&T) and top universities (Stanford, Berkeley)

(19)

My Design: Time Line

key dates:

• report due (i.e. overall competition end): as late as possible

—often

4 days before I need to submit the scores to NTU

• award ceremony (i.e. early competition end): usually

last class

• announcement: best timing to be

right after midterm

—but may highly depend on TAs’ schedule

• start designing:

two or more weeks before

announcement

(20)

My Design: Story/Topic

an interesting story makes the competition exciting!

• ML2014:

In this final project, you are going to be part of an exciting machine learning competition. Consider a startup company that features a coming product on the mobile phone. The core of the product is a robust character recognition system... To win the prize, you need to fight for the leading positions on the score board. Then, you need to submit a comprehensive report that describes not only the

recommended approaches, but also the reasoning behind your recommendations. Well, let’s get started!

• more interesting ones:

• ML2014, ML2013:optical character recognition

• ML2012:ad click prediction(derived from KDDCup 2012)

—often okay to

reuse with modifications

(21)

My Design: Team Size

• most ideal team size IMHO is 3:

collaborative,dispute resolution,fewer free riders, etc.

• but can also allow 4if class size too bigfor the TAs to grade

• usually allow ≤ 3:

• so students do not have the burden to findexactly 3

• students canflexibly break teamsif needed

• butevaluate with workloads of 3for fairness

• still sometimes hard for some students to find team members:

• motto: provide matching mechanism, butnot force anyone to any team

• prevent free riders: need

workload distribution

in report

(22)

My Design: Scoreboard

• core place that makes the game

exciting

thanks to my TAs

in all those years for creating and maintaining the service

• basically, a simple

submit-judge-scoreboard

system

(23)

My Design: Award Ceremony

• purpose: to

add more fun

light presents

(postcards, paper notebooks, etc.)

• some students list their

good-performing awards in resume

• may serve some

educational purposes

• in addition to good-performing awards, can also give

interesting

awards

(24)

ML2012: How Much Overfitting Can We Get?

9472 submissions from 52 teams within 1.5 months...

(25)

Award 4: Happy 2013 Award

team scoreboard hidden algorithm time

Minimaximizer 0.7632 0.7407 rwa 2013/01/01 00:00:08

(26)

Award 7-8: Hard Working Awards

team submission count

A 1097

anything 1149

(27)

My Design: Grade

• generally based on

report, not competition, but correlated

• too much emphasis on competition ⇒ utilitarianism

• too little emphasis on competition ⇒ less interesting game

• ask TAs to act as “bosses”: The grading TAs would grade qualitatively with letters: A++[210], A+[196], A[186], B+[176], B[166], C+[156], C[146], D+[136], D[126], F+[116], F[76], F-[36], Z[0]

• list

basic requirements

corresponding to

B

• to get B, students only need to work ≈ usual homeworks

• to get more, need more to convince the TAs

• generally

“loose” about basic requirements

—most students perform way beyond the basic requirements anyway

• generally team grade, but

adjust individual grade if workload

unbalanced

(28)

My Design: Loading

• ideal: a bit

harder than homework

• estimate: 60 to 90 man-hours to finish basic requirements (30

man-hour per member)

• sometimes need to

adjust loading of other homeworks

—not an easy task, though

(29)

My Design: TAs

• good TAs’ help

essential—I cannot thank them enough!

design, system setup, discuss with students

(30)

My Design: TAs

always note: TAs are

busy!!

(31)

My Design: Instructor

my main job:

heat up the competition

(32)

My Design: Instructor

my two other jobs:

• participate

seriously in the design

• maintain

fairness

of competition

(33)

Some Summary Thoughts

Positive Side

fun

for most students, TAs and instructor

• students, TAs and instructor

learn a lot

Negative Side

exhausting

for most students, TAs and instructor

can be disappointing

for some students

Questions and Discussions?

參考文獻

相關文件

• logistic regression often preferred over pocket.. Linear Models for Classification Stochastic Gradient Descent. Two Iterative

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Which keywords below shall have large positive weights in a good perceptron for the task. 1 coffee, tea,

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

 Promote project learning, mathematical modeling, and problem-based learning to strengthen the ability to integrate and apply knowledge and skills, and make. calculated

“Machine Learning Foundations” free online course, and works from NTU CLLab and NTU KDDCup teams... The Learning Problem What is

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of