Teaching Machine Learning: Foundations, Techniques and Project

(1)

Teaching Machine Learning:

Foundations, Techniques and Project

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Appier/National Taiwan University

September 7, 2018

some parts based on Lin, Madgon-Ismail, and Abu-Mostafa. Teaching machine learning to a diverse audience: the foundation-based approach.

Teaching Machine Learning Workshop @ ICML ’12.

(2)

About Me

Hsuan-Tien Lin

• Chief Data Scientist, Appier

• Professor, Dept. of CSIE, National Taiwan University

• Co-author of textbook “Learning from Data: A Short Course”

• Instructor of the NTU-Coursera Mandarin-teaching ML Massive Open Online Courses

• “Machine Learning Foundations”:

www.coursera.org/course/ntumlone

• “Machine Learning Techniques”:

www.coursera.org/course/ntumltwo

(3)

Diversity in ML classes

NTU ML 2011 Fall (77 students)

• background diversity • “maturity” diversity

• junior: 8

• senior: 20

• master: 44

• phd: 5

• similarly diverse in RPI and in Caltech (online course)¹

• challenge:

serving CS students while accommodating the needs of

diverse non-CS audience

mindset of the audience?

1http://work.caltech.edu/telecourse

(4)

Observed Mindsets of the Diverse Audience

• highly

motivated

to learn—not satisfied with only shallow comic-book stories

• often with

minimum but non-empty

math/programming background—capable of downloading and trying the latest packages

words of a student from industry (Caltech online course 2012)

(5)

Our Proposed Teaching Approach

• foundation-based, and foundation-first

• then, compensate foundation with

a couple of

useful algorithms/techniques

comparison to techniques-based

• techniques-based:

hops through the forest of

many

latest and greatest techniques

• foundation-based: illustrate the

map (core)

first to prevent getting lost in the forest

foundation-based: prepare students for

easy

learning of untaught/future techniques

(6)

Our Proposed Teaching Approach [Cont.]

• foundation-based, and foundation-first

• then, compensate foundation with

a couple of

useful algorithms/techniques

comparison to foundation-later

• foundation-later:

• first, techniques to raise interests

• then, foundations to consolidate understanding

• foundation-first: build the

basis (core)

first to perceive the techniques from the right angle

foundation-first: let students

know when and

how to use the powerful tools

before getting

(7)

Our Proposed Foundation: Three Concepts

understand learnability, approximation and generalization

• when can we learn and what are the tradeoffs?

• conducting machine learning

properly

use simple models first

• the linear model coupled with some nonlinear transforms is typically enough for most applications

safely

deal with noise and overfitting carefully

• how to tackle the “dark side” of learning?

professionally

our experience: worth starting with those foundations,

even for a diverse audience

(8)

learnability, approximation & generalization

—conducting machine learning properly

good learning (test performance)

=

good approximation (training performance)

+

good generalization (complexity penalty)

•

a must-teach key message

• can be illustrated in

different forms

(e.g. VC bound, bias-variance, even human-learning philosophy)

• make learning

non-trivial and fascinating

to students

(9)

learnability, approximation & generalization

—conducting machine learning properly [Cont.]

wrong use of learning (beginner’s mistakes)

ensure

good approximation, pray for good generalization

—praying for something out-of-control

right use of learning

ensure

good generalization, try best for good approximation

—trying something possibly in-control

We cannot guarantee learning. We can

“guar-

antee” no disasters. That is, after we learn

we will either declare success or failure, and in both cases we will be right.

(10)

linear models

—conducting machine learning safely

linear models

=

good generalization

withestablished optimization toolsfor

good approximation

• after knowing

approximation/generalization:

a good stage

for learning safe techniques

•

sufficiently useful

for many practical problems (Yuan et al., 2012)

•

building block

in sophisticated techniques through

feature transforms

• make learning

concrete

to students

(11)

linear models

—conducting machine learning safely [Cont.]

wrong use of learning (beginner’s mistakes)

start with the “greatest” techniques first —

a point of no return

right use of learning

start with the

simplest

techniques first —

and yes, it can work well

a rich and representative family of linear techniques

• classification: approx. combinatorial optimization (perceptron-like)

• regression: analytic optimization (pseudo-inverse)

• logistic regression: iterative optimization (SGD)

Students coming from diverse backgrounds not only get the

big picture, but also the finer

details in a concrete setting.

(12)

deal with noise and overfitting

—conducting machine learning professionally

• overfit = difficult to ensure good

generalization/learning with

stochastic or deterministic noise

on finite data

•

regularization

= tools for further guaranteeing

good generalization

• validation= tools for certifyinggood learning

overfit(data size, noise level)

• turn amateur students to

professionals

• make learning

artistic

to students

(13)

deal with noise and overfitting

—conducting machine learning professionally [Cont.]

wrong use of learning (beginner’s mistakes)

apply all possible techniques and choose by

best approximation result

—high risk of overfitting

right use of learning

apply a reasonable number of well-regularizedtechniques and choose bybest validation result—relatively immune to noise and overfitting

Complex situations call for

simpler

models.

(14)

Teaching/Learning Life After the Foundations:

Techniques

Support Vector Machine

generalization large-margin bound approximation quadratic programming linear model basic formulation feature transform through kernel regularization large-margin validation #-SV bound

Neural Network

#-neuron bound gradient decent et al.

neurons

through cascading

weight-decay or early-stopping for choices in regularization

[libsvm-2.9]$ ./svm-train -t 2 -g 0.05 -c 100 heart_scale optimization finished, #iter = 1966

Total nSV = 113

•

good approximation (by choosing kernel and optimization)

•

good generalization (by regularization)

(15)

Teaching/Learning Life After the Foundations [Cont.]

• Caltech 2012: (mixed)

7 weeks

of foundations, 0.5 week of NNet, 0.5 week of RBF Net, 1 week of SVM

• NTU ML (with MOOCs): (sequential)

8 weeks

of foundations, 3 weeks of SVM, 3 weeks of aggregation, 2 weeks of deep learning

—with an in-class data mining competition where students exploited taught/not-taughttechniques with ease

often

incremental

efforts to teach/learn a new technique after solid foundations

(16)

Mini Summary

foundation-based, foundation-first

—works well in our experience

• learnability:

philosophical

understanding, make learning

non-trivial, conduct learning properly

• linear models:

algorithmic

modeling, make learning

concrete,

conduct learning

safely

• overfitting:

practical

tuning, make learning

artistic, conduct

learning

professionally

(17)

Excitement of Competition

史丹佛這樣教創新

http:

//www.cw.com.tw/article/article.action?id=5059685

「第六、鼓勵學生競賽。從來沒有一件事像「競爭」這樣，能讓人廢寢忘食、24小時工作絲毫不倦。我們鼓勵學生參加各式各樣的國際競賽，

我們的學生蓋了一間太陽能屋，做電動車、機器人，參加

DARPA(國防高等研究計劃署)挑戰賽，也參加企業營運書的競賽。」

(18)

Machine Learning Competition: Mini-KDD Cup

Background

• an annual competition on KDD (knowledge discovery and data mining)

• organized by ACM SIGKDD, starting from 1997, now

the most prestigious data mining competition

• usually lasts 3-4 months

• participants include famous research labs (IBM, AT&T) and top universities (Stanford, Berkeley)

(19)

My Design: Time Line

key dates:

• report due (i.e. overall competition end): as late as possible

—often

4 days before I need to submit the scores to NTU

• award ceremony (i.e. early competition end): usually

last class

• announcement: best timing to be

right after midterm

—but may highly depend on TAs’ schedule

• start designing:

two or more weeks before

announcement

(20)

My Design: Story/Topic

an interesting story makes the competition exciting!

• ML2014:

In this final project, you are going to be part of an exciting machine learning competition. Consider a startup company that features a coming product on the mobile phone. The core of the product is a robust character recognition system... To win the prize, you need to fight for the leading positions on the score board. Then, you need to submit a comprehensive report that describes not only the

recommended approaches, but also the reasoning behind your recommendations. Well, let’s get started!

• more interesting ones:

• ML2014, ML2013:optical character recognition

• ML2012:ad click prediction(derived from KDDCup 2012)

—often okay to

reuse with modifications

(21)

My Design: Team Size

• most ideal team size IMHO is 3:

• collaborative,dispute resolution,fewer free riders, etc.

• but can also allow 4if class size too bigfor the TAs to grade

• usually allow ≤ 3:

• so students do not have the burden to findexactly 3

• students canflexibly break teamsif needed

• butevaluate with workloads of 3for fairness

• still sometimes hard for some students to find team members:

• motto: provide matching mechanism, butnot force anyone to any team

• prevent free riders: need

workload distribution

in report

(22)

My Design: Scoreboard

• core place that makes the game

exciting

•

thanks to my TAs

in all those years for creating and maintaining the service

• basically, a simple

submit-judge-scoreboard

system

(23)

My Design: Award Ceremony

• purpose: to

add more fun

•

light presents

(postcards, paper notebooks, etc.)

• some students list their

good-performing awards in resume

• may serve some

educational purposes

• in addition to good-performing awards, can also give

interesting

awards

(24)

ML2012: How Much Overfitting Can We Get?

9472 submissions from 52 teams within 1.5 months...

(25)

Award 4: Happy 2013 Award

team scoreboard hidden algorithm time

Minimaximizer 0.7632 0.7407 rwa 2013/01/01 00:00:08

(26)

Award 7-8: Hard Working Awards

team submission count

A 1097

anything 1149

(27)

My Design: Grade

• generally based on

report, not competition, but correlated

• too much emphasis on competition ⇒ utilitarianism

• too little emphasis on competition ⇒ less interesting game

• ask TAs to act as “bosses”: The grading TAs would grade qualitatively with letters: A++[210], A+[196], A[186], B+[176], B[166], C+[156], C[146], D+[136], D[126], F+[116], F[76], F-[36], Z[0]

• list

basic requirements

corresponding to

B

• to get B, students only need to work ≈ usual homeworks

• to get more, need more to convince the TAs

• generally

“loose” about basic requirements

—most students perform way beyond the basic requirements anyway

• generally team grade, but

adjust individual grade if workload

unbalanced

(28)

My Design: Loading

• ideal: a bit

harder than homework

• estimate: 60 to 90 man-hours to finish basic requirements (30

man-hour per member)

• sometimes need to

adjust loading of other homeworks

—not an easy task, though

(29)

My Design: TAs

• good TAs’ help

essential—I cannot thank them enough!

•

design, system setup, discuss with students

(30)

My Design: TAs

always note: TAs are

busy!!

(31)

My Design: Instructor

my main job:

heat up the competition

(32)

My Design: Instructor

my two other jobs:

• participate

seriously in the design

• maintain

fairness

of competition

(33)

Some Summary Thoughts

Positive Side

•

fun

for most students, TAs and instructor

• students, TAs and instructor

learn a lot

Negative Side

•

exhausting

for most students, TAs and instructor

•

can be disappointing

for some students

Questions and Discussions?