This homework set comes with 200 points and 20 bonus points. In general, every home- work set would come with a full credit of 200 points, with some possible bonus points.

(1)

Machine Learning Foundations (NTU, Fall 2017) instructor: Hsuan-Tien Lin

Homework #1

RELEASE DATE: 10/04/2017 DUE DATE: 11/14/2017, BEFORE 14:00

QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FACEBOOK FORUM.

Unless granted by the instructor in advance, you must turn in a printed/written copy of your solutions (without the source code) for all problems.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 200 points and 20 bonus points. In general, every home- work set would come with a full credit of 200 points, with some possible bonus points.

1.

(60 points) Go register for the Coursera version of the first part of the class ( https://www.

coursera.org/teach/ntumlone-mathematicalfoundations/ ) and solve its homework 1. The registration should be totally free. Then, record the highest score that you get within up to 3 trials. Please print out a snapshot of your score as an evidence. (Hint: The problems below are simple extensions of the Coursera problems.)

2.

(20 points + 10 bonus points) Describe an application of active learning within 10 English or Chinese sentences. You can use the application in related Coursera problems if you find it proper, but the grading TAs are allowed to give bonus points based on the “creativity” exhibited in the problem.

Problems 3-5 are about Off-Training-Set error .

Let X = {x1, x2, . . . , xN, xN+1, . . . , xN+L} and Y = {−1, +1} (binary classification). Here the set of training examples is D =n

(xn, yn)oN n=1

, where yn ∈ Y, and the set of test inputs isn xN+`

oL

`=1

. The Off-Training-Set error (OT S) with respect to an underlying target f and a hypothesis g is

EOT S(g, f ) = 1 L

L

X

`=1

Jg(x^N+`) 6= f (xN+`)K .

3.

(20 points) Consider f (x) = +1 for all x and g(x) =

+1, for x = xk and k is odd and 1 ≤ k ≤ N + L

−1, otherwise .

EOT S(g, f ) =? Please provide proof of your answer.

4.

(20 points) We say that a target function f can “generate” D in a noiseless setting if f (xn) = yn

for all (xn, yn) ∈ D. For all possible f : X → Y, how many of them can generate D in a noiseless setting? Note that we call two functions f1and f2the same if f1(x) = f2(x) for all x ∈ X . Please provide proof of your answer.

1 of 2

(2)

Machine Learning Foundations (NTU, Fall 2017) instructor: Hsuan-Tien Lin

5.

(20 points) A determistic algorithm A is defined as a procedure that takes D as an input, and outputs a hypothesis g. For any two deterministic algorithms A1 and A2, if all those f that can

“generate” D in a noiseless setting are equally likely in probability, please prove or disprove that

Ef

n

EOT S A1(D), fo

= E^fn

EOT S A2(D), fo .

Problems 6-7 illustrate what happens with multiple bins. Please note that the dice is not meant to be thrown for random experiments in this problem. They are just used to bind the six faces together. The probability below only refers to drawing from the bag.

Consider four kinds of dice in a bag, with the same (super large) quantity for each kind.

• A: all even numbers are colored orange, all odd numbers are colored green

• B: all even numbers are colored green, all odd numbers are colored orange

• C: all small (1-3) are colored orange, all large numbers (4-6) are colored green

• D: all small (1-3) are colored green, all large numbers (4-6) are colored orange

6.

(20 points) If we pick 5 dice from the bag, what is the probability that we get five green 1’s? Please provide calculating steps of your answer.

7.

(20 points) If we pick 5 dice from the bag, what is the probability that we get “some number”

that is purely green? Please provide calculating steps of your answer. Compare your answer to the previous problem and describe your findings.

For Problem 8, you will play with the PLA algorithm.

First, we use an artificial data set to study PLA. The data set is in

http://www.csie.ntu.edu.tw/~htlin/course/mlfound17fall/hw1/hw1_8_train.dat Note that the file is exactly the same as the one for Cousera Homework 1, Problem 15.

https://www.csie.ntu.edu.tw/~htlin/mooc/datasets/mlfound_math/hw1_15_train.dat Each line of the data set contains one (x_n, y_n) with x_n ∈ R⁴. The first 4 numbers of the line contains the components of x_n orderly, the last number is y_n. Please initialize your algorithm with w = 0 and take sign(0) as −1. As a friendly reminder, remember to add x₀= 1 as always!

8.

(*, 20 points) Implement a version of PLA by visiting examples in fixed, pre-determined random cycles throughout the algorithm. Run the algorithm on the data set. Please repeat your experiment for 2000 times, each with a different random seed. What is the average number of updates before the algorithm halts? Plot a histogram ( https://en.wikipedia.org/wiki/Histogram ) to show the number of updates versus the frequency of the number.

Bonus: More about Proof of Perceptron Learning Algorithm

9.

(10 points) Page 16 in lecture 2 suggests that the radius R affects the convergence of PLA. So Dr.

Learn plans to conduct the following procedure: scale down all xn linearly by a factor of 20, with the hope that the PLA algorithm would run 20 times faster. Will his plan work? Why or why not?

2 of 2