This homework set comes with 400 points. For each problem, there is one correct choice.

(1)

Homework #6

RELEASE DATE: 12/25/2020 RED BUG FIX: 01/01/2021 16:30

DUE DATE: 01/15/2021, BEFORE 13:00 on Gradescope RANGE: MOOC LECTURES 207-210, 212, 215

(SELECTED PARTS, WITH BACKGROUND FROM ML FOUNDATIONS) QUESTIONS ARE WELCOMED ON THE NTU COOL FORUM.

We will instruct you on how to use Gradescope to upload your choices and your scanned/printed solutions.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to Gradescope as well. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 400 points. For each problem, there is one correct choice.

For most of the problems, if you choose the correct answer, you get 20 points; if you choose an incorrect answer, you get −10 points. That is, the expected value of random guessing is −20 per problem, and if you can eliminate two of the choices accurately, the expected value of random guessing on the remaining three choices would be 0 per problem. For other problems, the TAs will check your solution in terms of the written explanations and/or code. The solution will be given points between [−20, 20] based on how logical your solution is.

Neural Networks

1.

(Lecture 212) A fully connected Neural Network has L = 3; d⁽⁰⁾= 4, d⁽¹⁾ = 5, d⁽²⁾= 6, d⁽³⁾= 1. If only products of the form w^(`+1)_jk δ_k^(`+1) count as operations, without counting anything else, which of the following is the total number of operations required in a single iteration of computing all δ_j^(`) for ` ∈ {1, 2} and j ∈ {1, 2, . . . , d^(`)} on one data point in the backward pass, after all x^(`)_i and s^(`)_i are computed and stored in the forward pass, and δ^(L)₁ has been computed? Choose the correct answer; explain your answer.

[a] 16 [b] 36 [c] 50 [d] 56 [e] 68

(2)

2.

(Lecture 212) Consider a Neural Network with d⁽⁰⁾+ 1 = 20 input units, 3 output units, and 50 hidden units (each x^(`)₀ is also counted as a unit). The hidden units can be arranged in any number of layers ` = 1, ..., L − 1. That is,

L−1

X

`=1

d^(`)+ 1

= 50.

Each layer is fully connected to the layer above it. What is the maximum possible number of weights that such a network can have? Choose the correct answer; explain your answer.

[a] 875 [b] 1123

[c] 1130 [d] 1219 [e] 1327

3.

(Lecture 212) Multiclass Neural Network of K classes is typically done by having K output neurons in the last layer. For some given example (x, y), let s^(L)_k be the summed input score to the k-th neuron, the joint “softmax” output vector is defined as

x^(L)=

"

exp(s^(L)₁ ) PK

k=1exp(s^(L)_k )

, exp(s^(L)₂ ) PK

k=1exp(s^(L)_k )

, . . . , exp(s^(L)_K ) PK

k=1exp(s^(L)_k )

# .

It is easy to see that each x^(L)_k is between 0 and 1 and the components of the whole vector sum to 1. That is, x^(L), renamed as q ≡ x^(L)for short, can be viewed as a vector whose k-th component estimates the probability for x to be in class k.

Define a one-hot-encoded vector of y to be

v = [Jy = 1K , Jy = 2K , . . . , Jy = K K] .

The cross-entropy loss function for the Multiclass Neural Network, much like an extension of the cross-entropy loss function used in logistic regression, is defined as

err(x, y) = −

K

X

k=1

vkln qk.

What is ^∂err

∂s^(L)_k , the δ_k^(L)that you’d need for backpropagation? Choose the correct answer; explain your answer.

[a] qk

[b] v_k− qk

[c] (v_k− q_k)q_k [d] qk− vk

[e] (qk− vk)qk

(Hint: The problem can be viewed as the Neural Network extension of Problem 10 of Homework 3 in Machine Learning Foundations)

4.

(Lecture 212) Consider a 4-5-1 Neural Network with all hidden layers having a bias input x^(`)₀ = +1 and use tanh(s) as the transformation functions on all neurons (including the output neuron).

Consider a single example xn = (1, 0, 0, 0) with yn = +1. Use SGD and backpropagation on this single example to update the weights. Set η = 1 and initialize all the weights in each w^(`) to 0.

What is the weight w⁽¹⁾₀₁ after 3 updates? Choose the correct answer; explain your answer.

[a] 0 [b] −2

[c] −4 [d] −6 [e] −8

(3)

Matrix Factorization

5.

(Lecture 215) Consider a matrix factorization model of ˜d = 1 solved with alternating least squares.

Assume that the ˜d × N user factor matrix V is initialized to a constant matrix of 2. After step 2.1 of alternating least squares (Page 10 of Lecture 215), what is wm, the ˜d × 1 movie “vector” for the m-th movie? Choose the correct answer; explain your answer.

[a] the sum of the ratings on the m-th movie [b] twice the sum of the m-th movie

[c] the average rating of the m-th movie [d] twice the average rating of the m-th movie [e] half the average rating of the m-th movie

6.

(Lecture 215) The Matrix Factorization Model tries to find the best wmand vn such that rnm≈ w^T_mvn. Sometimes, we can make the model more expressive by introducing bias term. That is, we try to approximate rnm by w^T_mvn+ am+ bn. Then, the per-example error function on Page 14 of Lecture 215 becomes

err(user n, movie m, rating rnm) = (rnm− w^T_mvn− am− bn)².

Which of the following corresponds to how amshould be updated when running SGD for this new matrix factorization model with a learning rate ^η₂? Choose the correct answer; explain your answer.

[a] a_m← (1 − η)am− η · (rnm− w_m^Tv_n− bn) [b] a_m← (1 − η)am+ η · (r_nm− w_m^Tv_n− bn) [c] a_m← (1 + η)a_m− η · (r_nm− w_m^Tv_n− b_n) [d] a_m← (1 + η)a_m+ η · (r_nm− w_m^Tv_n− b_n) [e] am← am+ η · (rnm− w^T_mvn− bn)

Aggregation

7.

(Lecture 207) For a binary classification task, assume that there are 3 binary classifiers g₁, g₂, g₃. If uniform blending is used to blend the three classifiers to get G like Page 7 of Lecture 207, and E_out(G) = 0.20. Which of the following is a possible combination of [E_out(g₁), E_out(g₂), E_out(g₃)]?

Here Eout is measured by the 0/1 error. Choose the correct answer; explain your answer.

[a] [0.04, 0.16, 0.16]

[b] [0.04, 0.08, 0.24]

[c] [0.06, 0.04, 0.16]

[d] [0.16, 0.08, 0.24]

[e] [0.04, 0.06, 0.24]

8.

(Lecture 207) For a binary classification task, assume that there are 5 binary classifiers g₁, g₂, . . ., g₅, and for some P (x, y), the errors made by the 5 classifiers are independent. That is, the five random variables Jy 6= g1(x)K, Jy 6= g2(x)K, . . ., Jy 6= g5(x)K are independent. Assume that E_out(g_t) = 0.4 for t = 1, 2, . . . , 5, if uniform blending is used to blend the five classifiers to get G like Page 7 of Lecture 207, what is E_out(G)? Choose the closest answer; explain your answer.

[a] 0.68 [b] 0.40 [c] 0.32 [d] 0.08 [e] 0.01

(4)

9.

(Lectures 207/210) If bootstrapping is used to sample exactly 0.5N examples out of N , what is the probability that an example is not sampled when N is very large? Choose the closest answer;

explain your answer.

[a] 77.9%

[b] 60.7%

[c] 36.8%

[d] 13.5%

[e] 1.8%

10.

(Lecture 207) When talking about non-uniform voting in aggregation, we mentioned that α can be viewed as a weight vector learned from any linear algorithm coupled with the following transform:

φ(x) =

g1(x), g2(x), · · · , gT(x) .

When studying kernel methods, we mentioned that the kernel is simply a computational short-cut for the inner product (φ(x))^T(φ(x⁰)). In this problem, we mix the two topics together using the decision stumps as our g_t(x).

Assume that the input vectors contain only even integers between (including) 2L and 2R, where L < R. Consider the decision stumps gs,i,θ(x) = s · sign

xi− θ , where i ∈ {1, 2, · · · , d},

d is the finite dimensionality of the input space, s ∈ {−1, +1},

θ is an odd integer between (2L, 2R).

Define φ_ds(x) = g+1,1,2L+1(x), g+1,1,2L+3(x), . . . , g+1,1,2R−1(x), . . . , g−1,d,2R−1(x)

!

. What is K_ds(x, x⁰) = (φ_ds(x))^T(φ_ds(x⁰))? Choose the correct answer; explain your answer.

[a] 2d(R − L) − kx − x⁰k₁ [b] 2d(R − L)²− kx − x⁰k²₁

[c] 2d(R − L) − kx − x⁰k2

[d] 2d(R − L)²− kx − x⁰k²₂ [e] none of the other choices

Adaptive Boosting

11.

(Lecture 208) Consider applying the AdaBoost algorithm on Page 17 of Lecture 208 to a binary classification data set where 95% of the examples are negative. Because there are so many negative examples, the base algorithm within AdaBoost returns a constant classifier g1 = −1 in the first iteration. Let u⁽²⁾₊ be the individual example weight of each positive example in the second iteration, and u⁽²⁾₋ be the example weight of each negative example in the second iteration. What is ^u

(2) +

u⁽²⁾₋ ? Choose the correct answer; explain your answer.

[a] 19 [b] 1/19

[c] 1 [d] 20 [e] 1/20

(5)

12.

(Lectures 208/211) For the AdaBoost algorithm on Page 17 of Lecture 208, let Ut =PN n=1u^(t)n . In Lecture 211, it is shown that for any integer t > 0,Ut+1=_N¹ PN

n=1exp

−ynPt

τ =1ατgτ(xn) , and that Ein(GT) ≤ UT +1. Assume that 0 < t ≤ < ¹₂ for each hypothesis gt, which of the following is correct? Choose the correct answer; explain your answer.

[a] Ein(GT) ≤ exp −2T²(¹₂− )² [b] Ein(GT) ≤ exp

−2T√

T (¹₂− )² [c] E_in(G_T) ≤ exp −4T (¹₂− )² [d] Ein(GT) ≤ exp −2T (¹₂− )² [e] none of the other choices

(Hint: It might be helpful to consider checking ^U_U^t+1

t , and use the fact that p(1 − ) ≤ 1

2exp(−2(1 2− )²) for all 0 < < ¹₂).

Decision Tree

13.

(Lecture 209) Impurity functions play an important role in decision tree branching. For binary classification problems, let µ+be the fraction of positive examples in a data subset, and µ−= 1−µ+

be the fraction of negative examples in the data subset. We can normalize each impurity function by dividing it with its maximum value among all µ+∈ [0, 1]. For instance, the classification error is simply min(µ₊, µ₋) and its maximum value is 0.5. So the normalized classification error is 2 min(µ₊, µ₋). After normalization, which of the following impurity function is equivalent to the classification error min(µ₊, µ₋)? Choose the correct answer; explain your answer.

[a] the Gini index 1 − µ²₊− µ²₋

[b] the squared error (used for branching in classification data sets), which is by definition µ+(1 − (µ+− µ₋))²+ µ₋(−1 − (µ+− µ₋))²

[c] the entropy, which is −µ+ln µ+− µ₋ln µ₋, with 0 ln 0 ≡ 0 [d] the closeness, which is 1 − |µ+− µ₋|

[e] none of the other choices

(6)

Experiments with Decision Tree and Random Forest

In the following questions, you are asked to implement a preliminary random forest algorithm. You need to implement everything by yourself without using any well-implemented packages.

14.

(Lecture 209, *) First, let’s implement a simple C&RT algorithm without pruning using the Gini index as the impurity measure, as introduced in the class. For the decision stump used in branching, if you are branching with feature i, please sort all the xn,ivalues to form (at most) N + 1 segments of equivalent θ, and then pick θ within the median of the segment. If multiple (i, θ) produce the best split, pick the one with the smallest i (and if there is a tie again, pick the one with the smallest θ).

Please run the algorithm on the following set for training:

http://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw6/hw6_train.dat and the following file as our test data set for evaluating Eout:

http://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw6/hw6_test.dat What is the E_out(g), where g is the unpruned decision tree returned from your C&RT algorithm and Eout is evaluated using the 0/1 error? Choose the closest answer; provide your code.

[a] 0.08 [b] 0.13 [c] 0.18 [d] 0.23 [e] 0.28

15.

(Lectures 207/210, *) Next, we implement the random forest algorithm by coupling bagging (by sampling with replacement) with N⁰ = 0.5N with your unpruned decision tree in the previous problem. Produce T = 2000 trees with bagging. Let g₁, g₂, . . . , g₂₀₀₀ denote the 2000 trees gen- erated. What is _T¹ PT

t=1E_out(g_t)?, where E_out is also evaluated using the 0/1 error? Choose the closest answer; provide your code.

[a] 0.08 [b] 0.13 [c] 0.18 [d] 0.23 [e] 0.28

16.

Let G(x) = sign(PT

t=1gt(x)) be the random forest formed by the trees above. What is Ein(G), where Einis evaluated using the 0/1 error? Choose the closest answer; provide your code.

[a] 0.01 [b] 0.06 [c] 0.11 [d] 0.16 [e] 0.21

17.

Following the previous problem, what is Eout(G), where Eout is evaluated using the 0/1 error?

Choose the closest answer; provide your code.

[a] 0.01 [b] 0.06 [c] 0.11 [d] 0.16 [e] 0.21

(7)

18.

Following the previous problem, we can calculate E_oob(G) as

1 N

N

X

n=1

err(yn, G⁻_n(xn)),

where G⁻_n is a random forest that contains all the trees that were not trained with xn. If all trees are trained with xn, take G⁻_n as a constant classifier that always returns −1. Let err be the 0/1 error. What is Eoob(G)? Choose the closest answer; provide your code.

[a] 0.02 [b] 0.07 [c] 0.12 [d] 0.17 [e] 0.22

Learning Comes from Feedback

19.

Which topic of this class do you like the most? Choose one topic; explain your choice.

[a] support vector machine [b] matrix factorization

[c] aggregation models: non-boosting ones

[d] aggregation models: AdaBoost and Gradient Boosting [e] neural networks and deep learning

20.

Which topic of this class do you like the least? Choose one topic; explain your choice.

[a] support vector machine [b] matrix factorization

[c] aggregation models: non-boosting ones

[d] aggregation models: AdaBoost and Gradient Boosting [e] neural networks and deep learning