This homework set comes with 160 points and 20 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points.

(1)

Machine Learning (NTU, Spring 2019) instructor: Hsuan-Tien Lin

Homework #3

RELEASE DATE: 04/30/2019

DUE DATE: 05/21/2019, BEFORE 14:00 ON GRADESCOPE

QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FACEBOOK FORUM.

Please upload your solutions (without the source code) to Gradescope as instructed.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to CEIBA. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 160 points and 20 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points.

Decision Tree

Impurity functions play an important role in decision tree branching. For multi-class classification problems, let µ1, µ2, . . . , µKbe the fraction of each class of examples in a data subset, where each µk ≥ 0 andPK

k=1µk= 1.

1.

The Gini impurity is 1 −PK

k=1µ²_k. What is the maximum value of the Gini impurity among all possible [µ1, µ2, . . . , µK that satisfies µk≥ 0 and PK

k=1µk = 1? Prove your answer.

For binary classification problems, let µ+ be the fraction of positive examples in a data subset, and µ−= 1 − µ+ be the fraction of negative examples in the data subset.

2.

Prove or disprove that the squared regression error when using binary classification, which is by definition µ₊(1 − (µ₊− µ−))²+ µ₋(−1 − (µ₊− µ−))²is simply a scaled version of the Gini impurity 1 − µ²₊− µ²₋.

Random Forest

3.

If bootstrapping is used to sample N⁰ = pN examples out of N examples and N is very large, argue that approximately e^−p· N of the examples will not be sampled at all.

4.

Consider a Random Forest G that consists of K binary classification trees {gk}^K_k=1, where K is an odd integer. Each gk is of test 0/1 error Eout(gk) = ek. Prove or disprove that _K+1² PK

k=1ek upper bounds Eout(G).

1 of 3

(2)

Gradient Boosting

5.

For the gradient boosted decision tree (with squared error), if a tree with only one constant node is returned as g1, and if g1(x) = 11.26, then after the first iteration, all sn is updated from 0 to a new constant α1g1(xn) = 11.26α1. What is α1 in terms of all the {(xn, yn)}^N_n=1? Prove your answer.

6.

For the gradient boosted decision tree (with squared error), after updating all s_n in iteration t using the steepest η as α_t, what is the value ofPN

n=1s_ng_t(x_n)? Prove your answer.

7.

If gradient boosting (with squared error) is coupled with squared-error polynomial regression (without regularization) instead of decision trees. Prove or disprove that the optimal α₁= 1.

Neural Network

8.

Consider Neural Network with sign(s) instead of tanh(s) as the transformation functions. That is, consider Multi-Layer Perceptrons. In addition, we will take +1 to mean logic TRUE, and −1 to mean logic FALSE. Assume that all x_i below are either +1 or −1. Write down the weights w_i for the following perceptron

gA(x) = sign

d

X

i=0

wixi

! .

to implement

OR (x1, x2, . . . , xd) . Explain your answer.

9.

For a Neural Network with at least one hidden layer and tanh(s) as the transformation functions on all neurons (including the output neuron), when all the initial weights w_ij^(`) are set to 0, what gradient components are also 0? Justify your answer.

10.

Multiclass Neural Network of K classes is typically done by having K output neurons in the last layer. For some given example (x, y), let s^(L)_k be the summed input score to the k-th neuron, the joint “softmax” output vector is defined as

x^(L)=

"

exp(s^(L)₁ ) PK

k=1exp(s^(L)_k )

, exp(s^(L)₂ ) PK

k=1exp(s^(L)_k )

, . . . , exp(s^(L)_K ) PK

k=1exp(s^(L)_k )

# .

It is easy to see that each x^(L)_k is between 0 and 1 and the the components of the whole vector sum to 1. That is, x^(L) defines a probability distribution. Let’s rename x^(L)= q for short.

Define a one-hot-encoded vector of y to be

v = [Jy = 1K , Jy = 2K , . . . , Jy = K K] .

The cross-entropy loss function for the Multiclass Neural Network, much like an extension of the cross-entropy loss function used in logistic regression, is defined as

e = −

K

X

k=1

vkln qk.

Prove that ^∂e

∂s^(L)_k = q_k− v_k which is actually the δ_k^(L)that you’d need for backprop.

Experiments with Decision Trees

Implement the simple C&RT algorithm without pruning using the Gini impurity as the impurity measure as introduced in the class. You need to implement the algorithm by yourself without using sophisticated pacakges. For the decision stump used in branching, if you are branching with feature i and direction s,

2 of 3

(3)

please sort all the x_n,i values to form (at most) N + 1 segments of equivalent θ, and then pick θ within the median of the segment.

Run the algorithm on the following set for training:

hw3_train.dat and the following set for testing:

hw3_test.dat

11.

(*) Draw the resulting tree (by program or by hand, in any way easily understandable by the TAs).

12.

(*) Continuing from the previous problem, what is Ein and Eout (evaluated with 0/1 error) of the tree?

13.

(*) Assume that the tree in the previous question is of height H. Try a simple pruning technique of restricting the maximum tree height to H − 1, H − 2, . . ., 1 by terminating (returning a leave) whenever a node is at the maximum tree height. Call gh the pruned decision tree with maximum tree height h. Plot curves of h versus Ein(gh) and h versus Eout(gh) using the 0/1 error in the same figure. Describe your findings.

Now implement the Bagging algorithm with N⁰ = 0.8N and couple it with your fully-grown (without- pruning) decision tree above to make a preliminary random forest G_RF. Produce T = 30000 trees with bagging. Compute E_in and E_out using the 0/1 error.

14.

(*) Plot a histogram of E_in(g_t) over the 30000 trees.

15.

(*) Let Gt= “the random forest with the first t trees”. Plot a curve of t versus Ein(Gt).

16.

(*) Continuing from Question 15, and plot a curve of t versus Eout(Gt). Briefly compare with the curve in Question 15 and state your findings.

Bonus: Crazy XOR

17.

(10%) Construct a d-d-1 feed-forward neural network with sign(s) as the transformation function (such a neural network is also called a Linear Threshold Circuit) to implement XOR (x)1, (x)2, . . . , (xd).

18.

(10%) Prove that it is impossible to implement XOR (x)1, (x)₂, . . . , (x_d) with any d-(d − 1)-1 feed-forward neural network with sign(s) as the transformation function.

3 of 3