Hard-Margin SVM and Large Margin

(1)

Homework #5

RELEASE DATE: 12/04/2020 RED BUG FIX: 12/11/2020 17:00 BLUE BUG FIX: 12/16/2020 15:30 GREEN BUG FIX: 12/25/2020 02:40

DUE DATE: 12/25/2020 (MERRY XMAS!!), BEFORE 13:00 on Gradescope

RANGE: MOOC LECTURES 201-204 (WITH BACKGROUND FROM ML FOUNDATIONS) QUESTIONS ARE WELCOMED ON THE NTU COOL FORUM.

We will instruct you on how to use Gradescope to upload your choices and your scanned/printed solutions.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to Gradescope as well. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 400 points. For each problem, there is one correct choice.

For most of the problems, if you choose the correct answer, you get 20 points; if you choose an incorrect answer, you get −10 points. That is, the expected value of random guessing is −20 per problem, and if you can eliminate two of the choices accurately, the expected value of random guessing on the remaining three choices would be 0 per problem. For other problems, the TAs will check your solution in terms of the written explanations and/or code. The solution will be given points between [−20, 20] based on how logical your solution is.

Hard-Margin SVM and Large Margin

1.

(Lecture 201) Consider a three-example data set in 1D: {(xn, yn)}³_n=1= {(−2, −1), (0, +1), (2, −1)}, and a polynomial transform φ(x) = [1, x, x²]^T. Apply the hard-margin SVM on the transformed examples {(φ(xn), yn)}³_n=1to get the optimal (b^∗, w^∗) in the transformed space. What is the optimal w₁^∗that corresponds to the “constant” feature transform? Choose the correct answer; provide steps of your “human optimization” like page 17 of Lecture 201 slides.

[a] w^∗₁= 4 [b] w^∗₁= 2 [c] w^∗₁= 1 [d] w^∗₁= 0 [e] w^∗₁= −1

(2)

(Hint: If you must, you can use the fact that all three examples are support vector candidates (i.e.

on the fat boundary) for this problem and the next one. But you can also challenge itself by solving it without using this fact first.)

(3)

2.

(Lecture 201) Following Problem 1, what is the margin achieved by the optimal solution? Choose the correct answer; provide steps of your “human optimization” like page 17 of Lecture 201 slides.

[a] 1 [b] 2 [c] 4 [d] 8 [e] 16

(Hint: You can use the same hint as the previous problem, and write your solution steps for both problems together if needed. Page 14 of Lecture 201 slides should remind you the relationship between (b^∗, w^∗) and the margin.)

3.

(Lecture 201) Consider N “linearly separable” 1D examples {(x_n, y_n)}^N_n=1. That is, x_n ∈ R.

Without loss of generality, assume that x1≤ x2≤ . . . xM < xM +1≤ xM +2. . . ≤ xN, yn = −1 for n = 1, 2, . . . , M , and yn = +1 for n = M + 1, M + 2, . . . , N . Apply hard-margin SVM without transform on this data set. What is the largest margin achieved? Choose the correct answer;

explain your answer.

[a] ¹₂(xN− xM) [b] ¹₂(xM +1− x1)

[c] ¹₂

1 N −M

N

P

n=M +1

x_n−_M¹

M

P

n=1

x_n

[d] ¹₂(x_N− x1) [e] ¹₂(x_{M +1}− x_M)

(Hint: Have we mentioned that a decision stump is just a 1D perceptron, and the hard-margin SVM is an extension of the perceptron model? :-))

4.

(Lecture 201) Two points x1 and x2are sampled from a uniform distribution in [0, 1]. Consider a large-margin perceptron algorithm that either returns a 1D perceptron with margin at least ρ, or returns a default constant hypothesis of h(x) = −1. For ρ ∈ [0, 0.5], what is the expected number of dichotomies that this algorithm can produce, where expectation is taken over the process that generated (x1, x2)? Choose the correct answer; explain your answer.

[a] 2 + 2 · (1 − 2ρ)² [b] 2 + 2 · (2ρ)²

[c] 4 · (1 − 2ρ)² [d] 2 − 2 · (1 − 2ρ)² [e] 2 − 2 · (2ρ)²

(Hint: We are mimicking page 24 of Lecture 201 here, and you are encouraged to think about the distance between two points.)

(4)

Dual Problem of Quadratic Programming

In the hard-margin SVM that we introduced in class, we hope to get a hyperplane such that the margin to the positive examples is the same as the margin to the negative examples. Sometimes we need to have different margins for different classes. The need can be written as the following uneven-margin SVM (in its linear form) with parameters ρ+> 0 and ρ−> 0:

min

b,w

1 2w^Tw

subject to yn(w^Txn+ b) ≥ ρ+ for n such that yn= +1 y_n(w^Tx_n+ b) ≥ ρ₋ for n such that y_n = −1.

Our original hard-margin SVM is just a special case with ρ₊= ρ₋= 1.

5.

(Lecture 202) The dual problem of the uneven-margin SVM can be written as

minα

1 2

N

X

n=1 N

X

m=1

α_nα_my_ny_mx^T_nx_m+

subject to

N

X

n=1

ynαn= 0

αn≥ 0 for n = 1, 2, . . . , N.

What is ? Choose the correct answer; explain your answer.

[a] −PN

n=1ρ⁻¹₊ Jyn = +1K αn−PN

n=1ρ⁻¹₋ Jyn = −1K αn

[b] −PN

n=1ρ⁰₊Jyn = +1K αn−PN

n=1ρ⁰₋Jyn= −1K αn

[c] −PN

n=1ρ+Jyⁿ = +1K αⁿ−PN

n=1ρ−Jyⁿ= −1K αⁿ [d] −PN

n=1ρ²₊Jyn = +1K αn−PN

n=1ρ²₋Jyn= −1K αn

[e] none of the other choices

6.

(Lecture 202) Let α^∗ be an optimal solution of the original hard-margin SVM (i.e. even margin).

Which of the following is an optimal solution of the uneven-margin SVM for a given pair of non- negative (ρ₋, ρ+)? Choose the correct answer; explain your answer.

[a] α^∗ [b] √

ρ₊· ρ−α^∗ [c] _ρ ²

++ρ−α^∗ [d] ^ρ

2 ++ρ²₋

2 α^∗ [e] ^ρ⁺^+ρ₂ ⁻α^∗

Properties of Kernels

(5)

8.

(Lecture 203) For any feature transform φ from X to Z, the squared distance between two examples x and x⁰ is kφ(x) − φ(x⁰)k² in the Z-space. For the Gaussian kernel K(x, x⁰) = exp(−γkx − x⁰k²), compute the squared distance with the kernel trick. Then, for any two examples x and x⁰, what is the tightest upper bound for theirsquareddistance in the Z-space? Choose the correct answer;

[a] 0 [b] 1 [c] 2 [d] 3 [e] 4

9.

(Lecture 203) For a set of examples {(x_n, y_n)}^N_n=1 and a kernel function K, consider a hypothesis set that contains

hα,b(x) = sign

N

X

n=1

ynαnK(xn, x) + b

! .

The classifier returned by SVM can be viewed as one such hα,b, where the values of α is determined by the dual QP solver and b is calculated from the KKT conditions.

In this problem, we study a simpler form of hα,bwhere α = 1 (the vector of all 1’s) and b = 0. Let us name h1,0 as ˆh for simplicity. We will show that when using the Gaussian kernel K(x, x⁰) = exp(−γkx − x⁰k²), if γ is large enough, Ein(ˆh) = 0. That is, when using the Gaussian kernel, we can “easily” separate the given data set if γ is large enough.

Assume that the distance between any pair of different (xn, xm) in the X -space is no less than .

That is,

kxn− xmk ≥ ∀n 6= m.

What is the tightest lower bound of γ that ensures Ein(ˆh) = 0? Choose the correct answer; explain your answer.

[a] ^ln²^{(N +1)}2

[b] ^{ln(N +1)}2

[c] ^{ln(N )}₂ [d] ^{ln(N −1)}2

[e] ^ln²^{(N −1)}₂

(6)

Kernel Perceptron Learning Algorithm

10.

(Lecture 203) In this problem, we are going to apply the kernel trick to the perceptron learning algorithm introduced in Machine Learning Foundations. If we run the perceptron learning algorithm on the transformed examples {(φ(xn), yn)}^N_n=1, the algorithm updates wtto wt+1when the current wtmakes a mistake on (φ(x_n(t)), y_n(t)):

wt+1← wt+ yn(t)φ(xn(t))

Because every update is based on one (transformed) example, if we take w₀= 0, we can represent every w_t as a linear combination of {φ(x_n)}^N_n=1. We can then maintain the linear combination coefficients instead of the whole w. Assume that we maintain an N -dimensional vector α_tin the t-th iteration such that

wt=

N

X

n=1

αt,nφ(xn)

for t = 0, 1, 2, . . .. Set α₀= 0 (N zeros) to match w₀= 0 ( ˜d + 1 zeros). How should α_tbe updated to α_t+1when the current w_t(represented by α_t) makes a mistake on (φ(x_n(t)), y_n(t))? Choose the correct answer; explain your answer.

[a] α_t+1← α_texcept α_t+1,n(t) ← α_t,n(t)+ 1 [b] αt+1← αtexcept α_t+1,n(t) ← α_t,n(t)− 1

[c] αt+1← αtexcept α_t+1,n(t) ← α_t,n(t)+ y_n(t) [d] αt+1← αtexcept α_t+1,n(t) ← α_t,n(t)− y_n(t) [e] αt+1← αt+ y

(Hint: Although we did not teach Lecture 205, if you have watched it by yourself from YouTube, you will find its page 15 loosely related. You should be able to solve this problem without watching Lecture 205, though.)

11.

(Lecture 203) Following Problem 10, the update rule takes care of the training iterations. In addition, we need to evaluate w^T_tφ(x) not only for predicting new x but also for checking whether w_t makes any mistake on some example x during training. Which of the following equation computes w^T_tφ(x) with the kernel trick K(x, x⁰) = φ(x)^Tφ(x⁰)? Choose the correct answer;

[a] PN

n=1αt,nK(xn, x) [b] −PN

n=1αt,nK(xn, x) [c] PN

n=1y_nα_t,nK(x_n, x) [d] PN

n=1α²_t,n(K(xn, x))² [e] PN

n=1α²_t,nK(xn, x)

(7)

Soft-Margin SVM

12.

(Lecture 204) Consider the soft-margin SVM taught in our class. Assume that after solving the dual problem, every example is a bounded support vector. That is, the optimal solution α^∗satisfies α^∗_n = C for every example. In this case, there may be multiple solutions for the optimal b^∗ for the primal SVM problem. What is the largest such b^∗? Choose the correct answer; explain your answer.

[a] min

n=1,2,...,N

1 −PN

m=1ymα^∗_mK(xn, xm) [b] min

n : yn>0

1 −PN

m=1y_mα^∗_mK(x_n, x_m) [c] min

n : y_n<0

1 −PN

m=1ymα^∗_mK(xn, xm) [d] average

n : y_n>0

1 −PN

m=1y_mα_m^∗K(x_n, x_m) [e] average

n : yn<0

1 −PN

m=1ymα_m^∗K(xn, xm)

13.

(Lecture 204) In class, we taught the non-linear soft-margin SVM as follows.

(P1) min

w,b,ξ

1

2w^Tw + C

N

X

n=1

ξn

subject to yn

w^Tφ(xn) + b

≥ 1 − ξn, for n = 1, 2, . . . , N, ξn≥ 0, for n = 1, 2, . . . , N.

The SVM penalizes the margin violation linearly. Another popular formulation penalizes the margin violation quadratically. In this problem, we derive the dual of such a formulation. The formulation as follows:

(P₂) min

w,b,ξ

1

2w^Tw + C

N

X

n=1

ξ_n² subject to y_n

w^Tφ(x_n) + b

≥ 1 − ξn, for n = 1, 2, . . . , N.

We do not have the ξn ≥ 0 constraints as any negative ξn would never be an optimal solution of (P2)—you are encouraged to think about why. Anyway, the dual problem of (P2) will look like this:

(D₂) min

α

1 2

N

X

n=1 N

X

m=1

α_nα_my_ny_m· ♦ −

N

X

n=1

α_n

subject to

N

X

n=1

ynαn= 0

αn≥ 0, for n = 1, 2, . . . , N.

Let the kernel function K(x, x⁰) = φ(x)^Tφ(x⁰). What is ♦? Choose the correct answer; explain your answer.

[a] (2C · K(xn, xm))

[b] (K(xn, xm) + 2CJn = mK) [c] (K(x_n, x_m) + CJn = mK) [d] (K(x_n, x_m) +_C¹ Jn = mK) [e] (K(x_n, x_m) +_2C¹ Jn = mK)

(8)

14.

(Lectures 202/204) After getting the optimal α^∗ for (D₂), how can we calculate the optimal ξ^∗for (P2)? Choose the correct answer; explain your answer.

[a] ξ^∗= α^∗ [b] ξ^∗= 2α^∗

[c] ξ^∗= Cα^∗ [d] ξ^∗= _C¹α^∗ [e] ξ^∗= _2C¹ α^∗

Experiments with Soft-Margin SVM

For Problems 15 to 20, we are going to experiment with a real-world data set. Download the processed satimage data sets from LIBSVM Tools.

Training: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale Testing: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.t We will consider binary classification problems of the form “one of the classes” (as the positive class) versus “the other classes” (as the negative class).

The data set contains thousands of examples, and some quadratic programming packages cannot handle this size. We recommend that you consider the LIBSVM package

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Regardless of the package that you choose to use, please read the manual of the package carefully to make sure that you are indeed solving the soft-margin support vector machine taught in class like the dual formulation below:

minα

1 2

N

X

n=1 N

X

m=1

αnαmynymK(xn, xm) −

N

X

n=1

αn

subjectto

N

X

n=1

y_nα_n = 0

0 ≤ α_n ≤ C n = 1, . . . , N.

In the following problems, please use the 0/1 error for evaluating E_in, E_val and E_out (through the test set). Some practical remarks include

(i) Please tell your chosen package to not automatically scale the data for you, lest you should change the effective kernel and get different results.

(ii) It is your responsibility to check whether your chosen package solves the designated formulation with enough numerical precision. Please read the manual of your chosen package for software parameters whose values affect the outcome—any ML practitioner needs to deal with this kind of added uncertainty.

(9)

15.

(Lectures 201/204, *) Consider the linear soft-margin SVM. That is, either solve the primal formulation of soft-margin SVM with the given xn, or take the linear kernel K(xn, xm) = x^T_nxmin the dual formulation. With C = 10, and the binary classification problem of “3” versus “not 3”, which of the following numbers is closest to kwk after solving the linear soft-margin SVM? Choose the closest answer; provide your command/code.

[a] 7.0 [b] 7.5 [c] 8.0 [d] 8.5 [e] 9.0

16.

(Lectures 203/204, *) Consider the polynomial kernel K(x_n, x_m) = (1 + x^T_nx_m)^Q, where Q is the degree of the polynomial. With C = 10, Q = 2, which of the following soft-margin SVM classifiers reaches the lowest E_in? Choose the correct answer; provide your command/code.

[a] “1” versus “not 1”

[b] “2” versus “not 2”

[c] “3” versus “not 3”

[d] “4” versus “not 4”

[e] “5” versus “not 5”

17.

(Lectures 203/204, *) Following Problem 16, which of the following numbers is closest to the maximum number of support vectors within those five soft-margin SVM classifiers? Choose the closest answer; provide your command/code.

[a] 500 [b] 600 [c] 700 [d] 800 [e] 900

18.

(Lectures 203/204, *) Consider the Gaussian kernel K(xn, xm) = exp −γ||xn− xm||². For the binary classification problem of “6” versus “not 6”, when fixing γ = 10, which of the following values of C results in the lowest Eout? Choose the correct answer; provide your command/code.

[a] 0.01 [b] 0.1

[c] 1 [d] 10 [e] 100

19.

(Lectures 203/204, *) Following Problem 18, when fixing C = 0.1, which of the following values of γ results in the lowest Eout? Choose the correct answer; provide your command/code.

[a] 0.1 [b] 1

[c] 10 [d] 100 [e] 1000

(10)

20.

(Lectures 203/204, *) Following Problem 18 and consider a validation procedure that randomly samples 200 examples from the training set for validation and leaves the other examples for training g_svm⁻ . Fix C = 0.1 and use the validation procedure to choose the best γ among {0.1, 1, 10, 100, 1000}

according to Eval. If there is a tie of Eval, choose the smallest γ. Repeat the procedure 1000 times.

Which of the following values of γ is selected the most number of times? Choose the correct answer;

provide your command/code.

[a] 0.1 [b] 1

[c] 10 [d] 100 [e] 1000