• 沒有找到結果。

Hard-Margin SVM and Large Margin

N/A
N/A
Protected

Academic year: 2022

Share "Hard-Margin SVM and Large Margin"

Copied!
10
0
0

加載中.... (立即查看全文)

全文

(1)

Homework #5

RELEASE DATE: 12/04/2020 RED BUG FIX: 12/11/2020 17:00 BLUE BUG FIX: 12/16/2020 15:30 GREEN BUG FIX: 12/25/2020 02:40

DUE DATE: 12/25/2020 (MERRY XMAS!!), BEFORE 13:00 on Gradescope

RANGE: MOOC LECTURES 201-204 (WITH BACKGROUND FROM ML FOUNDATIONS) QUESTIONS ARE WELCOMED ON THE NTU COOL FORUM.

We will instruct you on how to use Gradescope to upload your choices and your scanned/printed solutions.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to Gradescope as well. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 400 points. For each problem, there is one correct choice.

For most of the problems, if you choose the correct answer, you get 20 points; if you choose an incorrect answer, you get −10 points. That is, the expected value of random guessing is −20 per problem, and if you can eliminate two of the choices accurately, the expected value of random guessing on the remaining three choices would be 0 per problem. For other problems, the TAs will check your solution in terms of the written explanations and/or code. The solution will be given points between [−20, 20] based on how logical your solution is.

Hard-Margin SVM and Large Margin

1.

(Lecture 201) Consider a three-example data set in 1D: {(xn, yn)}3n=1= {(−2, −1), (0, +1), (2, −1)}, and a polynomial transform φ(x) = [1, x, x2]T. Apply the hard-margin SVM on the transformed examples {(φ(xn), yn)}3n=1to get the optimal (b, w) in the transformed space. What is the opti- mal w1that corresponds to the “constant” feature transform? Choose the correct answer; provide steps of your “human optimization” like page 17 of Lecture 201 slides.

[a] w1= 4 [b] w1= 2 [c] w1= 1 [d] w1= 0 [e] w1= −1

(2)

(Hint: If you must, you can use the fact that all three examples are support vector candidates (i.e.

on the fat boundary) for this problem and the next one. But you can also challenge itself by solving it without using this fact first.)

(3)

2.

(Lecture 201) Following Problem 1, what is the margin achieved by the optimal solution? Choose the correct answer; provide steps of your “human optimization” like page 17 of Lecture 201 slides.

[a] 1 [b] 2 [c] 4 [d] 8 [e] 16

(Hint: You can use the same hint as the previous problem, and write your solution steps for both problems together if needed. Page 14 of Lecture 201 slides should remind you the relationship between (b, w) and the margin.)

3.

(Lecture 201) Consider N “linearly separable” 1D examples {(xn, yn)}Nn=1. That is, xn ∈ R.

Without loss of generality, assume that x1≤ x2≤ . . . xM < xM +1≤ xM +2. . . ≤ xN, yn = −1 for n = 1, 2, . . . , M , and yn = +1 for n = M + 1, M + 2, . . . , N . Apply hard-margin SVM without transform on this data set. What is the largest margin achieved? Choose the correct answer;

explain your answer.

[a] 12(xN− xM) [b] 12(xM +1− x1)

[c] 12



1 N −M

N

P

n=M +1

xnM1

M

P

n=1

xn



[d] 12(xN− x1) [e] 12(xM +1− xM)

(Hint: Have we mentioned that a decision stump is just a 1D perceptron, and the hard-margin SVM is an extension of the perceptron model? :-))

4.

(Lecture 201) Two points x1 and x2are sampled from a uniform distribution in [0, 1]. Consider a large-margin perceptron algorithm that either returns a 1D perceptron with margin at least ρ, or returns a default constant hypothesis of h(x) = −1. For ρ ∈ [0, 0.5], what is the expected number of dichotomies that this algorithm can produce, where expectation is taken over the process that generated (x1, x2)? Choose the correct answer; explain your answer.

[a] 2 + 2 · (1 − 2ρ)2 [b] 2 + 2 · (2ρ)2

[c] 4 · (1 − 2ρ)2 [d] 2 − 2 · (1 − 2ρ)2 [e] 2 − 2 · (2ρ)2

(Hint: We are mimicking page 24 of Lecture 201 here, and you are encouraged to think about the distance between two points.)

(4)

Dual Problem of Quadratic Programming

In the hard-margin SVM that we introduced in class, we hope to get a hyperplane such that the margin to the positive examples is the same as the margin to the negative examples. Sometimes we need to have different margins for different classes. The need can be written as the following uneven-margin SVM (in its linear form) with parameters ρ+> 0 and ρ> 0:

min

b,w

1 2wTw

subject to yn(wTxn+ b) ≥ ρ+ for n such that yn= +1 yn(wTxn+ b) ≥ ρ for n such that yn = −1.

Our original hard-margin SVM is just a special case with ρ+= ρ= 1.

5.

(Lecture 202) The dual problem of the uneven-margin SVM can be written as

minα

1 2

N

X

n=1 N

X

m=1

αnαmynymxTnxm+ 

subject to

N

X

n=1

ynαn= 0

αn≥ 0 for n = 1, 2, . . . , N.

What is ? Choose the correct answer; explain your answer.

[a] −PN

n=1ρ−1+ Jyn = +1K αn−PN

n=1ρ−1 Jyn = −1K αn

[b] −PN

n=1ρ0+Jyn = +1K αn−PN

n=1ρ0Jyn= −1K αn

[c] −PN

n=1ρ+Jyn = +1K αn−PN

n=1ρJyn= −1K αn [d] −PN

n=1ρ2+Jyn = +1K αn−PN

n=1ρ2Jyn= −1K αn

[e] none of the other choices

6.

(Lecture 202) Let α be an optimal solution of the original hard-margin SVM (i.e. even margin).

Which of the following is an optimal solution of the uneven-margin SVM for a given pair of non- negative (ρ, ρ+)? Choose the correct answer; explain your answer.

[a] α [b] √

ρ+· ρα [c] ρ 2

+α [d] ρ

2 +2

2 α [e] ρ+2 α

Properties of Kernels

(5)

8.

(Lecture 203) For any feature transform φ from X to Z, the squared distance between two examples x and x0 is kφ(x) − φ(x0)k2 in the Z-space. For the Gaussian kernel K(x, x0) = exp(−γkx − x0k2), compute the squared distance with the kernel trick. Then, for any two examples x and x0, what is the tightest upper bound for theirsquareddistance in the Z-space? Choose the correct answer;

explain your answer.

[a] 0 [b] 1 [c] 2 [d] 3 [e] 4

9.

(Lecture 203) For a set of examples {(xn, yn)}Nn=1 and a kernel function K, consider a hypothesis set that contains

hα,b(x) = sign

N

X

n=1

ynαnK(xn, x) + b

! .

The classifier returned by SVM can be viewed as one such hα,b, where the values of α is determined by the dual QP solver and b is calculated from the KKT conditions.

In this problem, we study a simpler form of hα,bwhere α = 1 (the vector of all 1’s) and b = 0. Let us name h1,0 as ˆh for simplicity. We will show that when using the Gaussian kernel K(x, x0) = exp(−γkx − x0k2), if γ is large enough, Ein(ˆh) = 0. That is, when using the Gaussian kernel, we can “easily” separate the given data set if γ is large enough.

Assume that the distance between any pair of different (xn, xm) in the X -space is no less than .

That is,

kxn− xmk ≥  ∀n 6= m.

What is the tightest lower bound of γ that ensures Ein(ˆh) = 0? Choose the correct answer; explain your answer.

[a] ln2(N +1)2

[b] ln(N +1)2

[c] ln(N )2 [d] ln(N −1)2

[e] ln2(N −1)2

(6)

Kernel Perceptron Learning Algorithm

10.

(Lecture 203) In this problem, we are going to apply the kernel trick to the perceptron learning algorithm introduced in Machine Learning Foundations. If we run the perceptron learning algo- rithm on the transformed examples {(φ(xn), yn)}Nn=1, the algorithm updates wtto wt+1when the current wtmakes a mistake on (φ(xn(t)), yn(t)):

wt+1← wt+ yn(t)φ(xn(t))

Because every update is based on one (transformed) example, if we take w0= 0, we can represent every wt as a linear combination of {φ(xn)}Nn=1. We can then maintain the linear combination coefficients instead of the whole w. Assume that we maintain an N -dimensional vector αtin the t-th iteration such that

wt=

N

X

n=1

αt,nφ(xn)

for t = 0, 1, 2, . . .. Set α0= 0 (N zeros) to match w0= 0 ( ˜d + 1 zeros). How should αtbe updated to αt+1when the current wt(represented by αt) makes a mistake on (φ(xn(t)), yn(t))? Choose the correct answer; explain your answer.

[a] αt+1← αtexcept αt+1,n(t) ← αt,n(t)+ 1 [b] αt+1← αtexcept αt+1,n(t) ← αt,n(t)− 1

[c] αt+1← αtexcept αt+1,n(t) ← αt,n(t)+ yn(t) [d] αt+1← αtexcept αt+1,n(t) ← αt,n(t)− yn(t) [e] αt+1← αt+ y

(Hint: Although we did not teach Lecture 205, if you have watched it by yourself from YouTube, you will find its page 15 loosely related. You should be able to solve this problem without watching Lecture 205, though.)

11.

(Lecture 203) Following Problem 10, the update rule takes care of the training iterations. In addition, we need to evaluate wTtφ(x) not only for predicting new x but also for checking whether wt makes any mistake on some example x during training. Which of the following equation computes wTtφ(x) with the kernel trick K(x, x0) = φ(x)Tφ(x0)? Choose the correct answer;

explain your answer.

[a] PN

n=1αt,nK(xn, x) [b] −PN

n=1αt,nK(xn, x) [c] PN

n=1ynαt,nK(xn, x) [d] PN

n=1α2t,n(K(xn, x))2 [e] PN

n=1α2t,nK(xn, x)

(7)

Soft-Margin SVM

12.

(Lecture 204) Consider the soft-margin SVM taught in our class. Assume that after solving the dual problem, every example is a bounded support vector. That is, the optimal solution αsatisfies αn = C for every example. In this case, there may be multiple solutions for the optimal b for the primal SVM problem. What is the largest such b? Choose the correct answer; explain your answer.

[a] min

n=1,2,...,N

 1 −PN

m=1ymαmK(xn, xm) [b] min

n : yn>0

1 −PN

m=1ymαmK(xn, xm) [c] min

n : yn<0



1 −PN

m=1ymαmK(xn, xm) [d] average

n : yn>0

 1 −PN

m=1ymαmK(xn, xm) [e] average

n : yn<0

 1 −PN

m=1ymαmK(xn, xm)

13.

(Lecture 204) In class, we taught the non-linear soft-margin SVM as follows.

(P1) min

w,b,ξ

1

2wTw + C

N

X

n=1

ξn

subject to yn



wTφ(xn) + b

≥ 1 − ξn, for n = 1, 2, . . . , N, ξn≥ 0, for n = 1, 2, . . . , N.

The SVM penalizes the margin violation linearly. Another popular formulation penalizes the margin violation quadratically. In this problem, we derive the dual of such a formulation. The formulation as follows:

(P2) min

w,b,ξ

1

2wTw + C

N

X

n=1

ξn2 subject to yn

wTφ(xn) + b

≥ 1 − ξn, for n = 1, 2, . . . , N.

We do not have the ξn ≥ 0 constraints as any negative ξn would never be an optimal solution of (P2)—you are encouraged to think about why. Anyway, the dual problem of (P2) will look like this:

(D2) min

α

1 2

N

X

n=1 N

X

m=1

αnαmynym· ♦ −

N

X

n=1

αn

subject to

N

X

n=1

ynαn= 0

αn≥ 0, for n = 1, 2, . . . , N.

Let the kernel function K(x, x0) = φ(x)Tφ(x0). What is ♦? Choose the correct answer; explain your answer.

[a] (2C · K(xn, xm))

[b] (K(xn, xm) + 2CJn = mK) [c] (K(xn, xm) + CJn = mK) [d] (K(xn, xm) +C1 Jn = mK) [e] (K(xn, xm) +2C1 Jn = mK)

(8)

14.

(Lectures 202/204) After getting the optimal α for (D2), how can we calculate the optimal ξfor (P2)? Choose the correct answer; explain your answer.

[a] ξ= α [b] ξ= 2α

[c] ξ= Cα [d] ξ= C1α [e] ξ= 2C1 α

Experiments with Soft-Margin SVM

For Problems 15 to 20, we are going to experiment with a real-world data set. Download the processed satimage data sets from LIBSVM Tools.

Training: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale Testing: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.t We will consider binary classification problems of the form “one of the classes” (as the positive class) versus “the other classes” (as the negative class).

The data set contains thousands of examples, and some quadratic programming packages cannot handle this size. We recommend that you consider the LIBSVM package

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Regardless of the package that you choose to use, please read the manual of the package carefully to make sure that you are indeed solving the soft-margin support vector machine taught in class like the dual formulation below:

minα

1 2

N

X

n=1 N

X

m=1

αnαmynymK(xn, xm) −

N

X

n=1

αn

subjectto

N

X

n=1

ynαn = 0

0 ≤ αn ≤ C n = 1, . . . , N.

In the following problems, please use the 0/1 error for evaluating Ein, Eval and Eout (through the test set). Some practical remarks include

(i) Please tell your chosen package to not automatically scale the data for you, lest you should change the effective kernel and get different results.

(ii) It is your responsibility to check whether your chosen package solves the designated formulation with enough numerical precision. Please read the manual of your chosen package for software parameters whose values affect the outcome—any ML practitioner needs to deal with this kind of added uncertainty.

(9)

15.

(Lectures 201/204, *) Consider the linear soft-margin SVM. That is, either solve the primal for- mulation of soft-margin SVM with the given xn, or take the linear kernel K(xn, xm) = xTnxmin the dual formulation. With C = 10, and the binary classification problem of “3” versus “not 3”, which of the following numbers is closest to kwk after solving the linear soft-margin SVM? Choose the closest answer; provide your command/code.

[a] 7.0 [b] 7.5 [c] 8.0 [d] 8.5 [e] 9.0

16.

(Lectures 203/204, *) Consider the polynomial kernel K(xn, xm) = (1 + xTnxm)Q, where Q is the degree of the polynomial. With C = 10, Q = 2, which of the following soft-margin SVM classifiers reaches the lowest Ein? Choose the correct answer; provide your command/code.

[a] “1” versus “not 1”

[b] “2” versus “not 2”

[c] “3” versus “not 3”

[d] “4” versus “not 4”

[e] “5” versus “not 5”

17.

(Lectures 203/204, *) Following Problem 16, which of the following numbers is closest to the maximum number of support vectors within those five soft-margin SVM classifiers? Choose the closest answer; provide your command/code.

[a] 500 [b] 600 [c] 700 [d] 800 [e] 900

18.

(Lectures 203/204, *) Consider the Gaussian kernel K(xn, xm) = exp −γ||xn− xm||2. For the binary classification problem of “6” versus “not 6”, when fixing γ = 10, which of the following values of C results in the lowest Eout? Choose the correct answer; provide your command/code.

[a] 0.01 [b] 0.1

[c] 1 [d] 10 [e] 100

19.

(Lectures 203/204, *) Following Problem 18, when fixing C = 0.1, which of the following values of γ results in the lowest Eout? Choose the correct answer; provide your command/code.

[a] 0.1 [b] 1

[c] 10 [d] 100 [e] 1000

(10)

20.

(Lectures 203/204, *) Following Problem 18 and consider a validation procedure that randomly samples 200 examples from the training set for validation and leaves the other examples for training gsvm . Fix C = 0.1 and use the validation procedure to choose the best γ among {0.1, 1, 10, 100, 1000}

according to Eval. If there is a tie of Eval, choose the smallest γ. Repeat the procedure 1000 times.

Which of the following values of γ is selected the most number of times? Choose the correct answer;

provide your command/code.

[a] 0.1 [b] 1

[c] 10 [d] 100 [e] 1000

參考文獻

相關文件

In this final project, you are going to be part of an exciting machine learning competition. Consider a startup company that features a coming product on the

soft-margin k -means OOB error RBF network probabilistic SVM GBDT PCA random forest matrix factorization Gaussian kernel kernel LogReg large-margin prototype quadratic programming

In Chapter 3, we transform the weighted bipartite matching problem to a traveling salesman problem (TSP) and apply the concepts of ant colony optimization (ACO) algorithm as a basis

Output : For each test case, output the maximum distance increment caused by the detour-critical edge of the given shortest path in one line.... We use A[i] to denote the ith element

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22.. If we use E loocv to estimate the performance of a learning algorithm that predicts with the average y value of the

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Now, nearly all of the current flows through wire S since it has a much lower resistance than the light bulb. The light bulb does not glow because the current flowing through it

In this paper, we suggest the Levenberg-Marquardt method with Armijo line search for solving absolute value equations associated with the second-order cone (SOCAVE for short), which