A Brief Introduction to SVM-HMM - 利用結構性支撐向量機的具音樂表現能力之半自動電腦演奏系統

In this thesis, we use the structural support vector machine to learn performance knowl-edge from expressive performance samples. Unlike the traditional SVM algorithm, which only produce univariate prediction, the structural SVM can produce structural predictions like trees, graphs or sequences. The structural SVM with hidden Markov model out-put (SVM-HMM) has been successfully applied to part-of-speech tagging problem [56].

There are some similarities between the part-of-speech tagging problem and the expres-sive performance problem. In the part-of-speech tagging, one tries to identify the role in which the word plays in the sentence, while in the expressive performance, one tries to determine how a note should be played, usually based on its role in the musical phrase.

Thus, we believe that SVM-HMM is also a good candidate for expressive performance.

The following introduction and formulas are summaries of [56--58].

The traditional SVM prediction problem can be described as finding a function

h :X → Y

with lowest prediction error. X is the input features space, and Y is the prediction space.

In a traditional SVM, elements inY are labels (classification) or real values (regression).

However, a structural SVM extends the framework to generate structural output, such as trees, graphs or sequences. To extend SVM to support structured outputs, the problem is modified as finding a discriminant function

F :X × Y → R

, in which the input/output pairs are mapped to a real number score. To predict an output y for an input x, one tries to maximize F over all y∈ Y.

f (x) = arg max

y∈Y F (w, x, y) Let F be a linear function of the following form:

F = w^TΨ(x, y)

, where w is the parameter vector, and Ψ(x, y) is the kernel function relating input x to output y. Ψ can be defined to accommodate various kinds of structure.

For each structure we want to predict, a loss function that measures the accuracy of of a prediction is required. A loss function ∆ : Y × Y → R needs to satisfy the following properties:

∆(y, y^′)≥ 0 for y ̸= y^′

∆(y, y) = 0

The loss function is assumed to be bounded. Let's assume that the input-output pair

(x, y) is drawn from a join distribution P(x,y), the prediction problem is to minimize the total loss:

R^∆_p =

∫

X×Y∆(y, f (x))dP (x, y)

Since we cannot directly find the distribution P , we need to replace this total loss with an empirical loss, which can be calculated from the observed training set of (x_i, y_i) pairs.

R^∆_s (f ) = 1 n

∑n

i=1

∆(y_i, f (x_i))

Now we are ready to extend SVM to structural output, starting with a linear separable case, and we will then extend it to a soft-margin formulation.

A linear separable case can be expressed by a set of linear constrains

∀i ∈ {1, · · · , n}, ∀ ˆyi ∈ Y : w^T[Ψ(x_i, y_i)− Ψ(xi, ˆy_i)]≥ 0

The constrains imply that the groundtruth y_ifor x_ihas the minimum F value than any other ˆy_i ̸= yi.

The key concept of SVM is the large margin principle. We not only want to find a solution that statisfies the constrains, but also we want to maximize the margin between the groundtruth and the second best ˆy_i:

γ,w:∥w∥=1max γ

s.t ∀i ∈ {1, · · · , n}, ∀ ˆyi ∈ Y : w^T[Ψ(x_i, y_i)− Ψ(xi, ˆy_i)]≥ γ

, which is equivalent to the convex quadratic programming problem:

w,ξmini≥0

1 2∥w∥²

s.t.∀i ∈ {1, · · · , n}, ˆyi ∈ Y : w^T[Ψ(x_i, y_i)− Ψ(xi, ˆy_i)]≥ 1

To extend the linear-separable case to a non-separable case, slack variables ξi are

in-troduced to penalize prediction errors, which results in a soft-margin formalization:

C is the weighting parameter controlling the trade-off between low training error and large margin. The optimal C varies between different problems, so experiments should be conducted to find the optimal C for our problem.

Intuitively, a constrain violation with a larger loss should be penalized more than the one with a smaller loss. So I. Tsochantaridis et al. [57] proposed two possible way to take the loss function into account. The first way is to re-scale the slack variable by the inverse of the loss, so a high loss leads to a smaller re-scaled slack variable:

w,ξmini≥0 The second way is to re-scale the margin, which yields

w,ξmini≥0

But the above quadratic programming problem has a very large number (O(n|Y|)) of con-strains, which will take considerable time to solve. I. Tsochantaridis et al. [57] proposed a greedy algorithm to speed up the process by selecting only part of the constrains that con-tributes the most to finding the solution. Initially, the solver starts with an empty working set containing no constrains. Then the solver iteratively scans the training set to find the most violated constrains under the current solution. If a constrain is violated more times than a desired threshold, the constrain is added to the working set of constrains. Then the solver re-calculates the solution under the new working set. The algorithm will terminate

once no more constrain can be added under the desired precision.

In a later work by Joachims et al. [56], they created a new formulation and an algorithm to further speed up the algorithm. Instead of using one slack variable for each training sample, which resulting in a total of n slack variables, they use a single slack variable for all n training samples. The following formula is the 1-slack version of slack-rescaling structural SVM:

Detailed proofs on how the new formulation is equally general as the old one is given in the paper [56].

With the framework described above, the only problem left is how to define the gen-eral loss function and Ψ. Drawing the inter-state dependencies and time dependencies concept from hidden Markov model, Y. Altun et al. [58] proposed two types of features for an equal-length observation/label sequence pair (x, y). The first is the interaction of an observed feature x^s with a label y^t, the other is the interaction between neighboring labels y^sand y^t.

To illustrate the method, we use an example from music: for some observed features Ψ_r(x^s) of a note x located in s-th position of the phrase, and assume that [[y^t= τ ]] denotes the t-th note is played at a velocity of τ , the interaction of the observed feature and the label can be written as:

ψ^st_rσ(x, y) =^[[y^t = τ^]]Ψ_r(x^s), 1≤ γ ≤ d, τ ∈ Σ

Figure 3.2: Learning phase flow chart

And the interaction between labels can be written as:

ψˆ^st_rσ(x, y) =^[[y^s = σ∧ y^t = τ^]], σ, τ ∈ Σ

By selecting an order of dependency for the HMM model, we can further restrict s's and t's. For example, for a first-order HMM, s = t for the first feature, and s = t− 1 for the second feature. The two features on the same time t is then stacked into a vector Ψ(x, y; t). The feature map for the whole sequence is simply the sum of all the feature vectors

Ψ(x, y) =

∑T

t=1

Ψ(x, y; t)

The distance, i.e. the general loss function, between two feature maps depends on the number of common label segments and the inner product between the input features sequence with common labels.

∆(Ψ(x, y), Ψ(ˆx, ˆy)) =^∑

s,t

[[

y^s⁻¹ = ˆy^t⁻¹∧ y^s = ˆy^t^]]+^∑

s,t

[[

y^s= ˆy^t^]]k(x^s, ˆx^t)

Finally, during the prediction process, a Viterbi-like decoding algorithm is used to effeciently find a y that maximize F .

在文檔中利用結構性支撐向量機的具音樂表現能力之半自動電腦演奏系統 (頁 21-27)