Condensed Filter Tree for Cost-Sensitive Multi-Label Classification

(1)

Condensed Filter Tree for Cost-Sensitive Multi-Label Classification

A. Proof of Theorem 1

Theorem 1. Under the proper ordering and K-classifier tricks, for eachx and the multi-label classifier h formed by chainingK binary classifiers (h₁, ..., h_K) as in the prediction procedure of Filter Tree, the regretrg(h, P) is

rg(h,P)≤ X

t∈hr,y^∗i

Jhk(x,t)6=y[k]Krg

h_k(x,t),FTt(P,h_k+1,...,h_K)

,

where k denotes the layer that t is on, and FTt(P, hk+1, ..., hK) represents the procedure that generates weighted examples(x, b, w) to train the node at indext based on sampling y from P_|xand considering the predictions of classifiers in the lower layers.

Proof. The proof is similar to the one in (Beygelzimer et al.,2008), which is based on defining the overall-regret of any subtree. The key change in our proof is to define the path-regretof any subtree to be the total regret of the nodes on the ideal path of the subtree. The induction step follows similarly from the proof in (Beygelzimer et al.,2008) by considering two cases: one for the ideal prediction to be in the left subtree and one for the ideal prediction to be in the right. Then an induction from layer K to the root proves the theorem.

For each node t on layer k, hk makes a weighted binary classification decision of 0 or 1, which directs the prediction procedure to move to either the node t0or t1. Without loss of generality, assume hk(x, t) = 1. We denote ˆt as the prediction (leaf) on x when starting at node t. For each leaf node ˜y, let ¯C(˜y) ≡ Ey∼P|xC(y, ˜y). Then the node regret rg(t) is simply ¯C(ˆt1) − min_i∈{0,1}C(ˆ¯ ti). Obviously, rg(t) ≥ ¯C(ˆt1) − ¯C(ˆt0) for all node t.

In addition to the regret of nodes, we also define the regret of the subtree Tt rooted at node t. The regret of the subtree Tt is as defined as the regret of the predicted path (vector) ˆt within the subtree Tt, that is, rg(Tt) = ¯C(ˆt) − ¯C(t^∗) , where t^∗ denotes the optimal prediction (leaf node) in the subtree Tt. By this definition, rg(h, P) can be treated as rg(Tr).

We now prove by induction from layer K to the root. The induction hypothesis is that

rg(Tt) ≤ X

t⁰∈ht,t^∗i

Jh^k(x, t⁰) 6= y[k]Krg(t

0),

where k is the corresponding layer of each node t⁰. The hypothesis states that the regret of the subtree is bounded by the sum of the regrets for the wrongly predicted nodes from t to the ideal prediction t^∗. The base case is the reduction tree with one single internal node t and two leaf nodes, which is a cost-sensitive binary classification with rg(Tt) = rg(t) trivially. If h1 predicts correctly, then rg(Tt) = 0. Otherwise rg(Tt) = rg(t). Then the induction hypothesis is satisfied.

For the inductive step, for node t on layer k, assume R0≡ rg(Tt0) ≤ X

t⁰∈ht0,t^∗₀i

Jhk(x, t⁰) 6= y[k]Krg(t

0),

and

R₁≡ rg(T_t₁) ≤ X

t⁰∈ht1,t^∗₁i

0).

The optimal prediction t^∗is either on the right subtree T1

or the left subtree T0. For the first case, it implies t^∗= t^∗₁ and y[k] = hk(x, t) = 1, then

rg(Tt) = C(ˆ¯ t1) − ¯C(t^∗)

= C(ˆ¯ t₁) − ¯C(t^∗₁)

= R₁≤ X

t⁰∈ht₁,t^∗₁i

0)

= X

t⁰∈ht,t^∗i

0).

For the second case, it implies t^∗ = t^∗₀ and y[k] 6=

hk(x, t) = 1, then

rg(Tt) = C(ˆ¯ t1) − ¯C(t^∗)

= C(ˆ¯ t1) − ¯C(t^∗₀)

= C(ˆ¯ t1) − ¯C(ˆt0) + ¯C(ˆt0) − ¯C(t^∗₀)

≤ rg(t) + R0

≤ rg(t) + X

t⁰∈ht0,t^∗₀i

0)

= X

t⁰∈ht,t^∗i

0).

Then we complete the induction.

B. Datasets

Here we summarize the basic statistics of the used datasets in Table1.

(2)

Condensed Filter Tree for Cost-Sensitive Multi-Label Classification

Table 1. The properties of each dataset Dataset # Instances # Labels (K)

CAL500 502 174

emotions 593 6

enron 1702 53

imdb 86290 28

medical 662 45

scene 2407 6

slash 3279 22

tmc 28596 22

yeast 2389 144

References

Beygelzimer, A., Langford, J., and Ravikumar, P. Error correcting tournaments, 2008.