Condensed Filter Tree for Cost-Sensitive Multi-Label Classification
A. Proof of Theorem 1
Theorem 1. Under the proper ordering and K-classifier tricks, for eachx and the multi-label classifier h formed by chainingK binary classifiers (h1, ..., hK) as in the predic- tion procedure of Filter Tree, the regretrg(h, P) is
rg(h,P)≤ X
t∈hr,y∗i
Jhk(x,t)6=y[k]Krg
hk(x,t),FTt(P,hk+1,...,hK)
,
where k denotes the layer that t is on, and FTt(P, hk+1, ..., hK) represents the procedure that generates weighted examples(x, b, w) to train the node at indext based on sampling y from P|xand considering the predictions of classifiers in the lower layers.
Proof. The proof is similar to the one in (Beygelzimer et al.,2008), which is based on defining the overall-regret of any subtree. The key change in our proof is to define the path-regretof any subtree to be the total regret of the nodes on the ideal path of the subtree. The induction step follows similarly from the proof in (Beygelzimer et al.,2008) by considering two cases: one for the ideal prediction to be in the left subtree and one for the ideal prediction to be in the right. Then an induction from layer K to the root proves the theorem.
For each node t on layer k, hk makes a weighted binary classification decision of 0 or 1, which directs the predic- tion procedure to move to either the node t0or t1. Without loss of generality, assume hk(x, t) = 1. We denote ˆt as the prediction (leaf) on x when starting at node t. For each leaf node ˜y, let ¯C(˜y) ≡ Ey∼P|xC(y, ˜y). Then the node re- gret rg(t) is simply ¯C(ˆt1) − mini∈{0,1}C(ˆ¯ ti). Obviously, rg(t) ≥ ¯C(ˆt1) − ¯C(ˆt0) for all node t.
In addition to the regret of nodes, we also define the regret of the subtree Tt rooted at node t. The re- gret of the subtree Tt is as defined as the regret of the predicted path (vector) ˆt within the subtree Tt, that is, rg(Tt) = ¯C(ˆt) − ¯C(t∗) , where t∗ denotes the optimal prediction (leaf node) in the subtree Tt. By this definition, rg(h, P) can be treated as rg(Tr).
We now prove by induction from layer K to the root. The induction hypothesis is that
rg(Tt) ≤ X
t0∈ht,t∗i
Jhk(x, t0) 6= y[k]Krg(t
0),
where k is the corresponding layer of each node t0. The hypothesis states that the regret of the subtree is bounded by the sum of the regrets for the wrongly predicted nodes from t to the ideal prediction t∗. The base case is the reduction tree with one single internal node t and two leaf nodes, which is a cost-sensitive binary classification with rg(Tt) = rg(t) trivially. If h1 predicts correctly, then rg(Tt) = 0. Otherwise rg(Tt) = rg(t). Then the induction hypothesis is satisfied.
For the inductive step, for node t on layer k, assume R0≡ rg(Tt0) ≤ X
t0∈ht0,t∗0i
Jhk(x, t0) 6= y[k]Krg(t
0),
and
R1≡ rg(Tt1) ≤ X
t0∈ht1,t∗1i
Jhk(x, t0) 6= y[k]Krg(t
0).
The optimal prediction t∗is either on the right subtree T1
or the left subtree T0. For the first case, it implies t∗= t∗1 and y[k] = hk(x, t) = 1, then
rg(Tt) = C(ˆ¯ t1) − ¯C(t∗)
= C(ˆ¯ t1) − ¯C(t∗1)
= R1≤ X
t0∈ht1,t∗1i
Jhk(x, t0) 6= y[k]Krg(t
0)
= X
t0∈ht,t∗i
Jhk(x, t0) 6= y[k]Krg(t
0).
For the second case, it implies t∗ = t∗0 and y[k] 6=
hk(x, t) = 1, then
rg(Tt) = C(ˆ¯ t1) − ¯C(t∗)
= C(ˆ¯ t1) − ¯C(t∗0)
= C(ˆ¯ t1) − ¯C(ˆt0) + ¯C(ˆt0) − ¯C(t∗0)
≤ rg(t) + R0
≤ rg(t) + X
t0∈ht0,t∗0i
Jhk(x, t0) 6= y[k]Krg(t
0)
= X
t0∈ht,t∗i
Jhk(x, t0) 6= y[k]Krg(t
0).
Then we complete the induction.
B. Datasets
Here we summarize the basic statistics of the used datasets in Table1.
Condensed Filter Tree for Cost-Sensitive Multi-Label Classification
Table 1. The properties of each dataset Dataset # Instances # Labels (K)
CAL500 502 174
emotions 593 6
enron 1702 53
imdb 86290 28
medical 662 45
scene 2407 6
slash 3279 22
tmc 28596 22
yeast 2389 144
References
Beygelzimer, A., Langford, J., and Ravikumar, P. Error correcting tournaments, 2008.