國立臺灣大學電機資訊學院資訊工程學系 碩士論文
Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science
National Taiwan University Master Thesis
動態特徵投影應用在成本導向線上多標籤分類問題 Dynamic Principal Projection
for Cost-sensitive Online Multi-label Classification
朱鴻敏 Hong-Min Chu
指導教授:林軒田博士 Advisor: Hsuan-Tien Lin, Ph.D.
中華民國 106 年 6 月
June, 2017
誌謝
感謝林軒田老師這三年多來的指導。從我一無所知的加入 CLLab
開始,一直到現在完成這篇碩士論文,一路上來老師給了我許許多多 的幫助。不管是研究上的討論,還是身為研究者的品格的言傳身教,
我從老師身上學到了非常多。
感謝口試委員陳藴儂教授以及王鈺強教授撥冗參加我的口試,並給 予我在論文寫作上非常有用的建議。
感謝CLLab 的成員們,與他們的討論使我在研究上獲益良多,不
管是新的想法方面還是新的研究趨勢方面。特別感謝一同與我口試的 夥伴鄒侑霖,有了他的幫忙,讓這次口試能夠順利的完成。
感謝我的家人,在我做研究的期間能夠無條件的支持我,使我能夠 順利得完成這篇論文。
最後,感謝阿拉,祐我順利的度過人生重要的一個階段,拿到碩士 學位。
朱鴻敏 謹識
2017.06.27
摘要
本論文研究三個重要且實際的議題:線上更新,標籤空間維度下 降,以及成本導向性,在多標籤分類問題上。目前的多標籤分類問題 演算法並未被設計來同時處理這三個議題。在本論文中,我們提出了 一個創新的演算法,成本導向動態特徵投影,來同時解決這三個議題。
本方法是基於一個將領先的標籤空間維度下降演算法利用線上主成份 分析延伸到線上更新的框架。詳細的說,本方法使用矩陣隨機梯度下 降法作為處理線上主成份分析問題的方法,並在與精心設計得線上回 歸學習者結合時建立其理論骨幹。另外,本方法將成本資訊嵌入標籤 權重之中以達有理論保證的成本導向性。我們也研究了本方法的實際 改進以提高效率。實驗結果表明,本方法在不同的評估標準上達到比 現有的多標籤分類演算法更好的實際表現,也證明了同時解決這三個 問題的重要性。
Abstract
We study multi-label classification (MLC) with three important real-world issues: online updating, label space dimensional reduction (LSDR), and cost- sensitivity. Current MLC algorithms have not been designed to address these three issues simultaneously. In this paper, we propose a novel algorithm, cost- sensitive dynamic principal projection (CS-DPP) that resolves all three issues.
The foundation of CS-DPP is a framework that extends a leading LSDR algo- rithm to online updating with online principal component analysis (PCA). In- particular, CS-DPP investigates the use of matrix stochastic gradient as the on- line PCA solver, and establishes its theoretical backbone when coupled with a carefully-designed online regression learner. In addition, CS-DPP embeds the cost information into label weights to achieve cost-sensitivity along with theoretical guarantees. Practical enhancements of CS-DPP are also studied to improve its efficiency. Experimental results verify that CS-DPP achieves bet- ter practical performance than current MLC algorithms across different eval- uation criteria, and demonstrate the importance of resolving the three issues simultaneously.
Contents
誌謝
i
摘要
ii
Abstract iii
1 Introduction 1
2 Preliminaries and Related Work 4
3 Dynamic Principal Projection 7
3.1 Principal Label Space Transformation . . . 7
3.2 Online PCA . . . 8
3.3 Proposed Approach . . . 9
3.4 Practical Variant and Implementation . . . 12
4 Cost-Sensitive Extension 14 5 Experiments 17
5.1 Experiments Setup . . . 175.2 Necessity of LSDR . . . 18
5.3 Experiments on Basis Drifting . . . 19
5.4 Experiments on Cost-Sensitivity . . . 20
6 Conclusion 23
A 24
A.1 Proof of Theorem 2 . . . 24
A.2 Proof of Lemma 3 . . . 27
A.3 Proof of Lemma 4 . . . 28
A.4 Proof of Theorem 5 . . . 29
A.5 Details of Experiments . . . 31
A.5.1 Datasets and Parameters . . . 31
A.5.2 Necessity of LSDR . . . 31
A.5.3 Experiments on Basis Drifting . . . 32
A.5.4 Experiments on Cost-sensitivity . . . 33
Bibliography 36
List of Figures
5.1 DPP vs. O-BR on noisy labels . . . 18 5.2 PBC vs. PBT vs. None . . . 20 5.3 CS-DPP vs. Others . . . 21
List of Tables
5.1 DPP vs. O-BR on Large Dataset . . . 18
A.1 Statistics of datasets . . . 31
A.2 DPP vs. O-BR on Noisy Data, Hamming loss . . . 32
A.3 DPP vs. O-BR on Noisy Data, F1 loss . . . 32
A.4 DPP vs. O-BR on Noisy Data, Accuracy loss . . . 33
A.5 DPP vs. O-BR on Noisy Data, Normalized rank loss . . . 33
A.6 CS-DPP with PBC vs. PBT vs. None, Hamming loss . . . 34
A.7 CS-DPP with PBC vs. PBT vs. None, F1 loss . . . 34
A.8 CS-DPP with PBC vs. PBT vs. None, Accuracy loss . . . 34
A.9 CS-DPP with PBC vs. PBT vs. None, Normalized rank loss . . . 34
A.10 CS-DPP vs. others, Hamming loss . . . 34
A.11 CS-DPP vs. others, F1 loss . . . 34
A.12 CS-DPP vs. others, Accuracy loss . . . 35
A.13 CS-DPP vs. others, Normalized rank loss . . . 35
Chapter 1 Introduction
The multi-label classification (MLC) problem allows each instance to be associated with a set of labels. The MLC problem reflects the nature of different real-world applications [8, 4, 12]. Traditional MLC algorithms mainly consider the batch MLC problem, where the input data are presented in a batch [22, 25]. Nevertheless, in many MLC applications such as e-mail categorization [20], multi-label examples arrive as a stream, which requires online analysis, as algorithms for batch MLC may not be suitable because of the potentially infinite amount of data. The need of such applications can be formalized as the online MLC (OMLC) problem.
The OMLC problem is generally more challenging than the batch one, and many ma- ture algorithms for the batch problem have not yet been carefully extended to OMLC.
Label space dimension reduction (LSDR) is a family of mature algorithms for the batch MLC problem [7, 13, 17, 24, 14, 23, 31, 6, 2, 5]. By viewing the label set of each instance as a high-dimensional label vector in a label space, LSDR encodes each label vector as a code vector in a lower-dimensional code space, and learns a predictor within the code space. An unseen instance is predicted by coupling the predictor with a decoder from the code space to the label space. For example, compressed sensing (CS) [13] encodes using random projections, and decodes with sparse vector reconstruction; principal label space transformation (PLST) [24] encodes by projecting to the key eigenvectors of the known label vectors obtained from principal component analysis (PCA), and decodes by reconstruction with the same eigenvectors. This low-dimensional encoding allows LSDR
algorithms to exploit the key joint information between labels to be more robust to noise and be more effective on learning [24]. Nevertheless, to the best of our knowledge, all the LSDR algorithms mentioned above are designed for the batch MLC problem rather than the OMLC one.
Another family of MLC algorithms that have not been carefully extended for OMLC contains the cost-sensitive MLC algorithms. In particular, different MLC applications usually come with different evaluation criteria (costs) that reflect their realistic needs. It is important to design MLC algorithms that are cost-sensitive to systematically cope with different costs, because an MLC algorithm that targets one specific cost may not always perform well under other costs [15]. Two representative cost-sensitive MLC algorithms are probabilistic classifier chain (PCC) [10] and condensed filter tree (CFT) [15]. PCC estimates the conditional probability with the classifier chain (CC) method [22] and makes Bayes-optimal predictions with respect to the given cost based on the estimations; CFT decomposes the cost into instance weights when training the classifiers in CC. Both algo- rithms, again, are designed for the batch MLC problem rather than the OMLC one.
From the discussions above, there is currently no algorithm that considers the three realistic needs of online updating, label space dimension reduction, and cost-sensitivity at the same time. The goal of this work is to study such algorithms. We first formalize the OMLC and cost-sensitive OMLC (CSOMLC) problems in Section 2 and discuss related work. We then extend LSDR for the OMLC problem and propose a novel online LSDR algorithm, dynamic principal projection (DPP), by connecting PLST with online PCA.
In particular, we derive the DPP algorithm in Section 3 along with its theoretical guar- antees, and resolve the issue of possible basis drifting caused by online PCA. Practical enhancements of DPP are also studied to improve its efficiency.
In Section 4, we further extend DPP to cost-sensitive DPP (CS-DPP) to fully match the needs of CSOMLC with a label-weighting scheme inspired by CFT. Extensive em- pirical studies demonstrate the strength of CS-DPP in addressing the three realistic needs in Section 5. In particular, we justify the necessity of considering LSDR, basis drifting and cost-sensitivity under the CSOMLC setting. The results show that CS-DPP signifi-
cantly outperforms other OMLC competitors across different CSOMLC problems, which validates the robustness and effectiveness of CS-DPP, as concluded in Section 6.
Chapter 2
Preliminaries and Related Work
For the MLC problem, we denote the feature vector of an instance as x ∈ R
d
and its cor- responding label vector as y ∈ Y ≡ {+1, −1}K
, where y[k] = +1 iff the instance is associated with the k-th label out of a total of K possible labels. We let y[k] ∈ {+1, −1}to conform with the common setting of online binary classification [9, 27], which is equiv- alent to another scheme, y[k] ∈ {1, 0}, used in other MLC works [15, 22].
Traditional MLC methods consider the batch setting, where a training dataset D = {(x
n
, yn
)}N n=1
is given at once, and the objective is to learn a classifier g : Rd
→ {+1, −1}K
from D with the hope that ˆy = g(x) accurately predicts ground truth y with respect to an unseen x. In this work, we focus on the OMLC setting, which assumes that instance (xt
, yt
) arrives in sequence from a data stream. Whenever an xt
arrives at iteration t, the OMLC algorithm is required to make a prediction ˆyt
= gt
(xt
) based on the current classifier gt
and feature vector xt
. The ground truth yt
with respect to xt
is then revealed, and the penalty of ˆyt
is evaluated against yt
.Many evaluation criteria for comparing y and ˆy have been considered in the litera- ture to satisfy different application needs. A simple criterion [25] is the Hamming loss c
ham
(y, ˆy) =K 1
!K
k=1
!y[k] ̸= ˆy[k]". The Hamming loss separately considers each label as equally important. In addition to the Hamming loss, there are other criteria that jointly evaluate all labels in ˆy, such asF1 loss c f
(y, ˆy) = 1−2"
K
!
k=1
!y[k]=+1 and ˆy[k]=+1"# /
"
K
!
k=1
(!y[k]=+1" + !ˆy[k]=+1")
#
Accuracy loss c acc
(y, ˆy) = 1−"
K
!
k=1
!y[k]=+1 and ˆy[k]=+1"# /
"
K
!
k=1
!y[k]=+1 or ˆy[k]=+1"#
Normalized Rank loss c nr
(y, ˆy) = averagey[i]>y[j]
$!ˆy[i]<ˆy[j]" +
1 2
!ˆy[i]=ˆy[j]"%In this work, we follow existing cost-sensitive MLC approaches [15] to extend OMLC to the cost-sensitive OMLC (CSOMLC) setting, which further takes the evaluation crite- rion as an additional input to the learning algorithm. We call the criterion a cost function and overload c: {+1, −1}
K
× {+1, −1}K
→ R as its notation. The cost function eval- uates the penalty of ˆy against y by c(y, ˆy), and includes the four loss functions discussed above. We naturally assume that c(·, ·) satisfies c(y, y) = 0 and maxˆ y
c(y, ˆy) ≤ 1. Given additional input, the CSOMLC algorithm shall behave differently when fed with different cost functions. In particular, the objective of a CSOMLC algorithm is to adaptively learn a classifier gt
: Rd
→ {+1, −1}K
based on not only the data stream but also the given cost function c such that the cumulative cost!T
t=1
c(yt
, ˆyt
) with ˆyt
= gt
(xt
) over T iterations of (xt
, yt
) can be minimized.Several OMLC algorithms have been studied in the literature, including online binary relevance [21], Bayesian OMLC framework [32], and the multi-window approach using k nearest neighbors [30]. However, none of them are cost-sensitive. That is, they cannot take the cost function into account to improve learning performance.
Cost-sensitive MLC algorithms have also been investigated in the literature. Cost- sensitive RAkEL [18] and progressive RAkEL [29] are two algorithms that generalize a famous batch MLC algorithm called RAkEL [26] to cost-sensitive learning. The for- mer achieves cost-sensitivity for any weighted Hamming loss, and the latter achieves this for any cost function. probabilistic classifier chain (PCC) [10] and conditional filter tree (CFT) [15] are two other algorithms that generalizes another famous batch MLC algorithm called classifier chain (CC) [22] to cost-sensitive learning. PCC estimates the conditional probability of the label vector via CC, and makes a Bayes-optimal prediction with respect to the estimation and cost function. While PCC in principle achieves cost-sensitivity for any cost function, the prediction step can be time-consuming unless an efficient Bayes inference rule can be specifically designed for the cost function (e.g. F1 loss [11]). CFT embeds the cost information into CC by an O(K
2
)-time step that re-weights the traininginstances for each classifier. All four algorithms above are designed for the batch cost- sensitive MLC problem, and it is not clear how they can be modified for the CSOMLC problem.
Label space dimension reduction (LSDR) is another family of MLC algorithms. LSDR encodes each label vector as a code vector in the lower-dimensional code space, and learns a predictor from the feature vectors to the corresponding code vectors. The prediction of LSDR consists of the predictor followed by a decoder from the code space to the label space. For example, compressed sensing (CS) [13] uses random projection for encoding, takes a regressor as the predictor, and decodes by sparse vector reconstruction. Instead of random projection, principal label space transformation (PLST) [24] encodes the label vectors {y
n
}N n=1
to their top principal components for the batch MLC problem. Other LSDR algorithms consider the feature and label vectors jointly, including conditional principal label space transformation [7], feature-aware implicit label space encoding [17], canonical-correlation-analysis method [23], and low-rank empirical risk minimization for multi-label learning [31]. The code vectors produced by those LSDR algorithms capture the joint information between the labels to allow more robust and more effective learning.Nevertheless, the algorithms are all designed for the batch MLC problem rather than the OMLC one, and they are not cost-sensitive.
Motivated by the possible applications of online updating, the realistic needs of cost- sensitivity, and the potential effectiveness of label space dimension reduction, we take an initiative to study LSDR algorithms for the CSOMLC setting. In particular, we first adapt PLST to the OMLC setting in Section 3, and further extend it to the CSOMLC setting in Section 4.
Chapter 3
Dynamic Principal Projection
In this section, we first propose an online LSDR algorithm, dynamic principal projection (DPP), that optimizes the Hamming loss. DPP is motivated by the connection between PLST, which encodes the label vectors to their principal components, and the rich literature of online PCA algorithms [1, 19, 16]. Before discussing our design to combine PLST with online PCA, we introduce their respective details first.
3.1 Principal Label Space Transformation
Given the dimension M ≤ K of the code space and a batch training dataset D = {(x
n
, yn
)}N n=1
, PLST encodes each yn
∈ {+1, −1}K
into a code vector zn
= P∗
(yn
− ¯y), where ¯y is the empirical mean of {yn
}N n=1
, and the rows of P∗
contain the projection directions to the top M principal components of{yn
− ¯y}N n=1
. That is, P∗
contains the top M eigenvectors of!
N
n=1
(yn
− ¯y)(yn
− ¯y)⊤
. A multi-target regressor r is then learned on {(xn
, zn
)}N n=1
, and the prediction of an unseen instance x is made byˆy = round&
(P
∗
)⊤ r(x) + ¯y
'(3.1)
where round(v) =
"
sign(v[1]), . . . , sign(v[K])
#
⊤
.By projecting to the top principal components, PLST preserves the maximum amount of information within the observed label vectors. In addition, PLST is backed by the following theoretical guarantee:
Theorem 1 ([24]). When making a prediction ˆy from x by ˆy = round
&P ⊤ r(x) + o
'with any given reference vector o and any left orthogonal matrix P, the Hamming loss
c
ham
(y, ˆy) ≤ 1K(∥r(x) − z∥( )* +
2 2 pred. error
+∥(I − P(
⊤ P)(y − o)∥
)* +2 2 reconstruction error
)
where z ≡ P(y − o).
Theorem 1 bounds the Hamming loss by the prediction and reconstruction errors.
Based on the standard results of PCA, the pair (P
∗
, ¯y) in PLST is the optimal solution for minimizing the reconstruction error of the observed label vectors. Then, by minimiz- ing the prediction error with regressor r, PLST is able to minimize the Hamming loss approximately.3.2 Online PCA
We start from the common setting considered in online PCA algorithms [1, 16, 19]. An online PCA algorithm is assumed to receive y
t
∈ RK
at each iteration t with ∥yt
∥2
≤ 1.Given the dimension M ≤ K of the lower-dimensional code space, the algorithm picks
P t
∈ RM ×K
with orthogonal rows and suffers reconstruction error ∥(I − P⊤ t P t
)yt
∥2 2
. The goal of the algorithm is to iteratively picks Pt
such that!T
t=1
∥(I−P⊤ t P t
)yt
∥2 2
can be close to the reconstruction error induced by the best offline matrix P∗
over T iterations.In this work, we consider a simple but promising algorithm, matrix stochastic gradient (MSG) [1, 19], as the foundation of DPP. MSG maintains an up-to-date projection matrix
U t
∈ RK ×K
constrained by tr(Ut
) = M , which is the convex hull of rank(Ut
) = M . Upon receiving a new yt
, MSG updates Ut
to Ut+1
asDescent: U ′ t+1
= Ut
+ ηyt y ⊤ t Projection: U t+1
= arg mintr(U)=M
∥U − U′ t+1
∥2 F
(3.2)
where η is the learning rate.
To conform with the setting of online PCA algorithm, P
t
needs to be produced (fromU t
) at each iteration. As shown in [28], any Ut
is a convex combination of at most Kmatrices as events and the corresponding combination coefficients as probabilities, we can easily sample a projection matrix that yields the same reconstruction error in expectation.
The eigen decomposition of the sampled projection matrix then gives P
t
. A greedy algo- rithm to find such a convex combination with time complexity be O(K2
) is also given in [28].3.3 Proposed Approach
Next, we proceed to the detail of our proposed online LSDR algorithm, dynamic principal projection (DPP), which focuses on the Hamming loss.
As neither P
∗
nor ¯y is known a priori, naïvely extending PLST to an OMLC algorithm by replacing r with an online regressor rt
cannot be carried out. The key idea of DPP is thus to additionally replace P∗
with an adaptively updated Pt
by incorporating MSG.Nevertheless, the problem of drifting of projection basis P
t
arises, which can negatively affect the performance of rt
because rt
is learned on the low-dimensional components ofy 1
, . . . , yt −1
composed of different sets of projection basis.We first establish the framework of DPP using P
t
from MSG instead of P∗
and discuss our solutions to handle basis drifting.General Framework. Theorem 1 bounds the Hamming loss by the prediction and re-
construction errors. Therefore, it is natural to take these two errors as the loss function for OMLC. Using the online linear predictor rt
(x) = W⊤ t x and P t
from an online PCA algorithm, the framework of DPP is established as follows.For t = 1, . . . , T
Receive x
t
and predict ˆyt
= round(P⊤ t W ⊤ t x)
Receive yt
and suffer loss ℓ(t)
(Wt
, Pt
) Update Pt
and Wt
where
ℓ
(t)
(W, P) =∥W⊤ x t
− Pyt
∥2 2
+∥(I − P⊤ P)y t
∥2 2
The framework is established with o = 0, which accommodates the setting of online PCA algorithms because it is assumed that uncentered y
t
comes in stream.Our goal is to optimize the cumulative loss !
T
t=1
ℓ(t)
(Wt
, Pt
). To achieve so, we choose to employ the merits of PLST to exploit MSG for optimizing the cumulative re- construction error!T
t=1
∥(I−P⊤ t P t
)yt
∥2
, and leave the optimization of prediction error to online ridge regression. To be more specific, we first derive a naïve updating procedure for Pt
and Wt
as follows:Update U: U t+1
=Ptrace
(Ut
+ ηyt y ⊤ t
)Sample P: P t+1
∼ Γt+1
(from Ut+1
)Update W: W t+1
= arg minW
λ
2∥W∥
2 F
+ ,t
i=1
∥W
⊤ x i
− Pi y i
∥2 2
(3.3)
P
trace
(·) abbreviates the projection step in (3.2), and λ is the regularization parameter for online ridge regression. Additionally, in order to fully accommodate the constraint of∥y
t
∥2
≤ 1 for online PCA, we apply a result-invariant trick (subject to a proper scaling of λ and η) that scales yt
by√ 1
K
.Drifting of Projection Basis. At a first glance, (3.3) suffices to extend PLST to an
OMLC algorithm. Nevertheless, a closer look at the update of Wt
reveals a vulnerability with respect to the drift of projection basis Pt
as t advances. In particular, PLST, as a batch MLC algorithm, uses the same P∗
to encode each label vector. In contrast, Wt
is updated with code vectors {zi
}t i=1 −1
where zi
= Pi y i
, and tries to predict zt
= Pt y t
fromx t
. However, each zi
is essentially the set of combination coefficients of different sets ofbasis formed by different P i
. Wt
may therefore fail to predict zt
, i.e. the coefficients with respect to a potentially new and different Pt
, effectively.To remedy the issue of basis drifting, we propose two different techniques, Princi- pal Basis Correction (PBC) and Principal Basis Transform (PBT). Each of them enjoys different advantages.
Principal Basis Correction. The ideal solution for the problem of basis drifting is to
“align” each z with the P that is used for prediction. More specifically, we want W to
be learned from {(x
i
, Pt y i
)}t−1 i=1
instead of {(xi
, Pi y i
)}t−1 i=1
. This can be achieved if WPBC t
is the ridge regression solution of {(xi
, Pt y i
)}t i=1 −1
. It is straightforward to see thatW PBC t
= (λI +t −1
,
i=1
x i x ⊤ i
)−1
( )* +
A −1 t
(
t −1
,
i=1
x i y ⊤ i
)( )* +
B t
P ⊤ t
By maintaining up-to-date A
−1 t
and Bt
, which takes O(d2
) andO(Kd) space, respectively,W PBC t
can be easily obtained by A−1 t B t P ⊤ t
for any Pt
.Next, we analyze the performance of PBC with respect to its batch predecessor, PLST.
For this comparison, it is natural to set up the offline cooperator as (W
#
, P∗
), where P∗
minimizes!T
t=1
∥(I − (P⊤ P)y t
∥2 2
and W#
minimizes!T
t=1
∥W⊤ x t
− P∗ y t
∥2 2
. We show that, under the condition that the sequence {Ut
}T t=1
converges to (P∗
)⊤ P ∗
as T → ∞, the expected average regretR T = 1
TE
P t ∼Γ t
[ ,T
t=1
(ℓ
(t)
(WPBC t
, Pt
)− ℓ(t)
(W#
, P∗
))]has an upperbound that converges to 0 as T → ∞, as formalized in the following theorem.
Theorem 2. Assume that the sequence {∥U t
− (P∗
)⊤ P ∗
∥2
}T t=1 converges to 0 as T → ∞.
Then there exists F (T ) ≥ R T such that lim T →∞
F (T ) = 0.Principal Basis Transform. While PBC always gives the W PBC t
learned on the correct code vectors with respect to the basis formed by Pt
, PBC has a dependency on Ω(Kd) because of the need to maintain Bt
. The Ω(Kd) dependency of the time complexity can make PBC computationally inefficient when both K and d are large.To address the issue, we propose another solution, Principal Basis Transform (PBT), that does not require maintaining B
t
. Suppose we have a W′ t −1
that predicts the combi- nation coefficients of the basis formed by Pt −1
, and we aim for prediction with respect to the basis formed by Pt
. The key idea of PBT is to first reconstruct the prediction in label space by P⊤ t W ′ t
, and then project the prediction into low-dimensional space spanned by rows of Pt
with minimal projection loss. Formally, PBT seeks WPBT t
such thatW PBT t
= arg minW
∥WPt
− W′ t −1 P t−1
∥2 F
(3.4)Solving (3.4) analytically gives
W PBT t
= W′ t −1 P t −1 P ⊤ t
(3.5) Finally, we update WPBT t
with (xt
, Pt y t
) to obtain W′ t
for the prediction of the next itera- tion. Note that Pt y t
uses exactly the same basis as WPBC t
, and a direct update is therefore feasible.One can see that PBT can be better than PBC because only dependency on Ω(M
2
d) rather than Ω(Kd) is required as Pt P ⊤ t −1
is first calculated in (3.5). In contrast, PBT can be worse than PBC because of the accumulated information loss every time (3.5) is applied.Therefore, we suggest PBT as a practical solution to remedy basis drifting when Ω(Kd) dependency of PBC is not acceptable. We shall also empirically demonstrate in Section 5 that PBT is generally competitive with PBC, while enjoying significant speedup for data with large K and d.
3.4 Practical Variant and Implementation
In this subsection, we first discuss the practical variant for updating U
t
and the correspond- ing efficient sampling of Pt
. Then, we discuss an efficient implementation for updatingW t
.Efficient implementations of MSG have been studied in [1], which improved the time complexity from O(K
2
) of the naïve implementation toO(K × rank2
(Ut
)) at iteration t.Specifically, the descent step can be implemented by maintaining an up-to-date eigen de- composition of U
t
= P′
diag(σ′
)(P′
)⊤
, while the projection step is performed by clipping each value of σ′
into [0, 1] after a constant shift.Nevertheless, the run-time of MSG, and also that of DPP, still critically depends on rank(U
t
). Capped MSG, which is proposed in [1], is a practical variant of MSG that imposes a hard constraint of rank(Ut
) ≤ M′
with M ≤ M′
during the projection step.Capped MSG has been shown to enjoy significant speedup while still maintaining the quality of U
t
. We use M′
= M + 1, as recommended in [1], for DPP.As capped MSG guarantees the time complexity of updating U to be O(M
2
K), thecomputational cost of sampling P
t
, which is O(K2
), becomes the main obstacle. We overcome this obstacle by presenting the following lemma.Lemma 3. Suppose U is obtained after an update of capped MSG with rank(U) = M +1, and let P ′
diag(σ′
)(P′
)⊤ be the eigen decomposition of U. Then define P i
∈ RM ×K to be P ′
with the i-th row excluded and Γ to be a discrete probability distribution over {Pi
}M +1 i=1
with probability of P i being 1 − σ i ′ , we have for any y
E
P∼Γ
[y⊤ P ⊤ Py] = y ⊤ Uy
(3.6) We refer our readers to the appendix for the proof. Because the up-to-date eigen de- composition of Ut
is already maintained by (capped) MSG in each iteration, Lemma 3 directly gives an O(M) sampling procedure, which is significantly improved over the original O(K2
).We now discuss the efficient implementation for updating W
t
. The optimal solution ofW t+1
(without PBC or PBT) is known to be Wt+1
= A−1 t+1
(!t
i=1 x i z ⊤ i
). Naïve calculation of Wt+1
takes O(Md2
), even with the matrix inversion lemma due to the need for matrix multiplication. We eliminate the multiplication step by realizing that the updating of Wt
has the following form:
W t+1
= Wt
−A −1 t x t+1
(˜zt+1
− zt+1
)⊤
1 + x
⊤ t+1 A −1 t x t+1
(3.7) where ˜zt
= W⊤ t x t+1
. (3.7) takes O(d2
+ M d) by calculating A−1 t x t+1
first before the outer product.(3.7) can be directly applied to obtain W
′ t+1
with PBT applied efficiently simply by replacing Wt
with WPBT t+1
= W′ t P t P ⊤ t+1
. To efficiently implement PBC, one can instead maintain an alternative Ht
by (3.7) with ˜zt
, zt
replaced by ˜yt
= H⊤ t x t
, yt
, respectively, and calculate WPBC t
= Ht P ⊤ t
afterward. We summarize the time complexity of updating Wt
with PBC and PBT in the following table.Time compl. W-Update P-Change
PBC O(d 2 + Kd) O(MKd)
PBT O(d 2 + M d) O(M 2 d + M 2 K)
Chapter 4
Cost-Sensitive Extension
In this section, we extend DPP to cost-sensitive DPP (CS-DPP), which meets the require- ment of CSOMLC. The key idea is based on a carefully designed label-weighting scheme that transforms cost c(y, ˆy) into the corresponding weighted Hamming loss. The opti- mization objective is then derived similarly to Theorem 1, which allows us to reuse the framework of DPP.
We start from the detail of our label-weighting scheme based on the label-wise decom- position of c(y, ˆy). The weight of each label arguably reflects its importance. However, many c(·, ·) (e.g. the F1 loss) do not evaluate each label independently. To allow the la- bel weights to fully represent the cost, we propose a label-weighting scheme based on a label-wise and order-dependent decomposition of c(·, ·), which is motivated by a similar concept in [15]. The label-weighting scheme works as follows. Defining ˆy
(k) real
and ˆy(k) pred
asˆy
(k) real
[i] =⎧⎪
⎪⎨
⎪⎪
⎩
y[i] if i ≤ k
ˆy[i] if i > kand ˆy (k) pred
[i] =⎧⎪
⎪⎨
⎪⎪
⎩
y[i] if i < k
ˆy[i] if i ≥ k we decompose c(y, ˆy) into δ
(1)
, . . . , δ(K)
such thatδ
(k)
=|c(y, ˆy(k) pred
)− c(y, ˆy(k) real
)| (4.1) Our label-weighting scheme directly follows by simply setting the weight of k-th label as δ(k)
.The proposed label-weighting scheme with (4.1) enjoys nice theoretical guarantee un-
Algorithm 1 Cost-Sensitive Dynamic Principal Projection with Principal Basis Transform Parameters: λ, M, η
1: P 0
← OM×K
, U0
← OK×K
, A−1 0
←1 λ I d×d
, W′ 0
← Od×M
(O is zero matrix)2: while Receive (x t
, yt
) do3:
ˆyt
← round(P⊤ t−1 W ′⊤ t−1 x t
)4:
Obtain Ct
by (4.2)5:
Update Ut −1
to Ut
by Capped MSG (using Ct y t
) and Sample Pt
with Lemma 36: W PBT t
← W′ t−1 P t −1 P ⊤ t
(PBT)7:
Update WPBT t
, A−1 t −1
to W′ t
, A−1 t
by (3.7) (using Ct y t
instead)8: end while
der a mild condition of c(·, ·) as shown in the following lemma.
Lemma 4. If c(y, y (k) pred
)− c(y, y(k) real
) ≥ 0 holds for any k, y and ˆy, then for any given yand ˆy, we have
c(y, ˆy) = ,
K k=1
δ
(k)
!y[k] ̸= ˆy[k]"The proof of the above lemma can be found in the appendix. Lemma 4 transforms c(y, ˆy) into the corresponding weighted Hamming loss, and thus enables the optimization over general cost functions.
Next, we propose CS-DPP, which extends DPP based on our proposed label-weighting scheme. Define C as
C = diag(
√δ
(1)
, ...,√δ
(K)
) (4.2)With C, which carries the cost information, we establish a theorem similar to Theo- rem 1 to upperbound c(y, ˆy).
Theorem 5. When making a prediction ˆy from x by ˆy = round
&P ⊤ r(x) + o
'with any given reference vector o and any left orthogonal matrix P, if c(·, ·) satisfies the condition of Lemma 4, the prediction cost
c(y, ˆy)≤ ∥r(x) − z
C
∥2 2
+∥(I − P⊤ P)(Cy − o)∥ 2 2
where z C
= P(Cy− o).The proof can be found in the appendix. This condition implies that correcting a wrongly-predicted label leads to no higher cost, and is considered mild as general cost functions satisfy the condition, including those mentioned in Section 2.
Theorem 5 generalizes Theorem 1 to upperbound the general cost c(y, ˆy) instead of c
ham
(y, ˆy). With Theorem 5, extending DPP to CS-DPP is a straightforward task by reusing the updating framework of DPP with yt
replaced by Ct y t
. The full details of CS-DPP using PBT is given in Algorithm 1.Chapter 5 Experiments
To empirically evaluate the performance, and also to study the effectiveness and necessity of design components of CS-DPP, we conduct three sets of experiments: (1) necessity justification of LSDR, (2) experiments on basis drifting, and (3) experiments on cost- sensitivity.
5.1 Experiments Setup
We conduct our experiments on nine real-world datasets
1
downloaded from Mulan2
. Data streams are generated by permuting datasets into different random orders. All LSDR algo- rithms are coupled with online ridge regression and three different code space dimensions, M = 10%, 25%, and 50% of K, are considered.We consider four different cost functions: Hamming loss, Normalized rank loss, F1 loss, and Accuracy loss, as defined in Section 2 to justify the cost-sensitivity. The perfor- mances of different algorithms are compared using the average cumulative cost
1 t
!t
i=1
c(yi
, ˆyi
) at each iteration t. We report the average results of each experiment after 15 repetitions.1 *G8yy, 2KQiBQMb, b+2M2, v2bi, 2M`QM, *Q`2H8F, K2/BKBHH, MmbrB/2 and K2/B+H
2 http://mulan.sourceforge.net/datasets-mlc.html
(a) emot. p = 0.3, c ham (b) emot. p = 0.3, c f1 (c) enron p = 0.3, c ham (d) enron p = 0.3, c f1
(e) emot. p = 0.7, c ham (f) emot. p = 0.7, c f1 (g) enron p = 0.7, c ham (h) enron p = 0.7, c f1
Figure 5.1: DPP vs. O-BR on noisy labels
.ib2i delicious eurlex-eurovec
H;Q`Bi?Kb PBT PBC O-BR PBT PBC O-BR
c ham 0.1136 0.1153 0.1245 0.4917 0.5011 0.4993 c NR 0.5636 0.5641 0.5756 0.7435 0.7467 0.7433 c F1 0.9143 0.9138 0.9076 0.9972 0.9928 0.9921 c Acc 0.9512 0.9517 0.9494 0.9980 0.9964 0.9958
p;. iBK2 (b2+) 21.49 140.77 105.18 60.81 10522.25 4841.35
Table 5.1: DPP vs. O-BR on Large Dataset
5.2 Necessity of LSDR
In this experiment, we aim to justify the necessity to address LSDR for OMLC problems.
We demonstrate that the ability of LSDR of preserving the key joint correlations between labels can be helpful when facing (1) data with noisy labels or (2) data with a large possible set of labels, where these types of data are often encountered in real-world OMLC prob- lems. We compare DPP with online Binary Relevance (O-BR), which is a naïve extension from binary relevance [25] with online ridge regressor. The only difference between DPP and O-BR is whether the algorithm incorporates LSDR or not.
We first compare DPP and O-BR on data with noisy labels. We generate noisy data stream by randomly flipping each positive label y[i] = 1 to negative with probability p = {0.3, 0.5, 0.7}, which simulates the real-world scenario in which human annotators fail to tag the existed labels. We plot the results of O-BR and DPP with M = 10%, 25%
and 50% of K on datasets emotions and enron with respect to Hamming loss and F1 loss in Figure 5.1. We report the complete results in the appendix.
The results from the first two columns of Figure 5.1 show that DPP with M = 10%
of K performs competitively and even better than O-BR as p increases. The results from the last two columns of Figure 5.1 show that DPP always performs better on enron. We can also observe from Figure 5.1 that DPP with smaller M tends to perform better as p increases. The above results clearly demonstrate that DPP better resists the effect of noisy labels with its incorporation of LSDR as the noise level (p) increases, while O-BR suffers more from the noise as it makes an independent prediction on each label. The observation that DPP with smaller M tends to perform better demonstrates that DPP is more robust to noise by preserving the key of the key joint correlations between labels with LSDR.
Next, we demonstrate that LSDR is also helpful for handling data with a large label set.
We compare O-BR with DPP that is coupled with either PBC or PBT on datasets delicious and eurlex-eurovec
3
. DPP uses M = 10 for delicious and M = 25 for eurlex-eurovec.We summarize the results and average run-time in Table 5.1. The results from Table 5.1 indicate that DPP coupled with either PBT or PBC performs competitively with O-BR, while DPP with PBT enjoys significantly cheaper computational cost. The results demon- strate that DPP enjoys more effective and efficient learning for data with a large label set than O-BR, and also justifies the advantage of PBT over PBC in terms of efficiency when K and d are large while M is relatively small.
5.3 Experiments on Basis Drifting
To empirically justify the necessity of handling basis drifting, we compare variants of DPP that (1) performs PBC, (2) performs PBT, and (3) neglects basis drifting. We plot the results for Hamming loss with M = 10% of K in Figure 5.2 on datasets CAL500,
emotions, enron, mediamill, medical, and nuswide, and report the complete results in the
appendix. The results on all datasets in Figure 5.2 show that DPP with either PBC or PBT significantly improves the performance over its variant that neglects the basis drifting, which clearly demonstrates the necessity to handle the drifting of projection basis.Further comparison of PBC and PBT based on Figure 5.2 reveals that PBT performs
3 delicious: d = 500, K = 983, eurlex-eurovec: d = 5000, K = 3993.
(a) CAL500 c ham (b) emot. c ham (c) enron c ham
(d) media. c ham (e) medical c ham (f) nuswide c ham
Figure 5.2: PBC vs. PBT vs. None
competitively with PBC. Nevertheless, as discussed in Section 5.2, PBT enjoys a nice computational speedup when K and d are large and M is relatively small, making PBT more suitable to handle data with a large label set.
5.4 Experiments on Cost-Sensitivity
To empirically justify the necessity of cost-sensitivity, we compare CS-DPP using PBT with DPP using PBT and other online LSDR algorithms. To the best of our knowledge, no online LSDR algorithm has yet been proposed in the literature. We therefore design two simple online LSDR algorithms, online Compressed Sensing (O-CS) and online Pseudo- inverse Decoding (O-RAND), to compare with CS-DPP. O-CS is a straightforward exten- sion of CS [13] with an online ridge regressor. O-RAND encodes using random matrix
P R
and simply decodes with the corresponding pseudo inverse P† R
.We plot the results with respect to all evaluation criteria except for the Hamming loss with M = 10% of K in Figure 5.3 on datasets Corel5k, enron, medical, and yeast. We report the complete results in the appendix.
t
(a) Corel5k c F1 (b) Corel5k c acc (c) Corel5k c nr
(d) medical c F1 (e) medical c acc (f) medical c nr
(g) yeast c F1 (h) yeast c acc (i) yeast c nr
Figure 5.3: CS-DPP vs. Others
CS-DPP versus DPP. The results of Figure 5.3 clearly indicate that CS-DPP performs
significantly better than DPP on all evaluation criteria other than the Hamming loss, while CS-DPP reduces to DPP when cHam
(·, ·) is used as the cost function. These observations demonstrate that CS-DPP, by optimizing the given cost function instead of Hamming loss, indeed achieves cost-sensitivity and is superior to its cost-insensitive predecessor, DPP.CS-DPP versus Other Online LSDR Algorithms. As shown in Figure 5.3, while DPP
generally performs better than O-CS and O-RAND because of the advantage to preserve key correlations between labels rather than random ones, it can nevertheless be inferior on some datasets with respect to specific cost functions due to its cost-insensitivity. For example, DPP loses to O-RAND on dataset Corel5k with respect to the Normalized rank loss, as shown in Figure 5.3(c). CS-DPP conquers the weakness of DPP with its cost- sensitivity, and significantly outperforms O-CS and O-RAND on all three datasets with respect to all three evaluation criteria, as demonstrated in Figure 5.3. The superiority of CS-DPP justifies the necessity to take cost-sensitivity into account.Chapter 6 Conclusion
We proposed a novel cost-sensitive online LSDR algorithm called cost-sensitive dynamic principal projection (CS-DPP). We established the foundation of CS-DPP via the con- nection of PLST and online PCA, and derived CS-DPP along with its theoretical guar- antees on top of MSG. We successfully conquered the challenge of basis drifting using our carefully designed PBC and PBT. CS-DPP further achieves cost-sensitivity because of our label-weighting scheme with a nice theoretical guarantee. Practical enhancements of CS-DPP were also studied to improve its efficiency. The empirical results demonstrate that CS-DPP significantly outperforms other OMLC algorithms on all evaluation criteria, which validates the robustness and superiority of CS-DPP. The necessity for CS-DPP to address LSDR, basis drifting and cost-sensitivity was also empirically justified.
Appendix A
A.1 Proof of Theorem 2
Theorem 2. Assume that the sequence {∥U t
− (P∗
)⊤ P ∗
∥2
}T t=1 converges to 0 as T → ∞.
Then there exists F (T ) ≥ R T such that lim T →∞
F (T ) = 0.Recall that
W PBC t
= (λI +t −1
,
i=1
x i x ⊤ i
)−1
( )* +
A −1 t
(
t −1
,
i=1
x i y ⊤ i
)( )* +
B t
P ⊤ t
For simplicity, we will overload W
PBC t
with Wt
, and denote A−1 t B t
as Ht
. Similarly, we have W#
= H∗
(P∗
)⊤
, whereH ∗
= arg minH
,
T t=1
∥H
⊤ x t
− yt
∥2 2
Before going into the details of the proof, we list several required (and minor) assumptions.
We assume ∥H
t x t
− yt
∥2 2
≤ p∗
for t = 1, . . . , T and ∥H∗
∥2 F
≤ h∗
, which is similar to that assumed in [3]. We also assume that ∥xt
∥2
≤ 1.Proof. It is straight-forward to see that
R = E
P t ∼Γ t
[ ,T
t=1
∥W
⊤ t x t
− Pt y t
∥2 2
− ∥(W∗
)⊤ x t
− P∗ y t
∥2 2
]( )* +
R Ridge
+E
P t ∼Γ t
[ ,T
t=1
ℓ
(t) MSG
(Pt
)− ℓ(t) MSG
(P∗
)]( )* +
R MSG
(A.1)
where
ℓ
(t) MSG
(P) =∥(I − P⊤ P)y t
∥2 2
as sampling of P
t
from Γt
is independent between each iteration t, which suggests that RRidge
and RMSG
can be bounded separately.We start bounding R
MSG
by first observing thatR
MSG
= ,T
t=1
E
P t ∼Γ t
[ℓ(t) MSG
(Pt
)− ℓ(t) MSG
(P∗
)]≤ ,
T
t=1
E
P t ∼Γ t
[ℓ(t) MSG
(Pt
)]− ℓ(t) MSG
(U∗
)= ,
T
t=1
ℓ
(t) MSG
(Ut
)− ℓ(t) MSG
(U∗
) (A.2) where U∗
is the optimal projection matrix with respect to!T
t=1
ℓ(t) MSG
(U) whose tr(U∗
) = M . The first equality follows from the observation that the sampling of Pt
does not af- fect the update of Ut
, and therefore is independent between each iteration. The second inequality follows from the fact that!T
t=1
ℓ(t) MSG
(U∗
)≤!T
t=1
ℓ(t) MSG
(P∗
) as tr((P∗
)⊤ P ∗
) = M . The third equality follows by realizing that (I− P⊤ t P t
) is a projection matrix plus the fact that E[P⊤ t P t
] = Ut
. Analysis of Eq. (4) follows standard analysis of online gradient descent and can be found in Appendix of [19], which gives an upperbound asM (K −M)
2ηK
+η 2
!⊤
t=1
∥yt
∥2 2
. With ∥yt
∥2 2
≤ 1 and minimization over η yields RMSG
≤1M (K− M)
K T = F
1
(T ) .We next analyze R
Ridge
. We first rewrite RRidge
as ,T
t=1
E[∥P
t
(H⊤ t x t
−yt
)∥2 2
−∥P∗
(H⊤ t x t
−yt
)∥2 2
+∥P∗
(H⊤ t x t
−yt
)∥2 2
−∥P∗
((H∗
)⊤ x t
−yt
)∥2 2
] by Wt
= Ht P ⊤ t
and W∗
= H∗
(P∗
)⊤
. Note that we omit the subscript of expectation here.We first bound
,
T t=1
E[∥P
t
(H⊤ t x t
− yt
)∥2 2
− ∥P∗
(H⊤ t x t
− yt
)∥2 2
]Let ¯y
t
= H⊤ t x t
− yt
and U∗
= (P∗
)⊤ P ∗
(note its difference from that is used in analyzing RMSG
), then we haveE ,
T
t=1
[¯y
⊤ t P T t P t
¯y⊤ t
− ¯y⊤ t U ∗
¯yt
] = ,T
t=1
¯y
⊤ t
(Ut
− U∗
)¯yt
≤ ,
T
t=1
∥¯y
t
∥2 2
∥(U
t
− U∗
)¯yt
∥2
∥¯y
t
∥2
≤ p
∗
,T
t=1
∥U
t
− U∗
∥2
= F2
(T )where the last inequality follows from ∥¯y
t
∥2 2
≤ p∗
and the definition of matrix 2-norm.Next we bound ,
T
t=1
E[∥P
∗
(H⊤ t x t
− yt
)∥2 2
− ∥P∗
((H∗
)⊤ x t
− yt
)∥2 2
] We first define another game asR
z
= ,T
t=1
∥ ¯
W ⊤ t x t
− zt
∥2 2
− ∥ ( ¯W ∗
)⊤ x t
− zt
∥2 2
(A.3) where zt
= P∗ y t
, ¯W t
= arg minW λ 2
!
t −1
i=1
∥W⊤ x i
− zi
∥2 2
, and ¯W ∗
be the best offline coop- erator of the game. It is not hard to notice that the game is exactly the same as the target we wish to bound by realizing that ¯W t
= Ht
(P∗
)⊤ and ¯W ∗
= H∗
(P∗
)⊤ and also the expectation can simply be removed. Furthermore, any (xt
, zt
) has at least corresponding (xt
, yt
) as P∗
is a linear operator. Therefore, bounding (A.3) suffices to bound our target.Now let
ℓ
(t) Ridge
( ¯W) = ∥ ¯ W ⊤ x t
− zt
∥2 2
We next rewrite (A.3) as ,
M m=1
,
T t=1
(ℓ
(t,m) Ridge
(wt,m
)− ℓ(t,m) Ridge
(w∗ m
)) (A.4) whereℓ
(t,m) Ridge
(w) = ∥w⊤ x t
− zt,m
∥2 2
z
t,m
is the m-th element of zt
, wt,m
is the m-th column of ¯W t
, and w∗ m
is the m-th columnof ¯
W ∗
. Next, we have∥w
∗ m
∥2
≤ ∥ ¯W ∗
∥2 F
≤ ∥U∗
∥2 F
∥H∗
∥2 F
≤ Mh∗
where the last inequality follows tr(U∗
) = M , andℓ
(t,m) Ridge
(wt,m
)≤ ∥W⊤ t x t
− zt
∥2 2
= ¯yT t U ∗
¯yt
≤ p∗
where the last inequality follows from the fact that U
∗
is a projection matrix. With the above, by plugging in λ = 1 and follow the analysis as shown in [3], we have,
T t=1
(ℓ
(t,m) Ridge
(wt,m
)− ℓ(t,m) Ridge
(w∗ m
))≤ M2 h
∗
+ 2p∗
d log(1 + T d maxt
((zt,m
)2
)) (A.5) where d is the dimension of xt
. Then by maxt
((zt,m
)2
)≤ 1 which comes frommax
t
((zt,m
)2
)≤ maxt
(∥zt
∥2 2
) = maxt
(∥P∗ y t
∥2 2
)≤ maxt
∥yt
∥2 2
≤ 1 and summing (A.5) over all m = 1, . . . , M we obtain the upperbound of (A.4) asF
3
(T ) = M2
2 h
∗
+ 2M p∗
d log(1 + T d)Now it is easy to see that
R T
≤ F (T ) =F 1 (T )+F 2 T (T )+F 3 (T )
. It is straight forward to see that limT →∞ F 1 T (T )
= 0 and limT →∞ F 3 (T )
T
= 0. To see limT →∞ F 2 (T )
T
= 0, we have by assumption {∥Ut
− U∗
∥2
}T t=1
converges to 0 as T → ∞, and the fact that the convergence of sequence implies the convergence of arithmetic mean of sequence. Combing the above we have limT →∞
F (T ) = 0, which completes the proof.A.2 Proof of Lemma 3
Lemma 3. Suppose U is obtained after an update of Capped MSG with rank(U) = M +1, and let P ′
diag(σ′
)(P′
)⊤ be the eigen decomposition of U. Then define P i
∈ RM ×K to be P ′
with the i-th row excluded and Γ to be a discrete probability distribution over {Pi
}M +1 i=1
with probability of P i being 1 − σ i ′ , we have for any y
E
P∼Γ
[y⊤ P ⊤ Py] = y ⊤ Uy
(A.6)Proof. We first show that Γ is a well-defined probability distribution. By the definition
of the projection step of MSG we have 0 ≤ σ′ i
≤ 1 for each σi ′
and !M +1
i=1
1− σi ′
= M + 1− !M +1
i=1
σi ′
= 1 with tr(U) = M . Γ is therefore a well-defined probability distribution.Then it suffices to show thatE
P∼Γ
[P⊤ P] = U. To see this, first notice that by orthog-
onal rows of P′
we have U = !M +1
j=1
σj ′ e j e ⊤ j
where ej
is the j-th row of P′
. We then haveE
P∼Γ
[P⊤ P] =
M +1
,i=1
(1− σ
i ′
)M +1
,j=1
!i ̸= j"e
j e ⊤ j
=
M +1
,j=1
(e
j e ⊤ j
M +1
,i=1
!i ̸= j"(1 − σ
i ′
))=
M +1
,j=1
(σ
j ′ e j e ⊤ j
) (a)= U where (a) is by!
M +1
i=1
σ′ i
= MA.3 Proof of Lemma 4
Lemma 4. If c(y, y (k) pred
)− c(y, y(k) real
) ≥ 0 holds for any k, y and ˆy, then for any given yand ˆy we have
c(y, ˆy) = ,
K k=1
δ
(k)
!y[k] ̸= ˆy[k]" (A.7)Proof. Recall the definition of y (k) real
and y(k) pred
to beˆy
(k) real
[i] =⎧⎪
⎪⎨
⎪⎪
⎩
y[i] if i ≤ k
ˆy[i] if i > kand ˆy (k) pred
[i] =⎧⎪
⎪⎨
⎪⎪
⎩
y[i] if i < k
ˆy[i] if i ≥ k and the definition of δ
(k)
to beδ
(k)
=|c(y, ˆy(k) pred
)− c(y, ˆy(k) real
)|Now define k
i
, i = 1, . . . , L be the sequence of indices such that y[ki
] ̸= ˆy[ki
] for everyk
i
and ki
< ki+1
. If such ki
does not exist than (A.7) holds trivially by c(y, y) = 0.Otherwise, by the condition of c we have ,
K
k=1
δ
(k)
!y[k] ̸= ˆy[k]" (a)= ,
K
k=1
(c(y, ˆy
(k) pred
)− c(y, ˆy(k) real
))!y[k] ̸= ˆy[k]"= ,
L
i=1
c(y, ˆy
(k pred i )
)− c(y, ˆy(k real i )
)= c(y, ˆy
(k pred 1 )
)− c(y, ˆy(k real L )
) (b)= c(y, ˆy) (c)
where (a) uses the condition of c(·, ·) to remove the absolute value function; (b) is from two possibilities of L: if L = 1 then the equation trivially holds; if L > 1 we use the observation that ˆy
(k real i )
= ˆy(k pred i+1 )
where the observation is by realizing y[j] = ˆy[j] for any ki
< j < ki+1
; (c) follows from the observation that ˆy(k pred 1 )
= ˆy and ˆy(k real L )
= y and c(y, y) = 0.A.4 Proof of Theorem 5
Theorem 5. When making a prediction ˆy from x by ˆy = round
&P ⊤ r(x) + o
'with any given reference vector o and any left orthogonal matrix P, if c(·, ·) satisfies the condition of Lemma 4,
c(y, ˆy)≤ ∥r(x) − z
C
∥2 2
+∥(I − P⊤ P)(Cy − o)∥ 2 2
where z C
= P(Cy− o).Recall the definition of C in the main context is
C = diag(
√δ
(1)
, ...,√δ
(K)
) (A.8)Next we show and prove the following lemma before we proceed to the complete proof.