Chun-Liang Li Hsuan-Ten Lin Chi-Jen Lu chunlial@cs.cmu.edu

Carnegie Mellon University

htlin@csie.ntu.edu.tw National Taiwan University

cjlu@iis.sinica.edu.tw Academia Sinica

### Abstract

We study the problem of recovering the subspace spanned by the first k principal components of d-dimensional data under the streaming setting, with a memory bound of O(kd). Two families of algorithms are known for this problem. The first family is based on the framework of stochastic gradient descent. Nevertheless, the convergence rate of the family can be seriously affected by the learning rate of the descent steps and deserves more serious study. The second family is based on the power method over blocks of data, but set- ting the block size for its existing algorithms is not an easy task. In this paper, we analyze the convergence rate of a representative algorithm with decayed learning rate (Oja and Karhunen, 1985) in the first family for the general k > 1 case. Moreover, we propose a novel algorithm for the second family that sets the block sizes automatically and dynamically with faster con- vergence rate. We then conduct empirical stud- ies that fairly compare the two families on real- world data. The studies reveal the advantages and disadvantages of these two families.

### 1 Introduction

For data points in R^{d}, the goal of principal component anal-
ysis (PCA) is to find the first k d eigenvectors (prin-
cipal components) that correspond to the top-k eigenval-
ues of the d ×d covariance matrix. For a batch of stored
data points with a moderate d, efficient algorithms like the
power method (Golub and Van Loan, 1996) can be run on
the empirical covariance matrix to compute the solution.

In addition to the batch algorithms, the stream setting (streaming PCA) is attracting much research attention in

Appearing in Proceedings of the 19^{th}International Conference
on Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz,
Spain. JMLR: W&CP volume 51. Copyright 2016 by the authors.

recent years (Arora et al., 2012; Mitliagkas et al., 2013;

Hardt and Price, 2014). Streaming PCA assumes that each
data point x ∈ R^{d} arrives sequentially and it is not feasi-
ble to store all data points. If d is moderate, the empiri-
cal covariance matrix can again be computed and fed to an
eigenproblem solver to compute the streaming PCA solu-
tion. When d is huge, however, it is not feasible to store the
O(d^{2})empirical covariance matrix. The situation arises in
many modern applications of PCA. Those applications call
for memory-restricted streaming PCA, which will be the
main focus of this paper. We shall consider restricting to
only O(kd) memory usage, which is of the same order as
the minimum amount needed for the PCA solution. In addi-
tion, we aim to develop streaming PCA algorithms that can
keep improving the goodness of the solution as more data
points arrive. Such algorithms are free from a pre-specified
goal of goodness and match the practical needs better.

There are two measurements for the goodness of the so-
lution. One is the reconstruction error that measures the
expected squared error when projecting a data point to the
solution, which is based on the fact that the actual princi-
pal components should result in the lowest reconstruction
error. The other is the spectral error that measures the dif-
ference between the subspace spanned by the solution and
the subspace spanned by the actual principal components,
which will be formally defined in Section 2. The spectral
error enjoys a wide range of practical applications (Sa et al.,
2015). In addition, note that when the k^{th} and (k + 1)^{th}
eigenvalues are close, the solution that wrongly includes
the (k +1)^{th}engenvector instead of the k^{th}one may still
reach a small reconstruction error, but the spectral error can
be large. That is, the spectral error is somewhat harder to
knock down and will be the main focus of this paper.

There are several existing streaming PCA algorithms, but not all of them focus on the spectral error and meet the memory restriction. For instance, Karnin and Liberty (2015) proposed an algorithm which considers the spec- tral error, but its space complexity is at least Ω(kd log n), where n is the number of data points received. Warmuth and Kuzmin (2008); Nie et al. (2013); Boutsidis et al.

(2015) propose the online algorithm with regret guaran- tee on the reconstruction error. Also, the space complexity

of (Nie et al., 2013) grows in the order of d^{2}. Arora et al.

(2013) extended Arora et al. (2012) to derive convergence analysis for minimizing the reconstruction error along with a memory-efficient implementation, but the space com- plexity is not precisely guaranteed to meet O(kd). Shamir (2015) also focuses on the reconstruction-based error. That is, those works do not match the focus of this paper.

There are two families of algorithms that tackle the spectral error while respecting the memory restriction, the family of stochastic gradient descent (SGD) algorithms for PCA, and the family of block power methods. The SGD family solves a non-convex optimization problem that minimizes the re- construction error, and applies SGD (Oja and Karhunen, 1985) under the memory restrictions to design streaming PCA algorithms. Interestingly, although the non-convex problem does not match standard convergence assumptions of SGD (Rakhlin et al., 2012), minimizing the reconstruc- tion error for the special case of k = 1 allows Balsubra- mani et al. (2013) to derive spectral-error guarantees on the classic stochastic-gradient-descent PCA (SPCA) algo- rithm (Oja and Karhunen, 1985). Recently, Sa et al. (2015) derive a spectral-error minimization algorithm for the gen- eral k > 1 cases based on SGD along with strong theoreti- cal guarantees. Nevertheless, different from Balsubramani et al. (2013), Sa et al. (2015) require a pre-specified error goal, which is taken to determine a fixed learning rate of the descent step. The pre-specified goal makes the algorithm inflexible in taking more data points to further decrease the error. Furthermore, the fixed learning rate is inevitably con- servative to keep the algorithm stable, but the conservative nature results in slow convergence in practice, as will be revealed from the experimental results in Section 5.

The other family, namely the block power meth- ods (Mitliagkas et al., 2013), extends the batch power method (Golub and Van Loan, 1996) for the memory- restricted streaming PCA by defining blocks (periods) on the time line. The key of the block power methods is to efficiently compute the product of the estimated covariance matrices in different blocks. The product serves as an ap- proximation to the power of the empirical covariance ma- trix, which is a core element of the batch power method.

This family could also be viewed as the mini-batch SGD algorithms but with different update rule from the SGD family. The original block-power-method PCA (BPCA;

Mitliagkas et al., 2013) is proved to converge under some restricted distributions, which is later generalized by Hardt and Price (2014) to a broader class of distributions. The convergence proof of BPCA in both works, however, de- pends on determining the block size from the total number of data points or a pre-specified error goal, which again make the works inflexible for further decreasing the error with more data points.

From the theoretical perspective, SPCA lacks convergence proof for the general k > 1 case without depending on the

pre-specified error goal nor the fixed learning rate, and it is non-trivial to directly extend the fixed-learning-rate re- sult of Sa et al. (2015) to realize the proof; from the algo- rithmic perspective, BPCA needs more algorithmic study on deciding the block size without depending on the pre- specified error goal; from the practical perspective, it is not clear which family should be preferred in real-world applications. This paper makes contributions on all the three perspective. We first prove the convergence of SPCA for k > 1 with a decaying learning rate scheme in Sec- tion 3. The convergence result turns out to be asymptoti- cally similar to the result of Sa et al. (2015) while not re- lying on the fixed learning rate. For convenience, we also call the proposed algorithm with the dynamic learning rate decay as SPCA. Then in Section 4, we propose a dynamic block power method (DBPCA) that automatically decides the block size to not only allow easier algorithmic use but also guarantee better convergence rate. Finally, we con- duct experiments on real-world datasets and provide con- crete recommendations in Section 5.

### 2 Preliminaries

Let us first introduce some notations which will be used later. First, let x ≤ O(y) and x ≥ Ω(y) denote that for some universal constant c, independent of all our parame- ters, x ≤ cy and x ≥ cy, respectively, for a large enough y.

Next, let dxe denote the smallest integer that is at least x.

Finally, for a vector x, we let kxk denote its `^{2}-norm, and
for a matrix M, we let kMk = maxxkMxk

kxk , which is the spectral norm.

In this paper, we study the streaming PCA problem,
in which with each input data point xn ∈ R^{d} is re-
ceived at step n within a stream. Following previous
works (Mitliagkas et al., 2013; Balsubramani et al., 2013),
we make the following assumption on the data distribution.

Assumption 1. Assume that each xnis sampled indepen-
dently from some distribution X with mean zero and co-
variance matrix A, which has eigenvalues λ1 ≥ λ^{2} ≥

· · · ≥ λ^{d}, with λk > λk+1. Moreover, assume that
kxk ≤ 1 for any x in the support of X ,^{1} which implies
that kAk ≤ 1 and kx^{n}x^{>}_{n}k ≤ 1 for each n.

Our goal is to find a d×k matrix Q^{n}at each step n, with its
column-space quickly approaching that spanned by the first
keigenvectors of A. For convenience, we let λ = λk and
λ = λˆ k+1, and moreover, let U denote the d×k matrix with
the first k eigenvectors of A as its columns. One common
way to measure the distance between such two spaces is

Φn= max

v∈R^{k}

1−kU^{>}Qnvk^{2}
kvk^{2}

, (1)

1We can relax this condition to that of having a small kxk with a high probability as Hardt and Price (2014) do, but we choose this stronger condition to simplify our presentation.

Algorithm 1 SPCA
1: S0∼ N (0, 1)^{d}^{×k}

2: S0= Q0R0(QR decomposition) 3: n ← 1

4: whilereceiving datado
5: Sn← Q^{n}−1+ γnxnx^{>}_{n}Qn−1

6: Sn= QnRn(QR decomposition) 7: n← n + 1

8: end while

which can be used as an error measure for Qn. It is known
that Φn = sin θk(U, Qn)^{2}, where θk(U, Qn) is the k-th
principal angle between these two spaces. For simplicity,
we will denote sin θk(U, Qn) by sin(U, Qn). Moreover,
let cos(U, Qn) = p

1− sin(U, Q^{n})^{2} and tan(U, Qn) =
sin(U, Qn)/ cos(U, Qn). It is also known that cos(U, Qn)
equals the smallest singular value of the matrix U^{>}Qn.
More can be found in, e.g., Golub and Van Loan (1996).

Our algorithms will generate an initial matrix S0 ∈ R^{d×k}
by sampling each of its entries independently from the nor-
mal distribution N (0, 1). Let S^{0}∼ N (0, 1)^{d×k}denote this
process, and we will rely on the following guarantee.

Lemma 1. (Mitliagkas et al., 2013) Suppose we sam-
ple S0 ∼ N (0, 1)^{d}^{×k} and let S0 = Q0R0 be its
QR decomposition. Then for a large enough con-
stant ¯c, there is a small enough constant δ0 such that
Prh

cos(U, Q0)≤p

¯ c/(dk)i

≤ δ^{0}.

### 3 Stochastic Gradient Descent

In this section, we study the classic PCA algorithm frame- work of Oja and Karhunen (1985) for the general rank-k case, which can be seen as performing stochastic gradient descent. Our algorithm, called SPCA, is given in Algo- rithm 1. The key component is to determine the learning rate, which is related to the error analysis. We choose the step size at step n as

γn= c

n, with c = c0

λ− ˆλfor a constant c0≥ 12.

The algorithm has a space complexity of O(kd), by noting
that the computation of xnx^{>}_{n}Qn−1 can be done by first
computing x^{>}_{n}Qn−1and then multiplying the result by xn.
The sample complexity of our algorithm is guaranteed by
the following, which we prove in Subsection 3.1. Our anal-
ysis is inspired by and follows closely that of Balsubramani
et al. (2013) for the rank-one case, but there are several new
hurdles which we need to overcome in the general rank-k
case.

Theorem 1. For any ρ ∈ (0, 1), there is some N ≤
(2^{1/(λ}^{−ˆ}^{λ)}kd)^{O(1)} +O_{k log(1/ρ)}

ρ(λ−ˆλ)^{3}

, such that our algo- rithm with high probability can achieve Φn ≤ ρ for any n≥ N.

Let us remark that we did not attempt to optimize the first term in the bound above, as it is dominated by the second term for a small enough ρ. Note that Sa et al. (2015) pro- vided a better bound, which only has quadratic dependence of the eigengap λ − ˆλ, for a similar algorithm called Alec- ton. Alecton is restricted to taking a fixed learning rate that comes from a pre-specified error goal on a fixed amount of to-be-received data points. The restriction makes Alec- ton less practical in the streaming setting, because one may not always be able to know the amount of to-be-received data points in advance. If one receives fewer points than needed, Alecton cannot achieve the error goal; if one re- ceives more than needed, Alecton cannot fully exploit the additional points for a smaller error. The decaying learning rate used by our proposed SPCA algorithm, on the other hand, does not suffer from such a restriction.

3.1 Proof of Theorem 1

The analysis of Balsubramani et al. (2013) works for the
rank-one case by using a potential function Ψn = 1−
(U^{>}Qn)^{2}, where U and Qn are both vectors instead of
matrices. To work in the general rank-k case, we choose
the function Φn defined in (1) as a generalization of their
Ψn, and our goal is to bound Φn.

Following Balsubramani et al. (2013), we divide the steps
into epochs, with epoch i ranging from step ni−1 to step
ni − 1, where we choose n^{0} = ˆc^{c}k^{3}d^{2}log d, for a large
enough constant ˆc, and ni = de^{5/c}^{0}e(ni−1+ 1)− 1 for
i≥ 1.

Remark 1. This gives us (ni+ 1)≥ e^{5/c}^{0}(ni−1+ 1)and
ni ≤ c^{1}ni−1for some constant c1.

As in Balsubramani et al. (2013), we also use the conven-
tion of starting from step n0. For each epoch i, we would
like to establish an upper bound ρi on Φn for each step
n in that epoch. To start with, we know the following
from Lemma 1, using the fact that Φ0 = sin(U, Q0)^{2} =
1− cos(U, Q^{0})^{2}.

Lemma 2. Let Γ0 denote the event that Φ0 ≤ ρ^{0}, where
ρ0 = 1− ¯c/(kd) for the constant ¯c in Lemma 1. Then we
have Pr [¬Γ^{0}]≤ δ^{0}.

Next, for each epoch i ≥ 1, we consider the event Γi: sup

n_{i−1}≤n<ni

Φn ≤ ρ^{i},

for some ρi to be specified later. Then our goal is to show
that Pr[¬Γ^{i+1}|Γ^{i}]is small, for i ≥ 0. This can be done
for the rank-one case, but it relies crucially on the property
that the potential function Ψnof Balsubramani et al. (2013)
satisfies a nice recurrence relation. Unfortunately, this does
not appear so for our function Φn, mainly because it takes
an additional maximization over v ∈ R^{k}. To overcome this
problem, we take the following approach.

Consider an epoch i and a step n in the epoch. Let
us define a new matrix Yn = (I + γnxnx^{>}_{n})Yn−1 =
QnRnRn−1· · · R^{n}i−1+1, with Yn_{i−1} = Qn_{i−1}. Let S =
{v ∈ R^{k} :kvk = 1}. Then for any v ∈ S, define

Φ^{(v)}_{n} = 1−kU^{>}Ynvk^{2}
kY^{n}vk^{2} ,

and note that Φn= max_{v∈S}Φ^{(v)}n . Now for each such new
function Φ^{(v)}n , with a fixed v, we can establish a similar
recurrence relation as follows, but for our purpose later we
show a better upper bound on |Zn| than that in Balsubra-
mani et al. (2013). We give the proof in Appendix A.

Lemma 3. For any n > n0 and any v ∈ S, we have
Φ^{(v)}n ≤ Φ^{(v)}n−1+ βn− Z^{n}, where

1. βn= 5γ_{n}^{2}+ 2γ^{3}_{n}
2. |Z^{n}| ≤ 2γ^{n}

q
Φ^{(v)}_{n−1}

3. E [Z^{n}|F^{n}−1]≥ 2γ^{n}(λ− ˆλ)Φ^{(v)}n−1(1− Φ^{(v)}n−1)≥ 0.^{2}
With this lemma, the analysis of Balsubramani et al. (2013)
can be used to show that E[Φ^{(v)}^{n} ]decreases as n grows, but
only for each individual v separately. This alone is not
sufficient to guarantee the event Γi+1 as it requires small
Φ^{(v)}n for all v’s simultaneously. To deal with this, a natu-
ral approach is to show that each Φ^{(v)}n is large with a small
probability, and then apply a union bound, but an apparent
difficulty is that there are infinitely many v’s. We will over-
come this difficulty by showing how it is possible to apply
a union bound only over a finite set of “-net” for these in-
finitely many v’s. Still, for this approach to work, we need
the probability of having a large Φ^{(v)}n to be small enough,
compared to the size of the -net. However, the beginning
steps of the first epoch seem to have us in trouble already as
the probability of their Φ^{(v)}values exceeding Φ^{(v)}n0 is not
small. This seems to prevent us from having an error bound
ρ1< ρ0, and without this to start, it is not clear if we could
have smaller and smaller error bounds for later epochs. To
handle this, we sacrifice the first epoch by using an error
bound ρ1slightly larger than ρ0, but still small enough. The
hope is that once Γ1is established, we then have a period
of small errors, and later epochs could then start to have
decreasing ρi’s. More precisely, we have the following for
the first epoch, which we prove in Appendix B.

Lemma 4. Let ρ1 = 1− ¯c/(c^{6c}1 kd), for the constant c1

given in Remark 1. Then Pr [¬Γ^{1}| Γ^{0}] = 0.

It remains to set the error bounds for later epochs appropri-
ately so that we can actually have small Pr[¬Γi+1|Γ^{i}], for
i≥ 1. We let the error bounds decrease in three phases as

2As in Balsubramani et al. (2013), F^{n}−1here denotes the σ-
field of all outcomes up to and including step n − 1.

follows. In the first phase, we let ρi = 1− 2(1 − ρ^{i}−1),
so that ηi = 1− ρ^{i} doubles each time. It ends at the
first epoch i, denoted by π1, such that ρi < 3/4. Note
that π1 ≤ O(log d) and at this point, ρ^{π}1 is still much
larger than 1/nπ1. Then in the second phase, we let
ρi = ρ_{i−1}/de^{5/c}^{0}e^{2}, which decreases in a faster rate than
ni increases. It ends at the first epoch i, denoted by π2,
such that ρi≤ c^{2}(c^{3}k log n_{i−1})/(n_{i−1}+ 1), for some con-
stant c2.^{3} Note that π2 ≤ O(log d) and at this point, ρ^{π}^{2}
reaches about the order of 1/nπ2. Finally in phase three,
we let ρi= c2(c^{3}k log n_{i−1})/(n_{i−1}+ 1), which decreases
in about the rate as niincreases.

With these choices, the events Γi’s are now defined, and
our key lemma is the following, which we prove in Ap-
pendix C. The proof handles the difficulties above by show-
ing how a union bound can be applied only on a small “-
net” of S along with proper choices of ρ^{i}to guarantee that
each Φ^{(v)}n is large with a small enough probability.

Lemma 5. For any i ≥ 1, Pr [¬Γ^{i+1}| Γ^{i}]≤ 2(i+1)^{δ}^{0} ^{2}.
From these lemmas, we can bound the failure probability
of our algorithm as

Pr [∃i ≥ 0 : ¬Γ^{j}] ≤ Pr [¬Γ^{0}] +P

i≥0Pr [¬Γ^{i+1}| Γ^{i}]

≤ δ^{0}+P

i≥0 δ0

2(i+1)^{2},
which is at most 2δ0using the fact thatP

i≥11/i^{2}≤ 2.

To complete the proof, it remains to determine the num-
ber of samples needed by our algorithm to achieve an error
bound ρ. This amounts to determine the number niof an
epoch i with ρi ≤ ρ. With n^{π}2 ≥ n^{0}, it is not hard to
check that ρπ2 ≤ 1/(2^{c}kd)^{O(1)} and nπ2 ≤ (2^{c}kd)^{O(1)}.
Then if ρ ≥ ρ^{π}2, we can certainly use nπ2 as an up-
per bound. If ρ ≤ ρ^{π}2, it is not hard to check that with
ni ≤ O(c^{3}k(1/ρ) log(1/ρ)), we can have ρi ≤ ρ. As
c = c0/(λ− ˆλ), this proves Theorem 1.

### 4 Block-Wise Power Method

In this section, we turn to study a different approach based
on block-wise power methods. Our algorithm is modified
from that of Mitliagkas et al. (2013) (referred as BPCA),
which updates the estimate Qn with a more accurate esti-
mate of A using a block of samples, instead of one single
sample as in our first algorithm. Our algorithm differs from
BPCA by allowing different block sizes, instead of a fixed
size. More precisely, we divide the steps into blocks, with
block i consisting of steps from some interval Ii, and we
use this block of |I^{i}| samples to update our estimate from
Qi−1 to Qi. We will specify |I^{i}| later in (3), which basi-
cally grows exponentially after some initial blocks. We call
our algorithm DBPCA, as described in Algorithm 2.

3Determined later in the proof of Lemma 5 in Appendix C for

Algorithm 2 DBPCA
1: S0∼ N (0, 1)^{d}^{×k}

2: S0= Q0R0(QR-decomposition) 3: i ← 1

4: whilereceiving datado 5: Si← 0

6: for n ∈ I^{i} do
7: Si← S^{i}+_{|I}^{1}

i|xnx^{>}_{n}Qi−1

8: end for

9: Si= QiRi(QR-decomposition) 10: i← i + 1

11: end while

This algorithm, as our first algorithm SPCA, also has a space complexity of O(kd). The sample complexity is guaranteed by the following, which we will prove in Sub- section 4.1. To have a easier comparison with the results of Mitliagkas et al. (2013) and Hardt and Price (2014), we use

√Φn = sin(U, Qn)as the error measure in this section.

Theorem 2. Given any ε ≤ 1/√

kd, our algorithm can achieve an error ε with high probability after L iterations with a total of N samples, for some L ≤ O

λ

λ−ˆλlog^{d}_{ε}

and N ≤ O_{λ log(dL)}

ε^{2}(λ−ˆλ)^{3}

.

Let us make some remarks about the theorem. First, the
error ρ in Theorem 1 corresponds to the error ε^{2}here, and
one can see that the bound in Theorem 2 is better than those
in Theorem 1 and Mitliagkas et al. (2013); Hardt and Price
(2014) in general. We summarize the sample complexity
in terms of the error ε in Table 1. Next, the condition
ε ≤ 1/√

kdin the theorem is only used to simplify the
error bound. One can check that our analysis also works
for any ε ≤ 1, but the resulting bound for N has the factor
ε^{2}replaced by min(1/(kd), ε^{2}). Finally, from Theorem 2,
one can also express the error in terms of the number of
samples n as ε(n) ≤ Or

λ

n(λ−ˆλ)^{3} log

λd log n λ−ˆλ

.

4.1 Proof of Theorem 2

Recall that after the i-th block, we have the estimate Qi,
and we would like it to be close to U, with a small error
sin(U, Qi). To bound this error, we follow Hardt and Price
(2014) and work on bounding a surrogate error tan(U, Qi),
which suffices as sin(U, Qi)≤ tan(U, Q^{i}).

To start with, we know from Lemma 1 that for ε0 =
pc/(kd)¯ with some constant ¯c, Pr[tan(U, Q0) > ε0] ≤
δ0,using the fact that tan(U, Q0)^{2}= 1/ cos(U, Q0)^{2}− 1.

Next, we would like to bound each tan(U, Qi) in terms of the previous tan(U, Qi−1). For this, recall that with Fi =P

n∈Iixnx^{>}_{n}/|I^{i}|, we have Q^{i}Ri = FiQi−1, which
can be rewritten as AQi−1+ (Fi− A)Q^{i}−1. Using the no-
the bound (8) there to hold.

Algorithm Complexity Restriction SGD family

Balsubramani et al. (2013) O(^{log(1/ε)}_{ε}^{2} ) only for k = 1
Sa et al. (2015) O(^{log(1/ε)}_{ε}^{2} ) pre-specified ε
our proposed SPCA O(^{log(1/ε)}_{ε}^{2} ) none
block power method family

Hardt and Price (2014) O(^{log(1/ε)}ε^{2} ) pre-specified ε
our proposed DBPCA Olog(log(1/ε))

ε^{2}

none

Table 1: sample complexity and restriction
tation Gi= (Fi− A)Q^{i}−1, we have QiRi= AQi−1+ Gi,
where Gican be seen as the noise arising from estimating
Aby Fiusing the i-th block of samples. Then, we rely on
the following lemma from Hardt and Price (2014), with the
parameters:

¯λ = max(ˆλ, λ/4), γ = ¯λ/λ^{1/4}

and 4 = (λ − ¯λ)/4.

(2) Lemma 6. (Hardt and Price, 2014) Suppose kGk ≤ 4 · min(cos(U, Q), β), for some β > 0. Then

tan(U, AQ + G)≤ max(β, max(β, γ) tan(U, Q)).

From this, we can have the following lemma, proved in Appendix D, which provides an exponentially-decreasing upper bound on tan(U, Qi), for the parameters:

εi= ε0γ^{i}and βi= min

γ/q

1 + ε^{2}_{i}_{−1}, γεi−1

where ε0=p

¯

c/(dk)with the constant ¯c in Lemma 1.

Lemma 7. Suppose tan(U, Qi−1) ≤ ε^{i}−1 and kG^{i}k ≤
4β^{i}. Then tan(U, Qi)≤ ε^{i}.

The key which sets our approach apart from that of
Mitliagkas et al. (2013); Hardt and Price (2014) is the fol-
lowing observation. According to Lemma 7, for earlier it-
erations, one can in fact tolerate a larger kG^{i}k and thus a
larger empirical error for estimating A. This allows us to
have smaller blocks at the beginning to save the number
of samples, while still keeping the failure probability low.

More precisely, we have the following lemma, proved in Appendix E, with the parameters:

δi= δ0/(2i^{2})and |Ii| = (c/(4β^{i})^{2}) log(d/δi), (3)
where δ0is the error probability given in Lemma 1 and c is
a large enough constant.

Lemma 8. For any i ≥ 1, given |Ii| samples in iteration i,
we have Pr[kGik > 4β^{i}]≤ δ^{i}.

With this, we can bound the failure probability of our algo- rithm as

Pr [∃i ≥ 0 : tan(U, Q^{i}) > εi]≤
Pr [tan(U, Q0) > ε0] +P

i≥1Pr [kG^{i}k > 4β^{i}]
which by Lemma 1 and Lemma 8 is at most δ0+P

i≥1δi= δ0+P

i≥1 δ0

2i^{2} ≤ 2δ^{0}.

To complete the proof of Theorem 2, it remains to bound the number of samples needed for achieving error ε. For this, we rely on the following lemma which we prove in Appendix F.

Lemma 9. For some L ≤ O

λ

λ−¯λlog^{d}_{ε}

, we have εL≤ ε andPL

i=1|I^{i}| ≤ O_{λ log(dL)}

ε^{2}(λ−¯λ)^{3}

.

Finally, as ¯λ = max(ˆλ, λ/4), we have λ − ¯λ ≥ Ω(λ − λ), and putting this into the bound above yields the sampleˆ complexity bound stated in the theorem.

### 5 Experiment

We conduct experiments on two large real-world datasets NYTimes and PubMed (Bache and Lichman, 2013) as used by Mitliagkas et al. (2013). The dimension d of the data points in the datasets are 102 and 141 thousands, respec- tively, which match our memory-restricted setting. The features of both datasets are normalized into [0, 1].

Parameter tuning is generally difficult for streaming algo- rithms. Instead of tuning the parameters extensively and reporting with the most optimistic (but perhaps unrealistic) parameter choice for each algorithm, we consider a thor- ough range of parameters but report the results of four pa- rameter choices per algorithm, which cover the best param- eter choice, to understand each algorithm more deeply.

We compare the proposed SPCA and DBPCA with Alec-
ton (fixed-learning-rate; Sa et al., 2015) and BPCA (fixed-
block-size; Hardt and Price, 2014). For Alecton, we report
the results of the learning rate γ ∈ {10^{−1},· · · , 10^{3}}, with
reasons to be explained in Subsection 5.1. For SPCA, we
follow its existing work (Balsubramani et al., 2013) to fix
n0 = 0while considering c ∈ {10^{1},· · · , 10^{8}}. Then we
report the results of c ∈ {10^{3}, 10^{4}, 10^{5}, 10^{6}}. For BPCA,
we follow its existing works (Mitliagkas et al., 2013; Hardt
and Price, 2014) and let the block size be bN/T c, where
N is the size of the dataset and T is the number of blocks.

Theoretical results of BPCA (Hardt and Price, 2014) sug- gest T = O

λ

λ−ˆλlog d

. Because λ and ˆλ are unknown
in practice, Mitliagkas et al. (2013); Hardt and Price (2014)
set T = bL log dc with L = 1. Instead, we extend
the range to L ∈ {5^{−1},· · · , 5^{6}} and report the result of
{5^{0}, 5^{1}, 5^{2}, 5^{3}}.

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

.=0.1 .=1 .=10 .=100

(a) Alecton, k = 4

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

c=10^{3}
c=10^{4}
c=10^{5}
c=10^{6}

(b) SPCA, k = 4

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

L=1 L=5 L=25 L=125

(c) BPCA, k = 4

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

r^{2}=0.6
r^{2}=0.7
r^{2}=0.8
r^{2}=0.9

(d) DBPCA, k = 4

Figure 1: Performance of different algorithms on NYTimes when k = 4

T = 10^{5} T = 2× 10^{5}
Alecton 0.232± 0.028 0.148± 0.008

SPCA 0.159± 0.008 0.079± 0.006 BPCA 0.234± 0.021 0.177± 0.012 DBPCA 0.138± 0.008 0.064± 0.005 Table 2: Performance of different algorithms with the best parameter on NYTimes when k = 4

For the proposed DBPCA, we set the initial block size as
2kto avoid being rank-insufficient in the first block. Then,
we consider the ratio γ^{2}∈ {0.6, 0.7, 0.8, 0.9} for enlarging
the block size.^{4}

We run each algorithm 60 times by randomly generating
data streams from the dataset. We consider sin(U, Qn)^{2},
which is the error function used for the convergence anal-
ysis, as the performance evaluation criterion. The average
performance on the two datasets for k = 4 and k = 10 are
shown in Figure 1, 2, 3 and 4, respectively. We observe
the similar results on other k values and do not included
here because of the space limit. Also, we report the mean
and the standard error of each algorithm with the best pa-
rameters in Tables 2, 3, 4 and 5. To visualize the results
clearly, we crop the figures up to n = 200, 000, which is
sufficient for checking the convergence of most of the pa-
rameter choices on the algorithms.

5.1 Comparison between SPCA and Alecton

The main difference between SPCA and Alecton is the rule of determining the learning rate. The learning rate of SPCA will decay along with the number of iterations,

4Note that (2) suggests γ^{2}≥ 0.5.

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

.=0.1 .=1 .=10 .=100

(a) Alecton, k = 10

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

c=10^{3}
c=10^{4}
c=10^{5}
c=10^{6}

(b) SPCA, k = 10

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

L=1 L=5 L=25 L=125

(c) BPCA, k = 10

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

r^{2}=0.6
r^{2}=0.7
r^{2}=0.8
r^{2}=0.9

(d) DBPCA, k = 10

Figure 2: Performance of different algorithms on NYTimes when k = 10

T = 10^{5} T = 2× 10^{5}
Alecton 0.385± 0.013 0.386± 0.012

SPCA 0.170± 0.023 0.102± 0.018 BPCA 0.487± 0.042 0.317± 0.034 DBPCA 0.207± 0.028 0.151± 0.022 Table 3: Performance of different algorithms with the best parameter on NYTimes when k = 10

which means it could achieve arbitrarily small error when we have more data. On the other hand, Alecton needs to pre-specify the desired error to determine a fixed learning rate. To achieve the same error, from Table 1, SPCA and Alecton have the same asymptotic convergence rate the- oretically. Next, we aim to study their empirical perfor- mance.

Sa et al. (2015) use a conservative rule to determine the
learning rate. The upper bound of the learning rate γ sug-
gested in Sa et al. (2015) is smaller than 10^{−5} for both
datasets. However, this conservative and fixed learning rate
scheme takes millions of iterations to converge to the com-
petitive performance with SPCA. Similar results can also
be found in Sa et al. (2015).

Although the suggested learning rate should be small, we
still study performance of Alecton with larger learning
rates, which are from 10^{−4}to 10^{4}. We report the results of
{10^{−1}, 10^{0}, 10^{1}, 10^{2}}, which contain the optimal choices
of the used datasets. Obviously, SPCA is generally bet-
ter than Alecton, such as the case in Figure 1. From Ta-
bles 2, 3, 4 and 5, SPCA outperforms Alecton with the best
parameters, which demonstrates the advantage of the de-
cayed learning rate used by SPCA. From all figures, al-
though Alecton with a larger learning rate (γ = 10) has a

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

.=0.1 .=1 .=10 .=100

(a) Alecton, k = 4

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

c=10^{3}
c=10^{4}
c=10^{5}
c=10^{6}

(b) SPCA, k = 4

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

L=1 L=5 L=25 L=125

(c) BPCA, k = 4

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

r^{2}=0.6
r^{2}=0.7
r^{2}=0.8
r^{2}=0.9

(d) DBPCA, k = 4

Figure 3: Performance of different algorithms on PubMed when k = 4

T = 10^{5} T = 2× 10^{5}
Alecton 0.051± 0.007 0.042± 0.000

SPCA 0.033± 0.000 0.022± 0.000 BPCA 0.045± 0.001 0.044± 0.000 DBPCA 0.026± 0.000 0.013± 0.000 Table 4: Performance of different algorithms with the best parameter on PubMed when k = 4

faster convergence behaviour at the beginning, it is stuck at a suboptimal point and can not utilize the new incoming data. The smaller learning rate could usually results in bet- ter performance in the end, but it takes more iterations than the number SPCA needs.

5.2 Comparison between DBPCA and BPCA

From Figure 1 and Figure 2, DBPCA outperforms BPCA under most parameter choices when k = 4, and is compet- itive to BPCA when k = 10. The edge of DBPCA over BPCA is even more remarkable in Figure 3 and Figure 4.

From the result of the best parameters, DBPCA is signifi- cantly better than BPCA by t-test at 95% confidence.

BPCA has the similar drawback to Alecton. As can be ob- served from Figures 1, 2, 3 and 4, if L is too small (larger block), BPCA only sees one or two blocks of data within n = 200, 000, and cannot reduce the error much. BPCA typically needs L > 1 (smaller blocks) to achieve lower error in the end. L = 125 gives the best performance of BPCA in Figures 3 and 4. However, sometimes large L (small blocks) in BPCA allows reducing the error in the beginning, the error cannot converge to a competitive level in the long run. For instance, in Figure 2(c), L = 125 con-

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

.=0.1 .=1 .=10 .=100

(a) Alecton, k = 10

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

c=10^{3}
c=10^{4}
c=10^{5}
c=10^{6}

(b) SPCA, k = 10

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

L=1 L=5 L=25 L=125

(c) BPCA, k = 10

Number of data #10^{5}

0 0.5 1 1.5 2

Error:sin(U;Qn)2 0 0.2 0.4 0.6 0.8 1

r^{2}=0.6
r^{2}=0.7
r^{2}=0.8
r^{2}=0.9

(d) DBPCA, k = 10

Figure 4: Performance of different algorithms on PubMed when k = 10

T = 10^{5} T = 2× 10^{5}
Alecton 0.291± 0.007 0.292± 0.009

SPCA 0.274± 0.007 0.190± 0.040 BPCA 0.415± 0.037 0.203± 0.030 DBPCA 0.212± 0.024 0.141± 0.031 Table 5: Performance of different algorithms with the best parameter on PubMed when k = 10

verges fast but cannot improve much after n = 50, 000;

L = 25converges slower but keeps going towards the low- est error after n = 200, 000. Also, using smaller blocks cannot ensure reducing the error after each update, and hence BPCA with larger L results in less stable curves even after averaging over 60 runs. The results shows the diffi- culty of setting parameters of BPCA by the strategy pro- posed in Mitliagkas et al. (2013); Hardt and Price (2014).

On the other hand, DBPCA achieves better results by us-
ing a smaller block in the beginning to make improvements
and a larger block later to further reduce the error. In both
datasets and under all parameter choices, DBPCA stably
reduces the error after each update, which matches our the-
oretical analysis that guarantees error reduction with a high
probability. In addition, DBPCA is quite stable with re-
spect to the choice of γ^{2}across the two datasets, making it
easier to tune in practice. The properties make DBPCA fa-
vorable over BPCA in the family of block power methods.

5.3 Comparison between DBPCA and SPCA

As observed, DBPCA is less sensitive to the parame-
ter γ that corresponds to the theoretical suggestion of
max(ˆλ/λ, 1/4)^{1}^{4}. Somehow SPCA is rather sensitive to

the parameter c that corresponds to the theoretical sugges-
tion of _{λ}^{c}^{0}

−ˆλ. For instance, setting c = 10^{3}results in strong
performance when k = 4 in Figure 1(b), but the worst per-
formance when k = 10 in Figure 2(b). Similar results can
be observed in Figure 3(b) and Figure 4(b) when c = 10^{3}.
Furthermore, the parameter c in SPCA directly affects the
step size of each gradient descent update. Thus, compared
with the best parameter choice, larger c leads to less sta-
ble performance curve, while smaller c sometimes results
in significantly slower convergence. The results suggest
that SPCA needs a more careful tuning and/or some deeper
studies on proper parameter ranges.

From Tables 2, 3, 4 and 5, DBPCA significantly outper- forms SPCA in 6 out of 8 cases by t-test under 95%

confidence. The result supports the theoretical study that DBPCA has better converges rate guarantee than SPCA.

However, the benefit of SPCA is its immediate use of new data point. DBPCA, as a representative of the block-power- method family, cannot update the solution until the end of the growing block. Then, the latter points in the larger blocks may be effectively unused for a long period of time.

For instance, in Figure 2, DBPCA uses larger blocks than the necessary size. After N = 150, 000, the block size is near to 20, 000, which is less efficient.

### 6 Conclusion

We strengthen two families of streaming PCA algorithms, and compare the two strengthened families fairly from both theoretical and empirical sides. For the SGD family, we an- alyze the convergence rate of the famous SPCA algorithm for the multiple-principal-component cases without speci- fying the error in advance; for the family of block power methods, we propose a dynamic-block algorithm DBPCA that enjoys faster convergence rate than the original BPCA.

Then, the empirical studies demonstrate that DBPCA not only outperforms BPCA often by dynamically enlarging the block sizes, but also converges to competitive results more stably than SPCA in many cases. Both the theoretical and empirical studies thus justify that DBPCA is the best among the two families, with the caveat of stalling the use of data points in larger blocks.

Our work opens some new research directions. Empirical results seem to suggest SPCA is competitive to or slightly worse than DBPCA. It is worth studying whether it is re- sulted from the substantial difference between log 1/ and log log 1/or caused by the hidden constants in the bounds.

So one conjecture is that the bound in Theorem 1 can be
further improved. On the other hand, although (2) sug-
gests γ^{2} ≥ 0.5, the empirical results show that larger γ^{2}
generally results in better performance. Hence, it is also
worth studying whether the lower bound could be further
improved.

### Acknowledgements

We thank the anonymous reviewers for valuable sugges- tions. This material is based upon work supported by the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Development (AOARD) un- der award number FA2386-15-1-4012, and by the Min- istry of Science and Technology of Taiwan under num- bers MOST 103-2221-E-001-015-MY3 and 103-2221-E- 002-148-MY3. The work was partially done when the first author was a joint research assistant at Academia Sinica and National Taiwan University.

### References

Arora, R., Cotter, A., Livescu, K., and Srebro, N. (2012).

Stochastic optimization for pca and pls. In Annual Aller- ton Conference on Communication, Control, and Com- puting.

Arora, R., Cotter, A., and Srebro, N. (2013). Stochastic optimization of pca with capped msg. In Advances in Neural Information Processing Systems.

Bache, K. and Lichman, M. (2013). UCI machine learning repository.

Balsubramani, A., Dasgupta, S., and Freund, Y. (2013).

The fast convergence of incremental pca. In Advances in Neural Information Processing Systems.

Boutsidis, C., Garber, D., Karnin, Z. S., and Liberty, E.

(2015). Online principal components analysis. In Pro- ceedings of the Symposium on Discrete Algorithms.

Golub, G. H. and Van Loan, C. F. (1996). Matrix Compu- tations (3rd Ed.). Johns Hopkins University Press.

Hardt, M. and Price, E. (2014). The noisy power method: A meta algorithm with applications. In Advances in Neural Information Processing Systems.

Karnin, Z. and Liberty, E. (2015). Online pca with spectral bounds. In Conference on Learning Theory.

Mitliagkas, I., Caramanis, C., and Jain, P. (2013). Memory limited, streaming pca. In Advances in Neural Informa- tion Processing Systems.

Nie, J., Kotlowski, W., and Warmuth, M. K. (2013). Online pca with optimal regrets. In International Conference on Algorithmic Learning Theory.

Oja, E. and Karhunen, J. (1985). On stochastic approx- imation of the eigenvectors and eigenvalues of the ex- pectation of a random matrix. Journal of Mathematical Analysis and Applications.

Rakhlin, A., Shamir, O., and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In International Conference on Machine Learning.

Sa, C. D., Re, C., and Olukotun, K. (2015). Global con- vergence of stochastic gradient descent for some non- convex matrix problems. In International Conference on Machine Learning.

Shamir, O. (2015). Convergence of stochastic gradient de- scent for PCA. CoRR.

Warmuth, M. K. and Kuzmin, D. (2008). Randomized On- line PCA Algorithms with Regret Bounds that are Loga- rithmic in the Dimension. Journal of Machine Learning Research.