Large Linear Classification When Data Cannot Fit In Memory

(1)

Large Linear Classification When Data Cannot Fit In Memory

Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang and Chih-Jen Lin, National Taiwan University

Recent advances in linear classification have shown that for applications such as document classification, the training can be extremely efficient. However, most of the existing training methods are designed by assuming that data can be stored in the computer memory. These methods cannot be easily applied to data larger than the memory capacity due to the random access to the disk. We propose and analyze a block minimization framework for data larger than the memory size. At each step a block of data is loaded from the disk and handled by certain learning methods. We investigate two implementations of the proposed framework for primal and dual SVMs, respectively. As data cannot fit in memory, many design considerations are very different from those for traditional algorithms. Experiments using data sets 20 times larger than the memory demonstrate the effectiveness of the proposed method.

Categories and Subject Descriptors: I.5.2 [Pattern Recognition]: Design Methodology—Classifier design and evaluation

General Terms: Algorithms, Performance, Experimentation

Additional Key Words and Phrases: Pattern Recognition, Design Methodology, Classifier design and evaluation, Algorithms, Performance, Experimentation

ACM Reference Format:

Yu, H.-F., Hsieh, C.-J., Chang, K.-W., Lin, C.-J.. 2011. Large Linear Classification When Data Cannot Fit In Memory. ACM Trans. Knowl. Discov. Data. 9, 4, Article 39 (March 2010), 21 pages.

DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

1. INTRODUCTION

Linear classification¹ is useful in many applications, but training large-scale data remains an important research issue. For example, a category of PASCAL Large Scale Learning Challenge²at ICML 2008 compares linear SVM implementations. The com- petition evaluates the time after data have been loaded into the memory, but many participants find that loading time costs more. Thus some have concerns about the evaluation.³ This result indicates a landscape shift in large-scale linear classification because time spent on reading/writing between memory and disk becomes the bottleneck. Existing training algorithms often need to iteratively access data, so without enough memory, the training time will be huge. To see how serious the situation is, Fig- ure 1 presents the running time by applying an efficient linear classification package

1By linear classification we mean that data remain in the input space and kernel methods are not used.

2http://largescale.first.fraunhofer.de/workshop

3http://hunch.net/?p=330

This work was supported in part by the National Science Council of Taiwan via the grant 98-2221-E-002- 136-MY3.

Author’s addresses: Yu, H.-F., Hsieh, C.-J., Chang, K.-W. and Lin, C.-J.. National Taiwan University.

Email:{b93107, b92085, b92084,cjlin}@cise.ntu.edu.tw

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is per- mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.

2010 ACM 1556-4681/2010/03-ART39 $10.00c

DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

(2)

Fig. 1: Data size versus training time by directly applying LIBLINEAR on a machine with 1GB memory (the actual available memory is about 895MB). When data size is larger than the memory capacity, the running time grows rapidly.

LIBLINEAR [Fan et al. 2008] to train data with different scales on a computer with 1 GB memory. Clearly, the time grows sharply when the data size is beyond the memory capacity.

We model the training time to contain two parts:

training time = time to run data in memory +

time to access data from disk. (1) Traditional training algorithms, assuming that the second part is negligible, focus on the first part by minimizing the number of CPU operations. Linear classification, es- pecially when applied to document classification, is in a situation that the second part may be more significant. Recent advances on linear classification (e.g., [Joachims 2006;

Shalev-Shwartz et al. 2007; Bottou 2007; Hsieh et al. 2008]) have shown that training one million instances takes only a few seconds (without counting the loading time).

Therefore, some have said that linear classification is essentially a solved problem if the memory is enough. However, handling data beyond the memory capacity remains a challenging research issue.

According to Langford et al. [2009], existing approaches to handle large data can be roughly categorized to two types. The first approach solves problems in distributed systems by parallelizing batch training algorithms (e.g., [Chang et al. 2008; Zhu et al.

2009]). However, not only writing programs on a distributed system is difficult, but also the data communication/synchronization may cause significant overheads. The second approach considers online learning algorithms. Since data may be used only once, this type of approaches can effectively handle the memory issue. However, even with an online setting, an implementation over a distributed environment is still complicated;

see the discussion in Section 2.1 of Langford et al. [2009]. In practice, Tong⁴argues that keeping algorithms simple and robust is crucial. Therefore, an easy-to-use system is more favorable. Moreover, existing implementations (including those in large Internet companies) may lack important functions such as evaluations by different criteria, parameter selection, or feature selection.

This paper aims to construct large linear classifiers for ordinary users. We consider one assumption and one requirement:

— Assumption: Data cannot be stored in memory, but can be stored in the disk of one computer. Moreover, sub-sampling data to fit in memory causes lower accuracy.

4See http://googleresearch.blogspot.com/2010/04/lessons-learned-developing-practical.html.

(3)

— Requirement: The method must be simple so that support for multi-class classification, parameter selection and other functions can be easily done.

If sub-sampling does not downgrade the accuracy, some (e.g., [Yu et al. 2003]) have proposed approaches to select important instances by reading data from disk only once.

In this work, we discuss a simple and effective block minimization framework for applications satisfying the above assumption. We focus on batch learning though ex- tensions to online or incremental/decremental learning are straightforward. While many existing online learning studies claim to handle data beyond the memory capacity, most of them conduct simulations with enough memory and check the number of passes to access data (e.g., [Shalev-Shwartz et al. 2007; Bottou 2007]). In contrast, we conduct experiments in a real environment without enough memory. An earlier linear- SVM study [Ferris and Munson 2003] has specifically addressed the situation that data are stored in disk, but it assumes that the number of features is much smaller than data points. Our approach allows a large number of features, a situation often occurred for document data sets.

This paper is organized as follows. In Section 2, we consider SVM as our linear classifier and propose a block minimization framework. Two implementations of the proposed framework for primal and dual SVM problems are respectively in Sections 3 and 4. Techniques to minimize the training time modeled in (1) are in Section 5. Sec- tion 6 discusses the implementation of cross validation, multi-class classification, and incremental/decremental settings. Section 7 discusses related approaches for training linear classifiers when data are larger than memory capacity. We show experiments in Section 8 and give conclusions in Section 9.

A short version of this work appears in a conference paper [Yu et al. 2010].

2. BLOCK MINIMIZATION FOR LINEAR SVMS

We consider linear SVM in this work because it is one of the most used linear classifiers. Given a data set {(xⁱ, yi)}^li=1, xi ∈ Rⁿ, yi ∈ {−1, +1}, SVM solves the following unconstrained optimization problem:⁵

minw

1

2w^Tw+ C

l

X

i=1

max(1 − yⁱw^Tx_i, 0), (2) where C > 0 is a penalty parameter. This formulation considers L1 loss, though our approach can be easily extended to L2 loss. Problem (2) is often referred to as the primal form of SVM. One may instead solve its dual problem:

minα f (α) =1

2α^TQα − e^Tα

subject to 0 ≤ αⁱ ≤ C, i = 1, . . . , l, (3) where e = [1, . . . , 1]^T and Qij = yiyjx^T

i x_j.

As data cannot fit in memory, the training method must avoid random accesses of data. In Figure 1,LIBLINEAR randomly accesses one instance at a time, so frequent moves of the disk head result in lengthy running time. A viable method must satisfy the following conditions:

1. Each optimization step reads a continuous chunk of training data.

2. The optimization procedure converges toward the optimum even though each step uses only a subset of training data.

5The standard SVM comes with a bias term b. Here we do not consider this term for the simplicity.

(4)

ALGORITHM 1:A block minimization framework for linear SVM 1. Split {1, . . . , l} to B¹, . . . , Bmand store data into m files accordingly.

2. Set initial α or w

3. For k = 1, 2, . . . (outer iteration)

— For j = 1, . . . , m (inner iteration) 3.1. Read xr,∀r ∈ Bjfrom disk

3.2. Conduct operations on {xr| r ∈ Bj} 3.3. Update α or w

3. The number of optimization steps (iterations) should not be too large. Otherwise, the same data point may be accessed from the disk too many times.

Obtaining a method having all these properties is not easy. We will propose methods to achieve them to a certain degree.

In unconstrained optimization, block minimization is a classical method (e.g., [Bertsekas 1999, Chapter 2.7]). Each step of this method updates a block of variables, but here we need a connection to data. Let {B¹, . . . , Bm} be a partition of all data indices {1, . . . , l}. According to the memory capacity, we can decide the block size so that instances associated with Bj and variables used in the sub-problem can fit in memory. These m blocks, stored as m files, are loaded when needed. Then at each step, we conduct some operations using one block of data, and update w or α according to if the primal or the dual problem is considered. We assume that w or α can be stored in memory. The block minimization framework is summarized in Algorithm 1. We re- fer to the step of working on a single block as an inner iteration, while the m steps of going over all blocks as an outer iteration. Algorithm 1 can be applied on both the primal form (2) and the dual form (3). We show two implementations in Sections 3 and 4, respectively.

We discuss some implementation considerations for Algorithm 1. For the conve- nience, assume B1, . . . , Bmhave a similar size |B| = l/m. The total cost of Algorithm 1 is

(Tm(|B|) + Td(|B|)) × l

|B|× #outer-iters, (4)

where

— Tm(|B|) is the cost of operations at each inner iteration, and

— Td(|B|) is the cost to read a block of data from disk.

These two terms respectively correspond to the two parts in (1) for modeling the training time.

Many studies have applied block minimization to train SVM or other machine learning problems, but we might be the first to consider it in the disk level. Indeed the ma- jor approach to train nonlinear SVM (i.e., SVM with nonlinear kernels) has been block minimization, which is often called decomposition methods in the SVM community. We discuss the difference between ours and existing studies in two aspects:

— variable selection for each block, and

— block size.

Existing SVM packages assume data in memory, so they can use flexible ways to select each Bj. They do not restrict B1, . . . , Bmto be a split of {1, . . . , l}. Moreover, to decide indices of one single Bj, they may access the whole set, an impossible situation for us.

We are more confined here as data associated with each Bjmust be pre-stored in a file before running Algorithm 1.

(5)

Regarding the block size, we now go back to analyze (4). If data are all in memory, Td(|B|) = 0. For T^m(|B|), people observe that if |B| linearly increases, then

|B| ր, T^m(|B|) ր, and #outer-iters ց . (5) Tm(|B|) is generally more than linear to |B|, so T^m(|B|) × l/|B| is increasing along with |B|. In contrast, the #outer-iters may not decrease as quick. Therefore, nearly all existing SVM packages use a small |B|. For example, |B| = 2 inLIBSVM[Chang and Lin 2001] and 10 in SVM^light [Joachims 1998]. With Td(|B|) > 0, the situation is now very different. At each outer iteration, the cost is

Tm(|B|) × l

|B|+ Td(|B|) × l

|B|. (6)

The second term is for reading l instances. As reading each block of data takes some initial time, a smaller number of blocks reduces the cost. Hence, the second term in (6) is a decreasing function of |B|. While the first term is increasing following the earlier discussion, as reading data from the disk is slow, the second term is likely to dominate.

Therefore, contrary to existing SVM software, in our case the block size should not be too small. We will investigate this issue by experiments in Section 8.

The remaining issue is to decide operations at each inner iteration. The second and the third conditions mentioned earlier in this section should be considered. We discuss two implementations in the next two sections.

3. SOLVING DUAL SVM BY LIBLINEAR FOR EACH BLOCK

A nice property of the SVM dual problem (3) is that each variable corresponds to a training instance. Thus we can easily devise an implementation of Algorithm 1 by updating a block of variables at a time. Assume ¯Bj = {1, . . . , l}\B^j. At each inner iteration we solve the following sub-problem.

mind

Bj

f (α + d) (7)

subject to dB¯_j = 0 and 0 ≤ αⁱ+ di≤ C, ∀i ∈ B^j.

That is, we update αBj using the solution of (7), while fix αB¯_j. Then, Algorithm 1 reduces to the standard block minimization procedure, so the convergence to the optimal function value of (3) holds [Bertsekas 1999, Proposition 2.7.1].

We must ensure that at each inner iteration, only one block of data is needed. With the constraint dB¯_j = 0 in (7),

f (α + d) = 1 2d^T

B_jQB_jB_jd_B

j+ (QB_j,:α− e^Bj)^Td_B

j+ f (α), (8)

where QB_j,: is a sub-matrix of Q including elements Qri, r ∈ B^j, i = 1, . . . , l. Clearly, QB_j,: in (8) involves all training data, a situation violating the requirement of Algo- rithm 1. Fortunately, by maintaining

w≡

l

X

i=1

αiyix_i, (9)

we have

Qr,:α− 1 = y^rw^Tx_r− 1, ∀r ∈ B^j.

(6)

ALGORITHM 2:An implementation of Algorithm 1 for solving dual SVM We only show details of steps 3.2 and 3.3:

3.2 Exactly or approximately solve the sub-problem (7) to obtain d^∗_B_j 3.3 αB_j ← αB_j + d^∗_B

j

Update w by (10)

Therefore, if w is available in memory, only instances associated with the block Bj are needed. To maintain w, if d^∗_B_j is an optimal solution of (7), we consider (9) and use

w← w + X

r∈Bj

d^∗_ryrx_r. (10)

This operation again needs only the block Bj. The procedure is summarized in Algo- rithm 2.

For solving the sub-problem (7), as all the information is available in the memory, any bound-constrained optimization method can be applied. We consider LIBLINEAR [Fan et al. 2008], which implements a coordinate descent method (i.e., block minimization with a single element in each block). Then, Algorithm 2 becomes a two-level block minimization method. The two-level setting had been used before for SVM or other applications (e.g., [Memisevic 2006; P´erez-Cruz et al. 2004; R ¨uping 2000]), but ours might be the first to associate the inner level with memory and the outer level with disk.

Algorithm 2 converges if each sub-problem (7) is exactly solved. Practically we often obtain an approximate solution by imposing a stopping criterion. We then address two issues:

1. The stopping criterion for solving the sub-problem must be satisfied after a finite number of operations, so we can move on to the next sub-problem.

2. We need to prove the convergence.

Next we show that these two issues can be resolved if usingLIBLINEARfor solving the sub-problem. Let {α^k} be the sequence generated by Algorithm 2, where k is the index of outer iterations. Because each outer iteration contains m inner iterations, we can further consider a sequence

{α^k,j}^∞,m+1k=1,j=1with α^k,1= α^kand α^k,m+1= α^k+1.

From α^k,j to α^k,j+1,LIBLINEARcoordinate-wisely updates variables in Bj to approximately solve the sub-problem (7) and we let tk,jbe the number of updates.

If the coordinate descent updates satisfy certain conditions, we can prove the convergence of {α^k,j}:

THEOREM 3.1. If applying a coordinate descent method to solve (7) with the follow- ing properties:

1. eachαi, i ∈ Bj is updated at least once, and 2. {t^k,j} is uniformly bounded,

then{α^k,j} generated by Algorithm 2 globally converges to an optimal solution α^∗. The convergence rate is at least linear: there are0 < µ < 1 and an iteration k0such that

f (α^k+1) − f(α^∗) ≤ µ f(α^k) − f(α^∗) , ∀k ≥ k⁰.

The proof is in Appendix A. With Theorem 3.1, the condition 2 mentioned in the beginning of Section 2 holds. For condition 3 on the convergence speed, block minimization

(7)

does not have fast convergence rates. However, for problems like document classification, some (e.g., [Hsieh et al. 2008]) have shown that we do not need many iterations to get a reasonable model. Though Hsieh et al. [2008] differ from us by restricting

|B| = 1, we hope to enjoy the same property of not needing many iterations. Experi- ments in Section 8 confirm that for some document data this property holds.

Next we discuss various ways to fulfill the two properties in Theorem 3.1.

3.1. Loosely Solving the Sub-problem

A simple setting to satisfy Theorem 3.1’s two properties is to go through all variables in Bj a fixed number of times. Then not only is tkj uniformly bounded, but also the finite termination for solving each sub-problem holds. A small number of passes to go through Bjmeans that we loosely solve the sub-problem (7). While the cost per block is cheaper, the number of outer iterations may be large. Through experiments in Section 8, we discuss how the number of passes affects the running time. A special case is to go through all αi, i ∈ B^j exactly once. Then Algorithm 2 becomes a standard (one-level) coordinate descent method, though data are loaded by a block-wise setting.

For each pass to go through data in one block, we can sequentially update variables in Bj. However, as mentioned in [Hsieh et al. 2008], using a random permutations of Bj’s elements as the order for update usually leads to faster convergence in practice.

3.2. Accurately Solving the Sub-problem

Alternatively, we can accurately solve the sub-problem. The cost per inner iteration is higher, but the number of outer iterations may be reduced. As an upper bound on the number of iterations does not reveal how accurate the solution is, most optimization software considers the gradient information for the stopping condition. We check the setting inLIBLINEAR. Its gradient-based stopping condition (details shown in Ap- pendix B) guarantees the finite termination in solving each sub-problem (7). Thus the procedure can move on to the next sub-problem without getting into an infinite loop.

Regarding the convergence, to use Theorem 3.1, we must show that {t^k,j} is uniformly bounded:

THEOREM 3.2. If coordinate descent steps withLIBLINEAR’s stopping condition are used to solve(7), then Algorithm 2 either terminates in a finite number of outer itera- tions or

tk,j≤ 2|B^j| ∀j after k is large enough.

Therefore, ifLIBLINEARis used to solve (7), then Theorem 3.1 implies the convergence.

4. SOLVING PRIMAL SVM BY PEGASOS FOR EACH BLOCK

Instead of solving the dual problem, in this section we check if the framework in Algo- rithm 1 can be used to solve the primal SVM. Because the primal variable w does not correspond to data instances, we cannot use a standard block minimization setting to have a sub-problem like (7). In contrast, existing stochastic gradient descent methods possess a nice property that at each step only certain data points are used. In this section, we study howPegasos[Shalev-Shwartz et al. 2007] can by used for implementing an Algorithm 1.

Pegasosconsiders a scaled form of the primal SVM problem:

minw

1

2lCw^Tw+1 l

l

X

i=1

max(1 − yⁱw^Tx_i, 0),

(8)

ALGORITHM 3:An implementation of Algorithm 1 for solving primal SVM. Each inner iteration is byPegasos

1. Split {1, . . . , l} to B¹, . . . , Bmand store data into m files accordingly.

2. t = 0 and initial w = 0.

3. For k = 1, 2, . . .

— For j = 1, . . . , m

3.1. Find a partition of Bj: B¹j, . . . , B^r_j^¯. 3.2. For r = 1, . . . , ¯r

— Use B^rjas B to conduct the update (11)-(13).

— t ← t + 1

At the tth update,Pegasoschooses a block of data B and updates the primal variable wby a stochastic gradient descent step:

¯

w= w − η^t∇^t, (11)

where η^t= lC/t is the learning rate, ∇^tis the sub-gradient

∇^t= 1

lCw− 1

|B|

X

i∈B⁺

yix_i, (12)

and B⁺≡ {i ∈ B | yⁱw^Tx_i< 1}. Then,Pegasosobtains w by scaling ¯w:

w← min(1,

√lC

k ¯wk) ¯w. (13)

Clearly, we can directly consider Bjin Algorithm 1 as the set B in the above update.

Alternatively, we can conduct severalPegasosupdates on a partition of Bj. Algorithm 3 gives details of the procedure. Here we consider two settings for an inner iteration:

1. Using onePegasosupdate on the whole block Bj.

2. Splitting Bj to |B^j| sets, where each one contains an element in B^j and then con- ducting |B^j|Pegasosupdates.

Different from dual SVM, we should not solve the sub-problem of primal SVM accurately. Otherwise, the model will converge to a solution that only learns on the set B.

For the convergence, Pegasosis proved to converge if all instances {x¹, . . . , xl} are used for updating the model at each step; see [Shalev-Shwartz et al. 2007, Corollary 1].

It is also shown that the same convergence results hold in expectation if each update is conducted on a subset chosen i.i.d. from the entire data set; see [Shalev-Shwartz et al. 2007, Theorem 3]. Although Algorithm 3 is a special case ofPegasos, it splits the data into blocks Bj, ∀j and updates on a subset of B^j at a time. Therefore, we cannot directly apply the convergence proof. However, empirically we observe that Algorithm 3 converges without problems.

5. TECHNIQUES TO REDUCE THE TRAINING TIME

Many techniques have been proposed to make block minimization faster. However, these techniques may not be suitable here as they are designed by assuming that all data are in memory. Based on the complexity analysis in (6), in this section we propose three techniques to speed up Algorithm 1. One technique effectively shortens Td(|B|), while the other two aim at reducing the number of iterations.

(9)

ALGORITHM 4:Splitting data into blocks 1. Decide m and create m empty files.

2. For i = 1, . . .

2.1. Convert xito a binary format ¯x_i.

2.2. Randomly choose a number j ∈ {1, . . . , m}.

2.3. Append ¯x_iinto the end of the jth file.

5.1. Data Compression

The loading time Td(|B|) is a bottleneck of Algorithm 1 due to the slow disk access.

Except some initial cost, Td(|B|) is proportional to the length of data. Hence, we can consider a compression strategy to reduce the loading time of each block. However, this strategy introduces two additional costs: the compression time in the beginning of Algorithm 1 and the decompression time when a block is loaded. The former is minor as we only do it once. For the latter, we must ensure that the loading time saved is more than the decompression time. The balance between compression speed and ratio has been well studied in the area of backup and networking tools [Morse 2005]. We choose a widely used compression libraryzlibfor our implementation.⁶Experiments in Section 8 show that if the reading speed of the disk is slow, the compression strategy effectively reduces the training time.

Because of using compression techniques, all blocks are stored in a binary format instead of a plain text form.

5.2. Random Permutation of Sub-problems

In Algorithm 1, we sequentially work on blocks B1, B2, . . ., Bm. We can consider other ways such as a permutation of blocks to decide the order of sub-problems. In LIBLINEAR’s coordinate descent implementation, the authors randomly permute all variables at each iteration and report faster convergence. We adopt a permutation strategy here as the loading time is similar regardless of the order of sub-problems.

5.3. Split of Data

An important step of Algorithm 1 is to split training data to m files. We need a careful design as data cannot be loaded into memory. To begin, we find the size of data and decide the value m based on the memory size. This step does not have to go through the whole data set because the operating system provides information such as file sizes.

Then we can sequentially read data instances and save them to m files. However, data in the same class are often stored together in the training set, so we may get a block of data with the same label. This situation clearly causes slow convergence. Thus for each instance being read, we randomly decide which file it should be saved to. Algorithm 4 summarizes our procedure. It goes through data only once.

6. OTHER FUNCTIONALITY

A learning system only able to solve an optimization problem (3) is not practically useful. Other functions such as multi-class classification or cross validation (for parameter selection) are very important. We discuss how to implement these functions based on the design in Section 2.

6.1. Multi-class Classification

Existing multi-class approaches either train several two-class problems (e.g., one- against-one and one-against-the rest) or solve one single optimization problem (e.g.,

6http://www.zlib.net

(10)

ALGORITHM 5: An block minimization framework for one-against-the rest multi-class approach. We assume the K class labels are 1, . . . , K

.

1. Split {1, . . . , l} to B¹, . . . , Bm, and store data into m files accordingly.

2. Set initial α¹, . . . , α^K and w¹, . . . , w^K where K is the number of classes.

3. For k = 1, 2, . . . (outer iteration)

— For j = 1, . . . , m (inner iteration) 3.1. Read xr,∀r ∈ Bjfrom disk 3.2. For t = 1, . . . , K

— Use B^tj≡ {xr| r ∈ Bjand yr=t} as positive data and Bj\ B_j^tas negative data to conduct the update on α^tand w^t.

[Crammer and Singer 2000]). Take the one-against-the rest approach for a K-class problem as an example. We train K classifiers, where each one separates a class from the rest. If we sequentially train K models, the disk access time is K times more. To save the disk accessing time, a more complicated implementation is to train K models together. We split each blocks Bjto B_j¹, . . . , B_j^K according to the class information.

Then we solve K sub-problems simultaneously, that is, we use B^t_jas positive data and Bj\ B^tj as negative data to update vector w^t and α^t. The details are in Algorithm 5.

The one-against-one approach is less suitable as it needs K(K − 1)/2 vectors for w, which may consume too much memory. For one-against-the rest and the approach in [Crammer and Singer 2000], they both need K vectors.

6.2. Cross Validation

Assume we conduct v-fold cross validation. Due to the use of m blocks, a straightforward implementation is to split m blocks to v groups. Each time one group of blocks is used for validation, while all remaining groups are for training. Similar to the situation in multi-class classification, the loading time is v times more than training a single model. To save the disk accessing time, a more complicated implementation is to train v models together. For example, if v = 3, we split each block Bj to three parts Bj¹, B²_j, and B_j³. Then ∪^mj=1(B_j¹∪ B²j) is the training set to validate ∪^mj=1B_j³. We maintain three vectors w¹, w², and w³. Each time when Bj is loaded, we solve three sub-problems to update w vectors. This implementation effectively saves the data loading time, but the memory must be enough to store v vectors w¹, . . . , w^v. The algorithm for cross validation is similar to Algorithm 5 for multi-class classification.

6.3. Incremental/ Decremental Setting

Many practical applications retrain a model after collecting enough new data. Our approach can be extended to this scenario. We make a reasonable assumption that each time several blocks are added or removed. Using LIBLINEAR to solve the dual form as an example, to possibly save the number of iterations, we can reuse the vector w obtained earlier. Algorithm 2 maintains w = P^l_i=1yiαix_i, so the new initial w can be

w← w + X

i:xibeing added

yiαix_i− X

i:xibeing removed

yiαix_i. (14) For data being added, αiis simply set to zero, but for data being removed, their corre- sponding αi are not available. To use (14), we must store α. That is, before and after solving each sub-problem, Algorithm 2 reads and saves α from/to disk.

If solving the primal problem byPegasosfor each block, Algorithm 3 can be directly applied for incremental or decremental settings.

(11)

Table I: Data statistics: We assume a sparse storage. Each non-zero feature value needs 12 bytes (4 bytes for the feature index, and 8 bytes for the value). However, this 12-byte structure consumes 16 bytes on a 64-bit machine due to data structure alignment.

Data set l n #nonzeros Memory (Bytes)

yahoo-korea 460,554 3,052,939 156,436,656 2,502,986,496 kddcup10 29,890,095 19,264,093 566,345,790 9,061,532,640 webspam 350,000 16,609,143 1,304,697,446 20,875,159,136 epsilon 500,000 2,000 1,000,000,000 16,000,000,000 7. RELATED APPROACHES FOR LARGE-SCALE DATA

In this section, we discuss related approaches for training linear classifiers when data cannot fit in the memory. The comparisons between these approaches and our block minimization framework are in Section 8.3.

7.1. Data Sub-sampling

In many cases, sub-sampling training data does not downgrade the prediction accuracy much. Therefore, by using only a portion of the training data that can fit into the memory, we can employ standard training techniques. This approach usually works when the data quality is good. However, in some situations, dealing with a huge data set may still be necessary. For example, Internet companies collect a huge amount of web log and train data on a distributed learning system. In Section 8.3, we demonstrate the relationship between testing performance and sub-sampling size.

7.2. Averaging Models Trained on Subsets of Data

Bagging [Breiman 1996] is a classical classification method. In the training phase, a bagging method randomly draws m subsets of samples from the entire data set.

Then, it trains m models w1, . . . , wmon each subset, separately. In the testing phase, the prediction of a testing instance is based on the decisions from the m models. If each subset can be stored in memory, then training is efficient. Similar to the block generation in our framework, this method needs to get subsets in the beginning.

Although sometimes a bagging method can achieve an accurate model [Zinkevich et al. 2010], its solution is not the same as the model from solving (2). In Section 8.3, we compare the proposed block optimization framework with a bagging method, which averages m models trained on Bj, ∀j.

7.3. Online Learning Approaches

Online methods can easily deal with large-scale data. An online learning algorithm loads a single data point at a time and avoids storing the whole data in the memory.

In the following, we discuss an online learning packageVowpal Wabbit[Langford et al.

2007].

Vowpal Wabbitprovides several choices of loss functions. Here we take the hinge loss:

l(w; y, x) = max(0, 1 − yw^Tx)

as an example, where y is the true label, w is the model and x is an instance. When Vowpal Wabbit faces a new instance x, it updates w by the sub-gradient descent di- rection. Vowpal Wabbit supports the setting to pass over data several times. During the first pass, Vowpal Wabbitsaves the data points into a cache file. This is similar to our data compression strategy discussed in Section 5.1. In Section 8.3, we compare Vowpal Wabbitwith the proposed block optimization framework.

(12)

Table II: Number of blocks and initial time to split and compress data into blocks. Time is in seconds.

Data set #Blocks Initial time

yahoo-korea 5 228

kddcup10 40 842

webspam 40 1,594

epsilon 30 1,237

8. EXPERIMENTS

In this section, we first conduct experiments to analyze the performance of the proposed approach. Then we investigate several implementation issues discussed in Sec- tion 5. Finally, we compare the proposed block minimization framework with other approaches for solving linear classification problems when data size is beyond the memory capacity.

We consider two document data sets yahoo-korea⁷ andwebspam, an artificial data setepsilon,⁸and an education data setkddcup10from a data mining challenge.⁹Table I summarizes the data statistics.

Exceptkddcup10, we randomly split the data such that 4/5 for training and 1/5 for testing, and all feature vectors are instance-wisely scaled to unit-length (i.e., kxⁱk = 1, ∀i). For epsilon, each feature of the training set is normalized to have mean zero and variance one, and the testing set is modified according to the same scaling factors.

This feature-wise scaling is conducted before the instance-wise scaling. Forkddcup10 dataset, we use the same training and testing split in Yu et al. [2010], and we do not further scale the features vectors.

We conduct experiments on a 64-bit machine with 1GB RAM. Due to the space con- sumed by the operating system, the real memory capacity we can use is 895MB. The reading speed of the disk is 102.36 MB/sec.¹⁰

8.1. Comparison of Sub-problem Solvers

In this section, we compare the various settings introduced in Sections 3–4 for operations on a block of data: The value C in (2) is set to one.

— BLOCK-L-N : Algorithm 2 withLIBLINEAR to solve each sub-problem.LIBLINEAR goes through the block of data N rounds, where we consider N = 1, 10, and 20.

— BLOCK-L-D: Algorithm 2 withLIBLINEARto solve each sub-problem.LIBLINEAR’s default stopping condition is adopted.

— BLOCK-P-B: Algorithm 3 with ¯r = 1. That is, we apply onePegasosupdate on the whole block.

— BLOCK-P-I: Algorithm 3 with ¯r = |B^j|. That is, we apply |B^j| Pegasos updates, each of which uses an individual data instance.

When data are larger than memory capacity, training a model by LIBLINEAR on entire data suffers from severe disk swapping [Yu et al. 2010]. Therefore, we omit the results ofLIBLINEARand only show the comparison between methods under the block minimization framework. We make sure that no other jobs are running on the same

7This data set is not publicly available

8webspamandepsiloncan be downloaded at http://largescale.first.fraunhofer.de/instructions/

9We use the second data set bridge to algebra 2008 2009 in KDD Cup 2010. The preprocessed version can be downloaded at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

10The reading speed of the disk is given by the program hdparm under Linux environment.

(13)

(a)yahoo-korea (b)webspam

(c)epsilon (d)kddcup10

Fig. 2: This figure shows the relative function value difference to the minimum. Time (in seconds) is log-scaled. The blue dotted vertical line indicates time spent by Algo- rithm 1-based methods for the initial split of data to blocks.

machine and report wall clock time in all experiments. We include all data loading time and the initial time to split and compress data into blocks. Table II lists the number of blocks and the initial time to split and compress data into blocks.

We are interested in both how fast the methods reduce the objective function value (2) and how quickly the methods obtain a reasonable model. Figure 3 presents two results:

1. Training time versus the relative difference to the optimum

f^P(w) − f^P(w^∗) f^P(w^∗)

,

where f^Pis the primal objective function in (2) and w^∗is the optimal solution. Since w^∗is not really available, we spend enough training time to get a reference solution.

2. Training time versus the difference to the best testing accuracy (acc^∗− acc(w)) × 100%,

(14)

Fig. 3: This figure shows the accuracy difference to the best testing accuracy. Time (in seconds) is log-scaled. The blue dotted vertical line indicates time spent by Algorithm 1-based methods for the initial split of data to blocks. Note that in Figure 3(c), the curve of BLOCK-L-D is not connected, where the missing point corresponds to the best accuracy.

where acc(w) is the testing accuracy using the model w and acc^∗is the best testing accuracy among all methods.

From Figure 3, BLOCK-L-∗ methods (usingLIBLINEAR) are faster than BLOCK-P-∗

methods (usingPegasos) in most cases. One of the possible reasons is that for BLOCK- P-∗, the information of each block is underutilized. In particular, BLOCK-P-B suffers from very slow convergence as for each block it conducts only one very simple update.

However, it may not be always needed to use the block of data in an exhaustive way.

For example, in Figures 2(a) and 2(d), BLOCK-L-1 (for each block LIBLINEAR goes through all data only once) is slightly faster than BLOCK-L-D (for each block running LIBLINEAR with the default stopping condition). Nevertheless, as reading each block from the disk is expensive, in general we should make proper efforts to use it.

The numbers of instances and features inkddcup10are very large. In such a situation, all the methods converge slowly; see Figure 2(d). To store both w and α, BLOCK-

(15)

Fig. 4: Effectiveness of two implemen- tation techniques: raw: no random as- signment in the initial data splitting.

perm: a random order of blocks at each outer iteration. BLOCK-L-D is used.

Fig. 5: Convergence speed of using different m (number of blocks). BLOCK- L-D is used.

L-∗ methods requires 400MB memory. However, BLOCK-P-∗ only need 160MB to store the model w. Because of using less memory, BLOCK-P-I is more likely to store the model w in a higher level of the memory hierarchy such as L2 cache. Therefore, it performs as well as BLOCK-L-∗ in this case. The results also indicate that when deter- mining the number of blocks in Algorithms 1, both l (size of α) and n (size of w) need to be taken into consideration.

Note that the objective values of BLOCK-P-∗ may not be decreasing asPegasosdoes not have this property. All BLOCK-∗ methods except BLOCK-P-B need around four outer iterations to achieve reasonable accuracy values. This number of iterations is small, so we do not need to read the training set many times.

8.2. Investigating Some Implementation Issues

In Section 5, we discussed several implementation issues for block minimization framework. In the following, we demonstrate that a careful implementation can further speed up the block minimization methods.

8.2.1. Initial Block Split and Random Permutation of Sub-problems.Section 5.3 proposes randomly assigning data to blocks in the beginning of Algorithm 1. It also suggests that a random order of B1, . . . , Bmat each iteration is useful. We are interested in their effectiveness. Figure 4 presents the result of running BLOCK-L-D onwebspam. We assume the worst situation that data of the same class are grouped together in the input file. If data are not randomly split to blocks, clearly the convergence is very slow. Further, the random permutation of blocks at each iteration slightly improves the training time.

8.2.2. Block Size. In Figure 5, we present the training speed of BLOCK-L-D by using various block sizes (equivalently, numbers of blocks). The datawebspamis considered.

The training time of using m = 40 blocks is smaller than that of m = 400 or 1000. This result is consistent with the discussion in Section 2. When the number of blocks is smaller (i.e., larger block size), from (5), the cost of operations on each block increases.

However, as we read less files, the total time is shorter. Furthermore, the initial split time is longer as m increases. Therefore, contrary to traditional SVM software which use small block sizes, now for each inner iteration we should consider a large block. In Figure 5, we do not check m = 20 because the memory is not enough to load a block of data.

(16)

8.2.3. Data Compression. We check if compressing each block saves time. By running 10 outer iterations of BLOCK-L-D on the training set ofwebspamwith m = 40, the implementation with compression takes 3,230 seconds, but without compression it needs 4,660 seconds. Thus, the compression technique is very useful in this case.

In practice, the data loading time depends heavily on the disk reading speed. If the disk reading speed is high, the data compression technique may even slow down the training process.

8.3. Comparisons with Other Methods for Large-scale Data

In Section 7, we discussed existing approaches for training large-scale data. In this section, we first show that the sub-sampling strategy may downgrade the performance on the data set we used. Then, we compare the proposed block minimization methods with other approaches for large-scale data.

To compare with random sub-sampling methods, we uniformly shuffle each data set and train (2) byLIBLINEAR on subsets with different sizes. Figure 6 presents the performance of the models trained on the subsets of data. Results show that for our four data sets, using only a portion of data that can fit in memory may not be enough to obtain an accurate model as using the entire data. Therefore, in this situation, a method that considers the whole data set is still useful.

Next, we compare the following approaches which are able to train the entire data set:

— BLOCK-L-10: This is one of the methods used in Section 8.1.

— AvgBlock: An approach introduced in Section 7.2. We average the models trained by LIBLINEARwith the default stopping condition on each block of data Bj, j = 1 . . . m.

AlthoughAvgBlockcan be trained on a distributed system with multiple machines, here, we consider running it on a single computer.

— Vowpal Wabbit: An online method mentioned in Section 7.3. The package (version 4.1) is available on line at http://hunch.net/~vw/. We set the parameter as-l 1000 -b 24 –initial t 80000 –power t 0.5 –loss function hinge and ensure this setting performs better than the default parameters.

We only report the results of BLOCK-L-10, as it is the most stable one among all the settings used in Section 8.1 for the block minimization method. Note that the above methods do not minimize the same objective function. Therefore, their finial models may give slightly different testing accuracy values. For BLOCK-L-10 and AvgBlock, we select the parameter C in (2) by five-fold cross validation on the training set, and report their performances on the testing set.

We are interesting in both whether BLOCK-L-10 and Vowpal Wabbit can obtain a reasonable model quickly and how fast their final convergence is. Therefore, we show in Table III both the results after running the first iteration and the tenth iteration.

To show more details, we demonstrate the testing performance along the training time in Figure 7. We omit the results ofAvgBlockin Figure 7, as it cannot be conducted in an iterative manner.

The results indicate BLOCK-L-10 efficiently obtains an accurate model. With only one outer iteration, BLOCK-L-10 obtains a reasonable model. After running 10 iterations, BLOCK-L-10 achieves an accuracy value that is almost the same as that of the final model. Vowpal Wabbittake less training time; however, as it solves an unregu- larized problem instead of problem (2), it somtimes converges to a model with lower testing accuracy. On some data sets such askddcup10,AvgBlock achieves similar accuracy values to BLOCK-L-10. However, on other data sets, BLOCK-L-10 is better because it solves (2) using all the training data.

(17)

Table III: Training time and testing accuracy after the first and the tenth outer iterations. Time is in seconds. For BLOCK-L-10 andAvgBlock, time for the initial split of data is included.

yahoo-korea kddcup10 webspam epsilon

C = 4 C = 0.1 C = 64 C = 1

method #iter acc. time acc. time acc. time acc. time

BLOCK-L-10 1 85.97 259 88.49 862 99.32 1,944 89.12 1,802 10 87.29 456 89.89 3,153 99.51 4,475 89.78 3773 Vowpal Wabbit 1 68.34 111 82.65 816 98.14 984 89.40 1,633 10 68.39 242 78.83 1,339 98.84 1,857 89.75 2,147 AvgBlock 1 86.08 628 89.64 6,809 98.40 4,722 88.83 1,999

Fig. 6: Data size versus difference to the best testing accuracy. The marker indicates the size of the subset that can fit in memory. Results show that training only sub- sampled data may not be enough to achieve the best testing performance.

9. DISCUSSION AND CONCLUSIONS

The discussion in Section 6 shows that implementing cross validation or multi-class classification may require extra memory space and some modifications of Algorithm 1.

Thus, constructing a complete learning tool is certainly more complicated than implementing Algorithm 1. There are many new and challenging future research issues.

In summary, we propose and analyze a block minimization method for large linear classification when data cannot fit in memory. Experiments show that the proposed method can effectively handle data 20 times larger than the memory size.

Our code is available at

http://www.csie.ntu.edu.tw/~cjlin/liblinear/exp.html REFERENCES

BERTSEKAS, D. P. 1999. Nonlinear Programming Second Ed. Athena Scientific, Belmont, MA 02178-9998.

BOTTOU, L. 2007. Stochastic gradient descent examples. http://leon.bottou.org/projects/sgd.

(18)

Fig. 7: This figure shows testing accuracy versus training time. Time (in seconds) is log-scaled. The blue dotted vertical line indicates time spent by Algorithm 1-based methods for the initial split of data to blocks.

BREIMAN, L. 1996. Bagging predictors. Machine Learning 24, 2, 123–140.

CHANG, C.-C.ANDLIN, C.-J. 2001. LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

CHANG, E., ZHU, K., WANG, H., BAI, H., LI, J., QIU, Z.,ANDCUI, H. 2008. Parallelizing support vector machines on distributed computers. In Advances in Neural Information Processing Systems 20, J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds. MIT Press, Cambridge, MA, 257–264.

CRAMMER, K.ANDSINGER, Y. 2000. On the learnability and design of output codes for multiclass problems.

In Computational Learing Theory. 35–46.

FAN, R.-E., CHANG, K.-W., HSIEH, C.-J., WANG, X.-R.,AND LIN, C.-J. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874.

FERRIS, M. AND MUNSON, T. 2003. Interior point methods for massive support vector machines. SIAM Journal on Optimization 13,3, 783–804.

HSIEH, C.-J., CHANG, K.-W., LIN, C.-J., KEERTHI, S. S.,ANDSUNDARARAJAN, S. 2008. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML).

(19)

JOACHIMS, T. 1998. Making large-scale SVM learning practical. In Advances in Kernel Methods – Support Vector Learning, B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, Eds. MIT Press, Cambridge, MA, 169–

184.

JOACHIMS, T. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining.

LANGFORD, J., LI, L.,ANDSTREHL, A. 2007. Vowpal Wabbit. http://hunch.net/~vw.

LANGFORD, J., LI, L.,AND ZHANG, T. 2009. Sparse online learning via truncated gradient. Journal of Machine Learning Research 10, 771–801.

LANGFORD, J., SMOLA, A. J.,ANDZINKEVICH, M. 2009. Slow learners are fast. In Advances in Neural Information Processing Systems.

LUO, Z.-Q.ANDTSENG, P. 1992. On the convergence of coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications 72, 1, 7–35.

MEMISEVIC, R. 2006. Dual optimization of conditional probability models. Tech. rep., Department of Com- puter Science, University of Toronto.

MORSE, JR., K. G. 2005. Compression tools compared. Linux Journal.

P ´EREZ-CRUZ, F., FIGUEIRAS-VIDAL, A. R.,ANDARTES´ -RODR´IGUEZ, A. 2004. Double chunking for solving SVMs for very large datasets. In Proceedings of Learning 2004, Spain.

R ¨UPING, S. 2000. mySVM - another one of those support vector machines. Software available at http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/.

SHALEV-SHWARTZ, S., SINGER, Y.,ANDSREBRO, N. 2007. Pegasos: primal estimated sub-gradient solver for SVM. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML).

YU, H., YANG, J.,ANDHAN, J. 2003. Classifying large data sets using SVMs with hierarchical clusters. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, NY, USA, 306–315.

YU, H.-F., HSIEH, C.-J., CHANG, K.-W.,ANDLIN, C.-J. 2010. Large linear classification when data can- not fit in memory. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

YU, H.-F., LO, H.-Y., HSIEH, H.-P., LOU, J.-K., MCKENZIE, T. G., CHOU, J.-W., CHUNG, P.-H., HO, C.-H., CHANG, C.-F., WEI, Y.-H., WENG, J.-Y., YAN, E.-S., CHANG, C.-W., KUO, T.-T., LO, Y.-C., CHANG, P. T., PO, C., WANG, C.-Y., HUANG, Y.-H., HUNG, C.-W., RUAN, Y.-X., LIN, Y.-S., LIN, S.-D., LIN, H.-T.,ANDLIN, C.-J. 2010. Feature engineering and classifier ensemble for KDD cup 2010. Tech. rep., Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan.

ZHU, Z. A., CHEN, W., WANG, G., ZHU, C., AND CHEN, Z. 2009. P-packSVM: Parallel primal gradient descent kernel SVM. In Proceedings of the IEEE International Conference on Data Mining.

ZINKEVICH, M., WEIMER, M., SMOLA, A.,ANDLI, L. 2010. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds. 2595–2603.

Received ??; revised ??; accepted ??

A. PROOF OF THEOREM 3.1

If each sub-problem involves a finite number of coordinate descent updates, then Al- gorithm 1 can be regarded as a coordinate descent method. We apply Theorem 2.1 of Luo and Tseng [1992] to obtain the convergence results. The theorem requires that (3) satisfies certain conditions and in the coordinate descent method there is an integer t such that every αi is iterated at least once every t successive updates (called almost cyclic rule in Luo and Tseng [1992]). Following the same analysis in the proof of Hsieh et al. [2008, Theorem 1], (3) satisfies the required conditions. Moreover, the two properties on tj,k imply the almost cyclic rule. Hence, both global and linear convergence results are obtained.

(20)

B. PROOF OF THEOREM 3.2

To begin, we discuss the stopping condition ofLIBLINEAR. Each run of LIBLINEARto solve a sub-problem generates {α^k,j,v| v = 1, . . . , tk,j+ 1} with

α^k,j = α^k,j,1and α^k,j+1= α^k,j,t^k,j⁺¹.

We further let ij,v denote the index of the variable being updated by α^k,j,v+1= α^k,j,v+ d^∗e_i

j,v, where d^∗is the optimal solution of

mind f (α^k,j,v+ dei_j,v) subject to 0 ≤ αi^k,j,vj,v + d ≤ C, (15) and eij,v is an indicator vector for the (ij,v)th element. All tk,j updates can be further separated to several rounds, where each one goes through all elements in Bj. LIBLINEARchecks the following stopping condition in the end of each round:

v∈a roundmax ∇^Pij,vf (α^k,j,v) − min

v∈a round∇^Pij,vf (α^k,j,v) ≤ ǫ, (16) where ǫ is a tolerance and ∇^Pf (α) is the projected gradient:

∇^Pi f (α) =







∇ⁱf (α) if 0 < αi< C, max(0, ∇ⁱf (α)) if αi= C, min(0, ∇ⁱf (α)) if αi= 0.

(17)

The reason thatLIBLINEARconsiders (16) is that from the optimality condition, α^∗ is optimal if and only if ∇^Pf (α^∗) = 0.

Next we prove the theorem by showing that for all j = 1, . . . , m there exists kjsuch that

∀k ≥ k^j, tk,j≤ 2|B^j|. (18)

Suppose that (18) does not hold. We can find a j and a sub-sequence R ⊂ {1, 2, . . .} such that

tk,j> 2|B^j|, ∀k ∈ R. (19)

Since {α^k,j | k ∈ R} are in a compact set, we further consider a sub-sequence M ⊂ R such that {α^k,j | k ∈ M} converges to a limit point ¯α.

Let σ ≡ minⁱQii. Following the explanation in Hsieh et al. [2008, Theorem 1], we only need to analyze indices with Qii > 0. Therefore, σ > 0. Lemma 2 of Hsieh et al.

[2008] shows that

f (α^k,j,v) − f(α^k,j,v+1) ≥ σ

2kα^k,j,v− α^k,j,v+1k², ∀v = 1, . . . , 2|B^j|. (20) The sequence {f(α^k) | k = 1, . . .} is decreasing and bounded below as the feasible region is compact. Hence

k→∞lim f (α^k,j,v) − f(α^k,j,v+1) = 0, ∀v = 1, . . . , 2|B^j|. (21) Using (21) and taking the limit on both sides of (20), we have

k∈M,k→∞lim

α^k,j,2|B^j^|+1= lim

k∈M,k→∞

α^k,j,2|B^j^|= · · ·

= lim

k∈M,k→∞

α^k,j,1= ¯α. (22)

From the continuity of ∇f(α) and (22), we have

k∈M,k→∞lim ∇f(α^k,j,v) = ∇f( ¯α), ∀v = 1, . . . , 2|B^j|.

(21)

Hence there are ǫ and ¯k such that ∀k ∈ M with k ≥ ¯k

|∇ⁱf (α^k,j,v)| ≤ ǫ

4 if ∇ⁱf ( ¯α) = 0, (23)

∇ⁱf (α^k,j,v) ≥3ǫ

4 if ∇ⁱf ( ¯α) > 0, (24)

∇ⁱf (α^k,j,v) ≤ −3ǫ

4 if ∇ⁱf ( ¯α) < 0, (25) for any i ∈ B^j, v ≤ 2|B^j|.

When we update α^k,j,v to α^k,j,v+1 by changing the ith element (i.e., i = ij,v) in the first round, the optimality condition for (15) implies that one of the following three situations occurs:

∇ⁱf (α^k,j,v+1) = 0, (26)

∇ⁱf (α^k,j,v+1) > 0 and α^k,j,v+1_i = 0, (27)

∇ⁱf (α^k,j,v+1) < 0 and α^k,j,v+1_i = C. (28)

From (23)-(25), we have that

i satisfies





 (26) (27) (28)

⇒







∇ⁱf ( ¯α) = 0

∇ⁱf ( ¯α) ≥ 0

∇ⁱf ( ¯α) ≤ 0

. (29)

In the second round, assume αiis changed at the v^′th update. From (29) and (23)-(25), we have

|∇ⁱf (α^k,j,v^′)| ≤ ǫ

4, (30)

or

∇ⁱf (α^k,j,v^′) ≥ −ǫ

4 and α^k,j,v_i ^′ = 0, (31)

or

∇ⁱf (α^k,j,v^′) ≤ ǫ

4 and α^k,j,v_i ^′ = C. (32)

Using (30)-(32), the projected gradient defined in (17) satisfies

|∇^Pi (α^k,j,v^′)| ≤ ǫ 4. This result holds for all i ∈ B^j. Therefore,

v∈2nd roundmax ∇^Pi_j,v(α^k,j,v) − min

v∈2nd round∇^Pi_j,v(α^k,j,v)

≤ ǫ 4 − (−ǫ

4) = ǫ 2 < ǫ.

Thus, (16) is valid in the second round. Then tk,j = 2|B^j| violates (19). Hence (18) holds and the theorem is obtained.