**Large Linear Classification When Data Cannot Fit In Memory**

Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang and Chih-Jen Lin, National Taiwan University

Recent advances in linear classification have shown that for applications such as document classification, the training can be extremely efficient. However, most of the existing training methods are designed by as- suming that data can be stored in the computer memory. These methods cannot be easily applied to data larger than the memory capacity due to the random access to the disk. We propose and analyze a block min- imization framework for data larger than the memory size. At each step a block of data is loaded from the disk and handled by certain learning methods. We investigate two implementations of the proposed frame- work for primal and dual SVMs, respectively. As data cannot fit in memory, many design considerations are very different from those for traditional algorithms. Experiments using data sets 20 times larger than the memory demonstrate the effectiveness of the proposed method.

**Categories and Subject Descriptors: I.5.2 [Pattern Recognition]: Design Methodology—Classifier design***and evaluation*

General Terms: Algorithms, Performance, Experimentation

Additional Key Words and Phrases: Pattern Recognition, Design Methodology, Classifier design and evalu- ation, Algorithms, Performance, Experimentation

**ACM Reference Format:**

Yu, H.-F., Hsieh, C.-J., Chang, K.-W., Lin, C.-J.. 2011. Large Linear Classification When Data Cannot Fit In Memory. ACM Trans. Knowl. Discov. Data. 9, 4, Article 39 (March 2010), 21 pages.

DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

**1. INTRODUCTION**

Linear classification^{1} is useful in many applications, but training large-scale data re-
mains an important research issue. For example, a category of PASCAL Large Scale
Learning Challenge^{2}*at ICML 2008 compares linear SVM implementations. The com-*
petition evaluates the time after data have been loaded into the memory, but many
participants find that loading time costs more. Thus some have concerns about the
evaluation.^{3} This result indicates a landscape shift in large-scale linear classification
because time spent on reading/writing between memory and disk becomes the bot-
tleneck. Existing training algorithms often need to iteratively access data, so without
enough memory, the training time will be huge. To see how serious the situation is, Fig-
ure 1 presents the running time by applying an efficient linear classification package

1By linear classification we mean that data remain in the input space and kernel methods are not used.

2http://largescale.first.fraunhofer.de/workshop

3http://hunch.net/?p=330

This work was supported in part by the National Science Council of Taiwan via the grant 98-2221-E-002- 136-MY3.

Author’s addresses: Yu, H.-F., Hsieh, C.-J., Chang, K.-W. and Lin, C.-J.. National Taiwan University.

Email:{b93107, b92085, b92084,cjlin}@cise.ntu.edu.tw

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is per- mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.

2010 ACM 1556-4681/2010/03-ART39 $10.00c

DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

Fig. 1: Data size versus training time by directly applying LIBLINEAR on a machine with 1GB memory (the actual available memory is about 895MB). When data size is larger than the memory capacity, the running time grows rapidly.

LIBLINEAR [Fan et al. 2008] to train data with different scales on a computer with 1 GB memory. Clearly, the time grows sharply when the data size is beyond the memory capacity.

We model the training time to contain two parts:

training time = time to run data in memory +

time to access data from disk. (1) Traditional training algorithms, assuming that the second part is negligible, focus on the first part by minimizing the number of CPU operations. Linear classification, es- pecially when applied to document classification, is in a situation that the second part may be more significant. Recent advances on linear classification (e.g., [Joachims 2006;

Shalev-Shwartz et al. 2007; Bottou 2007; Hsieh et al. 2008]) have shown that training one million instances takes only a few seconds (without counting the loading time).

*Therefore, some have said that linear classification is essentially a solved problem if*
the memory is enough. However, handling data beyond the memory capacity remains
a challenging research issue.

According to Langford et al. [2009], existing approaches to handle large data can be roughly categorized to two types. The first approach solves problems in distributed systems by parallelizing batch training algorithms (e.g., [Chang et al. 2008; Zhu et al.

2009]). However, not only writing programs on a distributed system is difficult, but also the data communication/synchronization may cause significant overheads. The second approach considers online learning algorithms. Since data may be used only once, this type of approaches can effectively handle the memory issue. However, even with an online setting, an implementation over a distributed environment is still complicated;

see the discussion in Section 2.1 of Langford et al. [2009]. In practice, Tong^{4}argues that
keeping algorithms simple and robust is crucial. Therefore, an easy-to-use system is
more favorable. Moreover, existing implementations (including those in large Internet
companies) may lack important functions such as evaluations by different criteria,
parameter selection, or feature selection.

This paper aims to construct large linear classifiers for ordinary users. We consider one assumption and one requirement:

— Assumption: Data cannot be stored in memory, but can be stored in the disk of one computer. Moreover, sub-sampling data to fit in memory causes lower accuracy.

4See http://googleresearch.blogspot.com/2010/04/lessons-learned-developing-practical.html.

— Requirement: The method must be simple so that support for multi-class classifi- cation, parameter selection and other functions can be easily done.

If sub-sampling does not downgrade the accuracy, some (e.g., [Yu et al. 2003]) have proposed approaches to select important instances by reading data from disk only once.

In this work, we discuss a simple and effective block minimization framework for applications satisfying the above assumption. We focus on batch learning though ex- tensions to online or incremental/decremental learning are straightforward. While many existing online learning studies claim to handle data beyond the memory capac- ity, most of them conduct simulations with enough memory and check the number of passes to access data (e.g., [Shalev-Shwartz et al. 2007; Bottou 2007]). In contrast, we conduct experiments in a real environment without enough memory. An earlier linear- SVM study [Ferris and Munson 2003] has specifically addressed the situation that data are stored in disk, but it assumes that the number of features is much smaller than data points. Our approach allows a large number of features, a situation often occurred for document data sets.

This paper is organized as follows. In Section 2, we consider SVM as our linear classifier and propose a block minimization framework. Two implementations of the proposed framework for primal and dual SVM problems are respectively in Sections 3 and 4. Techniques to minimize the training time modeled in (1) are in Section 5. Sec- tion 6 discusses the implementation of cross validation, multi-class classification, and incremental/decremental settings. Section 7 discusses related approaches for training linear classifiers when data are larger than memory capacity. We show experiments in Section 8 and give conclusions in Section 9.

A short version of this work appears in a conference paper [Yu et al. 2010].

**2. BLOCK MINIMIZATION FOR LINEAR SVMS**

We consider linear SVM in this work because it is one of the most used linear classi-
fiers. Given a data set {(x^{i}, yi)}^{l}i=1, xi ∈ R^{n}, yi ∈ {−1, +1}, SVM solves the following
unconstrained optimization problem:^{5}

minw

1

2w^{T}w+ C

l

X

i=1

max(1 − y^{i}w^{T}x_{i}, 0), (2)
where C > 0 is a penalty parameter. This formulation considers L1 loss, though our
approach can be easily extended to L2 loss. Problem (2) is often referred to as the
primal form of SVM. One may instead solve its dual problem:

minα f (α) =1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α^{i} ≤ C, i = 1, . . . , l, (3)
where e = [1, . . . , 1]^{T} and Qij = yiyjx^{T}

i x_{j}.

As data cannot fit in memory, the training method must avoid random accesses of data. In Figure 1,LIBLINEAR randomly accesses one instance at a time, so frequent moves of the disk head result in lengthy running time. A viable method must satisfy the following conditions:

*1. Each optimization step reads a continuous chunk of training data.*

2. The optimization procedure converges toward the optimum even though each step uses only a subset of training data.

5The standard SVM comes with a bias term b. Here we do not consider this term for the simplicity.

**ALGORITHM 1:**A block minimization framework for linear SVM
1. Split {1, . . . , l} to B^{1}, . . . , Bmand store data into m files accordingly.

2. Set initial α or w

3. For k = 1, 2, . . . (outer iteration)

— For j = 1, . . . , m (inner iteration) 3.1. Read xr,∀r ∈ Bjfrom disk

3.2. Conduct operations on {xr| r ∈ Bj} 3.3. Update α or w

3. The number of optimization steps (iterations) should not be too large. Otherwise, the same data point may be accessed from the disk too many times.

Obtaining a method having all these properties is not easy. We will propose methods to achieve them to a certain degree.

In unconstrained optimization, block minimization is a classical method (e.g.,
*[Bertsekas 1999, Chapter 2.7]). Each step of this method updates a block of variables,*
but here we need a connection to data. Let {B^{1}, . . . , Bm} be a partition of all data in-
dices {1, . . . , l}. According to the memory capacity, we can decide the block size so that
instances associated with Bj and variables used in the sub-problem can fit in mem-
ory. These m blocks, stored as m files, are loaded when needed. Then at each step, we
conduct some operations using one block of data, and update w or α according to if
the primal or the dual problem is considered. We assume that w or α can be stored
in memory. The block minimization framework is summarized in Algorithm 1. We re-
fer to the step of working on a single block as an inner iteration, while the m steps
of going over all blocks as an outer iteration. Algorithm 1 can be applied on both the
primal form (2) and the dual form (3). We show two implementations in Sections 3 and
4, respectively.

We discuss some implementation considerations for Algorithm 1. For the conve- nience, assume B1, . . . , Bmhave a similar size |B| = l/m. The total cost of Algorithm 1 is

(Tm(|B|) + Td(|B|)) × l

|B|× #outer-iters, (4)

where

— Tm(|B|) is the cost of operations at each inner iteration, and

— Td(|B|) is the cost to read a block of data from disk.

These two terms respectively correspond to the two parts in (1) for modeling the train- ing time.

Many studies have applied block minimization to train SVM or other machine learn- ing problems, but we might be the first to consider it in the disk level. Indeed the ma- jor approach to train nonlinear SVM (i.e., SVM with nonlinear kernels) has been block minimization, which is often called decomposition methods in the SVM community. We discuss the difference between ours and existing studies in two aspects:

— variable selection for each block, and

— block size.

Existing SVM packages assume data in memory, so they can use flexible ways to select each Bj. They do not restrict B1, . . . , Bmto be a split of {1, . . . , l}. Moreover, to decide indices of one single Bj, they may access the whole set, an impossible situation for us.

We are more confined here as data associated with each Bjmust be pre-stored in a file before running Algorithm 1.

Regarding the block size, we now go back to analyze (4). If data are all in memory,
Td(|B|) = 0. For T^{m}(|B|), people observe that if |B| linearly increases, then

|B| ր, T^{m}(|B|) ր, and #outer-iters ց . (5)
Tm(|B|) is generally more than linear to |B|, so T^{m}(|B|) × l/|B| is increasing along
with |B|. In contrast, the #outer-iters may not decrease as quick. Therefore, nearly all
existing SVM packages use a small |B|. For example, |B| = 2 inLIBSVM[Chang and
Lin 2001] and 10 in SVM^{light} [Joachims 1998]. With Td(|B|) > 0, the situation is now
very different. At each outer iteration, the cost is

Tm(|B|) × l

|B|+ Td(|B|) × l

|B|. (6)

The second term is for reading l instances. As reading each block of data takes some initial time, a smaller number of blocks reduces the cost. Hence, the second term in (6) is a decreasing function of |B|. While the first term is increasing following the earlier discussion, as reading data from the disk is slow, the second term is likely to dominate.

Therefore, contrary to existing SVM software, in our case the block size should not be too small. We will investigate this issue by experiments in Section 8.

The remaining issue is to decide operations at each inner iteration. The second and the third conditions mentioned earlier in this section should be considered. We discuss two implementations in the next two sections.

**3. SOLVING DUAL SVM BY LIBLINEAR FOR EACH BLOCK**

A nice property of the SVM dual problem (3) is that each variable corresponds to a
training instance. Thus we can easily devise an implementation of Algorithm 1 by
updating a block of variables at a time. Assume ¯Bj = {1, . . . , l}\B^{j}. At each inner
iteration we solve the following sub-problem.

mind

Bj

f (α + d) (7)

subject to dB¯_{j} = 0 and 0 ≤ α^{i}+ di≤ C, ∀i ∈ B^{j}.

That is, we update αBj using the solution of (7), while fix αB¯_{j}. Then, Algorithm 1 re-
duces to the standard block minimization procedure, so the convergence to the optimal
function value of (3) holds [Bertsekas 1999, Proposition 2.7.1].

We must ensure that at each inner iteration, only one block of data is needed. With
the constraint dB¯_{j} = 0 in (7),

f (α + d) = 1
2d^{T}

B_{j}QB_{j}B_{j}d_{B}

j+ (QB_{j},:α− e^{B}j)^{T}d_{B}

j+ f (α), (8)

where QB_{j},: is a sub-matrix of Q including elements Qri, r ∈ B^{j}, i = 1, . . . , l. Clearly,
QB_{j},: in (8) involves all training data, a situation violating the requirement of Algo-
rithm 1. Fortunately, by maintaining

w≡

l

X

i=1

αiyix_{i}, (9)

we have

Qr,:α− 1 = y^{r}w^{T}x_{r}− 1, ∀r ∈ B^{j}.

**ALGORITHM 2:**An implementation of Algorithm 1 for solving dual SVM
We only show details of steps 3.2 and 3.3:

3.2 Exactly or approximately solve the sub-problem (7) to obtain d^{∗}_{B}_{j}
3.3 αB_{j} ← αB_{j} + d^{∗}_{B}

j

Update w by (10)

Therefore, if w is available in memory, only instances associated with the block Bj are
needed. To maintain w, if d^{∗}_{B}_{j} is an optimal solution of (7), we consider (9) and use

w← w + X

r∈Bj

d^{∗}_{r}yrx_{r}. (10)

This operation again needs only the block Bj. The procedure is summarized in Algo- rithm 2.

For solving the sub-problem (7), as all the information is available in the memory, any bound-constrained optimization method can be applied. We consider LIBLINEAR [Fan et al. 2008], which implements a coordinate descent method (i.e., block minimiza- tion with a single element in each block). Then, Algorithm 2 becomes a two-level block minimization method. The two-level setting had been used before for SVM or other applications (e.g., [Memisevic 2006; P´erez-Cruz et al. 2004; R ¨uping 2000]), but ours might be the first to associate the inner level with memory and the outer level with disk.

Algorithm 2 converges if each sub-problem (7) is exactly solved. Practically we often obtain an approximate solution by imposing a stopping criterion. We then address two issues:

1. The stopping criterion for solving the sub-problem must be satisfied after a finite number of operations, so we can move on to the next sub-problem.

2. We need to prove the convergence.

Next we show that these two issues can be resolved if usingLIBLINEARfor solving the
sub-problem. Let {α^{k}} be the sequence generated by Algorithm 2, where k is the index
of outer iterations. Because each outer iteration contains m inner iterations, we can
further consider a sequence

{α^{k,j}}^{∞,m+1}k=1,j=1with α^{k,1}= α^{k}and α^{k,m+1}= α^{k+1}.

From α^{k,j} to α^{k,j+1},LIBLINEARcoordinate-wisely updates variables in Bj to approxi-
mately solve the sub-problem (7) and we let tk,jbe the number of updates.

If the coordinate descent updates satisfy certain conditions, we can prove the con-
vergence of {α^{k,j}}:

THEOREM *3.1. If applying a coordinate descent method to solve (7) with the follow-*
*ing properties:*

*1. each*αi, i ∈ Bj *is updated at least once, and*
*2.* {t^{k,j}*} is uniformly bounded,*

*then*{α^{k,j}*} generated by Algorithm 2 globally converges to an optimal solution α*^{∗}*. The*
*convergence rate is at least linear: there are0 < µ < 1 and an iteration k*0*such that*

f (α^{k+1}) − f(α^{∗}) ≤ µ f(α^{k}) − f(α^{∗}) , ∀k ≥ k^{0}.

The proof is in Appendix A. With Theorem 3.1, the condition 2 mentioned in the begin- ning of Section 2 holds. For condition 3 on the convergence speed, block minimization

does not have fast convergence rates. However, for problems like document classifica- tion, some (e.g., [Hsieh et al. 2008]) have shown that we do not need many iterations to get a reasonable model. Though Hsieh et al. [2008] differ from us by restricting

|B| = 1, we hope to enjoy the same property of not needing many iterations. Experi- ments in Section 8 confirm that for some document data this property holds.

Next we discuss various ways to fulfill the two properties in Theorem 3.1.

**3.1. Loosely Solving the Sub-problem**

A simple setting to satisfy Theorem 3.1’s two properties is to go through all variables
in Bj a fixed number of times. Then not only is tkj uniformly bounded, but also the
finite termination for solving each sub-problem holds. A small number of passes to go
through Bjmeans that we loosely solve the sub-problem (7). While the cost per block is
cheaper, the number of outer iterations may be large. Through experiments in Section
8, we discuss how the number of passes affects the running time. A special case is to go
through all αi, i ∈ B^{j} exactly once. Then Algorithm 2 becomes a standard (one-level)
coordinate descent method, though data are loaded by a block-wise setting.

For each pass to go through data in one block, we can sequentially update variables in Bj. However, as mentioned in [Hsieh et al. 2008], using a random permutations of Bj’s elements as the order for update usually leads to faster convergence in practice.

**3.2. Accurately Solving the Sub-problem**

Alternatively, we can accurately solve the sub-problem. The cost per inner iteration is higher, but the number of outer iterations may be reduced. As an upper bound on the number of iterations does not reveal how accurate the solution is, most optimiza- tion software considers the gradient information for the stopping condition. We check the setting inLIBLINEAR. Its gradient-based stopping condition (details shown in Ap- pendix B) guarantees the finite termination in solving each sub-problem (7). Thus the procedure can move on to the next sub-problem without getting into an infinite loop.

Regarding the convergence, to use Theorem 3.1, we must show that {t^{k,j}} is uniformly
bounded:

THEOREM *3.2. If coordinate descent steps with*LIBLINEAR’s stopping condition are
*used to solve(7), then Algorithm 2 either terminates in a finite number of outer itera-*
*tions or*

tk,j≤ 2|B^{j}*| ∀j after k is large enough.*

Therefore, ifLIBLINEARis used to solve (7), then Theorem 3.1 implies the convergence.

**4. SOLVING PRIMAL SVM BY PEGASOS FOR EACH BLOCK**

Instead of solving the dual problem, in this section we check if the framework in Algo- rithm 1 can be used to solve the primal SVM. Because the primal variable w does not correspond to data instances, we cannot use a standard block minimization setting to have a sub-problem like (7). In contrast, existing stochastic gradient descent methods possess a nice property that at each step only certain data points are used. In this sec- tion, we study howPegasos[Shalev-Shwartz et al. 2007] can by used for implementing an Algorithm 1.

Pegasosconsiders a scaled form of the primal SVM problem:

minw

1

2lCw^{T}w+1
l

l

X

i=1

max(1 − y^{i}w^{T}x_{i}, 0),

**ALGORITHM 3:**An implementation of Algorithm 1 for solving primal SVM. Each inner itera-
tion is byPegasos

1. Split {1, . . . , l} to B^{1}, . . . , Bmand store data into m files accordingly.

2. t = 0 and initial w = 0.

3. For k = 1, 2, . . .

— For j = 1, . . . , m

3.1. Find a partition of Bj: B^{1}j, . . . , B^{r}_{j}^{¯}.
3.2. For r = 1, . . . , ¯r

— Use B^{r}jas B to conduct the update (11)-(13).

— t ← t + 1

At the tth update,Pegasoschooses a block of data B and updates the primal variable wby a stochastic gradient descent step:

¯

w= w − η^{t}∇^{t}, (11)

where η^{t}= lC/t is the learning rate, ∇^{t}is the sub-gradient

∇^{t}= 1

lCw− 1

|B|

X

i∈B^{+}

yix_{i}, (12)

and B^{+}≡ {i ∈ B | y^{i}w^{T}x_{i}< 1}. Then,Pegasosobtains w by scaling ¯w:

w← min(1,

√lC

k ¯wk) ¯w. (13)

Clearly, we can directly consider Bjin Algorithm 1 as the set B in the above update.

Alternatively, we can conduct severalPegasosupdates on a partition of Bj. Algorithm 3 gives details of the procedure. Here we consider two settings for an inner iteration:

1. Using onePegasosupdate on the whole block Bj.

2. Splitting Bj to |B^{j}| sets, where each one contains an element in B^{j} and then con-
ducting |B^{j}|Pegasosupdates.

Different from dual SVM, we should not solve the sub-problem of primal SVM accu- rately. Otherwise, the model will converge to a solution that only learns on the set B.

For the convergence, Pegasosis proved to converge if all instances {x^{1}, . . . , xl} are
used for updating the model at each step; see [Shalev-Shwartz et al. 2007, Corollary 1].

It is also shown that the same convergence results hold in expectation if each update
is conducted on a subset chosen i.i.d. from the entire data set; see [Shalev-Shwartz
et al. 2007, Theorem 3]. Although Algorithm 3 is a special case ofPegasos, it splits the
data into blocks Bj, ∀j and updates on a subset of B^{j} at a time. Therefore, we cannot
directly apply the convergence proof. However, empirically we observe that Algorithm
3 converges without problems.

**5. TECHNIQUES TO REDUCE THE TRAINING TIME**

Many techniques have been proposed to make block minimization faster. However, these techniques may not be suitable here as they are designed by assuming that all data are in memory. Based on the complexity analysis in (6), in this section we propose three techniques to speed up Algorithm 1. One technique effectively shortens Td(|B|), while the other two aim at reducing the number of iterations.

**ALGORITHM 4:**Splitting data into blocks
1. Decide m and create m empty files.

2. For i = 1, . . .

2.1. Convert xito a binary format ¯x_{i}.

2.2. Randomly choose a number j ∈ {1, . . . , m}.

2.3. Append ¯x_{i}into the end of the jth file.

**5.1. Data Compression**

The loading time Td(|B|) is a bottleneck of Algorithm 1 due to the slow disk access.

Except some initial cost, Td(|B|) is proportional to the length of data. Hence, we can
consider a compression strategy to reduce the loading time of each block. However,
this strategy introduces two additional costs: the compression time in the beginning of
Algorithm 1 and the decompression time when a block is loaded. The former is minor
as we only do it once. For the latter, we must ensure that the loading time saved is
more than the decompression time. The balance between compression speed and ratio
has been well studied in the area of backup and networking tools [Morse 2005]. We
choose a widely used compression libraryzlibfor our implementation.^{6}Experiments in
Section 8 show that if the reading speed of the disk is slow, the compression strategy
effectively reduces the training time.

Because of using compression techniques, all blocks are stored in a binary format instead of a plain text form.

**5.2. Random Permutation of Sub-problems**

In Algorithm 1, we sequentially work on blocks B1, B2, . . ., Bm. We can consider other ways such as a permutation of blocks to decide the order of sub-problems. In LIBLINEAR’s coordinate descent implementation, the authors randomly permute all variables at each iteration and report faster convergence. We adopt a permutation strategy here as the loading time is similar regardless of the order of sub-problems.

**5.3. Split of Data**

An important step of Algorithm 1 is to split training data to m files. We need a careful design as data cannot be loaded into memory. To begin, we find the size of data and decide the value m based on the memory size. This step does not have to go through the whole data set because the operating system provides information such as file sizes.

Then we can sequentially read data instances and save them to m files. However, data in the same class are often stored together in the training set, so we may get a block of data with the same label. This situation clearly causes slow convergence. Thus for each instance being read, we randomly decide which file it should be saved to. Algorithm 4 summarizes our procedure. It goes through data only once.

**6. OTHER FUNCTIONALITY**

A learning system only able to solve an optimization problem (3) is not practically use- ful. Other functions such as multi-class classification or cross validation (for parameter selection) are very important. We discuss how to implement these functions based on the design in Section 2.

**6.1. Multi-class Classification**

Existing multi-class approaches either train several two-class problems (e.g., one- against-one and one-against-the rest) or solve one single optimization problem (e.g.,

6http://www.zlib.net

**ALGORITHM 5:** An block minimization framework for one-against-the rest multi-class ap-
proach. We assume the K class labels are 1, . . . , K

.

1. Split {1, . . . , l} to B^{1}, . . . , Bm, and store data into m files accordingly.

2. Set initial α^{1}, . . . , α^{K} and w^{1}, . . . , w^{K} where K is the number of classes.

3. For k = 1, 2, . . . (outer iteration)

— For j = 1, . . . , m (inner iteration) 3.1. Read xr,∀r ∈ Bjfrom disk 3.2. For t = 1, . . . , K

— Use B^{t}j≡ {xr| r ∈ Bjand yr=t} as positive data and Bj\ B_{j}^{t}as negative
data to conduct the update on α^{t}and w^{t}.

[Crammer and Singer 2000]). Take the one-against-the rest approach for a K-class
problem as an example. We train K classifiers, where each one separates a class from
the rest. If we sequentially train K models, the disk access time is K times more. To
save the disk accessing time, a more complicated implementation is to train K mod-
els together. We split each blocks Bjto B_{j}^{1}, . . . , B_{j}^{K} according to the class information.

Then we solve K sub-problems simultaneously, that is, we use B^{t}_{j}as positive data and
Bj\ B^{t}j as negative data to update vector w^{t} and α^{t}. The details are in Algorithm 5.

The one-against-one approach is less suitable as it needs K(K − 1)/2 vectors for w, which may consume too much memory. For one-against-the rest and the approach in [Crammer and Singer 2000], they both need K vectors.

**6.2. Cross Validation**

Assume we conduct v-fold cross validation. Due to the use of m blocks, a straightfor-
ward implementation is to split m blocks to v groups. Each time one group of blocks is
used for validation, while all remaining groups are for training. Similar to the situa-
tion in multi-class classification, the loading time is v times more than training a single
model. To save the disk accessing time, a more complicated implementation is to train
v models together. For example, if v = 3, we split each block Bj to three parts Bj^{1}, B^{2}_{j},
and B_{j}^{3}. Then ∪^{m}j=1(B_{j}^{1}∪ B^{2}j) is the training set to validate ∪^{m}j=1B_{j}^{3}. We maintain three
vectors w^{1}, w^{2}, and w^{3}. Each time when Bj is loaded, we solve three sub-problems to
update w vectors. This implementation effectively saves the data loading time, but the
memory must be enough to store v vectors w^{1}, . . . , w^{v}. The algorithm for cross valida-
tion is similar to Algorithm 5 for multi-class classification.

**6.3. Incremental/ Decremental Setting**

Many practical applications retrain a model after collecting enough new data. Our
approach can be extended to this scenario. We make a reasonable assumption that
each time several blocks are added or removed. Using LIBLINEAR to solve the dual
form as an example, to possibly save the number of iterations, we can reuse the vector
w obtained earlier. Algorithm 2 maintains w = P^{l}_{i=1}yiαix_{i}, so the new initial w can
be

w← w + X

i:xibeing added

yiαix_{i}− X

i:xibeing removed

yiαix_{i}. (14)
For data being added, αiis simply set to zero, but for data being removed, their corre-
sponding αi are not available. To use (14), we must store α. That is, before and after
solving each sub-problem, Algorithm 2 reads and saves α from/to disk.

If solving the primal problem byPegasosfor each block, Algorithm 3 can be directly applied for incremental or decremental settings.

Table I: Data statistics: We assume a sparse storage. Each non-zero feature value needs 12 bytes (4 bytes for the feature index, and 8 bytes for the value). However, this 12-byte structure consumes 16 bytes on a 64-bit machine due to data structure alignment.

Data set l n #nonzeros Memory (Bytes)

yahoo-korea 460,554 3,052,939 156,436,656 2,502,986,496
kddcup10 29,890,095 19,264,093 566,345,790 9,061,532,640
webspam 350,000 16,609,143 1,304,697,446 20,875,159,136
epsilon 500,000 2,000 1,000,000,000 16,000,000,000
**7. RELATED APPROACHES FOR LARGE-SCALE DATA**

In this section, we discuss related approaches for training linear classifiers when data cannot fit in the memory. The comparisons between these approaches and our block minimization framework are in Section 8.3.

**7.1. Data Sub-sampling**

In many cases, sub-sampling training data does not downgrade the prediction accuracy much. Therefore, by using only a portion of the training data that can fit into the mem- ory, we can employ standard training techniques. This approach usually works when the data quality is good. However, in some situations, dealing with a huge data set may still be necessary. For example, Internet companies collect a huge amount of web log and train data on a distributed learning system. In Section 8.3, we demonstrate the relationship between testing performance and sub-sampling size.

**7.2. Averaging Models Trained on Subsets of Data**

Bagging [Breiman 1996] is a classical classification method. In the training phase, a bagging method randomly draws m subsets of samples from the entire data set.

Then, it trains m models w1, . . . , wmon each subset, separately. In the testing phase, the prediction of a testing instance is based on the decisions from the m models. If each subset can be stored in memory, then training is efficient. Similar to the block generation in our framework, this method needs to get subsets in the beginning.

Although sometimes a bagging method can achieve an accurate model [Zinkevich et al. 2010], its solution is not the same as the model from solving (2). In Section 8.3, we compare the proposed block optimization framework with a bagging method, which averages m models trained on Bj, ∀j.

**7.3. Online Learning Approaches**

Online methods can easily deal with large-scale data. An online learning algorithm loads a single data point at a time and avoids storing the whole data in the memory.

In the following, we discuss an online learning packageVowpal Wabbit[Langford et al.

2007].

Vowpal Wabbitprovides several choices of loss functions. Here we take the hinge loss:

l(w; y, x) = max(0, 1 − yw^{T}x)

as an example, where y is the true label, w is the model and x is an instance. When Vowpal Wabbit faces a new instance x, it updates w by the sub-gradient descent di- rection. Vowpal Wabbit supports the setting to pass over data several times. During the first pass, Vowpal Wabbitsaves the data points into a cache file. This is similar to our data compression strategy discussed in Section 5.1. In Section 8.3, we compare Vowpal Wabbitwith the proposed block optimization framework.

Table II: Number of blocks and initial time to split and compress data into blocks. Time is in seconds.

Data set #Blocks Initial time

yahoo-korea 5 228

kddcup10 40 842

webspam 40 1,594

epsilon 30 1,237

**8. EXPERIMENTS**

In this section, we first conduct experiments to analyze the performance of the pro- posed approach. Then we investigate several implementation issues discussed in Sec- tion 5. Finally, we compare the proposed block minimization framework with other approaches for solving linear classification problems when data size is beyond the memory capacity.

We consider two document data sets yahoo-korea^{7} andwebspam, an artificial data
setepsilon,^{8}and an education data setkddcup10from a data mining challenge.^{9}Table
I summarizes the data statistics.

Exceptkddcup10, we randomly split the data such that 4/5 for training and 1/5 for
testing, and all feature vectors are instance-wisely scaled to unit-length (i.e., kx^{i}k =
1, ∀i). For epsilon, each feature of the training set is normalized to have mean zero
and variance one, and the testing set is modified according to the same scaling factors.

This feature-wise scaling is conducted before the instance-wise scaling. Forkddcup10 dataset, we use the same training and testing split in Yu et al. [2010], and we do not further scale the features vectors.

We conduct experiments on a 64-bit machine with 1GB RAM. Due to the space con-
sumed by the operating system, the real memory capacity we can use is 895MB. The
reading speed of the disk is 102.36 MB/sec.^{10}

**8.1. Comparison of Sub-problem Solvers**

In this section, we compare the various settings introduced in Sections 3–4 for opera- tions on a block of data: The value C in (2) is set to one.

— BLOCK-L-N : Algorithm 2 withLIBLINEAR to solve each sub-problem.LIBLINEAR goes through the block of data N rounds, where we consider N = 1, 10, and 20.

— BLOCK-L-D: Algorithm 2 withLIBLINEARto solve each sub-problem.LIBLINEAR’s default stopping condition is adopted.

— BLOCK-P-B: Algorithm 3 with ¯r = 1. That is, we apply onePegasosupdate on the whole block.

— BLOCK-P-I: Algorithm 3 with ¯r = |B^{j}|. That is, we apply |B^{j}| Pegasos updates,
each of which uses an individual data instance.

When data are larger than memory capacity, training a model by LIBLINEAR on en- tire data suffers from severe disk swapping [Yu et al. 2010]. Therefore, we omit the results ofLIBLINEARand only show the comparison between methods under the block minimization framework. We make sure that no other jobs are running on the same

7This data set is not publicly available

8webspamandepsiloncan be downloaded at http://largescale.first.fraunhofer.de/instructions/

9We use the second data set bridge to algebra 2008 2009 in KDD Cup 2010. The preprocessed version can be downloaded at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

10The reading speed of the disk is given by the program hdparm under Linux environment.

(a)yahoo-korea (b)webspam

(c)epsilon (d)kddcup10

Fig. 2: This figure shows the relative function value difference to the minimum. Time (in seconds) is log-scaled. The blue dotted vertical line indicates time spent by Algo- rithm 1-based methods for the initial split of data to blocks.

*machine and report wall clock time in all experiments. We include all data loading time*
and the initial time to split and compress data into blocks. Table II lists the number of
blocks and the initial time to split and compress data into blocks.

We are interested in both how fast the methods reduce the objective function value (2) and how quickly the methods obtain a reasonable model. Figure 3 presents two results:

1. Training time versus the relative difference to the optimum

f^{P}(w) − f^{P}(w^{∗})
f^{P}(w^{∗})

,

where f^{P}is the primal objective function in (2) and w^{∗}is the optimal solution. Since
w^{∗}is not really available, we spend enough training time to get a reference solution.

2. Training time versus the difference to the best testing accuracy
(acc^{∗}− acc(w)) × 100%,

(a)yahoo-korea (b)webspam

(c)epsilon (d)kddcup10

Fig. 3: This figure shows the accuracy difference to the best testing accuracy. Time (in seconds) is log-scaled. The blue dotted vertical line indicates time spent by Algorithm 1-based methods for the initial split of data to blocks. Note that in Figure 3(c), the curve of BLOCK-L-D is not connected, where the missing point corresponds to the best accuracy.

where acc(w) is the testing accuracy using the model w and acc^{∗}is the best testing
accuracy among all methods.

From Figure 3, BLOCK-L-∗ methods (usingLIBLINEAR) are faster than BLOCK-P-∗

methods (usingPegasos) in most cases. One of the possible reasons is that for BLOCK- P-∗, the information of each block is underutilized. In particular, BLOCK-P-B suffers from very slow convergence as for each block it conducts only one very simple update.

However, it may not be always needed to use the block of data in an exhaustive way.

For example, in Figures 2(a) and 2(d), BLOCK-L-1 (for each block LIBLINEAR goes through all data only once) is slightly faster than BLOCK-L-D (for each block running LIBLINEAR with the default stopping condition). Nevertheless, as reading each block from the disk is expensive, in general we should make proper efforts to use it.

The numbers of instances and features inkddcup10are very large. In such a situa- tion, all the methods converge slowly; see Figure 2(d). To store both w and α, BLOCK-

Fig. 4: Effectiveness of two implemen-
*tation techniques: raw: no random as-*
signment in the initial data splitting.

*perm: a random order of blocks at each*
outer iteration. BLOCK-L-D is used.

Fig. 5: Convergence speed of using dif- ferent m (number of blocks). BLOCK- L-D is used.

L-∗ methods requires 400MB memory. However, BLOCK-P-∗ only need 160MB to store the model w. Because of using less memory, BLOCK-P-I is more likely to store the model w in a higher level of the memory hierarchy such as L2 cache. Therefore, it performs as well as BLOCK-L-∗ in this case. The results also indicate that when deter- mining the number of blocks in Algorithms 1, both l (size of α) and n (size of w) need to be taken into consideration.

Note that the objective values of BLOCK-P-∗ may not be decreasing asPegasosdoes not have this property. All BLOCK-∗ methods except BLOCK-P-B need around four outer iterations to achieve reasonable accuracy values. This number of iterations is small, so we do not need to read the training set many times.

**8.2. Investigating Some Implementation Issues**

In Section 5, we discussed several implementation issues for block minimization framework. In the following, we demonstrate that a careful implementation can fur- ther speed up the block minimization methods.

*8.2.1. Initial Block Split and Random Permutation of Sub-problems.*Section 5.3 proposes ran-
domly assigning data to blocks in the beginning of Algorithm 1. It also suggests that a
random order of B1, . . . , Bmat each iteration is useful. We are interested in their effec-
tiveness. Figure 4 presents the result of running BLOCK-L-D onwebspam. We assume
the worst situation that data of the same class are grouped together in the input file. If
data are not randomly split to blocks, clearly the convergence is very slow. Further, the
random permutation of blocks at each iteration slightly improves the training time.

*8.2.2. Block Size.* In Figure 5, we present the training speed of BLOCK-L-D by using
various block sizes (equivalently, numbers of blocks). The datawebspamis considered.

The training time of using m = 40 blocks is smaller than that of m = 400 or 1000. This result is consistent with the discussion in Section 2. When the number of blocks is smaller (i.e., larger block size), from (5), the cost of operations on each block increases.

However, as we read less files, the total time is shorter. Furthermore, the initial split time is longer as m increases. Therefore, contrary to traditional SVM software which use small block sizes, now for each inner iteration we should consider a large block. In Figure 5, we do not check m = 20 because the memory is not enough to load a block of data.

*8.2.3. Data Compression.* We check if compressing each block saves time. By running
10 outer iterations of BLOCK-L-D on the training set ofwebspamwith m = 40, the im-
plementation with compression takes 3,230 seconds, but without compression it needs
4,660 seconds. Thus, the compression technique is very useful in this case.

In practice, the data loading time depends heavily on the disk reading speed. If the disk reading speed is high, the data compression technique may even slow down the training process.

**8.3. Comparisons with Other Methods for Large-scale Data**

In Section 7, we discussed existing approaches for training large-scale data. In this section, we first show that the sub-sampling strategy may downgrade the performance on the data set we used. Then, we compare the proposed block minimization methods with other approaches for large-scale data.

To compare with random sub-sampling methods, we uniformly shuffle each data set and train (2) byLIBLINEAR on subsets with different sizes. Figure 6 presents the performance of the models trained on the subsets of data. Results show that for our four data sets, using only a portion of data that can fit in memory may not be enough to obtain an accurate model as using the entire data. Therefore, in this situation, a method that considers the whole data set is still useful.

Next, we compare the following approaches which are able to train the entire data set:

— BLOCK-L-10: This is one of the methods used in Section 8.1.

— AvgBlock: An approach introduced in Section 7.2. We average the models trained by LIBLINEARwith the default stopping condition on each block of data Bj, j = 1 . . . m.

AlthoughAvgBlockcan be trained on a distributed system with multiple machines, here, we consider running it on a single computer.

— Vowpal Wabbit: An online method mentioned in Section 7.3. The package (version 4.1) is available on line at http://hunch.net/~vw/. We set the parameter as-l 1000 -b 24 –initial t 80000 –power t 0.5 –loss function hinge and ensure this setting per- forms better than the default parameters.

We only report the results of BLOCK-L-10, as it is the most stable one among all the settings used in Section 8.1 for the block minimization method. Note that the above methods do not minimize the same objective function. Therefore, their finial models may give slightly different testing accuracy values. For BLOCK-L-10 and AvgBlock, we select the parameter C in (2) by five-fold cross validation on the training set, and report their performances on the testing set.

We are interesting in both whether BLOCK-L-10 and Vowpal Wabbit can obtain a reasonable model quickly and how fast their final convergence is. Therefore, we show in Table III both the results after running the first iteration and the tenth iteration.

To show more details, we demonstrate the testing performance along the training time in Figure 7. We omit the results ofAvgBlockin Figure 7, as it cannot be conducted in an iterative manner.

The results indicate BLOCK-L-10 efficiently obtains an accurate model. With only one outer iteration, BLOCK-L-10 obtains a reasonable model. After running 10 itera- tions, BLOCK-L-10 achieves an accuracy value that is almost the same as that of the final model. Vowpal Wabbittake less training time; however, as it solves an unregu- larized problem instead of problem (2), it somtimes converges to a model with lower testing accuracy. On some data sets such askddcup10,AvgBlock achieves similar ac- curacy values to BLOCK-L-10. However, on other data sets, BLOCK-L-10 is better because it solves (2) using all the training data.

Table III: Training time and testing accuracy after the first and the tenth outer iter- ations. Time is in seconds. For BLOCK-L-10 andAvgBlock, time for the initial split of data is included.

yahoo-korea kddcup10 webspam epsilon

C = 4 C = 0.1 C = 64 C = 1

method #iter acc. time acc. time acc. time acc. time

BLOCK-L-10 1 85.97 259 88.49 862 99.32 1,944 89.12 1,802 10 87.29 456 89.89 3,153 99.51 4,475 89.78 3773 Vowpal Wabbit 1 68.34 111 82.65 816 98.14 984 89.40 1,633 10 68.39 242 78.83 1,339 98.84 1,857 89.75 2,147 AvgBlock 1 86.08 628 89.64 6,809 98.40 4,722 88.83 1,999

Fig. 6: Data size versus difference to the best testing accuracy. The marker indicates the size of the subset that can fit in memory. Results show that training only sub- sampled data may not be enough to achieve the best testing performance.

**9. DISCUSSION AND CONCLUSIONS**

The discussion in Section 6 shows that implementing cross validation or multi-class classification may require extra memory space and some modifications of Algorithm 1.

Thus, constructing a complete learning tool is certainly more complicated than imple- menting Algorithm 1. There are many new and challenging future research issues.

In summary, we propose and analyze a block minimization method for large linear classification when data cannot fit in memory. Experiments show that the proposed method can effectively handle data 20 times larger than the memory size.

Our code is available at

http://www.csie.ntu.edu.tw/~cjlin/liblinear/exp.html
**REFERENCES**

BERTSEKAS*, D. P. 1999. Nonlinear Programming Second Ed. Athena Scientific, Belmont, MA 02178-9998.*

BOTTOU, L. 2007. Stochastic gradient descent examples. http://leon.bottou.org/projects/sgd.

(a)yahoo-korea (b)webspam

(c)epsilon (d)kddcup10

Fig. 7: This figure shows testing accuracy versus training time. Time (in seconds) is log-scaled. The blue dotted vertical line indicates time spent by Algorithm 1-based methods for the initial split of data to blocks.

BREIMAN*, L. 1996. Bagging predictors. Machine Learning 24, 2, 123–140.*

CHANG, C.-C.ANDLIN*, C.-J. 2001. LIBSVM: a library for support vector machines. Software available at*
http://www.csie.ntu.edu.tw/~cjlin/libsvm.

CHANG, E., ZHU, K., WANG, H., BAI, H., LI, J., QIU, Z.,ANDCUI, H. 2008. Parallelizing support vector
*machines on distributed computers. In Advances in Neural Information Processing Systems 20, J. Platt,*
D. Koller, Y. Singer, and S. Roweis, Eds. MIT Press, Cambridge, MA, 257–264.

CRAMMER, K.ANDSINGER, Y. 2000. On the learnability and design of output codes for multiclass problems.

*In Computational Learing Theory. 35–46.*

FAN, R.-E., CHANG, K.-W., HSIEH, C.-J., WANG, X.-R.,AND LIN, C.-J. 2008. LIBLINEAR: A library for
*large linear classification. Journal of Machine Learning Research 9, 1871–1874.*

FERRIS, M. AND MUNSON*, T. 2003. Interior point methods for massive support vector machines. SIAM*
*Journal on Optimization 13,*3, 783–804.

HSIEH, C.-J., CHANG, K.-W., LIN, C.-J., KEERTHI, S. S.,ANDSUNDARARAJAN, S. 2008. A dual coordinate
*descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference*
*on Machine Learning (ICML).*

JOACHIMS*, T. 1998. Making large-scale SVM learning practical. In Advances in Kernel Methods – Support*
*Vector Learning, B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, Eds. MIT Press, Cambridge, MA, 169–*

184.

JOACHIMS*, T. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD Interna-*
*tional Conference on Knowledge Discovery and Data Mining.*

LANGFORD, J., LI, L.,ANDSTREHL, A. 2007. Vowpal Wabbit. http://hunch.net/~vw.

LANGFORD, J., LI, L.,AND ZHANG*, T. 2009. Sparse online learning via truncated gradient. Journal of*
*Machine Learning Research 10, 771–801.*

LANGFORD, J., SMOLA, A. J.,ANDZINKEVICH*, M. 2009. Slow learners are fast. In Advances in Neural*
*Information Processing Systems.*

LUO, Z.-Q.ANDTSENG, P. 1992. On the convergence of coordinate descent method for convex differentiable
*minimization. Journal of Optimization Theory and Applications 72, 1, 7–35.*

MEMISEVIC, R. 2006. Dual optimization of conditional probability models. Tech. rep., Department of Com- puter Science, University of Toronto.

MORSE, JR*., K. G. 2005. Compression tools compared. Linux Journal.*

P ´EREZ-CRUZ, F., FIGUEIRAS-VIDAL, A. R.,ANDARTES´ -RODR´IGUEZ, A. 2004. Double chunking for solving
*SVMs for very large datasets. In Proceedings of Learning 2004, Spain.*

R ¨UPING, S. 2000. mySVM - another one of those support vector machines. Software available at http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/.

SHALEV-SHWARTZ, S., SINGER, Y.,ANDSREBRO, N. 2007. Pegasos: primal estimated sub-gradient solver
*for SVM. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML).*

YU, H., YANG, J.,ANDHAN, J. 2003. Classifying large data sets using SVMs with hierarchical clusters. In
*KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and*
*data mining. ACM Press, New York, NY, USA, 306–315.*

YU, H.-F., HSIEH, C.-J., CHANG, K.-W.,ANDLIN, C.-J. 2010. Large linear classification when data can-
*not fit in memory. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge*
*Discovery and Data Mining.*

YU, H.-F., LO, H.-Y., HSIEH, H.-P., LOU, J.-K., MCKENZIE, T. G., CHOU, J.-W., CHUNG, P.-H., HO, C.-H., CHANG, C.-F., WEI, Y.-H., WENG, J.-Y., YAN, E.-S., CHANG, C.-W., KUO, T.-T., LO, Y.-C., CHANG, P. T., PO, C., WANG, C.-Y., HUANG, Y.-H., HUNG, C.-W., RUAN, Y.-X., LIN, Y.-S., LIN, S.-D., LIN, H.-T.,ANDLIN, C.-J. 2010. Feature engineering and classifier ensemble for KDD cup 2010. Tech. rep., Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan.

ZHU, Z. A., CHEN, W., WANG, G., ZHU, C., AND CHEN, Z. 2009. P-packSVM: Parallel primal gradient
*descent kernel SVM. In Proceedings of the IEEE International Conference on Data Mining.*

ZINKEVICH, M., WEIMER, M., SMOLA, A.,ANDLI, L. 2010. Parallelized stochastic gradient descent. In
*Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-Taylor,*
R. Zemel, and A. Culotta, Eds. 2595–2603.

Received ??; revised ??; accepted ??

**A. PROOF OF THEOREM 3.1**

If each sub-problem involves a finite number of coordinate descent updates, then Al- gorithm 1 can be regarded as a coordinate descent method. We apply Theorem 2.1 of Luo and Tseng [1992] to obtain the convergence results. The theorem requires that (3) satisfies certain conditions and in the coordinate descent method there is an integer t such that every αi is iterated at least once every t successive updates (called almost cyclic rule in Luo and Tseng [1992]). Following the same analysis in the proof of Hsieh et al. [2008, Theorem 1], (3) satisfies the required conditions. Moreover, the two prop- erties on tj,k imply the almost cyclic rule. Hence, both global and linear convergence results are obtained.

**B. PROOF OF THEOREM 3.2**

To begin, we discuss the stopping condition ofLIBLINEAR. Each run of LIBLINEARto
solve a sub-problem generates {α^{k,j,v}| v = 1, . . . , tk,j+ 1} with

α^{k,j} = α^{k,j,1}and α^{k,j+1}= α^{k,j,t}^{k,j}^{+1}.

We further let ij,v denote the index of the variable being updated by α^{k,j,v+1}= α^{k,j,v}+
d^{∗}e_{i}

j,v, where d^{∗}is the optimal solution of

mind f (α^{k,j,v}+ dei_{j,v}) subject to 0 ≤ αi^{k,j,v}j,v + d ≤ C, (15)
and eij,v is an indicator vector for the (ij,v)th element. All tk,j updates can be fur-
ther separated to several rounds, where each one goes through all elements in Bj.
LIBLINEARchecks the following stopping condition in the end of each round:

v∈a roundmax ∇^{P}ij,vf (α^{k,j,v}) − min

v∈a round∇^{P}ij,vf (α^{k,j,v}) ≤ ǫ, (16)
where ǫ is a tolerance and ∇^{P}f (α) is the projected gradient:

∇^{P}i f (α) =

∇^{i}f (α) if 0 < αi< C,
max(0, ∇^{i}f (α)) if αi= C,
min(0, ∇^{i}f (α)) if αi= 0.

(17)

The reason thatLIBLINEARconsiders (16) is that from the optimality condition, α^{∗} is
optimal if and only if ∇^{P}f (α^{∗}) = 0.

Next we prove the theorem by showing that for all j = 1, . . . , m there exists kjsuch that

∀k ≥ k^{j}, tk,j≤ 2|B^{j}|. (18)

Suppose that (18) does not hold. We can find a j and a sub-sequence R ⊂ {1, 2, . . .} such that

tk,j> 2|B^{j}|, ∀k ∈ R. (19)

Since {α^{k,j} | k ∈ R} are in a compact set, we further consider a sub-sequence M ⊂ R
such that {α^{k,j} | k ∈ M} converges to a limit point ¯α.

Let σ ≡ min^{i}Qii. Following the explanation in Hsieh et al. [2008, Theorem 1], we
only need to analyze indices with Qii > 0. Therefore, σ > 0. Lemma 2 of Hsieh et al.

[2008] shows that

f (α^{k,j,v}) − f(α^{k,j,v+1}) ≥ σ

2kα^{k,j,v}− α^{k,j,v+1}k^{2}, ∀v = 1, . . . , 2|B^{j}|. (20)
The sequence {f(α^{k}) | k = 1, . . .} is decreasing and bounded below as the feasible
region is compact. Hence

k→∞lim f (α^{k,j,v}) − f(α^{k,j,v+1}) = 0, ∀v = 1, . . . , 2|B^{j}|. (21)
Using (21) and taking the limit on both sides of (20), we have

k∈M,k→∞lim

α^{k,j,2|B}^{j}^{|+1}= lim

k∈M,k→∞

α^{k,j,2|B}^{j}^{|}= · · ·

= lim

k∈M,k→∞

α^{k,j,1}= ¯α. (22)

From the continuity of ∇f(α) and (22), we have

k∈M,k→∞lim ∇f(α^{k,j,v}) = ∇f( ¯α), ∀v = 1, . . . , 2|B^{j}|.

Hence there are ǫ and ¯k such that ∀k ∈ M with k ≥ ¯k

|∇^{i}f (α^{k,j,v})| ≤ ǫ

4 if ∇^{i}f ( ¯α) = 0, (23)

∇^{i}f (α^{k,j,v}) ≥3ǫ

4 if ∇^{i}f ( ¯α) > 0, (24)

∇^{i}f (α^{k,j,v}) ≤ −3ǫ

4 if ∇^{i}f ( ¯α) < 0, (25)
for any i ∈ B^{j}, v ≤ 2|B^{j}|.

When we update α^{k,j,v} to α^{k,j,v+1} by changing the ith element (i.e., i = ij,v) in the
first round, the optimality condition for (15) implies that one of the following three
situations occurs:

∇^{i}f (α^{k,j,v+1}) = 0, (26)

∇^{i}f (α^{k,j,v+1}) > 0 and α^{k,j,v+1}_{i} = 0, (27)

∇^{i}f (α^{k,j,v+1}) < 0 and α^{k,j,v+1}_{i} = C. (28)

From (23)-(25), we have that

i satisfies

(26) (27) (28)

⇒

∇^{i}f ( ¯α) = 0

∇^{i}f ( ¯α) ≥ 0

∇^{i}f ( ¯α) ≤ 0

. (29)

In the second round, assume αiis changed at the v^{′}th update. From (29) and (23)-(25),
we have

|∇^{i}f (α^{k,j,v}^{′})| ≤ ǫ

4, (30)

or

∇^{i}f (α^{k,j,v}^{′}) ≥ −ǫ

4 and α^{k,j,v}_{i} ^{′} = 0, (31)

or

∇^{i}f (α^{k,j,v}^{′}) ≤ ǫ

4 and α^{k,j,v}_{i} ^{′} = C. (32)

Using (30)-(32), the projected gradient defined in (17) satisfies

|∇^{P}i (α^{k,j,v}^{′})| ≤ ǫ
4.
This result holds for all i ∈ B^{j}. Therefore,

v∈2nd roundmax ∇^{P}i_{j,v}(α^{k,j,v}) − min

v∈2nd round∇^{P}i_{j,v}(α^{k,j,v})

≤ ǫ 4 − (−ǫ

4) = ǫ 2 < ǫ.

Thus, (16) is valid in the second round. Then tk,j = 2|B^{j}| violates (19). Hence (18) holds
and the theorem is obtained.