## Active Sampling of Pairs and Points for Large-scale Linear Bipartite Ranking

Wei-Yuan Shen, and Hsuan-Tien Lin, Member, IEEE

**Abstract—Bipartite ranking is a fundamental ranking problem that learns to order relevant instances ahead of irrelevant ones. One**
major approach for bipartite ranking, called the pair-wise approach, tackles an equivalent binary classification problem of whether one
instance out of a pair of instances should be ranked higher than the other. Nevertheless, the number of instance pairs constructed
from the input data could be quadratic to the size of the input data, which makes pair-wise ranking generally infeasible on large-
scale data sets. Another major approach for bipartite ranking, called the point-wise approach, directly solves a binary classification
problem between relevant and irrelevant instance points. This approach is feasible for large-scale data sets, but the resulting ranking
performance can be inferior. That is, it is difficult to conduct bipartite ranking accurately and efficiently at the same time. In this paper, we
develop a novel scheme within the pair-wise approach to conduct bipartite ranking efficiently. The scheme, called Active Sampling, is
inspired from the rich field of active learning and can reach a competitive ranking performance while focusing only on a small subset of
the many pairs during training. Moreover, we propose a general Combined Ranking and Classification (CRC) framework to accurately
conduct bipartite ranking. The framework unifies point-wise and pair-wise approaches and is simply based on the idea of treating
each instance point as a pseudo-pair. Experiments on 14 real-word large-scale data sets demonstrate that the proposed algorithm of
Active Sampling within CRC, when coupled with a linear Support Vector Machine, usually outperforms state-of-the-art point-wise and
pair-wise ranking approaches in terms of both accuracy and efficiency.

**Index Terms—bipartite ranking, binary classification, large-scale, active learning, AUC.**

✦

**1** **I**

**NTRODUCTION**

The bipartite ranking problem aims at learning a ranking function that orders positive instances ahead of negative ones. For example, in information retrieval, bipartite ranking can be used to order the preferred documents in front of the less-preferred ones within a list of search- engine results; in bioinformatics, bipartite ranking can be used to identify genes related to a certain disease by ranking the relevant genes higher than irrelevant ones. Bipartite ranking algorithms take some positive and negative instances as the input data, and produce a ranking function that maps an instance to a real- valued score. Given a pair of a positive instance and a negative one, we say that the pair is mis-ordered if the ranking function gives a higher score to the negative instance. The performance of the ranking function is measured by the probability of mis-ordering an unseen pair of randomly chosen positive and negative instances, which is equal to one minus the Area Under the ROC Curve (AUC) [18], a popular criterion for evaluating the sensitivity and the specificity of binary classifiers in many real-world tasks [6], [11], [21] and large-scale data mining competitions [9], [41].

Given the many potential applications in information retrieval, bioinformatics, and recommendation systems, bipartite ranking has received much research attention in the past two decades [1], [6], [12], [16], [19], [25],

*W.-Y. Shen and H.-T. Lin are with the Department of Computer Science*
*and Information Engineering, National Taiwan University, Taiwan, e-mail:*

*{r00922024, htlin}@csie.ntu.edu.tw.*

*Manuscript received August ??, 2014; revised ??.*

[27], [30]. Many existing bipartite ranking algorithms explicitly or implicitly reduce the problem to binary clas- sification to inherit the benefits from the well-developed methods in binary classification [6], [16], [19], [23], [27].

The majority of those reduction-based algorithms can be
categorized to two approaches: the pair-wise approach
and the point-wise one. The pair-wise approach trans-
forms the input data of positive and negative instances
*to pairs of instances, and learns a binary classifier for*
predicting whether the first instance in a pair should
be scored higher than the second one. Note that for
an input data set that contains N^{+} positive instances
and N^{−} negative ones, the pair-wise approach trains a
binary classifier by optimizing an objective function that
consists of N^{+}N^{−} terms, one for each pair of instances.

The pair-wise approach comes with strong theoretical
guarantee. For example, [4] shows that a low-regret
ranking function can indeed be formed by a low-regret
binary classifier. The strong theoretical guarantee leads
to promising experimental results in many state-of-the-
art bipartite ranking algorithms, such as RankSVM [23],
RankBoost [19] and RankNet [7]. Nevertheless, the num-
ber of pairs in the input data can easily be of size Θ(N^{2}),
where N is the size of the input data, if the data is not
extremely unbalanced. The quadratic number of pairs
with respect to N makes the pair-wise approach compu-
tationally infeasible for large-scale data sets in general,
except in a few special algorithms like RankBoost [19]

or the efficient linear RankSVM [25]. RankBoost enjoys an efficient implementation by reducing the quadratic number of pair-wise terms in the objective function

to a linear number of equivalent terms; efficient linear RankSVM transforms the pair-wise optimization formu- lation to an equivalent formulation that can be solved in subquadratic time complexity [27].

On the other hand, the point-wise approach directly
runs binary classification on the positive and negative
*instance points of the input data, and takes the scoring*
function behind the resulting binary classifier as the
ranking function. In some special cases [19], [31], such as
AdaBoost [20] and its pair-wise sibling RankBoost [19],
the point-wise approach is shown to be equivalent to
the corresponding pair-wise one [16], [33]. In other
cases, the point-wise approach often operates with an
approximate objective function that involves only N
terms [21], [27]. For example, [27] shows that minimiz-
ing the exponential or the logistic loss function on the
instance points decreases an upper bound on the number
of mis-ordered pairs within the input data. Because of
the approximate nature of the point-wise approach, its
ranking performance can sometimes be inferior to the
pair-wise approach.

From the discussion above, we see that the pair-wise approach leads to more satisfactory performance while the point-wise approach comes with efficiency, and there is a trade-off between the two. In this paper, we are interested in designing bipartite ranking algorithms that enjoy both satisfactory performance and efficiency for large-scale bipartite ranking. We focus on using the linear Support Vector Machine (SVM) [40] given its recent advances for efficient large-scale learning [43].

We first show that the loss function behind the usual point-wise SVM [40] minimizes an upper bound on the loss function behind RankSVM, which suggests that the point-wise SVM could be an approximate bipartite ranking algorithm that enjoys efficiency. Then, we design a better ranking algorithm with two major contributions.

Firstly, we study an active sampling scheme to select important pairs for the pair-wise approach and name the scheme Active Sampling for RankSVM (ASRankSVM).

The scheme makes the pair-wise SVM computationally feasible by focusing only on a small number of valuable pairs out of the quadratic number of pairs , and allows us to overcome the challenge of having a quadratic number of pairs. The active sampling scheme is inspired by active learning, another popular machine learning setup that aims to save the efforts of labeling [35]. More specifically, we discuss the similarity and differences between active sampling (selecting a small number of valuable pairs within a pool of potential pairs) and pool-based active learning (labeling a small number of valuable instances within a pool of unlabeled instances), and propose some active sampling strategies based on the similarity. Sec- ondly, we propose a general framework that unifies the point-wise SVM and the pair-wise SVM (RankSVM) as special cases. The framework, called combined ranking and classification (CRC), is simply based on the idea of treating each instance point as a pseudo-pair. The CRC framework coupled with active sampling (ASCRC)

improves the performance of the point-wise SVM by considering not only points but also pairs in its objective function.

Performing active sampling within the CRC frame- work leads to a promising algorithm for large-scale linear bipartite ranking. We conduct experiments on 14 real-world large-scale data sets and compare the proposed algorithms (ASRankSVM and ASCRC) with several state-of-the-art bipartite ranking algorithms, in- cluding the point-wise linear SVM [17], the efficient linear RankSVM [25], and the Combined Ranking and Regression (CRR) algorithm [34] which is closely related to the CRC framework. In addition, we demonstrate the robustness and the efficiency of the active sampling strategies and discuss some advantages and disadvan- tages of different strategies. The results show that AS- RankSVM is able to efficiently sample only 8, 000 of the more than millions of the possible pairs to achieve better performance than other state-of-the-art algorithms that use all the pairs, while ASCRC that considers the pseudo-pairs can sometimes be helpful. Those results validate that the proposed algorithm can indeed enjoy both satisfactory performance and efficiency for large- scale bipartite ranking.

The paper is organized as follows. Section 2 describe the problem setup and several related works in the literature. Then, we illustrate the active sampling scheme and the proposed CRC framework in Section 3. We conduct a thorough experimental study to compare the proposed algorithm to several state-of-the-art ones in Section 4, and conclude in Section 5.

A preliminary version of this paper appeared in the 5th Asian Conference on Machine Learning [37]. The paper is then enriched by

1) extending the design of the proposed CRC framework to allow a threshold term for the classification part in Section 3.5

2) examining the necessity of each part of the proposed CRC framework in Section 4.2

3) studying the effect of the budget parameter of active sampling in Section 4.4

**2** **S**

**ETUP AND**

**R**

**ELATED**

**W**

**ORKS**

In a bipartite ranking problem, we are given a training
set D = {(xk, yk)}^{N}_{k=1}, where each (xk, yk) is a training
instance with the feature vector xk in an n-dimensional
space X ⊆ R^{n} and the binary label yk ∈ {+1, −1}. Such
a training set is of the same format as the training set
in usual binary classification problems. We assume that
the instances (xk, yk*) are drawn i.i.d. from an unknown*
distribution P on X × {+1, −1}. Bipartite ranking algo-
rithms take D as the input and learn a ranking function
r : X → R that maps a feature vector x to a real-valued
score r(x).

For any pair of two instances, we call the pair mis-
*ordered by r iff the pair contains a positive instance*
(x+, +1) and a negative one (x−, −1) while r(x+) ≤

r(x−). For a distribution P that generates instances
(x, y), we can define its pair distribution P2 that gen-
erates (x, y, x^{′}, y^{′}) to be the conditional probability of
sampling two instances (x, y) and (x^{′}, y^{′}) from P , condi-
tioned on y 6= y^{′}. Then, let the expected bipartite ranking
loss LP(r) for any ranking function r be the expected
number of mis-ordered pairs over P2.

LP(r) = E

(x,y,x^{′},y^{′})∼P2

hI

(y − y^{′})(r(x) − r(x^{′})) ≤ 0i
,
where I(•) is an indicator function that returns 1 iff the
condition (•) is true, and returns 0 otherwise. The goal
of bipartite ranking is to use the training set D to learn a
ranking function r that minimizes the expected bipartite
ranking loss LP(r).

Because P is unknown, LP(r) cannot be computed directly. Thus, bipartite ranking algorithms usually resort to the empirical bipartite ranking loss LD(r), which takes the expectation over the pairs in D instead of over the pair distribution P2, with the hope that LD(r) would be sufficiently close to LP(r) when the model complexity of the candidate ranking functions r is properly controlled.

Denote D^{+} as the set of the positive instances in D, and
D^{−} as the set of negative instances in D. The formal
definition of LD(r) is

LD(r) = 1
N^{+}N^{−}

X

x_{i}∈D^{+}

X

x_{j}∈D^{−}

I

r(xi) ≤ r(xj) .

The bipartite ranking loss LP(r) is closely related to
the area under the ROC curve (AUC), which is com-
monly used to evaluate the sensitivity and the specificity
of binary classifiers [6], [9], [11], [21], [41]. More specif-
*ically, AUC calculates the expected number of correctly-*
*ordered pairs. Hence, AUC*•(r) = 1−L•(r) for • = P or D,
and higher AUC indicates better ranking performance.

Bipartite ranking is a special case of the general rank- ing problem in which the labels y can be any real value, not necessarily {+1, −1}. For example, recommendation systems may allow users to enter their preferences (la- bels) on the item x with real-valued or ordinal-scaled scores. There are lots of recent studies on improving the accuracy [8], [15], [23] and efficiency [3], [19] of general ranking problems. In this paper, instead of considering the general ranking problem, we focus on using the specialty of bipartite ranking, namely its connection to binary classification, to improve the accuracy and the efficiency.

Motivated by the recent advances of linear models
for efficient large-scale learning [43], we consider linear
models for efficient large-scale bipartite ranking. That is,
the ranking functions would be of the form r(x) = w^{T}x,
which is linear to the components of the feature vector
x. In particular, we study the linear Support Vector
Machine (SVM) [40] for bipartite ranking. There are two
possible approaches for adopting the linear SVM on
*bipartite ranking problem, the pair-wise SVM approach*
*and the point-wise SVM approach.*

The pair-wise approach corresponds to the famous
RankSVM algorithm [23], which is originally designed
for ranking with ordinal-scaled scores, but can be easily
extended to general ranking with real-valued labels or
restricted to bipartite ranking with binary labels. For
each positive instance (xi, yi = +1) and negative in-
stance (xj, yj = −1), the pair-wise approach transforms
*the two instances to two symmetric pairs of instance*
((xi, xj), +1) and ((xj, xi), −1), the former for indicating
that xishould be scored higher than xj and the latter for
indicating that xj should be scored lower than xi. The
pairs transformed from D are then fed to an SVM for
learning a ranking function of the form r(x) = w^{T}φ(x),
where φ indicates some feature transform.

When using a linear SVM, φ is simply the identity function. Then, for the pair ((xi, xj), +1), we see that I

r(xi) ≤ r(xj)

*= 0 iff w*^{T}(xi− xj) > 0. Define the
transformed feature vector xij = xi− xj and the trans-
formed label yij= sign(yi−yj), we can equivalently view
the pair-wise linear SVM as simply running a linear SVM
on the pair-wise training set Dpair = {(xij, yij)|yi6= yj}.

The pair-wise linear SVM minimizes the hinge loss as a
surrogate to the 0/1 loss on Dpair [38], and the 0/1 loss
on Dpair is equivalent to LD(r), the empirical bipartite
ranking loss of interest. That is, if the linear SVM learns
an accurate binary classifier using Dpair, the resulting
ranker r(x) = w^{T}x would also be accurate in terms of
the bipartite ranking loss.

Denote the hinge function max(•, 0) by [•]+, RankSVM solves the following optimization problem

min

w

1

2w^{T}w+ X

x_{ij}∈Dpair

Cij[1 − w^{T}yijx_{ij}]+ , (1)

where Cij denotes the weight of the pair xij. Because of the symmetry of xij and xji, we naturally assume that Cij = Cji. In the original RankSVM formulation, Cij is set to a constant for all the pairs. Here we list a more flex- ible formulation (1) to facilitate some discussions later.

RankSVM has reached promising bipartite ranking per-
formance in the literature [6]. Because of the symmetry of
positive and negative pairs, we can equivalently solve (1)
on those positive pairs with yij= 1. The number of such
positive pairs is N^{+}N^{−}if there are N^{+}positive instances
and N^{−}negative ones. The huge number of pairs make it
difficult to solve (1) with a na¨ıve quadratic programming
algorithm.

In contrast with the na¨ıve RankSVM, the efficient linear RankSVM [25] changes (1) to a more sophisticated and equivalent one with an exponential number of con- straints, each corresponding to a particular linear com- bination of the pairs. Then, it reaches O(N log N ) time complexity by using a cutting-plane solver to identify the most-violated constraints iteratively, while the constant hidden in the big-O notation depends on the parameter Cij as well as the desired precision of optimization. The subquadratic time complexity of the efficient RankSVM can make it much slower than the point-wise approach

(to be discussed below), and hence may not always be fast enough for large-scale bipartite ranking.

*The point-wise SVM approach, on the other hand,*
directly runs an SVM on the original training set D
instead of Dpair. That is, in the linear case, the point-wise
approach solves the following optimization problem
min

w

1

2w^{T}w+C+

X

x_{i}∈D^{+}

[1−w^{T}x_{i}]++C−

X

x_{j}∈D^{−}

[1+w^{T}x_{j}]+.
(2)
Such an approach comes with some theoretical justifica-
tion [27]. In particular, the 0/1 loss on D has been proved
to be an upper bound of the empirical bipartite ranking
loss. In fact, the bound can be tightened by adjusting C+

and C− to balance the distribution of the positive and negative instances in D. When C+= C−, [6] shows that the point-wise approach (2) is inferior to the pair-wise approach (1) in performance. The inferior performance can be attributed to the fact that the point-wise approach only operates with an approximation (upper bound) of the bipartite ranking loss of interest.

Next, inspired by the theoretical result of upper- bounding the bipartite ranking loss with a balanced 0/1 loss, we study the connection between (1) and (2) by balancing the hinge loss in (2). In particular, as shown in Theorem 1, a balanced form of (2) can be viewed as minimizing an upper bound of the objective function within (1). In other words, the weighted point-wise SVM can be viewed as a reasonable baseline algorithm for large-scale bipartite ranking problem.

**Theorem 1.** *Let C*ij =^{C}_{2} *be a constant in (1); C*+= 2N^{−}·C
*and C*− = 2N^{+}*· C in (2). Then, for every w, the objective*
*function of (1) is upper-bounded by* ^{1}_{4} *times the objective*
*function of (2).*

*Proof. Because*

[1 − w^{T}x_{ij}]+≤ 1

2 [1 − 2w^{T}x_{i}]++ [1 + 2w^{T}x_{j}]+ ,
starting from the objective function of (1) we get

1

2w^{T}w+ X

x_{ij}∈Dpair

Cij[1 − w^{T}yijx_{ij}]+

= 1

2w^{T}w+ X

x_{ij}∈Dpair,yij=+1

C[1 − w^{T}x_{ij}]+

≤ 1

2w^{T}w
+C

2 X

x_{i}∈D^{+}

X

x_{j}∈D^{−}

[1 − 2w^{T}x_{i}]++ [1 + 2w^{T}x_{j}]+

= 1

2w^{T}w
+C

2 · N^{−} X

x_{i}∈D^{+}

[1 − 2w^{T}x_{i}]+

+C

2 · N^{+} X

x_{j}∈D^{−}

[1 + 2w^{T}x_{j}]+.

The theorem can be easily proved by substituting 2w with a new variable u.

**3** **B**

**IPARTITE**

**R**

**ANKING WITH**

**A**

**CTIVE**

**S**

**AM**

**-**

**PLING**

As discussed in the previous section, the pair-wise ap- proach (1) is infeasible on large-scale data sets due to the huge number of pairs. Then, either some ran- dom sub-sampling of the pairs are needed [34], or the less-accurate point-wise approach (2) is taken as the approximate alternative [27]. Nevertheless, the better ranking performance of the pair-wise approach over the point-wise one suggest that some of the key pairs shall carry more valuable information than the instance- points. Next, we design an algorithm that samples a few key pairs actively during learning. The resulting algorithm achieves better performance than the point- wise approaches because of the key pairs, and enjoys better efficiency than the pair-wise approach because of the sampling. We first show that some proposed active sampling schemes, which are inspired by the many existing methods in active learning [28], [32], [35], can help identify those key pairs better than random sub- sampling. Then, we discuss how we can unify point- wise and pair-wise ranking approaches under the same framework.

**3.1** **Pool-based Active Learning**

The pair-wise SVM approach (1) is challenging to solve because of the huge number of pairs involved in Dpair. To make the computation feasible, we can only afford to work on a small subset of Dpairduring training. Existing algorithms conquer the computational difficulty of the huge number of pairs in different ways. The Combined Ranking and Regression approach [34] performs stochas- tic gradient descent on its objective function, which essentially selects within the huge number of pairs in a random manner; the efficient RankSVM [25] identi- fies the most-violated constraints during optimization, which corresponds to selecting the most valuable pairs from an optimization perspective.

We take an alternative route and hope to select the
*most valuable pairs from a learning perspective. That*
is, our task is to iteratively select a small number of
valuable pairs for training while reaching similar per-
formance to the pair-wise approach that trains with all
the pairs. One machine learning setup that works for
*a similar task is active learning [35], which iteratively*
select a small number of valuable instances for labeling
(and training) while reaching similar performance to the
approach that trains with all the instances fully labeled.

[2] avoids the quadratic number of pairs in the general ranking problem from an active learning perspective, and proves that selecting a subquadratic number of pairs is sufficient to obtain a ranking function that is close to the optimal ranking function trained by using all the

pairs. The algorithm is theoretical in nature, while many other promising active learning tools [28], [32], [35] have not been explored for selecting valuable pairs in large- scale bipartite ranking.

Next, we start exploring those tools by providing a brief review about active learning. We focus on the setup of pool-based active learning [35] because of its strong connection to our needs. In a pool-based active learning problem, the training instances are separated into two parts, the labeled pool (L) and the unlabeled pool (U). As the name suggests, the labeled pool consists of labeled instances that contain both the feature vector xk and its corresponding label yk, and the unlabeled pool contains unlabeled instances xℓ only. Pool-based active learning assumes that a (huge) pool of unlabeled instances is rel- atively easy to gather, while labeling those instances can be expensive. Therefore, we hope to achieve promising learning performance with as few labeled instances as possible.

A pool-based active learning algorithm is generally iterative. In each iteration, there are two steps: the train- ing step and the querying step. In the training step, the algorithm trains a decision function from the labeled pool; in the querying step, the algorithm selects one (or a few) unlabeled instances, queries an oracle to label those instances, and moves those instances from the unlabeled pool to the labeled one. The pool-based active learning framework repeats the training and querying steps iteratively until a given budget B on the number of queries is met, with the hope that the decision functions returned throughout the learning steps are as accurate as possible for prediction.

Because labeling is expensive, active learning algo- rithms aim to select the most valuable instance(s) from the unlabeled pool at each querying step. Various selec- tion criteria have been proposed to describe the value of an unlabeled instance [35], such as uncertainty sam- pling [28], expected error reduction [32], and expected model change [36].

Moreover, there are several works that solve bipartite ranking under the active learning scenario [13], [14], [42].

For example, [13] selects points that reduce the ranking loss functions most from the unlabeled pool while [14]

selects points that maximize the AUC in expectation.

Nevertheless, these active learning algorithms require either sorting or enumerating over the huge unlabeled pool in each querying step. The sorting or enumerating process can be time consuming, but have not been considered seriously because labeling is assumed to be even more expensive. We will discuss later that those algorithms that require sorting or enumerating may not fit our goal.

**3.2** **Active Sampling**

Following the philosophy of active learning, we propose the active sampling scheme for choosing a smaller set of key pairs on the huge training set Dpair. We call

*the scheme Active Sampling in order to highlight some*
differences to active learning. One particular difference is
*that RankSVM (1) only requires optimizing with positive*
*pairs. Then, the label y*ij of a pair is a constant 1 and
thus easy to get during active sampling, while the label
in active learning remains unknown before the possibly
expensive querying step. Thus, while active sampling
and active learning both focus on using as few labeled
data as possible, the costly part of the active sampling
scheme is on training rather than querying.

For active sampling, we denote B as the budget on the number of pairs that can be used in training, which plays a similar role to the budget on querying in active learning. In brief, active sampling chooses B informative pairs iteratively for solving the optimization problem (1).

We separate the pair-wise training set Dpair into two
parts, the chosen pool (L^{∗}) and the unchosen pool (U^{∗}).

The chosen pool is the subset of pairs to be used for training, and the unchosen pool contains the unused pairs. The chosen pool is similar to the labeled pool in pool-based active learning; the unchosen pool acts like the unlabeled pool. The fact that it is almost costless to

“label” the instances in the unchosen pool allows us to design simpler sampling strategies than those commonly used for active learning, because no effort is needed to estimate the unknown labels.

**Algorithm 1** Active Sampling

**Input:** the initial chosen pool, L^{∗}; the initial unchosen
pool, U^{∗}; the regularization parameters, {Cij}. the
number of pairs sampled per iteration, b; the budget
on the total number of pairs sampled, B; the sampling
strategy, Sample : (U^{∗}, w) → xij that chooses a pair
from U^{∗}.

**Output:** the ranking function represented by the
weights w.

w= linearSVM(L^{∗}, {Cij}, 0);

**repeat**

**for i = 1 → b do**

x_{ij}= Sample(U^{∗}, w);

L^{∗}= L^{∗}∪ {(xij, yij)} and U^{∗}= U^{∗}\ {xij};

**end for**

w= linearSVM(L^{∗}, {Cij}, w);

**until**(|L^{∗}| ≥ B)
**return w;**

The proposed scheme of active sampling is illustrated
in Algorithm 1. The algorithm takes an initial chosen
pool L^{∗} and an initial unchosen pool U^{∗}, where we
simply mimic the usual setup in pool-based active learn-
ing by letting L^{∗} be a randomly chosen subset of Dpair

and U^{∗} be the set of unchosen pairs in Dpair. In each
iteration of the algorithm, we use Sample to actively
choose b instances to be moved from U^{∗} to L^{∗}. The
strategy Sample takes the current ranking function w
into account. After sampling, a linearSVM is called to
learn from L^{∗} along with the weights in {Cij}. We feed
*the current w to the linearSVM solver to allow a warm-*

*start in optimization. The warm-start step enhances the*
efficiency and the performance. The iterative procedure
continues until the budget B of chosen instances is fully
consumed.

Another main difference between the active sampling scheme and typical pool-based active learning is that we sample b instances before the training step, while pool-based active learning often considers executing the training step right after querying the label of one in- stance. The difference is due to the fact that the pair-wise labels yij can be obtained very easily and thus sampling and labeling can be relatively cheaper than querying in pool-based active learning. Furthermore, updating the weights right after knowing one instance may not lead to much improvement and can be too time consuming for the large-scale bipartite ranking problem that we want to solve.

**3.3** **Sampling Strategies**

Next, we discuss some possible sampling strategies that
can be used in Algorithm 1. One na¨ıve strategy is
to passively choose a random sample within U^{∗}. For
active sampling strategies, we define two measures that
estimate the (learning) value of an unchosen pair. The
two measures correspond to well-known criteria in pool-
based active learning. Let xij be the unchosen pair in U^{∗}
with yij = 1, the two measures with respect to the
current ranking function w are

closeness(xij, w) = |w^{T}x_{ij}| (3)
correctness(xij, w) = −[1 − w^{T}x_{ij}]+ (4)
The closeness measure corresponds to one of the
most popular criteria in pool-based active learning called
*uncertainty sampling [28]. It captures the uncertainty of*
the ranking function w on the unchosen pair. Intuitively,
a low value of closeness means that the ranking function
finds it hard to distinguish the two instances in the pair,
which implies that the ranking function is less confident
on the pair. Therefore, sampling the unchosen pairs that
come with the lowest closeness values may improve the
ranking performance by resolving the uncertainty.

On the other hand, the correctness measure is related
to another common criterion in pool-based active learn-
*ing called expected error reduction [32]. It captures the*
performance of the ranking function w on the unchosen
pair. Note that this exact correctness measure is only
available within our active sampling scheme because
we know the pair-label yij to always be 1 without loss
of generality, while usual active learning algorithms do
not know the exact measure before querying and hence
have to estimate it [13], [14]. A low value of correctness
indicates that the ranking function does not perform well
on the pair. Then, sampling the unchosen pairs that come
with the lowest correctness values may improve the
ranking performance by correcting the possible mistakes.

Moreover, sampling the pair with lowest correctness

value shall change w the most in general, which echoes
another criterion in pool-based active learning called
*expected model change [36].*

Similar to other active learning algorithms [13], [14],
computing the pairs that come with the lowest closeness
or correctness values can be time consuming, as it
requires at least evaluating the values of w^{T}x_{k} for each
instance (xk, yk) ∈ D, and then computing the measures
on the pairs along with some selection or sorting steps
that may be of super-linear time complexity [25]. Thus,
*such a hard version of active sampling is not computa-*
tionally feasible for large-scale bipartite ranking. Next,
*we discuss the soft version of active sampling that ran-*
domly chooses pairs that come with lower closeness or
correctness values by rejection sampling.

**Algorithm 2** Soft Version of Active Sampling

**Input:** the current ranking function represented by the
weights w; the unchosen pool, U^{∗}.

**Output: x**ij, the sampled pair.

**repeat**

Sample a pair xij uniformly from U^{∗};
Calculate a probability value pij from w;

**until**( random() < pij )
**return x**ij;

Algorithm 2 illustrates the soft version of active sam- pling: we consider a rejection sampling step that samples a pair xij with probability pij based on a method ran- dom() that generates random numbers between [0, 1]. A pair that comes with a lower closeness or correctness values would enjoy a higher probability pij of being accepted.

Next, we define the probability value functions that correspond to the hard versions of closeness and correctness. Both value functions are in the shape of the sigmoid function, which is widely used to repre- sent probabilities in logistic regression and neural net- works [5]. For soft closeness sampling, we define the probability value as:

pij ≡ 2/

1 + e^{|w}^{T}^{x}^{ij}^{|}

For soft correctness sampling, we define pij as:

pij ≡ 1 − 2/

1 + e^{[1−w}^{T}^{(x}^{ij}^{)]}^{+}

We take different forms of soft versions because closeness is of range [0, ∞) while correctness is of range (−∞, 0].

Note that the sampling strategies above, albeit focus- ing on the most valuable pairs, is inheritedly biased.

The chosen pool may not be representative enough of the whole training set because of the biased sampling strategies. There is a simple way that allows us to correct the sampling bias for learning a ranking function that performs well on the original bipartite ranking loss of interest. We take the idea of [24] to weight the sampled pair by the inverse of its probability of being sampled.

That is, we could multiply the weight Cij for a chosen
pair xij by _{p}^{1}_{ij} when it gets returned by the rejection
sampling.

**3.4** **Combined Ranking and Classification**

Inspired by Theorem 1, the points can also carry some information for ranking. Next, we study how we can take those points into account during active sampling.

We start by taking a closer look at the similarity and differences between the point-wise SVM (2) and the pair-wise SVM (1). The pair-wise SVM considers the weighted hinge loss on the pairs xij = xi− xj, while the point-wise SVM considers the weighted hinge loss on the points xk. Consider one positive point (xk, +1).

Its hinge loss is [1 − w^{T}x_{i}]+, which is the same as
[1 − w^{T}(xi − 0)]+. In other words, the positive point
(xi*, +1) can also be viewed as a pseudo-pair that con-*
sists of (xi, +1) and (0, −1). Similarly, a negative point
(xj, −1) can be viewed as a pseudo-pair that consists
of (xj, −1) and (0, +1). Let the set of all pseudo-pairs
within D be

Dpseu

= {(xi0= xi− 0, +1)|xi∈ D^{+}}

∪{(x0j= 0 − xj, +1)|xj ∈ D^{−}}

∪{(x0i= 0 − xi, −1)|xi ∈ D^{+}}

∪{(xj0= xj− 0, −1)|xj ∈ D^{−}}.

Then, the point-wise SVM (2) is just a variant of the pair-
wise one (1) using the pseudo-pairs and some particu-
lar weights. Thus, we can easily unify the point-wise
and the pair-wise SVMs together by minimizing some
weighted hinge loss on the joint set D^{∗}= Dpair∪ Dpseu

of pairs and pseudo-pairs. By introducing a parameter γ ∈ [0, 1] to control the relative importance between the real pairs and the pseudo-pairs, we propose the following novel formulation.

min

w

1

2w^{T}w+ γ X

x_{ij}∈D_{pair}^{+}

Ccrc^{(ij)}[1 − w^{T}x_{ij}]+

+(1 − γ) X

x_{kℓ}∈D^{+}pseu

Ccrc^{(kℓ)}· [1 − w^{T}x_{kℓ}]+ , (5)

where D_{pair}^{+} and D^{+}pseu denote the set of positive pairs
and positive pseudo-pairs, respectively. The new formu-
lation (5) combines the point-wise SVM and the pair-
wise SVM in its objective function, and hence is named
the Combined Ranking and Classification (CRC) frame-
work. When γ = 1, CRC takes the pair-wise SVM (1) as
a special case with Cij= 2Ccrc^{(ij)}; when γ = 0, CRC takes
the point-wise SVM (2) as a special case with C+= Ccrc^{(i0)}

and C− = Ccrc^{(0j)}. The CRC framework (5) remains as
challenging to solve as the pair-wise SVM approach (1)
because of the huge number of pairs. However, the
general framework can be easily extended to the active
sampling scheme, and hence be solved efficiently. We

only need to change the training set from Dpair to the
joint set D^{∗}, and multiply the probability value pij in the
soft version sampling by γ or (1 − γ) for actual pairs or
pseudo-pairs.

The CRC framework is closely related to the algorithm of Combined Ranking and Regression (CRR) [34] for general ranking. The CRR algorithm similarly considers a combined objective function of the point-wise terms and the pair-wise terms for improving the ranking per- formance. The main difference between CRR and CRC is that the CRR approach takes the squared loss on the points, while CRC takes the nature of bipartite ranking into account and considers the hinge loss on the points.

On the other hand, the idea of combining pair-wise and point-wise approaches had been used in another machine learning setup, the multi-label classification problem [39]. The algorithm of Calibrated Ranking by Pairwise Comparison [22] assumes a calibration label between relevant and irrelevant labels, and hence unifies the pair-wise and point-wise label learning for multi- label classification. The calibration label plays a similar role to the zero-vector in the pseudo-pairs for combining pair-wise and point-wise approaches.

To the best of our knowledge, while the CRR approach
has reached promising performance in practice [34], the
CRC formulation has not been seriously studied. The
hinge loss used in CRC allows unifying the point-wise
SVM and the pair-wise SVM under the same framework,
*and the unification is essential for applying one active*
*sampling strategy on both the real pairs and the pseudo-*
pairs.

In summary, we propose the active sampling scheme for RankSVM (ASRankSVM) and the more general CRC framework (ASCRC), and derive two sampling strategies that correspond to popular strategies in pool-based ac- tive learning. The soft version of the sampling strategies helps reducing the computational cost, and allows cor- recting the sampling bias by adjusting the weights with the inverse probability of being sampled.

**3.5** **Combined Ranking and Classification with**
**Threshold**

*In Theorem 1, we connect the point-wise SVM without*
*threshold term (2) to the pair-wise SVM (1). The standard*
SVM for binary classification, however, often come with
a threshold term θ to allow the classification hyperplane
to be away from the origin. That is, the standard SVM
solves

minθ,w

1

2w^{T}w+C+

X

x_{i}∈D^{+}

[1−w^{T}x_{i}+θ]++C−

X

x_{j}∈D^{−}

[1+w^{T}x_{j}−θ]+.
(6)
Note that for any given (θ, w),

[1−w^{T}x_{ij}]+≤ 1

2 [1 − 2w^{T}x_{i}+ 2θ]++ [1 + 2w^{T}x_{j}− 2θ]+ .
If we revisit the proof of Theorem 1 with the equation
above, we get a similar theorem that connects the stan-
dard SVM to the pair-wise SVM.

**Theorem 2.** *Let C*ij =^{C}_{2} *be a constant in (1); C*+= 2N^{−}·C
*and C*− = 2N^{+}*·C in (6). Then, for every (θ, w), the objective*
*function of (1) is upper-bounded by* ^{1}_{4} *times the objective*
*function of (6).*

Given the connection between (6) to (1) in Theo-
rem 2, one may wonder whether the trick of pseudo-
pair works for connecting the two formulations. Con-
sider one positive point (xk, +1). Its hinge loss within
(6) is [1 − w^{T}x_{i} + θ]+, which is the same as

1 −θ w^{T}−1
x_{i}

− 0n+1

+

, where 0n+1 is a zero
vector of length n + 1. Thus, the positive point (xi, +1)
*can also be viewed as an extended pseudo-pair that*
consists of −1

x_{i}

, +1

and (0n+1, −1), ranked by the extended vector θ

w

. We will denote the extended vector

−1
x_{i}

as ˜x_{i}. Similarly, a negative point (xj, −1) can
be viewed as an extended pseudo-pair that consists of
(˜x_{j}, −1) and (0n+1, +1).

Note that if we consider all the extended vectors

˜

x_{i}, ranking pair-wise extended vectors by θ
w

means calculating

θ w

T

(˜x_{i}− ˜x_{j}) = θ
w

T

−1
x_{i}

−−1
x_{j}

= w^{T}(xi− xj)
That is, the hinge loss on extended pairs is exactly the
same as the hinge loss on the original pairs.

Based on the discussions above, if we define extended
pairs ˜x_{ij} and extended pseudo-pairs ˜x_{kℓ} based on the
extended vectors ˜x_{i} and 0n+1, we can combine the pair-
wise SVM and the standard SVM with threshold term to
design a variant of the CRC formulation:

minθ,w

1

2w^{T}w+ γ X

x_{ij}∈D^{+}_{pair}

C_{crc}^{(ij)}[1 −θ w^{T} ˜x_{ij}]+

+(1 − γ) X

x_{kℓ}∈D^{+}pseu

C_{crc}^{(kℓ)}· [1 −θ w^{T} ˜x_{kℓ}]+ ,(7)

Note, however, that θ in (7) is not included in the
regularization term ^{1}_{2}w^{T}w. Several existing works, such
as LIBLINEAR [17], include θ in the regularization term
to allow simpler design of optimization routines. We
adopt the same idea and consider

minθ,w

1

2(θ^{T}θ + w^{T}w)

+γ X

x_{ij}∈D_{pair}^{+}

Ccrc^{(ij)}[1 −θ w^{T} x_{ij}]+

+(1 − γ) X

x_{kℓ}∈Dpseu^{+}

Ccrc^{(kℓ)}· [1 −θ w^{T} ˜x_{kℓ}]+(8)

in our study. We call this formulation (8) CRC-threshold, which is simply equivalent to the original CRC formula- tion (5) applied to the extended vectors. The equivalence allows us to easily test whether the flexibility of θ

(through using the extended vectors ˜x_{i}) can improve the
original CRC formulation.

**4** **E**

**XPERIMENTS**

In this section, we study the performance and efficiency of our proposed ASCRC algorithm on real-world large- scale data sets. We compare ASCRC with random-CRC, which does random sampling under the CRC frame- work. In addition, we compare ASCRC with three other state-of-the-art algorithms for large-scale bipartite rank- ing: the point-wise weighted linear SVM (2) (WSVM), an efficient implementation [25] of the pair-wise linear RankSVM (1) (ERankSVM), and the combined ranking and regression (CRR) [34] algorithm for general ranking.

We use 14 data sets from the LIBSVM Tools [10] and the UCI Repository [26] in the experiments. Table 1 shows the statistics of the data sets, which contains more than ten-thousands of instances and more than ten- millions of pairs. The data sets are definitely too large for a na¨ıve implementation of RankSVM (1). Note that the data sets marked with (∗) are originally multi-class data sets, and we take the sub-problem of ranking the first class ahead of the other classes as a bipartite ranking task. For data sets that come with a moderate-sized test set, we report the test AUC. Otherwise we perform a 5-fold cross validation and report the cross-validation AUC.

**4.1** **Experiment Settings**

Given a budget B on the number of pairs to be used
in each algorithm and a global regularization parameter
C, we set the instance weights for each algorithm to
fairly maintain the numerical scale between the regular-
ization term and the loss terms. The global regularization
parameter C is fixed to 0.1 in all the experiments. In
particular, the setting below ensures that the total C^{(ij)},
summed over all the pairs (or pseudo-pairs), would be
C · B for all the algorithms.

• WSVM: As discussed in Section 2, C+ and C− shall
be inverse-proportional to N^{+} and N^{−} to make the
weighted point-wise SVM a reasonable baseline for
bipartite ranking. Thus, we set C+ = _{2N}^{B}+ · C and
C−=_{2N}^{B}−· C in (2). We solve the weighted SVM by
the LIBLINEAR [17] package with its extension on
instance weights.

• ERankSVM: We use the SV M^{perf} [25] package to
efficiently solve the linear RankSVM (1) with the
AUC optimization option. We set the regularization
parameter Cperf =_{100}^{B} ·C where the 100 comes from
a suggested value of the SV M^{perf} package.

• *CRR: We use the package sofia-ml [34] with the sgd-*
*svm learner type, combined-ranking loop type and the*
default number of iterations that SGD takes to solve
the problem. We set its regularization parameter λ =

1 B·C.

• ASCRC (ASRankSVM): We initialize |L^{∗}| to b, and
assign Ccrc^{(ij)} = ^{Γ|L}_{p}_{ij}_{Z}^{∗}^{|} · C in each iteration, where

TABLE 1 Data Sets Statistics

Data Positive Negative Total Points Total Pairs Dimension AUC

letter* 789 19211 20000 30314958 16 CV

protein* 8198 9568 17766 156876928 357 test

news20 9999 9997 19996 199920006 1355191 CV

rcv1 10491 9751 20242 204595482 47236 CV

a9a 7841 24720 32561 387659040 123 test

bank 5289 39922 45211 422294916 51 CV

ijcnn1 4853 45137 49990 438099722 22 CV

shuttle* 34108 9392 43500 640684672 9 test

mnist* 5923 54077 60000 640596142 780 test

connect* 44473 23084 67557 2053229464 126 CV

acoustic* 18261 60562 78823 2211845364 50 test

real-sim 22238 50071 72309 2226957796 20958 CV

covtype 297711 283301 581012 168683648022 54 CV

url 792145 1603985 2396130 2541177395650 3231961 CV

Γ equals to either γ or (1 − γ) for either real or pseudo pairs and Z is a normalization constant P

x_{ij}∈L^{∗}
1

pij that prevents Ccrc^{(ij)}from being too large.

We solve the linearSVM within ASCRC by the LIB- LINEAR [17] package with its extension on instance weights.

• random-CRC: random-CRC simply corresponds to ASCRC with pij = 1 for all the pairs. That is, random-CRC samples uniformly within the unlabeled pool.

To evaluate the average performance of ASCRC and random-CRC algorithms, we average their results over 10 different initial pools.

**4.2** **Performance Comparison and Robustness**
Next, we examine the necessity of three key designs
within the active sampling framework: soft-version ver-
sus hard-version, sampling bias correction within soft-
version of active sampling, and the choice of soft-version
value functions. We first set γ = 1 in ASCRC and
random-CRC, which makes ASCRC equivalent to AS-
RankSVM. We let b = 100 and B = 8000, which is a
relatively small budget out of the millions of pairs. We
will study the effect of a larger budget in Section 4.4 and
the effect of using different γ in the more general ASCRC
in Section 4.5.

4.2.1 Soft-Version versus Hard-Version

We will discuss the time difference between the soft- and hard-versions of sampling in Table 6 of Section 4.3.

The soft-versions are both coupled with bias correction.

Intuitively, the soft version is much faster than the hard version. Here we examine the performance difference between the two versions first. In Table 2, we compare the soft- and hard-versions of closeness and correctness sampling under the t-test of 95% confidence level. For closeness sampling, the soft version performs better than the hard version on 9 data sets and ties with 3; for

correctness sampling, the soft version performs better than the hard version on 12 data sets and ties with 1.

The results justify that the soft version is a better choice than the hard-version in terms of AUC performance.

Fig. 1 further show how the AUC changes as |L^{∗}|
grows for different versions of sampling, along with
the baseline ERankSVM algorithm. We see that hard-
correctness-sampling always leads to unsatisfactory per-
formance. One possible reason is that hard correctness-
*sampling can easily suffer from sampling the noisy*
pairs, which come with larger hinge loss. On the other
hand, hard-closeness-sampling is competitive to the soft-
versions (albeit slower), but appears to be saturating to
less satisfactory model in Fig. 1(b). The saturation corre-
sponds to a known problem of uncertainty sampling in
active learning because of the restricted view of the non-
perfect model used for sampling [29]. The soft-version,
on the other hand, has some probability of escaping from
the restricted view, and hence enjoys better performance.

4.2.2 Bias Correction for Soft Version Sampling

Next, we show the AUC difference between doing bias correction (see Section 3.3) and not doing so for soft- version sampling in Table 3. A positive difference indi- cates that doing bias correction leads to better perfor- mance. First of all, we see that the difference of the bias correction is relatively small. For soft-close sampling, performing bias correction is slightly worse in 12 data sets; for soft-correct sampling, performing bias correction is slightly better in 9 data sets. Note correctness sampling is inheritedly more biased towards the noisy pairs as dis- cussed during hard-version sampling. Thus, performing bias correction can be necessary and helpful in ensuring the stability, as justified by the better performance in those 9 data sets.

4.2.3 Value Functions for Soft Version Sampling
We show how the AUC changes as |L^{∗}| grows through-
out the active sampling steps of ASRankSVM in Fig. 5.

For WSVM, ERankSVM and CRR, we plot a horizontal

line on the AUC achieved when using the whole train- ing set. We also list the final AUC with the standard deviation of all the algorithms in Table 4.

From Fig. 5 and Table 4, we see that soft-correct
sampling is generally the best. Further, we conduct the
right-tail t-test for soft-correct against the others to show
whether the improvement of soft-correct sampling is sig-
*nificant. In Table 5, we list the p-values of the t-test. The*
results are summarized under a 95% significance level,
which means we say soft-correct performs better when
*the corresponding p-value is less than 0.05. Actually, we*
*can see that most of the p-values are much smaller than*
0.05, which suggests that the improvement is usually
significant.

First, we compare soft-correct with random sampling and discover that soft-correct performs better on 10 data sets and ties with 4, which shows that active sampling is working reasonably well. While comparing soft-close with soft-correct in Table 4 and Table 5, we find that soft-correct outperforms soft-close on 7 data sets and ties with 5. Moreover, Fig. 5 shows the strong performance of soft-correct comes from the early steps of active sam- pling. Finally, when comparing soft-correct with other algorithms, we discover that soft-correct performs the best on 8 data sets: it outperforms ERankSVM on 8 data sets, WSVM on 9 data sets, and CRR on 11 data sets. The results demonstrate that even when using a pretty small sampling budget of 8, 000 pairs, ASRankSVM with soft- correct sampling can achieve significant improvement over those state-of-the-art ranking algorithms that use the whole training data set. Also, the tiny standard deviation shown in Table 4 and the significant results from the t-test suggest the robustness of ASRankSVM with soft-correct in general.

Nevertheless, we observe a potential problem of soft- correct sampling from Fig. 5. In data sets letter and mnist, the performance of soft-correct increases faster than soft-close in the beginning, but starts dropping in the middle. The possible reason, similar to the hard- version sampling, is the existence of noisy pairs that shall better not to be put into the chosen pool. When sampling more pairs, the probability that some noisy pairs (which come with larger hinge loss) are sampled by soft-correct sampling is higher, and can in term lead to degrading of performance. The results suggest a possible future work in combining the benefits of soft-close and soft-correct sampling to be more noise-tolerant.

**4.3** **Efficiency Comparison**

First, we study the efficiency of soft active sampling
by checking the average number of rejected samples
before passing the probability threshold during rejection
sampling. The number is plotted against the size of L^{∗}
in Fig. 6. The soft-close strategy usually needs fewer
than 10 rejected samples, while the soft-correct strategy
generally needs an increasing number of rejected sam-
ples. The reason is that when the ranking performance

becomes better throughout the iterations, the probability
threshold behind soft-correct could be pretty small. The
results suggest that the soft-close strategy is generally
efficient, while the soft-correct strategy may be less
efficient as |L^{∗}| grows.

Next, we list the CPU time consumed for all algo- rithms under 8, 000 pairs budget in Table 6, and the data sets are ordered ascendantly by its size. We can see that WSVM and CRR run fast but give inferior performance;

ERankSVM performs better but the training time grows fast as the data size increases. The result is consistent with the discussion in Section 1 that conducting bipartite ranking efficiently and accurately at the same time is challenging.

For ASRankSVM, random runs the fastest, then soft- close, and soft-correct is the slowest. The results reflect the average number of rejected samples discussed above.

In addition, not surprisingly, the soft version samplings are usually much faster then the corresponding hard versions, which validate that the time consuming enu- merating or sorting steps do not fit our goal in terms of efficiency.

More importantly, when comparing soft-correct with ERankSVM, soft-correct runs faster on 7 data sets, which suggests ASRankSVM is as efficient as the state-of- the-art ERankSVM on large-scale data sets in general.

Nevertheless, we can find that the CPU time of soft- correct grows much slower than ERankSVM as data size increases because the time complexity of ASRankSVM mainly depends on the budget B and the step size b, not the size of data.

**4.4** **The Usefulness of Larger Budget**

From the previous experiments, we have shown that ASRankSVM with a budget of 8, 000 pairs can perform better than other competitors on large-scale data sets.

Now, we check the performance of ASRankSVM with different budget size. In Figure 2, we show the AUC curves with much larger budgets on two data sets. Then, we find that the performance of ASRankSVM can be improved or maintained as the budget size increases.

For example, in data set protein, we can match the performance of WSVM with around 40,000 pairs and surpass it slightly with around 80,000 pairs. Neverthe- less, in most data sets, we find that the slope of AUC curves become flat around 10,000 pairs, and eventually converge as the budget increases. That is, increasing the budget in ASRankSVM leads to consistent but marginal improvements.

Note that the potential problem of sampling noisy pairs within the soft-correct sampling can be more seri- ous when the budget size increases. Fig. 3 illustrates the problem, where the performance of soft-correct degrades and rejects many more pairs in the latter sampling iterations. On the other hand, soft-close maintains the robustness and the efficiency as the budget increases, and improves the performance consistently throughout

the iterations. Thus, if a larger budget is used, soft-close can be a better choice than soft-correct.

**4.5** **The Usefulness of the CRC Framework**

Next, we study the necessity of the CRC framework
by comparing the performance of soft-closeness and
soft-correctness under different choices of γ. We re-
port the best γ under a 95% significance level within
{uniform, 0.1, 0.2, ..., 1.0}, where uniform means balanc-
ing the influence of actual pairs and pseudo-pairs by
γ = ^{|D}_{|D}^{pair}∗|^{|}. Moreover, we check whether CRC-threshold
can be useful. Table 7 shows the best γ and formulation
for each sampling strategy. The entries with “-thre”

indicates CRC-threshold. The bold entries indicates that the setting outperforms ERankSVM significantly. There are important properties that can be observed from the table. Firstly, we see that the choice of sampling strategy does not effect the optimal γ much, most data sets have similar optimal γ for both soft-closeness and soft-correctness sampling. Secondly, we find that adding a threshold term for CRC can sometimes reach better performance. Last, we see that using γ = 1 (real pairs only) performs well in most data sets, while a smaller γ or uniform can sometimes reach better performance.

The results justify that the real pairs are more important than the pseudo-pairs, while the latter can sometimes be helpful. When pseudo-pair helps, as shown in Fig. 4 for the mnist data set, the flexibility of the CRC framework can be useful.

**4.6** **Experiment Result Summary**

In summary, a special case of the proposed ASCRC algorithm that only samples actual pairs (ASRankSVM) works reasonably well for a budget of 8, 000 when cou- pled with soft-correct sampling. The setting significantly outperforms WSVM, ERankSVM, CRR and soft-close on most of the data sets, also the execution time shown the efficiency of soft-correct sampling is comparable with ERankSVM. The cons of the soft-correct sampling is that it becomes increasingly difficult to pass rejection sampling and it is more sensitive to noisy instances than soft-close sampling. While γ = 1 leads to promising performance on most of the data sets, further tuning with a smaller γ or adding a threshold term helps in some data sets. Moreover, using budget size around or larger than the training size with soft-close sampling may also help in some data set such as protein. The results validate the usefulness of active sampling (with soft-correct) as well as CRC (with a flexible γ).

**5** **C**

**ONCLUSION**

We propose the algorithm of Active Sampling (AS) under Combined Ranking and Classification (CRC) based on the linear SVM. There are two major components of the proposed algorithm. The AS scheme selects valuable pairs for training and resolves the computational burden

in large-scale bipartite ranking. The CRC framework unifies the concept of point-wise ranking and pair-wise ranking under the same framework, and can perform better than pure point-wise ranking or pair-wise ranking.

The unified view of pairs and points (pseudo-pairs) in CRC allows using one AS scheme to select from both types of pairs.

Experiments on 14 real-world large-scale data sets demonstrate the promising performance and efficiency of the ASRankSVM and ASCRC algorithms. The algo- rithms usually outperform state-of-the-art bipartite rank- ing algorithms, including the point-wise SVM, the pair- wise SVM, and the combined ranking and regression approach. The results not only justify the validity of ASCRC, but also shows the valuable pairs or pseudo- pairs can be helpful for large-scale bipartite ranking.

As future works, we will consider using other com- mon loss functions, like exponential loss or logistic loss, instead of the hinge loss that we discussed. Another in- teresting direction is adopting more sophisticated active learning algorithm rather than the simple uncertainty or error reduction strategies, but maintaining the efficiency is still challenging.

**R**

**EFERENCES**

[1] *S. Agarwal and D. Roth. A study of the bipartite ranking problem*
*in machine learning. University of Illinois at Urbana-Champaign,*
2005.

[2] N. Ailon. An active learning algorithm for ranking from pairwise
*preferences with an almost optimal query complexity. The Journal*
*of Machine Learning Research, 13:137–164, 2012.*

[3] N. Ailon and M. Mohri. An efficient reduction of ranking to
*classification. arXiv preprint arXiv:0710.2889, 2007.*

[4] M.-F. Balcan, N. Bansal, A. Beygelzimer, D. Coppersmith, J. Lang-
ford, and G. B. Sorkin. Robust reductions from ranking to
*classification. Machine learning, 72(1-2):139–153, 2008.*

[5] E. B. Baum and F. Wilczek. Supervised learning of probability
*distributions by neural networks. In Neural Information Processing*
*Systems, pages 52–61, 1988.*

[6] U. Brefeld and T. Scheffer. AUC maximizing support vector
*learning. In International Conference on Machine learning Workshop*
*on ROC Analysis in Machine Learning, 2005.*

[7] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamil- ton, and G. Hullender. Learning to rank using gradient descent.

*In International Conference on Machine learning, pages 89–96, 2005.*

[8] C. J. Burges, Q. V. Le, and R. Ragno. Learning to rank with
*nonsmooth cost functions. Neural Information Processing Systems,*
19:193–200, 2007.

[9] R. Caruana, T. Joachims, and L. Backstrom. KDD-Cup 2004: re-
*sults and analysis. ACM SIGKDD Explorations Newsletter, 6(2):95–*

108, 2004.

[10] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector
*machines. ACM Transactions on Intelligent Systems and Technology,*
2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.

tw/^{∼}cjlin/libsvm.

[11] S. Clemenc¸on, G. Lugosi, and N. Vayatis. Ranking and empirical
*minimization of U-statistics. The Annals of Statistics, 36(2):844–874,*
2008.

[12] C. Cortes and M. Mohri. AUC optimization vs. error rate
*minimization. Neural Information Processing Systems, 16(16):313–*

320, 2004.

[13] P. Donmez and J. G. Carbonell. Optimizing estimated loss
*reduction for active sampling in rank learning. In International*
*Conference on Machine learning, pages 248–255, 2008.*

[14] P. Donmez and J. G. Carbonell. Active sampling for rank learning
via optimizing the area under the ROC curve. *Advances in*
*Information Retrieval, pages 78–89, 2009.*