## Cost-aware Pre-training for Multiclass Cost-sensitive Deep Learning

### Yu-An Chung Department of CSIE National Taiwan University

### b01902040@ntu.edu.tw

### Hsuan-Tien Lin Department of CSIE National Taiwan University

### htlin@csie.ntu.edu.tw

### Shao-Wen Yang Intel Labs Intel Corporation shao-wen.yang@intel.com

### Abstract

Deep learning has been one of the most promi- nent machine learning techniques nowadays, be- ing the state-of-the-art on a broad range of applica- tions where automatic feature extraction is needed.

Many such applications also demand varying costs for different types of mis-classification errors, but it is not clear whether or how such cost information can be incorporated into deep learning to improve performance. In this work, we first design a novel loss function that embeds the cost information for the training stage of cost-sensitive deep learning.

We then show that the loss function can also be in- tegrated into the pre-training stage to conduct cost- aware feature extraction more effectively. Exten- sive experimental results justify the validity of the novel loss function for making existing deep learn- ing models cost-sensitive, and demonstrate that our proposed model with cost-aware pre-training and training outperforms non-deep models and other deep models that digest the cost information in other stages.

### 1 Introduction

In many real-world machine learning applications [Tan, 1993;

Chan and Stolfo, 1998; Fan et al., 2000; Zhang and Zhou, 2010; Jan et al., 2011], classification errors may come with different costs; namely, some types of mis-classification er- rors may be (much) worse than others. For instance, when classifying bacteria [Jan et al., 2011], the cost of classify- ing a Gram-positive species as a Gram-negative one should be higher than the cost of classifying the species as another Gram-positive one because of the consequence on treatment effectiveness. Different costs are also useful for building a realistic face recognition system, where a government staff being mis-recognized as an impostor causes only little incon- venience, but an imposer mis-recognized as a staff can result in serious damage [Zhang and Zhou, 2010]. It is thus impor- tant to take into account the de facto cost of every type of error rather than only measuring the error rate and penalizing all types of errors equally.

The classification problem that mandates the learning algo- rithm to consider the cost information is called cost-sensitive

classification. Amongst cost-sensitive classification algo- rithms, the binary classification ones [Elkan, 2001; Zadrozny et al., 2003] are somewhat mature with re-weighting the training examples [Zadrozny et al., 2003] being one ma- jor approach, while the multiclass classification ones are continuing to attract research attention [Domingos, 1999;

Margineantu, 2001; Abe et al., 2004; Tu and Lin, 2010].

This work focuses on multiclass cost-sensitive classifica- tion, whose algorithms can be grouped into three categories [Abe et al., 2004]. The first category makes the predic- tion procedure cost-sensitive [Kukar and Kononenko, 1998;

Domingos, 1999; Zadrozny and Elkan, 2001], generally done by equipping probabilistic classifiers with Bayes decision the- ory. The major drawback is that probability estimates can of- ten be inaccurate, which in term makes cost-sensitive perfor- mance unsatisfactory. The second category makes the training procedure cost-sensitive, which is often done by transforming the training examples according to the cost information [Chan and Stolfo, 1998; Domingos, 1999; Zadrozny et al., 2003;

Beygelzimer et al., 2005; Langford and Beygelzimer, 2005].

However, the transformation step cannot take the particular- ities of the underlying classification model into account and thus sometimes has room for improvement. The third cate- gory specifically extends one particular classification model to be cost-sensitive, such as support vector machine [Tu and Lin, 2010] or neural network [Kukar and Kononenko, 1998;

Zhou and Liu, 2006]. Given that deep learning stands as an important class of models with its special properties to be dis- cussed below, we aim to design cost-sensitive deep learning algorithms within the third category while borrowing ideas from other categories.

Deep learning models, or neural networks with deep archi- tectures, are gaining increasing research attention in recent years. Training a deep neural network efficiently and effec- tively, however, comes with many challenges, and different models deal with the challenges differently. For instance, con- ventional fully-connected deep neural networks (DNN) gen- erally initialize the network with an unsupervised pre-training stage before the actual training stage to avoid being trapped in a bad local minimal, and the unsupervised pre-training stage has been successfully carried out by stacked auto- encoders [Vincent et al., 2010; Krizhevsky and Hinton, 2011;

Baldi, 2012]. Deep belief networks [Hinton et al., 2006;

Le Roux and Bengio, 2008] shape the network as a gener-

ative model and commonly take restricted Boltzmann ma- chines [Le Roux and Bengio, 2008] for pre-training. Convo- lutional neural networks (CNN) [LeCun et al., 1998] mimic the visual perception process of human based on special net- work structures that result in less need for pre-training, and are considered the most effective deep learning models in tasks like image or speech recognition [Ciresan et al., 2011;

Krizhevsky et al., 2012; Abdel-Hamid et al., 2014].

While some existing works have studied cost-sensitive neural networks [Kukar and Kononenko, 1998; Zhou and Liu, 2006], none of them have focused on cost-sensitive deep learning to the best of our knowledge. That is, we are the first to present cost-sensitive deep learning algorithms, with the hope of making deep learning more realistic for applications like bacteria classification and face recognition. In Section 2, we first formalize the cost-sensitive classification problem and review related deep learning works. Then, in Section 3, we start with a baseline algorithm that makes the predic- tion procedure cost-sensitive (first category). The features extracted from the training procedure of such an algorithm, however, are cost-blind. We then initiate a pioneering study on how the cost information can be digested in the training procedure (second category) of DNN and CNN. We design a novel loss function that matches the needs of neural net- work training while embedding the cost information. Further- more, we argue that for DNN pre-trained with stacked auto- encoders, the cost information should not only be used for the training stage, but also the pre-training stage. We then pro- pose a novel pre-training approach for DNN (third category) that mixes unsupervised pre-training with a cost-aware loss function. Experimental results on deep learning benchmarks and standard cost-sensitive classification settings in Section 4 verified that the proposed algorithm based on cost-sensitive training and cost-aware pre-training indeed yields the best performance, outperforming non-deep models as well as a broad spectrum of deep models that are either cost-insensitive or cost-sensitive in other stages. Finally, we conclude in Sec- tion 5.

### 2 Background

We will formalize the multiclass cost-sensitive classification problem before introducing deep learning and related works.

2.1 Multiclass Cost-sensitive Classification

We first introduce the multiclass classification problem and
then extend it to the cost-sensitive setting. The K-class clas-
sification problem comes with a size-N training set S =
{(xn, y_{n})}^{N}_{n=1}, where each input vector xn is within an in-
put space X , and each label yn is within a label space Y =
{1, 2, ..., K}. The goal of multiclass classification is to train a
classifier g : X → Y such that the expected errorJy 6= g(x)K
on test examples (x, y) is small.^{1}

Multiclass cost-sensitive classification extends multiclass classification by penalizing each type of mis-classification er- ror differently based on some given costs. Specifically, con- sider a K by K cost matrix C, where each entry C(y, k) ∈ [0, ∞) denotes the cost for predicting a class-y example as

1

J·K is 1 when the inner condition is true, and 0 otherwise.

class k and naturally C(y, y) = 0. The goal of cost-sensitive classification is to train a classifier g such that the expected cost C(y, g(x)) on test examples is small.

The cost-matrix setting is also called cost-sensitive classifi-
cation with class-dependent costs. Another popular setting is
to consider example-dependent costs, which means coupling
an additional cost vector c ∈ [0, ∞)^{K} with each example
(x, y), where the k-th component c[k] denotes the cost for
classifying x as class k. During training, each cnthat accom-
panies (xn, yn) is also fed to the learning algorithm to train a
classifier g such that the expected cost c[g(x)] is small with
respect to the distribution that generates (x, y, c) tuples. The
cost-matrix setting can be cast as a special case of the cost-
vector setting by defining the cost vector in (x, y, c) as row y
of the cost matrix C. In this work, we will eventually propose
a cost-sensitive deep learning algorithm that works under the
more general cost-vector setting.

2.2 Neural Network and Deep Learning

There are many deep learning models that are successful for
different applications nowadays [Lee et al., 2009; Krizhevsky
and Hinton, 2011; Ciresan et al., 2011; Krizhevsky et al.,
2012; Simonyan and Zisserman, 2014]. In this work, we
first study the fully-connected deep neural network (DNN)
for multiclass classification as a starting point of making
deep learning cost-sensitive. The DNN consists of H hid-
den layer and parameterizes each layer i ∈ {1, 2, ..., H} by
θi = {Wi, bi}, where Wi is a fully-connected weight ma-
trix and b_{i}is a bias vector that enter the neurons. That is, the
weight matrix and bias vector applied on the input are stored
within θ1= {W1, b1}. For an input feature vector x, the H
hidden layers of the DNN describe a complex feature trans-
form function by computing φ(x) = s(WH· s(· · · s(W2·
s(W_{1}· x + b1) + b_{2}) · · · ) + b_{H}), where s(z) = _{1+exp(−z)}^{1} is
the component-wise logistic function. Then, to perform mul-
ticlass classification, an extra softmax layer, parameterized
by θsm= {Wsm, bsm}, is placed after the H-th hidden layer.

There are K neurons in the softmax layer, where the j-th neu-
ron comes with weights W^{(j)}smand bias b^{(j)}smand is responsible
for estimating the probability of class j given x:

P (y = j|x) = exp(φ(x)^{T}Wsm^{(j)}+ b^{(j)}sm)
PK

k=1exp(φ(x)^{T}W^{(k)}sm+ b^{(k)}sm)
. (1)
Based on the probability estimates, the classifier trained from
the DNN is naturally g(x) = argmax_{1≤k≤K}P (y = k|x).

Traditionally, the parameters {{θi}^{H}_{i=1}, θsm} of the DNN
are optimized by the back-propagation algorithm, which is
essentially gradient descent, with respect to the negative log-
likelihood loss function over the training set S:

LNLL(S) =

N

X

n=1

− ln(P (y = yn|xn)). (2) The strength of the DNN, through multiple layers of non- linear transforms, is to extract sophisticated features automat- ically and implement complex functions. However, the train- ing of the DNN is non-trivial because of non-convex opti- mization and gradient diffusion problems, which degrade the

test performance of the DNN when adding too many layers.

[Hinton et al., 2006] first proposed a greedy layer-wise pre- training approach to solve the problem. The layer-wise pre- training approach performs a series of feature extraction steps from the bottom (input layer) to the top (last hidden layer) to capture higher level representations of original features along the network propagation.

In this work, we shall improve a classical yet effective unsupervised pre-training strategy, stacked denoising auto- encoders [Vincent et al., 2010], for the DNN. Denoising auto- encoder (DAE) is an extension of regular auto-encoder. An auto-encoder is essentially a (shallow) neural network with one hidden layer, and consists of two parameter sets: {W, b}

for mapping the (normalized) input vector x ∈ [0, 1]^{d}to the
d^{0}-dimensional latent representation h by h = s(W·x+b) ∈
[0, 1]^{d}^{0}; {W^{0}, b^{0}} for reconstructing an input vector ˜x from h
by ˜x = s(W^{0}· h + b^{0}). The auto-encoder is trained by min-
imizing the total cross-entropy loss LCE(S) over S, defined
as

−

N

X

n=1 d

X

j=1

xn[j] ln ˜xn[j] + (1 − xn[j]) ln(1 − ˜xn[j]) , (3) where xn[j] denotes the j-th component of xn and ˜xn[j] is the corresponding reconstructed value.

The DAE extends the regular auto-encoder by randomly adding noise to inputs xnbefore mapping to the latent repre- sentation, such as randomly setting some components of xn

to 0. Several DAEs can then be stacked to form a deep net- work, where each layer receives its input from the latent rep- resentation of the previous layer. For the DNN, initializing with stacked DAEs is known to perform better than initializ- ing with stacked regular auto-encoders [Vincent et al., 2010]

or initializing randomly. Below we will refer the DNN ini- tialized with stacked DAEs and trained (fine-tuned) by back- propagation with (2) as the SDAE, while restricting the DNN to mean the model that is initialized randomly and trained with (2).

In this work, we will also extend another popular deep
learning model, the convolutional neural network (CNN), for
cost-sensitive classification. The CNN is based on a locally-
connected network structure that mimics the visual percep-
tion process [LeCun et al., 1998]. We will consider a standard
CNN structure specified in Caffe^{2} [Jia et al., 2014], which
generally does not rely on a pre-training stage. Similar to the
DNN, we consider the CNN with a softmax layer for multi-
class classification.

2.3 Cost-sensitive Neural Network

Few existing works have studied cost-sensitive classifica- tion using neural networks [Kukar and Kononenko, 1998;

Zhou and Liu, 2006]. [Zhou and Liu, 2006] focused on study- ing the effect of sampling and threshold-moving to tackle the class imbalance problem using neural network as a core classifier rather than proposing general cost-sensitive neu- ral network algorithms. [Kukar and Kononenko, 1998] pro- posed four approaches of modifying neural networks for cost- sensitivity. The first two approaches train a usual multiclass

2https://github.com/BVLC/caffe/tree/master/examples/cifar10

classification neural network, and then make the prediction stage of the trained network cost-sensitive by including the costs in the prediction formula; the third approach modifies the learning rate of the training algorithm base on the costs;

the fourth approach, called MIN (minimization of the mis- classification costs), modifies the loss function of neural net- work training directly. Among the four proposed algorithms, MIN consistently achieves the lowest test cost [Kukar and Kononenko, 1998] and will be taken as one of our competi- tors. Nevertheless, none of the existing works, to the best of our knowledge, have conducted careful study on cost- sensitive algorithms for deep neural networks.

### 3 Cost-sensitive Deep Learning

Before we start describing our proposed algorithm, we high- light a na¨ıve algorithm. For the DNN/SDAE/CNN that esti- mate the probability with (1), when given the full picture of the cost matrix, a cost-sensitive prediction can be obtained using Bayes optimal decision, which computes the expected cost of classifying an input vector x to each class and predicts the label that reaches the lowest expected cost:

g(x) = argmin

16k6K K

X

y=1

P (y|x)C(y, k). (4) We will denote these algorithms as DNNBayes, SDAEBayes

and CNN_{Bayes}, respectively. These algorithms do not include
the costs in the pre-training nor training stages. Also, those
algorithms require knowing the full cost matrix, and cannot
work under the cost-vector setting.

3.1 Cost-sensitive Training

The DNN essentially decomposes the multiclass classifica- tion problem to per-class probability estimation problems via the well-known one-versus-all (OVA) decomposition. [Tu and Lin, 2010] proposed the one-sided regression algorithm that extends OVA for support vector machine (SVM) to a cost-sensitive SVM by considering per-class regression prob- lems. In particular, if regressors rk(x) ≈ c[k] can be learned properly, a reasonable prediction can be made by

g_{r}(x) ≡ argmin

16k6K

r_{k}(x). (5)

[Tu and Lin, 2010] further argued that the loss function of the regressor rk with respect to c[k] should be one-sided. That is, rk(x) is allowed to underestimate the smallest cost c[y]

and to overestimate other costs. Define zn,k = 2Jc^{n}[k] =
c_{n}[y_{n}]K − 1 for indicating whether c_{n}[k] is the smallest
within cn. The cost-sensitive SVM [Tu and Lin, 2010] mini-
mizes a regularized version of the total one-sided loss ξn,k=
max(zn,k·(rk(xn)−cn[k]), 0), where rkare formed by (ker-
nelized) linear models. With such a design, the cost-sensitive
SVM enjoys the following property [Tu and Lin, 2010]:

cn[gr(xn)] 6

K

X

k=1

ξn,k. (6)

That is, an upper boundPK

k=1ξ_{n,k}of the total cost paid by g_{r}
on xnis minimized within the cost-sensitive SVM.

If we replace the softmax layer of the DNN or the CNN with regression outputs (using the identity function instead of the logistic one for outputting), we can follow [Tu and Lin, 2010] to make DNN and CNN cost-sensitive by letting each output neuron estimate c[k] as rk and predicting with (5).

The training of the cost-sensitive DNN and CNN can also be
done by minimizing the total one-sided loss. Nevertheless, the
one-sided loss is not differentiable at some points, and back-
propagation (gradient descent) cannot be directly applied. We
thus derive a smooth approximation of ξn,kinstead. Note that
the new loss function should not only approximate ξn,k but
also be an upper bound of ξn,kto keep enjoying the bound-
ing property of (6). [Lee and Mangasarian, 2001] has shown a
smooth approximation u+_{α}^{1}·ln(1+exp(−αu)) ≈ max(u, 0)
when deriving the smooth SVM. Taking α = 1 leads to
LHS = ln(1 + exp(u)), which is trivially an upper bound of
max(u, 0) because ln(1+exp(u)) > u, and ln(1+exp(u)) >

ln(1) = 0. Based on the approximation, we define

δ_{n,k}≡ ln(1 + exp(zn,k· (rk(x_{n}) − c_{n}[k]))). (7)
δn,k is not only a smooth approximation of ξn,k that enjoys
the differentiable property, but also an upper bound of ξn,kto
keep the bounding property of (6) held. That is, we can still
ensure a small total cost by minimizing the newly defined
smooth one-sided regression (SOSR) loss over all examples:

LSOSR(S) =

N

X

n=1 K

X

k=1

δn,k. (8)

We will refer to these algorithms, which replace the soft-
max layer of the DNN/SDAE/CNN with a regression layer
parameterized by θSOSR = {W_{SOSR}, b_{SOSR}} and mini-
mize (8) with back-propagation, as DNNSOSR, SDAESOSR

and CNNSOSR. These algorithms work with the cost-vector setting. They include costs in the training stage, but not the pre-training stage.

3.2 Cost-aware Pre-training

For multiclass classification, the pre-training stage, either in a totally unsupervised or partially supervised manner [Bengio et al., 2007], has been shown to improve the performance of the DNN and several other deep models [Bengio et al., 2007;

Hinton et al., 2006; Erhan et al., 2010]. The reason is that pre- training usually helps initialize a neural network with better weights that prevent the network from getting stuck in poor local minima. In this section, we propose a cost-aware pre- training approach that leads to a novel cost-sensitive deep neural network (CSDNN) algorithm.

CSDNN is designed as an extension of SDAESOSR. In-
stead of pre-training with SDAE, CSDNN takes stacked cost-
sensitive auto-encoders (CAE) for pre-training instead. For
a given cost-sensitive example (x, y, c), CAE tries not only
to denoise and reconstruct the original input x like DAE, but
also to digest the cost information by reconstructing the cost
vector c. That is, in addition to {W, b} and {W^{0}, b^{0}} for
DAE, CAE introduces an extra parameter set {W^{00}, b^{00}} fed
to regression neurons from the hidden representation. Then,
we can mix the two loss functions LCE and LSOSR with a

balancing coefficient β ∈ [0, 1], yielding the following loss function for CAE over S:

LCAE(S) = (1 − β) · LCE(S) + β · LSOSR(S) (9) The mixture step is a widely-used technique for multi-criteria optimization [Hillermeier, 2001], where β controls the bal- ance between reconstructing the original input x and the cost vector c. A positive β makes CAE cost-aware during its fea- ture extraction, while a zero β makes CAE degenerate to DAE. Similar to DAEs, CAEs can then be stacked to initial- ize a deep neural network before the weights are fine-tuned by back-propagation with (8). The resulting algorithm is named CSDNN, which is cost-sensitive in both the pre-training stage (by CAE) and the training stage (by (8)), and can work under the general cost-vector setting. The full algorithm is listed in Algorithm 1.

Algorithm 1 CSDNN

Input: Cost-sensitive training set S = {(xn, yn, cn)}^{N}_{n=1}
1: for each hidden layer θi= {W_{i}, b_{i}} do

2: Learn a CAE by minimizing (9).

3: Take {Wi, bi} of CAE as θi. 4: end for

5: Fine-tune the network parameters {{θi}^{H}_{i=1}, θSOSR} by
minimizing (8) using back-propagation.

Output: The fine-tuned deep neural network with (5) as gr.
CSDNN is essentially SDAESOSRwith DAEs replaced by
CAEs with the hope of more effective cost-aware feature ex-
traction. We can also consider SCAEBayes which does the
same for SDAE_{Bayes}. The CNN, due to its special network
structure, generally does not rely on stacked DAEs for pre-
training, and hence cannot be extended by stacked CAEs.

As discussed, DAE is a degenerate case of CAE. Another
possible degeneration is to consider CAE with less complete
cost information. For instance, a na¨ıve cost vector defined by
ˆc_{n}[k] = Jyn 6= kK encodes the label information (whether
the prediction is erroneous with respect to the demanded la-
bel) but not the complete cost information. To study whether
it is necessary to take the complete cost information into
account in CAE, we design two variant algorithms that re-
place the cost vectors in CAEs with ˆc_{n}[k], which effectively
makes those CAEs error-aware. Then, SCAEBayesbecomes
SEAE_{Bayes} (with E standing for error); CSDNN becomes
SEAESOSR.

### 4 Experiments

In the previous section, we have derived many cost-sensitive
deep learning algorithms, each with its own specialty. They
can be grouped into two series: those minimizing with (2) and
predicting with (4) are Bayes-series algorithms (DNNBayes,
SDAE_{Bayes}, SEAEBayes, SCAEBayes, and CNNBayes);

those minimizing with (8) and predicting with (5) are SOSR- series algorithms (DNNSOSR, SDAESOSR, SEAESOSR, CSDNN ≡ SCAESOSR, CNNSOSR). Note that the Bayes- series can only be applied to the cost-matrix setting while the SOSR-series can deal with the cost-vector setting. The two

series help understand whether it is beneficial to consider the cost information in the training stage.

Within each series, CNN is based on a locally-connected structure, while DNN, SDAE, SEAE and SCAE are fully- connected and differ by how pre-training is conducted, rang- ing from none, unsupervised, error-aware, to cost-aware. The wide range helps understand the effectiveness of digesting the cost information in the pre-training stage.

Next, the two series will be compared with the blind-series algorithms (DNNblind, SDAEblind, and CNNblind), which are the existing algorithms that do not incorporate the cost in- formation at all, to understand the importance of taking the cost information into account. The two series will also be compared against two baseline algorithms: CSOSR [Tu and Lin, 2010], a non-deep algorithm that our proposed SOSR- series originates from; MIN [Kukar and Kononenko, 1998], a neural-network algorithm that is cost-sensitive in the training stage like the SOSR-series but with a different loss function.

The algorithms along with highlights on where the cost infor- mation is digested are summarized in Table 1.

4.1 Setup

We conducted experiments on MNIST, bg-img-rot (the hard- est variant of MNIST provided in [Larochelle et al., 2007]), SVHN [Netzer et al., 2011], and CIFAR-10 [Krizhevsky and Hinton, 2009]. The first three datasets belong to handwritten digit recognition and aim to classify each image into a digit of 0 to 9 correctly; CIFAR-10 is a well-known image recog- nition dataset which contains 10 classes such as car, ship and animal. For all four datasets, the training, validation, and test- ing split follows the source websites; the input vectors in the training set are linearly scaled to [0, 1], and the input vectors in the validation and testing sets are scaled accordingly.

The four datasets are originally collected for multiclass
classification and contain no cost information. We adopt the
most frequently-used benchmark in cost-sensitive learning,
the randomized proportional setup [Abe et al., 2004], to gen-
erate the costs. The setup is for the cost-matrix setting. It
first generates a K × K matrix C, and sets the diagonal en-
tries C(y, y) to 0 while sampling the non-diagonal entries
C(y, k) uniformly from [0, 10^{|{n:y}_{|{n:y}^{n}^{=k}|}

n=y}|]. The randomized proportional setup generates the cost information that takes the class distribution of the dataset into account, charging a higher cost (in expectation) for mis-classifying a minority class, and can thus be used to deal with imbalanced classifica- tion problems. Note that we take this benchmark cost-matrix setting to give prediction-stage cost-sensitive algorithms like the Bayes-series a fair chance of comparison. We find that the range of the costs can affect the numerical stability of the al- gorithms, and hence scale all the costs by the maximum value within C during training in our implementation. The reported test results are based on the unscaled C.

Arguably one of the most important use of cost-sensitive classification is to deal with imbalanced datasets. Neverthe- less, the four datasets above are somewhat balanced, and the randomized proportional setup may generate similar cost for each type of mis-classification error. To better meet the real- world usage scenario, we further conducted experiments to

evaluate the algorithms with imbalanced datasets. In partic- ular, for each dataset, we construct a variant dataset by ran- domly picking four classes and removing 70% of the exam- ples that belong to those four classes. We will name these im- balanced variants as MNISTimb, bg-img-rotimb, SVHNimb, and CIFAR-10imb, respectively.

All experiments were conducted using Theano. For algo- rithms related to DNN and SDAE, we selected the hyperpa- rameters by following [Vincent et al., 2010]. The β in (9), needed by SEAE and SCAE algorithms, was selected among {0, 0.05, 0.1, 0.25, 0.4, 0.75, 1}. As mentioned, for CNN, we considered a standard structure in Caffe [Jia et al., 2014].

4.2 Experimental Results

The average test cost of each algorithm along with the stan-
dard error is shown in Table 2. The best result^{3} per dataset
among all algorithms is highlighted in bold.

Is it necessary to consider costs? DNNblindand SDAEblind

performed the worst on almost all the datasets. While
CNN_{blind}was slightly better than those two, it never reached
the best performance for any dataset. The results indicate the
necessity of taking the cost information into account.

Is it necessary to go deep? The two existing cost-sensitive baselines, CSOSR and MIN, outperformed the cost-blind algorithms often, but were usually not competitive to cost- sensitive deep learning algorithms. The results validate the importance of studying cost-sensitive deep learning.

Is it necessary to incorporate costs during training?

SOSR-series models, especially under the imbalanced sce- nario, generally outperformed their Bayes counterparts. The results demonstrate the usefulness of the proposed (7) and (8) and the importance of incorporating the cost information dur- ing the training stage.

Is it necessary to incorporate costs during pre-training?

CSDNN outperformed both SEAESOSR and SDAESOSR, and SDAESOSR further outperformed DNNSOSR. The re- sults show that for the fully-connected structure where pre- training is needed, our newly proposed cost-aware pre- training with CAE is indeed helpful in making deep learning cost-sensitive.

Which is better, CNNSOSR or CSDNN? The last two columns in Table 2 show that CSDNN is competitive to CNNSOSR, with both algorithms usually achieving the best performance. CSDNN is slightly better on two datasets. Note that CNNs are known to be powerful for image recognition tasks, which match the datasets that we have used. Hence, it is not surprising that CNN can reach promising perfor- mance with our proposed SOSR loss (8). Our efforts not only make CNN cost-sensitive, but also result in the CSDNN algo- rithm that makes the full-connected deep neural network cost- sensitive with the help of cost-aware pre-training via CAE.

Is the mixture loss necessary? To have more insights on CAE, we also conducted experiments to evaluate the perfor-

3The selected CSDNN that achieved the test cost listed in Ta- ble 2 is composed of 3 hidden layers, and each hidden layer consists of 3000 neurons.

Table 1: cost-awareness of algorithms (O: cost-aware; E: error-aware; X: cost-blind)

aaa aa

aaa Stage

Algorithm

DNNblind SDAEblind CNNblind CSOSR MIN DNNBayes SDAEBayes SEAEBayes SCAEBayes CNNBayes DNNSOSR SDAESOSR SEAESOSR CSDNN CNNSOSR

pre-training none X none none none none X E O none none X E O none

training X X X O O X X X X X O O O O O

prediction X X X X X O O O O O X X X X X

Table 2: Average test cost

aa aa

aa aa Dataset

Algorithm

DNNblind SDAEblind CNNblind CSOSR MIN DNNBayes SDAEBayes SEAEBayes SCAEBayes CNNBayes DNNSOSR SDAESOSR SEAESOSR CSDNN CNNSOSR

MNIST 0.11 ± 0.00 0.10 ± 0.00 0.10 ± 0.00 0.10 ± 0.00 0.10 ± 0.003 0.10 ± 0.00 0.09 ± 0.00 0.09 ± 0.00 0.09 ± 0.00 0.09 ± 0.00 0.10 ± 0.00 0.09 ± 0.00 0.09 ± 0.00 0.09 ± 0.00 0.08 ± 0.00 bg-img-rot 3.33 ± 0.06 3.28 ± 0.07 3.05 ± 0.07 3.25 ± 0.06 3.02 ± 0.06 2.95 ± 0.07 2.66 ± 0.07 2.85 ± 0.07 2.54 ± 0.07 2.40 ± 0.07 3.21 ± 0.07 2.99 ± 0.07 3.00 ± 0.07 2.34 ± 0.07 2.29 ± 0.07 SVHN 1.58 ± 0.03 1.40 ± 0.03 0.91 ± 0.03 1.17 ± 0.03 1.19 ± 0.03 1.07 ± 0.03 0.93 ± 0.03 0.94 ± 0.03 0.88 ± 0.03 0.85 ± 0.03 1.02 ± 0.03 0.92 ± 0.03 0.99 ± 0.03 0.83 ± 0.03 0.82 ± 0.03 CIFAR-10 3.46 ± 0.04 3.26 ± 0.05 2.51 ± 0.04 3.30 ± 0.04 3.19 ± 0.05 2.80 ± 0.05 2.52 ± 0.05 2.68 ± 0.05 2.38 ± 0.04 2.34 ± 0.05 2.74 ± 0.05 2.48 ± 0.04 2.52 ± 0.05 2.24 ± 0.05 2.25 ± 0.04 MNISTimb 0.32 ± 0.01 0.31 ± 0.01 0.19 ± 0.01 0.26 ± 0.01 0.27 ± 0.01 0.23 ± 0.01 0.20 ± 0.01 0.20 ± 0.01 0.18 ± 0.01 0.18 ± 0.01 0.22 ± 0.01 0.20 ± 0.01 0.19 ± 0.01 0.17 ± 0.01 0.17 ± 0.01 bg-img-rotimb 15.9 ± 0.70 13.8 ± 0.70 5.04 ± 0.67 8.55 ± 0.70 8.40 ± 0.69 7.19 ± 0.69 5.10 ± 0.70 4.95 ± 0.70 4.73 ± 0.70 4.49 ± 0.68 6.89 ± 0.70 4.99 ± 0.69 4.86 ± 0.69 4.16 ± 0.68 4.39 ± 0.69 SVHNimb 1.79 ± 0.01 1.60 ± 0.01 0.31 ± 0.01 1.05 ± 0.01 0.99 ± 0.01 0.53 ± 0.01 0.33 ± 0.01 0.34 ± 0.01 0.29 ± 0.01 0.28 ± 0.01 0.51 ± 0.01 0.31 ± 0.01 0.31 ± 0.01 0.26 ± 0.01 0.28 ± 0.01 CIFAR-10imb 19.1 ± 0.09 17.7 ± 0.09 7.29 ± 0.08 10.1 ± 0.09 11.2 ± 0.09 8.16 ± 0.09 7.48 ± 0.09 7.25 ± 0.08 6.97 ± 0.09 6.81 ± 0.09 7.86 ± 0.08 7.44 ± 0.09 7.14 ± 0.09 6.48 ± 0.09 6.63 ± 0.08

0 0.2 0.4 0.6 0.8 1

0.08 0.085 0.09 0.095 0.1 0.105 0.11

MNIST

0 0.2 0.4 0.6 0.8 1

2.2 2.4 2.6 2.8 3 3.2 3.4

bg−img−rot

0 0.2 0.4 0.6 0.8 1

0.8 0.85 0.9 0.95

1 SVHN

0 0.2 0.4 0.6 0.8 1

2.2 2.3 2.4 2.5

2.6 CIFAR−10

0 0.2 0.4 0.6 0.8 1

0.16 0.18 0.2 0.22 0.24

MNIST

imb

0 0.2 0.4 0.6 0.8 1

4 4.2 4.4 4.6 4.8 5 5.2

bg−img−rot

imb

0 0.2 0.4 0.6 0.8 1

0.24 0.26 0.28 0.3 0.32 0.34

SVHNimb

0 0.2 0.4 0.6 0.8 1

6.4 6.6 6.8 7 7.2 7.4 7.6

CIFAR−10

imb

Figure 1: Relation between β and test cost (note that SDAESOSRis the data point with β = 0).

mance of CSDNN for β ∈ [0, 1]. When β = 0, CSDNN de- generates to SDAESOSR; when β = 1, each CAE of CSDNN performs fully cost-aware pre-training to fit the cost vectors.

The results are displayed in Figure 1, showing a roughly U- shaped curve. The curve implies that some β ∈ [0, 1] that best balances the tradeoff between denoising and cost-awareness can be helpful. The results validate the usefulness of the pro- posed mixture loss (9) for pre-training.

### 5 Conclusion

We proposed a novel deep learning algorithm CSDNN for multiclass cost-sensitive classification with deep learning.

Existing baselines and other alternatives within the blind-

series, the Bayes-series and the SOSR-series were exten- sively compared with CSDNN carefully to validate the impor- tance of each component of CSDNN. The experimental re- sults demonstrate that incorporating the cost information into both the pre-training and the training stages leads to promis- ing performance of CSDNN, outperforming those baselines and alternatives. One key component of CSDNN, namely the SOSR loss for cost-sensitivity in the training stage, is shown to be helpful in improving the performance of CNN. The re- sults justify the importance of the proposed SOSR loss for training and the CAE approach for pre-training.

### 6 Acknowledgement

We thank the anonymous reviewers for valuable suggestions.

This material is based upon work supported by the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Development (AOARD) under award number FA2386-15-1-4012, and by the Ministry of Science and Tech- nology of Taiwan under number MOST 103-2221-E-002- 149-MY3.

### References

[Abdel-Hamid et al., 2014] Ossama Abdel-Hamid, Abdel- rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. Convolutional neural networks for speech recognition. Audio, Speech, and Language Processing, IEEE/ACM Transactions on, 22(10):1533–1545, 2014.

[Abe et al., 2004] Naoki Abe, Bianca Zadrozny, and John Langford. An iterative method for multi-class cost- sensitive learning. In KDD, 2004.

[Baldi, 2012] Pierre Baldi. Autoencoders, unsupervised learning, and deep architectures. Unsupervised and Trans- fer Learning Challenges in Machine Learning, 2012.

[Bengio et al., 2007] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise train- ing of deep networks. In NIPS, 2007.

[Beygelzimer et al., 2005] Alina Beygelzimer, Varsha Dani, Tom Hayes, John Langford, and Bianca Zadrozny. Error limiting reductions between classification tasks. In ICML, 2005.

[Chan and Stolfo, 1998] Philip K. Chan and Salvatore J.

Stolfo. Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In KDD, 1998.

[Ciresan et al., 2011] Dan C Ciresan, Ueli Meier, Jonathan Masci, and J¨urgen Schmidhuber. Flexible, high perfor- mance convolutional neural networks for image classifi- cation. In IJCAI, 2011.

[Domingos, 1999] Pedro Domingos. Metacost: A general method for making classifiers cost-sensitive. In KDD, 1999.

[Elkan, 2001] Charles Elkan. The foundations of cost- sensitive learning. In IJCAI, 2001.

[Erhan et al., 2010] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? JMLR, 11:625–660, 2010.

[Fan et al., 2000] Wei Fan, Wenke Lee, Salvatore J. Stolfo, and Matthew Miller. A multiple model cost-sensitive ap- proach for intrusion detection. In ECML, 2000.

[Hillermeier, 2001] Claus Hillermeier. Nonlinear multiob- jective optimization. Birkhauser, 2001.

[Hinton et al., 2006] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006.

[Jan et al., 2011] Te-Kang Jan, Hsuan-Tien Lin, Hsin-Pai Chen, Tsung-Chen Chern, Chung-Yueh Huang, Bing- Cheng Wen, Chia-Wen Chung, Yung-Jui Li, Ya-Ching Chuang, Li-Li Li, Yu-Jiun Chan, Juen-Kai Wang, Yuh-Lin Wang, Chi-Hung Lin, and Da-Wei Wang. Cost-sensitive classification on pathogen species of bacterial meningitis by Surface Enhanced Raman Scattering. In BIBM, 2011.

[Jia et al., 2014] Yangqing Jia, Evan Shelhamer, Jeff Don- ahue, Sergey Karayev, Jonathan Long, Ross Girshick, Ser- gio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[Krizhevsky and Hinton, 2009] A. Krizhevsky and G. Hin- ton. Learning multiple layers of features from tiny im- ages. Master’s thesis, Department of Computer Science, University of Toronto, 2009.

[Krizhevsky and Hinton, 2011] Alex Krizhevsky and Geof- frey E. Hinton. Using very deep autoencoders for content- based image retrieval. In ESANN, 2011.

[Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

[Kukar and Kononenko, 1998] Matjaz Kukar and Igor Kononenko. Cost-sensitive learning with neural networks.

In ECAI, 1998.

[Langford and Beygelzimer, 2005] John Langford and Alina Beygelzimer. Sensitive error correcting output codes. In COLT, 2005.

[Larochelle et al., 2007] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, 2007.

[Le Roux and Bengio, 2008] Nicolas Le Roux and Yoshua Bengio. Representational power of restricted boltzmann machines and deep belief networks. Neural Computation, 20:1631–1649, 2008.

[LeCun et al., 1998] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning ap- plied to document recognition. Proceedings of the IEEE, 86:2278–2324, 1998.

[Lee and Mangasarian, 2001] Yuh-Jye Lee and O. L. Man- gasarian. SSVM: A smooth support vector machine. Com- putational Optimization and Applications, 20:5–22, 2001.

[Lee et al., 2009] Honglak Lee, Roger Grosse, Rajesh Ran- ganath, and Andrew Y. Ng. Convolutional deep belief net- works for scalable unsupervised learning of hierarchical representations. In ICML, 2009.

[Margineantu, 2001] Dragos D. Margineantu. Methods for cost-sensitive learning. In IJCAI, 2001.

[Netzer et al., 2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learn- ing. In NIPS workshop on deep learning and unsupervised feature learning, 2011.

[Simonyan and Zisserman, 2014] Karen Simonyan and An- drew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[Tan, 1993] Ming Tan. Cost-sensitive learning of classifica- tion knowledge and its applications in robotics. Machine Learning, 13:7–33, 1993.

[Tu and Lin, 2010] Han-Hsing Tu and Hsuan-Tien Lin. One- sided support vector regression for multiclass cost- sensitive classification. In ICML, 2010.

[Vincent et al., 2010] Pascal Vincent, Hugo Larochelle, Is- abelle Lajoie, Yoshua Bengio, and Pierre-Antoine Man- zagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11:3371–3408, 2010.

[Zadrozny and Elkan, 2001] Bianca Zadrozny and Charles Elkan. Learning and making decisions when costs and probabilities are both unknown. In KDD, 2001.

[Zadrozny et al., 2003] Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost- proportionate example weighting. In ICDM, 2003.

[Zhang and Zhou, 2010] Yin Zhang and Zhi-Hua Zhou.

Cost-sensitive face recognition. Pattern Analysis and Ma- chine Intelligence, IEEE Transactions on, 32:1758–1769, 2010.

[Zhou and Liu, 2006] Zhi-Hua Zhou and Xu-Ying Liu.

Training cost-sensitive neural networks with methods ad- dressing the class imbalance problem. Knowledge and Data Engineering, IEEE Transactions on, 18:63–77, 2006.