IEEE Proof
DNNOff: Offloading DNN-Based Intelligent IoT Applications in Mobile Edge Computing
1
2
Xing Chen , Member, IEEE, Ming Li, Hao Zhong , Member, IEEE, Yun Ma , Member, IEEE, and Ching-Hsien Hsu , Senior Member, IEEE
3 4
Abstract—A deep neural network (DNN) has become in-
5
creasingly popular in industrial Internet of Things scenar-
6
ios. Due to high demands on computational capability,
7
it is hard for DNN-based applications to directly run on
8
intelligent end devices with limited resources. Computa-
9
tion offloading technology offers a feasible solution by of-
10
floading some computation-intensive tasks to the cloud or
11
edges. Supporting such capability is not easy due to two
12
aspects: Adaptability: offloading should dynamically occur
13
among computation nodes. Effectiveness: it needs to be
14
determined which parts are worth offloading. This article
15
proposes a novel approach, called DNNOff. For a given
16
DNN-based application, DNNOff first rewrites the source
17
code to implement a special program structure supporting
18
on-demand offloading and, at runtime, automatically deter-
19
mines the offloading scheme. We evaluated DNNOff on a
20
real-world intelligent application, with three DNN models.
21
Our results show that, compared with other approaches,
22
DNNOff saves response time by 12.4–66.6% on average.
Q1 Q2
23
Index Terms—Computation offloading, deep neural net-
24
works (DNNs), intelligent Internet of Things (IoT) applica-
25
tion, mobile edge computing (MEC), software adaption.
26
I. INTRODUCTION 27
R
ECENT years have witnessed the remarkable improve-28
ments of a deep neural network (DNN) . As the core
Q3 29
Manuscript received February 22, 2021; revised March 17, 2021 and March 29, 2021; accepted April 18, 2021. This work was supported in part by the National Natural Science Foundation of China under Grant 62072108 and in part by the Natural Science Foundation of Fujian Province for Distinguished Young Scholars under Grant 2020J06014.
Paper no. TII-21-0800. (Corresponding author: Hao Zhong.)
Xing Chen and Ming Li are with the College of Mathematics and Computer Science and the Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350118, China (e-mail: [email protected];
Hao Zhong is with the Department of Computer Science and En- gineering, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: [email protected]).
Yun Ma is with the Institute for Artificial Intelligence, Peking University, Beijing 100871, China, and also with the School of Software, Tsinghua University, Beijing 100084, China (e-mail: [email protected]).
Ching-Hsien Hsu is with the Department of Computer Science and Information Engineering, Asia University, Taichung 41354, Taiwan, and also with the Department of Computer Science and Information En- gineering, National Chung Cheng University, Chiayi 621301, Taiwan (e-mail: [email protected]).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TII.2021.3075464.
Digital Object Identifier 10.1109/TII.2021.3075464
machine learning technique [1], the DNN has been applied in 30
industrial Internet of things (IoT) scenarios such as computer 31
vision [2] and self-driving cars [3]. Meanwhile, more and more 32
trained deep learning models have been deployed on intelligent 33
end devices, such as wearable devices [4], vehicles [5], and 34
unmanned aerial vehicles [6]. In this article, we call such trained 35
models as DNN-based intelligent IoT applications. 36
Due to limited resources about computation and storage, 37 complex DNN-based applications cannot be directly run on in- 38
telligent end devices. One feasible solution is to offload all or part 39
of computational tasks to the cloud with sufficient resources [7], 40
[8]. More specifically, DNNs are divided by the granularity of 41
neural network layers [9]. Thus, some computation-intensive 42
neural network layers can be offloaded to the cloud for execution, 43
while other simpler neural network layers are processed locally. 44
However, the network communication between end devices 45
and the cloud is likely to cause significant execution delay, and it 46
seriously affects the user experience. To address this delay prob- 47
lem, mobile edge computing (MEC) has been introduced [10]. 48
The mobile edges provide computing capabilities in close prox- 49
imity to end devices and enable the execution of highly de- 50
manding applications in end devices while offering significantly 51
lower latencies. Although MEC provides new opportunities to 52 offload DNN-based applications among end devices, the cloud, 53
and nearby edges, the prior approaches do not consider how 54
to offload them in the new environment. On the one hand, as 55
the environment is constantly changing, the offloading scheme 56
of the DNN model shall be flexible for the need of adaptation. 57
On the other hand, an offloading scheme shall make tradeoffs 58
between the reduced execution time and the network delay, when 59
it determines which layers will be offloaded and where to offload 60
them, based on the changes of environment. 61
To fully release the potential of offloading, an offloading 62
mechanism shall support on-demand changes for DNN-based 63
applications and shall enable the execution of some parts of 64
the DNN model on different computing nodes (including end 65
devices, cloud, and edge servers). Afterward, there needs to be 66
an efficient estimation model, which can determine which of 67
its layers shall be offloaded. In summary, our main research 68
questions are: 1) How to design a mechanism to support the 69
automatic offloading of DNN-based applications in the MEC 70
environment? 2) How to build an estimation model to determine 71
the optimal offloading schemes? After the above questions are 72
carefully handled, the problem of offloading can be reduced to 73
a traditional optimization problem [11]. 74 1551-3203 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
IEEE Proof
To address the aforementioned questions, we present a novel
75
approach called DNNOff, which supports offloading DNN-
76
based applications in the MEC environment. This article makes
77
the following major contributions:
78
1) an offloading mechanism that enables DNN-based appli-
79
cations to be offloaded automatically and dynamically
80
in the MEC environment. To achieve this, DNNOff
81
translates a DNN-based application to a target program
82
that is easier to offload;
83
2) an effective model to predict the latency of offloading
84
schemes. DNNOff first extracts the structure and pa-
85
rameters of the DNN model and then uses a random
86
forest regression model to predict the execution cost of
87
each layer. Based on the prediction model, DNNOff can
88
determine which parts shall be moved to MEC servers;
89
and
90
3) an evaluation on a real-world DNN-based application
91
with AlexNet, VGG, and ResNet models. Our results
92
show that DNNOff reduces the response time by 12.4–
93
66.6% for complex DNN-based applications.
94
The rest of this article is organized as follows. Section II
95
reviews the related work. Section III presents our approach, and
96
Section IV evaluates it on a real-world application. Section V
97
discusses some issues about applicability. Section VI intro-
98
duces industrial applications. Finally, Section VII concludes this
99
article.
100
II. RELATEDWORK 101
Mobile devices are generally limited to storage space, battery
102
life, and computing power [12]. To improve the performance
103
of mobile applications, computation offloading has become the
104
most widely used technology. MCC improves the performance
105
of applications by sending computing-intensive components
106
from end devices to the cloud . These applications are partitioned
Q4 107
at different granularities, such as method, thread, and class. For
108
example, MAUI [13] supports offloading at the granularity of
109
methods. It allows annotating which parts of a program can
110
be offloaded to the cloud and makes offloading decisions at
111
runtime. CloneCloud [14] is a thread-based computation of-
112
floading framework, and it modifies virtual machines to support
113
seamless offloading of threads to the cloud. DPartner [15] can
114
offload classes, and it uses a proxy mechanism to access class
115
instances. Furthermore, it calculates the coupling of classes
116
and divides them into two sets. The two sets are deployed on
117
the end device and the cloud server, respectively. However,
118
MCC has inherent limitations, namely, long latency between end
119
devices and remote clouds. Hence, MEC has been proposed, in
120
which the service of cloud is increasingly moving toward nearby
121
edges [16]. AndroidOff [17] supports mobile applications with
122
the offloading capability at the granularity of objects for MEC.
123
It provides the mechanism to offload an object-oriented applica-
124
tion and determine which parts shall be offloaded. However, the
125
proposed works above cannot apply to DNN-based applications.
126
Computation offloading for DNN-based applications is fur-
127
ther advanced in recent years. Neurosurgeon [9] showed that
128
large amounts of data produced by DNN models should be up-
129
loaded to the cloud via wireless network, leading to high latency
130
Fig. 1. Overview of DNNOff.
and energy consumption. For the sake of better performance and 131
energy efficiency of modern DNN-based applications, Neuro- 132
surgeon designed a light weight scheduler to partition DNN- 133
based applications automatically between end devices and the 134
cloud at the granularity of neural network layers. Edgent [18] is 135
a framework that automatically and intelligently selects the best 136
partition point of a DNN model to satisfy the requirement on the 137
execution latency. Compared with Neurosurgeon, Edgent can of- 138
fload computation-intensive DNN layers to the remote server at 139
a low transmission overhead, namely, nearest computation node. 140
Liu et al. [19] proposed an image recognition framework based 141
on the DNN in the MEC environment and realized the food im- 142
age recognition system by employing an edge-computing-based 143
service infrastructure. It allows the system to overcome some 144
inherent limitations of the traditional MCC paradigm, such as 145
high latency and energy consumption. Zhou et al. [20] proposed 146
a robust mobile crowd sensing framework in the MEC environ- 147
ment. It can reduce the service delay with edge-computing-based 148
local processing. The above approaches assume that end devices 149
use a single remote server for computation offloading and can- 150 not make efficient use of dispersed and changing computing 151
resources in the MEC environment. 152
III. APPROACH 153
Fig. 1presents the overview of DNNOff. For the nodes, we 154
use rectangles to denote its components and circles to denote 155
its internal data. For the edges, red ones denote data flows, 156
and blue ones denote requests. DNNOff has three main compo- 157 nents, namely, extraction, offloading mechanism, and estimation 158
model. First, the extraction component extracts the structure 159
and parameters of a DNN model (see Section III-A). Second, 160
the offloading mechanism translates a DNN model to a target 161
program that enables offloading (see Section III-B) and deploys 162
it on end devices and remote servers where offloading may occur. 163
Finally, the estimation model component deployed on the end 164
device synthesizes an optimized offloading scheme to execute 165
different parts of the target program on proper locations, based 166
on the DNN network structure information and the surrounding 167
MEC environment (see Section III-C). Moreover, the estimation 168
model will update the offloading decision when the surrounding 169
IEEE Proof
Fig. 2. Example of the DNN model.
MEC environment changes. InFig. 1, a DNN-based application
170
and its MEC environment are presented on the right.
171
A. Extracting Structure for the DNN Model
172
Fig. 2shows an example of the DNN model. A DNN model
173
consists of layers. InFig. 2, layers are represented as squares
174
in different colors. In particular, the yellow one represents a
175
convolution layer, which translates an image to a feature map
176
with learned filters. The blue one represents an activation layer,
177
which is a nonlinear function. The function accepts a feature map
178
and generates an output with the same dimension. The purple
179
one represents a pooling layer, which can be defined as a general
180
pooling, an average pooling, or a max pooling. The green one
181
represents a fully connected layer, which calculates the weighted
182
sum of the inputs by learned weights. The top of square is the
183
name of layer, such as “conv1” and “relu1,” and the bottom
184
of square is the parameters of layer. For example, “channel:3”
185
denotes that the corresponding value of the parameter “channel”
186
is “3.” The black arrow represents the data flow. DNNOff first
187
extracts the structure of a DNN model, and the structure includes
188
the parameters of each layer and the data flow between layers.
189
Its definitions are as follows.
190
Definition 1 (DNN model structure): A DNN model structure
191
is a directed graph GD= (L, R) representing data transmissions
192
between layers of a DNN D, where L= {l1, l2, . . . , ln} is the
193
set of layers of D and R is the set of data flow edges. Each edge
194
rij∈ R represents a data flow from lito lj.
195
Definition 2 (DNN layer information): A layer consists of
196
type and parameters as li=< type, feature >, where type is
197
the type of the layer and f eature denotes the set of features of
198
the layer.
199
In general, the DNN-based application stores its trained model
200
in the configuration file, such as prototxt of Caffe2.1 Our ap-
201
proach takes this file as the input and gets the DNN model graph
202
GD= (L, R) through code analysis.
203
B. Offloading Mechanism for the DNN Model
204
First, we translate an original application to a target program,
205
and the translated target program follows the pipe-and-filter
206
style. In this style, DNN layers are modeled as filters that receive
207
and send data, and data flows between two layers are modeled as
208
pipes that transmit the intermediate results. Second, we propose
209
a “Pipe” engine to determine which neural network layer shall
210
be offloaded.
211
1[Online]. Available: https://caffe2.ai/
Fig. 3. Translation of a DNN program.
1) Target Program: We abstract a DNN program using the 212
pipe-and-filter architecture style, based on which we propose a 213
design pattern to support adaptive offloading in MEC. 214
A DNN program is essentially a data flow software architec- 215
ture [21]. Each layer can be regarded as a filter, and the data 216
transmission between layers can be regarded as a pipe. In a 217
typical DNN program, each filter performs the calculation of a 218
layer, whereas the pipe uses the result of the preceding layer as 219 the input data of the succeeding layer. 220
In order to support adaptive offloading of DNN applications, 221
the pipe should decide whether to transfer the data to the local 222
filter or to the remote filter for the successive computation 223
tasks. The filter should decide whether to perform the current 224
computation task (calculation of the current layer) or directly 225
return the results. 226
The left-hand side of Fig. 3 shows the source program of 227
a DNN-based application. It starts from the first layer and 228
receives the initial data (i.e., an image). The result of each 229
layer, namely the intermediate result, is hidden, and the out- 230
put of the last layer is the return value. The statement, li= 231 li.type(li−1, li.f eature), indicates that the li layer takes the 232
result of li−1as its input. The right-hand side ofFig. 3presents 233
the translated target program. It uses “Pipe” functions to connect 234 each layer, such that the DNN model can be offloaded at the 235
granularity of layers. 236
2) Code Translation: Our translation has three steps. 237
Step 1 (Adding the parameters such as “InitL” and “EndL” 238
after “InitData”): For a given program, DNNOff automatically 239
adds “InitL” and “EndL” into the list of parameters. The two 240
parameters represent the labels of the initial and the last layers, 241
respectively. In addition, DNNOff adds “CurrentL,” which de- 242
notes the label of the layer to be executed. Meanwhile, “InitData” 243
is assigned to the result of lInitL−1, and used as an input to lInitL. 244
Here, when the “DP” program runs, the layers between lInitLand 245
lEndLshall be executed. 246
IEEE Proof
Algorithm 1: Pipe.
Input: m—the label of the “Pipe” function
Output: CurrentL—the label of the layer to be executed Declare:
conf ig—the offloading scheme that records execution locations of each layer;
EndL—the label of the last layer that is executed at Local;
li—the result of the ith layer 1: ifm < CurrentL then 2: returnCurrentL 3: end if
4: ifm== CurrentL and config[m] == Local then 5: returnCurrentL
6: end if
7: ifm== CurrentL and config[m] != Local then 8: k← calculate the label of the next layer that 9: shall be executed at Local
10: ifk== Null then 11: k← EndL + 1 12: end if
13: lk−1← remote(lCurrentL−1, CurrentL, k− 1) 14: CurrentL← k
15: returnCurrentL 16: end if
Step 2 (Adding a “Pipe(i)” function before each layerli):
247
This function determines whether the layer lishall be offloaded
248
(see Section III-B3 for details).
249
Step 3 (Adding twoif statements to check each layer li): The
250
first statement is “if CurrentL== i,” where lirepresents the
251
ith layer. It checks whether the layer liis to be executed currently.
252
The second statement is “if CurrentL− 1 == EndL.” If the
253
layer lEndL has been executed, its result shall be returned, and
254
the layers after lEndLare skipped.
255
3) Computation Offloading at Runtime: At runtime, the
256
“Pipe” functions connect each layer that can be executed locally
257
or remotely, according to the offloading scheme. Algorithm 1
258
shows how the “Pipe” function works. Pipe(m) denotes the pipe
259
between the layers lm−1and lm, conf ig is the offloading scheme
260
that records execution locations of each layer, and CurrentL
261
denotes the label of the layer to be executed.
262
When m < CurrentL, it indicates that the layer lmhas been
263
executed and does not need to be repeated (lines 1–3). Therefore,
264
the layer lmwill be skipped.
265
When m== CurrentL and config[m] == Local (Local
266
is a keyword, representing the local node), it means that the
267
layer lmis to be locally executed (lines 4–6). Therefore, lmis
268
executed and the value of CurrentL is added by 1.
269
When m== CurrentL and config[m] != Local, it means
270
that the layer lm is to be remotely executed (lines 7–15).
271
Then, we calculate the label of the next layer that shall be
272
executed at local (line 8), and if k does not exist, we assign
273
“EndL+1” to it (lines 9–11). Finally, we run the program
274
DP(lCurrentL1, CurrentL,(k − 1) on the remote node ac-
275
cording to conf ig[m] and assign its result to lk−1(line 12).
276
Fig. 4. Proposed design pattern of DNN programs.
Fig. 4shows the example of adaptive offloading of the five- 277
layer DNN, which is executed on three computation nodes. 278
Layers l1and l5are to be executed on end device, layers l2and l3 279
are to be executed on Node A, and layer l4is to be executed on 280
Node B. First, the DNN program DP(InputData, 1, 5) runs on 281 end device, while CurrentL is 1 and EndL is 5, and l0is set to 282
InputData; P ipe(1) and l1 are executed, as conf ig[1] is end 283 device; P ipe(2) is executed and the remote service is invoked, 284 as conf ig[2] is Node A. Second, the DNN program DP (li, 2, 4) 285 runs on the Node A; P ipe(1) and l1are skipped, as CurrentL 286
is 2; P ipe(2), l2, P ipe(3), and l3 are executed in sequence, as 287
conf ig[2] and config[3] are both Node A; P ipe(3) is executed 288 and the remote service is invoked, as conf ig[4] is Node B. Third, 289 the DNN program DP(l3, 4, 4) runs on the Node B; P ipe(1), 290 l1, P ipe(2), l2, P ipe(3), and l3 are all skipped, as CurrentL 291
is 4; P ipe(4) and l4are executed as conf ig[4] is Node B; then, 292 CurrentL is 5 and thus return the calculation result to the DNN 293
program on end device. Finally, l5is executed on end device and 294
the output is produced. 295
C. Estimation Model for the Offloading Scheme 296
1) Predicting Cost With Random Forest Regression:The 297
execution time of each layer is an essential factor in the esti- 298
mation model. If the layer li is executed on the node nk, we 299
define its execution cost as follows. 300
Definition 3 (Execution cost): Costlnik=< time, 301 datasize >: time denotes the execution time from setting 302
input data to generating output data, which depends on the 303 performance of the execution node, while datasize denotes the 304
amount of data transmission, which is a fixed value obtained by 305
the extraction component. 306
With the number of layers and the diversity of computing 307
nodes, it is difficult to get execution time of each layer on each 308
computing node. Thus, we used the random forest regression to 309
build prediction models for different layer types and computing 310
nodes, which is to predict Costlnik.time. The RF regression 311
model is proposed by Brieman [22] and is proved to carry out 312
the nonlinear relation between the variables. It is a nonlinear 313
model-building tool, which is widely used in classification [23] 314
and prediction [24]. 315
IEEE Proof
TABLE I
FACTORSTHATCANINFLUENCE THEOFFLOADINGDECISION
Definition of the prediction model:
316
Y = predict(X) (1)
Xconv= (channel, ksize, knumber, stride, padding) Xpooling= (channel, ksize, stride)
Xrelu= (innumber, outnumber)
Xf c= (innumber, outnumber). (2) We use the dataset of history data to train the prediction
317
model, which is collected from DNN applications, including
318
Alexnet [25], VGG16, VGG19 [26], ResNet-50, and ResNet-
319
152 [27]. The RF regression prediction model is represented
320
as Equation (1). The input(X) depends on the type of layers
321
as Equation (2), and the layer types include convolution layer,
322
pooling layer, activation layer, and fully connected layer.
323
2) Contributory Factor: In this subsection, we introduce a
324
context model that describes the environment (e.g., computation
325
nodes) and the factors that affect the offloading decision.
326
The context architecture consists of an end device (ED), sev-
327
eral nearby edges (NE), and a remote cloud (RC). We use a graph
328
to present this network GC = (N, E), where N denotes a set of
329
compute nodes, including end device and remote servers, and E
330
represents a set of communication links among nodes ni∈ N.
331
Each edge(ni, nj) ∈ E is associated with a data transmission
332
rate vninj and a round-trip time rttninj between ni and nj. A
333
typical offloading scenario is as follows: The data are generated
334
on the end device (the only nED), and the layers can be offloaded
335
to nearby edges (some nodes of nNE) or the remote cloud (nRC).
336
Table Ishows our factors for estimating an offloading scheme.
337
Among them, nk∈ N, vninjand rttninj are defined before. We
338
next introduce DEP = (dep(l1), dep(l2), . . . , dep(ln)), where
339
DEP denotes the offloading scheme. Each li∈ L is executed
340
on a computation node dep(li) ∈ N. Let Te(li) denote the
341
execution time of li and let Td(lk, lm) denote the data trans-
342
mission time between layer lkand layer lm. The response time
343
of application can be represented by Tresponse, which is equal to
344
the moment after the execution of the last layer (tn). In addition,
345
an objective function is constructed to calculate Tresponse and
346
estimate the offloading scheme.
347
3) Objective Function: Our objective function makes predic-
348
tions, based on contributory factors. In particular, based on the
349
factors in Section III-C2, we construct the objective function
350
as shown in Equation (3). Here, we consider that a scheme is
351
optimal, if its objective value is the smallest. AsTable Ishows
352
Algorithm 2: Calculation of Response Time.
Input: Pli—the set of parent nodes of the layer li Output: tn—the response time of an offloading scheme Declare:
li—the ith layer
ti—the moment after the execution of the layer li; tmax—the maximum sum of the time at the moment after the execution of each parent layer with the addition of the 8: transmission time between two layers;
9: Td(lk, lm)—the data transmission time between the layer
lkand the layer lm;
Te(li)—the execution time of the layer li
1: function currentTimePli, li 2: for each plji ∈ Plido 3: iftpli
j not calculated then 4: tpli
j ← currentT ime(Pplij , plji) 5: end if
6: tmax← max{tmax, tpli
j + Td(plji, li)}
7: end for
8: ti← tmax+ Te(li) 9: returnti
10: end Function 11: t0← 0
12: tn← currentT ime(Pln, ln) 13: returntn
that ti is the moment after the execution of layer li, the total 353
response time is obtained when the last layer lnis executed 354
Tresponse= tn (3)
ti= max tpli
j + Td(plji, li)
+ Te(li), ∀plji∈ Pli (4) Te(li) = Costldep(li i).time (5)
Td(plji, li) = Costpij.datasize vdep(pi
j)dep(li) + rttdep(pij)dep(li). (6) The description of Equation (4) is expounded as follows: 355
First, the moment before the execution of current layer li is 356
calculated as the moment after the execution of previous layer 357
(tpli
j ) with the addition of the transmission time between two 358
layers (Td(plji, li)). Second, according to the characteristic of the 359 DNN, the current layer can only be executed when all branches 360
from previous layers have already been executed. Hence, ti 361
includes the execution time of layer li, and the maximum sum 362
of the time at the moment after the execution of each parent 363
layer with the addition of the transmission time between two 364
layers. Among them, the execution time of layer liis represented 365
as Equation (5) (Costldep(li
i).time is mentioned in Definition 366
3) and the transmission time with previous layer is represented 367
as Equation (6). 368
IEEE Proof
TABLE II
DEVICECONTEXTS INDIFFERENTLOCATIONS
As a result, given an offloading scheme, the calculation of
369
response time is shown in Algorithm 2. According to line 11 of
370
Algorithm 2, we first initialize the value of t0. Then, we use the
371
“currentTime” function to calculate tnrecursively according to
372
line 12. The calculation principle of the “currentTime” function
373
corresponds to Equation (4).
374
IV. EVALUATION 375
We implemented DNNOff and conducted evaluations to ex-
376
plore the following research questions.
377
r
(RQ1) To what degree does DNNOff improve perfor-378
mance of DNN-based applications (see Section IV-A)?
379
r
(RQ2) How does DNNOff perform in cost prediction of380
each neural network layer (see Section IV-B)?
381
r
(RQ3) How much extra overhead does DNNOff introduce382
(see Section IV-C)?
383
For RQ1, our results show that DNNOff saved 12.4–66.6% re-
384
sponse time compared with other approaches. For RQ2, DNNOff
385
achieved high accuracy for predicting execution time in different
386
layer types and computing nodes. For RQ3, the overhead of our
387
offloading mechanism is acceptable.
388
A. RQ1 Improvement Over the State of the Art
389
1) Experimental Settings:
390
a) Network environment: The network context con-
391
sists of four computation nodes: one end devices and three
392
remote servers. We simulate four locations, which are named
393
community, traffic road, parking lot, and store. Table II lists
394
the connections among our computation nodes. The column
395
and the row of a cell denote the round-trip time and the data
396
transmission rate between computation nodes. We utilize the
397
network simulation tool Dummynet2 to control the available
398
bandwidth. A smaller rtt and a higher v denotes a better signal
399
strength.
400
b) Devices: We take three desktop computers to emulate
401
the Elastic Compute Service (ECS) and edge servers E1 and E2.
402
The ECS is equipped with a 3.6-GHz 16-core CPU and 16-GB
403
RAM, server E1 is equipped with a 2.5-GHz eight-core CPU
404
and 8-GB RAM, and server E2 is equipped with a 3.0-GHz
405
eight-core CPU and 8-GB RAM. We further use a smartphone
406
to act as the end device, and the end device is equipped with a
407
2.2-GHz CPU and 4-GB RAM.
408
c) Application: We use a real-world DNN-based image
409
recognition application in the evaluation. It is written in Python
410
and powered by the Caffe2 deep learning framework.
411
2[Online]. Available: http://info.iet.unipi.it/luigi/dummynet/
We mainly concern with three models, which are the core 412
of the DNN-based application, including AlexNet, VGG16, 413 and ResNet-50. The most complex model is ResNet-50, while 414
AlexNet is the simplest one. The inference latency and recogni- 415
tion accuracy are increasing as the model is more complex. 416
d) Compared approaches: In our evaluation, we com- 417
pared DNNOff with four other approaches. 418
The original application is executed on end device, without 419
any offloading. 420
Neurosurgeon [9] selected the best DNN partition point and 421
sent the remaining DNN layers from end device to the cloud. 422
Edgent [18] is similar to Neurosurgeon [9], but offloads 423
computation-intensive DNN layers to the remote server at a low 424
transmission overhead, namely, nearest computation node. 425
For the ideal plan, it has to get execution time of each layer on 426
each computing node and choose the fastest one after executing 427
all the schemes in reality. The ideal plan is infeasible in practice 428 since it needs to get the execution time at different levels in 429
advance and try all the possibilities. We introduce the ideal plan 430
to illustrate how close DNNOff is to the ideal one. 431
e) Measurement:To show the effectiveness of 432
DNNOff, we define the following metrics. 433
1) Total response time: We use the total response time as the 434
metric for performance. To make a fair comparison, we 435
pick ten different images from the video in each location 436
and calculate their averages for comparison. Here, the 437
start time is recorded when the image is input, and the end 438
time is recorded when the recognition result is output. It 439
includes local inference, data transmission, and remote 440
inference. The less response time indicates better results. 441
2) Local inference: This is the time about inference process 442
on the end device. The inference on remote servers is 443 usually more efficient than end device. 444
3) Remote inference: This is the time about inference process 445
on the remote servers. 446
4) Data transmission: This is the time to transmit the feature 447
vectors result by the partitioned layers of DNN model, and 448
it is often slow under poor network connection. 449
2) Results:The total response time consists of the inference 450
time and the transmission time.Fig. 5shows the time of com- 451
pared approaches in the four locations. For each approach, the 452
blue bar denotes the local inference time, the orange one denotes 453
the data transmission time, while the gray one denotes the remote 454
inference time. 455
Compared with the original application, DNNOff reduces the 456
total response time by 30.4–66.6%. The result also shows that the 457
more complex the model, the better the optimization of DNNOff. 458 In general, the optimization of community is better than that of 459
store, because community is closer to the better performing edge 460
IEEE Proof
Fig. 5. Process of a DNN-based image recognition application.
(a) Image recognition with the AlexNet model. (b) Image recognition with the VGG16 model. (c) Image recognition with the ResNet-50 model.
server, which can significantly reduce the reference time. In the
461
traffic road, the ResNet-50 is optimized to 66.6% with DNNOff,
462
since the data transfer volume between the layers in ResNet-50
463
is small and the location is connected to all remote servers, so
464
that the offloading can alleviate the bottleneck of local inference
465
time and, meanwhile, guarantee a lower data transmission time.
466
It should be noted that the parking lot is only connected to the
467
cloud server, so the performance improvement is not as obvious
468
as that in other locations, but it can still reduce the time by
469
30.4–47.1%. Hence, DNNOff is still effective even if there are
470
no edge servers.
471
Compared with Neurosurgeon, DNNOff reduces the total
472
response time by 26.5–53.2%. The results show that DNNOff
473
significantly outperforms Neurosurgeon in the traffic road with
474
the VGG model. Because traffic road has the best network
475
connection with remote servers, which provide more choices
476
to DNNOff for offloading . While in the parking lot, DNNOff
Q5 477
can keep the same performance as Neurosurgeon . Due to the
Q6 478
poor network connection, multiple partitions will increase the
479
TABLE III SAMPLEITEMS
data transmission time instead. In this case, DNNOff makes the 480
same offloading scheme as Neurosurgeon does. 481
Compared with Edgent, DNNOff reduces the total response 482
time by 12.4–39.3%. Although Edgent considers the use of 483
nearest computation node, DNNOff can cut the DNN at multiple 484
points and execute different parts over the end device, edges, and 485
the cloud. 486
Compared with the ideal plan, DNNOff can achieve compa- 487
rable performance in different cases, and the performance gap 488
between them is about 5%. 489
In summary, DNNOff saved 12.4–66.6% of the total response 490
time compared with other approaches. Meanwhile, the results 491
show that DNNOff achieves optimal/near-optimal performance 492
of offloading. 493
B. RQ2 Accuracy for Cost Prediction of DNN Layers 494
1) Experimental Settings: 495
a) Model training: We use the dataset of history data 496
to train the random forest regression prediction model, which 497
is collected from DNN-based applications running on different 498
computing nodes. In total, we collected the layer information 499
about convolution layers, pooling layers, activation layers, and 500
fully connected layers of 425 items, 320 items, 582 items, and 501
96 items, respectively.Table IIIshows some convolution layer 502 items, which is collected on the end device, as an example. 503
Column “Channel” lists the number of channels of convolution 504
kernel. “ksize” and “knumber” list the size and the number of their 505
filters, respectively. Columns “Stride” and “Padding” list the 506
stride and the padding with which the filters are being applied. 507
The inputs (X) include the channel, ksize, knumber, stride, and 508
padding. The output (time) is denoted as the predicted value of 509
layer latency. Based on the dataset, we randomly split the data 510
items into two categories: 70% for training the prediction model 511
and 30% for testing the quality of our model. 512
b) Measurement: We regard root-mean-square error 513
(RMSE) and R-squared (R2) as the evaluation measures of the 514
prediction model 515
RMSE=
1 N
N t=1
(observedt− predictedt)2 (7)
R2= 1 −
(observedt− predictedt)2
(observedt− meant)2 . (8) RMSE is the sample standard deviation of the differences 516
between predicted and observed values. R2 is commonly used 517
to evaluate the quality of regression models. They are calculated 518
according to Equations (7) and (8). 519
2) Results: Table IV shows the accuracy of the random 520
forest regression prediction model. It illustrates the RMSE and 521
IEEE Proof
TABLE IV
RMSEANDR-SQUARED OF THEPREDICTIONMODEL ON THETESTSET
Fig. 6. Optimization of random forest parameters using RMSE.
R2results for predicting in different layer types and computa-
522
tion nodes. For RMSE, the smaller RMSE indicates the better
523
model’s fitting degree [28], [29]. For R2, an acceptable value of
524
R2 is greater than 0.5 [30], and the closer to 1, the better the
525
model is.Table IVshows that the RMSE of the model is small
526
and R2is greater than 0.5, illustrating that the prediction model
527
is acceptable. And the high accuracy of prediction model lays
528
the foundation for scheme estimation.
529
In addition, there are two parameters in random forest: N tree,
530
the number of regression trees grown based on a bootstrap sam-
531
ple of the collected layers, and M try, the number of different
532
predictors tested at each node. The two parameters (N tree and
533
M try) are optimized based on the RMSE of calibration. Take the
534
training of convolution layers on the end device as an example.
535
N tree values from 500 to 4000 with intervals of length 50 were
536
tested, and M try was tested from 1 to 5. The results of random
537
forest parameters (N tree and M try) are shown inFig. 6, which
538
clearly indicates that random forest parameters affect the error
539
of prediction. The optimization was done using the calibration
540
dataset (n= 297) and RMSE. The result Ntree = 2000 and
541
M try= 3 yielded the lowest RMSE (0.289 ms). In this case,
542
we chose N tree= 2000 and Mtry = 3 as the best parameters.
543
C. RQ3 Extra Overhead
544
1) Experimental Settings:
545
a) Setting: We use a simple AlexNet [25] application
546
with 24 layers, which is a state-of-the-art DNN for image
547
TABLE V
FIVEOFFLOADINGSCHEMES FORALEXNET
Fig. 7. Overhead of DNNOff and manual-modified one.
classification, and simulate five typical offloading schemes, 548
which represent device-cloud, device-edge, and device-edge- 549
cloud offloading, as shown inTable V. 550
b) Compared approaches:We evaluate the overhead 551 of DNNOff by comparing the performance of the adaptive of- 552
floaded application with the manual-modified offloaded applica- 553
tion for five typical offloading schemes. The adaptive offloaded 554
application is dynamically offloaded according to the offloading 555
scheme, which is supported by our framework. The manual 556
modified one is implemented by separating the code according 557
to the offload scheme case by case. 558
2) Results:We run the application in the five typical offload- 559
ing schemes and, respectively, record their average response 560
time, as shown in Fig. 7. We can see that the response time 561
of DNNOff is similar to the manual modified one, but with an 562
overhead of 120–150 ms. The slight increase of response time 563
(under 10%) is due to the condition statements of pipes that are 564
needed to go through for each layers execution in our framework. 565
For instance, the overheads in cases 1–3 are all about 120 ms 566 because the cutoff points of three offloading scheme are the 567
same. The overhead in cases 4 and 5 are both 150 ms because 568
there are two cutoff points in each offloading scheme, for which 569
more condition statements need to be executed. Overall, the 570
overhead is acceptable. 571
V. DISCUSSION 572
Some issues about applicability need to be further discussed. 573
A. Online Decision 574
For online decision, DNNOff uses the estimation model to 575
calculate the response time given an offloading scheme, based 576
on which the problem of online decision can be reduced to 577
a traditional optimization problem. Some algorithms can be 578
used to reduce overhead. For instance, it takes about minutes 579
to determine the offloading decision for the genetic algorithm, 580