Optimizing Network Performance of Computing Pipelines in Distributed
Environments
Qishi Wu1, Yi Gu1 Mengxia Zhu2 Nageswara S.V. Rao3
1Dept of Computer Science 2Dept of Computer Science University of Memphis Southern Illinois University
3Computer Science & Math Div Oak Ridge National Laboratory
2008 IPDPS
Outline
1 Introduction
2 Cost Models and Problem Formulation
3 Algorithm Design ELPC Algorithm Streamline Algorithm Greedy Algorithm
4 Implementation and Experimental Results
5 Conclusions and Future Work
Outline
1 Introduction
2 Cost Models and Problem Formulation
3 Algorithm Design ELPC Algorithm Streamline Algorithm Greedy Algorithm
4 Implementation and Experimental Results
5 Conclusions and Future Work
Introduction
Introduction
The demands of large-scale collaborative applications in various domains are beyond the capabilities of the traditional solutions based on standalone workstations.
Supporting high performance computing pipelines over WAN is critical to enabling large-scale distributed scientific
applications.
A number of large-scale computational applications require efficient executions of computing tasks that consist of a sequence of linearly arranged modules, also referred to as subtasks or stages.
These modules form a so-called computing pipeline between a data source and an end user.
Introduction
We consider two types of large-scale computing applications comprising of a number of modules or subtasks to be executed sequentially in a distributed network environment:
1 Interactive applications where a single dataset is sequentially processed along a computing pipeline.
Goal Minimize the end-to-end delay of a pipeline to provide fast response
2 Streaming applications where a series of datasets continuously flow through a computing pipeline.
Goal Maximize the frame rate of a pipeline to achieve smooth data flow
Introduction
Introduction
We construct analytical cost models for computing modules, network nodes, and communication links to estimate the computing times on nodes and the data transport times over connections.
Based on these time estimates, we present the Efficient Linear Pipeline Configuration(ELPC) method based on dynamic programming that partitions the pipelines modules into groups and maps them onto a set of selected computing nodes in a network.
For comparison purposes, we also implement and test Streamline algorithm (By B. Agarwalla et al.)
Greedy algorithm
with the same simulation datasets on the same computing platform.
Outline
1 Introduction
2 Cost Models and Problem Formulation
3 Algorithm Design ELPC Algorithm Streamline Algorithm Greedy Algorithm
4 Implementation and Experimental Results
5 Conclusions and Future Work
Cost Models and Problem Formulation
Cost Models of Pipeline and Network Components
Mi A computing module
ci The computational complexity of Mi mi−1 The incoming data size
ciand mi−1determine the number of CPU cycles needed to complete the subtask
vi A network node
pi The overall computing power of vi
Li,j The communication link between viand vj bi,j The bandwidth of Li,j
di,j The minimum link delay of Li,j
Cost Models of Pipeline and Network Components
The computing time of Mirunning on vj: Tcomputing(Mi, vj) =ci·mpi−1
j
The transfer time of message size m over Li,j: Ttransport(m, Li,j) =bm
i,j+di,j
Cost Models and Problem Formulation
Cost Models of Pipeline and Network Components
We consider a transport network consisting of k geographically distributed computing nodes v1, v2, · · · , vk.
The general pipeline consists of n sequential modules
M1, M2, · · · , Mn, where M1is a data source and Mnis an end user.
Problem Formulation
The objective of a general mapping scheme is to decompose the pipeline into q groups of modules denoted by g1, g2, · · · , gq, and map them onto a selected path P of q nodes from a source node vsto a destination node vd, where q ∈ [1, min(k, n)].
Path P consists of a sequence of unnecessarily distinct nodes vP[1]=vs, vP[2], · · · , vP[q]=vd.
For each mapping, we consider two cases:
1 With node reuse, two or more modules are allowed to run on the same node.
2 Without node reuse, a node on the selected path P executes exactly one module.
Cost Models and Problem Formulation
Minimal total delay for interactive application
We achieve the fastest system response by minimizing the total computing and transport delay of the pipeline from the source node to the destination node.
Total delay
Ttotal(Path P of q nodes) = Tcomputing+Ttransport
=
q
X
i=1
Tgi+
q−1
X
i=1
TLP[i],P[i+1]
=
q
X
i=1
à 1 pP[i]
X
Mj∈gi,j≥2
(cjmj−1)
! +
q−1
X
i=1
µ m(gi) bP[i],P[i+1]
¶
Maximal frame rate for streaming applications
To produce the smoothest data flow for streaming applications, we maximize the frame rate. Which is achieved by identifying and minimizing the time incurred on a bottleneck link or node.
Time on bottleneck
Tbottleneck(Path P of q nodes)
= max
Path p of q nodes i=1,2,...,q−1
Tcomputing(gi), Ttransport(LP[i],P[i+1]), Tcomputing(gq)
= max
Path p of q nodes i=1,2,...,q−1
1 pP[i]
P
Mj∈gi,j≥2(cjmj−1),
m(gi) bP[i],P[i+1],
1 pP[q]
P
Mj∈gq,j≥2(cjmj−1)
Algorithm Design
Outline
1 Introduction
2 Cost Models and Problem Formulation
3 Algorithm Design ELPC Algorithm Streamline Algorithm Greedy Algorithm
4 Implementation and Experimental Results
5 Conclusions and Future Work
Outline
1 Introduction
2 Cost Models and Problem Formulation
3 Algorithm Design ELPC Algorithm Streamline Algorithm Greedy Algorithm
4 Implementation and Experimental Results
5 Conclusions and Future Work
Algorithm Design ELPC Algorithm
Minimum End-to-end Delay with Node Reuse
For interactive applications, our goal is to minimize the end-to-end delay incurred on the nodes and links from the source to the destination to achieve the fastest response.
A single dataset is processed and there is only one module being executed at any particular time.
Node can be reused but are not shared simultaneously among different modules.
Illustration of ELPC Mapping Scheme for
Minimum End-to-end Delay
Algorithm Design ELPC Algorithm
Minimum End-to-end Delay with Node Reuse
Let Tj(vi) denote the minimal total delay with the first j
modules mapped to a path from the source node vsto node vi. We have the following recursion leading to the final solution Tn(vd).
Minimal total delay
Tj(vi)
j=2 to n, vi∈V
=min
Tj−1(vi) +cjmpj−1
vi
u∈adj(vmini)(Tj−1(u) +cjmpj−1
vi +mbj−1
u,vi)
Base condition
T2(vi)
vi∈V , and vi6=vs
=
( c2m1
pvi + m1
bvs,vi , ∀evs,vi∈E
∞ , otherwise
Minimum End-to-end Delay with Node Reuse
Let Tj(vi) denote the minimal total delay with the first j
modules mapped to a path from the source node vsto node vi. We have the following recursion leading to the final solution Tn(vd).
Minimal total delay
Tj(vi)
j=2 to n, vi∈V
=min
Tj−1(vi) +cjmpj−1
vi
u∈adj(vmini)(Tj−1(u) +cjmpj−1
vi +mbj−1
u,vi)
Base condition
T2(vi)
vi∈V , and vi6=vs
=
( c2m1
pvi + m1
bvs,vi , ∀evs,vi∈E
∞ , otherwise
Algorithm Design ELPC Algorithm
Minimum End-to-end Delay with Node Reuse
Minimum End-to-end Delay with Node Reuse
The complexity of this algorithm is O(n × |E|) - n denotes the number of modules
- |E| is the number of edges
Algorithm Design ELPC Algorithm
Maximum Frame Rate without Node Reuse
For streaming applications, our goal is to maximize frame rate.
The maximum frame rate a computing pipeline can achieve is limited by the bottleneck unit which is the slowest transport link or computing node.
Node reuse in streaming applications causes resource sharing, and hence affects the optimality of the solutions to previous mapping subproblems.
We consider a restricted version of the mapping problem for maximum frame rate by limiting the use of each node to a single module.
Illustration of ELPC Mapping Scheme for
Maximum Frame Rate
Algorithm Design ELPC Algorithm
Maximum Frame Rate without Node Reuse
We attempt to find the widest1network path with exact n nodes to map n modules in the pipeline on a one-to-one basis.
This problem is NP-complete.
We develop an approximate solution by adapting the method for minimum end-to-end delay with some necessary
modifications.
Maximum Frame Rate without Node Reuse
1
Tj(vi) denote the maximal frame rate with the first j modules mapped to a path from source node vsto node vi.
Also we have the following recursion leading to the final solution Tn(vd)
Time on bottleneck Tj(vi)
j=2 to n,vi∈V
= min
u∈adj(vi)
µ max
µ
Tj−1(u),cjmj−1 pvi ,mj−1
bu,vi
¶¶
Base condition
T2(vi)
vi∈V , and vi6=vs
=
( max³
c2m1
pvi ,bm1
vs,vi
´
, ∀evs,vi∈E
∞ , otherwise
Algorithm Design ELPC Algorithm
Maximum Frame Rate without Node Reuse
1
Tj(vi) denote the maximal frame rate with the first j modules mapped to a path from source node vsto node vi.
Also we have the following recursion leading to the final solution Tn(vd)
Time on bottleneck Tj(vi)
j=2 to n,vi∈V
= min
u∈adj(vi)
µ max
µ
Tj−1(u),cjmj−1 pvi ,mj−1
bu,vi
¶¶
Base condition
T2(vi)
vi∈V , and vi6=vs
=
( max³
c2m1
pvi ,bm1
vs,vi
´
, ∀evs,vi∈E
∞ , otherwise
Outline
1 Introduction
2 Cost Models and Problem Formulation
3 Algorithm Design ELPC Algorithm Streamline Algorithm Greedy Algorithm
4 Implementation and Experimental Results
5 Conclusions and Future Work
Algorithm Design Streamline Algorithm
Streamline Algorithm
Agarwalla et al. proposed a grid scheduling algorithm for graph dataflow scheduling in a network with n resources and n × n communication links.
This algorithm considers application requirements in terms of
1 Per-stage computation and communication needs
2 Application constraints on co-location of stages
3 Availability of computation and communication resources This scheduling heuristic expects to maximize the throughput of an application by assigning the best resources to the most needy stages at each step
The complexity of this algorithm is O(m × n2) - m is the number of stages or modules - n is the number of nodes
Outline
1 Introduction
2 Cost Models and Problem Formulation
3 Algorithm Design ELPC Algorithm Streamline Algorithm Greedy Algorithm
4 Implementation and Experimental Results
5 Conclusions and Future Work
Algorithm Design Greedy Algorithm
Greedy Algorithm
A greedy algorithm iteratively obtain the greatest immediate gain based on certain local optimality criteria at each step.
We calculate the end-to-end delay or maximum frame rate for the mapping of a new module onto the current node when node reuse is allowed or one of its neighbor nodes and choose the optimal one.
This algorithm makes a mapping decision at each step only based on the current information without considering the effect of this local decision on the mapping performance in the later steps.
The complexity of this algorithm is O(m × n)
- m denotes the number of modules in the linear pipeline - n is the number of nodes in the network
Outline
1 Introduction
2 Cost Models and Problem Formulation
3 Algorithm Design ELPC Algorithm Streamline Algorithm Greedy Algorithm
4 Implementation and Experimental Results
5 Conclusions and Future Work
Implementation and Experimental Results
Implementation
We conduct an extensive set of mapping experiments using a wide variety of simulated application pipelines and computing networks.
We generate these simulation datasets by randomly varying the pipeline and network attributes within a suitably selected range of values.
For each mapping problem, we designate a source node and a destination node to run the first module and the last module of the pipeline.
Performance comparison of the three algorithms
Implementation and Experimental Results
Performance of Comparison of Minimum End-to-end Delay for Three Algorithms
The x-axis represents the case number and there are 20 cases.
Performance of Comparison of Maximum
Frame Rate for Three Algorithms
Conclusions and Future Work
Outline
1 Introduction
2 Cost Models and Problem Formulation
3 Algorithm Design ELPC Algorithm Streamline Algorithm Greedy Algorithm
4 Implementation and Experimental Results
5 Conclusions and Future Work
Conclusions
We designed an ELPC scheme based on dynamic programming that strategically maps modules of computing pipelines to shared or dedicated network environments to achieve the minimum end-to-end delay and maximum frame rate.
The experimental results show that the ELPC exhibits superior mapping performance over the other methods.
Conclusions and Future Work
Future Work
We will study the pipeline mapping problem for maximum frame rate in the case of node reuse.
And also extend linear pipelines to graph workflows and study the complexity of and develop efficient solutions to graph workflow mapping problems in distributed environments.