### Optimizing Network Performance of Computing Pipelines in Distributed

### Environments

Qishi Wu^{1}, Yi Gu^{1} Mengxia Zhu^{2}
Nageswara S.V. Rao^{3}

1Dept of Computer Science ^{2}Dept of Computer Science
University of Memphis Southern Illinois University

3Computer Science & Math Div Oak Ridge National Laboratory

2008 IPDPS

### Outline

**1** Introduction

**2** Cost Models and Problem Formulation

**3** Algorithm Design
ELPC Algorithm
Streamline Algorithm
Greedy Algorithm

**4** Implementation and Experimental Results

**5** Conclusions and Future Work

### Outline

**1** Introduction

**2** Cost Models and Problem Formulation

**3** Algorithm Design
ELPC Algorithm
Streamline Algorithm
Greedy Algorithm

**4** Implementation and Experimental Results

**5** Conclusions and Future Work

**Introduction**

### Introduction

The demands of large-scale collaborative applications in various domains are beyond the capabilities of the traditional solutions based on standalone workstations.

Supporting high performance computing pipelines over WAN is critical to enabling large-scale distributed scientific

applications.

A number of large-scale computational applications require efficient executions of computing tasks that consist of a sequence of linearly arranged modules, also referred to as subtasks or stages.

These modules form a so-called computing pipeline between a data source and an end user.

### Introduction

We consider two types of large-scale computing applications comprising of a number of modules or subtasks to be executed sequentially in a distributed network environment:

**1** **Interactive applications where a single dataset is sequentially**
processed along a computing pipeline.

Goal Minimize the end-to-end delay of a pipeline to provide fast response

**2** **Streaming applications where a series of datasets continuously**
flow through a computing pipeline.

Goal Maximize the frame rate of a pipeline to achieve smooth data flow

**Introduction**

### Introduction

We construct analytical cost models for computing modules, network nodes, and communication links to estimate the computing times on nodes and the data transport times over connections.

**Based on these time estimates, we present the Efficient Linear**
**Pipeline Configuration(ELPC) method based on dynamic**
programming that partitions the pipelines modules into groups
and maps them onto a set of selected computing nodes in a
network.

For comparison purposes, we also implement and test
*Streamline algorithm (By B. Agarwalla et al.)*

*Greedy algorithm*

with the same simulation datasets on the same computing platform.

### Outline

**1** Introduction

**2** Cost Models and Problem Formulation

**3** Algorithm Design
ELPC Algorithm
Streamline Algorithm
Greedy Algorithm

**4** Implementation and Experimental Results

**5** Conclusions and Future Work

**Cost Models and Problem Formulation**

### Cost Models of Pipeline and Network Components

*M** _{i}* A computing module

*c*_{i}*The computational complexity of M*_{i}*m**i−1* The incoming data size

*c*_{i}*and m** _{i−1}*determine the number of CPU cycles needed to
complete the subtask

*v** _{i}* A network node

*p*_{i}*The overall computing power of v*_{i}

*L*_{i,j}*The communication link between v*_{i}*and v*_{j}*b**i,j* *The bandwidth of L**i,j*

*d*_{i,j}*The minimum link delay of L*_{i,j}

### Cost Models of Pipeline and Network Components

*The computing time of M*_{i}*running on v** _{j}*:

*T*

*computing*

*(M*

*i*

*, v*

*j*) =

^{c}

^{i}^{·m}

_{p}

^{i−1}*j*

*The transfer time of message size m over L**i,j*:
*T*_{transport}*(m, L** _{i,j}*) =

_{b}

^{m}*i,j*+*d*_{i,j}

**Cost Models and Problem Formulation**

### Cost Models of Pipeline and Network Components

*We consider a transport network consisting of k geographically*
*distributed computing nodes v*_{1}*, v*_{2}*, · · · , v** _{k}*.

*The general pipeline consists of n sequential modules*

*M*1*, M*2*, · · · , M**n**, where M*1*is a data source and M**n*is an end user.

### Problem Formulation

The objective of a general mapping scheme is to decompose the
*pipeline into q groups of modules denoted by g*_{1}*, g*_{2}*, · · · , g** _{q}*, and

*map them onto a selected path P of q nodes from a source node*

*v*

_{s}*to a destination node v*

_{d}*, where q ∈ [1, min(k, n)].*

*Path P consists of a sequence of unnecessarily distinct nodes*
*v** _{P[1]}*=

*v*

_{s}*, v*

_{P[2]}*, · · · , v*

*=*

_{P[q]}*v*

*.*

_{d}For each mapping, we consider two cases:

**1** **With node reuse, two or more modules are allowed to run on the**
same node.

**2** * Without node reuse, a node on the selected path P executes*
exactly one module.

**Cost Models and Problem Formulation**

### Minimal total delay for interactive application

We achieve the fastest system response by minimizing the total computing and transport delay of the pipeline from the source node to the destination node.

Total delay

*T*_{total}*(Path P of q nodes) = T**computing*+*T**transport*

=

*q*

X

*i=1*

*T*_{g}* _{i}*+

*q−1*

X

*i=1*

*T*_{L}*P[i],P[i+1]*

=

*q*

X

*i=1*

Ã 1
*p**P[i]*

X

*M**j*∈g*i**,j≥2*

*(c*_{j}*m** _{j−1}*)

! +

*q−1*

X

*i=1*

µ *m(g** _{i}*)

*b*

*P[i],P[i+1]*

¶

### Maximal frame rate for streaming applications

To produce the smoothest data flow for streaming applications, we maximize the frame rate. Which is achieved by identifying and minimizing the time incurred on a bottleneck link or node.

Time on bottleneck

*T*_{bottleneck}*(Path P of q nodes)*

= max

*Path p of q nodes*
*i=1,2,...,q−1*

*T*_{computing}*(g** _{i}*),

*T*

_{transport}*(L*

*P[i],P[i+1]*),

*T*

*computing*

*(g*

*q*)

= max

*Path p of q nodes*
*i=1,2,...,q−1*

1
*p**P**[i]*

P

*M**j*∈g*i**,j≥2**(c*_{j}*m** _{j−1}*),

*m(g** _{i}*)

*b*

*P[i],P[i+1]*,

1
*p*_{P}*[q]*

P

*M**j*∈g*q**,j≥2**(c*_{j}*m** _{j−1}*)

**Algorithm Design**

### Outline

**1** Introduction

**2** Cost Models and Problem Formulation

**3** Algorithm Design
ELPC Algorithm
Streamline Algorithm
Greedy Algorithm

**4** Implementation and Experimental Results

**5** Conclusions and Future Work

### Outline

**1** Introduction

**2** Cost Models and Problem Formulation

**3** Algorithm Design
ELPC Algorithm
Streamline Algorithm
Greedy Algorithm

**4** Implementation and Experimental Results

**5** Conclusions and Future Work

**Algorithm Design** **ELPC Algorithm**

### Minimum End-to-end Delay with Node Reuse

For interactive applications, our goal is to minimize the end-to-end delay incurred on the nodes and links from the source to the destination to achieve the fastest response.

A single dataset is processed and there is only one module being executed at any particular time.

Node can be reused but are not shared simultaneously among different modules.

### Illustration of ELPC Mapping Scheme for

### Minimum End-to-end Delay

**Algorithm Design** **ELPC Algorithm**

### Minimum End-to-end Delay with Node Reuse

*Let T*^{j}*(v**i**) denote the minimal total delay with the first j*

*modules mapped to a path from the source node v*_{s}*to node v** _{i}*.
We have the following recursion leading to the final solution

*T*

^{n}*(v*

*).*

_{d}Minimal total delay

*T*^{j}*(v** _{i}*)

*j=2 to n, v**i*∈V

=min

*T*^{j−1}*(v** _{i}*) +

^{c}

^{j}

^{m}

_{p}

^{j−1}*vi*

*u∈adj(v*min*i*)*(T*^{j−1}*(u) +*^{c}^{j}^{m}_{p}^{j−1}

*vi* +^{m}_{b}^{j−1}

*u,vi*)

Base condition

*T*^{2}*(v**i*)

*v** _{i}*∈V , and v

*6=v*

_{i}*s*

=

( _{c}_{2}_{m}_{1}

*p** _{vi}* +

^{m}^{1}

*b*_{vs,vi}*, ∀e*_{v}_{s}_{,v}* _{i}*∈

*E*

∞ *, otherwise*

### Minimum End-to-end Delay with Node Reuse

*Let T*^{j}*(v**i**) denote the minimal total delay with the first j*

*modules mapped to a path from the source node v*_{s}*to node v** _{i}*.
We have the following recursion leading to the final solution

*T*

^{n}*(v*

*).*

_{d}Minimal total delay

*T*^{j}*(v** _{i}*)

*j=2 to n, v**i*∈V

=min

*T*^{j−1}*(v** _{i}*) +

^{c}

^{j}

^{m}

_{p}

^{j−1}*vi*

*u∈adj(v*min*i*)*(T*^{j−1}*(u) +*^{c}^{j}^{m}_{p}^{j−1}

*vi* +^{m}_{b}^{j−1}

*u,vi*)

Base condition

*T*^{2}*(v**i*)

*v** _{i}*∈V , and v

*6=v*

_{i}*s*

=

( _{c}_{2}_{m}_{1}

*p** _{vi}* +

^{m}^{1}

*b*_{vs,vi}*, ∀e*_{v}_{s}_{,v}* _{i}*∈

*E*

∞ *, otherwise*

**Algorithm Design** **ELPC Algorithm**

### Minimum End-to-end Delay with Node Reuse

### Minimum End-to-end Delay with Node Reuse

*The complexity of this algorithm is O(n × |E|)*
- *n denotes the number of modules*

- |E| is the number of edges

**Algorithm Design** **ELPC Algorithm**

### Maximum Frame Rate without Node Reuse

For streaming applications, our goal is to maximize frame rate.

The maximum frame rate a computing pipeline can achieve is limited by the bottleneck unit which is the slowest transport link or computing node.

Node reuse in streaming applications causes resource sharing, and hence affects the optimality of the solutions to previous mapping subproblems.

We consider a restricted version of the mapping problem for maximum frame rate by limiting the use of each node to a single module.

### Illustration of ELPC Mapping Scheme for

### Maximum Frame Rate

**Algorithm Design** **ELPC Algorithm**

### Maximum Frame Rate without Node Reuse

We attempt to find the widest^{1}*network path with exact n nodes*
*to map n modules in the pipeline on a one-to-one basis.*

This problem is NP-complete.

We develop an approximate solution by adapting the method for minimum end-to-end delay with some necessary

modifications.

### Maximum Frame Rate without Node Reuse

1

*T*^{j}*(v**i*) *denote the maximal frame rate with the first j modules*
*mapped to a path from source node v*_{s}*to node v** _{i}*.

Also we have the following recursion leading to the final
*solution T*^{n}*(v** _{d}*)

Time on bottleneck
*T*^{j}*(v**i*)

*j=2 to n,v**i*∈V

= min

*u∈adj(v**i*)

µ max

µ

*T*^{j−1}*(u),c*_{j}*m*_{j−1}*p*_{v}* _{i}* ,

*m*

_{j−1}*b*_{u,v}_{i}

¶¶

Base condition

*T*^{2}*(v** _{i}*)

*v**i*∈V , and v*i*6=v*s*

=

( max³

*c*2*m*1

*p** _{vi}* ,

_{b}

^{m}^{1}

*vs,vi*

´

*, ∀e*_{v}_{s}_{,v}* _{i}*∈

*E*

∞ *, otherwise*

**Algorithm Design** **ELPC Algorithm**

### Maximum Frame Rate without Node Reuse

1

*T*^{j}*(v**i*) *denote the maximal frame rate with the first j modules*
*mapped to a path from source node v*_{s}*to node v** _{i}*.

Also we have the following recursion leading to the final
*solution T*^{n}*(v** _{d}*)

Time on bottleneck
*T*^{j}*(v**i*)

*j=2 to n,v**i*∈V

= min

*u∈adj(v**i*)

µ max

µ

*T*^{j−1}*(u),c*_{j}*m*_{j−1}*p*_{v}* _{i}* ,

*m*

_{j−1}*b*_{u,v}_{i}

¶¶

Base condition

*T*^{2}*(v** _{i}*)

*v**i*∈V , and v*i*6=v*s*

=

( max³

*c*2*m*1

*p** _{vi}* ,

_{b}

^{m}^{1}

*vs,vi*

´

*, ∀e*_{v}_{s}_{,v}* _{i}*∈

*E*

∞ *, otherwise*

### Outline

**1** Introduction

**2** Cost Models and Problem Formulation

**3** Algorithm Design
ELPC Algorithm
Streamline Algorithm
Greedy Algorithm

**4** Implementation and Experimental Results

**5** Conclusions and Future Work

**Algorithm Design** **Streamline Algorithm**

### Streamline Algorithm

*Agarwalla et al. proposed a grid scheduling algorithm for graph*
*dataflow scheduling in a network with n resources and n × n*
communication links.

This algorithm considers application requirements in terms of

**1** Per-stage computation and communication needs

**2** Application constraints on co-location of stages

**3** Availability of computation and communication resources
This scheduling heuristic expects to maximize the throughput
of an application by assigning the best resources to the most
needy stages at each step

*The complexity of this algorithm is O(m × n*^{2})
- *m is the number of stages or modules*
- *n is the number of nodes*

### Outline

**1** Introduction

**2** Cost Models and Problem Formulation

**3** Algorithm Design
ELPC Algorithm
Streamline Algorithm
Greedy Algorithm

**4** Implementation and Experimental Results

**5** Conclusions and Future Work

**Algorithm Design** **Greedy Algorithm**

### Greedy Algorithm

A greedy algorithm iteratively obtain the greatest immediate gain based on certain local optimality criteria at each step.

We calculate the end-to-end delay or maximum frame rate for
**the mapping of a new module onto the current node when**
**node reuse is allowed or one of its neighbor nodes and choose**
the optimal one.

This algorithm makes a mapping decision at each step only based on the current information without considering the effect of this local decision on the mapping performance in the later steps.

*The complexity of this algorithm is O(m × n)*

- *m denotes the number of modules in the linear pipeline*
- *n is the number of nodes in the network*

### Outline

**1** Introduction

**2** Cost Models and Problem Formulation

**3** Algorithm Design
ELPC Algorithm
Streamline Algorithm
Greedy Algorithm

**4** Implementation and Experimental Results

**5** Conclusions and Future Work

**Implementation and Experimental Results**

### Implementation

We conduct an extensive set of mapping experiments using a wide variety of simulated application pipelines and computing networks.

We generate these simulation datasets by randomly varying the pipeline and network attributes within a suitably selected range of values.

For each mapping problem, we designate a source node and a destination node to run the first module and the last module of the pipeline.

### Performance comparison of the three algorithms

**Implementation and Experimental Results**

### Performance of Comparison of Minimum End-to-end Delay for Three Algorithms

*The x-axis represents the case number and there are 20 cases.*

### Performance of Comparison of Maximum

### Frame Rate for Three Algorithms

**Conclusions and Future Work**

### Outline

**1** Introduction

**2** Cost Models and Problem Formulation

**3** Algorithm Design
ELPC Algorithm
Streamline Algorithm
Greedy Algorithm

**4** Implementation and Experimental Results

**5** Conclusions and Future Work

### Conclusions

We designed an ELPC scheme based on dynamic programming that strategically maps modules of computing pipelines to shared or dedicated network environments to achieve the minimum end-to-end delay and maximum frame rate.

The experimental results show that the ELPC exhibits superior mapping performance over the other methods.

**Conclusions and Future Work**

### Future Work

We will study the pipeline mapping problem for maximum frame rate in the case of node reuse.

And also extend linear pipelines to graph workflows and study the complexity of and develop efficient solutions to graph workflow mapping problems in distributed environments.