Optimizing Network Performance of Computing Pipelines in Distributed Environments

(1)

Optimizing Network Performance of Computing Pipelines in Distributed

Environments

Qishi Wu¹, Yi Gu¹ Mengxia Zhu² Nageswara S.V. Rao³

1Dept of Computer Science ²Dept of Computer Science University of Memphis Southern Illinois University

3Computer Science & Math Div Oak Ridge National Laboratory

2008 IPDPS

(2)

Outline

1 Introduction

2 Cost Models and Problem Formulation

3 Algorithm Design ELPC Algorithm Streamline Algorithm Greedy Algorithm

4 Implementation and Experimental Results

5 Conclusions and Future Work

(3)

Outline

1 Introduction

(4)

Introduction

Introduction

The demands of large-scale collaborative applications in various domains are beyond the capabilities of the traditional solutions based on standalone workstations.

Supporting high performance computing pipelines over WAN is critical to enabling large-scale distributed scientific

applications.

A number of large-scale computational applications require efficient executions of computing tasks that consist of a sequence of linearly arranged modules, also referred to as subtasks or stages.

These modules form a so-called computing pipeline between a data source and an end user.

(5)

Introduction

We consider two types of large-scale computing applications comprising of a number of modules or subtasks to be executed sequentially in a distributed network environment:

1 Interactive applications where a single dataset is sequentially processed along a computing pipeline.

Goal Minimize the end-to-end delay of a pipeline to provide fast response

2 Streaming applications where a series of datasets continuously flow through a computing pipeline.

Goal Maximize the frame rate of a pipeline to achieve smooth data flow

(6)

Introduction

Introduction

We construct analytical cost models for computing modules, network nodes, and communication links to estimate the computing times on nodes and the data transport times over connections.

Based on these time estimates, we present the Efficient Linear Pipeline Configuration(ELPC) method based on dynamic programming that partitions the pipelines modules into groups and maps them onto a set of selected computing nodes in a network.

For comparison purposes, we also implement and test Streamline algorithm (By B. Agarwalla et al.)

Greedy algorithm

with the same simulation datasets on the same computing platform.

(7)

Outline

1 Introduction

(8)

Cost Models and Problem Formulation

Cost Models of Pipeline and Network Components

M_i A computing module

c_i The computational complexity of M_i mi−1 The incoming data size

c_iand m_i−1determine the number of CPU cycles needed to complete the subtask

v_i A network node

p_i The overall computing power of v_i

L_i,j The communication link between v_iand v_j bi,j The bandwidth of Li,j

d_i,j The minimum link delay of L_i,j

(9)

Cost Models of Pipeline and Network Components

The computing time of M_irunning on v_j: Tcomputing(Mi, vj) =^cⁱ^·m_pⁱ⁻¹

j

The transfer time of message size m over Li,j: T_transport(m, L_i,j) =_b^m

i,j+d_i,j

(10)

Cost Models of Pipeline and Network Components

We consider a transport network consisting of k geographically distributed computing nodes v₁, v₂, · · · , v_k.

The general pipeline consists of n sequential modules

M1, M2, · · · , Mn, where M1is a data source and Mnis an end user.

(11)

Problem Formulation

The objective of a general mapping scheme is to decompose the pipeline into q groups of modules denoted by g₁, g₂, · · · , g_q, and map them onto a selected path P of q nodes from a source node v_sto a destination node v_d, where q ∈ [1, min(k, n)].

Path P consists of a sequence of unnecessarily distinct nodes v_P[1]=v_s, v_P[2], · · · , v_P[q]=v_d.

For each mapping, we consider two cases:

1 With node reuse, two or more modules are allowed to run on the same node.

2 Without node reuse, a node on the selected path P executes exactly one module.

(12)

Minimal total delay for interactive application

We achieve the fastest system response by minimizing the total computing and transport delay of the pipeline from the source node to the destination node.

Total delay

T_total(Path P of q nodes) = Tcomputing+Ttransport

=

q

X

i=1

T_g_i+

q−1

X

i=1

T_LP[i],P[i+1]

=

q

X

i=1

Ã 1 pP[i]

X

Mj∈gi,j≥2

(c_jm_j−1)

! +

q−1

X

i=1

µ m(g_i) bP[i],P[i+1]

¶

(13)

Maximal frame rate for streaming applications

To produce the smoothest data flow for streaming applications, we maximize the frame rate. Which is achieved by identifying and minimizing the time incurred on a bottleneck link or node.

Time on bottleneck

T_bottleneck(Path P of q nodes)

= max

Path p of q nodes i=1,2,...,q−1







T_computing(g_i), T_transport(LP[i],P[i+1]), Tcomputing(gq)







= max

Path p of q nodes i=1,2,...,q−1







1 pP[i]

P

Mj∈gi,j≥2(c_jm_j−1),

m(g_i) bP[i],P[i+1],

1 p_P[q]

P

Mj∈gq,j≥2(c_jm_j−1)







(14)

Algorithm Design

Outline

1 Introduction

(15)

Outline

1 Introduction

(16)

Algorithm Design ELPC Algorithm

Minimum End-to-end Delay with Node Reuse

For interactive applications, our goal is to minimize the end-to-end delay incurred on the nodes and links from the source to the destination to achieve the fastest response.

A single dataset is processed and there is only one module being executed at any particular time.

Node can be reused but are not shared simultaneously among different modules.

(17)

Illustration of ELPC Mapping Scheme for

Minimum End-to-end Delay

(18)

Minimum End-to-end Delay with Node Reuse

Let T^j(vi) denote the minimal total delay with the first j

modules mapped to a path from the source node v_sto node v_i. We have the following recursion leading to the final solution Tⁿ(v_d).

Minimal total delay

T^j(v_i)

j=2 to n, vi∈V

=min







T^j−1(v_i) +^c^j^m_p^j−1

vi

u∈adj(vmini)(T^j−1(u) +^c^j^m_p^j−1

vi +^m_b^j−1

u,vi)

Base condition

T²(vi)

v_i∈V , and v_i6=vs

=

( _c₂_m₁

p_vi + ^m¹

b_vs,vi , ∀e_v_s_,v_i∈E

∞ , otherwise

(19)

Minimum End-to-end Delay with Node Reuse

Let T^j(vi) denote the minimal total delay with the first j

modules mapped to a path from the source node v_sto node v_i. We have the following recursion leading to the final solution Tⁿ(v_d).

Minimal total delay

T^j(v_i)

j=2 to n, vi∈V

=min







T^j−1(v_i) +^c^j^m_p^j−1

vi

u∈adj(vmini)(T^j−1(u) +^c^j^m_p^j−1

vi +^m_b^j−1

u,vi)

Base condition

T²(vi)

v_i∈V , and v_i6=vs

=

( _c₂_m₁

p_vi + ^m¹

b_vs,vi , ∀e_v_s_,v_i∈E

∞ , otherwise

(20)

Minimum End-to-end Delay with Node Reuse

(21)

Minimum End-to-end Delay with Node Reuse

The complexity of this algorithm is O(n × |E|) - n denotes the number of modules

- |E| is the number of edges

(22)

Maximum Frame Rate without Node Reuse

For streaming applications, our goal is to maximize frame rate.

The maximum frame rate a computing pipeline can achieve is limited by the bottleneck unit which is the slowest transport link or computing node.

Node reuse in streaming applications causes resource sharing, and hence affects the optimality of the solutions to previous mapping subproblems.

We consider a restricted version of the mapping problem for maximum frame rate by limiting the use of each node to a single module.

(23)

Illustration of ELPC Mapping Scheme for

Maximum Frame Rate

(24)

Maximum Frame Rate without Node Reuse

We attempt to find the widest¹network path with exact n nodes to map n modules in the pipeline on a one-to-one basis.

This problem is NP-complete.

We develop an approximate solution by adapting the method for minimum end-to-end delay with some necessary

modifications.

(25)

Maximum Frame Rate without Node Reuse

1

T^j(vi) denote the maximal frame rate with the first j modules mapped to a path from source node v_sto node v_i.

Also we have the following recursion leading to the final solution Tⁿ(v_d)

Time on bottleneck T^j(vi)

j=2 to n,vi∈V

= min

u∈adj(vi)

µ max

µ

T^j−1(u),c_jm_j−1 p_v_i ,m_j−1

b_u,v_i

¶¶

Base condition

T²(v_i)

vi∈V , and vi6=vs

=

( max³

c2m1

p_vi ,_b^m¹

vs,vi

´

, ∀e_v_s_,v_i∈E

∞ , otherwise

(26)

Maximum Frame Rate without Node Reuse

1

T^j(vi) denote the maximal frame rate with the first j modules mapped to a path from source node v_sto node v_i.

Also we have the following recursion leading to the final solution Tⁿ(v_d)

Time on bottleneck T^j(vi)

j=2 to n,vi∈V

= min

u∈adj(vi)

µ max

µ

T^j−1(u),c_jm_j−1 p_v_i ,m_j−1

b_u,v_i

¶¶

Base condition

T²(v_i)

vi∈V , and vi6=vs

=

( max³

c2m1

p_vi ,_b^m¹

vs,vi

´

, ∀e_v_s_,v_i∈E

∞ , otherwise

(27)

Outline

1 Introduction

(28)

Algorithm Design Streamline Algorithm

Streamline Algorithm

Agarwalla et al. proposed a grid scheduling algorithm for graph dataflow scheduling in a network with n resources and n × n communication links.

This algorithm considers application requirements in terms of

1 Per-stage computation and communication needs

2 Application constraints on co-location of stages

3 Availability of computation and communication resources This scheduling heuristic expects to maximize the throughput of an application by assigning the best resources to the most needy stages at each step

The complexity of this algorithm is O(m × n²) - m is the number of stages or modules - n is the number of nodes

(29)

Outline

1 Introduction

(30)

Algorithm Design Greedy Algorithm

Greedy Algorithm

A greedy algorithm iteratively obtain the greatest immediate gain based on certain local optimality criteria at each step.

We calculate the end-to-end delay or maximum frame rate for the mapping of a new module onto the current node when node reuse is allowed or one of its neighbor nodes and choose the optimal one.

This algorithm makes a mapping decision at each step only based on the current information without considering the effect of this local decision on the mapping performance in the later steps.

The complexity of this algorithm is O(m × n)

- m denotes the number of modules in the linear pipeline - n is the number of nodes in the network

(31)

Outline

1 Introduction

(32)

Implementation and Experimental Results

Implementation

We conduct an extensive set of mapping experiments using a wide variety of simulated application pipelines and computing networks.

We generate these simulation datasets by randomly varying the pipeline and network attributes within a suitably selected range of values.

For each mapping problem, we designate a source node and a destination node to run the first module and the last module of the pipeline.

(33)

Performance comparison of the three algorithms

(34)

Implementation and Experimental Results

Performance of Comparison of Minimum End-to-end Delay for Three Algorithms

The x-axis represents the case number and there are 20 cases.

(35)

Performance of Comparison of Maximum

Frame Rate for Three Algorithms

(36)

Conclusions and Future Work

Outline

1 Introduction

(37)

Conclusions

We designed an ELPC scheme based on dynamic programming that strategically maps modules of computing pipelines to shared or dedicated network environments to achieve the minimum end-to-end delay and maximum frame rate.

The experimental results show that the ELPC exhibits superior mapping performance over the other methods.

(38)

Conclusions and Future Work

Future Work

We will study the pipeline mapping problem for maximum frame rate in the case of node reuse.

And also extend linear pipelines to graph workflows and study the complexity of and develop efficient solutions to graph workflow mapping problems in distributed environments.