Branch-and-bound task allocation with task clustering-based pruning

(1)

Branch-and-boundtask allocation with task clustering-basedpruning

Yung-Cheng Ma

∗

, Tien-Fu Chen, Chung-Ping Chung

Department of Computer Science and Information Engineering, National Chiao-Tung University, 1001 Ta Hsueh Road, Hsinchu 30050, Taiwan Received8 June 2000; receivedin revisedform 5 July 2004

Abstract

We propose a task allocation algorithm that aims at finding an optimal task assignment for any parallel programs on a given machine configuration. The theme of the approach is to traverse a state–space tree that enumerates all possible task assignments. The efficiency of the task allocation algorithm comes from that we apply a pruning rule on each traversedstate to check whether traversal of a given sub-tree is required by taking advantage of dominance relation and task clustering heuristics. The pruning rules try to eliminate partial assignments that violate the clustering of tasks, but still keeping some optimal assignments in the future search space. In contrast to previous state–space searching methods for task allocation, the proposed pruning rules significantly reduce the time and space required to obtain an optimal assignment andleadthe traversal to a near optimal assignment in a small number of states. Experimental evaluation shows that the pruning rules make the state–space searching approach feasible for practical use.

Keywords: Task allocation; Branch-and-bound; Pruning rule; Dominance relation; State–space searching

1. Introduction

Advances in hardware and software technologies have led to the use of parallel anddistributedcomputing systems. To execute a parallel program efﬁciently, the mapping of program tasks to processors shouldconsider both loadbal-ancing and reducing communication overhead. This paper studies such a task allocation problem.

Several research works have been done for the task al-location problem. Although the task alal-location problem has been shown to be NP-complete[3], a set of heuristics have been proposed[4,8,9,11,14,15,19,23]. A drawback of these heuristics is the poor quality on the assignment found[5]. On the other hand,[1,2,7,12,13,16–18,20] proposedstate– space searching methods with differences in the problem formulation for various applications andmachine conﬁgura-tions. The state–space searching approach ﬁnds an optimal assignment at the cost of intractable time andspace com-plexity. AhmadandKwok[1] proposedpruning rules and

parallelization methodto reduce the time to ﬁndan optimal solution of assigning precedence-constrained graphs. In this

paper, we follow the task graph mode of[18], which models

a set of parallel processes without precedence constraint, andpropose pruning rules to improve the efﬁciency of state– space searching method.

The key idea of the proposed pruning rule is to detect task clustering in the task graph. We observe that tasks can be groupedsuch that a group is a set of heavily communi-catedtasks andinter-group communication weights are rela-tively small. While traversing the state–space, our proposed algorithm detects task clustering from traversal history and tries to prune partial assignments that violate the detected task clustering. We prove that the proposedpruning rule will reserve some optimal assignment in the future search space. This guarantees the optimality of the solution found. Moreover, our experiment shows that the proposedalgo-rithm traverses only a low-order polynomial number of states to reach a near optimal assignment. Hence, when time and space is limited, a near optimal assignment can be obtained. This makes our proposedalgorithm feasible for practical use.

(2)

This paper is organizedas follows. Section 2 models the task allocation problem as a state–space searching problem. Section 3 describes the basic idea of the proposed pruning rule. Section 4 describes the dominance relation, which is the basis to derive our pruning rule. Section 5 described the proposedpruning rule Section 6 describes the proposed task allocation algorithm andthe space management policy. Section 7 presents the experiment to show the effectiveness of our proposedpruning rules. Finally, a conclusion is given in Section 8.

2. Modeling task allocation problem

In this section, we present how the task allocation prob-lem is formulatedandtransformedinto state–space search-ing problem. This section deﬁnes the terminologies used in this paper andgives the framework of our proposedtask al-location algorithm.

2.1. Formulating task allocation problem

We follow[4,9,18]to formulate the task allocation

prob-lem. This formulation assumes that there are little or no precedence relationships and synchronization requirements so that processor idleness is negligible. Contentions on com-munication links are also ignored.

The optimization problem is formulatedas follows. The input to a task allocation algorithm is a task graph G and a machine conﬁgurationM. The output, calleda complete

assignment, is a mapping that maps the set of tasksT to the

set of processors P . An optimal assignment is a complete assignment with minimum cost. The cost of an assignment is the turn-around time of the last processor finishing its execution. To findan optimal assignment, the branch-and-boundalgorithm will go through several partial assignments, where only a subset of the tasks has been assigned. We define the above terminology to formulate the task allocation problem.

A parallel program is representedas a task graph

G(T , E, e, c). The vertex set of the task graph is the set of

tasksT = {t0, t1, . . . , tn−1}. Each task ti ∈ T represents a program module. The edge set E of the task graph repre-sents communication between tasks. Two taskst_i andt_j are connectedby an edge ift_i communicates witht_j. For each taskt_i ∈ T , a weight e(t_i) is associatedwith it to represent the execution time of the taskt_i. For each edge(t_i, t_j) ∈ E, a weight c(t_i, t_j) is given to represent the amount of data transferredbetween taskst_i andt_j.

An example task graph is depicted in Fig. 1. Each vertex is a task andthe number on each task is the execution weight

e(ti) for the task ti. Associatedwith the number on edge

(ti, tj) is the communication weight c(ti, tj). Throughout this article, we will use this task graph to demonstrate the idea behind our algorithm.

600 400 300 800 700 750 1000 1200 1000 1000 600 450 800 t0 t1 t2 t4 t3 t5 t6 t7 t12 t11 t10 t8 t9 500 400 150 40 30 200 50 200 300 100 20 300 50 100 200 100 10 10

Fig. 1. Example of a task graph.

The machine conﬁguration is representedas M(P, d).

P = {p0, p1, . . . , pm−1} is the set of all processors. For each

pair of processorsp_k,p_l ∈ P , k = l, a distance d(p_k, p_l) is associatedto represent the latency of transferring one unit of data between p_k and p_l. If two tasks t_i and t_j are as-signedto different processors p_k andp_l, respectively, the

time requiredfor task t_i to communicate with t_j is

esti-matedto bec(t_i, t_j)∗d(p_k, p_l). The communication time be-tween two tasks within the same processor is assumedto be zero.

A machine conﬁguration example is depicted in Fig. 2. We take the hierarchical architecture as an example. The machine consists of two subnets. It takes 5 units of time to transfer a unit of data for two processors in the same subnet and20 units for two processors in different subnets. Throughout this paper, we will use the hierarchical archi-tecture to demonstrate the idea of our task allocation algo-rithm. However, our proposedalgorithm can also be applied to other machine conﬁgurations with non-uniform distances between processors.

A complete assignmentAcis a mapping that maps the set

of tasksT to the set of processors P . To ﬁnda complete as-signment, our task allocation algorithm will examine several

partial assignments. A partial assignment A is a mapping

that mapsQ, a proper subset of T , to the set of processors

P .

The turn-around time of processor p_k, denoted TA_k(A),

under a partial/complete assignmentA is deﬁned to be the

time to execute all tasks assignedto p_k plus the time that

these tasks communicate with other tasks not assignedto

pk. That is, TA_k(A) = ti:A(ti)=pk e(ti) + ti:A(ti)=pk tj:A(tj)=pk ×c(ti, tj)∗d(pk, A(tj)). (1) The cost of a partial/complete assignment is the turn-around time of the last processor ﬁnishing its execution:

cost(A) = max

(3)

p₀ p₁ p₂ p₃ p₀ p₁ p₂ p₃ 0 5 20 20 5 0 20 20 20 20 0 5 20 20 5 0 d(p_k,p_l): p k p l p0 p1 p2 p3 interconnection interconnection interconnection cluster cluster (a) (b)

Fig. 2. Example of a machine conﬁguration: (a) the clusteredarchitecture and(b) the distance matrix(d(p_k, p_l)).

t0-->p0 t0-->p1 t0-->p2

t1-->p0 t1-->p1 t1-->p2 t1-->p0 t1-->p1 t1-->p2 t1-->p0 t1-->p1 t1-->p2

t₂-->p₀ t₂-->p₁ t₂-->p₂

root

t₃-->p₀ t₃-->p₁ t₃->p₂

internal nodes: partial assignments leaves: complete assignment (Goal Nodes)

t0-->p0 t0-->p1 t0-->p2

t1-->p0 t1-->p1 t1-->p2 t1-->p0 t1-->p1 t1-->p2 t1-->p0 t1-->p1 t1-->p2

t₂-->p₀ t₂-->p₁ t₂-->p₂

root

t₃-->p₀ t₃-->p₁ t₃->p₂

internal nodes: partial assignments leaves: complete assignment (Goal Nodes)

Fig. 3. State–space tree.

An optimal assignmentAoptis a complete assignment with

minimum cost:

cost(Aopt)

= min{cost(Ac)|Ac is a complete assignment}. (3)

2.2. Transforming to the state–space searching problem—A∗-algorithm

We solve the task allocation problem by state–space

searching with pruning rules. Shen andTsai[18]proposed

a state–space search algorithm without pruning to solve the task allocation problem. This state–space search methodis known as the A∗-algorithm[6], which has been proven to guarantee the optimality of the solution obtained. Based on the A∗-algorithm, we add a pruning rule to reduce the search space to be traversed. In our experiment, this A∗ -algorithm will be usedas a baseline for comparison with our branch-and-bound algorithm.

As illustratedin Fig. 3, the state–space tree represents all possible task assignments. We use an (n + 1)-level m-ary tree to enumerate all possibilities of assigningn tasks to m processors. In the literature of branch-and-bound method, a node in the state–space tree is called a branching state. In this study, a branching state represents either a partial or a complete assignment, depending on whether the branching state is an internal node or a leaf node in the state–space tree.

In the remaining of this article, we will use the terms branch-ing states andpartial/complete assignments interchangeably. The traversal proceeds as follows. During the traversal,

an active set [10] (also calledthe open set in some

litera-ture[6]), denoted ActiveSet, is usedto keep track of all par-tial/complete assignments that have been exploredbut not visited. In each iteration during the traversal, the following operations are performed:

Step 1: Remove a partial/complete assignment Av from

ActiveSet andvisitAv.

Step 2: If Av is a complete assignment, terminate the traversal andreturnAvas the output.

Step 3: Check if the sub-trees derived fromAvneedfurther traversal by using the pruning rule.

Step 4: If the sub-tree of Avneeds further traversal, put each childnode ofAvin the state–space tree into ActiveSet.

For simplicity, we useActiveSet(k)to denote the contents of the ActiveSet at the beginning of the kth iteration, and

A(k)v to denote the partial/complete assignment visited in the

kth iteration.

We follow the approach in Shen andTsai[18] to deter-mine the traverse order. For each partial/complete assign-mentA, a lower-bound(denotedL(A)) on all complete as-signments extended from A (or A itself in case that A is a complete assignment) is estimated. In each iteration during the traversal, the partial/complete assignmentAvwith min-imumL(•) is removedfrom ActiveSet andvisited. L(A) is

(4)

computedaccording to the additional cost of assigning tasks

not assignedinA.

Given a partial assignmentA in which Q ⊆ T has been

assigned, we deﬁneAC_k(t_j → p_l, A) to reﬂect the

addi-tional cost on processorp_kif taskt_jis assignedto processor

pl: ACk(tj → pk, A) = e(tj) + ti:A(ti)=pk c(ti, tj)∗d(pk, A(ti)) if p_k= p_l, (4) ACk(tj → pl, A) = ti:A(ti)=pk c(ti, tj)∗d(pk, pl) if p_k= p_l. (5)

For a partial assignmentA, the cost lower-bound L(A)

for all complete assignments extended fromA is estimated

to be L(A) ≡ max processorp_k TA_k(A) + ti:not assignedin A × min prcoessorp_l ACk(ti → pl, A) . (6)

Without pruning rules, the methodpresentedso far is

known as A∗-algorithm[6], which was originally proposed

by Shen andTsai[18]for task allocation. The A∗-algorithm traverses all partial assignments withL(•) less than the op-timal cost. We propose a pruning rule to reduce the state– space size to be traversed.

3. Basic idea of the proposed pruning rule

The development of the pruning rule is based on the clus-tering of tasks. As shown in Fig. 4, tasks are groupedsuch

600 400 300 800 700 750 1000 1200 1000 1000 600 450 800 t₀ t₁ t₂ t₄ t₃ t₅ t₆ t₇ t₁₂ t₁₁ t₁₀ t₈ t₉ 500 400 150 40 30 200 50 200 300 100 20 300 50 100 200 100 10 10 tasks suitable to be placed in the same processor

tasks suitable to be placed in the same subnet

600 400 300 800 700 750 1000 1200 1000 1000 600 450 800 t₀ t₁ t₂ t₄ t₃ t₅ t₆ t₇ t₁₂ t₁₁ t₁₀ t₈ t₉ 500 400 150 40 30 200 50 200 300 100 20 300 50 100 200 100 10 10 600 400 300 800 700 750 1000 1200 1000 1000 600 450 800 t₀ t₁ t₂ t₄ t₃ t₅ t₆ t₇ t₁₂ t₁₁ t₁₀ t₈ t₉ 500 400 150 40 30 200 50 200 300 100 20 300 50 100 200 100 10 10 600 400 300 800 700 750 1000 1200 1000 1000 600 450 800 t₀ t₁ t₂ t₄ t₃ t₅ t₆ t₇ t₁₂ t₁₁ t₁₀ t₈ t₉ 500 400 150 40 30 200 50 200 300 100 20 300 50 100 200 100 10 10 tasks suitable to be placed in the same processor

tasks suitable to be placed in the same subnet

Fig. 4. Sample clustering of tasks according to communication weights.

that each group contains heavily communicating tasks. The key observation is that a group may contain a set of tasks suitable to be placedin the same processor, or a set of tasks suitable to be placedin the same subnet in the hierarchi-cal architecture. While traversing the state–space tree, our branch-and-bound algorithm detects the clustering of tasks andtries to prune those partial assignments that violate the clustering heuristic. The effectiveness of the pruning rule thus depends on whether the tasks can be clearly clustered into groups.

The development of the pruning rule consists of two phases. In Section 4, we ﬁrst develop a dominance relation. This dominance relation is effective only when a small cut is met. In Section 5, we further integrate the detection of clustering of tasks with the dominance relation to form an enhancedpruning rule.

4. Pruning search space by dominance relation

We ﬁrst develop a dominance relation to serve as the basis for developing the pruning rule. We pick two

par-tial assignmentsA1 andA2 in which the same set of tasks

has been assigned. Suppose cost(A1)cost(A2). We call

A1 the winner andA2the loser. Let A1-best andA2-best be

the complete assignments with a minimum cost in the sub-tree belowA1andA2, respectively. We want to be able to

check whether it is possible that the winner–loser relation-ship will be changed, that is, cost(A₁_-_best)cost(A₂_-_best). Our proposeddominance relation claims that what may re-verse the winner–loser relationship is the weights of edges between assignedandun-assignedtasks in the task graph. The dominance relation is effective in pruning the search space when the weights between assignedandun-assigned tasks are small.

4.1. Formalization of dominance relation

Deﬁnition 1 (Dominance relation). Let A1 andA2 be two

(5)

A1 A1 Q S state-space tree: A2 A2

A’1(ti)=A’2(ti) for ti S

Aa Aa tasks in p k tasks not in pk Qk Sk A1 A1 Q S state-space tree: A2 A2

A’1(ti)=A’2(ti) for ti S A1 A1 Q S state-space tree: A2 A′ ′ ′ 2

A’1(ti)=A’2(ti) for tiy S

Aa Aa tasks in p k tasks not in pk Qk Sk Aa Aa tasks in p k tasks not in pk Qk Sk Qk Sk Qk Sk Qk Sk (a) (b)

Fig. 5. Idea behind deriving the dominance relation: (a) selection of partial/complete assignments and (b) classiﬁcations on tasks.

guarantee that cost(A₁_-_best)cost(A₂_-_best), where A₁_-_best

andA₂_-_best are complete assignments with minimum cost

extended fromA1andA2, respectively.

The inference rule we use to derive a dominance relation is as follows. We omittedthe proof since it is a direct con-sequence from Deﬁnition 1.

Corollary 1 (Inference rule for deriving the dominance

relation). Let A1 and A2 be two partial assignments.A1

dominates A2 if for any complete assignmentA₂ extended from A2, there exists a complete assignmentA₁ extended fromA1, such that TAk(A₂) − TAk(A₁)0 for each proces-sorp_k.

The idea to derive a dominance relation is depicted in Fig. 5. The assignmentsA1,A2,A1, andA2concernedin

Corollary 1 are shown in Fig. 5(a), where S = T − Q.

A

1andA2are chosen such thatA1andA2have the same

future extension. We rewrite the turn-aroundtime equation according to the task classiﬁcation shown in Fig. 5(b). In addition to TA_k(A2) − TAk(A1), the communication time

between assignedandto-be-assignedtasks inA1(A2) also

contribute to TA_k(A₂) − TA_k(A₁). This gives a lower bound estimation on TA_k(A₂) − TA_k(A₁). The

proposeddomi-nance relation checks whetherA2can be prunedor not

ac-cording the estimated turn-around time difference

lower-bound.

We introduce the following notations:

• Execution(R) =

ti∈R

e(ti), where R is a set of tasks.

• Communication(R1, R2) = ti∈R1 tj∈R2 c(ti, tj)∗d(Aa (ti), Aa(tj)), where R1andR2are sets of tasks.

Following the classiﬁcation on tasks shown in Fig. 5(b), we rewrite the turn-aroundtime equation in the following lemma. The proof is omittedsince it is a trivial computation from the turn-aroundtime formula.

Lemma 1 (Reformulating the turn-aroundtime). LetAabe

a partial assignment andA_a be a complete assignment ex-tended fromAa.Q is the set of tasks assigned in Aa andS

is the set of tasks not assigned inAa. Then TA_k(A_a) = TA_k(Aa) + Execution(Sk(Aa)) + Communication(Qk(Aa), Sk(Aa)) + Communication(Qk(Aa), Sk(Aa)) + Communication(Sk(Aa), Sk(Aa)), (7) where • Qk(Aa) = {ti ∈ Q|Aa(ti) = pk} and Qk(Aa) = Q − Qk (Aa), • Sk(Aa) = {ti ∈ S|Aa(ti) = pk} and Sk(Aa) = S − Sk (Aa).

Before stating the dominance relation, we state the

turn-around time difference lower-bound TADL_k(A1, A2). Let A1

and A2 be two partial assignments with the same set of

tasks Q being assigned, and S = T − Q. TADL_k(A1, A2)

is a lower boundon TA_k(A₂) − TA_k(A₁), where A₁andA₂

are arbitrary complete assignments extendfromA1andA2,

respectively, such thatA₁(t_i) = A₂(t_i) for each task t_i ∈ S.

TADL_k(A1, A2) is estimatedto be TADL_k(A1, A2) ≡ TAk(A2) − TAk(A1) + ti∈S × min pl∈P(ACk(ti → pl, A2) − ACk(ti → pl, A1)) . (8)

We then check whether A2 can be prunedor not by

computing TADL_k(A1, A2) for each processor pk. If

TADL_k(A1, A2) is greater than or equal to zero for each

processor p_k, it indicates that TA_k(A₂) − TA_k(A₁)0 for

each processor p_k andhence we can prune A2. This is

statedin the following theorem.

Theorem 1 (Dominance relation for space pruning). Let

A1andA2be two partial assignments containing the same set of tasks. If TADL_k(A1, A2)0 for each processor pk, thenA1dominatesA2.

(6)

A1: t0 p0 t1 p0 t2 p0 A2: t0 p0 t1 p0 t2 p1 TA0(A1)=1300 TA1(A1)=0 TA2(A1)=0 TA3(A1)=0 TA0(A2)=3750 TA1(A2)=3050 TA2(A2)=0 TA3(A2)=0 600 400 300 800 700 750 1000 1200 1000 1000 600 450 800 t0 t1 t2 t4 t3 t5 t6 t7 t12 t11 t10 t8 t9 500 400 150 40 30 200 50 200 300 100 20 300 50 100 200 100 10 10 Q

edges that may affect the winner-loser relationship A1: t0 p0 t1 p0 t2 p0 A2: t0 p0 t1 p0 t2 p1 TA0(A1)=1300 TA1(A1)=0 TA2(A1)=0 TA3(A1)=0 TA0(A2)=3750 TA1(A2)=3050 TA2(A2)=0 TA3(A2)=0 A1: t0→p0 t1→p0 t2→p0 A2: t0→p0 t1→p0 t2→p1 TA0(A1)=1300 TA1(A1)=0 TA2(A1)=0 TA3(A1)=0 TA0(A2)=3750 TA1(A2)=3050 TA2(A2)=0 TA3(A2)=0 600 400 300 800 700 750 1000 1200 1000 1000 600 450 800 t0 t1 t2 t4 t3 t5 t6 t7 t12 t11 t10 t8 t9 500 400 150 40 30 200 50 200 300 100 20 300 50 100 200 100 10 10 Q 600 400 300 800 700 750 1000 1200 1000 1000 600 450 800 t0 t1 t2 t4 t3 t5 t6 t7 t12 t11 t10 t8 t9 500 400 150 40 30 200 50 200 300 100 20 300 50 100 200 100 10 10 Q

edges that may affect the winner-loser relationship edges that may affect the winner-loser relationship AC0(ti pl, A1): ti pl t6 t8 t10 p0 1000 1000 450 p1 0 100 50 p2 0 400 200 p3 0 400 200 t5 750 0 0 0 t7 1200 0 0 0 t9 1000 0 0 0 t11 600 0 0 0 t12 800 0 0 0 t3 800 200 800 800 t4 700 150 600 600 AC0(ti pl, A2): ti pl t6 t8 t10 p0 1000 1000 500 p1 0 100 0 p2 0 400 0 p3 0 400 0 t5 750 0 0 0 t7 1200 0 0 0 t9 1000 0 0 0 t11 600 0 0 0 t12 800 0 0 0 t3 800 200 800 800 t4 850 0 0 0 TA0(A2)-TA0(A1) 3750-1300 + (-600) + (-200) 0 due to t4 due to t10 AC0(ti pl, A1): ti pl t6 t8 t10 p0 1000 1000 450 p1 0 100 50 p2 0 400 200 p3 0 400 200 t5 750 0 0 0 t7 1200 0 0 0 t9 1000 0 0 0 t11 600 0 0 0 t12 800 0 0 0 t3 800 200 800 800 t4 700 150 600 600 AC0(ti→pl, A1): ti pl t6 t8 t10 p0 1000 1000 450 p1 0 100 50 p2 0 400 200 p3 0 400 200 t5 750 0 0 0 t7 1200 0 0 0 t9 1000 0 0 0 t11 600 0 0 0 t12 800 0 0 0 t3 800 200 800 800 t4 700 150 600 600 AC0(ti pl, A2): ti pl t6 t8 t10 p0 1000 1000 500 p1 0 100 0 p2 0 400 0 p3 0 400 0 t5 750 0 0 0 t7 1200 0 0 0 t9 1000 0 0 0 t11 600 0 0 0 t12 800 0 0 0 t3 800 200 800 800 t4 850 0 0 0 AC0(ti→pl, A2): ti pl t6 t8 t10 p0 1000 1000 500 p1 0 100 0 p2 0 400 0 p3 0 400 0 t5 750 0 0 0 t7 1200 0 0 0 t9 1000 0 0 0 t11 600 0 0 0 t12 800 0 0 0 t3 800 200 800 800 t4 850 0 0 0 TA0(A2)-TA0(A1) 3750-1300 + (-600) + (-200) 0 due to t4 due to t10 TA0(A′2)-TA0(A′1)≥3750-1300 + (-600) + (-200) ≥0 due to t4 due to t10 (a) (b) (c)

Fig. 6. Example to illustrate the dominance relation: (a) partial assignments in consideration, (b) the task graph and (c) effects onp0 for all possible

extensions.

Proof. To draw a dominance relation by Corollary 1, we

pick the complete assignment A₁ extended from A1 such

thatA₁(t_i) = A₂(t_i) for each t_i ∈ S. The pattern is depicted in Fig. 5(a). We want to show that TA_k(A₂) − TA_k(A₁)0 for eachp_k.

We decompose both TA_k(A₂) and TA_k(A₁) as statedin Lemma 1. SinceA₁(t_i) = A₂(t_i) for each t_i ∈ S, we have

• Execution(Sk(A2)) − Execution(Sk(A1)) = 0, and

• Communication(Sk(A2), Sk(A2)) − Communication (Sk(A1), Sk(A1)) = 0. Hence, we have TA_k(A₂) − TA_k(A₁) = TAk(A2) − TAk(A1) + (Communication(Sk(A2), Qk(A2)) − Communication(Sk(A1), Qk(A1))) + (Communication(Sk(A2), Qk(A2)) − Communication(Sk(A1), Qk(A1))) = TAk(A2) − TAk(A1) + ti∈S × (ACk(ti → A2(ti), A2) − ACk(ti → A2(ti), A1)). (9)

Taking a lower boundon the turn-aroundtime difference, we have TA_k(A₂) − TA_k(A₁) TA_k(A2) − TAk(A1) + ti∈S min pl∈P(ACk(ti → pl, A2) − ACk(ti → pl, A1)).

The right-handside of above inequality is the TADL_k(A1,

A2) deﬁned previously. Hence if TADLk(A1, A2)0 for

eachp_k, it impliesA1dominatesA2. 4.2. Example of the dominance relation

We use the task graph in Fig. 1 andthe machine conﬁgura-tion in Fig. 2 to illustrate the idea of the dominance relaconﬁgura-tion given in Theorem 1. The partial assignments concernedare

(7)

A1andA2shown in Fig. 6(a).A1is the winner andA2is the

loser in this comparison. We apply Theorem 1 to guarantee that the winner–loser relationship will not be reversed.

We use the example in Fig. 6 to explain the key idea of exploiting task clustering. In the task graph in Fig. 6(b),

{t0, t1, t2} is a group of heavily communicating tasks and

shouldbe assignedto the same processor. In Fig. 6(a),A1

is a partial assignment obeying the task clustering andA2

is a partial assignment that violates the task clustering. The dominance relation examines the “cut”, edges between as-signedtasks{t0, t1, t2} andremaining tasks (boldededges

in Fig. 6(b)), to test whetherA2can be prunedor not. The

examination ﬁnds that edges from assigned tasks tot4and

t10 are the only possible causes for A2 to win back what

it has lost (cf. Fig. 6(c)). The edge weights in the cut are relative small andhence positive TADL_k(A1, A2) values are

obtained. This results inA2been pruned. Enumerating

heav-ily communicatedtasks in consecutive order ensures that a cut with light-weightededges can be met andimproves the pruning efﬁciency of the dominance relation.

5. Pruning search space by task clustering

The dominance relation proposed in Section 4 is effec-tive only when a small cut can be found. To relieve this constraint, we develop a further pruning rule that considers both the detection of clustering of tasks and the dominance relation.

How well the pruning rule works depends on the task enumeration order. We assume that tasks are enumerated in an order such that heavily communicated tasks will be enu-meratedﬁrst. We will see how such an enumeration order is obtainedin Section 6. With this assumption, a task assign-ment has the following properties:

• A complete assignment obtainedby a greedy search policy reﬂects the clustering of tasks.

• The first partial assignment of assigning a sub-graph vis-itedreflects the clustering of tasks in the sub-graph. With these properties, we obtain (1) partial assignmentA_k— calledthe killer—reflecting the clustering of tasks, and(2)

complete assignmentAu servedas an upper boundon the

optimal cost to test whether a candidate partial assignment

A can be pruned. These are the inputs to our pruning rule.

We use the task graph in Fig. 1 andthe machine conﬁg-uration in Fig. 2 to illustrate how the pruning rule works as depicted in Fig. 7. The killerA_kis a partial assignment with

more tasks than the candidateA has. In the Fig. 7 example,

Akreﬂects the clustering of tasks by showing that{t0, t1, t2}

shouldbe placedin the same processor and{t0, t1, t2, t3, t4}

shouldbe placedin the same subnet. We are thus given the guidelines to extendA: (i) t2shouldbe assignedtop0, (ii) t3,t4shouldbe assignedto either ofp0andp1.

Complete assignments extended fromA can be classiﬁed

into two categories: extensions following or violating the

guidelines. For extensions violating the guidelines, we es-timate the cost lower boundandexclude those extensions whose costs are guaranteedto be greater than or equal to

cost(Au). For extensions following the guidelines, we ﬁnd

a dominatorAdfrom the killerAk that dominates these

ex-tensions. These observations leadus to propose the pruning rule, whose criteria for pruning the search space is statedas follows.

Pruning criteria: Let Ad and A be two partial

assign-ments in which the same set of tasks has been determined,

and Au be a complete assignment. We prune A if for

any complete assignment A extended from A, either (i)

cost(A₎_cost(A

u) or (ii) there exists a complete

assign-mentA_dextended fromAdsuch thatcost(A_d)cost(A).

5.1. Predicting clustering of tasks

Fig. 8 presents the procedure Compute_PA(A, A_k) to pre-dict the clustering of tasks. The result of this detection is a set of possible assignments, denoted PA_is, for each taskt_i

not assignedin A. Each PA_i is a set of processors which

we can assign taskt_i to PA_is are determined according to a killerA_k. That is, the killer shouldreﬂect the clustering of tasks in a task graph. How such a killer can be obtainedwill be explainedin Section 5.4.

To generate a guideline to extendingA, we sketch a d

is-tance hierarchy on processors centralizedat the “central

pro-cessor”pc andmap the tasks to the distance hierarchy. Let

ta be the last task assignedinA. We take pc to be the one tais assignedto inAk (cf. Step 1 in Fig. 8). For each taskti

assignedinA_k but not inA, we let PA_ibe the set of all pro-cessors with distance less than or equal tod(pc, Ak(ti)) (cf.

Step 2 in Fig. 8). Ift_i is not assignedinA_k, no prediction is made and PA_i is set to be the set of all processors.

5.2. Examining partial assignment using pruning rule

Fig. 9 presents the procedure PruneTest to test whether a partial assignment can be pruned. Procedure PruneTest calls Compute_PA to predict the guidelines to extending the

candidateA. From there, the remaining work is to examine

whether the sub-tree ofA needs further traversal using the

pruning rule.

We ﬁrst test the correctness of the prediction outcome

PA_is. The test is performedby estimating a turn-around time

lower-bound for extensions violating the guidelines, denoted TAL_k(A, violate PA_i), statedas follows:

TAL_k(A, violate PA_i) ≡ TAk(A) + tj not assignedin A and_{tj =ti} × min processorp_l ACk(tj → pl, A) + min

(8)

root t0 p0 t1 p1 A dominated by Ad t2 p0 t3 p1 t4 p0 A’ extensions that obey the guidelines

cost(A’) cost(Au) t2 p1 t3 p2 t4 p3 A’ extensions that violet the guidelines t1 p0 t2 p0 t3 p1 t4 p1 Ad Ak to predict the restriction on extending A dominator killer restrictions on extending A: • t2 {p0} • t3, t4 {p0, p1} root t0→p0 t1→p1 A dominated by Ad t2 p0 t3 p1 t4 p0 A’ extensions that obey the guidelines dominated by Ad t2 p0 t3 p1 t4 p0 A’ extensions that obey the guidelines

t2→p0

t3→p1

t4→p0

A’

extensions that obey the guidelines

cost(A’) cost(Au) t2 p1 t3 p2 t4 p3 A’ extensions that violet the guidelines cost(A’)≥cost(Au) t2 p1 t3 p2 t4 p3 A’ extensions that violet the guidelines

t2→p1

t3→p2

t4→p3

A’

extensions that violet the guidelines t1 p0 t2 p0 t3 p1 t4 p1 Ad Ak to predict the restriction on extending A dominator killer t1→p0 t2→p0 t3→p1 t4→p1 Ad Ak to predict the restriction on extending A dominator killer restrictions on extending A: • t2→{p0} • t3, t4→{p0, p1}

Fig. 7. Pruning basedon task clustering.

Algorithm Compute_PA(A, A_k) • input:

– A, A_k: partial assignments, number of tasks assigned in A_k≥number of tasks assigned in A • output:

– PA_i⊆P for each task t_inot assigned in A (P is the set of all processors) • method:

1) p_c←A_k(t_a) where t_ais the last task assigned in A 2) for each task t_inot assigned in A do

if t_iis assigned in A_kthen PA_i←{ processor p_k| d(p_k, p_c)≤d(A_k(t_i), p_c) } else PA_i←P

Fig. 8. Algorithm to predict the clustering of tasks.

Algorithm PruneTest(A,A_k,A_u) • input: – A, Ak: partial assignments. • d epth(Ak)≥depth(A) – Au: a complete assignment • output:

– prune=True if A can be pruned, otherwise prune=False

• method:

1) perform Compute_PA(A, Ak) to determine PAifor each task tinot assigned in A

2) /* exclude extensions violating PA */ 2.1) success←False

2.2) for each processor pkdo

if TALk(A, violate PA)≥cost(Au) then

success ←True

break

2.3) if success=False then PAi←P

3) Ad←the ancestor of Akin the same level with A

4) prune←True

5) /* dominate extensions obeying PA */

for each processor pkdo

if TADL_k(A_d,A,PA)<0 then prune←False

break

6) return prune

(9)

Lemma 2. Let A be a partial assignment and A be a complete assignment extended from A. If there exists a task t_i not assigned in A such that A(t_i) /∈ PA_i, then

TA_k(A)TAL_k(A, violate PA_i) for each processor p_k.

Proof. The proof is similar to the estimation of the cost

lower boundL(•) in[18]. The only difference is that when

taking minimum on the sum of additional cost to obtain a lower boundon TA_k(A), the possibilities of assigning t_i to processors in PA_i are excluded.

After excluding extensions violating the guidelines, we then check the dominance imposed on the remain-ing extensions. The dominator Ad is the ancestor of Ak in the state–space tree at the same level with A. Similar to the procedure in Section 4, we estimate a turn-around

time difference lower-bound between Ad and A, denoted

TADL_k(Ad, A, PA), assuming that AdandA have the same future extensions andfollowing the guidelines for each task

ti not assignedinA(Ad). We estimate TADLk(Ad, A, PA) as follows: TADL_k(Ad, A, PA) = TAk(A) − TAk(Ad) + tinot assigned min pl∈PAi(ACk(ti → pl, A) − ACk(ti → pl, Ad)) . (11)

Comparedto the TADL_k(Ad, A) deﬁned in Section 4,

these two quantities are estimatedin similar ways. The dif-ference is that the future extensions ofAdandA have been

restrictedto be in PA_is in estimating TADL_k(Ad, A, PA).

And TADL_k(Ad, A) = TADLk(Ad, A, PA) if each PAi

con-tains all of the processors.

Theorem 2 (Pruning rule). LetAdandA be two partial as-signments in which the same set of tasks has been deter-mined, andAube a complete assignment. PAi’s are guide-lines to extendA for each task t_i not assigned inA. If

(i) For each taskt_i not assigned inA, there exists a pro-cessorp_k such that TAL_k(A, violate PA_i)cost(Au). And

(ii) TADL_k(Ad, A, PA)0 for each processorpk.

Then the pruning criteria is satisﬁed andA can be pruned.

Proof. By Lemma 2, hypothesis (i) implies that complete

assignments extended fromA violating the guidelines PA_is

will have a cost greater than or equal to cost(Au). The

remainder of the proof is to estimate a lower bound on

TA_k(A) − TA_k(A_d). This is similar to Theorem 1, but the

possibilities of extending A to an assignment that

vio-late the guidelines PA_is are ignored. The lower bound of

TA_k(A)−TA_k(A_d) is thus estimatedto be TADL_k(Ad, A, PA)

as deﬁned before. This proves the theorem.

The procedure PruneTest uses Theorem 2 to test whether

A can be prunedor not. Hypothesis (i) of Theorem 2 is

guaranteedby Step 2. Step 5 in the procedure PruneTest checks whether hypothesis (ii) of Theorem 2 holds. This test

then returns the result indicating whether A can be pruned

or not.

The advantage of using the pruning rule in Theorem 2 insteadof the dominance relation in Theorem 1 is that the space can be prunedearlier during the traversal. For the ex-ample given in Fig. 7, this advantage is shown in Fig. 10. If we use the dominance relation given in Theorem 1 as the pruning rule, the bolded partial assignments will be tra-versed. The reduced search space is an exponential function of the depth of the clustering of tasks that we can detect.

5.3. Obtaining an upper bound on the optimal cost

To check whether a partial assignmentA can be pruned,

the procedure PruneTest uses two additional inputs: (1) a

complete assignmentAu servedas an upper boundon the

optimal cost and(2) a killerA_k reﬂecting the clustering of tasks. Another use of such anAuis to serve as an “imperfect

solution” once the “perfect solution” cannot be found. The task allocation problem is well known to be NP-complete

[2]. Once the optimal assignment cannot be foundsubject to time andspace constraints, an “imperfect solution”—a complete assignment that may not be optimal—wouldbe returnedas the output. In this section, we describe how such anAucan be obtained.

We use a greedy search approach to obtain a complete as-signmentAu. A pointerp is usedto indicate the status of the greedy search. At the beginning,p points at the starting node (the partial assignment currently visited) in the state–space tree. In each step, we move p down to one of its children with the minimum cost. The procedure terminates when (1)

p points at a partial assignment with a cost greater than that

of the presentAu, or (2)p points at a complete assignment.

Auis then updated if a better complete assignment is found. The reason we use greedy search is because not only of its simplicity but also the fact that a low cost complete assignment can be obtainedif a careful task enumeration order is applied. Assume the tasks are enumerated in an order such that heavily communicatedtasks will be enumerated consecutively. The complete assignment obtainedwill reﬂect the clustering of tasks andis likely to have a low cost.

To illustrate the idea, we take the task graph in Fig. 1 and machine conﬁguration in Fig. 2 as an example. Consider the greedy search starts from the partial assignment {t0 →

p0, t1→ p0}. Part of the greedy search path is shown in Fig. 11. The greedy search will assignt2top0next since it is the childof{t0→ p0, t1→ p0} with the lowest cost. This se-lection indicates thatt0,t1, andt2may needbe placedin the same processor. Similarly,t3will be assignedtop1following

(10)

t₂→p₁or p₂or p₃

excluded due to the cost cost(A_u) dominated by A_d root t0→p0 t₁→p₀ t₂→p₀ t₃→p₁ t₄→p₁ t₁→p₁ t₂→→p₀ A_d A_k A to predict the guideline on extending A t₃→p₀ t₃→p₁ t₃→p₀or p₁

t_i→p_kpartial assignments saved

t₂→p₁or p₂or p₃

excluded due to the cost ≥cost(A_u) dominated by A_d root t0→p0 t₁→p₀ t₂→p₀ t₃→p₁ t₄→p₁ t₁→p₁ t₂→p₀ A_d A_k A to predict the guideline on extending A t₃→p₀ t₃→p₁ t₃→p₀or p₁

t_i→p_kpartial assignments saved t_i→p_kpartial assignments saved

Fig. 10. Space savedby the pruning criteria.

t0→p0 t1→p0 t2→p0 t2→p1 t2→p2 t2→p3 t3→p0 t3→p1 t3→p2 t3→p3 t4→p1 t4→p0 t4→p2 t4→p3

selected in the greedy search path

Fig. 11. Greedy search on the state–space tree.

the parent partial assignment{t0→ p0, t1→ p0, t2→ p0},

also reﬂecting the clustering of tasks. Following the same procedure, we obtain a complete assignment that obeys the task clustering guideline.

5.4. Obtaining killers reﬂecting clustering of tasks

In addition to the complete assignment Au, a partial

as-signmentA_k reﬂecting the clustering of tasks is also helpful to enhance the pruning rule. To increase the possibility of pruning a partial assignment, we may ﬁndmultiple killers to form a KillerSet, insteadof only one killer. The procedure PruneTest is then performedfor each killer in the KillerSet to test whether a partial assignment can be pruned.

Partial assignments reﬂecting clustering of tasks can be obtainedby the proposedtask enumeration order andthe state–space tree traverse order. A partial assignment covers a sub-graph of the task graph. With the assumption that heav-ily communicatedtasks are enumeratedconsecutively, we can capture part of the clustering of tasks in the sub-graph.

Since we traverse the task graph in the minimumL(•) ﬁrst

order, the ﬁrst partial assignment containing the sub-graph

visitedis the one with minimumL(•) among all partial

as-signments containing the sub-graph. The ﬁrst partial

assign-ment of containing a sub-graph visitedindicates the cluster-ing of tasks, otherwise it will have a largeL(•).

We follow the principle that the ﬁrst partial assignment indicates clustering of tasks to obtain killers. We assess that a candidate partial assignmentA will be prunedif it violates the clustering of tasks somewhere in the path from root to the branching state in the state–space tree. Partial assignments having taken advantage of clustering of the tasks assigned byA are those partial assignments each of which (1) have

a common ancestor with A in the state–space tree, (2) are

visitedearlier thanA, and(3) are deeper than A in the state–

space tree such that the sub-graph containedin A is also

contained in them. This leads to the design of our heuristic scheme to obtain the killers.

To realize the scheme, a link to the deepest descendant node is associatedwith each visitedpartial assignment. For each partial assignmentAa, we associate a pointerdeep(Aa)

pointing at the deepest partial assignment visited in the sub-tree of Aa. If two or more partial assignments at the same

level of the state–space tree are visited,deep(Aa) points at

the ﬁrst one visited, which has the smallest cost lower bound

estimate (L(•)) on all its extensions. The KillerSet is the

set of all deep(Aa) for each ancestor of A along with the

complete assignmentAu.

KillerSet(A)

= {deep(Aa)|Aais an ancestor of A} ∪ {Au}.

The determination of the KillerSet is depicted in Fig. 12.

The number in each node is theL(•) of the partial

assign-ment representedby the node. For each visitednodeAa, the

dashed link represents the deepest link deep(Aa). When a

partial assignmentA is visited, we follow the deepest link

along all ancestors of A to obtain the KillerSet. In this ex-ample, the KillerSet to be usedfor pruning A is {A6, A4}

plusAu. That is, for each sub-tree (of the state–space tree)

containingA, we pick the best branching state visitedin the sub-tree to try to pruneA.

(11)

Part of the State Space Tree partial assignment that traversed partial assignment that not traversed

the deepest link

Deep(A0)=A4, Deep(A1)=A4, Deep(A2)=A6 KillerSet(A)={A6,A4} 30 32 33 35 45 38 47 50 36 39 A A₀ A1 A₂ A₃ A4 A5 A6

Fig. 12. Deepest link to determine the KillerSet.

6. Branch-and-bound task allocation with preprocessing We now present the task allocation algorithm using the pruning rules. We present how a goodenumeration order is obtainedin Section 6.1. In Section 6.2, the branch-and-boundalgorithm along with the correctness proof will be presented.

6.1. Preprocessing to determine the task enumeration order

We have seen the importance of the task enumeration order in previous sections. For the following reasons, tasks shouldbe enumeratedin such an order that tasks with high communication are enumeratedﬁrst:

• To arrive at a small cut to exploit the dominance relation before the space overﬂow.

• To obtain killers that take advantage of the clustering of tasks.

• To obtain a low cost complete assignment serving as an upper boundon the optimal cost.

The task enumeration order is determined by applying the max-flow min-cut algorithm recursively to partition the task graph. Each time the max-flow min-cut procedure is applied, the set of tasks is decomposed into two partitions connectedby a minimum cut. We repeat the partitioning recursively until each partition contains only one task. The partitioning process can be representedby a tree. Each leaf in the tree represents a group containing only one task. The enumeration order is thus the order of all leaf nodes in depth first traversal. For instance, the partitioning process for the task graph in Fig. 1 is depicted in Fig. 13. Following this result, we obtain the enumeration order that has been used for illustration in previous discussion.

6.2. The optimal branch-and-bound algorithm

The branch-and-bound algorithm is shown in Fig. 14. This is basedon the A∗traversal scheme with the addition of the pruning rules andrelatedimplementation code presentedin Section 5. We now show that an optimal assignment can be obtainedby the proposedalgorithm if neither time-out nor overﬂow of the ActiveSet occurs.

To be convenient, we introduce some terminologies

andnotations. A complete assignment Ac is saidto be

in the future search space of ActiveSet(k) if either Ac ∈

ActiveSet(k) _{or there exists a partial assignment} _A

a ∈ ActiveSet(k) _{such that}_A

c can be derived fromAa. On the

other hand, we say Ac is lost from ActiveSet(k) if Ac is

not in the future search space of ActiveSet(k). The depth

of a partial/complete assignment A, denoted depth(A), is

the length of the path from the root to the branching states representingA in the state–space tree.

The difficulty of showing the correctness of the algorithm is that the pruning rules may remove some partial assign-ments that can leadto optimal assignassign-ments. Fortunately, it can be guaranteedthat there exists other optimal assignments in the future search space after pruning. When an optimal assignment is pruned, we always can find another optimal assignment survivedin the future search space, as shown in Fig. 15. Providedthat some optimal assignments survivedin the future search space, we show that the termination con-dition implies the optimality of the solution obtained. Lemma 3. Assume that no overflow in the ActiveSet occurs.

Then, during the traversal, there are always some optimal assignments survived in the future search space.

Proof. We prove this by induction on the number of iterations

i. The induction hypothesis is that

• for any optimal assignment Aopt-0 not in the future

search space, there exists another optimal assignment

Aopt-k survivedin the future search space such that

depth(A

k)depth(A0), where A0 and Ak are the last

visitedancestors ofAopt-0andAopt-k, respectively.

Lemma 3 holds in the beginning since no optimal assign-ment is lost at initialization. Assuming the induction hypoth-esis holds at the beginning of certain iteration. Suppose there is a partial assignmentA₀been prunedin this iteration and

A

0can be extended to some optimal assignmentAopt-0. The

proof is to ﬁndtheAopt-k andA_k described in the induction

hypothesis.

In this case, A₀ must have been prunedby some

domi-natorA1, which can also be extended to an optimal

assign-mentAopt-1(otherwise the pruning criteria is violated). Let A

1 be the last visitedancestor of Aopt-1. By the pruning

rule, part of the sub-tree below A1 must be traversedand

hence depth(A₁)depth(A1) = depth(A0). If A1 is not

(12)

{t0,t1,...,t12} {t0,t1,...,t7} {t0,t1,t2,t3,t4} {t5,t6,t7} {t₀,t₁,t₂} {t₃,t₄} {t₅} {t₆,t₇} {t0,t1} {t2} {t₀} {t₁} {t3} {t4} {t6} {t7} {t8,t9,...,t12} {t8,t9} {t10,t11, t12} {t₈} {t₉} {t₁₀} {t₁₁, t₁₂} {t11} {t12}

Fig. 13. Determining the task enumeration order.

Algorithm BB-Alloc(G,M) • /* initialization phase */

– L(root of the state-space tree) ←0 – ActiveSet←{root of the state-space tree}

– Obtain A_uby performing greedy search starting at the root of the state-space tree • while not time-out do /* traversal phase */

1) remove a partial/complete assignment A_vwith minimum L(•) from ActiveSet andperform the following to visit(A_v)

1.1) /* update deepest link for all ancestor of A */ deep(A)←A

for each A_a: ancestor of A in the state-space tree do if depth(A)>depth(deep(A_a)) then deep(A_a)←A 1.2) /* try to improve Au*/

perform greedy search starting from A to obtain a complete assignment A_c if cost(A_c)<cost(A_u) then Au←A_c

2) if A_vis a complete assignment then Au←A_vand terminate the traversal by return A_u 3) /* check if the sub-tree of A needs further traversal */

KillerSet←{deep(A_a)| A_ais an ancestor of A_vin the state-space tree}∪{A_u} prune ←False

for each AkyKillerSet do

prune←PruneTest(A_k, A_u,A_v) if prune=True then break

4) /* exploit children of A if the sub-tree of A needs further traversal */ if prune=False then

for each child A′vof A_vin the state-space tree do compute L(A′_v) and insert A′_vinto ActiveSet

Fig. 14. The branch-and-bound algorithm for task allocation.

hence the induction hypothesis holds for the next iteration (cf. Fig. 15(a)). In case that Aopt-1 is lost, the induction

hypothesis states that there exists a survivedoptimal as-signmentAopt-k with the last visitedancestorA_k such that depth(A

k)depth(A1)depth(A1) = depth(A0) (cf.

Fig. 15(b)). Andhence we obtain the requiredAopt-k and

A

k forAopt-0andA₀. This proves the lemma.

Theorem 3 (Correctness of our proposedalgorithm). Our

proposed branch-and-bound algorithm will end up with an optimal assignment if neither space overﬂow in the ActiveSet nor time-out occurs.

Proof. If not timed-out, some complete assignmentAcwill

be removedfrom the ActiveSet in the last iteration during the traversal. The complete assignment returnedis this Ac.

We want to show thatAcis optimal.

We prove this by contradiction. SupposeAcis not optimal.

Consider the contents ofActiveSet(j)for the last iteration

j. Lemma 3 states the existence of an optimal assignment Aopt in the future search space of ActiveSet(j). Thus, we

havecost(Ac) > cost(Aopt) since Aopt is optimal. LetAa

be the ancestor of Aopt (or Aopt itself) in ActiveSet(j).

By the deﬁnition of L(•), L(Aa)cost(Aopt). Andhence

L(Aa)cost(Aopt) < cost(Ac) = L(Ac). However, Ac is

(13)

tasks traversed A

Ai+1 Ai+1is the dominator that

prunes A_i Aopt-0 0 A1 1 state-space tree Aopt-1 A_opt-0 0 A1 1 A2 2 A_k k state-space tree A_opt-1 A_opt-2 A_opt-k tasks traversed A

prunes A_i tasks traversed tasks traversed

A

prunes A_i A′

′

i

prunes A_i Aopt-0 0 A1 1 state-space tree Aopt-1 A_opt-0 0 A1 1 A2 2 A_k k state-space tree A_opt-1 A_opt-2 A_opt-k A_opt-0 A′ A′ A′ A′ A′ A′ A1 0 1 A2 2 A_k k state-space tree A_opt-1 A_opt-2 A_opt-k (a) (b)

Fig. 15. Finding an optimal assignment survived in the future search space. • L(A1)<L(A2) but A2can be extended to an optimal assignment

50 t8 t9 t2 600 400 300 800 700 750 1000 1200 t0 t1 t4 t3 t5 t6 t7 500 400 150 40 30 200 50 200 300 100 20 ₁₀₀₀ ₁₀₀₀ 600 450 800 t12 t11 t10 300 100 200 100 10 10 p0 p1 t8 t9 t2 600 400 300 800 700 750 1000 1200 t0 t1 t4 t3 t5 t6 t7 500 400 150 40 30 200 50 200 300 100 20 ₁₀₀₀ ₁₀₀₀ 600 450 800 t12 t11 t10 300 100 200 100 10 10 p0 p1 (a) t2 600 400 300 800 700 750 1000 1200 t0 t1 t4 t3 t5 t6 t7 500 400 150 40 30 200 50 200 300 100 20 ₁₀₀₀ ₁₀₀₀ 600 450 800 t12 t11 t10 300 50 100 200 100 10 10 p0 p1 t2 600 400 300 800 700 750 1000 1200 t0 t1 t4 t3 t5 t6 t7 500 400 150 40 30 200 50 200 300 100 20 ₁₀₀₀ ₁₀₀₀ 600 450 800 t12 t11 t10 300 50 100 200 100 10 10 p0 p1 (b)

Fig. 16. Unfair comparison: assigning different sets of tasks: (a) partial assignmentA1and(b) partial assignmentA2.

L(Ac)L(Aa). This produces a contradiction and hence

proves this theorem.

6.3. Space-efﬁcient ActiveSet organization

The remaining problem in designing the task allocation algorithm is the design of ActiveSet such that (1) the

par-tial/complete assignment with minimumL(•) can be easily

removed, and (2) a near optimal assignment can be obtained once overﬂow occurs. A simple solution is to implement the

ActiveSet as a heap anddrop the partial/complete

assign-ment with maximumL(•) when overﬂow occurs, because

such an assignment is unlikely to be extended to an optimal assignment. However, this scheme has certain drawbacks. We identify two situations that will reduce the effectiveness of the victim selection scheme:

• Unfair comparisons between partial assignments contain-ing different sets of tasks.

• Unfair comparisons between partial assignments using different numbers of processors.

Fig. 16 depicts an example of unfair comparison between partial assignments assigning different sets of tasks. Con-sider mapping the task graph in Fig. 1 to the machine con-ﬁguration in Fig. 2. Fig. 16 depicts two partial assignments

A1 and A2 containing different sub-graphs and L(A1) < L(A2). However, A2can be extended to an optimal

assign-ment but A1 cannot. A partial assignment containing less

number of tasks usually has lower cost and L(•), but this

does not mean it has a better future extension. Our solution is to keep partial assignments assigning different number of tasks in different heaps.

Fig. 17 depicts an example of unfair comparison be-tween partial assignments using different number of

pro-cessors. We have two partial assignments A1 andA2 with

L(A1) < L(A2). A1 is the best assignment to assign the

sub-graph containing tasks {t0, t1, t2, t3, t4}. However, A2

can be extended to an optimal assignment but A1 cannot.

The assignment lacks knowledge of future load to be

as-signedandhence A1 uses too many processors for tasks