Research Model - 行政院國家科學委員會專題研究計畫成果報告

Abstract

3. Research Model

3.1 Identical Cluster Grid

To explicitly define the problem, upon the number of clusters (C), number of computing nodes in each cluster (n_i), 1≦i≦C, the number of sub-blocks (K) and <G(C):{n₁, n₂, n₃, …, nc}> presents the cluster grid model with ni computing nodes in each cluster. The definition of symbols is shown in Table 1.

Table 1 The definition of symbols.

C

The number of clusters.

K

The degree of refinement

n

i The number of

computing nodes in each cluster.

G(C):{n1

, n

, …, n

The cluster grid model

We consider two models of cluster grid when performing data reallocation. Figure 1 shows an example of localization technique for explanation. The degree of data refinement is set to three (K = 3). This example also assumes an identical cluster grid that consists of three clusters and each cluster provides three nodes to join the computation. In algorithm phase, in order to accomplish the fine-grained data distribution, processors partition its own block into K sub-blocks and distribute them to corresponding destination processors in ascending order of processors‟ id that specified in most data parallel programming languages.

For example, processor P₀ divides its data block A into a₁, a₂, and a₃; it then distributes these three sub-blocks to processors P0, P1 and P2, respectively. Because processors P0, P1 and P2 belong to the same cluster with P0; therefore, these three communications are interior. However, the same situation on processor P₁ generates three external communications. Because processor P₁ divides its local data block B into b₁, b₂, and b₃. It then distributes these three sub-blocks to processors P₃, P₄ and P₅, respectively. As processor P1 belongs to Cluster-1 and processors P3, P4 and P5 belong to Cluster-2, there are three external communications. Figure 1(a) summarizes all messaging patterns of this example into the communication

table. Messages {a1, a2, a3}, {e1, e2, e3} and {i1, i2, i3} are presented interior communications (|I| = 9); all the others are external communications (|E| = 18).

(a)

(b)

Figure 1. Communication tables of data reallocation over the cluster grid. (a) Without data mapping. (b) With data mapping.

The idea of changing logical processor mapping [15, 16] is employed to minimize data transmission time of runtime array redistribution in the previous research works. In the cluster grid, we can derive a mapping function to produce a realigned sequence of logical processors‟ id for grouping communications into the local cluster. Given an identical cluster grid with C clusters, a new logical id for replacing processor P_ican be determined by New(P_i) = (i mod C) * K + (i / C), where K is the degree of data refinement. Figure 1(b) shows the communication table of the same example after applying the above reordering scheme. The source data is distributed according to the reordered sequence of processors‟ id, i.e., <P0, P3, P6, P1, P4, P7,

P

₂, P₅, P₈> which is computed by mapping function. Therefore, we have |I| = 27 and |E| = 0.

For the case of K (degree of refinement) is not equal to n (the number of grid nodes in each cluster), the mapping function becomes impracticable. In this subsection, the previous work proposes a grid node replacement algorithm for optimizing distribution localities of data reallocation. According to the relative position of the first of consecutive sub-blocks that produced by each processor, we can determine the best target cluster as candidate for node replacement. Combining with a load balance policy among clusters, this algorithm can effectively improve data localities. Figure 2 gives an example of data reallocation on the cluster grid, which has four clusters. Each cluster provides three processors. The degree of data refinement is set to four (K = 4). Figure 2(a) demonstrates an original reallocation communication patterns.

We observe that |I| = 12 and |E| = 36.

(a)

(b)

Figure 2. Communication tables of data reallocation on the identical cluster grid. (C = 4, n = 3, K = 4) (a) Without data mapping.

(b) With data mapping.

If we change the distribution of block B to processors reside in cluster-2 (P3, P4 or P5) or cluster-3 (P6, P7

or P8) in the source distribution, we find that the communications could be centralized in the local cluster for some parts of sub-blocks. Because cluster-2 and cluster-3 will be allocated the same number of sub-blocks in the target distribution, therefore processors belong to these two clusters have the same priority for node replacement. In this way, P3 is first assigned to replace P1. For block C, most sub-blocks will be reallocated to processors in cluster-4, therefore the first available node P9 is assigned to replace P2. Similar determination is made to block D and results P₁ replace P₃. For block E, cluster-2 and cluster-3 have the same amount of sub-blocks. Processors belong to these two clusters are candidates for node replacement.

However, according to the load balance policy among clusters, cluster-2 remains two available processors for the node replacement while cluster-3 has three; our algorithm will select P6 to replace P4. Figure 2(b) gives the communication tables when applying data to logical grid nodes mapping technique. We obtain |I| = 28 and |E| = 20.

3.2 Non-identical Cluster Grid

Let‟s consider a more complex example in non-identical cluster grid, the number of nodes in each cluster is different. It needs to add global information of cluster grid into algorithm for estimating the best target cluster as candidate for node replacement. Figure 3 shows a non-identical cluster grid composed by four clusters. The number of processors provided by these clusters is 2, 3, 4 and 5, respectively. We also set the degree of refinement as K = 5. Figure 3(a) presents the table of original communication patterns that consists of 19 interior communications and 51 external communications. Applying our node replacement

algorithm, the derived sequence of logical grid nodes is <P₂, P₅, P₉, P₃, P₆, P₁₀, P₄, P₁₁, P₀, P₇, P₁₂, P₁, P₈,

P

₁₃>. Figure 3(b) gives the communication tables when applying data to logical grid nodes mapping technique. This data to grid nodes mapping produces 46 interior communications and 24 external communications. This result reflects the effectiveness of the node replacement algorithm in term of minimizing inter-cluster communication overheads.

(a)

(b)

Figure 3. Communication tables of data reallocation on non-identical cluster grid. (a) Without data mapping.

(b) With data mapping.

3.3

Communication Cost of Multi-Clusters with Heterogeneous Network

Examples in the above section do not consider the real communication status for multi-clusters over heterogeneous network communication. Figure 4(a) shows an example of four clusters with various inter-cluster communication costs. Each unit‟s block data must spend 20 units time from the cluster-1 transmission to cluster-2, but each unit‟s block data must spend 30 units time from the cluster-1 transmission to cluster-3. Figure 4(b) shows the table of inter-cluster communication costs. Therefore, we can calculate communication cost of data distribution for each processor over inter-cluster by this communication matrix. After calculating, the communication cost are 1865 and 885 according to

distribution scheme in Figure 3(a) and 3(b), respectively. But the proposed processor mapping methods provide new sequences of logical grid node which are <P₄, P₅, P₁₁, P₂, P₉, P₀, P₆, P₁₀, P₁, P₇, P₁₂, P₃, P₈,

P

₁₃> and < P₃, P₅, P₉, P₂, P₁₀, P₁, P₆, P₁₁, P₀, P₇ P₁₂, P₄, P₈, P₁₃> in next section. Consequently, the necessary costs of both sequences are 740 units. The result reflects the effectiveness of this sequence which has the less communications cost. In next section, we will to explain the research model and calculation of communication cost.

(a)

(b)

Figure 4. Communication model of Multi-Clusters with Heterogeneous Network. (a) Example of four clusters with various inter-cluster communication costs. (b) The communication matrix table.

3.4 Communication Model of Data Distribution in Multi-Clusters

To set the communication cost of inter-cluster as V(i,j). The communication cost of distribute data block from C1 to C3 is denoted V(1,3). Assume there is block A (β=A) from node P of Ci, total cost formula denoted W(β)i.. W(β)i = (β1*V(i,1) + β2*V(i,2)+…+ βj*V(i,j)). (1≦i, j≦C). β1, β2, …, βj-1 andβj represent number of sub-blocks that Pi has to send from C1 to C1, C2, …, Cj-1, Cj. Figure 5 shows the communication cost of data distribution from each node according to distribution scheme in Figure 4(b). There is the data block A on logic nodes P0 within a grid model C = 4, K = 5, <G(4):{2, 3, 4, 5}>. Assume the sub-blocks a1,

a

2 of block A on P0 needs to be redistributed from C1 to C1, the a3, a4, a5 needs to be redistributed from C1

to C2,no data is redistributed from C1 to C3, C4. The communication cost of redistributing block A from P0

and P2 are W(A)1 = (2*0 + 3*20 + 0*30 + 0*30) = 60 and W(A)2 = (2*20 + 3*0 + 0*50 + 0*20) = 40, respectively. Accordingly, W(A)3 = 125, W(A)4 = 105.

Figure 5. The total communication cost of grid model (C = 4, K = 5, < G (4): {2, 3, 4, 5}> )

在文檔中行政院國家科學委員會專題研究計畫成果報告 (頁 40-45)