Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs

(1)

9 BSP Programs

WEN-LI SHIH, National Tsing Hua University YI-PING YOU, National Chiao Tung University

CHUNG-WEN HUANG and JENQ KUEN LEE, National Tsing Hua University

Multithread programming is widely adopted in novel embedded system applications due to its high perfor-mance and flexibility. This article addresses compiler optimization for reducing the power consumption of multithread programs. A traditional compiler employs energy management techniques that analyze compo-nent usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage power can be controlled by inserting on and off instructions based on component usage information generated by flow equations. However, these methods cannot be directly extended to a multithread environment due to concurrent execution issues.

This article presents a multithread power-gating framework composed of multithread power-gating

anal-ysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage

power when executing multithread programs on simultaneous multithreading (SMT) machines. Our mul-tithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The experimental results show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09% for BSP programs relative to a system without a power-gating mechanism on leakage contribution set to 30%; and the total energy consumption is reduced by an average of 4.27% on leakage contribution set to 10%. The results demonstrate our mechanisms are effective in reducing the leakage energy of BSP multithread programs.

Categories and Subject Descriptors: D.3.2 [Programming Languages]: Language Classifications—

Concurrent; distributed; parallel languages; D.3.4 [Programming Languages]: Processors—Compiler; optimization

General Terms: Design, Language

Additional Key Words and Phrases: Compilers for low power, leakage power reduction, power-gating mech-anisms, multithreading

ACM Reference Format:

Wen-Li Shih, Yi-Ping You, Chung-Wen Huang, and Jenq Kuen Lee. 2014. Compiler optimization for reducing leakage power in multithread BSP programs. ACM Trans. Des. Autom. Electron. Syst. 20, 1, Article 9 (November 2014), 34 pages.

DOI: http://dx.doi.org/10.1145/2668119

This work is supported in part by Ministry of Science and Technology (under grant no. 103-2220-E-007-019) and Ministry of Economic Affairs (under grant no. 103-EC-17-A-02-S1-202) in Taiwan.

Author’s addresses: W.-L. Shih, Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan; Y.-P. You, Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan; C.-W. Huang and J. K. Lee (corresponding author), Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan; email: jklee@cs.nthu.edu.tw.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

c

2014 ACM 1084-4309/2014/11-ART9 $15.00

(2)

VLIW (very long instruction word) instructions to reduce the power consumption on the instruction bus [Lee et al. 2003], reducing instruction encoding to reduce code size and power consumption [Lee et al. 2013], and gating the clock to reduce work-loads [Horowitz et al. 1994; Tiwari et al. 1997, 1998]. Compiler code for reducing leakage power can employ power gating [Kao and Chandrakasan 2000; Butts and Sohi 2000; Hu et al. 2004]. Various studies have attempted to reduce the leakage power using integrated architectures and compiler-based power gating mechanisms [Dropsho et al. 2002; Yang et al. 2002; You et al. 2002, 2007; Rele et al. 2002; Zhang et al. 2003; Li and Xue 2004]. These approaches involve compilers inserting instructions into programs to shut down and wake up components as appropriate, based on a dataflow analysis or a profiling analysis. The power analysis and instruction insertion are further integrated into trace-based binary translation [Li and Xue 2004]. The Sink-N-Hoist framework [You et al. 2005, 2007] has been used to reduce the number of power-gating instructions generated by compilers. However, these power-gating control frameworks are only ap-plicable to single-thread programs, and care is needed in multithread programs since some of the threads might share the same hardware resources. Turning resources on and off requires careful consideration of cases where multiple threads are present. Herein, we extend previous work to deal with the case of multithread systems in a bulk-synchronous parallel (BSP) model.

The BSP model, proposed by Valiant [1990], is designed to bridge between theory and practice of parallel computations. The BSP model structures multiple processors with local memory and a global barrier synchronous mechanism. Threads processed by processors are separated by synchronous points, called supersteps, that form the basic unit of the BSP model. A superstep consists of a computation phase and a communica-tion phase, allowing processors to compute data in local memory until encountering a global synchronous point in the computation phase and synchronizing local data with each other in the communication phase. The algorithm complexity of parallel programs can then be analyzed in the BSP model by considering both locality and parallelism issues. The BSP model works well for a family of parallel applications in which the tasks are balanced. However, global barrier synchronization was found inflexible in the practice [McColl 1996], which promoted proposals for several enhanced BSP models presenting hierarchical groupings. NestStep [Keßler 2000] is a programming language for the BSP model that adopts nested parallelism with support for virtual shared memory. The H-BSP model [Cha and Lee 2001] splits processors into groups and dy-namically runs BSP programs within each group in a bulk-synchronous fashion, while the multicore BSP [Valiant 2008, 2011] provides hierarchical multicore environments with independent communication costs. In the present study we adopted the concept of hierarchical BSP models [Keßler 2000; Cha and Lee 2001; Torre and Kruskal 1996] as the basis for a power reduction framework for use in parallel programming.

Several methods have been proposed for analyzing the concurrency of multithread programs. May-happen-in-parallel (MHP) analysis computes which statements may be executed concurrently in a multithread program [Callahan and Sublok 1989;

(3)

Duesterwald and Soffa 1991; Masticola and Ryder 1993; Naumovich and Avrunin 1998; Naumovich et al. 1999; Li and Verbrugge 2004; Barik 2005]. The problem of precisely computing all pairs of statements that may execute in parallel is undecidable [Ramalingam 2000]; however, it was proved that the problem is NP-complete if we assume that all control paths are executable [Taylor 1983]. The general approach involves using a dataflow framework to compute a conservative estimate of MHP information.

This article presents a multithread power-gating (MTPG) framework, composed of MTPG Analysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing leakage power when executing multithread programs on si-multaneous multithreading (SMT) machines. SMT is a widely adopted processor tech-nique that allows multithread programs to utilize functional units more efficiently by fetching and executing instructions from multiple threads at the same time. Our multithread programming model is based on hierarchical BSP models. We propose us-ing thread fragment concurrency analysis (TFCA) to analyze MHP information among threads and MTPGA to report the component usages shared by multiple threads in hi-erarchical BSP models. TFCA reports the concurrency of threads, which allows power-gating candidates to be classified into those used by multiple threads and those used by a single thread. A conventional power-gating optimization framework [You et al. 2005, 2007] can be employed for candidates used by a single thread, with the compiler inserting instructions into the program to shut down and wake up components as ap-propriate. For candidates used concurrently by different threads, PPG instructions are adopted to turn components on and off as appropriate. Based on the TFCA, our MTPGA framework estimates the energy usage of multithread programs with our proposed cost model and inserts a pair of predicated power-on and predicated power-off operations at those positions where a power-gating candidate is first activated and last deactivated within a thread.

To our best knowledge, this is the first work to attempt to devise an analysis scheme for reducing leakage power in multithread programs. We performed experiments by incorporating TFCA and MTPGA into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. Our preliminary experimental results on a system with leakage contribution set to 30% show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09% for BSP programs converted from the OpenCL kernel and by up to 10.49% for D-BSP programs relative to the system without a power-gating mechanism, and is reduced by an average of 4.27% for BSP programs and by up to 6.68% for D-BSP programs on a system with leakage contribution set to 10%, demonstrating our mechanisms effective in reducing the leakage power in hierarchical BSP multithread environments.

The remainder of the article is organized as follows. Section 2 gives a motivating example for the problem addressed by our study. Section 3 presents the technical rationale of our work, first presenting the PPG instruction and architectures, and then summarizing our compilation flow. Section 4 presents the method of TFCA for hierarchical BSP programs while Section 5 presents our MTPGA compiler framework for power optimizations. Section 6 presents the experimental results, discussion is given in Section 7, and conclusions are drawn in Section 8.

2. MOTIVATION

A system might be equipped with a power-gating mechanism to activate and deac-tivate components in order to reduce the leakage current [Goodacre 2011]. In such systems, programmers or compilers should analyze the behavior of programs, inves-tigate component utilization based on execution sequences, and insert power-gating

(4)

Fig. 1. The traditional power-gating mechanism adopted in a single-thread or SMT environment. Both environments are equipped with two categories of components C0and C1, where C0is capable of controlling

the power-gating status of C1: (a) Two code segments of threads T1and T2, where power-gating instructions

are inserted by power-gating analysis results for threads T1and T2individually. Note that op1 of T1and op2 and op5 of T2demonstrate those cases where instructions might need more than one component (in this

case, C0and C1) to complete operation; (b); (c) how the code segments in (a) are executed in a single-thread

and SMT environment, respectively. All component usages of instructions for the two threads are labeled as square boxes with corresponding labels, and power-off instructions are labeled as boxes with a cross. instructions into programs [You et al. 2002, 2006] to ensure that the leakage current is gated appropriately. Traditional compiler analysis algorithms for low power focus on single-thread programs, and the methods cannot be directly applied to multithread programs. We use the example in Figure 1 to illustrate the scenario for motivating the need of new compiler schemes for reducing the power consumption in multithread environments. Assume we have hardware equipped with two categories of functional units, named C0and C1, where C0is capable of controlling the power-gating status of C1, and the hardware is configurable as a single-thread or SMT environment. We first present two pseudocode segments for threads T1and T2 in Figure 1(a), that are ana-lyzed and processed by traditional low-power optimization analysis. Note that op1 of T1 and op2 and op5 of T2demonstrate the cases where instructions might need more than one component (in this case, C0and C1) to complete operation. Traditional sequential analysis of the compiler will yield the component utilization for every instruction. As shown in Figure 1(a), the compiler inserts two power-gating instructions “pg-off C1” at the end of both code segments because C1 is no longer used for those segments in subsequent codes. These code segments work smoothly when executed individually in single-thread environments as shown in Figure 1(b). In the figure, all component us-ages of instructions for the two threads are labeled as square boxes with corresponding labels, and power-off instructions are labeled as boxes with a cross. For thread T1, after instructions op1 and op2 are executed, the power-off instruction is executed at t4; hence the system could save leakage energy from idle component C1. For thread T2, after five instructions are executed, the power-off instruction is executed at t6, which turns off component C1to stop the leakage current.

However, when the multithread program is executed in an SMT system, the system could concurrently execute threads T1 and T2 with shared components C0and C1 as illustrated in Figure 1(c). At time t4, thread T1 powers off C1because the traditional compiler analysis reports that C1will no longer be used in T1and a power-off instruction

(5)

is inserted. However, T2 actually still uses C1 at time t4 and t5, which means that powering off C1 at t4 will make the system fail if the powered-off components fully rely on power-gating instructions; or the system would pay the penalty associated with executing T2at t4if the system could internally turn on the components according to the status of instruction queues.

The prior example indicates that the traditional single-thread analyzer cannot be naively applied to the MTPG case, as it will likely break the logic that a unit must be in the active state (i.e., powered on) before being used for processing, since the unit might be powered off by a thread while other concurrent threads are still using or about to use it. Moreover, a unit might be powered on multiple times by a set of concurrent threads. The preceding problems must be appropriately addressed when constructing power-gating controls for multithread programs. This article presents our solution for addressing this issue.

3. TECHNICAL RATIONALE 3.1. PPG Operations

Predicated execution support provides an effective means to eliminate branches from an instruction stream. Predicated or guarded execution refers to the conditional exe-cution of an instruction based on the value of a Boolean source operand, referred to as the predicate [Hsu and Davidson 1986]. Predicated instructions are fetched regardless of their predicate value. Instructions whose predicate is true are executed normally, while those whose predicate is false are nullified and thus prevented from modifying the processor state.

We include the concept of predicated execution in power-gating devices for controlling the power gating of a set of concurrent threads. We combine the predicated executions into three special power-gating operations: predicated power-on, predicated power-off, and initialization operations. The main ideas are: (1) to turn on a component only when it is actually in the off state; (2) to keep track of the number of threads using the component; and (3) to turn off the component when this is the last exit of all threads using this component. Note that these operations must be atomic with respect to each other in order to prevent multiple threads from accessing control at the same time. —Initialization operation. An initialization operation is designed to clean all predicated

bits (i.e., pgp1, pgp2, . . . , pgpN) and empty all reference counters (i.e., rc1, rc2, . . . , rcN) when the processor is starting up.

—Predicated power-on operation. The predicated power-on operation takes an explicit operand and two implicit operands to record component usage and conditionally turn on a power-gating candidate. The explicit operand is power-gating candidate Ci, and the implicit operands include predicated bit pgpi of Ci and a reference counter rci of Ci. The operation consists of the following steps:

(1) power on Ci if pgpi (i.e., the predicated bit of Ci) is set;

(2) increase rci (i.e., the reference counter of Ci) by 1. The reference counter keeps track of the number of threads that reference the power-gating candidate at this time; and

(3) unset predicated bit pgpi.

—Predicated power-off operation. The predicated power-off operation also takes an explicit operand Ci and two implicit operands pgpi and rci. Predicated power-off instructions update component usage rci and conditionally turn off a power-gating candidate Ci by predicated bit pgpi. The operation consists of the following steps: (1) decrease the reference counter rci by 1;

(2) set predicate bit pgpi if reference counter rci is 0; and (3) power off Ci if predicated bit pgpi is set.

(6)

Figure 2 illustrates the specification of these PPG instructions in pseudocode seg-ments. Consider a power-gating candidate C1 in an SMT system with PPG support. Figure 2(a) shows the initialization operation for all PPG operations, and Figures 2(b) and 2(c) show pseudocode segments for predicated power-on and power-off operations for C1, respectively. To support the atomicity, a lock lc1is used before and after the code segments to guarantee that these operations are executed exclusively. For efficiency reasons, hardware circuits should be used to implement this behavior in practice. 3.2. Multithread Power-Gating Framework

Algorithm 1 summarizes our proposed compiler flow of the MTPG framework for BSP models. To generate code with power-gating control in a multithread BSP program, the compiler should compute concurrency information and analyze component usage with respect to concurrent threads. Step 1 of the algorithm applies TFCA to compo-nent usages shared by multiple threads in hierarchical BSP models (the details of this algorithm are presented in Section 4); this is the hierarchical BSP version of MHP analysis. In step 2, detailed component usages can be calculated via dataflow equations by referencing component-activity dataflow analysis (CADFA) [You et al. 2002, 2006]. Steps 3 and 4 insert PPG instructions according to the information gathered in the pre-vious steps while considering the cost model (Section 5 presents our MTPGA compiler framework for power optimizations). In step 3, MTPGA arranges power-gating control among threads. In step 4, CADFA calculates the detailed component usage with regard to the arrangement of step 3. Steps 5 and 6 further merge the generated power-gating controls into a single compound instruction based on the sink-n-hoist framework [You et al. 2005, 2007]. This is a compiler solution to merge power-gating instructions into a single compound instruction and reduce the number of power-gating instructions issued. Step 5 decides if and where power-gating instructions should be inserted, while

ALGORITHM 1: Multithread Power-Gating Framework Input: A source program

Output: A program with power-gating control begin

1 Perform thread fragment concurrency analysis for BSP programs

2 Perform component-activity data-flow analysis to get detailed component usage

3 Perform multithread power-gating analysis to arrange power-gating control among threads

4 Perform component-activity data-flow analysis with advise from MTPGA

5 Perform power-gating-instruction scheduling

6 Perform sink-n-hoist analysis to merge generated power-gating controls

7 Produce predicated-power-gating instructions and power-gating instruction end

(7)

Fig. 3. Illustration of one superstep of a BSP program, where eight threads (T1to T8) are divided into six

subgroups (G1to G6). Each subgroup contains a subsuperstep. In a hierarchical BSP program, programmers

are allowed to divide threads into groups and the synchronization of threads would be limited in the groups, which form subsupersteps inside the groups. A barrier is a synchronous point of a group in the hierarchical BSP model; therefore all the barriers in program must belong to a specific group as shown in the figure. step 6 attempts to merge the power-gating instructions with the sink-n-hoist frame-work. Finally, step 7 produces the power-control assembly codes.

4. TFCA FOR BSP PROGRAMS

This section presents the concurrency analysis method for BSP programs. We con-sider hierarchical BSP models with a fixed number of threads. Operations of BSP programs are assumed well structured and correctly maintained by programmers, and the scheduling of threads is assumed explicit and correctly maintained by programmers or a static scheduler. Figure 3 presents an example for a superstep of the hierarchical BSP model, in which vertical black lines indicate threads and horizontal gray bars indi-cate barriers. Eight individual threads and two barriers form the superstep, where the eight threads join and are divided into six groups. In a hierarchical BSP program, pro-grammers are allowed to divide threads into groups and the synchronization of threads would be limited in the groups, which form subsupersteps inside groups. A barrier is a synchronous point of a group in a hierarchical BSP model; therefore, all the barriers in a program must belong to a specific group. The threads in each group are synchronized by barriers belonging to the group, which form subsupersteps inside the groups.

The threads do not have a constant relationship. Computing the concurrency between threads actually involves considering the relation between threads that are present during a specific period, which are indicated by a set of neighboring nodes in the control-flow graph (CFG), denoted by a thread fragment. We calculate the thread concurrency in a superstep of a group rather than of the entire BSP program. Since every superstep is executed sequentially, solving the thread concurrency of all supersteps will solve the thread concurrency of the BSP program. This analysis is performed by first constructing a thread fragment graph (TFG) that represents the relationships in a superstep, and then computing lineal thread fragments and the may-happen-in-parallel regions (MHP regions) that represent thread fragments that have a lineal relationship and thread fragments that may happen in parallel, respectively.

4.1. Thread Fragment Graph

The relationships between thread fragments in a superstep are abstracted into a di-rected graph named the TFG, in which a node represents a thread fragment and an edge represents the control flow. A TFG might be constructed with a single CFG or multiple

(8)

Fig. 4. Hierarchical BSP program presented in a CFG, where four threads (T1to T4) are divided into four

supersteps by barriers. In the second superstep, four threads are further grouped into two groups (g1to g2).

Each subgroup has its own supersteps.

CFGs, depending on the adopted programming model. For a single-program multiple-data programming model, a multithread program is a single executable file and is ex-ecuted heterogeneously by conditional branches of unique thread identification; a TFG for a single-program multiple-data program is thus constructed from a CFG where certain control paths are recognized as thread operations. For a multiple-program multiple-data programming model, a multithread program is composed of multiple individual executable files that are executed on different processors; in such a case, a TFG is constructed from several CFGs of the individual programs. This article adopts a single-program multiple-data programming model to construct our TFG; however, the method could be applied to multiple-program multiple-data programming models with minor revisions. Figure 4 presents a hierarchical BSP program in CFG, where four threads (T1to T4) are divided into four supersteps. In the first superstep, four threads are further grouped into two groups (g1to g2). Each subgroup has its own supersteps.

The notations used are presented in Table I. Given a CFG G= (V, E), comprising a set of nodes V and a set of edges E, we denote the set of immediate successors of a nodev by Succ(v) and the set of immediate predecessors of v by Pred(v) (e.g., if there exists an edge e(v1, v2) ∈ E, then Succ(v1) = v2 and Pred(v2) = v1). For convenience, we denote a set of immediate successors and immediate predecessors of a set V0 as Succ(V0) and Pred(V0), respectively.

Succ(V0)= v∈V0

(9)

Table I. Notation

G a CFG G_{= (V, E), comprising a set of nodes V and a set of edges E}

e(_{v, u)} a directed edge from_{v to u}

w(v, u) a walk from_{v to u, which is a sequence of connected nodes}

W (v, u) the set of all walks fromv to u

Succ(v) the set of immediate successors of nodev

Pred(v) the set of immediate predecessors of nodev

Succ(V ) the set of immediate successors of a set of node V

Pred(V ) the set of immediate predecessors of a set of node V

O the set of groups in a BSP program

g a group in a BSP program

VS(g) the set of begin nodes of group g

VE(g) the set of end nodes of group g

OSUB(g) the set of immediate subgroups of group g

VB a set of barrier nodes

VT F a set of nodes that belong to a thread fragment

G a TFG G_{= (V}_{, E}), comprising a set of nodes Vand a set of edges E

VT(t) the set of nodes with thread t

T (_v) the thread that node_{v belongs to}

T FG(VT F) a mapping function that maps a thread fragment VTF⊂ V to a node v ∈ G

C FG(v) a mapping function that maps a nodev ∈ Vto a thread fragment VTF⊂ V

Pred(V0)= v∈V0

Pred(v)

A walk in the graph is a sequence of connected nodes that are not necessarily distinct. We denote a walk fromv0tovnin G byw(v0, vn)

w(v0, vn)=< v0, v1, . . . , vn−1, vn>, ∀(1 ≤ k ≤ n) ∧ (k ∈ N) : vk∈ Succ(vk−1), whereN is the set of natural numbers. Let W(u, v) be a set of all walks from u to v in G; note that A W (u, v) wills an infinite set if there is a loop between u and v.

A hierarchical BSP program might have several groups in a superstep. A group is a set of threads that are present during a certain period of time; all threads in the set are executed simultaneously and synchronized by barriers belonging to the group. A hierarchical BSP program has an implicit group that contains all threads. Figure 4 contains an implicit group g0that contains all threads in the graph. Let O be the set of all groups in a BSP program. A subgroup gof a group g is a set of threads such that g ⊆ g. We say that a group g is an immediate subgroup of a group g if and only if g is a subgroup of g and there does not exist a group gsuch that gis a subgroup of g and gis a subgroup of g. The set of immediate subgroups for a group g is denoted by OSUB(g).

Barriers belonging to a group could synchronize threads in the group. Given a group g∈ O, a set of BSP barrier nodes is denoted by VB(g), which includes all BSP synchro-nization nodes in group g. Sets of beginning nodes and ending nodes of a BSP thread are denoted by VS(g) and VE(g), respectively, that contain all beginning nodes and all ending nodes of a BSP program in group g.

Barrier nodes block threads, dividing them into thread fragments and forming su-persteps. A thread fragment might belong to multiple supersteps, depending on the numbers of prior barriers along a control flow. We say that a barrier that has n prior barriers along a control flow belongs to generation n. A barrier node might belong to more than one generation if there are multiple control flows with different numbers of barriers reaching the barrier. For a given group g, let the numbers of BSP barrier

(10)

Fig. 5. Supersteps for Figure 4. Nodes inside the dashed area are a superstep, named VSS. Four supersteps

are shown in the figure: VSS(g0, 0), VSS(g0, 1), VSS(g0, 2), and VSS(g0, 3).

nodes in a walkw be NumBarr(w, g). We denote a set of the BSP barriers that have n prior barriers by VB(g, n), and the sets of the first BSP barriers and last BSP barriers of group g are denoted by VB(g, entry) and VB(g, exit), respectively.

vs∈ VS(g), VB(g, n) = {v ∈ VB| ∃w ∈ W(vs, v) ∧ NumBarr(w, g) = n} vs∈ VS(g), VB(g, entry) = {v ∈ VB| ∃w ∈ W(vs, v) ∧ NumBarr(w, g) = 0}

ve∈ VE(g), VB(g, exit) = {v ∈ VB| ∃w ∈ W(v, ve)∧ NumBarr(w, g) = 0}

A superstep in a BSP model is formed by nodes between two adjacent barriers of the same group. Figure 5 shows supersteps for Figure 4, where four supersteps are divided by VB(g0). For a given group g and two given adjacent barrier generations n and n+ 1, a set of nodes of superstep VSS(g, n) in G could be derived by traversing nodes from successors of VB(g, n) to VB(g, n + 1).

VSS(g, n) = Succ(VB(g, n)) ∪ {v ∈ V | v ∈ Succ(VSS(g, n)) ∧ v /∈ VB(g, n + 1)} A walk in a superstep is defined as a sequence of nodes that contains no barriers of the superstep.

WSS(u, v, g) = {w ∈ W(u, v) | o ∈ OSU B(g), NumBarr(w, o) = 0 ∧ NumBarr(w, g) = 0} A thread fragment, denoted by VT F, is a set of neighboring nodes of a superstep that contains no BSP barriers of a group g or its immediate subgroups, which means that a thread fragment of group g in generation n can be characterized into one of the following three cases:

(11)

Fig. 6. Superstep VSS(g0, 1) of Figure 5. Nodes inside a superstep are divided into thread fragments that

are used to build a TFG. Nodes_v11andv19are overlapped because there is a barrier nodev15inside the

loop structure. With a barrier node inside a loop, nodes inside the loop may be executed multiple times in different thread fragments; therefore nodes_v11andv19appear in different thread fragments, resulting in

the thread fragments overlapping.

(a) starts from a successor of barriers VB(g, n) and ends in a predecessor of barriers in next generation VB(g, n + 1) or barriers in entry barriers of immediate subgroup o of g, VB(o, entry);

(b) starts from a successor of exit barriers of immediate subgroup o of g, VB(o, exit), and ends in a predecessor of barriers in next generation VB(g, n + 1) or barriers in entry barriers of another immediate subgroup oof g, VB(o, entry); or

(c) starts from the entry barrier of an immediate subgroup o of g, VB(o, entry), and ends in the exit barrier of o, VB(o, exit).

For case a, we have a thread fragment VTF(v, g) for each node v ∈ Succ(VB(g, n)). VTF(v, g) = {u |o∈ OSUB(g), v∈ VB(g, n + 1) ∪ VB(o, entry),

WSS(v, u) = ∅ ∧ WSS(u, v)= ∅} (1) For case b, we have a thread fragment VTF(v, o, g) for each node v ∈ Succ(VB(o, exit)).

VTF(v, o, g) = {u |o∈ OSUB(g)∧ o= o, v∈ VB(o, entry) ∪ VB(g, n + 1),

WSS(v, u) = ∅ ∧ WSS(u, v)= ∅} (2) For case c, we have a thread fragment VTF(o) where o is a subgroup and o∈ OSUB(g).

VTF(o)= {v | v∈ VB(o, entry), v∈ VB(o, exit), W(v, v) = ∅ ∧ W(v, v)= ∅} (3) Figure 6 shows the thread fragments in superstep VSS(g0, 1) of Figure 5. Nodes v11 andv19 are overlapped because there is a barrier node v15 inside the loop structure. With a barrier node inside a loop, nodes inside the loop may be executed multiple times in different thread fragments; therefore nodesv11and v19 appear in different thread fragments, resulting in the thread fragments overlapping.

We now can construct a TFG for a superstep n of group g in GCFG. A TFG, denoted by G= (V, E), is a directed graph in which each node is a thread fragment or a grouped thread fragment and each edge is a control flow between nodes. For each VTF, we add a nodev to Vto represent VTF. The relation from VTF to the relevantv is denoted by v = TFG(VTF). Conversely, we denote the relation fromv ∈ Vto VTFby VTF= CFG(v). For two given nodesvi ∈ Vandvj∈ V, an edge e(vi, vj) is added to Eif and only if there exists an edge e∈ E from a node of CFG(vi) to a node ofCFG(vj).

(12)

Fig. 7. A TFG for Figure 6.

Nodes of a graph having no predecessor are called entry nodes, while those that have no successor are called exit nodes. There are multiple entry nodes and exit nodes in a TFG, denoted by V(entry) and V(exit), respectively. A thread might have several entry nodes and exit nodes in a TFG. To unify these nodes for each thread, an initial node and a final node for each thread are introduced into the TFG as an immediate predecessor of all entry nodes of the same thread or an immediate successor of all exit nodes of the same thread, respectively. Given a thread with a thread identification t of a thread, we denote initial nodes and final nodes by V(t, initial) and V(t, f inal). The connections between nodes are indicated as

Succ(V(t, initial)) = {v | v ∈ V(entry)∧ T (v) = t} and

Pred(V(t, f inal)) = {v | v ∈ V(exit)∧ T (v) = t},

where T () is a function that returns the thread identification of a thread fragment. Figure 7 presents a TFG that abstracts the relationship of thread fragments for Fig-ure 6. Nodes u9and u10are nodes for thread fragments derived from immediate sub-groups with Eq. (3), while the other nodes for thread fragments are derived with Eqs. (1) and (2). Node u5is the node for thread fragment{v1, v5, v9}, that is derived with Eq. (1). Node u11is the node for thread fragment{v24, v28, v32} that is derived with Eq. (2). Node u9is the node for thread fragment{v13, v14, v17, v18, v21, v22} that is derived with Eq. (3). Entry nodes V(entry) are{u5, u6, u7, u8}, exit nodes V(exit) are{u11, u12, u13, u14}, the initial nodes are{u1, u2, u3, u4}, and the final nodes are {u15, u16, u17, u18}.

We say that two TFGs G₀and G₁are identical if and only if every node in G₀has a node in G₁that is related to the same set of fragments, and vice versa.

G₀≡ G₁ ⇐⇒ (∀v ∈ G0)(∃u ∈ G1)(u= TFG(CFG(v))) (∀v ∈ G₁)(∃u ∈ G₀)(u= _TFG (CFG(v))) 4.2. Constructing TFGs

We designed a TFG construction algorithm that builds the TFG for each BSP superstep from a CFG and performs the lineal thread fragments analysis for each TFG. The idea involves recursively computing concurrency information inside a group. The algorithm ends when any thread encounters an end of a thread or a built TFG is identical to any previously built one.

Algorithm 2 is the kernel algorithm that collects the thread fragment of a designated group as well as constructs the TFG of the group and computes the concurrency infor-mation. Algorithm 2 collects thread fragments in case c as mentioned in Section 4.1. The output of the algorithm would be a set of nodes between the entry barrier of an

(13)

ALGORITHM 2: TraverseGroup(CFG G, group g, start nodes Varg, blocked nodes Vblk)

Input: G: The CFG to be analyzed Input: g: the group to analyze Input: Varg: a set of starting nodes

Output: Vblk: a set of nodes blocked by barriers

Output: Vgt f: a set of nodes of CFG, which contains all nodes in the group

Used Data: Vitr: a set of nodes of CFG, where nodes are iterators

Used Data: Varg : a set of starting nodes for a subgroup

Used Data: Vblk : a set of blocked nodes for a subgroup

Used Data: VTF: a set of nodes of CFG, which represents a thread fragment

Used Data:va,vb: nodes of CFG

Used Data: G: a TFG for a superstep Used Data:vi,vj: nodes of TFG

begin

Initialize Vitrwith Vitr ← Varg

repeat

repeat /* Traverse a superstep of BSP */

Let G= (V_{, E}_{) be a TFG for the superstep}

foreachva∈ Vitr do /* Traverse thread fragments */

VTF← TraverseThreadFragment (v, VT F, Vblk) Add a nodeviinto Gand letC FG(vi)= VTF Collect traversed nodes by Vgt f ← Vgt f∪ VTF

end Vitr← ∅

foreach g∈ OSU B(g) do /* Traverse subgroups */

Check Vblkto determine if every thread belongs to subgroup gis ready.

if gis ready then

Let Varg be the set of nodes encountering barriers of group g Update blocked nodes by Vblk← Vblk− Varg

Cross barriers before traverse subgroups: Varg ← CrossBarrier (Varg )

VTF← TraverseGroup (G, g, Varg , Vblk ) Add a nodeviinto Gand letC FG(vi)= VTF Collect traversed nodes by Vgt f ← Vgt f∪ VTF Update iterators by Vitr ← CrossBarrier(Vblk )

end end

until Vitr is empty;

foreachvi, vj∈ Vdo /* Build up edges of G */

if∃e(va, vb)∈ E, where va∈ C FG(vi), vb∈ C FG(vj) then Add e(vi, vj) into E

end end

Add initial nodes and final nodes, and construct edges for initial and final nodes /* All iterators encountered BSP sync nodes. */ ComputeMHPRegion (G)

Vitr← CrossBarrier (Vblk)

until Vblk⊆ VB(g, exit) or two TFGs are identical;

return Vgt f

end

immediate subgroup o of g, VB(o, entry) and the exit barrier of o, VB(o, exit), which means that Algorithm 2 is an implementation of Eq. (3). Algorithm 3 collects thread fragments of a designated node until barriers, namely cases a and b as mentioned in Section 4.1. Algorithm 3 is an implementation of Eq. (3).

(14)

end

Addv into VT F

foreach v∈ Succ(v) do /* Recursively traverse successors */

VT F← VT F∪ TraverseThreadFragment (v, VT F, Vblk)

end return VT F

end

In Algorithm 2, Vgtf collects all nodes in the group. Certain temporal variables are introduced to aid the collecting of nodes. Vargis a set of starting nodes in a group, Vitris a set of iterator nodes for recording the locations of iterators, and Vblkis a set of blocked nodes that are barrier nodes to identify where iterators are blocked. Algorithm 2 begins by initializing Vitrinitialized as Varg; nodes in Vitrare then traversed recursively based on Algorithm 3. Each node in Algorithm 3 will conform to one of the following conditions: —be in VTF: the function returns because the iterator is in a circular path;

—be blocked by a barrier: the node is added to Vblkand returns to Algorithm 2 because a thread fragment is found; or

—keep traversing to its successors.

Each collected VTFset has a relevant TFG node, and a VTFis added into Vgtffor all nodes in a group, which will eventually be a thread fragment of the outer group according to Eq. (3).

Once all of the nodes in Vitr have been traversed, we check blocked set Vblk be-cause some nodes might be blocked by barriers of subgroups. In such a case we recur-sively perform TraverseGroup() to traverse a subgroup with designated V_arg and get a thread fragment according to Eq. (3). Two operations are required before invoking TraverseGroup() for subgroups:

—remove nodes of V_arg from Vblkbecause they no longer belong to Vblk; and

—since V_arg now contains nodes that are blocked, we have to allow Vargto cross barriers. CrossBarrier() is a function that helps the input nodes to cross a barrier and outputs a set of nodes after such barriers. After performing TraverseGroup() for each subgroup, the output blocked set V_blk needs to cross barriers; then the processed V_blk set is added to Vitrso that thread fragments are collected in the subsequent iterations.

Thread fragments of a superstep are collected when Vitr is empty. The next step involves adding edges to complete the TFG. Lineal thread fragments and MHP thread fragments (MTFs) are then analyzed, as explained in detail in Section 4.3.

After processing a superstep, we update Vitrwith CrossBarrier(Vblk) and repeat the procedure to process the next superstep. The algorithm iterates for each superstep until one of two cases is obtained: (1) all blocked nodes Vblkare a subset of VB(g, exit), which means that there are no further supersteps in this group; or (2) two TFGs in a

(15)

GEN(v) ← {v} (4)

KI LL(v) ← ∅ (5)

I N(v) ←

v:Pred(v)

OU T (v₎ ₍₆₎

OU T (v) ← IN(v) ∪ GEN(v) − KILL(v) (7)

LT F(v) ← OU T (v) ∪ {v_{| OU T (v}₎_v} ₍₈₎

Fig. 8. Dataflow equations for lineal thread fragments information.

group are identical, which means that there is a loop in the BSP program and we have already explored all possible combinations.

4.3. Lineal Thread Fragments Analysis and MTF

Once the TFG has been constructed, we can compute the concurrent thread fragments of a hierarchical BSP program. Instead of gathering MHP information, we gather nodes that cannot happen in parallel; that is, they have a lineal relation. We collect all nodes along the TFG in our dataflow analysis and maintain the set of entire lineal thread fragments by adding nodes symmetrically so as to keep this set symmetric.

The GEN set obtained by lineal thread fragments analysis is the node itself; the KILL set is always empty; the IN set is the set of ancestor nodes, which is the union of all the predecessor’s OUT set; the OUT set is the reached ancestor nodes, which is the union of the IN set and GEN set; and the LTF set is the set of lineal thread fragments derived from the OUT set as follows.

LTF(v) ← OUT(v) ∪ {v| OUT(v) v}

The LTF set gets its ancestor nodes and its children nodes by a symmetric step [Barik 2005]. Figure 8 presents the dataflow equations used to gather lineal thread fragment nodes in a TFG.

After the LTF set of all nodes has been computed, the MTF for each thread fragment is computed as

MTF(v) ← V_{− ∪}

v_∈V_T(T (t))LTF(v),

where VT(t) denotes a set of nodes with thread t, and T (v) denotes a thread to which nodev belongs.

An MHP region is a subgraph of the TFG where thread fragments may be executed concurrently. Nodes belonging to different MHP regions must not be executed in paral-lel. The MHP regions are determined by first constructing an MHP graph G= (V, E), that is an indirected graph whose nodes are thread fragments and edges are nodes that may happen in parallel; that is, they are related to the MTF set.

e(v, u) ∈ E⇐⇒ ∃u ∈ MTF(v) : v ∈ V

The connected components of a MHP graph, denoted by{R1, . . . , RN}, are sets of nodes having a walk relationship. We then construct MHP regions {S1, . . . , SN} for each connected component. An MHP region, as a subgraph of G, is denoted by Sn= (Rn, En), where Rnis a connected component of Gand where Enis a set of edges.

e(v, u) ∈ En⇐⇒ ∃e(v, u) ∈ E:v ∈ Rn∧ u ∈ Rn (9) We compute the lineal thread fragments of nodes and MHP regions for each TFG by running Algorithm 4 with dataflow equations. Table II lists the results of GEN, OUT,

(16)

u8 {u8} {u4, u8} {u4, u8, u10, u13, u14, u17, 18} u9 {u9} {u1, u2, u5, u6} {u1, u2, u5, u6, u9, u11, u12, u15, u16} u10 {u10} {u3, u4, u7, u8} {u3, u4, u7, u8, u10, u13, u14, u17, u18} u11 {u11} {u1, u2, u5, u6, u11} {u1, u2, u5, u6, u9, u11, u15} u12 {u12} {u1, u2, u5, u6, u12} {u1, u2, u5, u6, u9, u12, u16} u13 {u13} {u3, u4, u7, u8, u13} {u3, u4, u7, u8, u10, u13, u17} u14 {u14} {u3, u4, u7, u8, u14} {u3, u4, u7, u8, u10, u14, u18} u15 {u15} {u1, u2, u5, u6, u11, u15} {u1, u2, u5, u6, u9, u11, u15} u16 {u16} {u1, u2, u5, u6, u12, u16} {u1, u2, u5, u6, u9, u12, u16} u17 {u17} {u3, u4, u7, u8, u13, u17} {u3, u4, u7, u8, u10, u13, u17} u18 {u18} {u3, u4, u7, u8, u14, u18} {u3, u4, u7, u8, u10, u14, u18}

Table III. The MTF Set for the Example in Figure 7 MTF u1 {u2, u3, u4, u6, u7, u8, u10, u13, u14, u17, u18} u2 {u1, u3, u4, u5, u7, u8, u10, u13, u14, u17, u18} u3 {u1, u2, u4, u5, u6, u8, u9, u11, u12, u15, u16} u4 {u1, u2, u3, u5, u7, u8, u9, u11, u12, u15, u16} u5 {u2, u3, u4, u6, u7, u8, u10, u13, u14, u17, u18} u6 {u1, u5, u3, u4, u7, u8, u10, u13, u14, u17, u18} u7 {u1, u2, u4, u5, u6, u8, u9, u11, u12, u15, u16} u8 {u1, u2, u3, u5, u7, u8, u9, u11, u12, u15, u16} u9 {u3, u4, u7, u8, u10, u13, u14, u17, u18} u10 {u1, u2, u5, u6, u9, u11, u12, u15, u16} u11 {u3, u4, u7, u8, u10, u12, u13, u14, u16, u17, u18} u12 {u3, u4, u7, u8, u10, u11, u13, u14, u15, u17, u18} u13 {u1, u2, u5, u6, u9, u11, u12, u14, u15, u16, u18} u14 {u1, u2, u5, u6, u9, u11, u12, u13, u15, u16, u17} u15 {u3, u4, u7, u8, u10, u12, u13, u14, u16, u17, u18} u16 {u3, u4, u7, u8, u10, u11, u13, u14, u15, u17, u18} u17 {u1, u2, u5, u6, u9, u11, u12, u14, u15, u16, u18} u18 {u1, u2, u5, u6, u9, u11, u12, u13, u15, u16, u17}

and LTF sets for the example in Figure 5. As listed in Table II, the GEN set contains the node itself; the OUT set is the collection of GEN sets along the TFG; and the LTF set is derived from the OUT set by a symmetric step. The MTF set of the example is listed in Table III, which is derived from LTF set by subtraction.

5. MULTITHREAD POWER-GATING ANALYSIS

The TFCA results and the component usage for a power-gating candidate Ci of all concurrent thread fragments can be categorized in the following three cases.

(17)

ALGORITHM 4: ComputeMHPRegion(TFG G): Computing the lineal thread fragments of nodes and the MHP region for TFG G

Input: G: a TFG G= (V, E) Output: The MHP regions Used Data:v, v: TFG nodes begin

foreachv ∈ Vdo /* Initialize GEN() sets */

GEN(v) ← {v} end

repeat /* Perform data-flow analysis */

foreachv ∈ V_do

I N(v) ←_v_∈Pred(v)OU T (v)

OU T (v) ← IN(v) ∪ GEN(v)

foreachv∈ OU T (v) do /* Compute lineal thread fragments */ Perform symmetric step: LTF(v₎_{← LTF(v}₎_{∪ OU T (v}₎_{∪ {v}}

end end

until for allv ∈ V, I N(v) and OU T (v) are converge;

foreachv ∈ Vdo /* Compute the MHP thread fragments */

MTF(v) ← V_{− ∪}

v_∈V_T(T (t))LTF(v) end

foreachv ∈ Vdo /* Construct the MHP graph */

foreach u∈ MTF(v) do Add edge e(v, u) to E end

end

Find all connected components{R1, . . . , RN} of MHP graph G Construct MHP regions{M1, . . . , MN} of Gfrom{R1, . . . , RN}

end

—Ci is used by only one thread fragment. In this case the Sink-N-Hoist framework should be applied to the thread fragment to insert traditional power-gating instruc-tions because the uncertainty of component usage in a multithread program is not present.

—Ci is not used. In this case we do not need to handle Ci because a power-gating candidate is defined to be turned off at the beginning of a superstep. This is the optimal case to use because there will be no extra cost and the power savings will be maximal.

—Ci is used by multiple thread fragments. When the analysis results indicate that multiple thread fragments might use Ci in an MHP region, we have two strategies for placing PPG instructions. The evaluation is described in detail next.

In this section we present an MTPGA scheme based on the PPG mechanism and TFCA scheme results to estimate energy consumption and to insert power-gating in-structions. MTPGA generally inserts a pair of predicated power-on and predicated power-off operations at the positions where a power-gating candidate is first activated and last deactivated for each thread within a MHP region according to the proposed cost model. Figure 9 illustrates a simple scenario in which thread fragments TF1and TF2may happen in parallel—and thus TF1and TF2form an MHP region—and CADFA exposes the utilization status of three power-gating candidates, labeled as C1, C2, and C3; in-use units are depicted with light gray boxes in the figure.

Figure 10 demonstrates two possible placements of PPG operations based on the MTPGA results. In most cases, MTPGA will place a pair of PPG operations for each

(18)

Fig. 9. Two thread fragments TF1and TF2in an MHP region and their utilization status for the

power-gating candidates C1, C2, and C3; in-use units are depicted with light gray boxes.

Fig. 10. Two kinds of instruction placement for power gating: (a) Two concurrent thread fragments TF1and TF2using a power-gating candidate C1and their component usage; (b); (c) two strategies for placing

power-gating instructions among TF1and TF2; (b) the leakage energy of thread fragments is worthy of being gated

(i.e., the calculation result of Eq. 10); thus all thread fragments would be inserted with ppg instructions; (c) the leakage energy of thread fragments is not worth gating; in such a case, we insert conventional power-gating instructions before and after the MHP region.

power gating candidate if appropriate. With the support of conditional execution of power-gating instructions, power gating will only occur when a unit is first activated and last deactivated within an MHP region. Recall that the proposed PPG mecha-nism incorporates a set of reference counters (N reference counters for N power-gating candidates) for tracking the number of threads that have been referenced, and predi-cated bits are set only when the value of their corresponding reference counters is 0. Therefore, within the MHP region there will be only a pair of power-gating operations, namely the first power-on and the last power-off operations, belonging to a pair of PPG operations being executed, whereas the power gating of the other PPG operations will be disabled. This ensures that a unit will be alive whenever it is required for processing. PPG is not cost free, and we now take this into consideration when building a model for determining which PPG placement strategy should be employed. The model is based on the comparison of the energy cost between normal power gating and the PPG in an MHP region. Suppose that there are N power-gating candidate units, C1, C2, . . . , CN, and K thread fragments in an MHP region, TF1, TF2, . . . , TFK. We define two functions, namedpro_andepi_{, that take a thread and a power-gating candidate} as their parameters for computing the inactive period of the power-gating candidate before/after the candidate operates for the first/last time within the thread as

pro_(TFi_{, C}

j)= start(TFi, Cj)− start(TFi) and

epi_(TF

(19)

where start(TFi, Cj) returns the time that Cj is first used in TFi and start(TFi) re-turns the start time of TFi, while end(TFi, Cj) returns the time that Cj is last used in TFi and end(TFi) returns the end time of TFi. Figure 10 portrays the implications of the aforesaid functions with TF1 and C1as parameters. Furthermore, we define that pro

min(Cj) andepimin(Cj) return the minimum ofpro(TFi, Cj) andepi(TFi, Cj) in terms of all Ti as

pro

min(Cj)= MIN∀i∈Kpro(TFi, Cj) and

epi

min(Cj)= MIN∀i∈Kepi(TFi, Cj),

where _minpro(Cj) represents the earliest time that Cj might be used after the MHP region starts,epi_min(Cj) represents the latest time that Cj might be used prior to the MHP region ending, and MIN is a function that returns the minimum value of its parameters. Accordingly, the energy consumptionEpredof the PPG control within the MHP region is

Epred(TF, Cj)= Eon(Cj)+ Eo f f(Cj)+ Kj× (Ep−on+ Ep−of f) + (pro

min(Cj)+ epimin(Cj))× Prleak(Cj),

(10) where functionsEon(Cj) andEo f f(Cj) return the energy consumption of issuing power-on and power-off instructipower-ons for comppower-onent Cj, respectively; Kj represents the num-ber of threads in the MHP region that requires Cj to operate; Ep−on andEp−of f are the energy consumptions associated with operating a set of predicated power-on and predicated power-off manipulation operations described in Section 3.1, excluding the power-on and power-off operations, respectively; andPrleak(Cj) represents the leakage-power consumption of Cj in a cycle when the power supply is gated. In contrast, when we employ normal power-gating control at the beginning and end of the MHP region rather than applying the PPG management, the energy consumptionEnormal of such operations and the potential leakage dissipation is

Enormal(TF, Cj)= Eon(Cj)+ Eo f f(Cj)+ (_minpro(Cj)+ epi_min(Cj))× Pleak(Cj), (11) wherePleak(Cj) represents the leakage power consumption of unit Cj during a cycle.

Accordingly, we can derive the following inequality for ensuring the worthiness of PPG

Epred(TF, Cj)< Enormal(TF, Cj),

and substitutingEpred(TF, Cj) andEnormal(TF, Cj) into Eqs. (10) and (11), respectively, yields

Eon(Cj)+ Eo f f(Cj)+ Kj× (Epred−on+ Epred−of f)+ (minpro(Cj)+ epimin(Cj))× Prleak(Cj) < Eon(Cj)+ Eo f f(Cj)+ (_minpro(Cj)+ epi_min(Cj))× Pleak(Cj).

Thus we have

pro

min(Cj)+ epimin(Cj)>

Kj× (Epred−on+ Epred−of f) Pleak(Cj)− Prleak(Cj)

as the criterion for determining whether a PPG should be applied. Algorithm 5 is an implementation of the proposed MTPGA. The algorithm receives an MHP region as its input and decides which power-gating policy to adopt. Basically, it determines whether a PPG should be employed for each MHP region.

(20)

Place a power-on and a power-off instruction for C at the beginning and the end of the MHP region, respectively.

else

foreach thread do

Place a predicated-power-on and a predicated-power-off operation

before/after the candidate operates for the first/last time within the thread. end

end end end

†_{T H RESH OLD}₌ Kj× (Epred−on+ Epred−of f) Pleak(C)− Prleak(C)

6. EXPERIMENT AND DISCUSSION 6.1. Platform

We used a DEC-Alpha-compatible architecture with the PPG controls and two-way to 8-way simultaneous multithreading as the target architecture for our experiments. The equipped SMT machine replicated certain resources for each thread such as pro-gram counter and registers, while function units were shared with both threads. The proposed MTPG framework was evaluated by a post-estimated SMT simulator based on Wattch toolkits with a 0.10μm process parameter and a 1.9-V supply voltage. The SMT simulator schedules multiple Wattch simulators to execute programs separately, and then reschedules the component usage on a cycle-by-cycle basis (according to ex-ecution traces gathered from Wattch simulators) to estimate the exex-ecution time and power consumption.

Table IV summarizes the baseline configuration of the Wattch simulators in our experiments. By default, the simulator performs out-of-order execution. We used the “-issue:inorder” option in the configuration so that instructions would be executed in order, which ensures the correctness of execution. Nevertheless, our approach could also be applied to machines issuing out-of-order execution commands when addi-tional hardware supports are employed such as in the hardware proposed in You et al. [2006].

Figure 11 illustrates the phases of the compilation. Two phases were added in or-der to analyze the component usage of a BSP program: the concurrent thread frag-ments phase (see Section 4) and the low-power optimization phase (see Section 5). The TFCA phase is performed in low SUIF, which constructs the TFG and analyzes the concurrency among threads. We incorporate the low-power optimization phase just before code generation, that is, after all traditional performance optimizations were per-formed. Hence, the additional phase hardly influences the performance; it only inserts power-gating instructions or PPG instructions and thus barely affects the execution behavior. The implementation was based on SUIF2 and the CFG and machine libraries from Machine-SUIF. Programs were first transformed from high SUIF into low SUIF

(21)

Table IV. Baseline Processor Configuration Parameter Configuration

Clock 600 MHz

Processor parameters 0.10_μm,1.9V

Issue In-order

Decode width 8 instruction/cycle Issue width 8 instruction/cycle Commit width 8 instruction/cycle

RUU size 128

LSQ size 64

Parameter Configuration Function units 4 integer ALU

1 integer mul/div unit 4 floating-point ALU 1 floating-point mul/div unit Register File 32 64-bit integer registers

32 64-bit floating-point registers 1 power-gating control register

Fig. 11. Power management in the compilation phases of multithread programs.

format with SUIF, processed by concurrent thread fragment analysis, and then trans-lated to the machine- or instruction-level CFG form with Machine-SUIF. Four compo-nents of the low-power optimization phase for multithread programs (implemented as a Machine-SUIF pass) were then performed, and finally, the compiler generated DEC Alpha assembly code with extended power-gating controls.

The power-gating mechanism is absent in the original DEC Alpha processor, hence there are no power-gating instructions in its instruction set. Moreover, programs could be roughly categorized into compiled user source codes and libraries, and there are no directives in executables for distinguishing one from another; however, the absence of source codes prevents power-gating analysis. We therefore defined a set of special instructions as power-gating instructions so that they could be recognized by the DEC Alpha assembler and linker: “stl $24, negative offset($31)”, where the negative offset is a negative integer used for indicating the function units to be powered on or off and the boundary for kernel extraction. The $31 register in the DEC Alpha processor is a constant zero register, and so the instruction stores a value at a negative address that is invalid and should not be generated by a standard compiler. We made a small modification in Wattch to prevent the processor from accessing such invalid memory addresses: when the instruction decoder deciphered such instructions, it extracted the user directive information and converted it to an NOP (no-operation) instruction. Furthermore, since Wattch does not model leakage at the component level per se, we assumed that leakage power contributes 10% or 30% of the total power consumption [Butts and Sohi 2000; Rusu et al. 2007]. We also assumed that wakeup operations of

(22)

power-gating controls have an eight-cycle latency and that it took 14× the leakage en-ergy per cycle to power a component off and on [Hu et al. 2004]. It was further assumed that the energy consumption associated with fetching and decoding a power-gating instruction was twice the leakage power. The overhead energy of the additional predi-cated power-gating controller (PPGC) was also considered. According to the synthesis result of PPGC by Synopsys Design Compiler, we assumed that the PPGC took 4×10−4 times the power of integer ALU.

We report the power usage of analyzed code regions (i.e., source codes from the user), not including the power usage that is not associated with the user program (e.g., libraries and the C runtime system). Also, the baseline data was provided by the power estimation of Wattch cc3 with a clock-gating mechanism that gates the clocks of unused resources in multiport hardware to reduce the dynamic power; however, leakage power still exists.

6.2. Simulation Results

To verify our proposed MTPGA algorithm and PPG mechanism, we focused on in-vestigating component utilization in the supersteps. We report two sets of simulation results: one for random TFGs and the other for BSP programs converted from OpenCL kernels. Each set of results compares three types of experiments: (1) no power-gating mechanism (baseline), (2) CADFA with a conventional power-gating mechanism from a previous work [You et al. 2002, 2006], and (3) MTPG with the PPG mechanism.

We first generated random TFGs and applied small programs as thread fragments. Random TFGs were generated using GGen, which is a random graph generator for scheduling simulations [Cordeiro et al. 2010]. The generation method was a slightly modified version of fanin/fanout method. We added a parameter for the size of layer to control the shape of generated graphs, where a layer is a set of nodes without edges. We generated random edges between adjacent layers only, which forced the generated graphs to fit the D-BSP communication rule. Also, a label swapping phase was added immediately before generating the graph to increase the randomness of thread fragments. Each node in the generated TFGs was mapped to a floating-point DSPstone [Zivojnovic et al. 1994] program. The random TFGs were all DAGs and generated with parameters as follows:

—number of nodes: the number of thread fragments in the graph;

—out-degree: the out-degree of each node controls the number of success of a thread fragment;

—in-degree: the in-degree of each node controls the number of predecessors of a thread fragment;

—number of layers: the number of layers in the graph;

—size of layer: the number of nodes in a layer controls the size of hardware threads. The energy consumption results for the parameter settings in Table V are listed in Tables VI, VIII, IX, and X. We used 10,756 graph instances to evaluate all settings. All

(23)

Table VI. Normalized Total Energy Consumptions of Randomly Generated TFGs for Setting A on Leakage Contribution Set to 10% and 30% (see Table V), Categorized by the Number

of MHP Regions for Cases with Two Hardware Threads Leakage contribution set to 10%

number of MHP regions method dynamic leakagea _leakageb _overhead _total

1 baseline 52.55% 12_.35% 35.10% 0_.00% 100_.00% CADFA 52.66% 2_.74% 36.91% 9_.14% 101_.46% MTPG 52.57% 9.96% 35.26% 0.45% 98.24% 2 baseline 50.60% 12_.86% 36.54% 0_.00% 100_.00% CADFA 50.70% 2_.73% 38.31% 7_.63% 99_.36% MTPG 50.63% 7_.72% 36.90% 0_.89% 96_.14% 3 baseline 49.80% 13.07% 37.13% 0.00% 100.00% CADFA 49.89% 2_.74% 38.90% 7_.20% 98_.72% MTPG 49.83% 6_.03% 37.76% 1_.33% 94_.96% 4 baseline 48.70% 13.35% 37.94% 0.00% 100.00% CADFA 48.80% 2.73% 39.85% 6.40% 97.77% MTPG 48.75% 4.82% 38.81% 1.71% 94.09% 5 baseline 47.64% 13_.63% 38.73% 0_.00% 100_.00% CADFA 47.73% 2.70% 40.71% 5.68% 96.83% MTPG 47.70% 4.08% 40.01% 2.11% 93.90% Leakage contribution set to 30%

number of MHP regions method dynamic leakagea _leakageb _overhead _total

1 baseline 22.52% 20_.15% 57.33% 0_.00% 100_.00% CADFA 22.56% 4.50% 60.29% 15.28% 102.63% MTPG 22.53% 16.26% 57.60% 0.73% 97.12% 2 baseline 21.11% 20_.52% 58.37% 0_.00% 100_.00% CADFA 21.15% 4_.36% 61.18% 12_.39% 99_.08% MTPG 21.12% 12.35% 58.95% 1.43% 93.84% 3 baseline 20.56% 20.66% 58.78% 0.00% 100.00% CADFA 20.60% 4_.33% 61.56% 11_.55% 98_.05% MTPG 20.58% 9_.55% 59.77% 2_.09% 91_.99% 4 baseline 19.85% 20.84% 59.30% 0.00% 100.00% CADFA 19.89% 4.26% 62.26% 10.13% 96.55% MTPG 19.87% 7_.56% 60.64% 2_.67% 90_.74% 5 baseline 19.22% 21.01% 59.77% 0.00% 100.00% CADFA 19.26% 4.18% 62.78% 8.89% 95.11% MTPG 19.25% 6.34% 61.68% 3.24% 90.52%

a_{leakage energy consumed by power-gateable units.} b_{leakage energy consumed by other units.}

results are normalized to the situation without a power-gating mechanism. The total energy consumption is divided into four categories: (1) the dynamic energy dissipated by the processor, (2) the leakage energy dissipated by power-gateable units, (3) the leakage energy dissipated by the entire processor except for power-gateable units, and (4) the overhead due to extra power-gating instructions. The overhead includes the energy consumed by power-gating instructions, the energy consumed due to the latency caused by powering on components that have been incorrectly powered off, and the energy consumed by the predicated power-gating controller. Settings A and B are for machines equipped with two hardware threads, while Setting C is for those equipped with four hardware threads. With MTPG, the total power consumption for each setting was reduced to 93.90%, 93.32%, and 95.12%, respectively, relative to the baseline (i.e., no