Organization - 在電路延遲限制下降低晶片上匯流排功率消耗之有彈性之匯流排編碼技

Chapter 1 Introduction

1.3 Organization

The rest of this thesis is organized as follows. Section 2 describes the basic assumptions used in our work for the bus structure, and this section also details our encoding flow and algorithm that is used to minimize the LC coupling effects.

Simulation results are shown in Section 3. Finally, Section 4 concludes the thesis.

Chapter 2 Bus Encoding for Reducing Delay and Power

2.1 Preliminary

2.1.1 Assumption and Problem Input

Signal Wire

Power/Ground

Figure 1: The coplanar bus structure.

In this thesis, we consider the coplanar bus structure as shown in Figure 1 to build our encoding scheme. In the coplanar bus structure, we assume that each driver (receiver) has a uniform size and the driver is also assumed to be symmetric so the effective output resistance is the same for both rising and falling signal transitions. In the bus structure, each signal wire has a uniform width, pitch, length and height. Given the parameters of wires (length, width, height and pitch), delay constraint, working frequency, and the number of data bit n, we will generate a valid code set that has the minimal total transition power to map the data patterns. The valid code set is obtained at the cost of m – n extra bus wires. Any valid code set must satisfy the property that any transition between codes within this set is guaranteed to meet the delay constraint. The overall bus structure is shown in Figure 2. The valid code set of the global bus contains only 2ⁿ out of 2^m possible codes. The mapping between the data patterns and the generated code set is a straight forward work and will not be discussed in this thesis.

Encoder Decoder Data

n-bit m-bit

m > n

Global bus with only 2ⁿvalid codes which are allowed to be transmitted

(m-n: wire overhead)

Figure 2: The overall bus structure including the encoder and decoder.

Assume the number of data bit is n, the overall bus structure is shown in Figure 2. The valid code set of the global bus contains only 2ⁿ out of 2^m possible codes. The specific 2ⁿ codes are selected to minimize the coupling effects between any two of them. In addition, the transition delay between any two patterns in the specific 2ⁿ codes will meet the delay constraint which is given by users.

Since transistors mainly operate in the linear region during transitions, it is assumed that all drivers’ output resistances are linear throughout the simulations.

Therefore, the drivers will be modeled as simple linear resistances. In addition, the receivers will be replaced by equivalent gate capacitances in the circuit model and the wires are replaced by equivalent RLC circuit models. With the models, the built circuit model for the coplanar bus structure will be constructed by only linear elements (linear R, L, and C). In other words, the built circuit is a LTI (linear time invariant) system.

In this thesis, ramp signal is given to the input of the driver, and the

transition time τr can be calculated from the working frequency f by the

In our simulation, we assume that synchronous registers are located at the transmitter side. Thus all the signals switch at the same time on the bus.

2.2 Bus Encoding Flow

2.2.1 Overall Encoding Flow

Figure 3 illustrates our overall encoding flow. At first, users should give the bus parameters (bus width n, wire dimensions, wire pitch, Power/Ground grid dimensions, Power/Ground-to-signal pitch), the working frequency, and the delay constraint. With the given parameters, we extract the resistances, capacitances, and inductances of bus wires. After extraction, the equivalent RLC circuit will be built. Next, the built circuit will be simulated by using HSPICE with the basis vectors which will be defined later. By applying superposition theorem [18] of linear circuits, we can establish the transition graph efficiently. From the transition graph, we apply a modified local search algorithm [19] to find a valid code set in which every transition between a code pair meets the delay constraint

and the total transition power of all code pairs is minimized. Next, we will check whether the code set covers all data patterns. If not, we add one more bit line to the bus structure and redo from Step (1). Otherwise, the code set will be output to map to the data patterns with the corresponding bus structure. The details of each step will be described in the following subsection.

Bus parameters, working

Does the code set cover all data pattern?

Output the code set and the bus structure Add one wire into bus

Yes No

Figure 3:

The overall encoding flow.

2.2.2 Extract RLC from Bus

In step (1), with the given feasible parameters, FastCap [20] and FastHenry [21] are used to extract the RLC parameters of the bus and construct the SPICE model. The detailed flow of step (1) is shown in Figure 4. FastCap can extract the self and coupling capacitance of wires, while FastHenry is developed to extract the resistance, self inductance, and coupling inductance. With these extracted RLC parameters, the equivalent RLC circuit models will be constructed.

The circuit models are constructed as π-segments using series resistances and inductances and shunt capacitances. The circuit model will be outputted as a SPICE file.

SPICE file SPICE file

Figure 4: Extract RLC and generate the corresponding SPICE file.

2.2.3 Simulate Basis Vectors by HSPICE

After building the RLC circuit model, the transition delay and power for a specific input pattern pair can be obtained by simply conducting HSPICE simulation. However, for an n-bit bus, there are 2ⁿinput patterns and 4ⁿ possible transition patterns in total. It is extremely time-consuming to simulate all transition patterns by HSPICE when n goes higher. The time complexity is 4ⁿ*(HSPICE simulation time for a transition pattern). Hence, we develop a method based on superposition theorem [18] to significantly reduce the simulation time. Based on this idea, we first simulate the basis vectors which are independent sources to the RLC circuit. Then the real delay of each transition pattern can be obtained by superposing the simulation results of the basis vectors.

What are the basis vectors for a bus? We define them as all independent transitions of bus inputs. Throughout this thesis, “–” represents a stable input (stable at low or high), “↑” represents an input changing from low to high, and

“↓” represents an input changing from high to low. Therefore, for an n-bit bus consisting of input signals z0 , z1 , z2 ,…,zn-1 , the basis vectors can be expressed

The followings are some properties of the basis vectors.

Property 1: In an linear RLC circuit of a bus, given a basis vector, such as (– –

↑), the transition delay and power due to the switching input is the same no matter other stable inputs are at ‘0’ or ‘1’ [15] (e.g. 000Æ001 and 110Æ111 have the same transition delay).

Property 2: In an linear RLC circuit of a bus, given an integer k and ,{( z

0≤k≤n− 0 z1 z2 …zn-1 )| zk ∈{↑} and z0 , z1 ,…zn-1 ∈{–}} and {( z0 z1 z2

…zn-1 )| zk ∈{↓} and z0 , z1 ,…zn-1 ∈{–}} are called a dual basis vector pair. The voltage and current waveforms resulting from a dual basis vector pair are equal in magnitude but opposite in direction [15].

Here we also define a minimum basis vector set as that all transition patterns can be obtained by superposing the basis vectors within this set:

Minimum basis vector sets of an n-bit bus =

{{(z0 z1 z2 …zn-1 )| z0 ∈{↑, ↓} and z1 , z2 ,…zn-1 ∈{–}},

{(z0 z1 z2 …zn-1 )| z1 ∈{↑, ↓} and z0 , z2 ,…zn-1 ∈{–}},

•

{(z0 z1 z2 …zn-1 )| zn-1 ∈{↑, ↓} and z0 , z1 ,…zn-2 ∈{–}}} (3)

Therefore, the minimum basis vector set of an n-bit bus has n elements and each element can be one of the dual basis vector pairs as shown in Equation (3).

Hence, we only need to simulate the basis vectors of the chosen minimum basis vector set. Then we can use the simulation results to obtain the delay and power of all transition patterns by applying the superposition theorem. The details and examples of Step (2) are shown in Figure 5. First, we apply one basis vector at a time as the input transition pattern on the bus. Second, we perform SPICE simulation and record the voltage and current waveforms of all signal wires.

Then, we repeat this procedure for every basis vector until all basis vectors of the chosen minimum basis vector set are simulated.

Record the waveform of each wire Record the waveform of

each wire 3 Record the waveform of

each wire 3

HSPICE simulationHSPICE simulation Apply one input source at one timeApply one input source at one time

HSPICE simulationHSPICE simulation

Apply one input source at one timeApply one minimum basis vector at a time

Record the waveform of each wire Repeat 3 Record the waveform of

each wire 3 Record the waveform of

each wire Record the waveform of

each wire 3 Record the waveform of

each wire 3

HSPICE simulationHSPICE simulation Apply one input source at one timeApply one input source at one time

HSPICE simulationHSPICE simulation

Apply one input source at one timeApply one minimum basis vector at a time

Record the waveform of each wire Repeat

Figure 5: HSPICE simulations with the minimum basis vectors of the minimum basis vector set.

2.2.4 Build Transition Graph

In step (3), we apply the superposition theorem to calculate the real transition delay of each transition pattern by using the simulation results of the basis vectors in step (2). Figure 6 illustrates how to obtain the real delay of a transition pattern by using the simulation results of the basis vectors. First, we decompose the transition pattern into some basis vectors. Then, by looking up the simulation results of the basis vectors that have been obtained in step (2) and superposing them, we can obtain the overall voltage waveform of each wire.

Then the real delay of this transition pattern can be calculated. Our simulation results show that the results obtained from superposition exactly comply with those obtained from the real HSPICE simulation.

By using superposition, the overall current waveform of each wire can be obtained as well. Then the transition power between any two codes can be obtained from the following equation:

(4)

Hence, by utilizing the superposition, we can calculate the transition delay and power between any two codes very fast without performing a real HSPICE simulation run. Then we build a transition graph to indicate if the transition delay between arbitrary two codes meets the delay constraint or not and record the

transition power as the edge weight. In the transition graph, a node represents a code and an undirected weighted edge indicates that the transition delay between two corresponding codes meets the delay constraint with the transition power as the edge weight.

Figure 6: An example of superposition.

2.2.5 Find Minimum Total Edge Weight b Clique

2.2.5.1 Our Proposed Flow for Finding Valid Code Set

After building the transition graph, we want to find a minimum total edge weight b (=2ⁿ) clique in the graph. The reasons to find such a clique in the

transition graph are: (a) we want to find a valid code set that can map to all data patterns, i.e., the size of the valid code set equal to 2ⁿ data patterns; (b) within the valid code set, any transition between two codes is guaranteed to meet the delay constraint; (c) the total transition power of this code set can be minimized, and thus the average power consumption of the encoded bus can also be minimized.

Inspired by the K-opt Local Search [19], we propose Modified Local Search (MLS) to solve the problem. In Figure 7, we propose a flow for finding the valid code set. Since finding the maximum clique is an NP-complete problem, the K-opt Local Search is a heuristic algorithm. The algorithm can process the graph up to 4000 nodes. Therefore, the proposed Modified Local Search is also a heuristic algorithm. By applying the Modified Local Search, the quality of the finding valid code ser would be improved both on the size and the total edge weight of the finding clique.

As shown in Figure 7, given a weighted transition graph, our goal is to find out a valid code set that can map to all the data patterns with minimum total edge weight (i.e. minimum total transition power). There are three steps in our proposed flow. First step is finding out the seeds with degree larger than b. The next is generating the initial clique for each seed. The third step is processing each clique by MLS.

Figure 7: Proposed flow for finding the valid code set.

2.2.5.2 Notations

The notations used in this thesis are given below. An example is given in Figure 8 to demonstrate the notations.

C

: the current clique.

S

_N: the neighbor node set in which the node can be added to enlarge C, i.e., the vertices are connected to all vertices of C. For example, SN = {8} in Figure 8.

S

OM: the node set of one edge missing, i.e., the vertices that are connected to |C| − 1 vertices of C. For example, SOM = {5} in Figure 8.

Cⁱ, SⁱN , SⁱOM: the current clique, the neighbor node set, and the node set of one edge missing in iteration i, respectively.

Total-Power(C): the sum of all edge weights of C.

Figure 8: An example of S_OM, S_N and C.

2.2.5.3 Finding Seeds and Generating Initial Cliques

After the weighted transition graph is built, we use the 1-opt Local Search [19] to develop initial cliques from seeds. Due to the size of our target clique is b, the edge degree of seeds should be larger than b.

Since initial cliques are inputs of MLS, the performance of MLS will depend

S

_OM

C

on the given initial cliques. Hence, different initial cliques should be tried by MLS as many as possible to find the global optimum. To generate initial cliques, we apply the 1-opt Local Search to develop them from seeds. Given a graph, nodes with edge degree larger than b will be chosen as seeds. Once a seed is chosen, the corresponding initial SN and SOM will also be generated. To enlarge a clique, we choose a candidate from SN at a time. The candidate is the node with the minimum average weight. After adding the candidate into the clique, SN and SOM will be updated. The procedure continues until two conditions are satisfied:

(1) |C| = b.

(2) SN = φ.

Figure 9 depicts the flow of the 1-opt Local Search.

If NE = Φ or

|Clique| = b

Add candidate with minimum average weight

Yes No

Seed

Report the valid code set

Figure 9: Flow of the 1-opt Local Search.

2.2.5.4 Modified Local Search

Inspired by the k-opt Local Search [19], we propose Modified Local Search to solve the problem in this thesis.

As described in the following, there are two basic ideas of the algorithm.

Given a current clique C for a graph G:

(i). If |C| < b, by dropping a node set A contained in C, we can add a different node set B contained in current SN (|B| > |A| and assume B is also a clique) of the resulting clique C − A. Then the resulting larger clique size |C| − |A|

+ |B| can be obtained.

(ii). If |C| = b, by dropping a node set A contained in C, we can add a different node set B contained in current SN (|B| = |A|, assume B is also a clique and Total-Power(B) < Total-Power(A)) of the resulting clique C − A. Then a new clique (C − A) ∪ B will have the same size with C but smaller total edge weight than C.

According to the two ideas, we give a flow as shown in Figure 10 to detail MLS.

Transition graph and initial clique

Initialization

If S

= F or

|C| = b

Add Phase

Is every node in initial clique tried?

Is better clique found?

Yes No

Report the valid code set No

Yes

NO

Yes Drop Phase

Figure 10: Flow of MLS.

The procedures of Modified Local Search are given as follows. Given a initial clique C⁰ (| C⁰|< b) with SN0 ≠ φ, the first solution C¹ can be obtained by C⁰ ∪ {v}, where v is a node with minimum average edge weight in SN0. Then SN0 and NE⁰sub

is update to be SN1 and S¹OM, respectively. The adding procedure is repeated until SN = φ or |C| = b.

After some adding procedures are performed (assume α times), either the condition “S^αN = φ” or “|C^α| = b” will be encountered. If “S^αN = φ” is encountered, our algorithm tries to find larger cliques after dropping one or several vertices from C^α. Otherwise, if “|C^α| = b” is encountered, the Modified Local Search will try to find a better b clique with smaller edge weight after dropping one or several vertices from C^α.

The dropping procedure begins when either the condition “S^αN = φ” or “|C^α| = b” is encountered. This procedure is continued until no node can be dropped from the current clique, or one or some vertices can be added during adding phase (i.e., S^α+βN ≠ φ after performing β times dropping). During the dropping procedure, a node v ∈ C^α that can result the largest S^α+1N is chosen. Hence, when the next adding procedure is performed, many candidates can be chosen from SN α+1 and added to C^α+1. The node v ∈ C^α that results the largest S^α+1Ncan be found by checking all vertices of C^α and choosing the one with most lacking edges to v in S^αOM.

The adding and dropping procedure will be repeated until a termination condition is satisfied. The information of SN and SOM will be updated whenever a node is added or dropped.

When the termination condition is encountered after p iterations, we choose the best one C^x from C¹ to C^p (1 ≤ x ≤ p). The “best” C^x could be either the

“largest” clique among C¹ to C^p when all cliques’ size is smaller than b or C^x could be the clique with “minimum” total edge weight when some cliques’ size is equal to b. Then the best solution becomes a new initial clique C⁰ := C^x for the next search. The Modified Local Search is continued until no better solution is found.

Figure 11: Pseudo code of the Modified Local Search for finding minimum total edge weight b clique.

Figure 11 shows the pseudo code of the Modified Local Search for the minimum total edge weight b clique problem. In our pseudo code, two variables g and gpower (lines 2 and 3) are used to denote the clique size gain and the minimum total edge weight in each iteration. A candidate set PO (line 2) is used to prevent the cycling among the solutions. Only one node in PO can be added or dropped in each iteration.

The Modified Local Search has inner and outer loops. In the inner loop (lines 4 to 25), given a initial solution, several adding and dropping procedures are performed with the restriction of PO, and the best solution is chosen. In the outer loop (lines 1 to 29), the chosen best solution will be checked if it is better than the previous one (Cpre) by comparing with gmax and gpower (lines 26 and 27). The maximum gain gmax is updated in the adding procedure (line 11). gmax is the difference between the current best solution and Cpre at line 2. On the other hand, the minimum total edge weight gpower is also updated in the adding procedure (line 13). gpower keeps recording the minimum total edge weight in the inner loop.

The termination condition of the inner loop (line 25) is “DR = φ“. Initially, DR is set to the initial clique Cpre at line 2 before entering the inner loop. When a node v is dropped in the dropping procedure, it is deleted from DR if v is contained in Cpre at line 23.

2.2.5.5 Example

To demonstrate how the MLS works, we use an example as shown in Figures 12-19 to go through our algorithm. Given a 10-node graph in Figure 12, we want to find a clique with size 4 (b = 4). First, the 1-opt Local Search is conducted to find an initial clique (Figures 13-15). Next, the initial clique is input to the MLS to find a “better” clique (In this example, we first try to find a larger clique, and then we try to minimize the total edge weight of the clique.) (Figures 16-19).

In Figures 12 -19, the shaded number beside the node represents the average edge weight of the node.

Figure 12: A 10-node transition graph.

(I). At Step 1 in Figure 7: As illustrated in Figure 13, we choose the seed node “3” with degree is 4 and average edge weight 3.2. Then there will be 5 candidates in S⁰N. generated. (The current clique ={3}.)

(II). At Step 2 in Figure 7: As shown in Figure 14, we choose the node “4”

with the minimum average weight and add to the current clique. Therefore, the clique is {3, 4}, and S¹N is {2,5,6}.

Figure 14: A candidate node “4” with minimum average edge weight is

在文檔中在電路延遲限制下降低晶片上匯流排功率消耗之有彈性之匯流排編碼技 (頁 15-0)