Thesis Organization - 應用於查找表式場域可程式化閘陣列之壓縮樹延遲最佳化合成演算法

Chapter 1 Introduction

1.4 Thesis Organization

The rest of this thesis is organized as follows. Terminology, definitions, fundamental theorems, and problem formulation are introduced in Chapter 2.

Chapter 3 details the proposed delay optimal compressor tree synthesis algorithm with ILP formulation. The experimental results are then presented in Chapter 4.

Finally, Chapter 5 concludes this thesis.

Chapter 2 Preliminaries

2.1 Compressor Trees

A compressor tree is a circuit dealing with a multi-operand addition. Before 1960s, the multi-operand addition was often accumulated by the carry-propagate adder (CPA). To minimize the delay of the carry chain produced by several CPAs, Wallace and Dadda proposed an efficient implementation in 1960s to reduce all partial products into two partial products by full-adders and half-adders, and to add the final two partial products by a CPA. Three reduction rules are used for constructing compressor trees: (i) any three dots with the same rank can be mapped onto a full adder, (ii) the remaining two dots with the same rank can be mapped onto a half adder or passed to the next stratum, and (iii) the last dots are directly passed to the next stratum. The full adder acts as a 3:2 counter to add as many dots as possible with the same rank. Figure 2.1 shows an example of a compressor tree

half adder full adder



 

0th stratum 1st stratum





Figure 2.1 An example of a compressor tree on ASICs.

on ASICs, which reduces three partial products into two partial products.

2.2 Definitions

Firstly, this subsection describes a formal expression to characterize the topology of the compressor tree. A compressor tree consists of a series of strata.

Each stratum is represented by a dot plane. A dot plane with respect to the s-th stratum is denoted as an n-tuple d_s t_n_₁,t_n_₂,...,t₀Z^Nⁿ^²Z^, where N is the set of non-negative integers, Z⁺ is the set of positive integers, and t_i indicates the number of dots which is in the i-th column of the dot plane on the s-th stratum of the compressor tree. The set of dot planes is defined as D, and then the function

column: 2 1 0 column: 1 0

 

 



0-th stratum

6-LUT 6-LUT 6-LUT

(a) (b)

(c)

Figure 2.2 (a) A dot plane d₀ 3, 4, 2. (b) A pattern 2,1 _p PS(3). (c) The pattern 2,1_p is mapped onto 3 6-input LUTs.

: input of a compressor tree consisting of three columns: two dots in the 0th column, four dots in the 1st column, and three dots in the 2nd column. Therefore, the height and the width of the dot plane d₀ are h d( ₀)max{3, 4, 2}4 and w d( ₀)3, respectively.

The following subsection describes a formal expression to characterize the pattern. A pattern is denoted as an m-tuple p t_m_₁,t_m_₂,...,t₀  _p Z^N^m^²Z^, where t_j indicates the number of dots which is in the j-th column of the pattern p.

The set of patterns is denoted as P, and then the function v P N:  N can

output bits. Thus the function ow P: Z^ calculates the minimal number of the required output bits by ow p( )log (₂ ^{iw p}i^{( ) 1}₀ ^ v p i( , ) 2 ) ⁱ . A pattern is similar to a counter in functionality, but it can sum inputs with value 1 in different ranks. For example, a 3:2 counter like p₃ as shown in Figure 2.3(c) sums three rank-0 inputs while the pattern 2,1_p as shown in Figure 2.2(b) sums two rank-1 inputs and the 6-input LUT as shown in Figure 2.2(c). The delay is obviously equal to a LUT delay as all patterns belong to UPS K . The delay optimal compress tree can be ( ) constructed with UPS K , but ( ) UPS K is an infinite set. In other words, we ( ) cannot determine the optimal solution with UPS K unless there is a finite set to ( ) construct the compressor tree without loss of delay optimality.

This thesis shows that a finite subset of the infinite set UPS K does exist to ( ) construct the compressor tree without loss of delay optimality. We denote the finite set as PP and take patterns in it as prime patterns. Therefore, pPP iff

1 _p (1)

p  PS , or p has the property i^j_^-10v p i( , ) 2 ⁱ 2^j for 0 j iw p( ).

the carry propagation in each of its columns. For example, the prime pattern

4 1, 2 _p

p   as shown in Figure 2.3(c) possibly produces two valid carries in the 0th and 1st columns due to v p( ₄, 0) 2 ⁰ 2¹ and ¹_i_₀v p i( ₄, ) 2 ⁱ 2² , respectively. On the other hand, the pattern 2,1_p is not a prime pattern because

it will never produce a valid carry in the 0th column due to v( 2,1 _p, 0) 2 ⁰2¹.

Before formulating the compressor tree problem, this thesis introduces the relationship between the dot plane and the pattern. In this thesis, a match is a subgraph indicating the relationship between a pattern and a collection of dots, and then the mapping is defined as match P N:  M, where M is the set of matches.

Figure 2.4 demonstrates three feasible matches under the dot plane d₀ 3, 4, 2:

On modern FPGAs, a ternary adder can sum three operands simultaneously. For flexibility, we denote the maximum quantity of operands as H for a CPA on the targeted FPGA. Therefore, H is equal to 3 in modern FPGAs. In the future, H may be increased to 4, 5, 6, and so on; therefore, we can construct the compressor tree by executing a sequence of covers until the height of the dot plane is less than or equal to H. After the compressor tree is constructed completely by the sequence of covers, at most H numbers are summed by a CPA. Since the delay of a pattern in

( )

UPS K is equal to a LUT delay, the depth of the compressor is equal to the times of executing covers. Apparently, the fewer the times of executing covers is, the less the depth of the compressor tree is. This thesis describes a general pseudo code to construct compressor trees, as shown in Figure 2.6. Before we execute the loop body, the height of the dot plane will be checked whether it is larger than H. If the condition is true, the dot plane needs a specific cover to reduce its height. After the dot plane d is covered by the cover c_s _s, the resultant dot plane d_s_₁, will be produced by map d c( _s, _s). Therefore, the depth of the compressor tree is equal to the times of executing loop. The unit delay model is used such that the delay is

(a) (b)

Figure 2.5 Illustrations for covers under d₀ 3, 4, 2.

determined by the depth of the compressor tree. Thus, we can declare that a compressor tree is delay optimal if its depth is the minimum.

2.4 Properties of Prime Patterns

In order to synthesize the delay optimal compressor tree, the set of building blocks should contain patterns where the number of inputs (i.e., ^{i ( )-1}i^{w p}₀ v p i( , )) is equal to or less than K. In other words, the set of building blocks will be UPS K( ) exactly. Since UPS K( ) is an infinite pattern set, considering all combinations of the compressor tree with UPS K( ) is impossible. This thesis describes the truth

(a) (b) (c)

Procedure: Compressor Tree Synthesis Input: d₀, H

Output: s, {c_s-1,c_s-2, …,c₀} 1. s = 0;

2. While (h(d_s) > H)

3. Find a feasible cover c_s for the dot plane d_s; 4. d_s+1 = map(d_s, c_s);

5. s = s+1;

6. Return the depth s and the cover set {c_s-1,c_s-2, …,c₀};

Figure 2.6 General compressor tree synthesis pseudo code.

that the delay optimal compressor tree can be constructed by the finite set ( )

UPPS K , rather than UPS K( ), without loss of delay optimality.

Before describing the fundamental theorem, we define the subpattern firstly.

The function sub N:   N P P defines the subpattern ( , , ) ( , ), ( , 1),..., ( , ) _p

sub j i p v p j v p j v p i  P with the constraint 0  i j iw p( ). Figure 2.7 shows all subpatterns of the pattern p₄ 1, 2_p:

(1,1, 4) 1 _p

sub p   , sub(0, 0,p₄) 2 _p , and sub(1, 0,p₄)1, 2_p . In the following, this thesis defines pattern decomposition. The function

: ( )

decompose PP M defines a list of feasible matches {(sub j i p k( , , ), )} such that the following conditions can be satisfied: (i) ( , )x i decompose p( ) :xPP, and (ii) (( , ), ( , ))x i y j decompose p( ) , ² i j ow y: ( )  j 1 i . Figure 2.8 shows that the pattern 3,1, 0,1_p can be partitioned into { ,p p p because of ₁ ₁, ₃} the pattern decomposition decompose( 3,1, 0,1  _p) {(p₁, 0), (p₁, 2), (p₃, 3)} . Then, this thesis shows that all patterns can be partitioned into a set of prime patterns.

decompose

Figure 2.8 Illustration for decompose( 3,1, 0,1 _p).

Lemma 1: For each pattern pUPS k( ), p can be partitioned into a set of prime patterns E_p { | ( , )pˆ p iˆ decompose p( )}.

Proof: In the beginning, we identify whether p belongs to UPPS(k). If ( )

pUPPS k , E_p  p. Otherwise, we check the carry propagation possibility from the 0th column to (iw(p)−1)-th column in the pattern p. Because p belongs to UNPPS(K), there exists a set of non-negative integers

p can be partitioned into a set of prime patterns E_p { | ( , )pˆ p iˆ decompose p( )}.

According to Lemma 1, it is obvious that a non-prime pattern can be replaced by a set of prime patterns. Therefore, Lemma 2 can be deduced by Lemma 1. Due to pattern decomposition, the output of a non-prime pattern p may be different to that of E_p. For example, the pattern 3,1, 0,1_p has the output 1,1,1,1,1, but compressor tree constructed with prime patterns, and then the latter has the same or less depth.

Lemma 2: If there exists a compressor tree T constructed with UPS k( ), there exists another compressor tree T' constructed with UPPS k such that the depth of ( ) T'

denoted as ˆc and ₀ map d c is denoted as ( ₀, )ˆ₀ ˆd . Since the decomposition may ₁ Lemma 2. Theorem 1 illustrates that the compressor tree with the minimum depth

Theorem 1: The minimum depth of compressor tree constructed with UPS k is ( ) the same as that of the compressor tree constructed with UPPS k . ( )

Proof: Let T be the compressor tree constructed with UPS k( ), and T' be the compressor tree constructed with UPPS k( ); meanwhile, T and T' have the minimum depth z and z’. Since UPS k contains ( ) UPPS k( ), i.e., the solution space with UPPS k is the subset of that with ( ) UPS k , we can derive that ( )

z z. According to Lemma 2, a compressor tree constructed with UPPS k( ) has the depth z"z. Since T' is the compressor tree constructed with UPPS k( )to have the minimum depth, we can derive that z'z"z. Due to z'z and

z z, z is equal to z'.

According to the theorem, the delay optimal compressor can be constructed with UPPS K , rather than ( ) UPS K . In other words, ( ) UPPS K is a compact ( ) set of basic building blocks for compressor trees. Throughout the rest of this thesis we deal the compressor tree problem with only UPPS K . ( )

Chapter 3 Proposed Algorithm

In this chapter, this thesis describes a delay optimal compressor tree synthesis algorithm, DOCT, to synthesis compressor trees. The detail processes are shown in Figure 3.1. Step 1 generates all prime patterns in UPPS K . Step 2 determines the ( ) upper bound of the minimum depth, denoted as UB, under the given input of the

generate all the corresponding constraints and the objective function. Step 4 gains the delay optimal compressor tree via the ILP solver. Furthermore, Step 5 uses a post-processing procedure to minimize area overhead.

3.1 Upper Bound Determination

Since this thesis unrolls the loop as shown in Figure 2.6 to get the delay optimal compressor tree, the upper bound of the minimum depth needs to be determined in advance. Therefore, this subsection describes how to determine UB.

Given a dot plane d which is the input of the compressor tree, we can construct ₀ another dot plane d₀' such that w d( ₀)w d( ')₀ and r d i( ', )₀ h d( ₀) for

0 i w d( 0). We call this process as extend. For example, we can extend

0 3, 4, 2

d   to d₀'4, 4, 4 , as shown in Figure 3.2(a). The prime pattern

extend

(a) (b)

Figure 3.2 (a) Extending d₀ 3, 4, 2 to d₀'4, 4, 4. (b) The resultant dot plane d₁'map d( ₀', )c with d covered by a collection of p₀' ₁ and p₃ such

that h d( ₁')3.

xK p matches as many dots in the dot plane d as possible if K is equal to ₀' or less than h d( ₀'). Then, the prime pattern yh d( ₀') modK _p matches the remaining dots, where K is the given input constraint of a LUT. Since d is ₀' regular, we can determine h d( ₁') precisely by the equation (1), where

1' ( 0', )

d map d c is a dot plane derived from the cover



ⁱ^{w d}⁽⁰⁰^{') 1} ^{( , )}

 

^{w d}^j⁽⁰⁰^{') 1} ^{( , )}



c  ^ match x i    ^ match y j . Therefore, the upper bound of the estimation, min{ ( ) |h d₁  c C d: ₁map d( ₀', )}c , can be determined by h d( ¹').

( _s1') ( _s') / ( < > )_p ( < ( _s') mod > )_p

h d_ h d Kow K ow h d K (1)

In all cases of this thesis we assume H 2. To determine UB, we execute the equation (1) recursively until h d( _s_₁') is equal to or less than H. As the times of executing the equation (1) is z', z' and UB should be equal. For example, based on K3 and d₀'4, 4, 4 , nine dots in d are matched to ₀' p₃ 3 _p

firstly, and then the others are matched to p₁4 mod 3_p, as shown in Figure 3.2(b). Therefore, we can derive the upper bound of the minimum height of the dot plane d as ( 4 / 3₁'  ow( 3  _p) ow( 4 mod 3 _p))3 . Moreover, we can determine the upper bound of the minimum height of the dot plane d as 2 by the ₂' equation (1). Since h d( ₂') is equal to H, the determination process for UB is terminated. In this example, the times of executing the equation (1) is 2. Hence, UB is equal to 2.

3.2 Variables

This subsection introduces the variables used in ILP formulation.

xs,i,j: the count of the match match p i( _j, ) occurring on d _s

Figure 3.3 (a) The compressor tree before area minimization. (b) The compressor tree after Phase I of the post-processing procedure.

Figure 3.3(a) shows that the dot plane d₀ 3, 4, 2 has a cover

0 ({ 1, 2, 3})

c cover m m m which is constructed from m₁match p( ₄ 1, 2_p, 0) ,

2 ( 3 3 _p,1)

m match p   , and m₃match p( ₃, 2); therefore, ^x^0,0,4 ^^x^0,1,3 ^^x^0,2,3 ^¹ and other variables x_{0, ,}_{i j} on the 0th stratum are equal to zero. Furthermore, the dot plane d₁1,3, 2,1 has a cover c₁cover m m m m({ ₃, ₄, ₅, ₆}) which is constructed from m₄match p( ₁ 1 _p, 0), m₅match p( ₂  2 _p,1), and m₆ match p( ₁, 3);

therefore, x1,0,1x1,1,2 x1,2,3 x1,3,1 1, and other variables ^x^{1, ,}^{i j} on the 1st stratum are equal to zero.

3.3 Covering and Succeeding Constraints

The constraint (2) is called as covering constraint used to enforce a feasible cover under the dot plane ds. We use the covering constraint to ensure that every dot is matched exactly once. The inner summation of the constraint (2) sums the amount of dots in the i-th column of ds, and they are matched to the prime pattern pj

by the match match p i( _j, k) for 0 k min{ (iw p_j),i1}. Further, the outer summation of the constraint (2) sums all the results of the inner summation for all prime patterns. Figure 3.3(a) shows that the covering constraint enforces a feasible cover cover m m m({ ₁, ₂, ₃}) under the dot plane d₀ 3, 4, 2 such that

0,0,4 0,0

2x  2 h , 3x_0,1,3x_0,0,4  4 h_0,1, and 3x_0,2,3  3 h_0,2.

| ( )| ( ) 1

Since the dot plane d_s+1 depends on how the preceding dot plane d_s is covered, we need the constraint (3) to construct the dot plane d_s+1 such that h_s__1,_i r d( _s_₁, )i Further, the outer summation of the constraint (3) sums all the results of the inner summation for all prime patterns to get h_s+1,i. Figure 3.3(a) shows h_1,0 x_0,0,4 1,

The union of the constraints (4) and (5) is used to compute the correct c_s,i. This thesis calls the union of the two constraints as column constraint. If the i-th element of the dot plane d is more than H, the column constraint should enforce c_s s,i to be

1. On the other hand, c_s,i will be enforced to be 0 as h_s,i is equal to or less than H. In Figure 3.3(a), due to h_1,12 and h_1,2 3, c_1,1 is set as 0 and c_{1, 2} is set as 1 via the column constraint. Besides, the term Inf used in the constraints (5) and (7) can be set as ^{w d}i⁽0⁰^{) 1}^ r d i( 0, ). constraints is called as stratum constraint. Figure 3.3(a) shows an illustrative example. The variable q1 is set as 1 via the stratum constraint due to h d( )₁  3 H.

3.5 Objective Function

As stated above, the summation of q_s will be the depth of the compressor tree and it is equal to the equation (8). When we minimize the equation (8), the depth of the compressor tree is minimized.

: ^UB0 _s

Minimize



s_ q ⁽⁸⁾

For example, the depth of the compressor tree is 2 when the dot plane

0 3, 4, 2

d   has the specific cover c₀cover({(p₄, 0), (p₃,1), (p₃, 2)}) , and

1 1,3, 2,1

d   has the cover c₁cover{(p₁, 0), (p₂,1), (p₃, 2), (p₁,3)} as shown in Figure 3.3(a).

3.6 Complexity Analysis

Here, this thesis analyzes the complexity of the number of variables and constrains in our ILP formulation. Firstly, the number of variables is proportional to the number of patterns (i.e., |UPPS K( ) |); the number of columns in every dot plane; the upper bound of the minimum depth UB. This thesis denotes the number of patterns as |P|. Besides, the minimum depth upper bound UB is proportional to

log( (h d0) ) . Therefore, the complexity of the number of variables is

0 0

(log( ( )) ( ) | |)

O h d w d  P . Secondary, this thesis makes the analysis of the number

of the constraints in the following. Similar to the complexity of the number of variables, the number of the constraints is proportional to the number of columns in every dot plane and the minimum depth upper bound UB. Therefore, the complexity of the number of the constraints is O(log( (h d₀))w d( ₀)).

3.7 Post-processing for Area Minimization

This thesis describes a post-processing procedure to reduce the area overhead without losing delay optimality. This post-processing procedure is described as two phases. Before detailing this two phases, we would define redundant matches firstly.

A match m on the dot plane d_s is redundant iff h d( _s_₁) does not increase while all dots matched by m can be matched by p instead. In Phase I, we would delete all ₁ redundant matches under the dot plane d_z_₁ on the penultimate stratum when the minimum depth of the delay optimal compress tree is z. Figure 3.3(a) shows that a redundant match match p( ₂,1) exists on the dot plane d₁1,3, 2,1, based on the specific cover cover({(p₁, 0), (p₂,1), (p₃, 2), (p₁,3)}). Figure 3.3(b) shows that the depth of the compressor tree does not increase after deleting the redundant match

( 2,1)

match p on d1. According to this phenomenon, this thesis presents Phase I of the post-processing procedure for area minimization. Firstly, we check the existence of redundant matches under the dot plane d_z_₁. If there is a redundant match m, it will be deleted from the compressor tree, e.g., the two dots of the match match p( ₂,1) on the dot plane d1 can be matched by p1 instead such that they can be passed through from the dot plane d to the dot plane ₁ d , as shown in Figure 3.3(b). Otherwise, Phase I ₂

is finished. As shown in Figure 3.4, the process stated above is one of the iterations in Phase I; therefore, this post-processing procedure will repeat the process until there is no redundant match on the penultimate stratum.

Practically, basic logic cells on modern FPGAs are flexible. In other words, modern FPGAs employ two single-output LUTs with shared inputs as shown in Figure 3.5. In general, this kind of circuits is called two-output LUTs. We observe that two-output LUTs can map two single-output Boolean functions simultaneously if the two functions satisfy two conditions: (i) the summation of the two function’s distinct variables should be fewer than or equal to the physical-input constraint denoted as PIC (PIC8 on Altera Stratix IV, and PIC5 in Xilinx Vertex V, e.g., PIC is equal to 6 in the example as shown in Figure 3.5), and (ii) the summation of the LUT size of the two functions should be fewer than or equal to the

Figure 3.4 Phase I of the post-processing procedure.

physical-capacity constraint denoted as PCC (PCC64 on Altera Stratix IV, and 64

PCC in Xilinx Vertex V [1, 2]) . Actually, PCC is equal to 2^K, where K is the input constraint of a LUT. Besides, a two-output LUT can map a single output function if the number of variables of the function is K (K = 6) as shown in Figure 3.5.

Moreover, we can merge the two distinct LUTs among all strata if the two functions mapped by these two distinct cells satisfy PIC and PCC. Suppose we want to map the two prime patterns p₄ 1, 2_p as shown in Figure 3.6(a) (i.e., PIC6 ,

PCC ), and then we can map the two patterns onto four two-output LUTs as shown in Figure 3.6(a). Obviously, the summation of the number of inputs of LUT 2 and 4 is equal to 6 which is fewer than PIC, and the summation of the LUT size is equal to 2³2³ which is fewer than PCC. Hence, we can merge LUT 2 and LUT 4 into a single LUT as shown in Figure 3.6(b). In Phase II, we merge the distinct LUTs to map different patterns if these two functions mapped by them satisfy PIC and PCC.

5-inpt LUT

^{5-inpt LUT}

5-inpt LUT

6-inpt LUT

Figure 3.5 A two-output LUT with shared inputs.

LUT1

LUT 3 LUT 2

LUT 4

p

₄

p

₄

p

₄

p

₄

LUT 2 LUT 1

LUT 3

(a) (b)

Figure 3.6 (a) The mapping before Phase II. (b) The mapping after Phase II.

Chapter 4 Experimental Results

4.1 Experimental Information

We implement DOCT and the GPC heuristic [14] in C/C++ language on a workstation with an Intel Xeon 2-GHz processor and 16 GB main memory under the Centos 5.2 operating system. Besides, an open source package, lp_solve 5.5.13, is used to solve the linear formulations. A set of benchmark circuits is evaluated including three Radix-4 unsigned Booth-encoded multipliers (8 by 8 and 16 by16), multiplier accumulators (MAC), discrete cosine transformation (DCT) [20], finite impulse response filters (FIR), and motion estimations (ME). The input of each compressor trees is extracted from the simulation result produced by MATLAB Simulink toolbox. All compressor trees in our experiments are directly synthesized without pipelined.

Table I illustrates the detail information of the benchmark circuits. In Table I, Column 1 shows the variety of our benchmark circuits. Column 2 and Column 3 show the width and the height of the input dot plane, respectively. Column 4 shows the number of dots in the input dot plane.

4.2 Parameters Setup

We implement two compressor tree synthesis algorithms DOCT and the GPC heuristic. The following is the setting of parameters in our experiment.

DOCT: Compressor tree synthesis using DOCT described in preceding sections

在文檔中應用於查找表式場域可程式化閘陣列之壓縮樹延遲最佳化合成演算法 (頁 12-0)