Chapter 2 Preliminaries
2.4 Properties of Prime Patterns
In order to synthesize the delay optimal compressor tree, the set of building blocks should contain patterns where the number of inputs (i.e., i ( )-1iw p0 v p i( , )) is equal to or less than K. In other words, the set of building blocks will be UPS K( ) exactly. Since UPS K( ) is an infinite pattern set, considering all combinations of the compressor tree with UPS K( ) is impossible. This thesis describes the truth
(a) (b) (c)
Procedure: Compressor Tree Synthesis Input: d0, H
Output: s, {cs-1,cs-2, …,c0} 1. s = 0;
2. While (h(ds) > H)
3. Find a feasible cover cs for the dot plane ds; 4. ds+1 = map(ds, cs);
5. s = s+1;
6. Return the depth s and the cover set {cs-1,cs-2, …,c0};
Figure 2.6 General compressor tree synthesis pseudo code.
that the delay optimal compressor tree can be constructed by the finite set ( )
UPPS K , rather than UPS K( ), without loss of delay optimality.
Before describing the fundamental theorem, we define the subpattern firstly.
The function sub N: N P P defines the subpattern ( , , ) ( , ), ( , 1),..., ( , ) p
sub j i p v p j v p j v p i P with the constraint 0 i j iw p( ). Figure 2.7 shows all subpatterns of the pattern p4 1, 2p:
(1,1, 4) 1 p
sub p , sub(0, 0,p4) 2 p , and sub(1, 0,p4)1, 2p . In the following, this thesis defines pattern decomposition. The function
: ( )
decompose PP M defines a list of feasible matches {(sub j i p k( , , ), )} such that the following conditions can be satisfied: (i) ( , )x i decompose p( ) :xPP, and (ii) (( , ), ( , ))x i y j decompose p( ) , 2 i j ow y: ( ) j 1 i . Figure 2.8 shows that the pattern 3,1, 0,1p can be partitioned into { ,p p p because of 1 1, 3} the pattern decomposition decompose( 3,1, 0,1 p) {(p1, 0), (p1, 2), (p3, 3)} . Then, this thesis shows that all patterns can be partitioned into a set of prime patterns.
decompose
Figure 2.8 Illustration for decompose( 3,1, 0,1 p).
Lemma 1: For each pattern pUPS k( ), p can be partitioned into a set of prime patterns Ep { | ( , )pˆ p iˆ decompose p( )}.
Proof: In the beginning, we identify whether p belongs to UPPS(k). If ( )
pUPPS k , Ep p. Otherwise, we check the carry propagation possibility from the 0th column to (iw(p)−1)-th column in the pattern p. Because p belongs to UNPPS(K), there exists a set of non-negative integers
p can be partitioned into a set of prime patterns Ep { | ( , )pˆ p iˆ decompose p( )}.
According to Lemma 1, it is obvious that a non-prime pattern can be replaced by a set of prime patterns. Therefore, Lemma 2 can be deduced by Lemma 1. Due to pattern decomposition, the output of a non-prime pattern p may be different to that of Ep. For example, the pattern 3,1, 0,1p has the output 1,1,1,1,1, but compressor tree constructed with prime patterns, and then the latter has the same or less depth.
Lemma 2: If there exists a compressor tree T constructed with UPS k( ), there exists another compressor tree T' constructed with UPPS k such that the depth of ( ) T'
denoted as ˆc and 0 map d c is denoted as ( 0, )ˆ0 ˆd . Since the decomposition may 1 Lemma 2. Theorem 1 illustrates that the compressor tree with the minimum depth
Theorem 1: The minimum depth of compressor tree constructed with UPS k is ( ) the same as that of the compressor tree constructed with UPPS k . ( )
Proof: Let T be the compressor tree constructed with UPS k( ), and T' be the compressor tree constructed with UPPS k( ); meanwhile, T and T' have the minimum depth z and z’. Since UPS k contains ( ) UPPS k( ), i.e., the solution space with UPPS k is the subset of that with ( ) UPS k , we can derive that ( )
'
z z. According to Lemma 2, a compressor tree constructed with UPPS k( ) has the depth z"z. Since T' is the compressor tree constructed with UPPS k( )to have the minimum depth, we can derive that z'z"z. Due to z'z and
'
z z, z is equal to z'.
According to the theorem, the delay optimal compressor can be constructed with UPPS K , rather than ( ) UPS K . In other words, ( ) UPPS K is a compact ( ) set of basic building blocks for compressor trees. Throughout the rest of this thesis we deal the compressor tree problem with only UPPS K . ( )
Chapter 3
Proposed Algorithm
In this chapter, this thesis describes a delay optimal compressor tree synthesis algorithm, DOCT, to synthesis compressor trees. The detail processes are shown in Figure 3.1. Step 1 generates all prime patterns in UPPS K . Step 2 determines the ( ) upper bound of the minimum depth, denoted as UB, under the given input of the
generate all the corresponding constraints and the objective function. Step 4 gains the delay optimal compressor tree via the ILP solver. Furthermore, Step 5 uses a post-processing procedure to minimize area overhead.
3.1 Upper Bound Determination
Since this thesis unrolls the loop as shown in Figure 2.6 to get the delay optimal compressor tree, the upper bound of the minimum depth needs to be determined in advance. Therefore, this subsection describes how to determine UB.
Given a dot plane d which is the input of the compressor tree, we can construct 0 another dot plane d0' such that w d( 0)w d( ')0 and r d i( ', )0 h d( 0) for
0 i w d( 0). We call this process as extend. For example, we can extend
0 3, 4, 2
d to d0'4, 4, 4 , as shown in Figure 3.2(a). The prime pattern
extend
(a) (b)
Figure 3.2 (a) Extending d0 3, 4, 2 to d0'4, 4, 4. (b) The resultant dot plane d1'map d( 0', )c with d covered by a collection of p0' 1 and p3 such
that h d( 1')3.
xK p matches as many dots in the dot plane d as possible if K is equal to 0' or less than h d( 0'). Then, the prime pattern yh d( 0') modK p matches the remaining dots, where K is the given input constraint of a LUT. Since d is 0' regular, we can determine h d( 1') precisely by the equation (1), where
1' ( 0', )
d map d c is a dot plane derived from the cover
iw d(00') 1 ( , )
w dj(00') 1 ( , )
c match x i match y j . Therefore, the upper bound of the estimation, min{ ( ) |h d1 c C d: 1map d( 0', )}c , can be determined by h d( 1').
( s1') ( s') / ( < > )p ( < ( s') mod > )p
h d h d Kow K ow h d K (1)
In all cases of this thesis we assume H 2. To determine UB, we execute the equation (1) recursively until h d( s1') is equal to or less than H. As the times of executing the equation (1) is z', z' and UB should be equal. For example, based on K3 and d0'4, 4, 4 , nine dots in d are matched to 0' p3 3 p
firstly, and then the others are matched to p14 mod 3p, as shown in Figure 3.2(b). Therefore, we can derive the upper bound of the minimum height of the dot plane d as ( 4 / 31' ow( 3 p) ow( 4 mod 3 p))3 . Moreover, we can determine the upper bound of the minimum height of the dot plane d as 2 by the 2' equation (1). Since h d( 2') is equal to H, the determination process for UB is terminated. In this example, the times of executing the equation (1) is 2. Hence, UB is equal to 2.
3.2 Variables
This subsection introduces the variables used in ILP formulation.
xs,i,j: the count of the match match p i( j, ) occurring on d s
Figure 3.3 (a) The compressor tree before area minimization. (b) The compressor tree after Phase I of the post-processing procedure.
Figure 3.3(a) shows that the dot plane d0 3, 4, 2 has a cover
0 ({ 1, 2, 3})
c cover m m m which is constructed from m1match p( 4 1, 2p, 0) ,
2 ( 3 3 p,1)
m match p , and m3match p( 3, 2); therefore, x0,0,4 x0,1,3 x0,2,3 1 and other variables x0, ,i j on the 0th stratum are equal to zero. Furthermore, the dot plane d11,3, 2,1 has a cover c1cover m m m m({ 3, 4, 5, 6}) which is constructed from m4match p( 1 1 p, 0), m5match p( 2 2 p,1), and m6 match p( 1, 3);
therefore, x1,0,1x1,1,2 x1,2,3 x1,3,1 1, and other variables x1, ,i j on the 1st stratum are equal to zero.
3.3 Covering and Succeeding Constraints
The constraint (2) is called as covering constraint used to enforce a feasible cover under the dot plane ds. We use the covering constraint to ensure that every dot is matched exactly once. The inner summation of the constraint (2) sums the amount of dots in the i-th column of ds, and they are matched to the prime pattern pj
by the match match p i( j, k) for 0 k min{ (iw pj),i1}. Further, the outer summation of the constraint (2) sums all the results of the inner summation for all prime patterns. Figure 3.3(a) shows that the covering constraint enforces a feasible cover cover m m m({ 1, 2, 3}) under the dot plane d0 3, 4, 2 such that
0,0,4 0,0
2x 2 h , 3x0,1,3x0,0,4 4 h0,1, and 3x0,2,3 3 h0,2.
| ( )| ( ) 1
Since the dot plane ds+1 depends on how the preceding dot plane ds is covered, we need the constraint (3) to construct the dot plane ds+1 such that hs1,i r d( s1, )i Further, the outer summation of the constraint (3) sums all the results of the inner summation for all prime patterns to get hs+1,i. Figure 3.3(a) shows h1,0 x0,0,4 1,
The union of the constraints (4) and (5) is used to compute the correct cs,i. This thesis calls the union of the two constraints as column constraint. If the i-th element of the dot plane d is more than H, the column constraint should enforce cs s,i to be
1. On the other hand, cs,i will be enforced to be 0 as hs,i is equal to or less than H. In Figure 3.3(a), due to h1,12 and h1,2 3, c1,1 is set as 0 and c1, 2 is set as 1 via the column constraint. Besides, the term Inf used in the constraints (5) and (7) can be set as w di(00) 1 r d i( 0, ). constraints is called as stratum constraint. Figure 3.3(a) shows an illustrative example. The variable q1 is set as 1 via the stratum constraint due to h d( )1 3 H.
3.5 Objective Function
As stated above, the summation of qs will be the depth of the compressor tree and it is equal to the equation (8). When we minimize the equation (8), the depth of the compressor tree is minimized.
: UB0 s
Minimize
s q (8)For example, the depth of the compressor tree is 2 when the dot plane
0 3, 4, 2
d has the specific cover c0cover({(p4, 0), (p3,1), (p3, 2)}) , and
1 1,3, 2,1
d has the cover c1cover{(p1, 0), (p2,1), (p3, 2), (p1,3)} as shown in Figure 3.3(a).
3.6 Complexity Analysis
Here, this thesis analyzes the complexity of the number of variables and constrains in our ILP formulation. Firstly, the number of variables is proportional to the number of patterns (i.e., |UPPS K( ) |); the number of columns in every dot plane; the upper bound of the minimum depth UB. This thesis denotes the number of patterns as |P|. Besides, the minimum depth upper bound UB is proportional to
log( (h d0) ) . Therefore, the complexity of the number of variables is
0 0
(log( ( )) ( ) | |)
O h d w d P . Secondary, this thesis makes the analysis of the number
of the constraints in the following. Similar to the complexity of the number of variables, the number of the constraints is proportional to the number of columns in every dot plane and the minimum depth upper bound UB. Therefore, the complexity of the number of the constraints is O(log( (h d0))w d( 0)).
3.7 Post-processing for Area Minimization
This thesis describes a post-processing procedure to reduce the area overhead without losing delay optimality. This post-processing procedure is described as two phases. Before detailing this two phases, we would define redundant matches firstly.
A match m on the dot plane ds is redundant iff h d( s1) does not increase while all dots matched by m can be matched by p instead. In Phase I, we would delete all 1 redundant matches under the dot plane dz1 on the penultimate stratum when the minimum depth of the delay optimal compress tree is z. Figure 3.3(a) shows that a redundant match match p( 2,1) exists on the dot plane d11,3, 2,1, based on the specific cover cover({(p1, 0), (p2,1), (p3, 2), (p1,3)}). Figure 3.3(b) shows that the depth of the compressor tree does not increase after deleting the redundant match
( 2,1)
match p on d1. According to this phenomenon, this thesis presents Phase I of the post-processing procedure for area minimization. Firstly, we check the existence of redundant matches under the dot plane dz1. If there is a redundant match m, it will be deleted from the compressor tree, e.g., the two dots of the match match p( 2,1) on the dot plane d1 can be matched by p1 instead such that they can be passed through from the dot plane d to the dot plane 1 d , as shown in Figure 3.3(b). Otherwise, Phase I 2
is finished. As shown in Figure 3.4, the process stated above is one of the iterations in Phase I; therefore, this post-processing procedure will repeat the process until there is no redundant match on the penultimate stratum.
Practically, basic logic cells on modern FPGAs are flexible. In other words, modern FPGAs employ two single-output LUTs with shared inputs as shown in Figure 3.5. In general, this kind of circuits is called two-output LUTs. We observe that two-output LUTs can map two single-output Boolean functions simultaneously if the two functions satisfy two conditions: (i) the summation of the two function’s distinct variables should be fewer than or equal to the physical-input constraint denoted as PIC (PIC8 on Altera Stratix IV, and PIC5 in Xilinx Vertex V, e.g., PIC is equal to 6 in the example as shown in Figure 3.5), and (ii) the summation of the LUT size of the two functions should be fewer than or equal to the
Figure 3.4 Phase I of the post-processing procedure.
physical-capacity constraint denoted as PCC (PCC64 on Altera Stratix IV, and 64
PCC in Xilinx Vertex V [1, 2]) . Actually, PCC is equal to 2K, where K is the input constraint of a LUT. Besides, a two-output LUT can map a single output function if the number of variables of the function is K (K = 6) as shown in Figure 3.5.
Moreover, we can merge the two distinct LUTs among all strata if the two functions mapped by these two distinct cells satisfy PIC and PCC. Suppose we want to map the two prime patterns p4 1, 2p as shown in Figure 3.6(a) (i.e., PIC6 ,
64
PCC ), and then we can map the two patterns onto four two-output LUTs as shown in Figure 3.6(a). Obviously, the summation of the number of inputs of LUT 2 and 4 is equal to 6 which is fewer than PIC, and the summation of the LUT size is equal to 2323 which is fewer than PCC. Hence, we can merge LUT 2 and LUT 4 into a single LUT as shown in Figure 3.6(b). In Phase II, we merge the distinct LUTs to map different patterns if these two functions mapped by them satisfy PIC and PCC.
5-inpt LUT
5-inpt LUT5-inpt LUT
6-inpt LUT
Figure 3.5 A two-output LUT with shared inputs.
LUT1
LUT 3 LUT 2
LUT 4
p
4p
4p
4p
4LUT 2 LUT 1
LUT 3
(a) (b)
Figure 3.6 (a) The mapping before Phase II. (b) The mapping after Phase II.
Chapter 4
Experimental Results
4.1 Experimental Information
We implement DOCT and the GPC heuristic [14] in C/C++ language on a workstation with an Intel Xeon 2-GHz processor and 16 GB main memory under the Centos 5.2 operating system. Besides, an open source package, lp_solve 5.5.13, is used to solve the linear formulations. A set of benchmark circuits is evaluated including three Radix-4 unsigned Booth-encoded multipliers (8 by 8 and 16 by16), multiplier accumulators (MAC), discrete cosine transformation (DCT) [20], finite impulse response filters (FIR), and motion estimations (ME). The input of each compressor trees is extracted from the simulation result produced by MATLAB Simulink toolbox. All compressor trees in our experiments are directly synthesized without pipelined.
Table I illustrates the detail information of the benchmark circuits. In Table I, Column 1 shows the variety of our benchmark circuits. Column 2 and Column 3 show the width and the height of the input dot plane, respectively. Column 4 shows the number of dots in the input dot plane.
4.2 Parameters Setup
We implement two compressor tree synthesis algorithms DOCT and the GPC heuristic. The following is the setting of parameters in our experiment.
DOCT: Compressor tree synthesis using DOCT described in preceding sections under K6 and H3. This thesis supposes that DOCT is evaluated on Altera Stratix IV. Thus, the physical input constraint (PIC) is set to 8 and the physical capacity constraint (PCC) is set to 64. The compressor tree produces three outputs summed by ternary adder.
GPC: Compressor tree synthesis using the generalized parallel counter (GPC) heuristic. In the GPC heuristic, there are three parameters: (i) M is the input
constraint of GPC patterns (the input constraint of LUTs in the targeted FPGA, e.g., 6 for Altera Stratix IV, and Xilinx Vertex V), (ii) N is the output constraint of GPC patterns, and (iii) k is the number of inputs of the final CPA (i.e., k is equal to H). In our experiments, M is set as 6; N is set as 4; k is set as 3. The setting is the same as [14].
4.3 Experimental Results
In our experiment, we compare both the depth and area produced by DOCT to that by the GPC heuristic. Table II, III, and IV show the experimental results under
TABLE II
different input constraints of LUTs: K 5, K6, and K7, respectively. In Table II, III, and IV, Column 2 illustrates upper bounds of all benchmark circuits.
Column 3 and Column 4 illustrate the depth of compressor trees produced by DOCT and the GPC heuristic. Meanwhile, Column 5 and Column 6 illustrate the area in terms of LUTs on Altera Vertex Stratix IV produced by DOCT and the GPC heuristic. Compared to the GPC heuristic, DOCT has 27% less depth with 17%
fewer LUTs under K5; 32% less depth with 21% fewer LUTs under K6; and 20% less depth with 2% fewer LUTs under K7. For all benchmark circuits, the GPC heuristic was finished in few seconds; meanwhile, DOCT was finished in 500 seconds.
It is evident that DOCT always have better or the same result in depth compared to the GPC heuristic. The reason is that DOCT consider all combinations of all prime patterns for constructing a compressor tree. Although DOCT does not outperform the GPC heuristic in area for every case, it provides smaller area on
Chapter 5
Conclusions and Future Work
A delay optimal compressor tree synthesis algorithm, DOCT, has been presented in this thesis. Since the infinite set of patterns can be superseded by the finite set of prime patterns without loss of delay optimality, DOCT adopts an ILP-based methodology to map prime patterns onto the compress tree with the minimum depth and utilizes a post-processing procedure to minimize area overhead.
Therefore, DOCT can authentically archive compressor trees with minimum depths by all prime patterns under the input constraint of a LUT. On average, compressor trees produced by DOCT have 32% less depth and 21% fewer LUTs than those produced by the GPC heuristic on modern technologies.
Although DOCT has made a progress in reducing area overhead compared to the GPC heuristic, we believe that there is still room for improvement. In the beginning, we have put the area cost in the cost function of ILP formulation.
Unfortunately, the run time of DOCT is too long and unacceptable. But according to the result of some smaller case, DOCT considering area cost in the cost function could archive around 50% fewer LUTs than that does not consider. Yet, we believe that the research of reducing area optimally is worth being performed in the future.
Reference
[1] Altera Corporation, Stratix IV device handbook. [Online]. Available:
http://www.altera.com/
[2] Xilinx Corporation, Vertex-5 FPGA user guide. [Online]. Available:
http://www.xilinx.com/
[3] Altera Corporation, Stratix III device handbook. [Online]. Available:
http://www.altera.com/
.
[4] Xilinx Corporation, Vertex-4 FPGA user guide. [Online]. Available:
http://www.xilinx.com/
[5] M. C. Herbordt, T. VanCourt, Y. Gu, B. Sukhwani, A. Conti, J. Model, and D.
DiSabello, “Achieving high performance with FPGA-based computing,”
Computer, vol. 40, no. 3, pp. 50–57, March 2007.
[6] Altera Corporation, FPGAs provide reconfigurable DSP solutions. [Online].
Available: http://www.altera.com/literature/wp/wp_dsp_fpga.pdf
[7] S. Mirzaei, A. Hosangadi, and R. Kastner, “FPGA implementation of high speed FIR filters using add and shift method,” International Conference on Computer Design, October 2006, pp. 308–313.
[8] O. Kwon, K. Nowka, and Jr. Swartzlander, “A 16-bit by 16-bit MAC design using fast 5:3 compressor cells,” Journal of VLSI Signal Processing Systems, Vol.
31, No. 2, pp. 77-89, June 2002.
[9] C.-Y. Chen, S.-Y. Chien, Y.-W. Huang, T.-C. Chen, T.-C. Wang, and L.-G. Chen,
“Analysis and architecture design of variable block-size motion estimation for H.264/AVC,” IEEE Transaction on Circuits and Systems, vol. 53, no 2, pp.
578-583, February 2006.
[10] L. Dadda, “Some schemes for parallel multipliers,” Alta Frequenza, vol. 34, pp.
[11] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Transaction on Electronic computers, vol. 13, no. 1, pp. 14–17, February 1964.
[12] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach,” IEEE Transaction on Computers, vol. 45, no. 3, pp.
294–306, March 1996.
[13] P. F. Stelling, C. U. Martel, V. G. Oklobdzija, and R. Ravi, “Optimal circuits for parallel multipliers,” IEEE Transaction on Computers, vol. 47, no. 3, pp.
273–285, March 1998.
[14] H. Parandeh-Afshar, P. Brisk, and P. Ienne, “Efficient synthesis of compressor trees on FPGAs,” Asia South Pacific Design Automation Conference, March 2008, pp. 138–143.
[15] H. Parandeh-Afshar, P. Brisk, and P. Ienne, “Improving synthesis of compressor trees on FPGAs via integer linear programming,” Design Automation and Test in Europe, March 2008, pp. 1256–1261.
[16] P. S. Zuchowski, C. B. Reynolds, R. J. Grupp, S. G. Davis, B. Cremen, and B.
Troxel, "A hybrid ASIC and FPGA architecture," International Conference of Computer-Aided Design, November 2002 , pp. 187-194.
[17] A. Cevrero, P. Athanasopoulos, H. Parandeh-Afshar, A. K. Verma, P. Brisk, F. K.
Gurkaynak, Y. Leblebici, and P. Ienne, “Architectural improvements for field programmable counter arrays: enabling efficient synthesis of fast compressor trees on FPGAs,” International Symposium on Field Programmable Gate Arrays, February 2008, pp. 181–190.
[18] P. Brisk, A. K. Verma, P. Ienne, and H. Parandeh-Afshar, “Enhancing FPGA performance for arithmetic circuits,” Design Automation Conference, June 2007, pp. 334–337.
[19] J. Cong and Y. Ding, “FlowMap: an optimal technology mapping algorithm for delay optimization in Lookup-Table based FPGA designs,” IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems, vol. 13, no. 1, pp. 1–12, January 1994.
[20] W. Pan, A. Shams, and M. A. Bayoumi, “NEDA: a new distributed arithmetic architecture and its application to one dimensional discrete cosine transform,”
IEEE Signal Processing Systems, October 1999, pp. 159–168.