Output word y[n] in bit-levelProcessor Array

Figure 4.23: The diagram of the folded bit-plane FIR architecture.

bit-plane FIR architecture enables the implementation of changeable folding factor onto a fixed-size systolic array. The systolic array reads the input word x[n] in the manner of bit-serial, and generates the partial products bit-by-bit with the coefficients. The partial products will then be accumulated for the calculation of the result y[n]. In the folded bit-plane architecture, the folding factor f is equal to the programmable bit-width of coefficients and the throughput of the folded architecture can be increased by reducing the bit-width of coefficients. Fig. 4.23 is the diagram of the folded bit-plane FIR presented in [84].

The architecture of the serial-in folded FIR filter

The transfer function of the K -tap FIR filter can be reformulated as Eq. 4.18 by recursively calculating f -tap FIR filtering.

K−1X

k=0

h_k· z^−k = Xr−1

i=0

z^{−f i}·

f −1X

j=0

h_{f i+j}· z^−j (4.18)

h₀+h₁z^-1+h₂z^-2+ +h_f-1z^-(f-1) +

Figure 4.24: (a) The SFG of the reformulated K -tap FIR filter. (b) The serial-in folded architecture.

where f = ^K_r. From Eq. 4.18, the K-tap FIR filter is composed of r short-length FIRs (P_{f −1}

j=0h_{f i+j}· z^−j), and can be realized by the SFG as shown in Fig. 4.24(a).

Given the folding factor f , the SFG can be then mapped onto r PEs by (1,0) projection. The folded architecture in Fig. 4.24(b) is called the serial-in folded FIR.

In the architecture, each PE serially executes the short-length FIR on input samples and the coefficients h_{f i+j} are circular in cyclic shift registers of PEs. As shown in Fig. 4.24(b), the clock rate is f times of the sampling rate and the the signal “RST”

is asserted every f clock cycles. The signal “RST” is used to clear the accumulator before the new sample comes. Finally, the f -stage shift registers f D are used to buffer the results of short-length FIRs and realize the function of z^{−f i}.

The architecture of the parallel-in folded FIR filter

For an input sequence x(n) and filter coefficients h(n), the output sequence y(n) is given by Eq. 4.1.

According to the Eq. 4.1, the data path of FIR filtering can be presented by an SFG shown in the Fig. 4.6 where D is the delay element. Because there are only r MAs available for the folding factor f , the FIR filtering can execute r MA-nodes in parallel at the maximum within a signal cycle and requires d^K_re (or f ) cycles to finish an iteration. Hence, given r MAs, the iteration period is bounded by f cycles and f is the folding factor. To fold the execution of the K-tap FIR by f , we reconstructed a K -tap FIR to the r -split SFG as shown in the Fig. 4.25. At first, the delay elements of the original SFG are scaled by the folding factor f so each iteration of the K-tap requires f cycles. Then, we performed the retiming transforms on the edges in the cut-sets as shown in Fig. 4.25. The cut-set is used to segment the retimed graph into f subgraphs(r -split graph). (f = d^K_re). Each subgraph has r or less than r MA operations. Afterward those f subgraphs would be executed in the same r MAs by turns. The terms r and f are defined as the number of hardware resource of MAs and how many equally structured operations will be assigned to the same hardware, respectively. To execute the subgraphs in order, the FIR coefficients are scheduled as illustrated in the Fig. 4.26. Because the input samples are read into the MA array in parallel, the folding technique is called the parallel-in folded FIR.

Fig. 4.27 illustrates the parallel-in folded FIR architecture, where w is the bitwidth of data bus. Given r MAs, the folding factor becomes f and the K-tap FIR requires f cycles for an iteration. There are K b-bit registers in coefficient register bank. The configuration is composed of r groups. (f = d^K_re). Each group contains f filter coefficients and simultaneously provides them one by one to the corresponding MA unit according to the scheduled order shown in Fig.6. The

h[1]

Figure 4.25: The r-split FIR filtering.

cycle ^MA1 ^MA2

Figure 4.26: The scheduling of FIR coefficients for the parallel-in folded technique.

register file is used to buffer the output of MAs so that the subgraphs of r-split FIR can be executed recursively. The counter is used to schedule the inputs and outputs of MA array for the FIR working correctly. Fig. 4.9 and Fig. 4.28 demonstrate the scheduling for an example of 5-tap FIR with two MAs. Given r = 2, the folding factor f is equal to three and we can obtain the 2-split SFG of Fig. 4.9. To map the SFG to the parallel-in architecture, we replaced the delay elements with registers and generated the scheduling as shown in Fig. 4.28. Note that each delay element is not necessary to be realized as a register if one can properly schedule the data storage. Following the scheduling, the MA units can perform the function of Eq. 4.7 in each cycle and produce the output y(n) in R₁ every three cycles.

MA 1 MA 2 MA r

Figure 4.27: (a) The architecture of the parallel-in folded FIR filter, and (b) the timing diagram.

4.2.4 Comparison Results

Given the folding factor f for K-tap FIR, all the folding techniques require the same number of MAs and has the same throughput rate and, hence, the efficiency of folding techniques is determined by size and power consumption of memory. At the stage of high-level synthesis, we consider the number of D-type flip-flops (DFFs) as the size of memory and the access number of DFFs per iteration as the power dissipation of memory. When making the comparison, we set the bitwidth of data bus w as m + b + dlog₂Ke for full-precision FIR calculation.

Eq. 4.19 formulates the number of DFFs required by [87]. In [87], there are r MAs located on the accumulated loop. In Eq. 4.19, we use the term α₀ to express the number of DFFs for coefficient storage. Because each MA has (f + 1) registers with full precision to hold the accumulating results, the term α₁ means the total number

h1 h2

Figure 4.28: The scheduling of FIR filtering.

of DFFs for r MAs. The term α₂ represents the number of DFFs for multiplexing the input samples to MAs. Because the coefficients are cyclically read by MAs, the number of registers for multiplexing coefficients (in each PE) increasingly varying with the MA changes. Hence the total number of registers for multiplexing coeffi-cients can be calculated by q and the term α₃ represent the number of DFFs for multiplexing coefficients. Finally, the controller can be implemented as a counter and the term α₄ counts the number of DFFs required in the controller. Eq. 4.20 shows the number of DFFs required by [92]. The unfolding factor, F, will result in a F -parallel filter topology. The memory requirement is approximatively proportional to the unfolding factor F.

#DF F_[87]= Kb|{z}

Eq. 4.21 formulates the number of DFFs required by the folded bit-plane FIR architecture [84]. The folded bit-plane FIR architecture requires K× (m + b + dlog₂Ke) PEs. Each PE performs 1-bit addition and needs two DFFs for the carry-out and sum signals. In addition, the length of the input shift register is (m + b + dlog₂Ke). Thus, the total number of DFFs for PEs can be expressed by the term α₅. The term α₆ is the total number of DFFs for coefficient registers.

Finally, the term α₇ represents the number of DFFs in the controller.

#DF F_[84] = (m + b + dlog₂Ke)(2K + 1)

The formulations of our proposed folded FIR techniques are shown in Eq. 4.22 and Eq. 4.23. The details are as follows. The term α₈represents the number of DFFs of the coefficient registers. The term α₉counts the number of DFFs in accumulators and latches of the serial-in folded FIR. The term α₁₀is the number of DFFs for the

out-loop delays, f D, as shown in Fig. 4.24. Because there is a counter to generate the signal “RST”, the term α₁₁ expresses the number of DFFs of the counter. For the parallel-in folded FIR, the term α₁₂ gives the number of DFFs in the register file, α₁₃represents the number of DFFs of the coefficient registers, and α₁₄expresses the number of DFFs for the MOD-f counter.

#DF F_serial−in= bf r|{z}

α8

+ 2r(m + b + dlog₂f e)

| {z }

α9

+ f (r − 1)(m + b + dlog₂Ke)

| {z }

α10

+ dlog₂f e

| {z }

α11

(4.22)

#DF Fparallel−in = K(m + b + dlog₂Ke)

| {z }

α12

+ Kb|{z}

α13

+ dlog₂f e

| {z }

α14

(4.23)

We estimated the power consumption of memory by counting the access num-ber of registers of each computation iteration [80]. The following equations list the estimation results for five candidates.

#reg access_[87] = f (r(m + b + dlog₂Ke)

(f + 1) + (K − f )m) (4.24)

#reg access_[92] = F (f (r(m + b + dlog₂Ke)

(f + 1) + (K − f )m)) (4.25)

#reg access_[84]= m((m + b + dlog₂Ke)(2K + 1) + K) (4.26)

#reg access_serial−in = f (m + (m + b + dlog₂f e)

(r(f + 1) − f )) (4.27)

#reg accessparallel−in= 2rf (m + b + dlog₂Ke) + bf r (4.28)

Additionally, we formulated the occupancy of multiplexers for five folded architec-tures by the number of the 1-bit 2-to-1 multiplexer as follows:

#M U X_[84]= 2Kw + Kb (4.29)

#M U X_[87]= r(w(f − 2) + b(f − 1)) + w (4.30)

#M U X_[92]= F (r(w(f − 2) + b(f − 1)) + w) (4.31)

#M U X_serial−in= rw (4.32)

#M U Xparallel−in= 2rwf (4.33)

10 ⁰ 10 ¹ 10 ² 10 ³ 10 ³

10 ⁴ 10 ⁵ 10 ⁶

folding factor f

number of F.F.s

K=256, m=8

parallel-in serial-in [7]

[10]

[15]

[84]

[87]

[92]

Figure 4.29: Number of DFFs of folded architectures (in log scale)

To graphically compare the candidates, we sketched the results for K=255 and m=8 (in log scale), as shown in Fig. 4.29, Fig. 4.30, and Fig. 4.31. As shown in Fig. 4.29, the serial-in folded FIR has the lowest memory requirement for large f while the parallel-in folded FIR has the edge for small f . With regard to the power consumption, the folded FIR of [87] and the serial-in folded FIR exponentially grows with the increasing value of folding factor.

According to Fig. 4.30, the parallel-in folded FIR consumes the least power than others. Taking the IS-95 WCDMA pulse shaping FIR filter, whose specifica-tion is tabulated in Table 4.1, as an example, we have implemented five architectures and estimated area requirements and power consumption by the Synopsys Design Analyzer and PrimePower. VLSI design exists a trade-off between operational speed and silicon area occupation. The main argument in this chapter is to save silicon area and power consumption based on the same speed. Basically, each architec-ture’s critical-path delay is the latency between one MA and one multiplexer. We specified the clock constraint of each architecture in the synthesizing stage to let the comparisons make sense. The target report of each architecture is summarized

10 ⁰ 10 ¹ 10 ² 10 ³ 10 ⁴

10 ⁵ 10 ⁶ 10 ⁷

folding factor f

access number of F.F.s per iteration

K=256, m=8

parallel-in serial-in [7]

[10]

[15]

[84]

[87]

[92]

Figure 4.30: Access number of DFFs per iteration (in log scale)

in Table 4.2. As we can see that the folding technique can save the area require-ment but maximal resource sharing can lead to an increase in power consumption.

However, our proposed folded architectures consume less power among all folded ones. The parallel-in folded FIR filter with the folding factor f = 11 was fabricated using a 0.18µm CMOS technology, packaged in 68-pin LCC, and successfully passed functional testing. The features of the implementation and the chip micrograph are given in Table 4.3 and Fig. 4.32, respectively.

4.2.5 Summary

Two novel systematic hardware-efficient folding techniques for high-order FIR filtering have been presented. The parallel-in folded design methodology was applied to the design of an IS-95 WCDMA pulse shaping FIR filter. It features a sample rate of 168.96 MSPS at a power dissipation of 16.66 mW in a 0.18µm CMOS technology. Under the same throughput rate, the proposed techniques enable the validation of the architecture of the folded FIR filter with minimal storage requirement and less power dissipation when comparing with that of the previous

10 ⁰ 10 ¹ 10 ² 10 ³ 10 ²

10 ³ 10 ⁴ 10 ⁵ 10 ⁶

folding factor f

number of 1-bit 2-to-1 mux

K=256, m=8

parallel-in serial-in [7]

[10]

[15]

[84]

[87]

[92]

Figure 4.31: Number of 1-bit 2-to-1 multiplexers of folded architectures (in log scale) filter length 33 tap

throughtput 15.36 MSPS

passband edge 0.1πω

stopband edge 0.28πω

passband ripple 1.5 dB stopband ripple 40 dB input sample word-length 8 bit

coefficient word-length 16 bit

Table 4.1: IS-95 WCDMA pulse shaping FIR filter specification.

works in the literatures.

FIR Architecture unfolded [84] [87] [92] (F = 2) serial-in parallel-in Area (µm²) 2311048 2231788 865477 1627096 578116 758781

Power (mW ) 6.79 49.9 34.8 70.2 25.45 16.66

Critical-path (ns) 65.1 4.07 5.92 5.92 5.92 5.92

Table 4.2: Area and power consumption comparisons.(An IS-95 WCDMA pulse shaping 33-tap FIR)

throughput 168.96 MSPS

power dissipation 16.66 mW chip size 1.411 × 1.411mm² supply voltage(core/ring) 1.8 V/3.3 V

Table 4.3: Features of the IS-95 WCDMA pulse shaping FIR filter chip.

Figure 4.32: Photomicrograph of IS-95 WCDMA pulse shaping FIR filter chip.

Chapter 5

在文檔中有效利用資源之低功率數位訊號處理設計 (頁 95-108)