• 沒有找到結果。

R28MDF-based 64-Point FFT/IFFT Processor for 2×2 MIMO-OFDM

2 Literature Review

4.2 The Proposed MIMO-FFT Architecture

4.2.1 R28MDF-based 64-Point FFT/IFFT Processor for 2×2 MIMO-OFDM

block 3

X07 X06 X05 X04 X03 X02 X01 X00

X17 X16 X15 X14 X13 X12 X11 X10

X27 X26 X25 X24 X23 X22 X21 X20

X37 X36 X35 X34 X33 X32 X31 X30

X47 X46 X45 X44 X43 X42 X41 X40

X57 X56 X55 X54 X53 X52 X51 X50

X67 X66 X65 X64 X63 X62 X61 X60

Retrenched 8-Points FFT Unit

Multiplier Unit

Y06 Y16 Y26 Y36 Y66 Y76

X(0:7)

X(8:15)

X(16:23)

X(24:31)

X(32:39), X(0:7) X(40:47), X(8:15) X(48:55), X(16:23) X(56:63), X(32:39)

Y46 Y56 Y05 Y15 Y25 Y35 Y45 Y55 Y65 Y75 Y04 Y14 Y24 Y34 Y44 Y54 Y64 Y74 Y03 Y13 Y23 Y33 Y43 Y53 Y63 Y73 Y02 Y12 Y22 Y32 Y42 Y52 Y62 Y72 Y01 Y11 Y21 Y31 Y41 Y51 Y61 Y71 Y00 Y10 Y20 Y30 Y40 Y50 Y60 Y70

Y07 Y17 Y27 Y37 Y47 Y57 Y67 Y77

Z[0:7], Z[16:23]

Z[8:15], Z[24:31]

Z[32:39], Z[48:55]

Z[40:47], Z[56:63]

2 Input Units (128 Words)

Delay Feedback Memory (64 Words)

Shift Data In Simple Gated Control

Clock

Shift Data Out

X77 X76 X75 X74 X73 X72 X71 X70

Output Control Unit

block 0

block 1 block 2

Fig. 14: Block diagram of the proposed R28MDF-based 64-point FFT/IFFT architecture for 2X2 MIMO-OFDM system.

The R28MDF design comprises two input units (IU), one retrenched 8-point FFT (R8-FFT) unit, one multiplier unit (MU), one delay feedback memory (DFM) and one control unit (CU) as shown in Fig. 14. The detailed operations of each building unit are described as follows.

IU: The IU contains one register bank, which can store 64 complex 16-bit word-length data. These 64 complex registers are split into 8 parallel shift-register lines as illustrated in Fig. 14. Each shift-register line can be easily controlled independently by the simple clock-gated controller. In Figs. 16 and 18, the subscripts of each element are represented as radix-8 based notation. Figure 14 shows that the proposed R28MDF-based serial blockwise architecture contains two input units to store two channel input data for the 2×2

MIMO-OFDM system, represented as X and X , to realize the functionality of the input buffer as discussed in the chapter 1. To prevent the input data overflow, the R28MDF architecture groups these 16 shift-register lines into four blocks, namely, block0, block1, block2 and block3, each of which contains four parallel shift-register lines. In the consequent timing frames, the proposed input unit applies two different combinations of these four blocks to store two-channel input data to prevent input data overflow as depicted in Fig. 15. Each timing frame contains 64 clock cycles. In Fig. 15, X(0:63), X(64:127) and X(128:191) denote input data in the first, second and third timing frames, respectively.

In the first timing frame, block0 and block1 are utilized to store input data X(0:31) and X(32:63), block2 and block3 are used to store input dataX(0:31) and X(32:63) as shown in Fig. 15. The proposed R28MDF architecture requires 32 cycles to complete the 64-point FFT/IFFT computations, which is described in detail in the following subsection.

In the preceding 32 cycles of the second timing frame, the data X(0:63) in block0 and block1 are pushed into the R8-FFT unit in parallel. Simultaneously, input data X(64:95) and X(64:95) can seamlessly replace the data contexts in block0 and block1. During cycles 96–127, the proposed design completes the 64-point FFT/IFFT computation of data X(0:63) in block2 and block3, and input data X(96:127) and

) 127 : 96 (

X concurrently replace the data contexts in block2 and block3. Based on these block-based input unit architectures with appropriate multiplexing control, two channel input data can be easily pushed to the R8-FFT unit using128 words shift-registers as depicted in Fig. 14.

Fig. 15: The timing sequence of the purposed block based input unit.

X1 Sel G1, G2

Sel H1, H2 X0

+

|

Y0 ,Y2

|

+

X2 X4 X6

X3 X5 X7

mode

Shift- and-Add

Y1 ,Y3

Y4 ,Y6

Y5 ,Y7 Matrix

Computation 1

Matrix Computation 2

Fig. 16: Block diagram of the proposed R8-FFT/IFFT unit.

R8-FFT unit: By sharing one constant multiplier in the radix-8 based butterfly in two clock cycles, the proposed equation (72) could produce a low cost and high-efficiency 8-point FFT/IFFT butterfly kernel as illustrated in Fig. 16, called the R8-FFT unit. The constant multiplier in the R8-FFT unit is fully implemented with the shift-and-add circuits, while the proposed parallel type multiplier unit (MU) is fully implemented with eight constant multipliers. The IFFT architecture can be easily obtained by controlling the mode signal in Fig. 16, and the operations of IFFT are similar to those of FFT. The detailed description is omitted here.

MU: The MU as illustrated in Fig. 17 comprises eight constant multipliers to realize different multiplications of the WMTsl in (70). For the purpose of completing the 64-point FFT/IFFT computation in (70), the 64-point FFT/IFFT operation sequence can be separated into two operational stages, namely the multiplication stage (MS) and the

output stage (OS), as illustrated in Figs. 18(a) and 18(b), where the number inside brackets denotes the usage of the constant name in the MU. Notably, the MU can only be adopted in the MS. Thus, the input ports of the MU should be gated during the OS to further reduce the power consumption. The MU contains five independent multiplication pair-ports in parallel, which has one more port than the modified R8MDC design [41].

The modified R8MDC design has to been halted for five clock cycles during FFT computation because of the resource conflictions. The proposed architecture adopts this port to resolve the performance degradation. The conflict clearly occurs in four different clock cycles, with clock cycle numbers of 6, 10, 11 and 14, as revealed in Fig. 18(a). In the clock cycles 8, 11, 12, 13 and 16, the fifth pair-port P(4), could re-serve the multiplication to re-fill the data R62(4), R64(8), R45(4), R74(4) and R66(4) to DFM, which is called MAW. Using the MAW method, the proposed architecture is capable of completing the computation in 16 clock cycles for each operational stage. The R28MDF architecture only needs 32 clock cycles to complete the two operational stages. The R28MDF architecture can thus complete two 64-Point FFT/IFFT computations in 64 clock cycles. Hence, the proposed R28MDF architecture can achieve a higher throughput rate of 2R, which is the twice that of the R22SDF architecture as illustrated in Fig. 18(c).

DFM or DCM

Multiplier Unit

P(0) P(1) P(2)

P(4) R

P(3)

R Constant 1

Constant 8 Constant 7 Constant 6 Constant 5 Constant 4 Constant 2 Constant 3 Y0 , Y2

Y1 , Y3 Y4 , Y6 Y5 , Y7

Constant 0

Fig. 17: Block diagram of the proposed MAW-based multiplier unit.

Clock

(a) The first stage: Multiplication Stage/

Clock (b) The second stage: Output Stage.

16 cycles

MS1 OS1 MS2 OS2

16 cycles 16 cycles 16 cycles 16 cycles

MS1 OS1

MS2 OS2

16 cycles 16 cycles 16 cycles 16 cycles

MS3 OS3

MS4 OS4

(c) The timing sequence of (d) The pipeline timing sequence of R28MDF design. R28MDC design.

Fig. 18: The timing sequence of the proposedR28MDF and R28MDC architectures.

DFM: The DFM contains one register bank, which can store 64 complex 16-bit wordlength data. The DFM is adopted to store the intermediate coefficient parameters from R8-FFT unit, as illustrated in Fig. 14. To save power, the DFM is built by one matrix based buffer architecture with the proper-gated control, as illustrated in Fig. 14.

CU: The CU contains a 6-bit master counter to manage the entire procedures, and gates the unused parts during the redundant period to minize power consumption. Although the proposed CU should pay very small area effort to realize the MAW, the proposed design still raises the throughput rate of 2R with only one constant multiplier.

4.2.2 R28MDC-based 64-Point Pipeline FFT/IFFT Processor for 4×4 MIMO-OFDM System

DCM MU

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

8 8 8 8

R8-FFT R8-FFT

8 8 8 8 8 8 8 8

SW 4 IU

(4x4 MIMO)

4 4 4 4 4 4 4 4 X

Z[0:7], Z[16:23]

Z[8:15], Z[24:31]

Z[32:39], Z[48:55]

Z[40:47], Z[56:63]

Fig. 19: Block diagram of the proposed R28MDC-based 64-point FFT/IFFT architecture for 4X4 MIMO-OFDM system.

For the 4×4 MIMO-OFDM WLAN application, another R28MDC design based on pipeline architecture is presented to further raise the throughput rate to 4R, which is double than that of the R28MDF architecture. The R28MDC design comprises four input units (IU), two retrenched 8-point FFT (R8-FFT) units, one multiplier unit (MU), one delay commutator memory (DCM) and one control unit (CU) as shown in Fig. 19.

Additionally, the R28MDC architecture has four IUs, allowing it to store four-channel

data for the 4×4 MIMO-OFDM system. Based on the block-based input buffer architecture, which is similar to the R28MDF design, the R28MDC architecture groups 32 shift-register lines into 16 blocks. Each block contains two parallel shift-register lines.

Applying two different combinations of these 16 blocks with a simple clock gated controller in CU, these four IUs can prevent input data overflow in the consequent timing frames for the 4×4 MIMO-OFDM system. For the 2×2 MIMO-OFDM WLAN system, the R28MDF design apply the feedback path to reduce the number of R8-FFT unit to only one with the 100% butterfly utilization rate as illustrated in Fig. 14. Compared with the R28MDF architecture, the R28MDC architecture adopts the feedforward path rather than the feedback path as illustrated in Fig. 19. Thus, the feedback-type memory architecture of the DFM is replaced by the feedforward-type delay commutator memory (DCM)

architecture. Additionally, another R8-FFT unit should be inserted following the DCM.

Otherwise, the structures of the R8-FFT unit and the MU are the same as those in the R28MDF architecture, but two intermediate multiplexes have been eliminated.

The MAW scheme can finish the computation period of each stage in the R28MDC architecture within 16 clock cycles without any performance degradation. In the R28MDF architecture, the same R8-FFT unit should execute the computation of MS and OS with the feedback path. The throughput rate of the R28MDF architecture is 2R. However, the first R8-FFT unit only performs the MS computation, and the second R8-FFT unit only performs the OS computation in the R28MDC architecture. Thus the R28MDC architecture can provide the double throughput rate of the R28MDF. Hence, the four channels computation can be completed in 64 clock cycles for the 4×4 MIMO-OFDM system, as illustrated in Fig. 18(d). The proposed R28MDC architecture attains a high throughput rate of 4R, which is the same as that of the R4MDC design.

4.3 Circuit Implementation

This work presents the R28MDF and R28MDC implementations for 2×2 and 4×4 MIMO-OFDM WLAN applications [23], respectively. As is well known, the processing time of 64-point FFT/IFFT for IEEE 802.11n standard has to be within 3.2µs without the guard interval [23]. The proposed 64-point FFT/IFFT design can maintain an appropriate throughput in the sampling data frequency of 20MHz for the MIMO-OFDM system. The R28MDF and R28MDC design thus achieves throughput rates of 2R and 4R to meet the IEEE 802.11n standard, respectively. Following functional verification by MATLAB, the proposed design was modeled in Verilog and verified using an NC-Verilog simulator. In this investigation, the proposed design with an internal word length of 16 bits was synthesized using a Design Complier based on TSMC 0.13µm 1P8M CMOS technology.

The floorplan and the post-layout were performed by Astro. Following the back-annotation from Start-RC extractor, the post-simulation was performed by the NC-Verilog simulator to verify the functionality. The static timing check was signed-off by PrimeTime. Finally, the power analysis was performed by Astro Rail. Figure 20(a) and.

20(b) show the core layouts of the R28MDF and R28MDC designs, respectively. For the post layout, the core area of R28MDF was 0.75 mm2, which includes power rings and

power straps as depicted in Fig. 20(a). The average power dissipation of the proposed R28MDF design was 19.42mW@20 MHz at 1.2V supply voltage. The core area of R28MDC was 0.98 mm2, as depicted in Fig. 20(b), and the power dissipation was 23.57mW@20 MHz at 1.2V supply voltage. Table 4 lists the gate count usage of each building unit. In Table 4, the small gate count usages in CU show that the small area expense for supporting the MAW can be ignored. Significantly, the matrix-based DFM architecture in the R28MDF design reduces routing complexity compared with the serial architecture in [57]. The routing area of the physical design in our previous scheme was reduced. Base on implementation results, the R28MDC implementation further has a smaller routing area than R28MDF implementation using the feedforward path architecture. Following the back-annotation, the static timing analyses indicate that the critical paths of the R28MDF and R28MDC design are 48.3ns and 47.8ns, respectively.

The implementation results demonstrate that the proposed 64-point FFT/IFFT design satisfies the 3.2µs timing specification of IEEE 802.11n standard for the 2×2 and 4×4 MIMO-OFDM wireless applications.

IU IU IU

IU

CUCUCUCU

R8-FFT R8-FFT R8-FFT R8-FFT

MU MU MU MU

R8-FFT FFT FFT FFT

DCM DCM DCM DCM

(a) The R28MDF implementation. (b) The R28MDC implementation.

Fig. 20: Layout view of the proposed 64-point FFT/IFFT processors.

Table 4: Area usage of each building block in the proposed R28MDF and R28MDC design.

Implementation IU R8-FFT MU DFM/DCM CU R28MDF 37.6 % 6.6 % 27 % 27.9 % 0.9 % R28MDC 53.2 % 9.2 % 17.2 % 20 % 0.4 %

4.4 The Comparison Discussion of MIMO-FFT Architecture

Considering the most efficient pipeline FFT processor in a single-input single-output OFDM (SISO-OFDM) WLAN application, He et al. have presented several reliable architectures and the detailed comparison of their hardware costs [31]. The comparison of these architectures indicates that the radix-22 single-path delay feedback (R22SDF) has the highest 50% butterfly utilization and lowest hardware resource consumption [31, 34].

However, the radix-22 based algorithm has a higher complex multiplicative complexity than high-radix and other mixed-radix FFT algorithms, as revealed in Table 1. Furthermore, the SDF based architecture has the lowest throughput rate of R, which can not meet the requirements of the MIMO-OFDM applications. Considering the most efficient pipeline FFT processor in the MIMO-OFDM WLAN applications, the comparison results of [26] indicates that the R4MDC architecture meets the most efficient 64-points FFT/IFFT processor for the 4×4 MIMO-OFDM WLAN system. Although several R4MDC based 64-point FFT chips have been discussed [40, 61, 72], only the design of Swartzlander et al. [40] can operate at the data sampling frequency in the 4×4 MIMO-OFDM systems. Notably, Hui et al. [56] proposed a digit-serial architecture base on radix-4 decomposition, with higher hardware utilization (100%) than the R4MDC based design [40] in the SISO-OFDM system. Hui et al. made good tradeoffs between the digit size and throughput rate in the SISO system. However, the radix-4 based design has a higher complex multiplicative complexity than high-radix and other mixed-radix FFT algorithms, too. This work focuses on the high throughput rate design with the low multiplicative complexity to fit the requirements of 2×2 and 4×4 MIMO-OFDM systems.

This section presents detailed comparisons among the two proposed architectures, R28MDF and R28MDC, and several famous FFT architectures in the 2×2 and 4×4 MIMO-OFDM systems. An effective design is well to be dictated by considerations on area, timing, power consumption and easily reuse. In this investigation, the systems were compared using five indices —MIMO-FFT architecture, complex multiplicative complexity, throughput rate, utilization and cost— to assess the effectiveness of FFT/IFFT processors. For the purpose of estimating the area index between the different architectures, the conventional comparative methodology [26] with the unit of equivalent adders was adopted. Based on the implementation results of our process, one complex multiplier is equivalent to 50 complex adders if it utilizes 16-bit precision and the scheme of three real multiplications and five real additions. The 16-bit complex memory was converted to 1.3 complex adders. The area report

of the logic synthesis tool demonstrates that one proposed MU is considered to equal 3.2 complex multipliers. Furthermore, the area of proposed constant multiplier is equivalent to one-eighth times that of the proposed MU. Restated, one constant multiplier is approximately equivalent to 0.4 complex multipliers.

4.4.1 2×2 MIMO-OFDM WLAN Application

Table 5: Comparison results of the 64-point FFT/IFFT chip designs in 2x2 MIMO-OFDM system.

Cost Architecture MIMO-FFT architecture

(Frequency, MHz)

Complex Multipli-c ation #

Through-put rate

Butterfly Utilization

ROM # complex multipliers

#

constant multipliers

#

Area without memory (Area with memory) Modified R22SDF [34] Parallel Multi-Path (20) 76 R 50% 2 4 0 224 (390.4)

R28SDF [32] Parallel Multi-Path (20) 48 R 25% 4 4 8 304 (470.4)

R2MDC [67] Serial Blockwise (20) 98 2R 100 % 4 4 0 424 (834.8)

R4MDC [40] Serial Blockwise (20) 76 4R 50% 6 6 0 324 (776.4)

Modified R4MDC [61] Serial Multi-Stream (80) 76 4R 50% 4 4 0 340 (1120) Modified R8MDC [41] Serial Blockwise (20) 48 5.33R 25% 0 3.2 4 228 (709)

Proposed R28MDF Serial Blockwise (20) 48 2R 100 % 0 3.2 1 197 (446.6)

Table 5 presents the comprehensive comparison results of seven existing 64-point FFT/IFFT processors and the proposed R28MDF design in terms of MIMO-FFT architecture, complex multiplicative complexity, throughput rate, butterfly utilization, the number of ROM/complex multipliers/constant multipliers and the area index. Table 5 shows that the proposed R28MDF and R8MDC design achieve the lowest complex multiplicative complexity among the tested design. In terms of butterfly utilization, the proposed R28MDF design achieved the highest butterfly utilization (100%) among those tested. The R28SDF [32]

and R22SDF [34] designs clearly have lowest throughput rates of R than other designs.

Significantly, the R8MDC-based FFT/IFFT architecture in [41] has two butterfly stages, which only needs 12 and 11 clock cycles respectively. Base on the serial blockwise architecture, the parallel input data for each butterfly stages in [41] could be provided simultaneously to achieve the higher throughput rate. Table 5 shows that the modified R8MDC [41] and R4MDC [40] design could attain higher throughput rates of 5.33R and 4R, respectively, but both of them have lower butterfly utilization and higher chip cost than the proposed R28MDF design. Sansaloni et al. [26] indicated that the MIMO-FFT processor with

throughput rates of 2R and 4R with the least amount of hardware was more appropriate than other architectures for 2×2 and 4×4 MIMO-OFDM applications, respectively.

Based on the serial blockwise architecture, the proposed R28MDF design should incur a small cost penalty on two IUs and one DFM memory in the 2×2 MIMO-OFDM system.

When considering the memory area, the cost of the R28MDF design increases the area index to 14.4% higher than that obtained with the R22SDF [34] design. However, the R2SDF design increases the multiplicative complexity by 58.3% and reduces the butterfly utilization to 50%

of that of the R28MDF design. Furthermore, the proposed design and that of Maharatna et al.

[41], which only adopt one parallel type multiplier unit, do not require any coefficient ROM.

Following comprehensive comparison between different architectures, this investigation demonstrates that the proposed R28MDF implementation minimizes the chip cost problem associated with the R8MDC, R4MDC architectures, low throughput rate problem of R22SDF and R28SDF architectures, and the high multiplicative complexity problem of R22SDF and R2MDC architectures. Thus, the proposed R28MDF design makes an effective tradeoff between complex multiplicative complexity, throughput rate, butterfly utilization and cost for the 2×2 MIMO-OFDM application.

4.4.2 4×4 MIMO-OFDM WLAN Application

For a 4×4 MIMO-OFDM system, Table 6 presents the comprehensive comparison result of several pipeline FFT/IFFT architectures in terms of the MIMO-FFT architecture, throughput rate, complex multiplicative complexity, the utilization of all components, the number of complex multipliers/complex adders/memory size and the area index of the entire system. Table 6 shows that the proposed R28MDC design achieves the lowest complex

multiplicative complexity among the tested design. Furthermore, the proposed R28MDC and R4MDC [40] achieved the highest utilization (100%) for all components; thus R28MDC and R4MDC design were the best among all pipeline architectures tested for the 4×4 MIMO-OFDM application. Although the R4MDC architecture [40] achieved 100% utilization for all components, it also resulted in a chip area 25.6% larger than that of the R28MDC architecture, when considering the memory cost. Regardless of whether memory cost is considered, the proposed R28MDC architecture had the smallest chip area among all pipeline architectures tested in the 4×4 MIMO-OFDM system. The R28MDC architecture did not require any coefficient ROM, also representing an improvement over the R4MDC architecture. Then, the R28MDC architecture achieved the lowest complex multiplicative complexity, appropriate throughput of 4R, highest utilization for all components and lowest chip cost, making it very suitable for the 4×4 WLAN MIMO-OFDM application.

Table 6: Comparison results of the 64-point pipelined FFT/IFFT architecture in 4x4 MIMO-OFDM system.

Pipeline Architecture

MIMO-FFT architecture

Complex multipli-c ation #

Through-put rate

Complexm ultiplier # (Utilization)

Complex adder

# (Butterfly Utilization)

Memory Size (Utilization)

Area without memory (Area with memory) R2SDF [42] Parallel Multi-Path 98 R 20 (50%) 48 (50%) 252 (100%) 1048 (1375.6) R22SDF [34] Parallel Multi-Path 76 R 8 (75%) 48 (50%) 252 (100%) 448 (775.6) R23SDF [31] Parallel Multi-Path 48 R 8 (87.5%) 48+16T (50%) 252 (100%) 528 (855.6) R24SDF [68] Parallel Multi-Path 76 R 8 (75%) 48 (50%) 252 (100%) 448 (775.6) R4SDF [69] Parallel Multi-Path 76 R 8 (75%) 96 (25%) 252 (100%) 496 (823.6) R4SDC [70] Parallel Multi-Path 76 R 8 (75%) 36 (25%) 504 (100%) 436 (1091.2) R28SDF [32] Parallel Multi-Path 48 R 8 (12.5%) 64+8T (25%) 252 (100%) 504 (831.6) R2MDC [67] Parallel Multi-Path 98 2R 8 (100%) 24 (100%) 316 (100%) 424 (834.8) R23MDC [36] Parallel Multi-Path 48 2R 8 (87%) 24+8T (100%) 316 (100%) 464 (874.8) R24MDC [71] Parallel Multi-Path 76 2R 16 (75%) 56 (71.2%) 380 (100%) 856 (1350) R4MDC [40] Serial Blockwise 76 4R 6 (100%) 24 (100%) 348 (100%) 324 (776.4) Modify

R4MDC [61]

Serial Multi-Stream 76 4R 4 (100%) 80+12T (100%) 600 (100%) 340 (1120) Modify

R8MDC [41]

Serial Blockwise 48 5.33R 3.2 (75%) 48+4T (75%) 370 (75%) 228 (709) Proposed

28MDC

Serial Blockwise 48 4R 3.2 (100%) 32+2T (100%) 320 (100%) 202 (618)

4.5 Summary

This work proposes a hardware-orientated approach for high efficiency to minimize the complex multiplicative complexity, area cost and achieve 100% butterfly utilization with an appropriate throughput rate. By adopting the proposed R8-FFT unit combined with the MAW method, two efficient serial blockwise type 64-point FFT/IFFT processors are constructing for the 2×2 and 4×4 MIMO-OFDM WLAN systems. For the 2×2 MIMO-OFDM system, the proposed R28MDF design has the best performance in terms of lowest complex multiplicative complexity, appropriate throughput rate of 2R, highest butterfly utilization and the fewest complex multipliers, when compared with other existing 64-point FFT/IFFT processor architectures. For the 4×4 MIMO-OFDM system, the proposed R28MDC outperforms existing FFT/IFFT pipeline processor architectures and has the lowest complex multiplicative complexity, an appropriate throughput rate of 4R, highest utilization rate (100%) of all components and the lowest hardware cost. According to the IEEE 802.11n standard [23], execution time for the 128-point and 64-point FFT/IFFT processor with 1–4 simultaneous

This work proposes a hardware-orientated approach for high efficiency to minimize the complex multiplicative complexity, area cost and achieve 100% butterfly utilization with an appropriate throughput rate. By adopting the proposed R8-FFT unit combined with the MAW method, two efficient serial blockwise type 64-point FFT/IFFT processors are constructing for the 2×2 and 4×4 MIMO-OFDM WLAN systems. For the 2×2 MIMO-OFDM system, the proposed R28MDF design has the best performance in terms of lowest complex multiplicative complexity, appropriate throughput rate of 2R, highest butterfly utilization and the fewest complex multipliers, when compared with other existing 64-point FFT/IFFT processor architectures. For the 4×4 MIMO-OFDM system, the proposed R28MDC outperforms existing FFT/IFFT pipeline processor architectures and has the lowest complex multiplicative complexity, an appropriate throughput rate of 4R, highest utilization rate (100%) of all components and the lowest hardware cost. According to the IEEE 802.11n standard [23], execution time for the 128-point and 64-point FFT/IFFT processor with 1–4 simultaneous