An Efficient QR Decomposition Design for MIMO Systems

(1)

An Efficient QR Decomposition Design for MIMO Systems

1Jing-Shiun Lin, ²Yin-Tsung Hwang, ¹Po-Han Chu, ¹Ming-Der Shieh, and ¹Shih-Hao Fang

1Department of Electrical Engineering National Cheng Kung University, Tainan, Taiwan

2Department of Electrical Engineering, National Chung Hsing University, Taichung, Taiwan

Email: [email protected] Abstract—Multiple-input multiple-output (MIMO) techniques

have been widely used in various wireless communication systems these days. QR factorization is a fundamental module yet computationally intensive used in many MIMO detection schemes. In this paper, a complex-valued QR factorization (CQRF) scheme realized via a sequence of real-value Givens rotations is first presented. An efficient CQRF design using coordinate rotation digital computer (CORDIC) modules is next developed. The design features a highly parallel architecture to support high throughput operations. One CQRF can be obtained in every 8 clock cycles. To reduce the circuit complexity, a pipelined CORDIC structure is also applied. The implementation results in TSMC 0.18-μm CMOS process indicate that the proposed design can achieve a throughput rate of 25MCQRFs per second while consuming only 103.7k gates in circuit complexity. Performance evaluation based on a composite index consisting of area and throughput rate also shows the advantages of the proposed design against other similar works.

I. INTRODUCTION

Multiple-input multiple-output (MIMO) technique can provide high spatial freedom to increase reliability and throughput. This technique has recently attracted a lot of attentions [1] and has been widely used in various wireless communication standards, such as IEEE 802.11n, IEEE 802.16e-2005 (WiMAX) and IEEE 802.15c. The MIMO systems can provide two data transmission modes. One is spatial diversity and the other is spatial multiplexing. In spatial diversity, each antenna transmits the same information to combat fading. Contrary to spatial diversity, the spatial multiplexing transmits different information in each antenna and it can provide high data rate in multiple channel environments. Furthermore, MIMO technique is also combined with orthogonal frequency-division multiplexing (OFDM) systems, which refer to the multiple narrow-band digital signal transmission technique. In the regard of spatial multiplexing, MIMO detection schemes, such as sphere decoder [2], [3], zero-forcing [4], QR-BLAST [5], and sorted

& MMSE (Minimum Mean Square Error) regularized QR (QR-MMSE) [8] are proposed to recover signals from multiple antennas. These MIMO detection schemes would require QR factorization to covert a conventional MIMO

channel into multiple layered sub-channels. With the increasingly high transmission rate of recent communication systems, a high throughput rate of QR factorization architecture is essential.

The QR factorization decomposes a channel matrix into a unitary matrix Q and a triangular matrix R in MIMO systems.

There are three categories of QR factorization methods: Gram- Schmidt (GS) process, Householder transforms (HT) and Givens rotation (GR). The GS process achieves orthogonalization by using the projection principle from existing bases to construct a new basis. In the finite precision arithmetic, the rounding error may cause the loss of the orthogonality between the resulting bases. To solve this problem, modified GS algorithms are proposed in [6].

Householder transformation is a linear transformation that reflects a column vector of a matrix onto a multiple of a standard basis vector, such that the non-zero elements of the column vector can be converted to zero value under the same norm. On the other hand, a 2-dimension rotation method was proposed by W. J. Givens in 1954. Since the rotation matrix is an orthogonal matrix and each rotation only affects the corresponding two row vectors, GR scheme has higher computation parallelism compared to HT scheme. From the hardware implementation point of view, coordinate rotation digital computer (CORDIC) algorithm can be applied to obtain the rotations needed in the GR schemes. Compared to the GR schemes based on CORDIC algorithm, GS and HT schemes require complicated operations, such as square root and division. As a result, the GR scheme based on CORDIC algorithm has been widely used in recent works [9], [10]. In this paper, we modify the early work in [9] to realize an efficient complex-valued QR factorization (CQRF) design. It features a highly parallel architecture to support high throughput operations. One CQRF can be obtained in every 8 clock cycles. To reduce the circuit complexity, a pipelined CORDIC structure is also applied. The implementation results in TSMC 0.18-μm CMOS process indicate that the proposed design can achieve a throughput rate of 25 M CQRFs per second while consuming only 103.7k gates in circuit complexity. Evaluations based on a composite index consisting of area and throughput rate also prove the

(2)

performance advantages of the proposed design over other works.

The rest of this paper is organized as follows. In Section II, a modified CQRF scheme based on real-valued GR factorization is given. The proposed CQRF design is described in Section III and the implementation results and design comparisons are provided in Section IV.

II. MODIFIED COMPLEX QRFACTORIZATION ALGORITHM A MIMO-OFDM system model with Nt transmit antennas and Nr receive antennas in the k^th subcarrier is expressed as:

( ) ( ) ( ) ( )

1 1 1

r r t t r

k k k k

N× N N× N× N×

y = H x + n , (1)

where y, H, x, and n denote the received signal vector, channel matrix, the transmitted signal vector, and the noise in frequency domain, respectively. Furthermore, the subscript index denotes the dimensions of a vector or a matrix. One optimum solution is obtained by maximum likelihood scheme, which aims to minimize the Euclidean distance between y and Hx. By using QR factorization, the ML solution is obtained by:

( )

( ) ( ) ( ) ( )

ˆ arg min ˆ

k

k k k k

ML ∈Ω −

x

x = y R x

, (2)

where Ω is a possible signal set, which dependents on the modulation type and yˆ^{( )}^k =Q y . Besides decomposing the ^H ^{( )}^k channel matrix into the product of a unitary matrix Q and an upper triangular matrix R, The QR factorization scheme also calculates the product of Q^H and the received signal vector y.

In the past, many MIMO detection schemes apply real-valued decomposition (RVD) to transform the system model into a real-valued signal detection problem shown below:

{ } { } { } { }

{ } { } { }

( ) ( ) ( ) ( )

2 1 2 2 2 1 2 1

( ) ( )

( ) ( ) ( ) ( )

Re Im

Re Im Re Re

Im Re Im Im

r r t t t

k k k k

N N N N N

k k T

k k k k

× = × × + ×

⎡ ⎤

= ⎣ ⎦

⎡ − ⎤ ⎡ ⎤ ⎡ ⎤

⎢ ⎥ ⎢ ⎥ ⎢ ⎥

=⎢ ⎥ ⎢ ⎥ ⎢+ ⎥

⎣ ⎦ ⎣ ⎦ ⎣ ⎦

y H x n

y y

H H x n

, (4)

where Re(.) and Im(.) denote the real and imaginary parts, respectively. According to the analysis of computational complexity among various QF factorization methods conducted in our previous work [9], the HT schemes have the highest complexity. The MGS scheme is not preferred either due to its necessity of complicated arithmetic modules. On the contrary, the GR schemes employing CORDIC modules only require additions/subtractions and constant multiplication. In this regard, we choose the CQRF scheme presented in [9] as a starting point. The main difference, however, is to obtain a real-valued triangular matrix of size 2Nr ×2Nt instead of a complex-valued triangular matrix of size Nr ×Nt. The former format is more suitable for the popular MIMO detection schemes such as k-best. An additional post processing stage consisting of row and column permutations is required to derive the desired format. Besides the extension in computing algorithm, more efficient architecture mapping and circuit design are also developed. Without loss of generality, we assume Nr=Nt=N. Table I shows the modified real-valued GR

scheme for CQRF. The outer loop index c1 indicates that the nullification of sub-diagonal elements in each sub-matrix H . At the column index c1, the inner loop i1 performs the vector rotation on the i1th row and the (i1+N)^th row via Givens rotation matrix ℜ( ,i i_{1 1}+N, )θ , where θ⁼ tan (⁻¹ h_{i N c}₁₊ ,₁ h_{i c}_{1 1}, ) and

( , , )i jθ

ℜ , as shown in Eq. (5), denotes a vector rotation on the i^th and the j^th rows by the angle θ. This procedure is applied to the two corresponding rows in the real and imaginary sub- matrices to nullify sub-diagonal elements (diagonal included) of column c1 in Im( H ). If the column index c1 is smaller than N, the inner loop i2 performs nullification of sub-diagonal elements (diagonal excluded) of column c1 in Re( H ). To maintain the symmetrical structure, these mirrored rotations in Re( H ) should be applied to Im( H ) as well. The above operation is equivalent to H multiplied by Q^H. In addition, an augmented matrix ( H | y ) is used in lieu of matrix H . After the completion of the outer loop c1, the matrix R formed by the sub-matrices Re( H ) and Im( H ) is obtained. Note that the real-valued matrix R is not a triangular one but consists of 4 triangular sub-matrices in symmetry. Hence, some interleavingoperations in the outer loop c2 should be applied

TABLE I. MODIFIED ALGORITHM OF REAL-VALUE GR SCHEME

1. H=[Re(H^{( )}^k ); Im(H^{( )}^k )]

2. For c₁ = 1 : N 3. For i₁= k : N 4.

1 1 1 1

1

, ,

tan (h_{i N c} h_{i c})

θ= ⁻ +

5.

( )

^{H y}^| ^{= ℜ}^{( ,}^{i i}^{1 1}⁺^N^{, )}^θ ^⋅

( )

^{H y}^|

6. End 7. If(c1 < N) 8. For i2 = N-1 : c1 9.

2 1 2 1

1

1, ,

tan (h_i _c h_{i c})

θ= ⁻ +

10.

( )

^{H y}^| ^{= ℜ}^{( ,}^{i i}² ²⁺^{1, )}^θ ^⋅

( )

^{H y}^|

11.

( )

^{H y}^| ^{= ℜ +}⁽ⁱ² ^{N i}^,²^{+ +}^N ^{1, )}^θ ^⋅

( )

^{H y}^|

12. End 13. End 14. End

15. R=[Re( ), Im( );Im( ),Re( )]H − H H H

16. For c₂ = 1 : N

17. R′

(

^{:,1 2}+

(

c²−^{1 : 2 2}

)

+

(

c²−¹

) )

=[ (:, ), (:,R c² R c²+0.5 )]N 18. ^R

(

^{1 2}⁺ (^c²⁻^{1 : 2 2}) ⁺ (^c²⁻^{1 ,:})

)

⁼^{[ ( ,:), (}^R^′^c² ^R^′^c²⁺^{0.5 ,:)]}^N

19. ^Y^{ˆ 1 2}

(

⁺

(

^c²⁻^{1 : 2 2}

)

⁺

(

^c²⁻^{1 ,:}

) )

⁼^{[ ( ,:), (}^y^c² ^y^c²⁺^{0.5 ,:)]}^N

20. End

( ) ( ) ( )

( ) ( )

1 0 0 0

0 cos sin 0

, ,

0 sin cos 0

0 0 0 1

i i k

k

i k

θ θ

θ

θ θ

⎡ ⎤

⎢ ⎥

ℜ = ⎢ ⎥

⎢ − ⎥

⎢ ⎥

⎣ ⎦

" " "

# % # # #

" " "

# # # #

" " "

# # # #

" " "

(5)

(3)

to rearrange the matrix R into a big triangular matrix and generate the corresponding ˆY .

III. HARDWARE ARCHITECTURE FOR CQRFDESIGN Based on the scheme described in Section II, the algorithm mapping and architecture design is developed. Each vector rotation is implemented by CORDIC algorithm and takes k clock cycles, where k is the number of CORDIC iterations. To shorten the computational latency, those vector rotations with no data dependency are performed in parallel. The design development starts with the finite precision analysis of the CORDIC module. Basic processing elements are derived next followed by the plotting of computation schedule. Finally, the entire CQRF design in 4×4 MIMO-OFDM systems is constructed.

A. CORDIC Iteration Number and Word Length Analysis CORDIC algorithm performs a vector rotation by decomposing the desired rotation angle into the weighted sum of a set of predefined elementary rotation angles. The selection of predefined angles facilitates simple shift-and-add operations to accomplish the rotations. With the increase of iteration number, the output precision of CORDIC algorithm is improved, but the computing time is also increased. Hence, a trade-off between output precision and the computation time must be made. From the BER simulation results, the number of CORDIC iterations should exceed 8 to achieve acceptable BER performance. The next step is determining word length in the implementation of CORDIC algorithm. Our simulation result leads to a fixed point design consisting of 5 bits for the integral part and 11 bits for the fractional part.

B. Basic Processing Element Designs

According to the CQRF scheme developed in section II, three basic operations, i.e. angle generation, vector rotation and interleaving arrangement, are needed. From the implementation point of view, the interleaving arrangement can be realized by wire connections, and the angle generation and vector rotation can be implemented by vectoring and rotation mode in CORDIC algorithm, respectively. Two folded processing elements, denoted as GPE and RPE, are derived to implement angle generation and vector rotation, respectively. The designs of GPE and RPE are shown in Fig. 1, where [X⁰ Y⁰], [X^l Y^l], Rd, and l indicate input vector, output vector at the l^th iteration, maximum iteration number and rotation direction, respectively. All iterations except for the final normalization are folded and realized by one processing element. The scaling multiplier needed for the normalization process is shared among different folded CORDIC modules.

The GPE is in charge of calculating the rotation angle, which is represented as a sequence of ±1indicating the rotation direction (Rd). Besides angle generation, the GPE also performs one vector rotation. The trailing RPEs admit the rotation sequence to perform the vector rotation. Although vector rotations corresponding to the same inner loop in Table I can be performed in parallel, their execution schedule are skewed with each other by one clock cycle deliberately so that a scaling multiplier can be shared. In addition to the folded GPE, RPE modules, a pipelined CORDIC processor is also developed to perform all the angle generation and the vector

rotations within the same inner loop. As shown in Fig. 2, each pipeline stage corresponds to one CORDIC iteration and the depth of the pipeline is equal to the iteration number, which is 8 in this illustration. The processing element for each pipeline stage is a fusion of GPE and RPE designs and can be controlled to perform each function alternately. From the experimental results, one such pipelined CORDIC architecture has a smaller area cost than the total area of 5 folded GPEs or RPEs when the number of CORDIC iterations is 8.

C. Computation Schedule

Based on the proposed real-valued QR factorization algorithm, a computation schedule is developed to maximize the degree of the computing parallelism. Fig. 3 shows the developed computation schedule. Notations G(i,j,k) and R(i,j,k) represent the angle generation and the vector rotation based on matrix elements (i,k) and (j,k), respectively. Each bar with a length of 8 clock cycles indicates the time span needed for a CORDIC operation. Two notations within the same bar indicate that two CORDIC operations are performed concurrently. However, CORDIC needs the barrel shifter to implement shift operations in the different pipeline stages. In fact, the gate counts of barrel shifter are close to those of the adders in CORDIC design. To achieve an efficient hardware, the hardware allocation strategy is as follow: As shown in Fig.

3, 8 consecutive time-skewed CORDIC operations can be mapped to one pipelined CORDIC processor. As opposed to the folded GPE/RPE modules, no barrel shifter is needed in a pipelined CORDIC processor. However, if less than 8 time- skewed CORDIC operations are mapped to one pipelined CORDIC processor, pipeline bubble occurs, which degrades the processor utilization ratio. In our design, if the number of consecutive time-skewed CORDIC operations is less than 7, instead of introducing one pipelined CORDIC processor, each of the CORDIC operations will be mapped to a separate GPE or RPE module. In Fig. 3, those gray colored bars in the same skewed column correspond to one pipelined CORDIC

Mux Reg

Add/SubAdd/Sub

>>i

X^l Y^l

Rd

RPE

>>i

RegReg

X⁰ Y⁰ Mux

Rd

Yⁱ⁺¹

Xⁱ⁺¹

Figure 1. The processing element design for the angle generation and the vector rotation

Reg

Add/SubAdd/Sub Reg

Mux

Mux Reg Reg

Add/SubAdd/Sub Reg

Mux

Mux Reg Reg

Add/SubAdd/Sub Reg

MuxMux Reg

MSB

MSB MSB

MSB

Figure 2. Pipelined CORDIC architecture

(4)

processor. Those white colored bars each corresponds to one GPE/RPE module. The entire CQRF design needs 8 pipelined CORDIC processors, 14 GPE modules and 10 RPE modules.

The total span of the computation schedule is 91 clock cycles.

However, one new CQRF computation can be initiated every 8 clock cycles. Not shown in the schedule are the row/column permutations needed to convert a complex valued triangular matrix to a double sized real valued triangular matrix.

IV. IMPLEMENTATION RESULT AND COMPARISONS

The proposed QR decomposition design for 4×4 MIMO systems is implemented in TSMC 0.18-μm CMOS 1P6M technology and its working frequency can operate at 200MHz.

The total gate count is 103.7K. Since only 8 clock cycles are needed for a new decomposition, 25M 4¯4 CQRFs can be processed per second. Table II shows the comparison results among related works for 4×4 MIMO systems. The initiation interval indicates the minimum time separation between consecutive CQRFs. Notation N.T. is the normalized throughput by taking the technology factor into consideration.

Frequency Technology N.T.=

# of Processing Cycle× 0.18 mμ . (6) Designs under comparison include 4 prior arts [7-10]. The design in [9] employs real-valued GR algorithm and uses fully-pipelined architecture to achieve high throughput rate.

However, the design requires the largest gate count as well. In [10], complex-valued QR factorization is performed and the matrix is extended by RVD process. The triangular matrix is obtained by nullification via real-valued GR scheme. The design performs one QR factorization and updates multiple signals from different OFDM symbols. For a fair comparison, the circuit for the signal updates part is excluded. This design also uses pipelined architecture and features the highest throughput rate. A compound performance index with the notation “A/N.T.” and defined as the number of gate counts divided by the normalized throughput rate is used for design comparison. A smaller value implies a better design. The result shows our work has the smallest value and features the most efficient design in all.

V. CONCLUSION

A modified real-valued Givens rotation algorithm and its complex QR factorization design are presented in this paper.

The proposed design employs both pipelined and folded CORDIC structures to reduce the hardware complexity.

Through a carefully plotted scheduling, the design can admit one CQRF every 8 clock cycles. Implementation results show that the proposed design, with a gate count of 103.7k, can offer 25M CORFs per second in 4×4 MIMO-OFDM systems.

Compared to other QR factorization works, the proposed design excels in terms of the product of area and time.

TABLE II. PERFORMANCE COMPARISON OF CHIP DESIGNS

Design [7] [8] [9] [10] Proposed

Algorithm GS GR GR GR GR

Technology (μm) 0.13 0.18 0.18 0.18 0.18 Initiation interval 139 67 8 4 8

Logical Gates 23.2K 54K 134.6K 111K 103.7K Frequency (MHz) 269 125 120 100 200

N.T. 1.398 2.591 15 25 25 A/N.T. 16.595 20.841 8.974 4.44 3.27

R^EFERENCES

[1] A. J. Paulraj, D. A. Gore, R. U. Nabar, and H. Bölcskei, “An overview of MIMO communications—A key to gigabit wireless,” Proc. IEEE, vol. 92, no. 2, pp. 198-218, Feb. 2004.

[2] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-best sphere decoding for MIMO detection,” IEEE J. Sel. Areas Commun., vol. 24, no. 3, pp. 491-503, Mar. 2006.

[3] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H.

Bolcskei, “VLSI implementation of MIMO detection using the sphere decoding algorithm,” IEEE J. Solid-State Circuits, vol. 40, no. 7, pp.

1566-1577, July 2005.

[4] A. Van Zelst, “Space division multiplexing algorithms,” in Proc. IEEE MELeCon, Nicosia, Cyprus, May 2000, pp. 1218-1221.

[5] X. Li and X. Cao, “Low complexity signal detection algorithm for MIMO-OFDM systems,” IEE Electronics Letters, vol. 41, no. 2, pp.

83-85, Jan. 2005.

[6] C. K. Singh, S. H. Prasad, and P. T. Balsara, “VLSI architecture for matrix inversion using modified Gram–Schmidt based QR decomposition,” in Proc. Int. Conf. VLSI Des., Jan. 2007, pp. 836-841.

[7] P. Salmela, A. Burian, H. Sorokin, and J. Takala, “Complex-valued QR decomposition implementation for MIMO receivers,” in Proc. IEEE Acoustics, Speech, and Signal Processing, Las Vegas, Nevada, USA, Apr. 2008, pp.1433-1436.

[8] P. Luethi, A. Burg, S. Haene, D. Perels, N. Felber, and W. Fichtner,

“VLSI Implementation of a High-Speed Iterative Sorted MMSE QR Decomposition,” in Proc. IEEE Int. Symp. Circuits Syst., New Orleans, USA, May 2007, pp. 1421-1424.

[9] Y. T. Hwang and W.D. Chen, “Design and implementation of a high- throughput fully parallel complex-valued QR factorisation chips,” IET Circuits, Devices & Systems, vol. 5, no. 5, pp. 424-432, Sept. 2011.

[10] Z. Y. Huang and P. Y. Tasi, “Efficient implementation of QR decomposition for gigabit MIMO-OFDM systems,” IEEE Trans.

Circuit Syst. I, Reg. Papers, vol. 58, no. 10, pp. 2531-2542, Oct. 2011.

Figure 3. The computation schedule for the CQRF design