CHAPTER 3 HIGH-SPEED COMMUNICATION SYSTEMS WITH LDPC CODES .18
3.2 E RROR -C ORRECTING P ERFORMANCE OF LDPC C ODES IN UWB S YSTEM
3.2.3 Performance Comparison with Convolutional Codes
In Fig. 3.8, the performance of LDPC codes is compared to the 64-state convolutional coded system proposed in [23] where two different rates after puncturing the R = 1/3 convolutional code are selected as the references. It shows that both LDPC codes can outperform the convolutional codes after puncturing with only 8 iterations. The short block length and small decoding iterations will facilitate high speed implementation.
0 1 2 3 4 5 6 7 8 10-6
10-5 10-4 10-3 10-2 10-1 100
R=3/4 convolutional code R=5/8 convolutional code (600,450) LDPC code with 8 iters.
(1200,720) LDPC code with 8 iters.
BER
SNR [dB]
(a) BER
0 1 2 3 4 5 6 7 8
10-3 10-2 10-1 100
R=3/4 convolutional code R=5/8 convolutional code (600,450) LDPC code with 8 iters.
(1200,720) LDPC code with 8 iters.
PER=8%
PER
SNR [dB]
(b) PER
Fig. 3.8 Performance comparison for different codes
Chapter 4
Architectures of Proposed LDPC Code Decoders
The architectures of the proposed LDPC code decoders for two different LDPC codes, Code I and Code II, will be introduced in this chapter. Basic functional units, data flow rescheduling and memory arrangement methods will be discussed in detail. The measurement results of the proposed LDPC code decoder chips and a comparison with the state-of-the-art designs will also be listed. The specifications of Code I and Code II are summarized in Table 4.1.
Table 4.1 Summary of the two LDPC codes
Code I Code II
Block length 600 1200
Information bits 450 720
Code rate 3/4 3/5
Code structure Irregular Irregular
Column weight 3 3
Row weight 11~14 7~9
4.1 Introduction to the Conventional Design
Based on the decoding algorithm, the block diagram of conventional LDPC code decoder is shown as Fig. 4.1. The bit node unit (BNU) is dedicated to the vertical step, while the check node unit (CNU) is used for the horizontal step. The BNU (or CNU) reads and processes the messages stored in the memory bank, and write them back into the memory bank after updating. It can be noticed that a large number of combinational feedback paths exist between the CNU (or BNU) and the memory unit, leading to the complex signal routing as well as degradation of the decoding speed in the VLSI implementation.
Memory
BNUBNUBNU Bank CNUCNUCNU
Channel value
Fig. 4.1 Block diagram of conventional LDPC code decoder
The conventional architecture of the CNU which is based on the LLR-SPA in (2.33) is shown in Fig. 4.2(a). The look-up tables (LUT) are used to implement the hyperbolic tangent (tanh) and inverse hyperbolic tangent (tanh-1) functions.
The CNU can be implemented based on the min-sum algorithm as shown in Fig. 4.2(b) to reduce the hardware cost. As described in (2.38), the operations in the CNU can be divided into two parts: the sign evaluation and the minimum absolute value searching. The minimum
absolute values are searches by k comparators which consist of k-1 inputs (CMP-(k-1)), where k is the row weight of the parity check matrix.
LUT-1 LUT-1
∑
-LUT-2 LUT-2
LUT-2 LUT-2 LUT-1
LUT-1
(a)
CMP-(k-1)
CMP-(k-1)
CMP-(k-1)
CMP-(k-1) Sign Bit Evaluation
min
min
min
min
(b)
Fig. 4.2 Architecture of conventional CNU based on: (a) LLR-SPA and (b) min-sum algorithm
The conventional BNU architecture with k inputs is shown in Fig. 4.3, where the SUM-(k-1) is used to sum up k-1 values. Note that there is no difference on the BNU design between the LLR-SPA and the min-sum algorithm. Both LLR-SPA and min-sum algorithm have the same BNU design.
SUM-(k-1)
SUM-(k-1)
SUM-(k-1)
channel value
Fig. 4.3 Architecture of conventional BNU
4.2 Proposed LDPC Code I Decoder Design
The LDPC code decoders have inherently parallelism due to the non-dependency among check node updates or bit node updates; the throughput can be improved by linear increase of the hardware costs. However, the full-parallel implementation [9] is non-area-efficient for a system chip design. Therefore the partial-parallel architecture is employed in the proposed decoders to reduce circuit complexity according to the system requirements. In time-division multiplexing mode, the partial-parallel LDPC code decoders map a certain number of check nodes or bit nodes into a single processing unit. Extra decoding latencies are produced as compared with the full-parallel implementations. Thus a trade-off is made between the decoding speed and the hardware complexity. Besides, to simplify the hardware cost, the min-sum algorithm is chosen to implement the proposed design while keeping the system performance.
Fig. 4.4 presents the architecture of the proposed LDPC Code I decoder containing the distributor, memory unit, switch groups, CNU and BNU. Since the irregular parity check matrix H has a fixed number of column weight (= 3), the total number of weight in parity check matrix is 600 × 3 = 1800. To implement the decoder in a partial-parallel mode, the check nodes in the corresponding bipartite graph are partitioned into three parts, and the bit nodes are divided into four parts as shown in Fig. 4.5, where every three check nodes share a single CNU, and every four bit nodes share a single BNU. Therefore 150/3 = 50 CNUs and 600/4 = 150 BNUs are required in the proposed design. The switch groups in Fig. 4.4 are used to select which part of check nodes or bit nodes is under operation.
Switch 2
Fig. 4.4 The architecture of LDPC Code I decoder
Parity check matrix
H
CNU setBNU set
c2
c3
c1
b1 b2 b3 b4
Fig. 4.5 The partition for parity check matrix H of Code I
Due to the random-like connections in the bipartite graph, the signal routing problem causes serious difficulties in the decoder implementation. As shown in Fig. 4.1, the combinational feedback paths leads to the degradation of the decoding speed and the routing area overhead in the VLSI implementation. In the proposed design, the pipeline registers are inserted in CNUs and BNUs to cut off those feedback paths as illustrated in Fig. 4.6. Thus, shorter critical path delay that reduces routing congestion can be achieved with little increases in the hardware costs.
BNU-PATH 2 CNU-PATH 1
BNU-PATH 1 CNU-PATH 2
Flip
Flop CNU Memory
Bank
Flip
BNU Flop
Fig. 4.6 Data path of proposed partial-parallel decoder
4.2.1 Channel Value Interconnection
For the conventional design in Fig. 4.1, both the CNUs and BNUs have to be connected to the channel values, which lead to large number of signal connections. Thus data
rescheduling is proposed to solve this problem in Fig. 4.7.
Fig. 4.7 Proposed LDPC decoding flow
As shown in Fig. 4.7, one extra vertical step is employed to replace the initialization through the CNUs. Recall equation (2.34)
( )\ only summations among the channel value L(xi) and the messages LC→B(eij) are performed in the BNUs. If the messages LC→B(eij) are set to zero during initialization, the channel values are thus loaded into the memory through the BNUs, and fed to the CNUs for the first horizontal step. In this scheme, only BNUs have to be connected to the channel values as illustrated in Fig. 4.4, leading to less signal routing costs with some increases in decoding latencies.
Fig. 4.8 gives the timing diagram of the proposed LDPC Code I decoder, where bi and ci
correspond to the active BNU and CNU set in Fig. 4.5. The design takes nine cycles to complete a decoding iteration, including 4 cycles for horizontal steps with the CNUs and 5 cycles for vertical steps with BNUs. Additional five cycles are used to complete the channel value loading as described above. Thus total 9*8 + 5 = 77 cycles are required to finish the decoding process of a codeword with 8 iterations.
Timechannel value loadingiteration #1
Fig. 4.8 Timing diagram of the proposed LDPC Code I decoder
4.2.2 Check Node Unit
As shown in Fig. 4.2(b), k comparators which search the minimal values among k-1 inputs are needed to implement the CNU based on the min-sum algorithm. As mentioned in [18], equation (2.38) can be modified as
( ) ( )
where “2nd min” denotes the value which is smaller than all the other candidates except the minimal one. According to (4.1), the absolute value searching has to be performed only one time to find the minimum and the second minimum. Fig. 4.9 shows the block diagram of the compare-select unit (CS14) which searches for the minimal and the second minimal values from 14 inputs.
CMP-14
Fig. 4.9 Block diagram of CS14
Because the column weight of Code I is ranging from 11 to 14, the CNUs dealing with different number of inputs should be designed. In this section, only the 14-input CNUs are introduced and others are designed in the analogous approach. The detailed architecture of CMP-14 in Fig. 4.9 is illustrated as Fig. 4.10, which consists of the pipeline registers and two
kinds of comparators: CMP-2 and CMP-4. CMP-4 finds out the minimal and the second minimal values from the four inputs, a, b, c, and d. In addition, CMP-2 is a two input comparator which is much simpler than CMP-4.
SUB SUB SUB SUB SUB SUB
a b a c a d b c b d c d
min 2nd min MSB5
5 5 5 5 5 5 5 5 5 5 5 5
1 1 1 1 1 1
MSB4 MSB3 MSB2 MSB1 MSB0
5 5
min 2nd min Decoder
Fig. 4.10(a) Block diagram of proposed CMP-4
5
Fig. 4.10(b) Block diagram of proposed CMP-14
The proposed architecture of the 14-input CNU is shown in Fig. 4.11, where SM14 is sign-multiplication. To facilitate the operations on the sign and absolute value, all the 6-bit
values have been represented by the sign-magnitude notation with 2 integer bits and 4 fractional bits. The combinational path in the CNUs is cut off into CNU-PATH1 and CNU-PATH2 by the pipeline registers, leading to shorter critical path delay that reduces routing congestion.
Fig. 4.11 The proposed 14-input CNU architecture
Table 4.2 lists the comparisons of three different CNU architectures. The LUT-1 and LUT-2 in Fig. 4.2(a) are implemented in 6-bit precision, including 2 integer bits and 4 fractional bits. The proposed CNU has the smallest size which is only about 22% of the others, whereas the maximum achievable operating speed is only a little smaller than conventional MS designs. Due to the fixed point implementation, some performance loss is produced. As a result, the decoder is implemented efficiently by using of the proposed CNU architecture.
Table 4.2 Comparison of different CNU architectures LUT
Fig. 4.2(a)
Conv. MS Fig. 4.2(b)
Proposed Fig. 4.11
Max. speed 162 MHz 261 MHz 250 MHz
Gate count 7.16 K 6.86 K 1.6 K
Total gate count 358 K 343 K 80 K
4.2.3 Bit Node Unit
Fig 4.12 shows the block diagram of BNU. According to equation (2.34) and (2.35), the BNUs receive the channel value and the message values linked to the same bit node. All inputs with sign-magnitude (SM) notation are converted to be 2’s complement (TC) representation, and summed to perform the updating calculation. The pipeline registers are inserted to break the critical paths into BNU-PATH1 and BNU-PATH2 as in the CNUs.
Finally, all the values are converted back to the SM notation and clipped to avoid overflow.
And the most significant bit (MSB) of the summation of the three input messages and the channel value is used to decide the estimated codeword bit.
All the 6-bit values are quantized with 2 integer bits and 4 fractional bits, while the intermediate summations are represented with 4 integer bits and 4 fractional bits.
+
Fig. 4.12 The proposed BNU architecture
Note that if C1, C2 and C3 are set to be zero during initialization, the channel value will be directly bypassed to the outputs of BNU. This produces a path to load the channel values into the memory as mentioned above.
4.2.4 Chip Implementation
The proposed LDPC Code I decoder was implemented within an LDPC-COFDM UWB baseband transceiver chip [25] with the 0.18 µm 1P6M standard CMOS process. The chip micrograph of the entire UWB transceiver including the OFDM modem and the LDPC codec is given in Fig. 4.13. The encoder die size is 2.25 mm2, while the decoder die size is 16.5 mm2. The total gate count of the LDPC codec is 542 K, where 70K is for the encoder and 472K is for the decoder.
The chip has been tested to verify the functional correctness. The measured maximal data rate of the decoder is 480 Mb/s while working at 82.1 MHz, and consuming 232 mW.
The detailed chip features are also summarized in Table 4.3.
OFDM Modem
LDPC Encoder
LDPC Decoder
Fig. 4.13 Die micrograph of the LDPC-COFDM UWB transceiver chip
Table 4.3 Summary of the LDPC Code I Chip
Technology Standard 0.18-µm CMOS 1P6M
Package CQFP-208
Supply voltage 1.8V core, 3.3 V I/O
Encoder 1.5mm × 1.5mm
Chip size
Decoder 5.0mm × 3.5mm
Encoder 70K
Gate count
Decoder 472K
Power dissipation 232mW @ 82.1MHz
Maximum data rate 480Mb/s
4.3 Proposed LDPC Code II Decoder Design
In Sec. 4.2, the proposed LDPC Code I decoder design is introduced and silicon proven to achieve 480Mb/s maximum data rate. The performance of LDPC code I decoder is acceptable for the MB-OFDM UWB system [23], but may be not for other high-speed communication systems mentioned in Chap. 3. As a result, the LDPC code II decoder is proposed to get better error-correcting ability and higher decoding throughput.
While considering circuit complexity, the 480 × 1200 parity check matrix H of LDPC code II are divided into four 240 × 600 sub-matrixes to fit partial-parallel architecture, which is shown in Fig. 4.14. Since matrix H of Code II has a fixed number of column weight (= 3), the total number of weight is 1200 × 3 = 3600. Based on this partition, the functional units in the decoder will process 1800 messages every cycle.
H =
h
00 CNU Set 1CNU Set 2
BNU Set 1 BNU Set 2
h
01h
10h
11Fig. 4.14 The partition of parity check matrix H of Code II
The proposed LDPC code II decoder architecture illustrated in Fig. 4.15 contains the input buffer, 240 CNUs, 600 BNUs and two dedicated message memory units (MMU). The set of data processed by CNUs are {h00, h01} and {h10, h11}, whereas the data fed into BNUs should be {h00, h10} and {h01, h11}. Note that two MMUs are employed to process two different codewords concurrently without stalls. Therefore, the LDPC decoder is not only area-efficient but the decoding speed is compatible with the fully parallel architecture. The detail ideas about the designs of MMUs will be introduced in the following.
The input buffer is a storage component that receives and keeps channel values for iterative decoding. Note that it only connects to the BNUs to get less routing congestion as discussed in Sec. 4.2.1.
buf-0 buf-1 buf-2 buf-3
B A
Fig. 4.15 The proposed LDPC code II decoder architecture
4.3.1 Input Buffer
Input buffer provides the channel values to the BNUs for iterative decoding. Because two different codewords are processed concurrently, total 1200 × 2 = 2400 symbols should be stored in the input buffer. According to the partition in Fig. 4.14, the buffer is divided into four sub-blocks, where each sub-block contains 600 channel values. The conventional design is illustrated in Fig. 4.16. Four sub-blocks, buf-0 ~ buf-3, are all connected to the channel
value inputs, and multiplexers are employed to switch appropriate values into the BNUs. Thus the signal routings are all “global”, meaning that all the connections are related to the inputs and outputs (I/O) of the buffer. The global connections and the multiplexers will lead to serious routing congestion.
buf-0 buf-1 buf-2 buf-3
Channel value inputs
To BNU
Fig. 4.16 The conventional architecture of input buffer
Fig. 4.17 shows the buffer structure based on register exchange (RE) approach and the operational timing diagram, where buf-0 is designed as a shift register that serially receives the channel values from inputs and the other three sub-blocks exchange the data with buf-0 sequentially. The notation E1, E2 and E3 represent the data exchange from buf-0 to buf-1, buf-2 and buf-3, respectively. During initialization, buf-0 serially receives the channel values and passes them into other sub-blocks by executing the operations E1, E2 and E3 when buf-0 is full-filled.
buf-1 buf-0
Channel value inputs
To BNU
E1 E3 E2
buf-2 buf-3
Fig. 4.17(a) The architecture of RE based input buffer
buf-0
C00 C10 C11 C01
Channel Value
Codeword 0 Codeword 1
C00 C01 C10 C11
Fig. 4.17(b) The timing diagram of RE based input buffer
For this RE based buffer architecture, the global interconnections exist only in buf-0, and all the others are “local” among sub-blocks. However, the drawback is that a large number of multiplexers are required around buf-0 to perform E1 ~ E3. Thus buf-0 becomes a routing-critical block due to the multiplexers and the global interconnections.
To overcome this problem, an architecture based on register shifting (RS) is proposed as shown in Fig. 4.18(a), where four sub-blocks are arranged in a ring. The buf-0 is a shift register that serially receives the channel values and buf-3 transports the associated channel values to BNU. The timing diagram of the RS-based input buffer is presented in Fig. 4.18(b).
Channel values of two different codewords are serially fed into buf-0, and shifted within the buffer ring when buf-0 is full-filled. Therefore, the data flow is further simplified, and the multiplexers are eliminated, leading to simple signal transfer and routing interconnections.
Fig. 4.18 The architecture and timing diagram of RS-based input buffer
Fig. 4.19 gives the comparison of the three input buffer architecture. The RS-based input buffer can save about 20% gate count and 30% interconnection wires as compared with the conventional design.
Conventional RE RS 0
1 2 3 4 5 6 7 8 9x 104
gate count
number of interconnection 83825
30000
81830
24000
67855
21000
Fig. 4.19 The comparison of three input buffer designs
4.3.2 Check Node Unit and Bit Node Unit
Fig. 4.20 shows the CNU architecture for proposed LDPC code II decoder. The CNU can be divided into two parts: one is 1-bit sign-multiplication (SM) and the other is 5-bit compare-and-select unit (CS) that searches the minimal value and the second minimal value from the inputs. The new message for each bit node is a combination of the sign bit according to (4.1) and the new magnitude which is either “min” or “2nd min” of the CS unit. The detailed architecture of CMP-9 in Fig. 4.20 is designed as that shown in Fig. 4.9 and 4.10.
The BNU architecture is illustrated in Fig. 4.21. According to (2.34) and (2.35), BNU receives the channel value and the messages linked to the same bit node. All inputs with sign-magnitude (SM) notation are firstly converted to be 2’s complement (TC) representation, and then summed to perform the updated calculation. The summed values are also clipped to
avoid overflow. Finally, the MSB of the summation of all the inputs is used to decide the
Fig. 4.20 CNU architecture of proposed LDPC Code II decoder
C1
Fig. 4.21 BNU architecture of proposed LDPC Code II decoder
4.3.3 Message Memory Unit
Message memory units (MMU) are used to store the message values that are generated by CNUs or BNUs. The size of each MMU is 3600 × 6 bit due to the weight of the parity check matrix. To increase the decoding throughput, two MMUs are employed to concurrently process two different codewords in the decoder. The memory management strategies, described below, include multiplexers (MUX) or register exchange (RE), resulting in different level of routing complexity. The MUX based MMU architecture and the timing diagram are illustrated in Fig. 4.22.
A B C D
iteration #iiteration #(i+1)
h00 h01 h01 h11 h00 h10 h10 h11
... ...
codeword-0 codeword-1 output block MMU-0 MMU-1
(b)
Fig. 4.22 The architecture and timing diagram of MUX-based MMU
According to the partition of the matrix H in Fig. 4.14, the MMU is divided into four sub-blocks: A, B, C and D. Many multiplexers are required for the inputs and outputs due to the partially parallel implementation and the concurrent process of two different codewords.
Moreover, all the signal interconnections are related to the I/O, leading to global routings. As a result, the serious routing congestion occurs in the conventional MMU design.
To release the routing congestion problem, the architecture based on register exchange among four sub-blocks (RE-4B) is proposed as shown in Fig 4.23. In this design, only sub-blocks B, C and D capture data form data paths, and only sub-blocks A and C connect to the outputs. Thus most of global routings are transformed into local interconnections between sub-blocks, leading to a simple data flow. Moreover, the number of multiplexers is also reduced by the RE-4B based architecture.
B A D C
datapath
datapath
Fig. 4.23(a) The architecture of RE-4B based MMU
h
11iter at io n # i iter at io n # (i+ 1) ... ...
codeword-0 codeword-1 output block MMU-0
Fig. 4.23(b) The timing diagram of RE-4B based MMU
To further improve the MMU design, the register exchange scheme based on five sub-blocks (RE-5B) is proposed as shown in Fig. 4.24(a). One extra sub-block E is used as temporal memory for reducing the interconnection between other sub-blocks. In MMU-1,
To further improve the MMU design, the register exchange scheme based on five sub-blocks (RE-5B) is proposed as shown in Fig. 4.24(a). One extra sub-block E is used as temporal memory for reducing the interconnection between other sub-blocks. In MMU-1,