Proposed Techniques for LDPC Convolutional Codes
4.2 Node-Level Optimization
4.2.1 Folding Architecture
In the original pipeline decoder architecture, a number of processors are concatenated together to decode on different regions over the Tanner graph simultaneously, thus the decoding is parallel in the iteration dimension. Assume the decoder can operate at fclk MHz clock frequency, since the decoder can only decode one bit in one cycle, the infor-mation throughput will be limited to only fclk Mb/s. For high-speed applications, the parallelization for LDPC convolutional code encoder and decoder is desirable. In the literatures, the concepts of node level parallelization are proposed in [21]. However, their decoder architecture requires the shuffle networks to overcome the problem of memory misalignments. Furthermore, the parallelism over a hundred is necessary to achieve the decoding throughput of 1 Gbps.
In order to provide a solution with lower complexity, we propose the folding technique for node level parallelization to design high throughput LDPC convolutional code encoder and decoder. The idea of folding technique is to look ahead the bits that are going to participate in the encoding or decoding operations. For parallel encoder using folding technique, total ρ information bits can be encode at the same time instant, where ρ is defined as the folding factor or the parallelization factor. The length of delay lines in the encoder are folded by a factor of ρ. And the XOR gate for encoding operation have to duplicate to ρ units. Although the similar parallel encoder architecture has been proposed in [22], the parallel encoder architecture they proposed requires large multiplexers for phase selection. For our folding technique, the parallelism is chosen as the multiple of time period. This allows the original time-varying encoding connections to transfer to fixed time-invariant connections. Thus, the multiplexers are no longer required in our encoder architecture, which is the major difference from [22].
Figure. 4.14 shows the parallel decoder architecture using folding technique. It can be seen that each FIFO delay line in the conventional processor is folded to ρ FIFO delay lines. In other words, each FIFO delay line is segmented by ρ factor to support required bandwidth. With this modified FIFO structure, sufficient input data could be provided
shift register of length 16 is replaced by 3 shift registers of length 6. Namely, the decoding delay is reduced from 16 clock cycles to 6 clock cycles for a unit processor. Also, both check node units and variable node units are duplicated to ρ units. Using this approach, the decoder throughput becomes (ρ×fclk) Mb/s. However, the maximum value of folding factor is restricted by the code structure. Generally, the larger constraint length LDPC convolutional codes with careful code constructions would allow higher folding factor. To be mentioned that folding technique primarily duplicates the combinational logic while the sequential circuits are only slightly increased. It is evident that the folding technique not only increases the decoder throughput, but also reduces the decoding delay by a factor of ρ at the same time.
Figure 4.14: Folding technique (only information part in shown).
Based on the conventional decoder architecture, Table. 4.1 presents a comparison of storage requirements and decoding latency for a unit processor with different fold-ing factors. It can be seen easily that foldfold-ing technique not only directly duplicates the throughput, but also significantly reduces the decoding latency. Moreover, this technique only slightly increases the hardware costs. In particular, the overhead of the duplication of check node units and variable node units is minor comparing to the overall cost of a processor. We apply the folding technique to the time-varying (491, 3, 6) LDPC convolu-tional code with period of 3. Since the shortest distance of adjacent check node accessing positions is 70− 56 = 14, the maximum folding factor of this code is 12. Therefore, a 12 times decoding throughput increase while the decoding delay is reduced from 493 clock cycles to 43 clock cycles for single processor.
Table 4.1: Comparison of hardware cost and decoding latency with different folding factors based on the conventional decoder architecture.
Folding factor ρ 1 3 12
Required bits for storage 23664 23904 24768
Number of CNUs 1 3 12
Number of VNUs 1 3 12
Throughput fclk 3× fclk 12× fclk
Decoding latency for a unit processor (cycles) 493 166 43
In addition, the concepts of parallelization can be described mathematically. Figure.
4.15 shows the parity-check matrix of (14, 3, 6) LDPC convolutional code given in (3.13).
Given that folding factor ρ = 3, every 3 rows in the parity-check matrix can be grouped to form a rate R = 3/6 LDPC convolutional code with syndrome former memory ms = 5.
Figure 4.15: Parity-check matrix of (14, 3, 6) LDPC convolutional code.
From the graph illustration in Figure. 4.15, we can see that the polynomial parity-check matrix becomes
The same result can be derived from the parity check polynomials of the LDPC convo-lutional code. Using the parity check polynomial representation, the parity-check matrix can be described as
(D2+ D7+ D13)u0(D) + (D2 + D9+ D16)v0(D) = 0 (4.2a) (D + D6+ D12)u1(D) + (D + D8+ D15)v1(D) = 0 (4.2b)
Let X = D3, we can rewrite these equations as
(D2 + X2· D + X4· D)u0(D) + (D2+ X3+ X5· D)v0(D) = 0 (4.3a) (D + X2+ X4)u1(D) + (D + X2· D2+ X5)v1(D) = 0 (4.3b) (1 + X· D2+ X3· D2)u2(D) + (1 + X2· D + X4· D2)v2(D) = 0. (4.3c) Given in (4.4), the polynomial parity-check matrix is the same as (4.1) if X is replaced by D.
We apply this procedure on the time-varying (491, 3, 6) LDPC convolutional code with period of 3. Let folding factor ρ = 3, the corresponding parity-check polynomials are listed in (4.5). Finally, we can obtain a rate R = 3/6 time-invariant (164, 3, 6) LDPC convolutional code, whose polynomial check matrix is shown in (4.7). The columns of the parity-check matrix are rearranged to ensure systematic encoding. If the folding factor is chosen as a multiple of time period , the folding technique allows the original time-varying code to transform into a time-invariant code. Thus, the multiplexers for the configuration of time-varying connection are saved. Moreover, if the time-invariant code has quasi-cyclic symmetries, the encoder complexity of tail-biting LDPC convolutional codes may be reduced. We simulate the performance of a family of LDPC convolutional codes derived from (491,3,6) LDPC convolutional code with folding factors 3, 6, 9 and 12. The BER
performance of these codes is shown in Figure. 4.16. We can see that these codes perform very similarly even if the syndrome former memories vary greatly. Also given in (4.8) is the parity-check matrix of the R = 12/24 (41, 3, 6) LDPC convolutional code.
H(D) =
Figure 4.16: Performance of the (491, 3, 6) LDPC convolutional code and the associated LDPC convolutional codes with different folding factors.