I
行政院國家科學委員會補助專題研究計畫 ■成果報告
□期中進度報告
(計畫名稱) 應用於無線通訊之低功耗基頻處理器
計畫類別: ■ 個別型計畫 □整合型計畫 計畫編號:NSC97-2220-E-009-166-MY3
執行期間: 97 年 8 月 1 日至 100 年 7 月 31 日 執行機構及系所:國立交通大學 電子工程學系
計畫主持人:李鎮宜 教授
計畫參與人員: 陳志龍、林義閔、李欣儒、林佳龍、許智翔
成果報告類型(依經費核定清單規定繳交):□精簡報告 ■ 完整報告 本計畫除繳交成果報告外,另須繳交以下出國心得報告:
□赴國外出差或研習心得報告
□赴大陸地區出差或研習心得報告
■ 出席國際學術會議心得報告
□國際合作研究計畫國外研究報告
處理方式:除列管計畫及下列情形者外,得立即公開查詢
□涉及專利或其他智慧財產權,□一年□二年後可公開查詢
中 華 民 國 100 年 10 月 31 日
II
目錄
I. 摘要 ... III 中文摘要 ... III 英文摘要 ... IV
II. 計畫的緣由與目的 ... 1
A. LDPC Decoder ... 1
1. A 11.5-Gbps LDPC Decoder Based on CP-PEG Code Construction ... 1
2. A 2.37-Gb/s 284.8mW Rate-Compatible (491,3,6) LDPC-CC Decoder ... 2
B. BCH and RS Decoder ... 3
1. Soft BCH Decoder Chip for DVB-S2 System ... 3
2. An Improved Soft BCH Decoder with One Extra Error Compensation ... 4
3. Soft RS Decoder Chip for Optical Communication System ... 5
C. Viterbi Decoder ... 6
1. A Low-Power Viterbi Decoder Based on Scarce State Transition and Variable Truncation Length ... 6
2. A Low Power Differential Cascode Voltage Switch with Pass Gate Pulsed Latch for Viterbi Decoder... 9
III. 研究方法及成果 ... 10
A. LDPC Decoder ... 10
1. A 11.5-Gbps LDPC Decoder Based on CP-PEG Code Construction ... 10
2. A 2.37Gb/s 284.8mW Rate-Compatible (491,3,6) LDPC-CC Decoder ... 18
B. BCH and RS Decoder ... 23
1. Soft BCH Decoder Chip for DVB-S2 System ... 23
2. An Improved Soft BCH Decoder with One Extra Error Compensation ... 31
3. Soft RS Decoder Chip for Optical Communication System ... 39
C. Viterbi Decoder ... 45
1. A Low-Power Viterbi Decoder Based on Scarce State Transition and Variable Truncation Length ... 45
2. A Low Power Differential Cascode Voltage Switch with Pass Gate Pulsed Latch for Viterbi Decoder... 51
IV. 結論與討論 ... 61
V. 參考文獻 ... 64
VI. 計畫成果自評 ... 68
VII. 附錄– 2007-2011 本計畫相關之研究成果 ... 72
III
應用於無線通訊之低功耗基頻處理器
A Low-Power Baseband Processor for Wireless Communication System
計畫編號: NSC97-2220-E-009-166
執行期間: 97 年 8 月 1 日 至 100 年 7 月 31 日 主持人:李鎮宜 交通大學電子工程系教授
參與學生:陳志龍、林義閔、賴義澤、李欣儒、林佳龍、許智翔
I. 摘要 中文摘要
基頻訊號處理在無線通訊系統上扮演關鍵性的角色,不僅可有效提升傳輸的效 能,更能提供多模式和多標準的系統實現方案。然而要達成低成本和低功率設計方法,
不僅對於個別模組的演算法需深入瞭解外,也必須融入系統層級的行為,方能提供一
個具有技術競爭力的解決方案。因此在這三年的研究計畫,我們針對 OFDM 主流無線
通訊系統所需求的關鍵模組 Viterbi decoder、LDPC decoder、RS decoder 及 BCH decoder
進行相關議題的研究,探討低功率 Viterbi decoder,高速、低功率 LDPC decoder 以及
低複雜度的軟性 BCH、RS decoder 的設計方式,並研究在不同設計規範下達成多模和
多標準的作業模式,之後將會把此關鍵模組設計整合,並配合基頻訊號的同步模組電 路,完成一個符合多模、多標準的低功率基頻處理器。
關鍵字
基頻處理器;多模式;多標準;低成本;低功率;Viterbi Decoder;LDPC Decoder;BCH Decoder;RS Decoder
IV
英文摘要
Signal processing in baseband processor designs plays a key role in wireless communication system designs—in not only improving overall system transmission performance, but also providing the capability of multi-mode and multi-standard for cost-effective system realization.
To reach better performance indices in terms of low-cost and low-power, it is necessary to investigate system design methodologies, covering in-depth exploration of algorithms of key modules and exploitation of unique features/behaviors of a complete system. As a result, a more competitive solution can be delivered. In three year (2008/8~2011/7), we have concentrated on the key modules (Viterbi decoder, LDPC decoder, BCH decoder and RS decoder) of the main-stream OFDM wireless communication systems. The first issue is low power solution for Viterbi decoder. The second issue is the high-speed solution for LDPC decoder. The last issue is the low-cost solution for soft BCH and RS decoder. In the end, these design techniques and key modules will be integrated on a design platform, together with synchronization modules, to come up with a multi-mode, multi-standard, and low-power baseband processor.
Keywords
Baseband Processor, Multi-mode, Multi-Standard, Low-Cost, Low-Power, Viterbi Decoder, LDPC Decoder, BCH Decoder, RS Decoder
1
II. 計畫的緣由與目的 A. LDPC Decoder
1. A 11.5-Gbps LDPC Decoder Based on CP-PEG Code Construction
Low-density parity-check (LDPC) code is a famous error control code with near Shannon limit performance [A-1] and can be described by its parity-check matrix H. The message exchanging order between nodes is called scheduling, which will influence the convergence speed of the decoding algorithm. In standard BP algorithm, simultaneous update of all check node messages or variable node messages is named as flooding scheduling. Alternatively, the layered BP algorithm [A-3] performing message update along different check node groups is a method of check-node-centric sequential scheduling (CSS). Researchers have revealed that CSS could reduce maximum iteration to approximate half of the standard BP with similar performance.
Recently, LDPC codes adopted in high-throughput systems have high code-rate property to increase channel efficiency. However, the introduced large check node degree dc will cause implementation full of difficulties. Even though the CSS could reduce the iteration number, the throughput is still degraded due to long critical path of check node unit (CNU).
In this project, the proposed decoder aims at providing a high-throughput and hardware-efficient solution to the high code-rate LDPC with large check node degrees. In order to reduce the iteration number, the decoding scheduling is based on the variable-node-centric sequential scheduling (VSS; also known as shuffled decoding [A-6]), where the messages are updated along different variable node groups. Since the inputs of CNU operation are also divided into several subgroups, the complexity and critical path delay of CNU are reduced. Furthermore, single pipelined approach and modified CNU are proposed to minimize the message storage memory. Using a (2048, 1920) LDPC code constructed by circulant permutation progressive edge-growth (CP-PEG) algorithm [A-7] as a design example, the overall decoder chip
2
implemented in 90nmtechnology will show its advantages in terms of throughput, energy efficiency, and hardware efficiency.
2. A 2.37-Gb/s 284.8mW Rate-Compatible (491,3,6) LDPC-CC Decoder
Near the rediscover of LDPC-BCs, LDPC-CCs were proposed in 1999 [A-10]. LDPC-CCs have the characteristics of convolutional code not found in LDPC-BCs. Continuous encoding supports any length of input data stream, which is especially suitable for streaming video and packet-switching network. The puncture scheme applied in LDPC-CCs provides flexible code-rates by abandoning certain positions of encoded bits according to the puncture table.
Simple encoder circuitry composed by registers, multiplexers and a few XOR gates has lower hardware cost and power consumption, and can be used in distributed sensor network.
Furthermore, the correlation between codeword symbols of LDPC-CCs is limited to a specific interval (constraint length ms+1, ms is the memory size of encoder). This locality property lowers the overall routing complexity of the decoder. Although possessing many advantages, LDPC-CCs were few chosen by standards. The main reason lies in its bottlenecks of the long decoding latency, high power consumption, and low-to-moderate decoding throughput.
The throughput of LDPC-CC decoders reported in literatures was only several hundred Mbps, which is difficult to compete with LDPC-BC decoder with several tens of Gbps throughput. Cause of lower throughput can be explained by the decoder structure. LDPC-CC decoder consists of serially concatenated processors, and each processor decoding a sliding window on the trellis diagram can be taken as one iteration in LDPC-BCs. Increasing processor number can enhance error-correcting capability but cannot increase throughput. Therefore many works put efforts on realizing the parallel message passing: analysis of parallelization concepts in [A-11], single-instruction-multiple-data (SIMD) architecture in [A-12], and joint code-decoder design in [A-13]. Recently, a high throughput LDPC-CC decoder design was proposed by adding regularity during code construction [A-14]. However, achieving high throughput is still challenging for some time-varying LDPC-CC code without regularity.
3
B. BCH and RS Decoder
1. Soft BCH Decoder Chip for DVB-S2 System
The Bose-Chaudhuri-Hocquenghen (BCH) [B-1] codes are popular in storage and communication systems recently. From DMB-T [B-2] and DVB-S2 [B-3] applications shown in Fig. 1, the BCH codes with long block length are specified to suppress the error floor due to iterative LDPC decoding. Since BCH codes perform as outer codes in those communication systems, the soft information from the inner decoder can be employed to further improve the error-correcting performance.
Fig. 1. Block diagram of DMB-T and DVB-S2 systems
Soft decision decoding of BCH codes has aroused many research interests. Forney developed the generalized-minimum-distance (GMD) [B-4], which uses algebraic algorithms to generate a list of candidate codewords and chooses a most likely codeword from the list. Other algorithms with the same concept of candidate list, such as Chase [B-5] and SEW [B-6], are also widely used in many applications. This report illustrates a soft BCH decoding method using error magnitudes [B-7] to deal with the least reliable bits. For example, Fig. 2 shows the results of a concatenated code with 16-state BCJR [B-8] and BCH (255,239) over GF(28). Based on the soft information from previous decoder, the performance gain of the BCH decoder is about 0.73 db at BER = 10−6 when 2t+1 candidate bits within a codeword are chosen to correct errors.
4
Fig. 2. Simulation results for BCH (255,239) concatenating with a 16-state BCJR under BPSK modulation and AWGN channel
2. An Improved Soft BCH Decoder with One Extra Error Compensation
The Bose-Chaudhuri-Hocquenghen (BCH) [B-1] codes are popular in storage and communication systems, such as flash device, DMB-T [B-2] and DVB-S2 [B-3] broadcasting systems. Recently, soft decoding of BCH codes has aroused many research interests. Forney developed the generalized-minimum-distance (GMD) [B-4] to generate a list of candidate codewords and choose a most likely codeword from the list. Other algorithms with similar concept, such as Chase [B-5] and SEW [B-6], are also widely used in many applications.
Moreover, Therattil and Thangaraj provided a sub-optimum MAP BCH decoding method with Hamming SISO decoder in 2005 [B-12].
In general, the complexity of a soft BCH decoder is much higher than a hard BCH decoder for decoding an entire codeword. Nevertheless, soft BCH decoders with lower complexity can be revealed by focusing on the least reliable bits instead of the whole codeword. A soft BCH decoding algorithm using error magnitudes to deal with the least reliable bits was developed in 1997 [B-7]. However, Fig. 3 shows that there is about 0.25 dB performance loss at BER = 10−5 in AWGN channel as compared to hard decision BCH decoder for BCH (255,239) code. For the existing soft decision algorithms, the soft BCH decoder provides either better error correcting
5
performance or lower hardware complexity than a traditional hard BCH decoder. In this project, a soft BCH decoder which has similar concept as [B-7] and enhances the correcting performance by compensating one extra error while maintaining the low hardware complexity is presented.
Fig. 3. Simulation Result for BCH (255,239)
3. Soft RS Decoder Chip for Optical Communication System
Reed-Solomon (RS) codes are widely used in various communications and digital data storage systems due to the advantage of overcoming the burst errors. According to International Telecommunication Union (ITU-T) recommendation, RS (255,239) is standardized in high speed optical fiber systems and Gigabit Passive Optical Network (GPON) applications, which demand 2.5 Gb/s throughput for achieving 10-40 Gb/s with 16 RS decoders or satisfying the maximum up and down link requirement. To resist the increasing noise induced by higher transmission rate required in optical communication systems, soft RS decoding algorithms are exploited for achieving considerable performance gain. These algorithms modify the received sequences to form a list of candidate codewords and choose the most probable one. Nevertheless, because of high computational complexity, these soft decoding algorithms are still unsuitable for practical implementation. In this project, a decision-confined soft decoding algorithm is proposed to enhance performance while maintaining area efficiency. Instead of decoding all the candidate codewords like other soft decoding algorithms, our design only decodes the candidate codeword with the degree of
6
error-locator polynomial Λ(x) less than error correction capability t. The Gray code based bit-flipping method is also exploited leading to only one suit of hardware requirement.
C. Viterbi Decoder
1. A Low-Power Viterbi Decoder Based on Scarce State Transition and Variable Truncation Length
The Viterbi decoder implementing the Viterbi algorithm [C-1] for decoding convolutional codes is composed of three main blocks: the branch metric (BM) unit, the ACS unit, and the survivor memory. The BM unit generates branch metrics from the input data. The ACS unit recursively accumulates branch metrics as path metrics (PM) and makes decisions to select the most likely state transition sequences, or the survivors. Survivor memory stores the survivors for retrieving the data sequence.
There are two well-known survivor memory management approaches: the register-exchange (RE) and the trace-back (TB) [C-2]. The register-exchange is conceptually the simplest technique that eliminates repeatedly memory access operations. Therefore, this approach has shorter latency and is suitable for high speed decoder implementations. However, due to the data movement among registers, the approach is considered to be power inefficient.
Fig. 4 shows the conventional 2υ-state Viterbi decoder with the register-exchange architecture [C-3]. The decisions from ACS units will be shifted within the survivor memory from left to right. Applying the scarce state transition (SST) technique and the variable truncation length, we illustrate the proposed low-power Viterbi decoder for the MB-OFDM UWB system [C-4] in Fig. 5. The SST unit is integrated to reduce state transition activities, leading to less dynamic power consumption [C-8]. Furthermore, the path merging detector monitors the merged point for all survivors and adjusts the truncation length to avoid unnecessary data movement in registers. Many redundant operations in the survivor memory can be reduced to save power dissipation. Additionally, considering the high throughout
7
requirement, the radix-4 ACS structure is exploited because of a better compromise between cost and throughput [C-5].
Fig. 4. The conventional register-exchange architecture
Fig. 5. The proposed Viterbi Decoder Architecture
For the Viterbi decoder, the scarce state transition (SST) algorithm is a low power technique to reduce the state transition activity significantly under high SNR conditions [C-6]-[C-8]. In the conventional Viterbi decoding model (see Fig. 6 (A)), u(D) denotes the information sequence, C(D)=u(D)G(D) is the codeword sequence deriving from the generator polynomial G(D). From the received sequence r(D), the Viterbi decoder estimates the decoded
8
information o(D).
The SST Viterbi decoding architecture in Fig. 6 (B) includes two additional blocks:
pre-decoder and re-encoder. Assume
r(D) = u(D) G(D) + e(D) = C(D) + e(D) ( 1 ) and e(D) is the error sequence from a noisy channel, the pre-decoder directly decode the information sequence from r(D):
-1 ˆ
i(D) = r(D) G (D) = u(D) ( 2 ) The re-encoder then encodes i(D) to a new codeword z(D).
z(D) = i(D) G(D) = C(D) ˆ ( 3 ) The Viterbi decoder performs maximum likelihood decoding on y(D), which is defined as follows:
y(D) = r(D) + z(D) = C(D) + e(D) + C(D)ˆ ( 4 ) In high SNR conditions, e(D) is nearly zero, and the decoded information sequence becomes
ˆ
o(D) = i(D) + n(D) = u(D) + n(D) ( 5 ) If the channel condition is good enough, the decoder estimates an approximately zero sequence; as a result, the dynamic power is reduced as the channel becomes better.
Fig. 6. (A) Conventional model (B) SST decoding model
9
2. A Low Power Differential Cascode Voltage Switch with Pass Gate Pulsed Latch for Viterbi Decoder
In mobile communication systems, and especially in wireless local area networks (WLAN), information must be transmitted at high data rates. Furthermore, an efficient error-control code is commonly adopted to enhance system performance. Accordingly, convolution codes have been exploited extensively in communication systems, as they provide a superior error correction capacity while maintaining reasonable coding complexity. The Viterbi algorithm is one of the best algorithms for decoding convolution codes with modest computing resources. However, as data rates increase, the power dissipation and system complexity also increase. Moreover, as required transmission rates of wireless systems increase, the error-control mechanism has come to dominate power dissipation.
Fig. 7 presents the power distribution along a Viterbi Decoder we have proposed in [C-10]
using UMC 90 nm CMOS technology. The survivor memory unit (SMU) is constricted by a register-array and dissipates most power in a Viterbi Decoder. Therefore, high performance, low power consumption, and robustness are the basic requirements of the design of clocked storage elements in a Viterbi decoder. This project presents a low power differential cascade voltage switch with pass-gate (DCVSPG) pulsed latch for the Viterbi decoder.
Fig. 7. Power distributions of a Viterbi Decoder
10
III. 研究方法及成果 A. LDPC Decoder
1. A 11.5-Gbps LDPC Decoder Based on CP-PEG Code Construction
a) CODE STRUCTURE AND DECODING ALGORITHM (1) CP-PEG LDPC Code Construction
The (2048, 1920) irregular LDPC code, rate-15/6, used in this project was constructed by CP-PEG algorithm and shown in Fig. 8(a). The constructed parity-check matrix H consists of p*p circulant permutation (CP) and all-zero matrices. A CP matrix is a cyclic square matrix with constant row and column weight of one. The number of each CP matrix indicates the cyclic shift amount and -1 means all zero matrixes. By setting p=32, there are 4*p check nodes and 64*p variable nodes in bipartite graph, where each check node has uniform degree 46, and 16*p, 24*p, 24*p variable nodes have degrees of 4, 3, 2 respectively. The performance of this code was proven to have better performance than other PEG-based LDPC codes [A-7]; nevertheless, the high check node degree required suitable decoder architecture to overcome implementation difficulties.
(2) Variable-node centric Sequential Scheduling
In VSS approach, the initialization, stopping criterion test, and output steps remain the same as the standard BP algorithm. The only difference between two algorithms lies in the updating procedure. The normalized min-sum (NMS) algorithm which compensates the approximation error in check node In VSS approach, the initialization, stopping criterion test, and output steps remain the same as the standard BP algorithm. The only difference between two algorithms lies in the updating procedure. The normalized min-sum (NMS) algorithm which compensates the approximation error in check node is shown and described in the next page.
In this work, the codeword is divided into G=4 groups, therefore the parity-check matrix H is divided into 4 sub-matrices (H1 to H4). As shown in Fig. 8(b), each sub-matrix consists of equal
11
number of variable nodes with the same degree to reduce the hardware cost of variable node unit (VNU). Moreover, the sub-matrices with the same shift amounts (shaded blue CP matrices) are arranged in the same position thus the routing and control could be further simplified.
Fig. 8. Parity-Check Matrix of(2048,1920) LDPC Code
12
b) PROPOSED DECODER ARCHITECTURE
In this section, the complete decoder architecture will be presented, including data path, scheduling, and VLSI structure of CNU and modified CNU.
(1) Single Pipelined Architecture
The entire decoder depicted in Fig. 9(a) is composed of fully-parallel CNUs and partial-parallel VNUs, where the VNU2, VNU3, and VNU4 will handle variable node operations with degree 2, 3, and 4 respectively. Letg( )i denote the sorted messages sent from variable nodes in the g-th group to one specific check node at i-th iteration, which is:
Then the magnitude part of check node to variable node message in (1) could be computed by the following equation:
Fig. 9(b) demonstrates the timing diagram of proposed decoder. There are G initialization cycles required to calculate ( )gi
for 0 ≤ g ≤ G − 1. Since only one subgroup of the message
( )i
z is updated in g-th cycle of one iteration, the main operation of CNU could be simplified to mn
calculate g( )i (local sorting) in each cycle and then perform global sorting like equation (5).
From the proposed single pipelined architecture, only messages g( )i and mn( )i are stored. The sorted results could be represented by min value, second min value, and the index of min value in NMS algorithm. Therefore, the proposed decoder only latches two values, one index, and sign part of messages in each subgroup, while the variable node to check node message z is mn( )i on-the-fly calculated. The single pipelined architecture is feasible because the CNU could be updated immediately after VNU’s operations in VSS approach.
13
Fig. 9. Proposed Architecture and Scheduling
(2) Modified CUN
The operation of check node to variable node update could be divided into magnitude part and sign part. Fig. 10(a) illustrates the magnitude part of CNU, which is an accumulative sorter composed of a local sorter and a global sorter. The local sorter is used to find the local min and second min values in each subgroups, and global min and second min values of a check node will be found by a global sorter. Similarly, the sign operation can be computed in an accumulative way like the accumulative sorter.
For our proposed CP-PEG LDPC codes with dc = 46, The VSS approach with G = 4 could divide 46 inputs of the sorter into only 12 inputs. More subgroup number G will result in fewer inputs of sorter, but increase the storage for min, second min, and index values of each subgroup.
In order to further reduce the storage overhead of each subgroup, we propose a reduced storage accumulative sorter as shown in Fig. 10(b). The basic idea is to simplify the local min and local
14
second min from G − 1 subgroups into one group. Some extra control circuits are needed to open or close the feedback loop in Fig. 10(b). This sorter architecture is beneficial since the complexity reduction of storage registers and global sorters is higher than the overhead of control circuits. Section IV will show the performance of this modified CNU is similar to original CNU.
(3) Summary
In traditional two-stage pipelined architecture, both z and mn( )i mn( )i messages are kept in registers or memory. Assume the bit-width of messages is w (= 6) and variable node degree is dv, then the required memory size (or registers) is as follows:
For the proposed single pipelined decoder and modified CNU in Fig.4 (b), the memory size is reduced to
Therefore the overall register reduction of proposed architecture is 73%, leading to the following advantages: fewer registers, higher utilization of functional units, and reduced complexity. Since high-rate LDPC codes usually have more VNUs than CNUs (in our case: 512 VNUs and 128 CNUs), the elimination of registers from VNU to CNU not only reduces hardware cost but also lowers power consumption of clock tree.
15
Fig. 10. CNU Architecture
c) PERFORMANCE AND IMPLEMENTATION RESULTS
Under AWGN channel with BPSK modulation, the performance curves are simulated to determine the required bit-width and maximum iteration number. The simulation parameters of proposed algorithm are 6-bit input quantization (5-bit integer and 1-bit decimal fraction), scaling factor 0.75 for NMS algorithm, and 4 or 5 iterations. In Fig. 11, the bit-error rate (BER) curves indicate that 4 iterations for the proposed algorithm are sufficient to achieve similar performance of standard BP algorithm with 7 iterations. Furthermore, in the aspect of almost the same code-rate and better error-correcting capability, our CP-PEG LDPC codes outperforms 1.2 dB better than the (255, 239) RS code at BER=10− 5, which reveals the potential of CP-PEG LDPC codes for storage applications and fiber optical communication systems. The overall SNR loss between this work and Shannon limit is only 1.6dB. The proposed LDPC decoder is implemented by standard-cell design flow and fabricated in 90-nm 1P9M CMOS technology. The core occupied 3.84 mm2 of area with 68% utilization. The die photo is shown in Fig. 12, where the distribution of CNUs and VNUs is auto-determined by APR tool. Since required decoding cycles of one LDPC codeword are 4 initialization cycles plus 4 iterations, the throughput is (1920bit/20cycles)×frequency. Fig. 13 shows the measured maximum throughput and power dissipation under different SNR conditions and supply voltages. The measurement result
16
indicates that the test chip with FF corner can achieve 11.5 Gbps throughput under 1.4V supply voltage. The throughput could be scaled down to 5.77Gbps with 0.8V supply voltage to meet the throughput requirement of IEEE 802.15.3c standard and the energy efficiency will be 0.033 nJ/bit.
Compared with the state-of-the-art in
Table 1, the proposed LDPC decoder outperforms others in the aspects of throughput, hardware efficiency, and power efficiency. Since the LDPC code specifications of these designs are different, the SNR loss between each work to their Shannon limit is also listed for reference.
Fig. 11. Performance
d) CONCLUSION
A high-throughput and power-efficient LDPC decoder is presented. Utilizing the characteristic of variable-node-centric sequential scheduling, the proposed decoding algorithm could reduce the maximum iteration number without performance loss. In addition, the single pipelined architecture and modified CNU can save 73% message storage memory and decrease the sorter size, resulting in a low-complexity design. After implementation in 90nm technology, the test chip occupies 3.84 mm2 of area and supports maximum 11.5 Gbps data rate under 1.4V supply voltage.
17
Fig. 12. Chip Micrograph
Fig. 13. Measured Maximum Throughput and Power Consumption Table 1
Comparison with the State-of-The-Art
18
2. A 2.37Gb/s 284.8mW Rate-Compatible (491,3,6) LDPC-CC Decoder
a) PROPOSED ALGORITHM AND ARCHITECTURE
Fig. 15 demonstrates the algorithm-level optimization to accelerate the decoding convergence speed by using the on-demand variable node activation (OVA) scheduling technique [A-15]. The main idea is to change the variable activation location leaving from the processor to the position right before each check node input. The OVA scheduling is similar to the layered decoding in LDPC-BCs that check nodes could access the most recent messages. The original VNU can be disassembled into several sub-VNU (SVNUs) and distributed within a processor.
Since the equation of VN-to-CN messages (e.g. n1 and n2 in Fig. 15) has two common terms, we may calculate n2 from n1 by deducting m2 (done by pre-SVNU) and adding m1 (done by post-SVNU). Therefore, the channel values (i.e. u and v) are concealed in VN-to-CN messages and the storage space of channel values can be removed from processors to save 17% memory.
When the channel values are concealed within the summation values, the bit-width of each message should be adjusted to avoid truncation error. In the situation of w-bit channel value, the summation values needs (w + 2)-bit. Since the operations of pre-SVNU and post-SVNU are independent, they can be retimed such that the messages between them only need (w+1)-bits.
The original critical path from CNU to post-SVNU is also diminished by one adder delay.
Fig. 16 is the bit-error-rate (BER) performance of the rate-compatible (491, 3, 6) time-varying LDPC-CC proposed in [A-16] under AWGN channel. In contrast to log-BP algorithm with 10 processors, the proposed algorithm with 5 processors can achieve similar or even better performance in all code-rates. Therefore, only half processors are required under the same performance, leading to half decoding latency reduction as well.
In the original structure of Fig. 14, the LDPC-CC decoder can only decode one bit in one cycle, so the information throughput will be fclk Mb/s at fclk MHz clock frequency. To increase throughput, the node level optimization duplicates both CNUs and VNUs to ρ (folding factor)
19
units and the decoder throughput becomes (ρ × fclk) Mb/s. The proposed folding technique primarily duplicates the combinational logic while the sequential circuits are only slightly increased. In the meantime, the FIFO structure is also modified accordingly to provide sufficient input data for operation units. Each FIFO in the conventional processor is segmented by ρ factor to support required bandwidth.
For irregular time-varying LDPC-CC with large folding factor, neither register-based FIFO with high power consumption nor memory-based FIFO with serious memory conflict is suitable.
In order to making trade-off between bandwidth and power, the hybrid-partitioned FIFO structure is presented. The first step is calculating the length of the longest continuous sectors of every folded row. Then sectors are to be merged into one memory bank together, where the depth of the memory bank is the minimum value of the sector lengths. If the original sector is larger than the memory depth, the excess part is still stored in registers. This procedure continues to merge sectors until the memory depth is less than a pre-defined parameter. In this work, 50% of messages in each processor are partitioned into three two-port memories (10.5 Kbits) and the clock buffers are also reduced by 54%.
Fig. 14. The (14, 3, 6) LDPC-CC decoder and conventional processor architecture. Note that the constrain length ms=14, VN degree dv=3, CN degree dc=6 in this example.
20
Fig. 15. Algorithm level optimization (OVA scheduling with concealing channel values).
Fig. 16. BER performance of Log-BP algorithm (floating-point) and our proposed scheduling in Normalized Min-Sum algorithm with scaling factor 0.875 (fixed-point (6,2)) under
AWGN channel.
b) IMPLEMENTATION RESULTS
Fabricated in 90nm 1P9M CMOS process, our test chip integrates the OVA scheduling with concealed channel values, folding architecture, re-timed SVNU, and hybrid-partitioned FIFOs.
Key features and performance comparison are given in Table 2. The decoder chip occupies
21
2.24mm2 area with 479K gates and 52.5Kb SRAM. Measurement results show that the decoder draws 284mW under 1.2V supply voltage while running at 198MHz. Since the folding factor equals 12, the information throughput of the LDPC-CC decoder achieves 2.37 Gb/s. When supply voltage is scaled down to 0.8V as shown in Fig. 17, the power is reduced to 90.2 mW with an energy efficiency of 0.0114 nJ/bit/proc. Compared with other LDPC-CC decoders [A-14], [A-17], this work provides higher throughput, less area, and better energy efficiency. Compared with the Turbo decoder [A-18], this work achieves much higher throughput with lower power and less die area. In conclusion, our proposed LDPC-CC decoder outperforms state-of-the-art designs and has the potential to be one candidate for next-generation communication systems. The chip micrograph is shown in Fig. 18.
Table 2
Chip summary and comparison with state-of-the-art
[B-8] [B-5] [B-9]
[B-8]
[B-5]
Fig. 17. Measurement Results and the comparison with previous works.
22
Fig. 18. Chip micrograph.
23
B. BCH and RS Decoder
1. Soft BCH Decoder Chip for DVB-S2 System
a) Soft Decision BCH Decoding
Conventional BCH decoding contains syndromes calculation, key equation solver, and Chien search [B-1]. In general, the complexity of a soft BCH decoder is much higher than a hard BCH decoder for decoding an entire codeword. Nevertheless, soft BCH decoders with lower complexity can be revealed by focusing on the least reliable bits instead of the whole codeword.
Fig. 19.Soft Decision BCH Decoding Block Diagram
As shown in Fig. 19, the soft BCH decoding using error magnitudes [B-1] includes three major steps: syndromes calculation, error locators evaluator, and error magnitudes solver. From the received polynomial R(x), the syndromes polynomial S(x) = S1 + S2x1 + · · · + S2tx2t−1 are expressed as
For j = 1, 2, · · ·, 2t, where α is the primitive element over GF(2m). Notice that li is the i-th actual error location and βli = αli indicates the corresponding error locator. With soft inputs, error locators evaluator can choose 2t least reliable inputs and evaluates their corresponding error locator values to form a β-vector, [βc1 , βc2 , . . . , βc2t ]. Also, the error location set, {Lc1 ,Lc2 , . . . ,Lc2t}, can be calculated with β- vector because the β value of the Lci -th location is βci = αLci . The relation between βci and the syndromes can be formulated as
24
where γci is the error magnitude corresponding to βci for i = 1, 2, . . . , 2t. The left 2t × 2t matrix in (2) is defined as β - matrix. From (1) and (2), it is evident that if all the errors are in the location set, the exact γci value can be determinated; otherwise, this decoding approach fails to correct errors. The error magnitudes solver shown in Fig. 19 is used to solve (2) to get γci. For those γci
equal to 1, the corresponding Lci are the exact error locations. The codeword polynomial C(x) can be obtained by inversing the Lci -th values in the received polynomial R(x).
b) Proposed Algorithm and Architecture
(1) Error Locators Evaluator
As shown in Fig. 20, error locators evaluator architecture includes the reliability part, the error locator part and the error location part. The upper part is the reliability part which stores the reliabilities of 2t least reliable candidates Rc1 ,Rc2 , . . . ,Rc2t . The medium part is the error locator part to construct the β-vector. Because the β value of the i-th location isαi, the β value of (i+1)-th locations isαtimes the β values of i-th location. The β value can be computed by multiplyingα−1 with register REG if the input is serial in from the highest degree coefficient of R(x). Thus, the error locator part can use a constant multiplier to calculate the error locator of each input. Notice that register REG initially contains the β value of the first input. The bottom part is the error location part. The decoding method focuses on the least reliable bits instead of the whole codeword, so the error location part uses a counter to compute the error location Lci
corresponding to each Rci for serial input. Hence, the Chien search procedure is no longer required and a lot of redundant decoding latencies can be eliminated.
25
Fig. 20. Error Locators Evaluator Architecture for Serial Input
Error locators evaluator classifies the soft inputs to choose 2t least reliable inputs as the candidate reliabilities Rc1 ,Rc2 , . . . ,Rc2t . Their corresponding error locators βci and error locations Lci are also calculated and stored in registers. Error locators evaluator compares the soft inputs with Rci , and then generate the select signals SELi to control the multiplexers. In the i-th stage, if the input is smaller than Rci−1 , the i-th stage is updated with (i-1)-th stage value. If the input is greater than Rci−1 and smaller than Rci , the i-th stage is updated with the input value.
Otherwise, the i-th stage holds its current value.
(2) Error Magnitudes Solver (EMS)
To obtain the valid γci value in (2), the Gauss Elimination method is the most intuitive way but the complexity is O(n3). Two alternative algorithms for improving decoding efficiency under different error correcting capabilities t are revealed. One uses the characteristic that the valid error magnitude in BCH codes is either 0 or 1, and the other employs the quick Vandermonde matrix solution.
26
(a) Heuristic Error Magnitudes Solver(H-EMS):
In BCH codes, the valid error magnitude in (2) is either 0 or 1, so the problem can be formulated into checking all combinations of γci over GF(2) instead of calculating real error magnitudes. A 2t-bit counter is used to do a heuristic search for all binary combinations. Since S12 = S2, S22 = S4, . . . , St2 = S2t in BCH codes, the even part of syndromes check can be eliminated to simplify (2) as :
Table 3
The Proposed Heuristic EMS Algorithm
Table 3 illustrates details of the proposed H-EMS algorithm. The i-th bit of CNT, CNTi, performs as the i-th error magnitude, γci. Thus, by iteratively flipping each CNTi value, a heuristic search for all binary combinations can be completed. At each iteration, the solver can verify where the equation (3) stands or not. As shown in Fig. 21, H-EMS uses 2t(t-1) multipliers to construct the β-matrix. Each βcij value will be calculated with CNTi, and the solver checks the results equal to the syndromes or not.
27
Fig. 21.Heuristic Error Magnitudes Solver Architecture
(b) Borck-Pereyra Error Magnitudes Solver(BP-EMS)
Table 4
Borck-Pereyra Algorithm
Since the βci matrix is a Vandermonde matrix, Borck-Pereyra algorithm [B-9][B-10] shown in Table 4 can calculate the error magnitudes efficiently for large matrix. In Borck-Pereyra algorithm, the variable Si which initially contains the ith syndrome value is updated iteratively. In stead of using β- matrix to compute (2), Borck-Pereyra algorithm uses β- vector to reduce the
28
implementation complexity. After all computations, Si indicates the i-th error magnitude. From Table 4, Borck-Pereyra algorithm has division, multiplication and addition operations. Notice that the multiplier can be shared if the divider can be decomposed into an inversion and a multiplier.
Thus, as shown in Fig. 22, BPEMS only contains 1 multiplier, 1 inversion, 3 adders and a control logic. The control logic determines the computation order of the syndromes and βci, and the computation results will be used to update each Si value. The inversion in the proposed architecture is carried out in composite field because the finite field inversion over GF(2m) is costly and infeasible with table-lookup implementation for large m.
Fig. 22.Borck-Pereyra Error Magnitudes Solver Architecture
Composite field [B-11] is viewed as an extension field of GF(2k) while given m = kr. The finite field GF(2m) can be constructed by coefficients from the subfield GF(2k). Operating in subfield leads to lower implementation complexity and better computation efficiency. For example, every element in GF(216) can be represented by bx+c and inversion of bx+c can be derived as (4) with the polynomial x2 + x +ψ[B-11], where b and c are over GF(28).
The composite field inversion over GF(216) is only 2.1K gate count in CMOS 90nm technology while the inversion using Look Up Table method is about 186K gate count.
29
c) Simulation and Implementation Result in DMB-T and DVB-S2 Systems
In DMB-T and DVB-S2 systems, BCH (762,752) over GF(210) and BCH (32400, 32208) over GF(216) are defined to be concatenated with LDPC codes respectively. Fig. 23 shows the simulation results for DMB-T system with LDPC (7493, 3048) at 20 iterations. Similarly, the simulation results for DVB-S2 system with LDPC (64800, 32400) at 50 iterations is shown in Fig.
24. The proposed soft BCH decoders have 0.5db gain in DMB-T and similar performance in DVB-S2 at BER = 10−5. These two BCH codes are implemented and demonstrated in
Table 5. Each BCH code is implemented in both hard decision and soft decision methods.
Fig. 23.Simulation results for BCH (762, 752, 1) in DMB-T system under BPSK modulation and AWGN channel
Fig. 24.Simulation results for BCH (32400, 32208, 12) in DVB-S2 system under QPSK modulation and AWGN channel
30
Table 5
Summary of Implementation Results
For BCH (762,752), key equation procedure is not needed due to t = 1. To eliminate Chien search, the hard BCH decoder uses look up table method to solve the error location, and the soft BCH decoder uses the H-EMS architecture. Calculating all the combination values at one cycle, the gate-count of the soft BCH decoder is only 38.8% of the hard BCH decoder under the same latency and operation frequency. For BCH (32400, 32208) with t = 12, the hard BCH decoder uses iBM algorithm to solve key equation and needs Chien search to get error locations. By inserting registers in composite field inversion, the operation frequency of the soft BCH decoder with BP-EMS is enhanced from 166MHz to 333MHz with only 2.5% latency increment in overall decoding procedure. Computing error locations without Chien search, the soft BCH decoder has almost half latencies of the hard BCH decoder. Hence, the soft BCH decoder has much better throughputs than the hard BCH decoder. The measurement result reveals that the soft BCH decoder saves 50.0% gate-count and 47.4% clock cycle latency as compared with the hard BCH decoder. Fig. 25 is the chip microphoto of soft BCH (32400, 32208).
Fig. 25. Microphoto of Soft BCH(32400,32208) Chip
31
2. An Improved Soft BCH Decoder with One Extra Error Compensation
a) PROPOSED COMPENSATION SOFT BCH DECODING
The proposed soft BCH decoder shown in Fig. 26 includes three major steps: syndrome calculator, error locator evaluator, and compensation error magnitude solver. From the received polynomial R(x), the syndrome polynomial S(x) = S1 + S2x1 + · · · + S2tx2t−1 is expressed as
(8)
for j = 1, 2, · · · , 2t, where α is the primitive element over GF(2m). Notice that ei is the i-th actual error location and βei = αei indicates the corresponding error locator.
Fig. 26. Soft Decision BCH Decoding Block Diagram
With soft inputs, error locator evaluator can choose 2t least reliable inputs and evaluate their corresponding error locators to form the error locator set B = [βl1, βl2, . . . , βl2t ]T . Also, the error location set, L = [l1, l2, . . . , l2t]T , can be calculated with B because the error locator of the li-th location is βli = αli . The relation between B and the syndrome vector, S = [S1, S2, . . . , S2t]T , can be formulated as
(9)
32
where Γ = [γ1, γ2, . . . , γ2t]T is the error magnitude set corresponding to B, and the 2t × 2t matrix in (9) is defined as β-matrix B. Let Δ = [δ1, δ2, . . . , δ2t]T be defined as
(10)
From (8) and (9), it is evident that if all the errors are in the error location set, the exact γi value can be determinated and Δ will be all zero; otherwise, this decoding approach fails to correct errors. There are at most 2t error locations can be determined. However, it is very likely that only one error outside L but the decoder can’t solve any error. To improve the error correcting ability, we additionally check whether Δ is a geometrical sequence or not to make a compensation for an error location outside L. A geometrical sequence Δ = [βlloss, βlloss2, . . . , βlloss2t] means an error location loss can be found, where βlloss = αlloss . For example, if there are four errors in 1st, 3rd, 5th and 9th locations for a BCH (255,239) decoder which can correct 2 errors, S is expressed as
(11)
In the case that the decoder collects B = [β1, β3, β6, β9], and Γ = [1, 1, 0, 1], Δ becomes
(12) Then not only errors at 1-st, 3-rd and 9-th locations but also an error at 5-th location can be corrected. Therefore, the proposed compensation soft BCH decoder can correct at most 2t+1 error.
The compensation error magnitude solver (CEMS) shown in Fig. 26 is used to solve (9) and (10) to get Γ and Δ. For those γi equal to 1, the corresponding li and lloss are the exact error locations.
The codeword polynomial C(x) can be obtained by inversing values at error locations in the received polynomial R(x).
33
To obtain the γi value in (9), the Gauss Elimination method is the most intuitive way but the complexity is O(n3). In BCH codes, the valid error magnitude in (9) is either 0 or 1, so the problem can be formulated into checking all combinations of γi over GF(2) instead of calculating real error magnitudes. A 2t-bit counter is used to do a heuristic search for all binary combinations.
Since S12 = S2, S22 = S4, . . . , St2 = S2t in BCH codes, the even part of syndromes check can be eliminated to simplify (9) as :
(13)
The complexity can be significantly reduced for only half size matrix, Bodd and Sodd, used in (13).
Following steps illustrate the details of the efficient Implementation of CEMS.
By iteratively counting Γ value, a heuristic search for all binary combinations can be completed. At each iteration, the solver can verify whether the geometrical sequence check stands or not.
34
b) VLSI ARCHITECTURE FOR THE COMPENSATION SOFT BCH DECODER
(1) Error Locator Evaluator
As shown in Fig. 27, error locator evaluator architecture includes the reliability part, the error locator part and the error location part. The upper part is the reliability part which stores the reliabilities of 2t least reliable candidates Rl1,Rl2, . . .,Rl2t . The medium part is the error locator part to construct the error locator set B. Because the error locator of the i-th location is αi, the error locator of (i+1)-th locations is α times the error locator of i-th location. The error locator can be computed by multiplying α−1 with register REG if the input is serial in from the highest degree coefficient of R(x).
Thus, the error locator part can use a constant multiplier to calculate the error locator of each input. Notice that register REG initially contains the error locator of the first input. The bottom part is the error location part. The decoding method focuses on the least reliable bits instead of the whole codeword, so the error location part uses a counter to compute the error location li corresponding to each Rli for serial input. Hence, the Chien search procedure is no longer required and a lot of redundant decoding latencies can be eliminated.
35
Fig. 27. Error Locator Evaluator Architecture for Serial Input
Error locator evaluator classifies the soft inputs to choose 2t least reliable inputs as the candidate reliabilities Rl1,Rl2, . . . , Rl2t . Their corresponding error locators βli and error locations li are also calculated and stored in registers. Error locator evaluator compares the soft inputs with Rli, and then generates the select signals SELi to control the multiplexers. In the i-th stage, if the input is smaller than Rli−1 , the i-th stage value is updated with (i-1)-th stage value. If the input is greater than Rli−1 and smaller than Rli, the i-th stage value is updated with the input value.
Otherwise, the i-th stage holds its current value.
(2) Compensation Error Magnitude Solver (CEMS)
The compensation error magnitude solver (CEMS) in Fig. 28 is employed to evaluate (13) while given Sodd and B. Totally 2t2 registers are used to store each entry in the Bodd matrix. The
36
initial value of registers in each row is set as B so that the output of the SQUARE will always be βli2 for first t-1 cycles. Iteratively multiplied byβli2, the bottom registers generate βli2j+1 for i = 1
~ 2t and j = 0 ~ t-1. Thus, totally only 2t multipliers are used for Bodd calculation. After t-1 cycles, Bodd is constructed and the registers will stop update. Matrix multiplication is evaluated in the following 22t cycles. By counting Γ value, a heuristic search for all binary combinations can be completed. At each iteration, each βlij value will be calculated with γi, and the solver can verify whether the geometrical sequence check stands or not. If Δodd is a geometrical sequence, then δi × δ12 = δi+2. CEMS uses t multipliers to check the relation and uses a look up table (LUT) for looking for lloss from δ1.
Fig. 28. Compensation Error Magnitude Solver Architecture
(3) Architecture Comparison
The architectures of a hard BCH decoder and the proposed soft BCH decoder are compared in Table 6. In finite field operation, the complexity of a multiplier is much higher than a register.
Because of fewer multipliers, the proposed soft BCH decoder with more registers and additional LUT has similar hardware complexity as the hard BCH decoder with inversionless
37
Berlekamp-Massey (iBM) algorithm [B-15] Moreover, searching error locations at error locator evaluator procedure leads to a lot of latency saving. Therefore, the proposed soft BCH decoder can provide higher throughput with almost the same hardware complexity as compared to the traditional hard BCH decoder. For example, for BCH (255,239) code, the proposed soft BCH decoder has 20 registers, 1 LUT and 5 multipliers while the hard BCH decoder has 12 registers and 9 multipliers. Furthermore, the proposed decoder also has only 53% latency as compared with traditional hard BCH decoder.
Table 6
Comparison Table for A (n, k, t) BCH Code
c) SIMULATION AND IMPLEMENTATION RESULTS
Simulation and implementation results for our proposed soft BCH decoder are presented in this section. Fig. 29 shows the performance comparison for 2-error-correcting (255,239) BCH code under BPSK modulation in AWGN channel. The achieved coding gain is about 0.75dB over the hard BCH decoder at BER = 10−5. Our proposed decoder can outperform 0.35dB and 0.2dB coding gain as compared with GMD [B-4] and sub-optimum MAP [B-7] respectively.
38
Fig. 29. Simulation results for BCH (255,239) code
The BCH (255,239) decoder is implemented with hard decision and soft decision methods and demonstrated in Table 7. The hard BCH decoder uses iBM algorithm to solve key equation and needs Chien search to get error locations. Computing error locations without Chien search, the soft BCH decoder has almost half latency of the hard BCH decoder. Hence, the soft BCH decoder has much better throughputs than the hard BCH decoder. According to the post-layout simulations, the soft BCH decoder saves 47.1% clock cycle latency with similar gate count and operation frequency as compared with the hard BCH decoder in standard CMOS 90nm technology.
Table 7
Summary of Implementation Results
39
3. Soft RS Decoder Chip for Optical Communication System
a) Proposed Soft RS Decoding Algorithm
First of all, based on the received soft information, η least reliable positions (LRPs), [l0, l1, ..., lη−1], are defined and S(x) is calculated simultaneously. The candidate sequences are generated according to Gray code based bit flipping method, leading to only one bit of these LRPs flipped between each successive candidate. As a result, S(i+1)(x) for the (i + 1)-th candidate can be updated with the method in step 2. of Algorithm 1. S(i+1) j is the j-th coefficient of S(i+1)(x) and e’k ×αlk×j is the compensation value, which can be viewed as the error pattern induced by the bit-flipping operation of k-th LRP. After updating the syndrome S(i+1)(x) and calculating the corresponding Λ(i+1)(x), we set a condition that only the Λ(i+1)(x) with degree less than t will be sent to Chien search to find the error locations because it’s highly possible for the Λ(i+1)(x) to be in the limit of correction capability. If the condition is met, the candidate sequence will be decoded as the output message and the decoding procedure will be terminated. Otherwise, next candidate will be generated to repeat above-mentioned steps. If no one meets the condition among all 2η−1 candidates, the received signal will be decoded without the condition, and the error correction capability as hard RS decoders is guaranteed.
40
b) VLSI Architecture for Soft RS Decoder
For the 2.5 Gb/s requirement of the optical communication systems, a soft RS (255,239) decoder with three pipelined stages based on our decision-confined decoding algorithm is presented and the decoding scheme is shown in Fig. 30. The following subsections will show the unique parts of our proposal in contrast to conventional hard decoders.
(1) Syndrome Updater
According to the method in step 2 of Algorithm 1, the candidate syndrome S(i+1)(x) can be updated from S(i)(x) by utilizing a look-up table (LUT) instead of recalculating it with syndrome calculator for further cost efficiency. Note that there are at most 25 candidates for each received message and 259 computational cycles for each pipelined stage. Thus it has 8 computational cycles for every S(i+1)(x) and Λ(i+1)(x). As a result, the finite field multipliers (FFMs) and the squares can be shared to compute 16 compensation values for further hardware reduction. In our design, it only costs 4 FFMs and 2 squares for the calculation of all compensation values as shown in Fig. 31.
Fig. 30. Decoding scheme of the proposed soft RS decoder
41
Fig. 31. Syndrome updater
(2) Half-iteration RiBM
The conventional KES needs 2t iterations to solve the key equation : Ω(x) = S(x)×Λ(x) mod x2t. For RS (255,239), it will cost 16 cycles to calculate Λ(x). Instead of using two KES to meet 8 cycles timing constraint, which results in high complexity and difficult signal controlling, we propose a half-iteration RiBM algorithm on the basis of [B-17] and [B-20] to shorten the latency of KES. Combining the advantages of homogeneous architecture and half computation latency, half-iteration RiBM can fully match our desire for KES. According to half-iteration RiBM algorithm, the structure of the processing element of half-iteration RiBM (H-PE) is depicted in Fig. 32 and the KES can be implemented with 2t + 1 H-PEs as illustrated in Fig. 33.
Fig. 32. The processing element of Half-iteration RiBM (H-PE)
42
Fig. 33. The homogeneous architecture of Half-iteration RiBM
(3) BP-based Error Value Evaluator
Conventionally, after Chien search evaluates the error locators Xi’s, the corresponding error values ei’s can be calculated with Λ(x) and Ω(x) based on the Forney’s algorithm. From another approach, the BP based method [B-18] can compute the error values by solving the Vandermonde relation between the syndrome Si’s and error locators Xi’s as following form.
Since the Forney’s algorithm and BP-based method consume nearly the same hardware costs, our half-iteration RiBM method removes the calculation of Ω(x) for further area efficiency. Based on the BP method, the error value evaluator can be implemented with the architecture as shown in Fig. 34.
43
Fig. 34. BP-based error value evaluator c) Simulation and Implementation Result
Fig. 35. Performance of the proposed soft decoding algorithm
Fig. 35 shows the RS (255,239) simulation results for our proposed algorithm with different η under BPSK modulation and AWGN channel. The performance gain at 10−4 CER is 0.4 dB with η = 5 over the hard decoding. Compared with Chase algorithm with η = 3, our proposed method can achieve competitive coding gain with η = 5. Although it needs more LRPs, the average computation complexity of our proposal is much less than Chase algorithm.
For instance, at Eb/N0 = 7, according to our approach with η = 5, the average computation times of syndrome updater, KES, Chien search and error value evaluator are 1.07, 1.07, 1 and 1 respectively. However, the Chase algorithm with η = 3 consumes 23 calculation for all the decoding blocks.
44
Table 8
Table 9
Table 8 shows the comparison with LCC-based soft RS decoder. Our proposal can achieve more than 40% area reduction while the assumption is even not including the cost of decision making unit consumed in [B-19]. In addition, our design can operate with only half latency for each pipeline and less pipelined stages. Fig. 36 shows our decoder chip which is the first soft RS decoder chip in our understanding. Hence,
Table 9 illustrates the implementation results of our soft RS decoder with other hard RS decoders. Implemented in 90nm CMOS process, our chip with 45.3K gates is comparable with a conventional hard decoder. Moreover, it can fit well for 10- 40 Gb/s with 16 RS decoders in optical fiber systems and 2.5 Gb/s GPON applications with 0.4 dB coding gain over hard decoders at 10−4 CER.
Fig. 36. Microphoto of soft RS (255,239) chip
45
C. Viterbi Decoder
1. A Low-Power Viterbi Decoder Based on Scarce State Transition and Variable Truncation Length
a) Power Reduction with Variable Truncation Length
As indicated in the Viterbi algorithm, the decoder output is the codeword that minimizes the conditional probability of the received sequence. Therefore, the entire received sequence should be stored and analyzed before any decoding output. Nevertheless, the received sequence length may be large, and the survivor path should be truncated to reduce storage requirement and decoding latency. If the truncation length T is large enough, about five times constraint lengths, the performance can achieve that of maximum-likelihood decoding.
Fig. 37 illustrates the survivor paths stored in a survivor memory. All the 2υ survivor paths will merge with a high probability for a 2υ-state Viterbi decoder. Consequently, it is more efficient to store the merged path rather than the 2υ paths after the merged stage. The truncation lengths depend strongly on the channel conditions as listed in Table 10. Based on the variable truncation length property [C-9], we design a path merging detection unit to reduce the power consumption in the survivor memory.
Fig. 37. Survivor paths stored in the survivor memory
46
Table 10
Average required truncation length for path merging in different channel condition
Eb/N0 1.0 2.0 3.0 4.0 5.0
Truncation
length 33.78 26.86 23.34 21.54 20.54
(1) VARIABLE TRUNCATION LENGTH
The Viterbi decoder for the MB-OFDM UWB system has 64 states. Fig. 38 illustrates the survivor memory on the radix-4 trellis. D0 to D63 are the decisions provided by the ACS units for selecting survivor paths. Base on path merging property, the 64 states tend to be equivalent from the left stages to the right stages, which are more reliable.
The path merging detection unit will find the merge point, or stage in the trellis.
Obviously, if contents of all the 64 survivors are equivalent at the same stage, the 64 survivor paths have merged. However, it is complex to check all 64 states concurrently. To reduce the hardware complexity, our proposal detects path merging by dividing 64 states into 16 groups that are verified separately. The simulation results show that this scheme has no performance loss. We assume the 64 survivor paths have merged and the value in state 0 is already reliable if every group (the circles in Fig. 38) contains equivalent values at the same stage. After detecting the merged point, we apply clock gating to the registers in the shadow region and directly shift out the value. The state 0 path is considered as the correct one, and the others are dropped.
Fig. 39 illustrates the survivor memory architecture with variable truncation length. The registers of each stage are connected to the path merging detection unit that decides the merge point and generates clock gating signals of each stage. Based on the scheme, we can adjust truncation length dynamically, depending on the channel. In high SNR environments, a shorter truncation length is required and the clock gating can be applied to more registers, resulting in a power efficient survivor memory.
47
Fig. 38. Path merging detection scheme
Fig. 39. Survivor memory architecture with variable truncation length
(2) DESIGN PARAMETERS AND PERFORMANCE SIMULATION
The Viterbi decoder, based on the register-exchange approach, combines the SST and the path merging detection schemes to reduce power dissipation. The design parameters of the proposed Viterbi decoder are listed in Table 11.
Fig. 40 shows the performance simulation result in the BPSK modulation. Notice that the performance of the conventional scheme, the SST scheme, and the proposed scheme are approximately the same. Compared with the floating point case, the performance degradation
48
of 8-level soft-decision is less than 0.5dB. The proposed variable truncation length scheme still preserves the error performance.
Table 11
Design parameters of proposed Viterbi decoder
Technology 1.2V 0.13-μm 1P8M CMOS
State Number 64
Code Rate 1/3
Soft-Decision 8-levels
BM Width 6 bits
PM Width 9 bits
Max. Truncation
length 64
ACS structure radix-4
Fig. 40. Simulation results in AWGN channel, BPSK, 8-level soft decision and code rate=1/3
b) Power Simulation
We analyze the power dissipation of three implementations: the conventional register-exchange approach, the SST scheme without and with the variable truncation length scheme. Table 12 lists the gate counts of these implementations. In different channel
49
environments, we compare the power consumption of the three structures. Fig. 41(a) and Fig.
41(b) respectively reveal the post-layout power estimation of the whole Viterbi decoder and the survivor memory. The operating frequency is 250MHz and the corresponding data rate is 500Mbps due to the radix-4 ACS structure.
For the conventional design, the channel conditions are ineffective in the power dissipation. In the SST only implementation, the decoder power dissipation is reduced in high SNR environments; however, the power reduction is not obvious due to the complex signal wire routing. In the proposed design combining the SST and the variable truncation length, the decoder power has a significantly reduction as shown in the figures. Fig. 41(b) shows survivor memory power only to highlight the effect of the dynamic truncation length. As the channel condition is good enough, the variable truncation length scheme lowers more than 60%
survivor memory power.
Fig. 42 shows the power profiling of the conventional register-exchange structure and the proposed decoder as Eb/N0 is 5.0 dB. The corresponding bit error rate in this channel condition is 2.56e-6. From Fig. 42(a), the survivor memory is a power intensive block in conventional decoder designs. With SST and variable truncation length schemes, the ratio of survivor memory power is reduced significantly (see Fig. 42(b)). Furthermore, the SST unit and the path merging detection unit consume less than 2% of the decoder power.
Table 12
The gate counts of different implementations
Implementation Gate counts Conventional RE
approach 108.5k SST scheme 109.0k
Proposed 116.9k
50
Fig. 41. Comparison of (A) Decoder power (B) Survivor memory power at 500Mbps
Fig. 42. The power profiling of (A) Conventional structure and (B) Proposed structure as Eb/N0 is 5.0dB