Correlation Circuits

Chapter 5 Architecture Design and Implementation

5.2 Mode/GI & Symbol Boundary Detection

5.2.2 Correlation Circuits

As shown in Fig. 3.16, the correlation part includes a complex multiplier and a multiplier. Eqn. (5.1)shows that the multipliers of the complex multiply operation can be reduced by additional two adders and a subtractor [29]. Therefore four multipliers and two adders are reduced to three multipliers and five adders as shown in Fig. 5.5.

Therefore, more than 12% area is saved and 4% power consumption is improved in complex multiplier design.

[

A B C D AC DB

]

Fig. 5.5 Complex multiplier reduction 5.2.3 Moving Sum Circuits

As shown in Fig. 3.16, the two output of Gate block at the output of moving sum delay-line is set to zero at “Dummy State” for the purpose to do the initial integration.

After this, the moving sum delay-line output value pass the Gate block and the value from “D” will subtract it to do the moving sum function. The square operation is realized by a complex multiplier and an integer multiplier.

5.2.4 Finite Word-Length Simulation

In order to prevent from the word-length is going to be bigger and bigger, the finite word-length simulation is necessary. For the purpose to reduce the multiplier size of square operation, the input to register “D” is the target. Due to the value of integration of moving sum is not necessary 23 bits (12 bits with 2K integration length), the MSB bit is cut and the word-length is reduced from 23 to 19 after maximum value simulation. As a result of the LSB bits are too small to influence the result after square operation, the input of square only take [18:8] 11 bits of “D”. Thus, the multiplier size

is successively reduced from 23×23 to 11×11. A piece of MC² result around peak period before and after finite word-length truncation is shown in Fig. 5.6. The detail architecture of the proposed blind mode/GI and boundary detection is shown in Fig.

5.7. Fig. 5.6 MC² results (a) before (b) after finite word-length simulation

1K6B*2 SRAM*8

(Delay N) 1K12B SRAM*4

(Delay Ng) Gate

Fig. 5.7 Improved architecture of blind mode/GI and boundary detection

5.3 Scattered Pilot Synchronization

The block diagram architecture of scattered pilot mode detection is shown in Fig.

4.2. The complex multiplication is realized by two multipliers and an adder. As a result of the structure is the same with the square operation in Fig. 5.7, this part is shard from the blind mode/GI and boundary detection after the boundary is detected.

The MUXs in Fig. 4.2 is replaced by using gated clock to reduce the power

consumption and area cost.

Since the inputs of the multiplier are the same, the sign bit is able to be ignored.

After the finite word-length simulation, the complex multiplication output value can be represented using the MSB 7 bits and the accumulation value can be represented using 11 bits. Therefore, a register group is reduced from 34 to 11 bits. Overall, the improved architecture of scattered pilot mode detection is shown in Fig. 5.8.

Fig. 5.8 Improved architecture of scattered pilot mode detection

5.4 Frequency Domain Channel Estimation

Since the channel estimation needs to store the scattered pilots of seven symbols, the storage element is shared from the fourteen 1K SRAM modules of mode/GI boundary detection after the boundary is detected. Therefore, the fourteen 1K SRAM module is divided into seven groups and each group consists two 1K SRAM module to store the real and image part of scattered pilots. Each group only stores one symbol’s scattered pilots.

Since the scattered pilot is stored by seven 1K×2 SRAM modules, the read/write conflict is possibly occurred. For example, as shown in Fig. 5.9, the gray area means the same address index for the SRAM module. While the lower red scattered pilot

stores to a 1K SRAM module, the previous upper red scattered pilot is overwritten.

After two cycles, the channel estimation needs the upper red scattered pilot to do channel estimation and it is covered. Thus, the R/W conflict leads to a wrong scattered pilot is read. In order to prevent this situation, the scattered pilot in the same address index will be pre-read before the new scattered pilot is stored. Therefore, the registers are required to hold the pre-read scattered pilots from the output of SRAM modules.

Fig. 5.9 Scattered pilot R/W conflict

The four time domain interpolation equations in Eqn. (4.5) can be modified as shown in Eqn. (5.2) by decomposing the scale numbers. Thus, the multiplication is transferred to shift and addition or subtraction. Therefore, four multipliers are saved.

)

Furthermore, the addition and subtraction operation is implemented using CSA (Carry Save Adder) and CPA (Carrier Propagation Adder) to reduce the power consumption and area cost.

The frequency domain interpolation is shown in Fig. 5.10. The non-black CRs are interpolated by the block CR before and after themselves. Therefore, two sample

delay registers is used to hold the black CR before the non-black CR.

Fig. 5.10 Frequency domain interpolation

Because of there is no multiplication in channel estimation and the finite word-length does not improve too much, the word-length of channel estimation is not truncated. Overall, the architecture of channel estimation is illustrated in Fig. 5.11.

Fig. 5.11 Architecture of channel estimation

5.4 Channel Compensation and Hard Demapper

5.4.1 Hardware Design

The structure of Eqn. (4.7) is as the same as shown in Fig. 5.5. The F1 is the same with the correlation part and the F2 with the power term. Therefore, the first stage architecture is shared from the mode/GI and boundary detection after the boundary is detected.

The value of B×NF in chapter 4.3 is pre-calculated using the normalized factor, decision boundary and α defined in [1]. According to the simulation result, the B×NF can be represented in five bits with only 1% performance loss while the performance

loss of four bits increases very much. Therefore, the possible B×NF values is listed in Table 5-2.

Table 5-2 Possible B×NF values of stage 2

QAM α B×NF R-Bits

1 2 10/10 101 00 2 3 20/20 101 01 16-QAM

4 5 52/52 101 10 1 4 42/42 101 00 2 5 60/60 101 01 64-QAM

4 7 108/108 101 10

where the B×NF is represented in 5 bits as R-Bits. And the possible B×NF values of stage 3 is listed in Table 5-3. By using CSD technique, the maximum non-zero digital is reduced to three from four. That means the number of adders and power consumption are also reduced.

Table 5-3 Possible B×NF values of stage 3 QAM α B×NF R-Bits R-Bits(CSD)

1 2 42/42 01010 0 0 1 0 1 0 2 3 60/60 01100 0 1 0 -1 0 0 4 5 108/108 01111 0 1 0 0 0 -1 1 6 42/42 11110 1 0 0 0 -1 0 2 7 60/60 11101 1 0 0 -1 0 1 64-QAM

4 9 108/108 11100 1 0 0 -1 0 0

Therefore, the scaling operation is replaced by addition and subtraction. For the purpose to reduce the power consumption and hardware cost, the addition and subtraction are realized by CSA and CPA. Due to the CSD effort, the maximum CSA number is reduce to two from three.

5.4.2 Finite Word-Length Simulation

In order to reduce the multipliers hardware cost of stage 1, the finite word-length is simulated. As shown in Fig. 5.12, the bit-error-rate (BER) increases dramatically

while the bit number is smaller than twelve. Thus, the architecture of stage 1 is illustrated in Fig. 5.13 with the word-length is dramatically reduced. The stage 2 and real part of stage 3 architecture is shown in Fig. 5.14.

10 12 14 16 18 20

0.1038 0.1039 0.104 0.1041 0.1042

Word-Length (Bit)

Bit Error Rate

Fig. 5.12 Finite word-length of BER after demapping

Fig. 5.13 Architecture of stage-1

(a) (b) Fig. 5.14 Architecture of (a) stage-2 and (b) stage-3

5.5 Design and Implementation Results

The blind mode/GI and boundary detection, scattered pilot mode detection, channel estimation and demapping in this thesis are synthesized using TSMC 0.18 um process. The synthesis tool is Synopsys Design Complier and the operation is set at 10MHz and slow condition. The synthesis results are listed in Table 5-4. Moreover, the statistics of all blocks are listed in Table 5-5.

Table 5-4 Design results Synthesis Results

Process TSMC 0.18 um (1.62 v)

System Required Speed 9.14MHz

Power 3.93mW@10MHz Maximum Delay/Freq. 23.70ns/42.19MHz

Gate Counts (w/MEM) 209,676(2,054,824um2) Gate Counts (w/o MEM) 18,385(180,174um2) Gate Counts (MEM) 191,291(1,874,650um2)

Table 5-5 Statistics of all blocks

Before Improved After Improved

Area Ratio Area Ratio Improve

Ratio CSS Ctrl 20,567um2 0.95% 17,264um2 0.84% 16.06%

CSS Combination 32,379um2 1.50% 17,637um2 0.86% 45.53 % SPS & CE Ctrl 26,721um2 1.24% 15,624um2 0.76% 41.53%

CE Combination 40,033um2 1.86% 38,783um2 1.89% 3.12%

Hard Demapper 23,118um2 1.07% 17,291um2 0.84% 25.21%

Shared Modules 123,290um2 5.72% 62,693um2 3.05% 49.15%

Others 14,414um2 0.67% 10,882um2 0.53% 24.50%

Total (w/o Mem) 280,522um2 13.02% 180,174um2 8.77% 35.77%

The “CSS Ctrl” and “CSS Combination” are the controller and combinational part of blind mode/GI and boundary detection process, the “SPS & CE Ctrl” and “CE Combination” includes the circuits of scattered pilot mode detection and channel estimation. “Hard Demapper” consists stage 2 and stage 3 of demapping process.

“Share Modules” are organized by the correlation circuits shown in Fig. 5.13 and a complex multiplier, which are shared with blind mode/GI and boundary detection process, scattered pilot mode detection and demapping process. The “Memory Bank”

is the fourteen 1KSRAM modules, which takes a large part of area (91%).

With memory sharing, the area of memory is 1.04 mm². Without the memory sharing skill, addition two 4K SRAM and two 1K SRAM are required. Therefore, the memory area will increase to 2.56 mm². Since the memory takes a large part of the area, to do the memory sharing really helps a lot to improve the area cost (59.38%).

The shared modules scheme proposed in this thesis also save large area. Finally, the area without memory is reduced greatly as a result of doing finite word-length truncation.

In order to prove the design really work, the simulation results of RTL and gate level is shown in Fig. 5.15, where the skipped threshold is the protection mechanism to prevent computing the wrong guard interval length.

0 1000 2000 3000 4000 5000 6000 7000 8000 0

0.5 1 1.5 2 2.5 3 3.5 4 4.5

5x 10⁴

Amplitude

Sample Index 2K Filling

Skipped Threshold GI Detection Boundary

0 1000 2000 3000 4000 5000 6000 7000 8000 0

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

2x 10⁴

Amplitude

Sample Index

(a) (b) Fig. 5.15 MC² results of (a) RTL (b) Gate level simulation

Chapter 6 Conclusion and Future Work

In this thesis, a modified architecture of Normalized-Maximum-Correlation (NMC) architecture, which holds the normalized mode/GI detection advantage without a division operation and using the Maximum-Correlation (MC) part to determine the symbol boundary, is proposed. This thesis also proposes an efficient mode/GI and symbol boundary detection scheme. By detecting one transmission mode at a time, the hardware is reduced to a single hardware. For some mode detection method, the delay-line is refilled or replenished while the tested mode changes. To overcome the delay penalty during the tested mode changes, a twister memory access method is offered to eliminate the refilling/replenishing penalty and without the refilling/replenishing penalty the mode/GI and boundary detection timing has a 41.56% (12,288 samples) reduction. Furthermore, the twister memory access method does not only reduce the delay-line refilling/replenishing penalty but also reduce the power distribution problem by accessing all memory blocks fairly. To reduce the unexpected accident scattered pilot synchronization error, a two-stage scattered pilot synchronization scheme, which can be organized by the power-based and the correlation-based or other scattered pilot synchronization algorithms, is proposed to improve the reliability of the scattered pilot synchronization result. With the channel estimation scattered pilots pre-filling scheme, the scattered pilot synchronization timing and channel estimation scattered pilot filling timing is overlapped. The scattered pilot synchronization timing is efficiently reduced from 68 symbols (TPS decoding time) to zero symbol for idea and two symbols if an error

occurs. Therefore, the scattered pilot synchronization timing is reduced by at best 100% to 97.06%. A three-stage division-free demapping scheme is also proposed to save a frequency domain equalizer (FEQ) and realized an all division-free architecture in this thesis. Finally, the memory and multipliers sharing efficiently reduce 43.51%

(3,637,517um²→2,054,824um²) area cost.

In the future, the channel estimation algorithm and architecture will be improved to against whether static or dynamic channels well with a small hardware cost. The inner receiver architecture , including the mode/GI and boundary detection, Fast Fourier Transform (FFT), scattered pilot synchronization (SPS), frequency domain channel estimation (CE), demapping, carrier frequency offset (CFO) recovering loop and sampling clock offset (SCO) recovering loop, will also be integrated and taped out.

Reference

[1] “Digital Video Broadcasting (DVB); Framing structure, channel coding and modulation for digital terrestrial television,” European Telecommunication Standard EN 300 744 V1.5.1, Nov. 2004.

[2] “Transmission System for Handheld Terminals (DVB-H),” European Telecommunication Standard EN 302 304 V1.1.1 Nov. 2004.

[3] “Digital Video Broadcasting (DVB); Framing structure, channel coding and modulation for 11/12 GHz satellite services,” European Telecommunication Standard EN 300 421 ed.1, Dec. 1994.

[4] “Digital Video Broadcasting (DVB); Framing structure, channel coding and modulation for cable systems,” European Telecommunication Standard EN 300 429 V1.2.1, Apr. 1998.

[5] “ATSC Digital Television Standard,” ATSC Standard A/53E, Dec 2005.

[6] “Terrestrial integrated services digital broadcasting (ISBD-T),” ARIB Standard STD-B31 V1.5, Jun. 2004.

[7] “Digital Multimedia Broadcasting,” Telecommunications Technology Association in Korea, 2003SG05.02-046, 2003.

[8] http://www.dtvc.org.tw/.

[9] http://info.gio.gov.tw/.

[10] M. Hosemann, G. Cichon, P. Robelly, H. Seidel, T. Drger, T. Richter, M.

Bronzel, and G. Fettweis, “Implementing a receiver for terrestrial digital video broadcasting in software on an application-specific DSP,” IEEE SIPS.2004, Oct.

2004, pp. 53-58.

[11] C. D. Toso, P. Combelles, J. Galbrun, L. Lauer, P. Penard, P. Robertson, F.

Scalise, P. Senn, and L. Soyer, “ 0.5 um CMOS Circuits for Demodulation and Decoding of an OFDM-Based Digital TV signal Conforming to the European DVB-T Standard,” IEEE Journal of Solid-State Circuits, vol. 33, no. 11, pp.

1781-1792, Nov. 1998.

[12] Y-J. Chen; Y-C. Lei; T-D. Chiueh, “Baseband transceiver design for the DVB-terrestrial standard,” IEEE APCCAS, Vol. 1, Dec. 2004, pp. 389-392.

[13] M. Speth, S. Fechtel, G. Fock and H. Meyr, “Optimum Receiver Design for OFDM-Based Broadband Transmission Part II: A Case Study,” IEEE Trans.

Commun., vol.49, no. 4, pp. 571-578, Apr. 2001

[14] T-Z. Wei, “Design of Carrier Recovery for DVB-T Baseband Receiver,” Master Thesis, Department of Electronics Engineering, National Central University, Jhongli, Taiwan, Jun. 2005.

[15] R.W. Chang, ”Synthesis of Band-Limited Orthogonal Signals for Multichannel Data Transmission”, Bell Syst. Tech. J., vol.45, pp. 1775-1796, Dec. 1966.

[16] S. Chen, W. He, H. Chen and Y. Lee, “Mode Detection, Synchronization, and Channel Estimation for DVB-T OFDM Receiver,” IEEE GLOBECOM, vol. 5, Dec. 2003, pp. 2416-2420

[17] C-W. Kuang, “Timing Synchronization for DVB-T System,” Master Thesis, Institute of Electronics Engineering, National Chao Tung University, Hsinchu, Taiwan, Sep. 2004.

[18] K-H. Lin, “Design of a Baseband Receiver for DVB-T Standard,” Master Thesis, Institute of Electronics Engineering, National Taiwan University, Taipei, Taiwan, Jul. 2005.

[19] A. Palin, J. Rinne, “Symbol synchronization in OFDM system for time selective channel conditions,” IEEE ICECS, vol. 3, Sep. 1999, pp. 1581-1584

[20] A. Hazmi, J. Rinne, T. Kuusisto, M. Renfors, “Performance evaluation of symbol synchronization in OFDM systems over impulsive noisy channels,” IEEE VETECS, vol. 3, May 2004, pp. 1782-1786.

[21] J.J. van de Beek, M. Sandell, P.O. Borjesson, “ML estimation of time and frequency offset in OFDM systems,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 45, no. 7, pp. 1800-1805, Jul. 1997

[22] L. Schwoerer “Fast Pilot Synchronization Schemes for DVB-H,” IASTED, July 2004, pp. 420-424

[23] L. Schwoerer, J Vesma, “Fast Scattered Pilot Synchronization for DVB-T and DVB-H,” Proc. 8^th International OFDM-Workshop, Hamburg, Germany, Sept.

2003

[24] F. Eory, “Comparison of adaptive equalization methods for the ATSC and DVB-T digital television broadcast systems,” IEEE ICCDCS, Cancun, Mar. 2000, pp. T107/1-T107/7

[25] P. Combelles, C. D. Toso, D. Hepper, D. Le Goff, J.J. Ma, P. Robertson, F.

Scalise, D. Soyer, M. Zamboni, “A receiver architecture conforming to the OFDM based digital video broadcasting standard for terrestrial transmission (DVB-T),” IEEE ICC, vol. 2, Atlanta, GA, Jun. 1998, pp. 780-785.

[26] T-A. Lin, C-Y. Lee, “Predictive equalizer design for DVB-T system,” IEEE ISCAS, vol.2, May 2005, pp. 940-943.

[27] L. Horvath, I. B. Dhaou, H. Tenhunen and J. Isoaho “A Novel, High-Speed, Reconfigurable Demapper-Symbol Deinterleaver Architecture for DVB-T,”

IEEE ISCAS, Vol. 4, June. 1999, pp. 382-385.

[28] S. A. Rechtel, A. Blaickner, “Efficient FFT and Equalizer Implement for OFDM Receivers,” IEEE. Trans. Consumer Electronics, pp. 1104-1107, Sept. 1999.

[29] K-W. Shin, B-S. Song, “A complex multiplier architecture based on redundant binary arithmetic,” IEEE ISCAS, Vol. 3, June 1997, pp.1944-1947.

在文檔中數位電視廣播之符號邊界偵測和佈散領航碼同步設計 (頁 76-0)