Chapter 3 Data-path in a Baseband Receiver
3.8 Fast Fourier Transform
In an OFDM receiver, a Fast Fourier Transform (FFT) unit is required to transform data form the time domain to frequency domain. The output order of an FFT has two kinds as shown in Fig. 3-11. The normal order is a general order of FFT output. However, in standards [1] [3], the positions of pilots are recorded in the reversed order. It is important to make sure what kind order is used in standards.
(a) (b)
Fig. 3-11 Output order of FFT (a) normal order (b) reversed order
Pipeline-based architecture [38] [39] [54] and memory-based [40] [41]
architecture are widely used in the implementation of FFT. The comparisons of these two architectures are shown in TABLE 3-7. The memory-based architecture can be regarding as folding form of the pipeline architecture. Hence, the memory-based one
has low complexity but it require more clock cycles to accomplish process. The pipeline-based one usually can operate at a higher frequency by pipelining. When considering the application of OFDM system, the successive data input to FFT unit is an important issue. The memory-based one need an extra input buffer for temporarily storing the input data but the pipeline-based one does not require. In the other hand, the order of output of pipeline-based one is not regular; hence, it requires a reorder buffer. In contrast, the memory-based one can reuse the internal memory.
TABLE 3-7 General comparison of FFT architectures
Pipeline-based (SDF) [38] [39] [54] Memory-based [40] [41]
Complexity × ○
Speed ○ ×
Successive
Data in ○ ×
(require a input buffer)
Reorder ×
(require a reorder buffer)
○
(reuse the internal memory)
Fig. 3-12 2K/4K/8K FFT [52] with single path delay feedback (SDF) [38] [39] [54] and Radix-2/4/8 [53]
Fig. 3-12 is a 2K/4K/8K FFT (a work of Syu-Siang Long [52]) adopted in DVB-T/H receiver. This FFT uses single path delay feedback (SDF) [38] [39] [54]
which is a pipeline-base architecture and combines Radix-2 and Radix2/4/8 [53]. A reorder buffer is used to transform the output order. It can operate at 40MHz clock and process 2K, 4K, and 8K FFT operations.
3.9 Summary
Several implementation methods of a delay line are discussed in this chapter. The length of delay can roughly decide how to implement a delay line. When the length is shorter, the complexity of the implantation of dual port memories is comparable with that of single port memories. On the contrast, when the length is longer, the implementation by single port memories has lower complexity. Algorithms of a baseband receiver require trigonometric modules such as cosine, sine, and arc-tangent.
CORDIC algorithm can calculate those functions and has a good ability for reusing.
Two architectures of CORDIC are compared and the adoption of these two architectures depends on the receiver architecture. Besides, the truncation operation is usually used to reduce the hardware cost but a DC error is generated in a 2‟s complementary system. This error has large influence on the DC subcarrier of an OFDM system. A removing DC error technique [36] is introduced. The method can eliminate the DC error with smaller overhead. Two architectures of FFT are compared.
The memory-based architecture has lower complexity but requires more clock cycles.
The pipeline-based architecture can operate at high frequency but an extra reorder buffer is required.
Chapter 4 OFDM Baseband Receiver for DVB-T/H
This chapter shows the proposed DVB-T/H receiver. First, an introduction of DVB-TH standard is presented and the proposed architecture for a DVB-TH receiver is shown. Then, several schemes are proposed to reduce the hardware complexity and the power consumption. Finally, the implementation results are shown.
4.1 Introduction of DVB-TH
Digital video broadcasting terrestrial and handheld (DVB-T/H) [1] [2] are proposed by European Telecommunications Standards Institute (ETSI) to transmit digital TV signal. DVB-T/H defines three different bandwidths, 6, 7 and 8MHz for different areas and countries. In Taiwan, the standard of digital TVs adopts 6MHz DVB-T. Fig.
4-1 shows the DVB-T/H transmitter block diagram [1] [2]. The DVB-T/H adopts two level channel codes, Reed Solomon code and convolution code. Reed Solomon code has a better ability against bust errors and convolution code is more suitable for random errors. Hence, these two codes cooperate well.
The DVB-T/H standard adopts orthogonal frequency division multiplexing (OFDM). In the DVB-T/H, there are three symbol lengths, 2048 (2K Mode), 4096(4K Mode) and 8192 (8K Mode) and four guard interval (GI) lengths which are used for
transmission parameter signaling (TPS) pilots are inserted in the frequency domain.
The continual pilots have fixed position, the scattered pilots change their position every OFDM symbols and the TPS is used to transmit system parameters. The data subcarriers can use several different constellation schemes like, quadrature phase-shift keying (QPSK), 16 quadrature amplitude modulation (QAM) and 64QAM. TABLE 4-1 is a summary of the specification of DVB-T/H.
Fig. 4-1 The DVB-T/H transmitter block diagram [1] [2]
TABLE 4-1 Specification of DVB-T/H [1] [2]
Bandwidth (MHz) 6 7 8
Samping Preiod (us) 7/48 1/8 7/64
FFT Length, 2K,4K,8K
Used Subcarriers 1705, 3409, 6817 Guard interval 1/4, 1/8,1/16,1/32
Modulation QPSK, 16QAM, 64QAM
4.2 Baseband Receiver Architecture
Fig. 4-2 shows the block diagram of the DVB-T/H baseband receiver. In the receiver, the Mode/GI/Symbol detection, the carrier synchronization, the sampling clock synchronization and the channel estimation (inner receiver) are designed and implemented into RTL level. The soft demapper, the interleaver, and the soft Viterbi decoder (outer receiver) are behavior models which are used to measure the receiver performance. The hardware implementation contains two clock rate domains. One is 4X clock rate and the other is 1X clock rate. The derotator, the interpolator and the FFT operate at 4X clock rate. On the other hand, the Mode/GI/Symbol detection, the channel estimation, the integer CFO (ICFO) estimation and the SCO and residual CFO (RCFO) estimation [5][6][7] work at 1X clock rate.
The demodulation flow has two stages: the acquisition stage and the tracking stage.
In the acquisition stage, the receiver detects the transmission Mode and the GI length, finds the OFDM symbol boundary, compensates the fractional CFO (FCFO) and estimates ICFO. Then, the demodulation flow enters into the tracking stage. In the tracking stage, the receiver tracks SCO and RCFO. After getting into the steady state, the receiver detects the scattered pilot mode, does channel estimation, equalization and demaps the constellation into bits stream.
The goal is to design a low power and low complexity baseband receiver. The following is the summary of the adopted schemes to reduce the power consumption or the hardware complexity:
The Phase prediction scheme reduces the operations of phase accumulators during GI period. (Low power)
ability. (Low power)
The Differential encoding scheme of continual pilots positions reduces the storage cost.(Low complexity)
The Mode/GI/Symbol detection and the channel estimation share the same memory bank.(Low complexity)
The integer CFO (ICFO) estimation and the residual CFO (RCFO) and SCO estimation share the same memory module. (Low complexity)
Fig. 4-2 The DVB-T/H receiver architecture
4.3 GI Detection [10] [11] [12]
The Mode/GI detection algorithm [42] adopts the cyclic prefix (CP) based correlation algorithm to identify the symbol mode. Eqn.(4-1) is the maximum correlation (MC) [6]:
32 1
0
*( ) ( )
) (
Nsc
i
MC n r n i r n i Nsc
x (4-1)
where r(n) is the received signal, Nsc is the number of sub-carriers and Nsc/32 is the shortest guard interval length. The correlation result xMC(n) will form a peak or plateau if the tested mode equals to the transmitted symbol mode. However, defining the threshold and detecting the plateau are difficult due to glitches. Eqn.(4-2) is a modified form of Eqn.(4-1) called the normalized maximum correlation (NMC) [43]
[44]:
The denominator denotes the power of received signal r(n) and is employed to normalize to “1”. Unlike MC method, NMC method has more flat plateau and is easy to detect the GI length; however, the NMC method requires division operation.
To reduce the division operation of NMC, the plateau threshold is defined as „Th‟ as given by Eqn.(4-3). Then, the GI detection equation can be modified as shown in Eqn.(4-4). A division operation is removed by moving the denominator to the right side and adopting a pre-determined threshold, Th. Moreover, the square-root operation in the absolute operation of a complex number is also not required by squaring both sides.
Determine an accurate detection is important in reducing the detection error. Using low threshold, the non-plateau region will be regarded as a plateau region and it causes incorrect GI length detection. Using high threshold, glitches on the plateau will decrease the estimated plateau length and cause incorrect GI detection. Fig. 4-3 and Fig. 4-4 are GI detection error rate simulations results for 2k and 8k mode. In simulation results, the detection rate at 8K mode is better than that at 2K mode. This is because 8K mode has the longest symbol length. Except the case of 1/32 GI length, the detection error rate has similar behavior at the low and high threshold. In the case of 1/32 GI length, at low threshold, miscalculated plateau decreases the GI detection error rate. At high threshold, due to the decision boundary of 1/32 GI case, the decreased plateau length does not cause the detection error in these simulations. For detection error rate to be lower than 0.01, the threshold is chosen to be 0.5. In 1/32 GI and 2K mode case, the performance is very close to 0.01. The multiplication in Eqn.(4-4) can be replaced with displacement of wiring in the hard implementation.
0
Fig. 4-3 GI detection error rate vs. threshold under 8K transmission mode, AWGN level = 5dB and Rayleigh channel [1] [2]
Fig. 4-4 GI detection error rate vs. threshold under 2K transmission mode, AWGN level = 5dB and Rayleigh channel [1] [2]
4.4 CFO and SCO synchronization
OFDM systems are sensitive to mismatches of carrier and sampling frequencies between transmitter and receiver. These mismatches cause two effects: phase rotation and intercarrier interference (ICI). CFO causes the constellation of an OFDM symbol to rotate by a common phase; on the other hand, the phase rotation caused by SCO is proportional to the subcarrier index [5][6]. In addition, the frequency offset breaks the orthogonality of OFDM systems; as a result, the transmitted data on a subcarrier is interfered by other subcarrier and causes the degradation of performance.
To avoid ICI and to keep the phase of the constellation fixed, the receiver needs to compensate the frequency offset. The CFO is composed of fractional CFO (FCFO) and integral CFO (ICFO) in an OFDM system. A three steps method for the carrier frequency synchronization (one pre-FFT and the other post-FFT) is reported by [5] [6]
[7]. First, at the symbol boundary detection, the result of delay correlation is also used for estimating FCFO [43]. In the second step, ICFO is estimated in frequency domain by using pilots. However, the FCFO estimation cannot be estimated perfectly. A residual CFO (RCFO) still remains. Hence, a RCFO and SCO estimation [5] [6] [7] in the frequency domain is used to keep tracking RCFO and SCO at every OFDM symbol in the final step.
This work adopts the carrier frequency and sample clock synchronization [5] [6]
[7]:
) 2 (
1 ) / 1 ( 2
1
, 2 ,
1l l
g N
f N
Where fΔ is the estimated CFO, tΔ is the is the estimated SCO, k is the number of subcarrier, N is the length of the OFDM, Ng is the length of guard interval, C1 is the positive continual pilot set, C2 is the negative continual pilot set, and Z is product of subcarriers of successive OFDM symbols. The architecture of the RCFO and SCO estimation is shown in Fig. 4-5. The „tan-1’ module calculates the angle of a complex number and this module adopts the CORDIC algorithm [31]. To smooth the RCFO and SCO estimation, the loop filters [30] are added into the synchronization loops.
The coefficients of the loop filters are designed as power-of-twos; therefore, the multipliers can be replaced with wire-shifting.
Fig. 4-5 Memory sharing architecture for ICFO, residual CFO (RCFO) and SCO estimation
)
In ICFO estimation; a memory reduction architecture [45] uses Series-In-Parallel-Out (SIPO) to temporarily store sign bit of the samples from FFT;
hence, it is not necessary to store the full bits of the each received data that comes from FFT. As a result, the usage of memory is reduced. In additional, because the multiplicand is equal to 1 or -1, the complex multiplier can be replaced with adders, inverters and MUXs. A differential encoding method [13] [26] [27] is used to record the continual pilot positions and the distribution of differential encoding positions is periodic. Therefore, storage requirement of recording continual plots positions is reduced by 77% in implementation. The design overhead is an accumulator and control unit to accumulate the difference values. Besides, The ICFO estimation and RCFO/SCO estimation share the same memory to reduce hardware cost. The carrier synchronization can compensates ±50 shifted subcarrier spacing (equivalent to
±220kHz at 2K mode and ±55kHz at 8K mode) and the clock synchronization can compensate 200ppm sampling clock offset.
TABLE 4-2 Comparison of memory usage of ICFO, RCO and SCO Direct Implementation Memory Sharing Architecture ICFO
Estimation [45]
51224 26438
51224(Sharing) 26438 RCFO/SCO
Estimation [5] [6] [7]
17724
Recording Pilot Position
17713 (ROM)
648
(Differential Encoding ) Total
Required Bits
23701 (100%) 17664 (75%)
In short, the memory sharing architecture shown in Fig. 4-5 is design to reduce the complexity of the memory usage. The comparison of direct implementation and this
architecture is shown in TABLE 4-2. The memory usage of this architecture is reduced by about 25%
The CFO compensation is composed of a derotator and a sinusoidal value generator.
This work uses the coordinate rotational digital computer (CORDIC) [31] based derotator [46]. The conventional derotator needs a complex multiplier and the sinusoidal value generator requires hardware for implementation. A CORDIC-based derotator combines them to reduce hardware complexity. CORDIC can implement trigonometric function such as, sine, cosine and arctangent. Hence, CORDIC can be reused to do different equations. A CORDIC-based derotator combines the derotator and the NCO to reduce hardware complexity. The derotator of this work also adopts a ten stages unfolding CORDIC structure [34] [35] as shown in Fig. 4-6.
Fig. 4-6 Architecture of ten stages unfolding CORDIC [34] [35]
A detail mathematic description of digital timing (sampling clock) synchronization loop is reported by [8][9]. The synchronization is composed of
delay filter” or “interpolator”. The proposed receiver adopts the cubic Lagrange interpolator [9][47] for compensating the SCO and 4 oversampling.
The operation of interpolation controller is shown in Fig. 4-7. Because of cubic Lagrange interpolation and 4 oversampling, the normal operation is shown in Fig.
4-7 (a). The cubic Lagrange interpolation requires four points to construct a new sample of fractional interval among the basepoint. Besides, the system requires 4
decimation to recovery symbol rate; as a result, the interpolator generates a new sample every four samples. However, there are two exceptional situations. The Lagrange interpolation is valid within the interval of interpolation set. When the fractional interval exceeds this range, the interpolation controller requires to change the basepoint of the interpolation set. This work sets valid fractional interval within
±0.5 sample. The valid fractional interval, [-0.5, 0.5) has a very sample hardware implementation. In two‟s complement, the first two bits of a number which is less than -0.5 must be 10 and the number which is more than or equal to 0.5 must be 01.
Hence, a comparator can examine the first two bits of a number to determine whether it is within the fractional interval instead of a whole bits comparison to reduce the power consumption.
Fig. 4-7(b) and Fig. 4-7(c) are the exceptional situations. In Fig. 4-7 (b), the sampling frequency of receiver is higher than that of transmitter, so the fractional interval is increasing. When it exceeds 0.5 sampling period, the interpolation controller moves the basepoint forward. Then, the modified fractional interval is the complementary of the original one. Therefore, a sample is skipped in this situation.
Fig. 4-7(c) shows the other situation. When the sampling rate of the receiver is slower than that of the transmitter, the fractional interval is decreasing. As a result, a sample
is duplicated.
Fig. 4-7 Operation of the interpolation controller (a) normal operation (b) skipped operation and (c) duplicated operation
Fig. 4-8 shows the modified Fallow structure [47] for the cubic Lagrange interpolator [9] [47]. This work modifies the Fallow structure by adding data holding registers to form a serial-in-parallel-out registers (SIPO) which is controlled by interpolator controller. SIPO stores the received samples into data holding registers every four clock cycles in normal operation or it stores received samples every five clock cycles in skipped operation and every three clock cycles in duplicated operation.
Hence, the interpolator does not require processing the dropped data due to 4
decimation and results in power saving.
Fig. 4-8 Modified Fallow structure for cubic Lagrange interpolator [9][47]
After locating the symbol boundary, the samples within the GI period can be dropped. Hence, the interpolator and derotator can stop working to save power.
However, the phase accumulators (ACC) of the NCO and interpolator controller must keep the phase continuity. A phase prediction scheme [13] [26] [27] is proposed in this work. First, the estimation estimates the frequency offset once at each symbol, so the estimated frequency offset is a constant within an OFDM symbol. Second, the estimated frequency offset multiplied by the GI length is the total phase offset of GI.
As a result, the total phase offset during GI is also a constant. Thus, the proposed scheme disables the NCO and interpolator controller within the GI and it predicts and compensates the phase of GI at the beginning of the next OFDM symbol. Moreover, because the GI length of DVB-T/H is a power-of-two, the multiplication of phase prediction can be replaced with the shifting of the connections for complexity saving.
Fig. 4-9 is the simulation waveforms of the proposed phase prediction scheme. It
shows that this scheme can keep the phase continuity and reduces 3%-20% operations of the phase accumulators for different GI length.
Fig. 4-9 Phase prediction of phase accumulator
Fig. 4-10 is a RTL simulation result of RCFO/SCO tracking and the FCFO estimation is closed to test tracking ability of RCFO. The simulation shows that both RCFO and SCO can track the offset at 8K mode, SNR = 20dB, 64 QAM, Rayleigh channel, 200ppm SCO and 0.05 sub-carrier spacing RCFO.
Fig. 4-10 RCFO/SCO RTL tracking curve @ 8K mode, SNR = 20dB, 64 QAM, Rayleigh channel, 200ppm SCO and 0.05 sub-carrier spacing RCFO
Fig. 4-11 shows the simulated RTL output SNR for different fractional SCOs (in ppm). In this simulation, a RCFO equal to 0.05 subcarrier spacing is added. The simulation results show the receiver can keep tracking under different frequency offsets at 8K/2K mode, 64QAM, AWGN/Rayliegh channel.
Fig. 4-11 Output SNR of different SCOs @ RCFO = 0.05 subcarrier spacing, 8K/2K mode, 64QAM and AWGN/Rayliegh channel
4.5 Scattered Pilots Synchronization [11] [13] [14]
The position of scattered pilots is recorded in TPS pilots. To decrease the detection latency, two fast scattered pilot synchronization (SPS) algorithms are reported in [48] [49]. One is Power-Based (PB) algorithm shown in Eqn.(4-6) [48]
[49] and the other is Correlation-Based (CB) algorithm shown in Eqn.(4-7) [48] [49]:
where SC(n,m) is the mth sub-carrier of the nth symbol, k is the possible scatter pilots mode and SP is the estimated scatter pilots mode. Both algorithms use the boosted power [1] [2] of the transmitted scatter pilots. The summation of correlation of the scatter pilots is usually larger than that of the data subcarriers. Therefore, the PB and CB algorithm can distinguish the scattered pilots from the data subcarriers.
The PB algorithm requires two real multipliers and one real adder to correlate, one adder to do summation and four register groups to store the correlation results of the possible scattered pilot location. On the other hand, due to the complex number operations of the CB algorithm, it requires a complex multiplier (three real multipliers and five real adders [28] [29]). Moreover, double register groups are required for algorithm are shown in TABLE 4-3.
TABLE 4-3 Hardware complexity of PB and CB Algorithm Real
multiplier Real adder
Register
group Memory
Latency (symbols)
PB 2 2 4 0 1
CB 5 8 8 227 2Words
( 8K mode) 5
The proposed baseband receiver adopts a two stages SPS scheme [11] [13] [14] to improve the reliability. This scheme operates SPS twice. The first SPS is used to detect the scattered pilot mode of the current symbol and the second one is used to ensure the prediction of the first one. If the detected scattered pilot mode from the second SPS is not the same as the predicted mode from the first one, the system will think that an error happened and redo the two stages SPS scheme. The first and second SPS can either the PB algorithm or the CB algorithm. Because the CB algorithm requires pervious symbol, the detection latency is five OFDM symbols.
The proposed baseband receiver adopts a two stages SPS scheme [11] [13] [14] to improve the reliability. This scheme operates SPS twice. The first SPS is used to detect the scattered pilot mode of the current symbol and the second one is used to ensure the prediction of the first one. If the detected scattered pilot mode from the second SPS is not the same as the predicted mode from the first one, the system will think that an error happened and redo the two stages SPS scheme. The first and second SPS can either the PB algorithm or the CB algorithm. Because the CB algorithm requires pervious symbol, the detection latency is five OFDM symbols.