• 沒有找到結果。

5 Implementation and Acceleration of AMR Speech Coding on TI DSP Platform 70

5.2.3 Performance Analysis

5.2.3.2 AMR Decoder Performance Analysis

We measure the AMR decoding time by the same method as the encoder.

1. the Original AMR Decoder (Provided by 3GPP)

Encoding Time (ms/frame) Source Rate

(bits/sec) TS0 TS1 TS2

4.75 6.25 6.29 6.26

5.15 6.33 6.22 6.26

5.9 6.33 6.22 6.26

6.7 6.40 6.39 6.26

7.4 6.18 6.12 6.04

7.95 6.32 6.36 6.35

10.2 6.15 6.29 5.99

12.2 6.57 6.39 6.30

Average 6.32 6.29 6.22

Table 5.12: Execution time of the DSP Implementation under Different Source Rate for Each Test Sequence

2. Improved AMR Decoder with the Intrinsics

TS0 TS1 TS2

Source Rate

(bits/sec) ms/frame % ms/frame % ms/frame %

4.75 5.90 5.60 5.92 5.88 5.81 7.19

5.15 5.90 6.79 5.86 5.79 5.90 5.75

5.9 5.94 6.16 5.92 4.82 5.86 6.39

6.7 5.97 6.72 5.89 7.82 5.90 5.75

7.4 5.84 5.50 5.82 4.90 5.72 5.30

7.95 5.94 6.01 5.96 6.29 5.99 5.67

10.2 5.66 7.97 5.89 6.36 5.68 5.18

12.2 6.05 7.91 5.95 6.89 5.94 5.71

Average 5.90 6.65 5.90 6.20 5.85 5.95

Table 5.13: Execution time of the DSP Implementation under Different Source Rate for Each Test Sequence (ms/frame: the Processing Time for one frame, %: Improvement

Percentage).

3. File-Level Optimization

TS0 TS1 TS2

Source Rate

(bits/sec) ms/frame % ms/frame % ms/frame %

4.75 2.46 58.30 2.46 58.45 2.37 59.21

5.15 2.43 58.81 2.46 58.02 2.37 59.83

5.9 2.39 59.76 2.43 58.95 2.42 58.70

6.7 2.46 58.79 2.46 58.23 2.42 58.98

7.4 2.43 58.39 2.36 59.45 2.42 57.69

7.95 2.46 58.59 2.46 58.72 2.42 59.60

10.2 2.39 57.77 2.43 58.74 2.37 58.27

12.2 2.46 59.34 2.43 59.16 2.42 59.26

Average 2.44 58.64 2.44 58.64 2.40 58.97

Table 5.14: Execution time of the DSP Implementation under Different Source Rate for Each Test Sequence (the Lists Representation is the Same as Table 5.13).

The final AMR decoder implemented on the DSP baseboard after our acceleration takes the processing time about 2.43 ms/frame and is improved up to 61.31% with respect to the original for average. It matches the real time requirement. The data transfer time alone is also about 0.28 ms/frame. Hence, the pure AMR decoding time is about 2.15 ms/frame.

Chapter 6

Implementation and Acceleration of 802.16a

Reed-Solomon Decoder on TI DSP Platform

After introducing the AMR speech coding part, in this chapter, we are going to discuss the second major topic – the implementation and acceleration of the specified Reed-Solomon coding scheme on the same DSP platform. The AMR codec and the RS coding scheme are both specified in the IEEE 802.16a wireless communication standard.

The AMR codec belongs to the source coding part, while the RS coding belongs to the channel coding part. The RS coding scheme connects directly to the block of AMR speech coding and provides it with the ability against channel errors. The acceleration work of the RS code would be mainly focused on the decoder because it is more complicated than the encoder.

At first, as the general flow of acceleration, the structure and profile of the original RS decoder is introduced. Then we describe the algorithms proposed to obtain the further improvement. Also, an alternative procedure for RS decoding, the remainder decoding algorithm [30] [31] [35], is implemented for comparison with the former system. Finally, we report the total effort of acceleration and the DSP implementation of our system.

6.1 Acceleration on Reed-Solomon Decoder

We first generate a computational profile by using the CCS built-in profiler to obtain the execution cycles. Then, we identify which parts of our program consume the most execution time based on the profile data, and hence we pay our attention on these parts to speed up the whole program. In the following subsections, the processing flow of our RS decoder program on TI DSP platform is divided into several procedures to improvement work.

6.1.1 Profiling the Original RS Decoder

The starting point of our RS decoder is the version that has been improved using several acceleration techniques on the well-known RS decoding flow. It was written by Y.-T. Lee in 2004 for his MS thesis [20]. We call it the Lee decoder. The well-known RS decoding flow has been described in Chapter 3, which consists of the four procedure units:

Syndrome computation

Berlekamp-Massey algorithm (BM algorithm) Chien search

Forney algorithm

The Lee decoder program we intend to accelerate uses a look-up table to realize the Galois field multiplier and has improved the BM algorithm and Chien search by some fast versions.

The inversion of discrepancy needed during the computation of the original BM algorithm is complex and time-consuming due to the requirement of chain multiplication. Hence the inverse-free BM algorithm is used to reduce the inversion operations to one time. Compared to the original BM algorithm, the Lee decoder program has greatly reduced the number of inversion operations.

Two features of Chien search are used to improve it. One feature is early termination. We can substitute elements to find the roots until the number of roots match the order of the errata locator polynomial instead of substituting all the elements.

The other is skipping nonused position in Chien search. The inputs of different block sizes defined in IEEE 802.16a standard should be padded with zeros in the (255, 239, 8) RS encoding. Thus, we also have to pad the same zeros to the input at the RS decoder.

Therefore, the positions of zero padding are never wrong and cannot be the roots of the errata locator polynomial. Those positions can be skipped in checking roots.

The improvement described above has been done in the version of RS decoder we start with, and we call this version the Lee RS decoder for convenience. The profile of the Lee RS decoder is shown in Table 6.1 without compiler level optimization.

Function Name Code Size Cycle Percentage

(%)

Syndrome Computation 480 249,294 80.98

BM Algorithm 1,920 23,962 7.78

Worst Case 25,375 8.24

Chien

Search Best Case 804

902 N/A

Forney Algorithm 1,064 9,211 2.99

Table 6.1: Profile of the Lee RS Decoder

The “Percentage” in Table 6.1 represents the execution cycles of individual functions in percentage of the whole RS decoder. The Chien search is discussed for two cases because it may early terminate when the number of roots reaches the order of the errata locator polynomial in the Lee RS decoder. The worst case represents that one of the errors happens in the last position, and therefore we have to substitute all the elements for finding the last roots. Respectively, the best case represents that no error happens. It is clear that the possibility of the best case is very low. To insure real-time operation, we focus mainly on the worst case, and the details will be discussed in the following sections.

Referring to Table 6.1, it shows that the procedures of syndrome computation and Chien search take the most execution time, and our acceleration work on them are described in the next section.

6.1.2 Modifications of RS Decoder

6.1.2.1 Syndrome Computation Improvement

The syndrome can be formally defined as follow:

Si = R mod G where i = (0, 1, 2, 3, …, 15) for GF(28)

The received codeword may be expressed in polynomial form as follow:

Ri = r0XN-1 + r1XN-2 + … + rN-1

Where the length of the received codeword is N. In our case of (255, 239, 8) RS code, N equals to 255. Let the first 2T powers of beta be specified as shown below, where beta = {β0, β1, …, β15}. The 16 syndromes are now expanded as follows:

S0 = r0β0N-1 + r1β0N-2 + … + rN-2β0 1+ rN-1

S1 = r0β1N-1 + r1β1N-2 + … + rN-2β1 1+ rN-1 ……

S15 = r0β15N-1 + r1β15N-2 + … + rN-2β15 1+ rN-1

It can be seen that computing the syndromes amounts to polynomial evaluation at the roots as defined by beta. In the Lee RS decoder, this is done recursively using the Horner’s rule. For example, the recursive computation of S0 is shown below:

S0 = (… ((r0β0 + r1) β0 + r2) β0 + … rN-2) β0 + rN-1

According to the computation procedure shown in Figure 6.1, the C code implementation involves two loops, an outer loop that iterates once for every syndrome and an inner loop that iterates over all the field elements. In order to obtain a better performance from the architecture, we unroll the inner loop.

for (j = 1; j <= 16; j++) { for (i = 0; i < 255; i++) {

product = gf_mul_tab(Alpha_to[B0-1+j],s[j]);

s[j] = product ^ data[i];

} }

Figure 6.1: the C Code of the Syndrome Computation in the Lee Decoder

We should choose a way to unroll the loop efficiently. Here is an approach similar to that of a radix-4 FFT [28]. The received codeword is read starting at locations 0, N/4, N/2, and 3N/4. Horner’s rule is now applied recursively to all four parts of the syndrome polynomial using the input data read in all four locations (N/4 – 1) times. The syndrome polynomials are thus segmented as shown below:

s0 = r0β063 + r1β062 + … + r62β0 1+ r63

s1 = r64β063 + r65β062 + … + r126β0 1+ r127

s2 = r128β063 + r129β062 + … + r190β0 1+ r191

s3 = r192β063 + r193β062 + … + r255β0 1+ r256

The four segments use the same powers of beta, and it means that only one beta value has to be read in one iteration for computing the terms of these four polynomials.

Then, these four segments has to be weighted and cumulated as follow to obtain the syndrome we want:

S0 = s0β0192 + s1β0128 + s2β064 + s3

It should be noticed that our received codeword length is 255, so we have to assign a zero to r0 to use this method. This method has the benefit in the reduction of the memory access of beta values. It is also able to reduce the number of the inner loops.

The profile data of the modified syndrome computation is compared in Table 6.2.

Version Code Size Cycle Improvement Percentage (%) Lee Decoder RS Syndrome

Computation 480 249,294 N/A

Modified RS Syndrome

Computation 748 172,607 30.76

Using the Intrinsic _gmpy4 680 47,486 72.49

Improved with More Intrinsics 816 34,058 28.28

Compiler File-Level Opt. 564 5,503 83.84

Compiler File-Level Opt.

(Lee Decoder) 296 104,378 58.13

Table 6.2: Improvement of Syndrome Computation

The list of the modified RS syndrome computation in Table 6.2 is the version using the method we propose here, and it improves the original one up to 30.76% of cycles without compiler-level optimization. The versions using the intrinsics are also listed in Table 6.2, where “_gmpy4” is the intrinsic for Galois field multiplier [23], and the more intrinsics means we further pack four symbols into a 32-bit integer by the other intrinsics and perform four Galois field multiplications simultaneously. Finally we turn on the file-level optimization and obtain the improvement percentage 97.79% compared to the Lee decoder syndrome computation. The improvement percentage of the Lee decoder syndrome computation is only 58.13% after the file-level optimization and is lower than the syndrome computation with our modification.

6.1.2.2 Chien Search Improvement

The Chien-search method is used to find the roots of an errata locator polynomial.

It requires multiplication for each term in calculating the errata locator polynomial.

Hence, we choose the Berlekamp-Rumsey-Solomon (BRS) algorithm together with the Chien-search method proposed in [29] for our RS decoder. The new fast algorithm makes the root-finding problem quite practical and efficient because it can eliminate a

lot of multiplications and is structured regularly for compiler to achieve the software pipeline more easily.

The BRS algorithm is first described below, which is an algorithm in finding the roots of a special class of polynomials as proposed by [29]. Before introducing the algorithm, here are two definitions and a theorem that are needed for this algorithm:

Definition 1: the polynomial L(y) over GF(2m) is called a p-polynomial for p = 2 iff

where ci are restricted to GF(2m) and the exponents are restricted to be the powers of two.

Definition 2: a polynomial A(y) over GF(2m) is called an affine polynomial iff A(y) = L(y) + β

where L(y) is a p-polynomial as defined previously and β∈GF(2m).

Theorem 1: let y∈GF(2m) and let α0, α1, α2, …, αm-1 be a standard basis. If y is represented in the standard basis, i.e., if

Using Theorem 1, a simplified algorithm is proposed to find the roots of an affine polynomial, which needs only to compute the eight values L(α0), L(α1), …, and L(α7) instead of all the 255 elements needed in the Chien search. The elements of rest simply need to be judged whether the term L(αk) should be cumulated or not according to each yk. This is done by checking the k-th bit of the element y.

It can be observed that most of the Galois field multiplications are eliminated. It is only needed to compute the eight terms imported with the standard bases. The BRS algorithm is used only for solving affine polynomials. Hence, in our method, we first arrange and sort our errata locator polynomial into an affine polynomial and the

remainder, and then the value of the affine polynomial is obtained by the BRS algorithm and the roots of the remainder is by the Chien search. If their values are equal for a Galois field element, we can claim a root is found. Note that this method benefits only when the order of the errata locator polynomial is not more than eleven [29].

Cycle Function Version Code Size

Worst Case Best Case

Lee Decoder Chien Search 804 25,375 902

Modified Chien Search 1,268 14,013 4,248

Table 6.3: Profile of Chien Search without the Intrinsics and Compiler Optimization

Cycle Function Version Code Size

Worst Case Best Case

Lee Decoder Chien Search 856 4,186 345

Modified Chien Search 960 1,100 183

Table 6.4: Profile of Chien Search with _gmpy4 and File-Level Optimization

Table 6.3 and Table 6.4 show the comparison of the Lee decoder Chien search and the modified one by the method we describe. In Table 6.3, it is the case without using the intrinsics and any compiler-level optimization, where the modified one is more efficient than the original in the worst case but is slower in the best case because the overhead of codes is increased to rearrange our errata locator polynomial. However, the best case is of very low probability. We apply the intrinsic “_gmpy4” and the file-level optimization to the two functions, and shown as Table 6.4, the modified Chien search is always more efficient than the Lee decoder Chien search. The improvement is up to 73.72% in the worst case and 46.96% in the best case because the most of Galois field multiplications are substituted in the modified Chien search to achieve the software pipeline more easily.

6.1.3 Performance Analysis

In this section, we present the simulation profile generated by the CCS built-in profiler for our RS decoder specified in IEEE 802.16a. The results of all improvements described formerly are also shown in the simulation profile, and the one which involves the efforts of all the former improvements is called the modified RS decoder on the list.

Decoder Version Code Size Cycle Improvement Percentage (%)

Lee RS Decoder 5284 447,109 N/A

Using the Intrinsics 4936 238,050 46.76

Modified RS Decoder 5584 121,466 48.97

Compiler File-Level Opt. 5048 11,650 90.41

Compiler File-Level Opt.

(Lee RS Decoder) 4732 121,169 72.90

Table 6.5: Simulation Profile for RS Decoder

Referring to Table 6.5, the cycles of the RS decoder are measured under the worst case condition, i.e., all elements are searched in the Chien search, and all the symbols are decoded correctly. It can be observed that in the case without the file-level optimization, the RS decoder with our improvement is accelerated up to 48.97% even compared to the one with the intrinsics. Respectively, it is accelerated up to 72.83%

compared to the Lee RS decoder . The file-level optimization can further obtain 90.41%

of acceleration. The final speed corresponds to 1.85 Mbytes/sec. The improvement of the Lee decoder only with the file-level optimization is also attached.

We also measure the speed and the ratio of correct decoding through the AWGN channel of the different SNR. Here we generate random data for the input to the RS encoder and pass the coded data through the convolutional coder and then the AWGN channel. At the receiver end, the soft-decision Viterbi decoder recovers the received data into the RS coded blocks. Then, we start to decode those RS blocks and count their decoding time. The process in the above is repeated ten times to make the results more accurate. The convolutional coder and Viterbi decoder used here are the ones designed

in IEEE 802.16a standard and are described in Chapter 3. We focus on the RS decoding cycles under different channel conditions, and the results are shown in Table 6.6. The relationship is plotted as Fig. 6.2 for the decoding cycle versus SNR and Fig. 6.3 for the correct decoding ratio versus SNR.

ES/N0 (dB) Correct Decoding Ratio (%) Decoding Cycle

7 100 11073

6.5 100 11574

6 96.43 12646

5.5 85.71 13181

5 67.86 14221

4.5 35.71 15030

4 7.14 15435

3.5 0 15269

3 0 15264

Table 6.6: the Decoding Ratio and Cycle under the Channel with Different SNR

Decoding Cycle

SNR (ES/N0)

Figure 6.2: the Plot of the Decoding Cycle versus SNR

Correct Decoding Ratio

SNR (ES/N0)

Figure 6.3: the Plot of the Correct Decoding Ratio versus SNR

It is clear that the decoding cycles are decreased and the correct decoding ratio is increased as the SNR goes up. The reason for the decrement of the decoding cycles is that because more error locations should be searched and more error values should be corrected, processing time is higher. The Chien search shall go through all the elements for the error locations but the Forney algorithm is not further executed when the number of errors is reaching the decoding capability for our RS decoder. It is why the decoding cycles of the zero correct decoding ratio are slightly less than some case with non-zero correct decoding ratios in Table 6.6.

6.2 Remainder Decoding Algorithm for RS Decoder

The decoding algorithm for RS codes has been investigated for a long time. Both the Berlekamp-Massey and Euclidean algorithms are well known, which solve the key-equations for RS codes. Generally, the key-equation can be generated by syndrome sequences, which are derived from the received codewords. Therefore, the syndromes have to be calculated. However, the syndrome calculation takes a large amount of execution time as shown in the profile data in the earlier sections. In 1983, L. Welch and E. R. Berlekamp proposed a new decoding algorithm, the remainder decoding algorithm [37], for RS codes without the need of computing the syndromes, and hence it becomes an alternative and popular algorithm that it is worthy of our attention and study.

They presented a new key-equation and the solving algorithm for decoding RS codes. It should be noted that the proposed key-equation is quite different from the conventional key-equation which was proposed by E. R. Berlekamp [38]. In the next subsections, we introduce the decoding flow for the remainder decoding algorithm and write the C codes for it. The performance analysis of the system is also shown and is compared in the final subsection.

6.2.1 Remainder Decoding Algorithm

The remainder decoding algorithm represents a decoding algorithm, which dose not compute the syndromes. There are two main points. One is that a new key-equation has been derived. This is a relationship between the coefficients of remainder polynomial and the errors occurring in a received codeword. It is very special that it is quite different from the conventional key-equation. The other is that Welch and Berlekamp have proposed an efficient algorithm, Welch-Berlekamp (WB) algorithm, for solving the new key-equation. The solution technique we adopt is proposed in [32], a modified version of the original WB algorithm. It is similar to but an improved version of the WB algorithm. Here, we call it the modified WB algorithm for convenience. Now, we shall briefly describe the decoding algorithm. However, the

proof for this algorithm dose not be presented here, and it can be find in [30], [31], and [35].

At first, we re-encode the received codeword R(x) and yield the remainder polynomial

r(x) = (R(x) mod g(x)),

where g(x) is the generator polynomial same as the one used in the encoder. A few polynomials are derived for the remainder decoding as follows:

)

where rj is the j-th coefficient of the polynomial r(x), W(x) is the error-locator polynomial, and N(x) is a unique polynomial whose degree is less than that of W(x).

The formal derivative applied here is defined as [30]

∑ ∏

where E is the set of indices for which ei, the error pattern in the position i, is nonzero, 0}

e

| {i

E= i

The RS decoding can then be formulated as a problem of solving the set of the key equations

Our goal is to find the unique pair of polynomials (W, N). The error locations correspond to the roots of W(x), and we denote it as Zj. If Zj is a message location, then the error values are given by the following equation:

)

The values of g’(αj) and β(Zj) can be calculated in advance when the specification of the RS code system is fixed.

6.2.2 Program Flow and Performance Analysis

In our program, first we re-encode the received codeword with the LFSR structure.

Then the algorithm proposed in [32] is used to solve the key equations for obtaining the pair (W, N). Then the roots of the error-locator polynomial should be found. We can apply the Chien search to solve this problem. Finally, the error values can be derived by using the equation described in the previous subsections or the Forney algorithm. Here we choose the Chien search and Forney algorithm to complete the last half of our

Then the algorithm proposed in [32] is used to solve the key equations for obtaining the pair (W, N). Then the roots of the error-locator polynomial should be found. We can apply the Chien search to solve this problem. Finally, the error values can be derived by using the equation described in the previous subsections or the Forney algorithm. Here we choose the Chien search and Forney algorithm to complete the last half of our