• 沒有找到結果。

Efficient Variable Length Decoding (VLD)

Optimization of Implementation on PACDSP

5.1 Implementation Strategies on PACDSP

5.1.2 Efficient Variable Length Decoding (VLD)

In this subsection, we discuss a efficient method of VLD which uses the advantage of PACDSP. In additions, we also compare the performance of different VLD methods on PACDSP. The methods are proposed in [12] and [13]. We use the simple VLC table in Table 5.3 for the following comparison, which has thirteen entries in this table.

One Table Mapping with Magnitude-Offset

In this technique, we build a table containing all possible codewords. Each entry in the table has two elements, which are the corresponding VLC symbol and its code length.

Table 5.2: Execution Time of Getting One Reference BAB on PACDSP Original (Cycles) Optimized (Cycles) Speed Up (%)

Whole BAB in VOP 18,184 4,325 76.22

Table 5.3: Variable Length Codes for dct dc size luminance [5]

Variable length code dct dc size luminance

011 0

11 1

10 2

010 3

001 4

0001 5

0000 1 6

0000 01 7

0000 001 8

0000 0001 9

0000 0000 1 10

0000 0000 01 11

0000 0000 001 12

Thus, because the maximum code length is 11 bits in this example, there would be211 entries in the table. We fetch the first 11 bits in the bitstream, whose magnitude gives the index the corresponding entry in the table. Note that we only have to access the bitstream once per symbol. The example assembly program of one-table mapping with magnitude-offset on the PACDSP is shown in Fig. 5.8.

Bit by Bit Matching

If the size of VLC table is not very big, we can simply check the bitstream bit by bit, and compare if any one symbol in the table is matched. The advantage of this method is simplicity, but the number of memory accesses to acquire the bits and the number of comparison instructions are many. Therefore, the average execution time to decode

Figure 5.8: Example of one table mapping with magnitude-offset on PACDSP.

Figure 5.9: Example of bit-by-bit matching on PACDSP.

a symbol will be long. The example assembly program of bit by bit matching on the PACDSP is shown in Fig. 5.9.

Multiple-Pass Matching

To reduce the frequency of accessing the bitstream, we may divide the VLC table into several subtables. Since the symbol with shorter code appears more frequently, we can search the subtable with shorter code length first. For example, we may divide the exam-ple table into two subtables. The first half with symbols 0–6 are grouped into one subtable and the second half with symbols 7–12 are grouped into the second subtable. In

decod-Figure 5.10: Example of multiple-pass matching on PACDSP.

ing, we read the first five bits in the bitstream and check if any code in the first subtable matches the bits. If not, then we read the next six bits and check the second subtable. The procedure is similar when there are more subtables. The example assembly program of multiple-pass matching on the PACDSP is shown in Fig. 5.10.

Optimized Multiple-Pass Matching

In our implementation, we use an idea similar to multiple-pass matching to realize the VLD on PACDSP. At first, we also divide the VLC table into two subtables in this ex-ample. However, without accessing the bitstream twice for the two subtables, we only access the bitstream once. The number of bits that we fetch from the bitstream is the longest code length in the VLC table. Then we can easily get the code from searching the table by shifts. In addition, because the predicate registers (p0–p15) are shared by the two clusters in the PACDSP, we can transmit the code to the other cluster and execute the comparison instruction at the same time. Then we can do the conditional execution according to the contents of the predicate registers. The example assembly program of optimized multiple-pass matching on the PACDSP is shown in Fig. 5.11.

Figure 5.11: Example of optimized multiple-pass matching on PACDSP.

Comparison of Different VLD Methods

We decode a bitstream consisting of all possible symbols on PACDSP, which use the four different methods introduced above. The results are shown in Fig. 5.12 and Table 5.4.

In the method “one table mapping with magnitude-offset,” we only access the bitstream once and get the output by searching the table. Therefore, the execution time for decoding each symbol is all the same, only 35 cycles. The primary drawback of this method is the memory requirement of the lookup table because of the exponentially increasing table size with maximum code length.

The second method, “bit-by-bit matching,” has the best performance for the shortest codeword. However, as the codeword gets longer, it is significant degraded in perfor-mance. Therefore, because of the characteristic of entropy coding which uses shorter codes to represent more frequently appearing symbols, the “bit-by-bit matching” method can be used when most symbols may be encoded with shorter codewords.

The third method, “multiple-pass matching,” has a similar characteristic, where the performance is also degraded with longer codewords. However, because we only access the bitstream twice for the longest codeword, we need 89 cycles rather than 256 cycles in the worst case.

Finally, in our implementation, we use the advantage of PACDSP to optimize the multiple-pass matching and fetch the bitstream only one time. We see that the

perfor-Table 5.4: Execution Time of Different VLD Methods on PACDSP

One Table Optimized

Code Pattern Mapping with Bit-by-Bit Multiple-Pass Multiple-Pass Magnitude-Offset Matching Matching Matching

10 35 27 34 38

11 35 31 41 38

001 35 54 48 38

010 35 58 55 38

011 35 62 62 38

0001 35 85 69 38

0000 1 35 108 75 38

0000 01 35 131 54 37

0000 001 35 154 61 37

0000 0001 35 177 68 37

0000 0000 1 35 210 75 37

0000 0000 01 35 233 82 37

0000 0000 001 35 256 89 37

mance of our implementation is very close to “one table mapping with magnitude-offset.”

Moreover, there is no memory requirement for building a table in our implementation.

Therefore, this method provides a good tradeoff between memory requirement and exe-cution time.

相關文件