Chapter 5 H.264 Encoder Implementation and Optimization on DSP Platform
5.4 DSP Code Acceleration Methods
5.5.2 Encoding Speed on DSP board
On the DSP board, we can use an internal timer to count the executing clock cycles on the C6416T emulator. The 32-bit general-purpose timers in included in the DSP core processor.
Its clock is closer to the real system than that of the C6416 simulator. So we compute the encoding time on the DSP board by using this internal timer instead of the C6416 simulator.
is 40.4, the fps of “akiyo” is 76, the fps of “mobile” is 24.2 and the fps of “stefan” is 26. We can achieve namely real-time encoding, which is 30 fps. As discussed in the early sections, the PSNR performance of the accelerated version is almost identical to the original x264 version.
Table 5-13 Results of the “foreman” sequence on the C6416 emulator
Sequence foreman_qcif
QP Clock cycles 50 frames
Average clock cycles per frame
Conversion
(sec) fps
16 1,331,768,824 26,635,376 0.0266 37.5 20 1,269,304,288 25,386,086 0.0254 39.4 24 1,236,622,048 24,732,441 0.0247 40.4 28 1,200,254,168 24,005,083 0.0240 41.7 32 1,198,344,096 23,966,882 0.0240 41.7 36 1,194,814,936 23,896,299 0.0239 41.8
Table 5-14 Results of the “akiyo” sequence on the C6416 emulator
Sequence akiyo_qcif
QP Clock cycles 50 frames
Average clock cycles per frame
Conversion
(sec) fps
16 666,665,336 13,333,307 0.0133 75.0
20 665,718,728 13,314,375 0.0133 75.1
24 659,000,184 13,180,004 0.0132 75.9
Table 5-15 Results of the “mobile” sequence on the C6416 emulator
Sequence mobile_qcif
QP Clock cycles Average clock cycles per frame
Conversion
(sec) fps
16 2,171,652,688 43,433,054 0.0434 23.0 20 2,144,657,712 42,893,154 0.0429 23.3 24 2,143,347,584 42,866,952 0.0429 23.3 28 2,011,821,640 40,236,433 0.0402 24.9 32 1,986,374,176 39,727,484 0.0397 25.2 36 1,943,237,440 38,864,749 0.0389 25.7
Table 5-16 Results of the “stefan” sequence on the C6416 emulator
Sequence stefan_qcif
QP Clock cycles 50 frames
Average clock cycles per frame
Conversion
(sec) fps
16 1,944,197,344 38,883,947 0.0389 25.7 20 1,920,824,192 38,416,484 0.0384 26.0 24 1,924,583,792 38,491,676 0.0385 26.0 28 1,915,755,304 38,315,106 0.0383 26.1 32 1,918,608,224 38,372,164 0.0384 26.1 36 1,907,222,528 38,144,451 0.0381 26.2
Chapter 6 H.264/AVC SVC Decoder Implementation and Optimization on DSP Platform
Chapter 3 gives an overview of scalable extension of H.264. We now discuss our implementation of the SVC decoder on DSP. We describe the system architecture and the procedure of our implementation work. Then we analyze the JSVM decoder which is the reference software of H.264/AVC SVC and identify the most complicated elements in the decoder. We also present a few techniques that accelerate code execution and the acceleration methods that take advantages of the features of C64x. Finally, we implement and show some experimental results on the speed and the coding performance of our system.
6.1 System Architecture
Figure 6-1 shows the overall scalable extension of H.264 decoder architecture. When the bit-stream enters the JSVM decoder, it can be split into the base-layer part and the enhancement-layer part. The base-layer of the JSVM decoder is almost the same as the H.264 decoder. It includes entropy decoding, inverse quantization, inverse transform, motion compensation and deblocking filter. The enhancement layer is similar to the H.264 decoder but with a few modifications. The structure of decoder includes three types of scalability. In
texture prediction. The information of inter-layer prediction needs to up-sample for the enhancement layer. For improving the coding efficiency, the enhancement layer can take the information from either the reference frame or the inter-layer prediction from the base layer.
Signals are controlled by SW3, SW4 and SW5 which is shown in Figure 6-1. For SNR scalability, both CGS and FGS are supported. CGS uses the same inter-prediction mechanism as the spatial scalable coding, but without the up-sampling operation. The other case is FGS.
FGS coding is based on so-called progressive refinement slices. The H.264/AVC CABAC is extended to support the FGS. In the temporal scalability coding, the procedure is already discussed in section 3.2.2 In the JSVM decoder, it uses the hierarchical-B prediction structure (shown in Figure 3-5) for the temporal scalable coding. A decoded picture buffer (DPB) method is to implement the temporal scalability.
Bit-Stream
6.2 Procedure of the Implementation Work
As discussed in section 6.1 we port the scalable extension of H.264 decoder on the DSP board.
Because JSVM is the only reference software that is available for SVC, it is our starting point for porting. JSVM includes a H.264/AVC SVC encoder, a decoder and other some useful tools, and these different programs share some common functions. It is developed on Visual Studio platform. Hence, the first step is to extract the decoder from JSVM to make it a stand alone program. In our implementation, we extract the H.264/AVC SVC decoder from the reference software JSVM 5.0 [27].
After making the decoder an independent program, the next step is to part the code from the Visual Studio (on PC) to the Code Composer Studio (CCS, the integrated development environment for TI’s DSP). Because CCS does not support all Visual Studio C++
programming functionality, in this step there are some problems as described below.
(1) CCS itself does not provide the Standard Temporal Library (STL), but JSVM uses STL a lot. Therefore, we found a STL called STLport [28] which can be ported to many platforms, and after a proper set-up of configuration, STLport can work correctly on CCS.
(2) Exception handing is not supported by CCS. That is, the keywords “try”, “throw”
and “catch” of C++ can not be used in CCS, but those keywords are found in JSVM.
So all of these keywords must be removed and modified properly to ensure the correctness of the whole code.
(3) CCS does not implement some useful headers in C++ such as iostream.h, io.h and so on. Therefore, we replace these codes by the equivalent and supported functions.
index pointer of the decoder on CCS. After the finishing the porting, we can ensure the result of decoding is correct.
6.3 JSVM 5.0 Decoder Complexity Analysis
We profile the JSVM 5.0 decoder to find which part takes the most computation time on DSP.
In order to profile the decoder, we use the profile of the stand-alone C6416T DSP simulator.
We concentrate on the most critical areas and try to accelerate these modules. In chapter 3, we know that the scalable extension of H.264 have three types of scalability. In this section, we profile each scalability separately. Finally, we profile the combined scalability, which contains the spatial, temporal and SNR scalability. The profiling results using different scalability are shown in Figure 6-2 (Temporal), Figure 6-3 (Spatial), Figure 6-4 (SNR) and Figure 6-5 (Combined). The test video sequence is “city.264”, the simulation condition is shown in Table 6-1. And the compiler optimization level configuration of C6416 simulator is the “File” level (-o3) and we already use the L2 cache, which has been described in section 5.3.2 Table 6-2 shows the cycles of different scalability. We can see that the Spatial and SNR scalability need more cycles than temporal scalability. In following section, we will focus on these two types of scalability.
Table 6-1 Simulation parameters
Test sequence: city.264 IPPPPP 9 Frames
GOP size Frame size FGS layers QP
Temporal 4 CIF 0 30
Spatial 1 QCIF,CIF 0 30
Table 6-2 Cycles on different scalability
Temporal Scalability of JSVM 5.0 decoder profile Temporal Scalability of JSVM 5.0 decoder profile Temporal Scalability of JSVM 5.0 decoder profile Temporal Scalability of JSVM 5.0 decoder profile
on C6416 simulator
Tran. & Quan.Tran. & Quan.
Tran. & Quan.
Figure 6-2 Complexity profiling of the Temporal Scalability of JSVM 5.0 decoder
Spatial Scalibility of JSVM 5.0 decoder profile Spatial Scalibility of JSVM 5.0 decoder profile Spatial Scalibility of JSVM 5.0 decoder profile Spatial Scalibility of JSVM 5.0 decoder profile
on C6416 Simulator
Figure 6-3 Complexity profiling of the Spatial Scalability of JSVM 5.0 decoder
SNR Scalability of JSVM 5.0 decoder profile SNR Scalability of JSVM 5.0 decoder profile SNR Scalability of JSVM 5.0 decoder profile SNR Scalability of JSVM 5.0 decoder profile
on C6416 simulator
Combined Scalibility of JSVM 5.0 decoder profile Combined Scalibility of JSVM 5.0 decoder profile Combined Scalibility of JSVM 5.0 decoder profile Combined Scalibility of JSVM 5.0 decoder profile
on C6416 simulator on C6416 simulatoron C6416 simulator on C6416 simulator
Figure 6-5 Complexity profiling of the Combined Scalability of JSVM 5.0 decoder
In Figure 6-2, the profile of the temporal scalability is almost as same as H.264 decoder. The major complex parts are motion compensation, loop filter, entropy coding (CABAC), transform and quantization. In Figure 6-3, the profile of spatial scalability, the most computation part is inter-layer prediction for residual and intra texture. Its computation percentage is almost 33%. In Figure 6-4, when we profile the SNR scalability the most complex part is FGS. Finally, in Figure 6-5, the major computation parts of combined scalability are inter-prediction which takes about 20% and FGS which takes 53%. In the following sections, we develop several techniques to reduce the complexity of the major computation parts.
6.4 DSP Code Acceleration Methods
6.4.1 Packet Data Processing
It is often desirable to use a single load or store instruction to access multiple data values consecutively located in memory. It is called the Single Instruction Multiple Data (SIMD) method. For example, when operating on a bit-stream, we can use word (32-bit) accesses to process read two 16-bit (short) or four 8-bit data (char) values at a time. This method can improve the code efficiency substantially. Figure 6-6 shows an example of using the SIMD method. Some intrinsic functions enhance the efficiency in a similar way.
Figure 6-6 SIMD example of using the word instructions for adding short data
6.4.2 Intrinsic
The TI C6000 compiler provides many special functions that map directly to the inlined C64x instructions. It speeds up the C codes. These special functions are called intrinsics. If an instruction has equivalent intrinsic functions, we can replace it by intrinsic functions. The execution time will be decreased because of the use of intrinsics. Intrinsics are specified with a leading underscore (_) and are accessed by calling them as ordinary functions. These are guite a few intrinsic functions defined for the C6000 series DSP. More details about the TI
additions in one instruction. The performance of adopting intrinsic is shown in Table 6-3.
for( x = 0; x < uiWidth; x+=2) {
_amem4(&pDes[x])=_add2( _amem4_const(&pDes[x]), _amem4_const(&pSrc[x]) );
}
for( x = 0; x < uiWidth; x++ ) {
pDes[x] += pSrc[x];
} data type: short (2 bytes)
add 2 short data using a single instruction
Figure 6-7 Use of intrinsic function in the SVC decoder
Table 6-3 Performance using intrinsic
Function Original cycles Revised cycles Reduction Ratio (%)
add 243,320,438 130,990,721 46.1%
copy 488,524,940 159,347,857 67.3%
subtract 58,412,795 14,885,097 74.5%
up-sampleResidual 1,818,090,978 1,711,585,722 5.9%
6.4.3 Memory Allocation Optimization
In section 4.2.2, we know that the sizes of the internal program memory and the internal data memory are both 16 K-bytes for C6416T. The code segment should be put into the internal program memory. However, our codes may require a larger memory size than the internal memory. For instance, when dealing with a large image, it can not load the whole image into
data in different memory sections for acceleration consideration. It also provides the
“CODE_SECTION”, “DATA_SECTION” key words, which can allocate parts of C code or data in the internal memory. In order to improve the JSVM decoder execution cycles on DSP, we put some frequently used functions into the internal memory. This method can decrease the memory access time.
6.4.4 DSP library
The TI C64x DSP library is an optimized DSP function fir C programmers using C64x devices. It includes many C-callable, assembly optimized, and general-purpose signal-processing routines. By using these routines, we can achieve execution speed up considerably faster than the equivalent code written in the standard C language. We can use the DSP library (includes convolution, fft, iddt…etc) to replace the original functions in the decoder.
6.5 Fast Algorithms for the SVC Decoder
In this section, we describe the implementation of the inter-layer prediction and the FGS operation in the JSVM decoder on DSP. We also modify some methods wherever possible to reduce the computations.
6.5.1 FGS
FGS (Fine Granularity Scalability) is one tool used by the SNR scalability. The details of FGS have been described in section 3.4.2 Figure 6-8 shows the block diagram of FGS in the JSVM 5.0 decoder. In the JSVM decoder, the FGS can be divided into three parts: luma, chorma DC and chroma AC. Each part includes the significant path and the refinement path. The refinement path is turned on only when the significant path is completed. When entering the
occurs in chroma DC and chroma AC. When all these significant and refinement path in a macroblock are completed, the enhancement coefficients are properly scaled and the update the macroblock coefficients. The flow chart of FGS is shown in Figure 6-8.
Figure 6-5 tells us that the FGS takes a large percentage of computing time in the combined scalability case. First, we profile the FGS operation which is shown in Figure 6-9. The complex parts are decoding the luma significant coefficients, luma refinement coefficients, chroma AC significant coefficient, the chroma DC significant coefficient and scaling and updating the macroblock coefficients. In the following sections, we adopt some methods to
Figure 6-9 Complexity profiling of FGS on C6416 simulator (A) Early termination for luma significant coefficient
In Figure 6-9, decoding luma significant coefficient is the part that takes the most computing time. So, if we would reduce the cycles of FGS. The decoding luma significant coefficient needs modification. In the decoding luma significant coefficient passes, each loop checks a 4x4 block for whether the block have significant coefficients. To speed up this process, we set an early termination point. If all of the significant coefficients in this block are done, in the
(B) Check skipped blocks for null coefficient blocks
In the significant and the refinement paths, some coefficients are not significant or refinement.
These coefficients are zeros. If all coefficients in a block are all zero, this block is called a null.
In the null block, scaling and transforming the coefficients is redundancy. So the block of scaling the enhancement layer coefficients can be skipped. We can detect whether the block is null or not. This method saves the redundant time in calculating the scaling operation.
(C) Code refinement
In the FGS block, some functions are shared with the other components in the JSVM 5.0 decoder. For example, the function “initMB” is a tool which initials all parameters in a macroblock. But some parameters are not need in the FGS process. For example, motion vectors are needed only in motion estimation or motion composition but FGS does not use the parameters of motion vectors. So we can rewrite the function “initMB” that only applies to FGS.
Table 6-4 shows the reduction ratio of each function in all FGS block and Table 6-5 shows the results of accelerating the FGS block on the overall system which condition as same as Table 6-1. We notice that the reduction ratio of the operational cycles is about 61%.
Table 6-4 Reduction Ratio of FGS block
Function
Original cycles
Revised Cycles
Reduction Ratio (%) Luma significant coefficients 923,838,292 315,713,695 65.8%
Luma refinement coefficients 70,901,624 30,425,812 57%
Table 6-5 Performance of the modified FGS Test sequence: city_qcif.264
Type Original Cycles Proposed Cycles Reduction Ratio (%)
SNR 2,017,898,772 870,697,861 56.8%
Combined 28,156,792,014 16,860,453,637 40.1%
6.5.2 Inter-layer Prediction
For spatial scalability, the most important component is inter-layer prediction. But from Figure 6-3, we find that the inter-layer prediction decoding takes a large amount of computations. This is because that in decoding the spatial enhancement layer, the motion vectors, residuals, and intra texture data of the base layer should be up-sampled for their use at enhancement layer. The up-sampling process is complex and takes a lot of computations.
We design algorithms to reduce the inter-layer prediction computation.
(A) Intra texture prediction
Inter-layer prediction is computational intensive. However, not all of the up-sampling data are needed in the enhancement layer. For example, the inter-layer intra prediction up-samples the reconstructed intra signal of the base layer. In up-sampling the luma component, one-dimensional 6-tap filter FIR filters are applied horizontally and vertically. The chroma components are up-sampled using a simple bi-linear filter. In the JSVM decoder, when the inter-layer prediction is in use, all the reconstructed signals of the base layer are up-sampled.
Figure 6-10 shows that the inter-layer prediction is performed before the macroblock decoding. But only a few blocks request for the intra prediction operation typically. In the JSVM 5.0 decoder, only the “Intra_BL” mode needs to use the information from the base layer. Table 6-6 shows that in the spatial scalability decoding, only 2% of block are using the
the intra texture data would be up-sampled. Otherwise, it would not be up-sampled.
Table 6-6 Distribution of mode in spatial scalability
Base layer Enhancement layer
Mode Number Percentage (%) Mode Number Percentage (%)
Intra_4x4 80 0.54% Intra_4x4 56 0.31%
Intra_8x8 26 0.18% Intra_8x8 25 0.14%
Intra_16x16 1 0.01% Intra_16x16 9 0.05%
Intra_BL 0 0% Intra_BL 419 2.29%
Inter_8x8 2026 13.64% Inter_8x8 8048 44.00%
Inter_8x16 2041 13.74% Inter_8x16 1708 9.34%
Inter_16x8 906 6.10% Inter_16x8 1104 6.04%
Inter_16x16 3701 24.92% Inter_16x16 5837 31.91%
Skip 6069 40.87% Skip 1084 5.93%
Total 14850 100% Total 18290 100%
Get the information from base layer
Up-sampling
Macroblcok decoding
Intra
Inter decoding I_BL
Intra_Base decoding IntraNxN
decoding Enhancement layer
NO YES
NO
Motion vectors Residuals
Up-sampling YES
Intra texture
Figure 6-11 Block algorithm of the modified inter-layer intra prediction
(B) Residual prediction
The inter-layer residual prediction can be employed for all inter-coded macroblocks. The residual signal is up-sampled using a bi-linear filter and used as the prediction value for the residual signal of the enhancement layer. Thus, only the associated difference signal is coded in the enhancement layer. In the JSVM decoder, the inter-layer residual prediction requires a lot of memory transfers. On the DSP platform, the memory transfer is costly in time. So we modify the codes to reduce the number of memory transfer. For the inter-layer residual prediction coding, the residuals are split into Y, U and V residual signal. But in the JSVM
Figure 6-12 Difference in the inter-layer residual prediction procedures
Table 6-7 shows the reduction ratio of the inter-layer prediction. We rewrite the program in order to accelerate the inter-layer part. Table 6-8 shows that the operational cycles of different scalability are reduced.
Table 6-7 Reduction ratio of Inter-layer prediction
Table 6-8 Performance of the modified inter-layer prediction Test sequence : city_qcif.264
Type Original cycles Proposed cycles Reduction Ratio (%)
Spatial 5,966,597,605 5,198,506,424 12.8%
Combined 16,860,453,637 15,472,723,853 8.2%
6.6 Final Simulation and Acceleration Results
After accelerating the codes and modifying the algorithms, we have efficiently reduced the computations of the JSVM 5.0 decoder on DSP. Table 6-9 shows the comparison of the processes with and without the L2 cache. We can clearly see that the reduction ratio achieves 50%. The improvement is not as much as the reduction ratio in the x264 case in Table 5-11, because the code size and data size of the JSVM 5.0 decoder is larger than the x264. Using two-level cache can not reduce the data cache miss a lot. As shown in Table 6-10, the data cache miss rate decreases from 56.63% to 27.12%. The data cache miss still is large. Table 6-11 shows the improvement of the optimized codes compared with the original version. The simulation condition is shown in Table 6-1. After optimizing the codes, the improvement achieves 20% in temporal scalability, 30% in spatial scalability, 55% in SNR scalability and 49% in combined scalability. We can decrease the half execution time on overall system.
Table 6-12 shows the real execution time on the C6416 emulator. The testing condition is the same as that in Table 6-1. The resulting system can decode approximately 15 frames per second in the baselayer and the temporal scalability, 1.32 frames in the spatial scalability, 2.84 frames in the SNR scalability and 0.4 frames in the combined scalability finally.
Table 6-9 Comparison of using C6416 simulator with and without the L2 cache Sequence Without L2 cache With L2 Cache Reduction Ratio (%)
City 64,576,729,391 30,067,421,887 53.44%
Akiyo 53,685,160,996 24,403,283,420 54.54%
Foreman 65,599,674,888 30,902,096,589 52.89%
Table 6-10 Effect of using L2 cache memory
C6416 simulator Without L2 cache With L2 cache
Event Count Percentage Count Percentage
Total Cycles 64,576,760,635 30,067,433,857 Core cycles(excl. stalls) 755,799,273 7.82% 755,799,785 16.8%
NOP cycles 2,354,743,973 46.5% 2,354,744,076 46.5%
Stall Cycles 59,526,048,163 92.18% 25,016,840,078 83.2%
Instruction cache hits 953,423,952 97.36% 953,507,967 97.37%
Instruction cache misses 25,882,890 2.64% 25,799,022 2.63%
Data cache references 915,440,375 915,440,489 Data cache reads 621,323,760 67.87% 621,323,824 67.87%
Data cache writes 294,116,624 32.13% 294,116,672 32.13%
Data cache hits 397,007,357 43.37% 667,191,138 72.88%
Data cache read hits 221,802,890 35.7% 446,661,075 71.89%
Data cache write hits 175,204,467 59.57% 220,530,063 74.98%
Data cache misses Data cache misses Data cache misses
Data cache misses 518,433,018 518,433,018 518,433,018 518,433,018 56.6356.6356.6356.63%%% % 248,249,351 248,249,351 248,249,351 248,249,351 27.1227.1227.1227.12%%%%
Table 6-11Comparison using C6416 simulator on the original and the modified codes Sequence Type Original cycles Optimized Cycles Reduction Ration (%)
Baselayer 521,049,251 415,868,509 20.19%
Temporal 540,286,942 428,750,233 20.64%
Spatial 7,427,104,703 5,198,506,424 30.01%
SNR 2,027,805,801 857,551,343 57.71%
City
Combined 30,067,421,887 15,472,723,853 48.54%
Baselayer 427,770,004 337728563 21.05%
Temporal 442,938,814 348379134 21.35%
Spatial 6,353,717,397 4,077,516,481 35.82%
SNR 1,743,465,294 790,636,440 54.65%
Akiyo
Combined 24,403,283,420 12,474,314,013 48.88%
Baselayer 491,066,392 383,309,737 21.94%
Temporal 518,112,745 405,372,731 21.76%
Spatial 7,327,335,062 5,069,933,810 30.81%
SNR 1,968,622,159 853,436,646 56.65%
Foreman
Combined 30,902,096,589 15,781,271,800 48.93%
Table 6-12 Performance on the C6416 emulator
Sequence Type Execution Time 9 frames (sec.)
Sequence Type Execution Time 9 frames (sec.)