PPE Profiling
In our profiling, we divided motion compensation into luma MC and chroma for advanced profiling. The modified process network of H.264 decoder is shown in Figure 4-1.
Figure 4-1 Modified Process Network of H.264 Decoder
In multimedia decoding applications market, the high-definition (HD) resolution is a basic requirement. So we adopted two 1080P full HD test sequences for profiling. Sunflower and RushHour 1080P (shown as Figure 4-2) with 500 frames was analyzed.
Figure 4-2 Sunflower and RushHour 1080P Test Sequence The profiling result of PPE is shown in Figure 4-3.
Profiling Result
Figure 4-3 Profiling Result after granularity adjustment of H.264 Decoder
We can recognize that luminance motion compensation, chrominance motion compensation and de-blocking filter are the most workload intensive part. Local optimization should be first applied on these parts. Motion compensation is the most computation intensive kernel in H.264 decoding. Therefore, we take motion compensation for example describing the way we offload a kernel on a SPE.
Data Alignment
Data in CBE processor must be aligned with a 128-bit-boundry for DMA transfers and SIMD operations. In our work, we allocate memory for pixels and vectors by using a frame as a unit and 128-bit-boundry aligned as shown in Figure 4-4. The address of pixels and vectors are continuous in x-direction.
Figure 4-4 Data Layout of pixels
Motion Compensation
Motion compensation is the most computation intensive part of H.264 decoder. Each 4x4 submacroblock has a separate motion vector. A 6-tap filter is used for 1/2 motion compensation. Moreover, each 4x4 submacroblock needs 9x9 pixels for compensation. It also means offloading this kernel on SPE needs high DMA bandwidth. Addresses of 9x9 pixels are non-continuous. A DMA command can only transfer continuous data in main memory. But there are only 16 entries in MFC SPU command queue.
The overhead of accessing a macroblock based pixels and vectors would be minimized in this data arrangement. But there are still extra efforts in unaligned access. For example, access arbitrary 9x9 pixels for luminance compensation needs transferring 18x9 pixels (each pixel size is 2 bytes) at least because of data un-alignment as shown in Figure 4-5.
Figure 4-5 Un-aligned access for arbitrary 9x9 pixels
To overcome the problem of limited MFC SPU command queue. We write DMA list on SPU’s LS first for issuing a large number of DMA command with only one entry in MFC SPU command queue. DMA list is used to move data between a contiguous area in an SPE’s LS and possibly noncontiguous area in the effective address space. It can specify up to 2048 DMA transfers, each up to 16KB in length.
In CBE processor, 128-bit-wide SIMD registers can contain 8 half-word integers. We can compute the 8 result of 6-tap FIR at once with 6 128-bit wide registers as shown in Figure 4-6.
9 instructions needed for computing 8 6-tap FIR results with (A+F)-(((B+E)-(C+D)<<2))x5. 4 extra instructions needed if A+F-5(B+E)+20(C+D) adopted, because 32-bit multiplication is not supported in CBE processor SIMD.
The bottleneck of SIMD optimization is the pack/unpack procedure. We can perform eight 6-tap FIRs (A-5B+20C+20D-5E+F) with six packed registers as shown in Figure 4-6.
0 1 2 3 4 5 6 7 8
However, the addresses of the pixels for performing FIRs are non-continuous in the memory layout. 48 instructions needed for packing 6 registers each with 8 pixels are needed in the worst case without any optimization. 8 instructions needed for unpacking the 8 pixels result. So pack/unpack procedure needed to be specified before/after SIMD operations.
In the pack/unpack procedure, the most useful instruction is byte-shuffle operation. We can arbitrarily select 1 of 32 bytes from two input quadwords for each of the 16 bytes in a output quadword according to the parameters of a third input quadword. That means we can construct one register by selecting any bytes from the two input registers as we want. Figure 4-7 shows the byte-shuffle operation.
0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
Figure 4-7 Byte-Shuffle Operation
For each luminance 4x4 submacroblock, 9x9 pixels are needed to perform inter prediction. Our 9x9 pixels input is arranged as shown in Figure 4-8. The addresses of pixels are continuous in x-direction. So there are more instructions needed for pack if y-direction FIRs need to be performed.
Figure 4-8 9x9 Pixels Arrangement for 4x4 Luminance Submacroblock Interpolation
1
2
i j k l
a b c d e f g h
m n o p
q r
s t
u v
3
4
5
6
Figure 4-9 Inter Prediction of Luminance Sub-Pixel Cases
The procedures of packing are various from the case of sub-pixel we interpolate. The cases of sub-pixels interpolation is shown in Figure 4-9. The 9x9 black squares represent 9x9 pixels needed for luminance 4x4 submacroblock interpolate.
In the example of computing sub-pixels a, b, c, d, e, f, g, h in Figure 4-9, we don’t need any pack procedure. We perform 6-tap FIRs with the register as shown in Figure 4-10 directly.
Figure 4-10 Perform SIMD with Register A-F directly
In the case of computing sub-pixels i, j, k, l, m, n, o, p in Figure 4-9, we need 7 instructions in pack procedure for SIMD operation. The pack procedure is shown in Figure 4-11.
Figure 4-11 Pack Procedure for Computing i, j, k, l, m, n, o, p
In the example of computing sub-pixels c, q, s, u, d, r, t, v in Figure 4-9, we need 10 instructions in total pack procedure for SIMD operation. The pack procedure is shown in Figure 4-12.
Figure 4-12 Pack Procedure for Computing c, q, s, u, d, r, t, v
The most complex case is computing 1, 2, i, m, 3, 4, 5, 6 in Figure 4-9, we need 20 instructions in total pack procedure. The pack procedure is shown in Figure 4-13.
Figure 4-13 Pack Procedure for Computing 1, 2, i, m, 3, 4, 5, 6
There are 16 cases in luminance 4x4 submacroblock interpolation as shown in Figure 4-14. We categorized the cases by instructions needed for pack procedure for SIMD operation.
The total instructions needed by each case of a 4x4 submacroblock interpolation are summarized in Table 4-1.
Figure 4-14 16 Cases of Luminance Interpolation
Table 4-1 Instructions Needed for Packing Procedure in 16 Cases of Luminance Interpolation Cases Pack Instructions Needed
G 0
a, b, c 14
d, h, n 20
e, g, p, r 26
f, j, q 36
i, k 14
After SIMD operations, the 16 pixels results of a 4x4 luminance submacroblock are in 2 128-bit-wide registers. There are two possibilities of the 16 pixels layout depending on the cases as shown in Figure 4-15. If the layout is the left case, two instructions needed for modified into right case.
Figure 4-15 16 Pixels in 2 Registers of a 4x4 Submacroblock
After a 16x16 macroblock are done. The layout of 16x16 pixels is shown in the left of Figure 4-16. 32 instructions are needed for unpacking the whole 16x16 macroblock.
Figure 4-16 Unpack Procedure of a 16x16 Macroblock
The result of computation optimization of each kernel is shown in Figure 4-17. We apply loop unrolling in all kernels and SIMD optimization in residue coding, luminance MC, chrominance MC and deblocking filter. In fact, there are still a lot of possibilities of optimization in official decoder. We do computation optimization just for getting computation/communication ratio more precisely. We also show the result of kernels offloading on SPEs. Computation time needed on SPE is always shorter then on PPE.
Because of SPE is designed for high speed computation. But some kernels need a lot of communication time on SPE like motion computation. Motion compensation needs to get reference pixels, which are most unaligned and not continuous. Therefore, there is high communication overhead in motion compensation on SPE.
Figure 4-17 Computation Optimization Results of Each Kernel