MPEG-4 Video Encoder Optimization by Intel MMX Technology
4.2 Code Acceleration
4.2.1 Motion Estimation Optimization
17,665,616 354,874,222 624,059,434 63,822,424
14,245,596 11,929,677
Figure 4.2: Code segment of hotspots of blkmatch16.
4.2.1 Motion Estimation Optimization
Figure 4.1 shows that the most computation is spent on functions relating to motion esti-mation. Hence our first target is to reduce the complexity of these functions. The major functions of motion estimation are summarized in Table 4.3 and we also show the percent-age complexity of each function with respect to the encoder and to the motion estimation.
Optimization of blkmatch16
The blkmatch16 function finds the best matched MB in the previous reconstructed VOP and is applied to MBs which are totally in VOP. The search method in the original refer-ence software is spiral full search. To reduce the complexity we need to find the hotspots.
The hotspots of blkmatch16 are shown in Figure 4.2.
As we can see, the most complexity is to calculate SAD (sum of absolute differences) at integer pixel displacements. In order to reduce the complexity we use MMX instruc-tions to modify the original loop. Firstly, we do not change the search algorithm (using full search as the original code). We just modify the SAD kernel by using MMX instruc-tions. The modified code is shown in Figure 4.3.
The major instruction we use is “psadbw.” The psadbw instruction computes the ab-solute differences of 8 unsigned byte integers using 64-bit operands. These 8 differences are then summed to produce an unsigned word integer result that is stored in the destina-tion operand. Figure 4.4 shows the operadestina-tion of the psadbw instrucdestina-tion. The original C++
code contains a premature breakout mechanism that saves iterations loop by comparing the SAD value accumulated after each row with the current minimum SAD value. Ac-cording to [13], this premature breakout mechanism will decrease the efficiency. But after experiment we find that if we comment out this mechanism the efficiency will be a little lower than we keep it and unroll the loop four times so that each loop iteration calculates the SAD for 4 rows of the macroblock.
We also modify the SAD kernel of half sample search by MMX instruction. The original C++ code is shown in Figure 4.5.
As we can see, the SAD kernel of half sample motion search is a little different from the SAD kernel of integer pixel search. The difference is that the half sample positions in the reference VOP for calculating SAD are not continuous. We need to modify the MMX SAD kernel of integer pixel search to suit this condition for efficiently. The modified code is shown in Figure 4.6. The major differences with MMX SAD kernel of integer pixel search are that we need to shift pixel values in MMX register left and right to reserve pixel values we need. Then we pack the pixel values in two MMX registers to one for psadbw instruction.
After modifying the original code by the above methods, the number of clockticks events of blkmatch16 is reduced to around 91M clockticks events per VOP, achieving a 1936% speedup comparing with the original code. We show the comparison in Table 4.4.
for (iy = 0; iy < 4; iy++){
movq mm1,[edx]; // read 1st 8 pixels of reference block movq mm2,[edx+8]; // read next 8 pixels of reference block psadbw mm1, [ebx]; // calculate SAD of pairs of 1st 8 pixels psadbw mm2, [ebx+8]; // calculate SAD of pairs of next 8 pixels paddw mm6, mm1; // add to buffer for final SAD
paddw mm7, mm2; // add to buffer for final SAD
mov eax, dword ptr [iFrameWidthY] // Calculate SAD of next row movq mm1, [edx][eax];
mov eax, dword ptr [iFrameWidthYx2]; // Calculate SAD of 3rd row movq mm1, [edx][eax];
mov eax, dword ptr [iFrameWidthYx3]; // Calculate SAD of 4th row movq mm1, [edx][eax];
if (mbDiff >= iMinSAD)
goto NEXT_POSITION; // skip the current position ppxlcRefMB += iFrameWidthYx4;
ppxlcTmpC += iMB_SIZEx4;
}
Figure 4.3: Revised code segment of SAD kernel of integer pixel motion search.
Figure 4.4: PSADBW instruction operation using 64-bit operands(from [10]).
for (iy = 0; iy < MB_SIZE; iy++){
for (ix = 0; ix < MB_SIZE; ix++)
mbDiff += abs (ppxlcTmpC [ix] - ppxlcRefZoomMB [2 * ix]);
if (mbDiff > iMinSAD)
goto NEXT_HALF_POSITION;
ppxlcRefZoomMB += m_iFrameWidthZoomY * 2;
ppxlcTmpC += MB_SIZE;
}
Figure 4.5: Original code segment of the SAD kernel of half pixel motion search.
Table 4.4: Execution Result of Optimized blkmatch16 Function Using MMX
Function Clockticks/VOP of Original Code Clockticks/VOP of Modified Code Speedup
blkmatch16 1,763,013,601 91,043,523 1936.45 %
for (iy = 0; iy < MB_SIZE ; iy++) { __asm {
mov eax, ppxlcTmpC;
mov ecx, ppxlcRefZoomMB;
movq mm1, [eax];
movq mm2, [eax + 8];
movq mm3, [ecx]; // read 1st 8 pixels of reference zoom block movq mm4, [ecx + 8]; // read next 8 pixels of reference zoom block movq mm5, [ecx + 16];
movq mm6, [ecx + 24];
psllw mm3, 8; // shift left to reserve pixels we need psllw mm4, 8;
psllw mm5, 8;
psllw mm6, 8;
psrlw mm3, 8; // then shift right for packuswb instruction psrlw mm4, 8;
psrlw mm5, 8;
psrlw mm6, 8;
packuswb mm3, mm4; // pack the two 4 pixels to one MMX register packuswb mm5, mm6;
psadbw mm1, mm3;
psadbw mm2, mm5;
paddd mm1, mm2;
movd temp, mm1;
}
mbDiff += temp;
if (mbDiff > iMinSAD)
goto NEXT_HALF_POSITION;
ppxlcRefZoomMB += m_iFrameWidthZoomY * 2;
ppxlcTmpC += MB_SIZE ; }
Figure 4.6: Revised code segment of the SAD kernel of half pixel motion search.
for (iy = 0; iy < MB_SIZE; iy++) {
for (ix = 0; ix < MB_SIZE; ix++) {
if (ppxlcTmpCBY [ix] != transpValue)
mbDiff += abs (ppxlcTmpC [ix] - ppxlcRefMB [ix]);
}
if (mbDiff > iMinSAD)
goto NEXT_HALF_POSITION; // skip the current position ppxlcRefMB += m_iFrameWidthY;
ppxlcTmpC += MB_SIZE;
ppxlcTmpCBY += MB_SIZE;
}
Figure 4.7: Code segment of hotspots of blkmatch16WithShape function.
Table 4.5: Execution Result of Optimized blkmatch16WithShape Function Using MMX
Function Clockticks/VOP of Original Code Clockticks/VOP of Modified Code Speedup
blkmatch16WithShape 651,183,211 95,240,399 683.73%
Optimization of blkmatch16WithShape
The blkmatch16WithShape function is the same as blkmatch16 but is applied to MBs which are on the boundary of VOP. The hotspots of blkmatch16WithShape is the same as blkmatch16, that is, integer pixel SAD computation for the MB. The original code is shown in Figure 4.7.
The SAD kernel adds a conditional statement “if (ppxlcTmpCBY [ix] != transp-Value)” because the pixels out of VOP will not be calculated in SAD computation. We modify our optimized integer pixel SAD kernel of blkmatch16 to include this conditional statement. The modified code is in Figure 4.8. The major differences with MMX integer SAD kernel of blkmatch16 are that we use 128-bit SIMD integer instructions of SSE2 to handle 16 pixels in a single instruction and we use pand instruction to substitute the added conditional statement. After modifying the original code, the execution result is shown in Table 4.5.
for (iy = 0; iy < MB_SIZE; iy++){
__asm {
mov eax, ppxlcTmpCBY; // move the memory address of current BAB MB to eax mov ecx, ppxlcTmpC; // move the memory address of current MB to ecx mov edx, ppxlcRefMB; // move the memory address of reference MB to eax movdqu xmm0, [eax]; // read 16 pixels of current BAB MB
movdqu xmm1, [ecx]; // read 16 pixels of current MB movdqu xmm2, [edx]; // read 16 pixels of reference MB
pand xmm1, xmm0; // Using pand to substitute the original "if statement"
pand xmm2, xmm0;
psadbw xmm1, xmm2;
movdqa xmm3, xmm1;
psrldq xmm3,8;
ddd xmm1, xmm3;
vd temp, xmm1;
emms;
}
mbDiff += temp;
if (mbDiff > iMinSAD)
goto NEXT_POSITION; // skip the current position ppxlcRefMB += m_iFrameWidthY;
ppxlcTmpC += MB_SIZE;
ppxlcTmpCBY += MB_SIZE;
}
Figure 4.8: Revised code segment of integer pixel SAD kernel of blkmatch16WithShape function.
Table 4.6: Execution Result of Optimization of Motion Estimation Using MMX
Encoder Block Clockticks/VOP of Original Code Clockticks/VOP of Modified Code Speedup
Motion Estimation 2,553,489,414 203,983,469 1251.81%
Optimization of Other Functions
The hotspots of the other functions for motion estimation are mostly the integer pixel SAD computation of16×16 or 8×8 block. The optimization method is almost the same.
We only use MMX or SSE2 instructions to modify the SAD kernel with full search. We do not go through the detail methods for optimizing each functions. We introduce each function and summarize the experimental result after we optimize all major functions for motion estimation.
The blkmatchForShape function finds best matched binary alpha plane MB. After searching for16 × 16 motion vectors, additional search is made for 8 × 8 vectors. Again, the search is made with integer pixel displacements and for the Y component. The same as16 × 16 motion estimation, the 8 × 8 integer motion estimation also employs block-matching. Using the 16 × 16 motion vector as the search center, the search range of 8 × 8 motion estimation is ±2 pixels. The blockmatch8 and blockmatch8WithShape are functions to implement the8 × 8 integer motion estimation with respect to MB totally in VOP and MB on VOP boundary. Table 4.6 shows the summary result of optimization for motion estimation using MMX. The clockticks per VOP is reduced from 2,553,489,414 to 203,983,469, which is 92.01% reduction.