Chapter 5 Bandwidth Reduction Techniques in Computation Cores
5.2. CFMMC
5.2.2. Combined Frame Memory Motion Compensation
A. Statistics of Perfect-Matched MB
The percentage of perfect-matched MBs within a frame determines the bandwidth reduction and energy consumption of the frame memory in motion compensation. A perfect-matched MB is one that has zero-valued MV and no residual. The reconstruction of such MB does not require the summation of the motion compensated (predicted) MB and the residual MB. For instance, a NOT-CODED MB in MPEG-4 [52] is a MB with zero-valued MV and no residual; hence a NOT-CODED MB is a perfect-matched MB. If a MB is a perfect-matched MB, the MB data read from the reference frame memory is the same as the MB data written to the reconstructed current frame memory in PPFM. Since the perfect-matched MB would be read and written with the same content at the same location, there is an opportunity to eliminate the redundant memory access for a perfect-matched MB.
To eliminate the repeat accesses for a perfect-matched MB, the content of the perfect-matched MB must be already in the reconstructed frame memory before performing the motion compensation. The only way to achieve this requirement without performing extra
memory access is to merge the reconstructed frame memory with the reference frame.
Therefore, it is necessary to use the merged-frame approach so that the memory accesses of a
perfect-matched MB can be eliminated. Since the memory access reduction depends on the percentage of perfect-matched MBs within a frame, the reduction of bandwidth requirement and energy consumption is also highly dependent on this percentage.
Table 7 lists the average percentage of perfect-matched MBs within one frame. Both the results for QCIF and CIF sized sequences are listed. The statistics were gathered from running MPEG-4 VM18 [53] with the quantization parameter (QP) set to 16. The parenthesis next to each sequence represents the class it belongs as classified in [53]. Class "A" to "C"
represents different levels of spatial detail and amount of movement, where class "A" is the lowest class and class "C" is the highest class. The statistics shows that lower class test sequences, such as akiyo, container, mother_daughter, news, and hall, which exhibit large portion of static background have more than 70% of perfect-matched MBs in average. Other test sequences with more motion, such as foreman, stefan, coastguard, and mobile, have less than 30% of perfect-matched MBs.
Table 7 Percentage of perfect-matched MBs when QP=16
Test sequences QCIF (%) CIF (%) container (A) 91.74 88.91 mother_daughter (A) 81.42 77.65
hall (A) 86.21 83.86
akiyo (A) 91.32 89.09
coastguard (B) 10.35 2.69
foreman (B) 24.49 23.38
news (B) 82.53 83.01
stefan (C) 15.71 20.90
mobile 10.93 3.39
The QP used in Table 7 was 16, this QP value was relatively lower than the typical QP values of 16~24 adopted in practical MPEG-4 applications. Fig. 20 illustrates the impact of different QP values on the percentage of perfect-matched MBs. It can be seen that for most sequences with high percentage, the highest percentage of NOT-CODED MB appeared when QP=16. However, for most sequences with low percentage, the percentage of NOT-CODED MB significantly increased until QP=24. After QP>24, the increase became insignificant. It is suspected that after the QP is larger than 24, the reconstructed frame’s quality would be so bad that the residue becomes increasingly larger, thus resulting the decrease in the percentage of NOT-CODED MB. Nevertheless, for the sequences which have low percentage, since the percentage of NOT-CODED MBs increases when QP>16, practical video applications should result in higher percentage of NOT-CODED MBs than those listed in Table 7 for these sequences.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 4 8 16 24 28 31
QPs
Not-coded MB%
Container Mother_daughter Hall
Akiyo Coastguard Foreman News Stefan Mobile
Fig. 20. Impact of different QP values on percentage of NOT-CODED MB in MPEG-4
B. Combined Frame Memory
The CFM architecture adopts the merged-frame approach with an additional look-up table. The reconstructed current frame data and the reference frame data are mapped to one single frame memory with the size of one single frame. Unlike the merged-frame approach in [48][49], we introduced the additional look-up table to indicate whether the predicted pixel data are in the frame memory or in the local buffers. There are three major parts in the proposed CFM architecture: the main frame memory (MFM), the vector range strip buffer (VRSB), and the dirty table (DT), as illustrated in Fig. 21 for QCIF size with the vector range of [-16:+15]. The function of each component is explained as follows.
• Main frame memory (MFM): The MFM stores the reference frame data and the reconstructed frame data together. The reconstructed current frame data are stored at the upper part of the MFM whereas the reference frame data are stored at the lower part of the MFM. The size of MFM is as large as one single frame, i.e. 176x144x1.5 bytes for QCIF.
• Vector range strip buffer (VRSB): The VRSB is a rectangular strip of memory which works as an exchange buffer for the reference frame data. If one reference MB in the
Main Frame Memory (MFM) Rec. Current Frame
Reference Frame
Dirty Table (DT)
Vearch Range Strip Buffer (VRSB)
Dirty Index 176x144x1.5 bytes
256x12x1.5 bytes 12 bits
Fig. 21. Memory components for QCIF with vector range of [-16:+15] in the CFM
MFM is to be updated by a reconstructed current MB, this reference MB would be copied into the VRSB as a backup in case the subsequent MB needs it. This avoids the reference frame data from being ruined by the reconstructed current frame data. The size of the VRSB is determined by the height of the vector range and the width of a frame, i.e. 16x(176+16)x1.5 bytes for QCIF with the vector range of [-16:+15].
• Dirty table (DT): The DT is the look-up table that keeps record of which pixels in the MFM are updated. If a MB in the MFM is updated by the reconstructed current frame data, the corresponding dirty bits of this MB will be set. This indicates that the
reference pixels in that MB are stored in the VRSB for backup as mentioned earlier. If the subsequent MB requires the reference pixels of this MB, these reference pixels will be read from the VRSB instead of the MFM. The size of the DT varies according to the size of the VRSB, i.e. 16x(176+16) bits for QCIF with the vector range of [-16:+15].
Perfect-matched MB?
Update the index in DT
Update the corresponding in DT Check DT for the corresponding DT status
Reconstruct reconstructed MB
Write reconstructed MB into MFM
End of processing current MB Start of processing current MB
No
Yes
Read predicted MB from MFM and/or VRSB Back-up current reference MB
Fig. 22. Flow chart of motion compensation process in the CFMMC
Fig. 22 illustrates the flow chart of the CFMMC. The process is very simple for a perfect-matched MB, but is complex for a non-perfect-matched MB. When processing a perfect-matched MB, since the reference MB and the reconstructed MB are the same and resides within the MFM at the same location, no memory access is performed. The only operation carried out is the updating of the index in the DT. For the non-perfect-matched MB case, the DT is checked first to determine where the predicted MB pixels are stored, each pixel in the predicted MB is either read out from the MFM or the VRSB according to the corresponding dirty bit. After the predicted MB is read out, it is summed with the residual to reconstruct the reconstructed MB. Then the current reference MB in the MFM must be copied into the VRSB before the reconstructed MB is written back to the same location.
Finally, the reconstructed MB is written back to the MFM, and the DT and its index are updated at the end. Fig. 23 illustrates the motion compensation process for two consecutive non-perfect-matched MBs.
Main Frame Memory (MFM)
Reference Frame
Dirty Table (DT)
Vector Range Strip Buffer (VRSB) STEP 1:
Vector Range Strip Buffer (VRSB)
Add with residual
Vector Range Strip Buffer (VRSB)
Which dirty bits to access is computed from MV
Add with residual
Vector Range Strip Buffer (VRSB)
Reconstructed block
Fig. 23. The processing of non-perfect-matched MBs
C. Analytic Estimation of Memory Size, Energy, and Latency
The memory requirement of the CFMMC can be determined through the life time analysis of the collocated MB in the reconstructed current frame and the reference frame, as illustrated in Fig. 24. For each MB, the life time of the reconstructed current frame data and the reference frame data overlaps for a portion of period during the processing of one single frame. This overlapped lifetime of a collocated MB would be referred as MB overlapped life time (MBOLT) here on. The length of MBOLT is determined by the vector range’s height and width. For instance, the reconstruction of the current MB requires the reference pixels from the most upper-left corner of the vector range in the worst case; thus the reference MB having the required reference pixels has to remain in VRSB until the reconstruction of the current MB is complete. MBOLT is proportional to the raster-scan MB distance between the reference MB and the current MB. The larger the vector range is, the longer the MBOLT is.
MB row 0 Reference MB n
MBOLT n Rec. Curr. MB n
MB row 1 MB row 2 Processing MB
Processing MB n
MB row 3 MB row 4 MB row 5 MB row 6 MB row 7 MB row 8
n+1 n+2 n+3
n+4 n+5
n+6 n+7
n+8 n+9 n+10
n+11
VRSB must store 12 MBs MB Overlapped
Life Time
Fig. 24. Life time analysis of MBs
The maximum number of MBs having overlapped MBOLT determines the size for VRSB and DT. In another word, the VRSB size must be large enough to store all the reference MBs who are currently alive. By the term alive we mean that a reference MB may still be needed by further motion compensation of subsequent MBs. For example, consider the case of QCIF with the vector range of [-16:+15], the maximum number of MBs having overlapped MBOLT is 12 MBs. This means there are at most 12 reference MBs alive simultaneously, hence the VRSB size is 12 MBs and the DT size is 12 bits. Comparing the memory requirement with other merged-frame approach [48][49], the VRSB size is 1 MB smaller than their LIFO buffer size. We generalized the formulation of memory size requirement for the MFM, VRSB, and DT and listed them in Table 8. Note that this formulation can be applied to any given frame size and vector range. The overall memory size was also compared with that of the most commonly used PPFM. For the aforementioned QCIF case, the memory size of the CFMMC architecture is 56.6% compared to that of the PPFM architecture. MFM height_frame x width_frame x
1.5 38,016
VRSB
floor(height_VR/height_MB) x height_MB x (width_frame + (floor(width_VR/width_MB) x width_MB)) x 1.5
4,608
DT
floor(height_VR/height_MB) x height_MB x (width_frame + (floor(width_VR/width_MB) x
(height_frame x width_frame x
1.5) x 2 76,032
Table 9 lists the analytic model of average bandwidth requirement, energy consumption, and latency due to memory accesses. The model is evaluated for processing one P-frame. In Table 8, DMB represents the amount of data to be read or write for one macroblock. The total bandwidth requirement accounts only the access with the MFM since it is common to implement MFM using external memories. We model the energy consumption of accessing one MB in the MFM and the VRSB as EMFM and EVRSB respectively. This assumes that the energy consumption of a memory read and a write are the same. Based on this assumption, the average memory energy consumption of processing a frame is listed in Table 8, where M represents the number of MBs in a frame, P0 represents the percentage of perfect-matched MBs, and k represents the ratio of EMFM to EVRSB. The energy consumed in the MFM includes the energy of reading predicted MBs from the MFM, reading the reference MBs for backup, and writing the reconstructed MBs into the MFM. Since the accesses to the MFM only occurs
Table 9 Memory access energy consumption and access latency of processing one frame
Memory Access Bandwidth
Requirement Access Energy Consumption Access Latency MFM M x 3 x (1-P0) x DMB M x 3 x (1-P0) x EMFM M x (1-P0) x 3 x CMB
VRSB M x (1-P0) x DMB M x (1-P0) x k-1 x EMFM M x (1-P0) x CMB
DT M M x k-1 x EMFM x 0.125 (neglected) M x 0.125 x CMB (neglected) Combined Total M x 3 x (1-P0) x DMB M x (3+ k-1) x (1-P0) x EMFM M x (1-P0) x 4 x CMB
Ping-pong Total M x 2 x DMB M x 2 x EMFM M x 2 x CMB
Table 10 Average memory access energy consumptions and latencies for various QCIF test sequences with k=4
K=4 Average Bandwidth Requirement Average Energy Consumptions Average Access Latency Test sequences
and write it once. The energy consumption of the VRSB is mainly due to the backup of reference MBs, which writes the reference MBs of non-perfect-matched MBs into VRSB.
Although some predicted pixels may have been stored in the VRSB, the worst case for energy consumption happens when all the predicted pixels are read from the MFM. This is the reason we account the energy of reading predicted pixels to the MFM’s energy consumption.
The memory access latencies of processing one frame are also listed in Table 9. The access latencies are modeled based on the assumption that the access latencies of read and write to either the MFM or the VRSB are all the same, hence the memory access latency of accessing one MB is denoted as CMB . A typical scenario for such assumption to hold is when SRAM is adopted for both the MFM and the VRSB. In the CFMMC, extra memory access latency is introduced for a non-perfect-matched MB whereas the memory access latency for a perfect-matched MB is eliminated. For each non-perfect-matched MB, the predicted MB is first read from the MFM or the VRSB, and then the content of the current MB which resides in the MFM is read and written into the VRSB for reference MB backup; the reconstructed current frame is then written back to the MFM at the end. As a result, the memory access latency of four MBs is needed for each non-perfect-matched MB. However, overlapping the latencies of reading the reference MB from MFM and writing the reference MB into VRSB may reduce the total latency to Mx(1-P0)x3xCMB . According to the formulas in Table 8, the memory access latency in the CFMMC can be less than that of the PPFMMC when P0 is larger than 50%.
The reduction of energy consumption in the CFMMC depends on the adopted memory type and the contents of video sequences. For instance, if on-chip SRAM [54] is used for both the MFM and the VRSB, k would be about 4. Note that this SRAM case may represent the worst case reduction of energy consumption. If external memory is adopted, such as
Mobile SRAM [55], k might be even larger, and the energy reduction should also be larger.
The reduction of bandwidth requirement, memory energy consumption, and access latency in different test sequences when k = 4 is listed in Table 10. The CFMMC may reduce 72.1% ~ 87.0% of memory access bandwidth compared to that of the PPFMMC for QCIF test sequences container, akiyo, news, hall, and mother_daughter. However, for test sequences with small P0, such as foreman, stefan, coastguard, and mobile, the estimated bandwidth requirement may increase by 13.6% ~ 34.5% compared to that of the PPFMMC. This analytic evaluation disregards the impact of memory banking because the memory organization is beyond the scope of interest in this work.
Table 10 also lists the estimated memory access energy and latency for different test sequences. For test sequences with larger P0 (>70%), the memory access energy consumption and latencies in a QCIF frame can be reduced by 71.6% ~86.6% and 62.8% ~ 83.5%
compared with that of the PPFMMC respectively. However, for other test sequences with smaller P0, such as foreman, stefan, coastguard, and mobile, the access energy consumption and latency are increased by 22.7% ~ 45.7% and 41.0% ~ 79.3%. However, this extra latency can be hidden by overlapping these latencies with the computation time of motion compensation.