Implementation Result - Bandwidth Reduction Techniques in Computation Cores

Chapter 5 Bandwidth Reduction Techniques in Computation Cores

5.2. CFMMC

5.2.4. Implementation Result

A. Architecture Cost Comparison

The prototypes of the CFMMC and the PPFMMC architectures are both synthesized from Verilog RTL design using UMC 0.18μm 1P6M CMOS technology [56]. Both designs are synthesized with the clock constrained at 50 MHz, which is more than enough to perform real-time decoding. The logics of the CFMMC prototype (exclude memory) have a gate count of 40,065. The MFM and the VRSB have the equivalent gate count of 280,749, which occupies 87.5% of the total cell area. As to the prototype of the PPFMMC, the logics part has a gate count of 28,039. The ping-pong frame memory has an equivalent gate count of 486,643. It is obvious that the extra logics used in the CFMMC increases the gate count by 43.4% compared with the gate count of the logic part in the PPFMMC. However, the total equivalent gate count, which takes the memory area into consideration, shows that the cell area of the CFMMC is actually 37.7% smaller compared with that of the PPFMMC. In other words, the total cell area of the CFMMC is only 72.3% of PPFMMCS’s total cell area. The detailed distributions of the gate counts in different modules excluding the memories are compared in Fig. 28. It can be seen that the dirty table, the pblk offset generator, the inblk offset generator, and the extra logics in the memory accessor are responsible for the area increase. Despite the increase in the area of the CFMMC’s logics part, which seems to be relatively large, the total cell area is significantly smaller than that of the PPFMMC.

0 10,000 20,000 30,000 40,000

CFMMC PPFMMC

Gate Counts

MC_CONTROL MV_PROCESSOR INBLKOFFSET_GEN PBLK_OFFSET_GEN DIRTY_TABLE MEM_ACCESSOR FILTER_REC

Fig. 28. Gate count distribution and comparison of the logics part in the PPFMMC and the CFMMC

B. Architecture Energy Consumption Comparison

The gate-level power is reported by using Power Compiler [57]. The signal switching activities are gathered by running at 50 MHz for both the CFMMC and the PPFMMC. The reason to use such a high clock rate is to increase the numerical order of the reported power, which corresponds to the energy of processing 109 CIF frames in one second. This can make the comparison of the energy consumption between the CFMMC and the PPFMMC easier.

Fig. 29 plots the energy consumption comparison between the CFMMC and PPFMMC running the manually created test patterns. There are two lines related to the CFMMC architectures. One is the line for the memories themselves, the other one is for the overall CFMMC architecture. The line of the CFMMC has smaller slope than the slope of the CFMMC memories line. This is due to the energy consumed by the logics which reduced the energy reduction. The line of the CFMMC memories has a slightly smaller slope compared with the slopes of the theoretical lines with k=4 and 2. This may be explained by that the logics and memories also consume energy even when there’s no memory access, thus compromising the energy model derived merely based on memory access energy. The detail distribution of the power consumption in the CFMMC when P0=20% and 80% are illustrated

-80%

-60%

-40%

-20%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

P₀ Percentage of Energy Reduction Compared to Ping-pong's (%)

k=4 k=2

CFMMC CFMMC mem

Fig. 29. Plot of the energy reduction percentages of the CFMMC at different P0

in Fig. 30; the power distribution of the PPFMMC is also illustrated for comparison. The memory accessor accounts for the second most power consumption among the logics in the CFMMC. This part of the energy consumption was not part of the energy consumption model to evaluate the percentage of reduction.

The energy reduction percentages of the real test patterns are listed in Table 12. The percentages of NOT-CODED MB found in the first 15 frames of each test sequences are also listed. For those test sequences with more than 70% of MBs being NOT-CODED MBs, the energy consumptions are reduced by 11%~18%. For sequences with much less MBs being NOT-CODED MBs, the energy consumption are increased by 18%~32%. Fig. 31 illustrates the detail distribution of the power consumption evaluated for running test sequences mobile and akiyo, which are the sequences with the least and the most energy consumption reduction percentages. The result gathered from the real test patterns verifies the conclusion made in

0 5 10 15 20 25

Fig. 30. Power consumption distribution and comparison of the PPFMMC and the CFMMC

Table 12 Energy reduction percentage of the real test patterns

Test sequences P0 of the first 15 frames

mother_daughter (A) 92.83 15.80

hall (A) 95.96 17.19

previous section which states that the CFMMC should be more suitable for applications with more static background and less motion.

The hardware implementation of the CFMMC proves the feasibility of the architecture.

The comparison with the most commonly used PPFMMC shows some advantages and limitations. From the cost perspective, the CFMMC has been proven to reduce the silicon area compared with the PPFMMC. From the latency perspective, the CFMMC prototype can achieve a comparable throughput when P0 is more than 33.4%. The limitation, however, is in the worst case scenario in which the CFMMC would need a clock which is 1.5 times faster than that of the PPFMMC. Nevertheless, such issue can be alleviated with the use of dynamic frequency scaling. In the energy consumption aspect, the CFMMC prototype can reduce energy consumption when P0 is high enough. Similar to the latency issue, the energy consumption would increase if P0 is not high enough. This issue originates from the architecture’s data-dependent nature, and cannot be solved. With this limitation, the suitable applications of the CFMMC hardware are limited to video surveillance, video telephony, and video conferencing.

5.2.5. Summary

We proposed a combined frame memory motion compensation (CFMMC) architecture which did not only reduce the frame memory size, but also is potential in reducing bandwidth

0 5 10 15 20 25

Fig. 31. Power consumption distribution and comparison of the PPFMMC and the CFMMC for mobile and akiyo.

requirement, access latency, and energy consumption. The statistics on perfect-matched MB were investigated for the well known video sequences. Based on the statistical result, we derived the latency and the energy consumption model for evaluation. During the exploration, we found that when the percentage of perfect-matched MBs (P0) was higher than 50%, the CFMMC could reduce both the latency and the energy consumptions due to memory accesses.

To investigate the cost of extra computation and control logics for achieving the aforementioned benefits, hardware architectures of the CFMMC and the most commonly used PPFMMC were both implemented. The hardware implementation of the CFMMC only required 75% of the silicon area used to implement the PPFMMC. The CFMMC architecture was also capable of reducing the bandwidth requirement and energy consumption up to 72%

and 16% respectively when P0>70%. However, when P0 is not high enough , the CFMMC suffered from bandwidth requirement, energy consumption, and latency increases.

Consequently, these limitations limited the application of CFMMC into video surveillance, video telephony, and video conferencing. For these applications, the CFMMC shall guarantee its bandwidth requirement, energy consumption, and latency reduction capability.

在文檔中系統資料頻寬之研究 (頁 95-101)