Chapter 5 Hardware Architectural Requirements
5.3 Timing model
In order to find out the overhead of different granularity, we separate the time that process one deblocking operation to time for read, time for filer, and time for write.
According to the hardware as mention above, the sources of read and write are memory, others PEs, and self PE. Assuming an address can load 16 pixels, the time latency of 3 sources is memory: others PEs: self PE = x: y: z. Due to we don’t find the time latency of 3 sources ratio, we assume y = x/1000 and the time latency of self PE is 0. We use CACTI[9] to estimate the time latency of memory is 1.63ns, and model the time latency of read data and write data in Table 5-1.
Table 5-1 Time for read data and write data.
PE1 PE2
PE4 PE3
PE5 PE6
PE8 PE7
PE9 PE10
PE12 PE11
PE13 PE14
PE16 PE15
NPE1 NPE2
46
Memory Others PEs Self PE
1.63(ns) 0.013(ns) 0(ns)
According to Ref[8], we can know time for filer is 10ns. We assume one stage compose of read data, filter, and write data. The time of one stage is 13.26ns that is sum up maximum time for read, maximum time for filter, and maximum time for write. The timing model can find the time for process one frame size is 1920×1080 as shown in Table 5-2. The Pt is the maximum parallelism at time t in Table 5-2.
Table 5-2 Time for deblocking a 1920×1080 frame.
Granularity Time
2D-wavefront(MB) ∑ ( × 3. 6)
3
Boundary16 ∑ ( × 3. 6)
5
Boundary4 ∑ ( × 3. 6)
57
After timing model, the speedup of different granularity is the same as Table 4-1.
But the speedup of original sequential deblocking is different. We use Boundary4 to process one QCIF(176× 144) need 108 stages. In Ref[10] can know sequential deblocking one MB need 530 cycles, so average cycles for process one 4 pixel long boundary is 17. In Ref[10] sequential process one QCIF frame need 51930 cycles, the
Table 5-3 can find the ideal speedup and actual speedup is different.
Table 5-3 Speedup for idealize and actually.
idealize actually
speedup 99 × 3
8 9.33 5 93
8 × 7 8. 8
The difference is come from the Boundary strength (BS) value, BS value range from 0 to 4. Each boundary have a BS value, boundary unneeded deblocking if BS value is 0, boundary needed deblocking if BS value is 1~4. In sequential processing, process next boundary if current boundary’s BS value equal to 0. But in our design, the time of stage is fixed, it must wait if current boundary’s BS value equal to 0.
48
Chapter 6 Conclusion
As shown in our proposed order, examining the deblocking algorithm at a finer granularity did bring additional opportunities for exploiting parallelism, and thus speed up the execution time of the deblocking filter. 4 pixel long boundary method compared with the 2D wave-front method order in deblocking both 1920*1080 and 1080*1920 pixel sized frames, we gain a speedup of 1.92 and 2.44 times given an un-limited number of PEs respectively. For an environment with limited hardware resources, we also provide an algorithm able to fully utilize available resources for the deblocking filter.
Considering the trend of digital video codecs, larger frame sizes and reduced coded video size are both essential. In order to achieve this goal, the deblocking filter plays an important role because dealing with larger frames takes time proportional to the frame size. The proposed design can limit the growth in time spent deblocking by the maximum of the frame width and height, which are often proportional to the square root of the frame size. Thus it brings the opportunity for practical real-time deblocking of larger sized videos in the future.
The proposed approach in this paper is just the first step of parallelizing H.264 video decoding in a finer way. In order to exploit overall parallelism, decoding stages including intra decoding and motion compensation are all required to consider the parallel order of their operations. However, we are able to further analyze the algorithms of these stages to see if there are any opportunities for using a similar approach to that in this paper.
References
[1] List. P. Joch, A., Lainema., J., Bjontegaard. G., Karczewicz. M., "Adaptive deblocking filter," Circuits and Systems for Video Technology, IEEE Transactions on , vol.13, no.7, pp.614-619, July 2003
[2] E. Van der Tol, E. Jasper, R.H. Gelderblom, “Mapping of H.264 Decoding on a Multiprocessor Architecture” Proceeding of SPIE Conference on Image and Video Communications 2003, p.p.707-709
[3] Meenderinck, C., Azevedo, A., Alvarez, M., Juurlink, B., Ramirez, A.: Parallel Scalability of H.264. In: Proc. First Workshop on Programmability Issues for Multi-Core Computers (January 2008)
[4] Zhuo Zhao, Ping Liang, "Data partition for wavefront parallelization of H.264 video encoder," Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on , vol., no., pp.4 pp.-2672, 0-0 0
[5] Final Draft International Standard of Joint Video Specification (ITU-T Rec.
H.264/ISO/IEC 14496-10 AVC), Mar. 2003.
[6] Ke Xu, Chiu-Sing Choy, "A Five-Stage Pipeline, 204 Cycles/MB, Single-Port SRAM-Based Deblocking Filter for H.264/AVC," Circuits and Systems for Video Technology, IEEE Transactions on , vol.18, no.3, pp.363-374, March 2008
[7] Yun-Shuo Chang, "Improvements of H.264 De-blocking filter and DST Implementation of H.264 Decoder," A Thesis Submitted to Institute of Electrical Engineering National Yunlin University of Science & Technology in Partial Fulfillment of the Requirements for the Degree of Master of Science in Electrical Engineering, July 2007.
[8] T.M. Liu, W. P. Lee, T.A. Lin, and C. Y. Lee, “A memory-efficient deblocking filter for H.264/AVC video coding,” in Proc. IEEE Int. Symp. Circuits Syst., May 2005, vol. 3, pp. 2140-2143.
[9] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi, “CACTI 5.3”, Technical Report. HPL-2008-20. 2008.
[10] Eric Gerard Ernst, “Architecture Design of a Scalable Adaptive Deblocking Filter for H.264/AVC,” A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering, July 2007.
50
3. HW requirement 為何 PE 間要有 connection,因為已經有 share bus 了?
0 500 1000 1500 2000 2500
4 pixel 16 pixel 2D
(a) 1920×1080 (b) 1080×1920
同時需要傳的資料量很多,且 PEs 是緊鄰的,所以 PEs 間的 connection cost 是低的,故 PEs 之間用 connection 傳遞是適合的。
4. 因為 PE 有 buffer 所以應該也要加 buffer size 要假設上去?
A4:
16: 若 share bus 可滿足所需傳遞的資料量,則 input buffer 的 size 為 1 MB,output buffer 的 size 為 1 MB,internal buffer 的 size 為 1/4 MB。
4: input buffer 的 size 為 2 4×4 blocks,output buffer 的 size 為 2 4×4 blocks,internal buffer 的 size 為 4×4 block。
5. 如何證明我的 order 是最好的?
A5:
以本論文主要目標為提高計算平行度以降低執行時間而言,在 PE 數量可滿足最大平行度時,由於所提出的 order 已滿足 critical path 最 短需求,而針對其他不在 critical path 上的 boundaries 的 order,若無硬 體上的限制,則無所謂最好的 order。
Deblocking 在 frame 與 frame 之間完全沒有 data dependency,且[3]
已完成 frame 間平行處理的設計,所以我們只需專注在單張 frame 內的
Filter 一個 boundary 的時間會根據 boundary strength(BS)值而有所 不同,若 BS 值為 0 則此 boundary 不需 filter,若 BS 值為 1~4 則需要 filter。
52
根據我們所提出來的 order,在同一個 stage 可以一起處理的 boundaries 的 BS 值都是 0 的機率很低的,所以在這邊我們時間單位是取處理一個 MB 所需最長的時間,以 fully synchronize 方式去處理。
8. 為什麼 4 pixel long boundary order 可以直接對應到 1 pixel long boundary order,而這個現象不會直接出現在 16 跟 4 之間?
A8:
因為 16 pixel long boundaries 之間有互相交錯,所以以更細粒度去 分析 data dependency 有機會讓部分可以先處理的 boundaries 提早處理,
而 4 pixel long boundaries 之間沒有互相交錯,所以在以更細的粒度去分 析 data dependency 不會有差別,以致於可以直接對應到 1 pixel long boundary。 是高的且可以 fully utilize 在 multi-core 上以及 overhead 是小 的
執行方法相比,在 idealize speedup 與 actual speedup 上會有所不同。(其 詳細內容已加入到 5.3)