實驗數據與探討 - 在多核心嵌入式平台使用平行處理加速H.264視訊解碼

Table 8 為 baseline single-core decoder 在目標平台上的效能，而所有數據皆是執行三次再取平均值，由數據上可以知道在此平台上，single-core decoder 對 CIF 影片解碼時無法達到 real- time decoding(30Hz)，但我們關注的是在此平台上實做的各種平行化 H.264 解碼器其加速比。

Table 9.~Table 12.為利用我們提出的 IPC datapath 的四種 Three-core Pipeline Decoder，雖然我們有四顆 Microblaze 核心，但因為第一顆是用來從 SDRAM 中的 bitstream 抽取 NAL unit，且把 NAL unit 傳給 pipeline 解碼，並透過 RS232 裝置在 console 顯示資訊等等的動作，所以 pipeline decoder 作實際解碼動作的只有三顆 Microblaze 核心。從 Table 11 可以發現 pipeline 架構相對於 single-core decoder 可以接近三倍的加速，而為了瞭解有多少效能提升是因為我們提出的 datapath，因此提供出 DRAM-based pipeline decoder，數據 Table 11 所示，DRAM-based 的設計是仿照 Static Pipeline Partition Decoder，差異是在 shared FIFO buffer 是放置在 DRAM，

而不是 local scratchpad memory，而我們觀察出其效能相對於 single-core decoder 只有提升 1.83 倍，跟 static partition 的 2.54 倍或是 dynamic partition 的 2.82 倍都有明顯的差距，這意味著，使用我們的 IPC datapath 能更有效的發揮 pipeline-based video decoder 相較於傳統沒有提供 IPC 機制的 SMP 架構。Table 13 為 Wavefront Three-Core Decoder，可以看出其效能雖高於 DRAM-based pipeline decoder，但卻低於我們提出的 pipeline decoder。

Table 8. Baseline Single-Core Video Decoder Performance

Video Sequence 512 kbps 1.5 mbps

Crew 97.57 136.55

Foreman 86.33 118.97

Mobile 85.61 116.87

News 57.15 81.27

Stefan 86.39 119.3

All sequence are in CIF resolution and have 300 frames. The numbers are decoding time in seconds.The test platform has 16KB of local scratchpad(not used here) and 64KB of data cache per processor core. Only one core is used to run the single-core video decoder.

Table 9. Proposed Static Pipeline Partition Decoder Performance

Video Sequence 512 kbps 1.5 mbps

Crew 38.46 51.84

Foreman 33.73 43.35

Mobile 34.7 42.93

News 24.76 29.65

Stefan 37.66 46.94

All numbers are decoding time in seconds. The decoder uses the proposed pipeline datapath for decoding. Static pipeline-stage partitioning is adopted. The test platform has 64KB of local scratchpad and 16KB of data cache per processor core.

Table 10. Proposed Dynamic Pipeline Partition Decoder (use MB mode) Performance

Video Sequence 512 kbps 1.5 mbps

Crew 35.63 50.31

Foreman 30.91 41.17

Mobile 31.83 41.58

News 24.58 30.05

Stefan 33.5 45.35

All numbers are decoding time in seconds. The decoder uses the proposed pipeline datapath for decoding. Dynamic pipeline-stage partitioning is adopted (use MB mode).

The test platform has 64KB of local scratchpad and 16KB of data cache per processor core.

Table 11. Proposed Dynamic Pipeline Partition Decoder (monitoring buffer) Performance

Video Sequence 512 kbps 1.5 mbps

Crew 34.60 43.96

Foreman 30.51 38.32

Mobile 29.56 38.43

News 24.53 29.91

Stefan 31.31 42.05

All numbers are decoding time in seconds. The decoder uses the proposed pipeline datapath for decoding. Dynamic pipeline-stage partitioning is adopted (monitoring buffer). The test platform has 64KB of local scratchpad and 16KB of data cache per processor core.

Table 12. DRAM-BASED Three-Core Pipeline Decoder Performance

Video Sequence 512 kbps 1.5 mbps

Crew 53.19 70.92

Foreman 47.81 61.47

Mobile 49.29 61.67

News 40.65 51.50

Stefan 50.68 64.63

All numbers are decoding time in seconds. The decoder uses the shared main memory (DDR2-DRAM) to store the pipeline FIFO buffers. The test platform has 16KB of local scratchpad (not used here) and 64KB of data cache per processor core.

Table 13. Wavefront Three-Core Decoder Performance

Video Sequence 512 kbps 1.5 mbps

Crew 42.86 63.00

Foreman 38.86 56.24

Mobile 38.82 56.14

News 29.59 44.80

Stefan 38.81 56.87

All numbers are decoding time in seconds. The test platform has 16KB of local scratchpad (not used here) and 64KB of data cache per processor core

Table 14 和 Table 15 為我們提出使用 monitoring buffer 機制的 dynamic pipeline partition decoder、DRAM-based pipeline decoder 與 wavefront decoder 的加速比，加速比為 single-core decoder 的解碼時間除於各個平行化解碼器的解碼時間所得到的，理論上，當平行化解碼演算法成效夠好，其加速比會接近三，但現實中，因為 system bus 或 memory bandwidth 議題與 IPC overheads，會造成加速比通常是低於三以下的，但值得討論的是，可以從表中觀察出，當 bitrate 從 512kbps 提升到 1.5mbps 時，我們提出的 dynamic pipeline partition decoder 其效能是明顯提升的，較高的 bitrate 通常會改善其 load balance，因為 motion compensation 是整個解碼過程中，負載最重的模組，當 bitrate 提升時，Entropy-decode 模組的負載會相對提高，這時就會接近較好的 load balance 情況。

但另外一方面，wavefront decoder 在 bitrate 提升時，效能會下降，其原因是 Entropy-decode 模組在 H.264/AVC 是必須要連貫的，其意味著，需要 Entropy-decode 完一整張 frame 才能進行分配 MBs 給 worker threads 解碼，當 bitrate 提高時，

Entropy-decoder 的負載會升高，因此整體效能會下降，而這也是為什麼我們提出了 interleave entropy-decode ，這會減緩 entropy-decode 是連貫的影響，但是分配 entropy-decode task 與所有的 MB-decode tasks 給所有的核心並不是容易的事，因此我們實作了兩個版本的 interleaving wavefront video decoder，第一種為 three-core scheduling policy，即第一顆核心 row-by-row 的執行 entropy-decode，而當一個行的 MBs 執行完 entropy-decode 後，則指派給第二顆或第三顆核心執行剩下的解碼動作，假如其他兩顆皆在忙碌於執行 MB-decode tasks 的情況，第一顆則是會主動執行 MB decode 的動作。第二種為 four-core scheduling policy，即第一顆核心只會執行 entropy-decode，每當一個行的 MBs 執行完 entropy-decode，則會指派給其他三顆核心去執行，假如第一顆核心執行完所有 MBs entropy-decode 且其他核心還沒執行完所有 MBs-decode，則第一顆核心會進入 busy waiting，直到所有 tasks 結束為止。

從 Table 16 可以觀察出來，當 bitrate 不管是 512 kbps 或 1.5 mbps 時，three-core

interleaving wavefront decoder 效能不會比 non-interleaving 版本還好，其可能原因是事實上視訊解碼的過程是高度的可變複雜度(variable complexity)，因此造成在 entropy-decode task 與 MB decoding tasks 之間交錯的成效並不好。但當採用 4-core interleaving wavefront decoder 則效能明顯比其他平行化解碼器好(除了 monitoring buffer 版本)，而 4-core policy 版本加速比則可以視為當 3-core policy 版本達到 good load balance 時加速比的 upper –bound。

我們能從實作上知道 wavefront-based decoder 在影片解析度提高時，效能可能會增加，Table 17 所示，使用 720x480 版本的 crew 且有三種不同的 bitrate:1.5、3.0、

6.0mbps ，測試我們提出的 three-core pipeline decoder 與 three-core wavefront decoder，能發現雖然 wavefront decoder 效能確實有所提升，但提出的 pipeline decoder 不管在哪種 bitrate 的情況下都會優於 wavefront decoder，而更不用提 4.5 節所討論的，wavefront decoder 會因為影片解析度增加，會需要額外的 off-chip memory 空間來儲存資料，但 pipeline-based decoder 對於解析度的改變與資料量的關係不高。

從以上的觀察可以看出，wavefront parallel video decoder 的 scalability 對於 memory subsystem 有高度的依賴性，雖然我們把 stack 放置在高速的 on-chip scrachpad memory，用來減少在 FPGA 平台上過於簡單的 cache 影響，但為了驗證 cache subsystem 對於 wavefront decoder 的影響，我們在 Nexux 7 2013 平板上執行 wavefront decoder，如 Table 18 所示，可以發現在平板上測試的加速比與在 FPGA 上是差不多的，由 Table 19 各 wavefront decoder 比較圖能知道，wavefront decoder 在三顆核心加速比大約是在 2.7 左右。

Table 14. Speedup Ratio of Different Decoders at 512 kbps

Sequence Proposed Pipeline DRAM Pipeline Wavefront

Crew 2.82 1.83 2.28

Foreman 2.83 1.81 2.22

Mobile 2.90 1.74 2.21

News 2.33 1.41 1.93

Stefan 2.76 1.70 2.23

The single-core decoder is used as the reference decoder for the calculation of the speedup ratios of all the three-core decoders. Thus, the upper bound of the speedup factor is around 3

Table 15. Speedup Ratio of Different Decoders at 1.5mbps

Sequence Proposed Pipeline DRAM Pipeline Wavefront

Crew 3.11 1.93 2.17

Foreman 3.10 1.94 2.12

Mobile 3.04 1.90 2.08

News 2.72 1.58 1.81

Stefan 2.84 1.85 2.10

The single-core decoder is used as the reference decoder for the calculation of the speedup ratios of all the three-core decoders. Thus, the upper bound of the speedup factor is around 3

Table 16. Speedup Ratio Wavefront with Interleaved Entropy Decoder

Sequence

512 kbps 1.5 mbps

3-core 4-core 3-core 4-core

Crew 2.16 2.72 2.11 2.95

Foreman 2.11 2.61 2.05 2.83

Mobile 2.09 2.63 2.01 2.82

News 1.80 2.36 1.75 2.59

Stefan 2.07 2.62 1.99 2.80

Table 17. Speedup Ratio for crew 720x480 video sequence

Bitrate Proposed Pipeline Decoder Wavefront Decoder

1.5 mbps 2.57 2.28

3.0 mbps 2.68 2.20

6.0 mbps 2.71 2.04

The single-core decoder is used as the reference decoder for the calculation of the speedup ratios of both three-core decoders. The decoding times of the single-core decoder are 314.7, 382.1, and 468.8 seconds, respectively. The video is the 300-frame Crew sequence.

Table 18. Speedup Ratio of Wavefront Decoder on NEXUS 7 2013

Sequence

512 kbps 1.5 mbps

3-core (NIE) 4-core (IE) 3-core (NIE) 4-core (IE)

Crew 2.32 2.80 1.88 2.97

Foreman 2.20 2.65 1.89 2.81

Mobile 2.11 2.60 1.78 2.72

News 1.87 2.55 1.52 2.30

Stefan 2.16 2.62 1.85 2.75

The average decoding times of the single-core decoder for different videos are 6.44, 5.68, 5.66, 3.29, and 5.62 seconds, respectively.The 3-core decoder does not interleave the entropy decoding (NIE) task with the MB decoding tasks while the 4-core decoder interleaves the entropy decoding (IE) task with the MB decoding tasks row-by-row. For the 4-core decoder, the first core is solely responsible for the entropy decoding task

Table 19. Comparisons of parallelization in various H.264/AVC Wavefront Decoder

Mesa[5] Schöffmann[10] Baik[8] Jo[1]

Parallelism DLP DLP DLP DLP

Codec FFmpeg Self-implemented N/A JM 13.2 Architecture ^{SGI Altix} x86 - xeon CBE x86 MP Video Resol. 1080p 1080p 1080p CIF、720p

Num. of core 3 4 5 4

Speedup 1.98x 2.34x 3.5x 2.9x

六、結論與未來展望

以往大家在多核心平台討論平行化 H.264/AVC 大部分採用 Data level parallelism video decoder，因為現今平行化計算追求的是 scalabilty 的極致，希望能越多核心同步處理越好，而 DLP 在 scalability 與程式設計師在撰寫上的負擔都佔有優勢，不需要如 TLP video decoder 需要仔細的對 decoding steps 作切割，且需要對解碼行為有一定的了解才有辦法達到 good load balancing 的切割方式，但經過前面章節的探討，我們能知道，現實中在嵌入式環境下的情況又是不一樣的，DLP 在 off-chip memory 使用上相對於 TLP 有很明顯的差異，在嵌入式的環境中，對於記憶體的使用是很重要的議題，DLP 在這方面將是明顯的缺點。另外 DLP 的 scalability 優勢，

在現在的智慧型手機或是平板，因為散熱因素或追求輕薄等等的情況下，大多是採用四核心、或最多為八核心，似乎沾不到太多好處。而且透過我們提出的 IPC datapath，將更有效的發揮 pipeline-based video decoder，在傳輸資料上負擔降低，

在 system bus 使用頻率上也降低，而且在程式設計上，透過我們提供的 thin OS kernel 將能輕鬆的開發 pipeline-based program，而我們提出的 dynamic pipeline partition video decoder 也能透過 monitoring buffer 的技術來改善 pipeline-based video decoder 效能，在三顆核心的情況下可以接近三的加速倍率。

雖然我們的實作目前看起來很完善，但不管是提出的平台架構與 pipeline-based、

wavefront-based video decoder 都仍有需要做加強和改進的地方。首先系統架構上，

目前在 Xilinx Vertex-5 FPGA 開發板上，沒有支援 coherence cache，這會造成程式設計師需要自行呼叫 invalidation function 處理 race condition 問題，而這部分需要從電路方面做修改，往後實驗室會關掉 Microblaze 支援的 cache，然後放上我們自行設計的 coherence cache，這可能會解決 DLP video decoder 在此平台下效能無法完全發揮的問題(需要在做相關數據分析與驗證)。

另外 video decoder 的部分，wavefront decoder 目前是採用 static scheduling，未來希望能做成 dynamic scheduling，即 scheduler(或稱 main thread)來維護 task queue，

而 worker threads 則在 queue 中取 task 來執行，這將會解決 static scheduling 較常會發生 load unbalance 情況，但這部分也需要 coherence cache 的支援，因為在設計 dynamic scheduling 需要使用 off-chip shared memory，而避免複雜設計所造成的頻繁存取 off-chip memory 而造成效能下降，就需要完善的 cache subsystem 的支援。另外 pipeline decoder 部分，目前是採用 dynamic pipeline partition 搭配 monitoring buffer 的技術，這部分希望每個核心 task 的分配能夠有彈性，因為目前設計 motion vector reconstruction 或執行 SaveNeighborPixelForIntra 等等的模組需要在指定的核心或是限制執行順序，而其原因是我們的 IPC controller 目前設計是單向的，只能由第 p 顆核心將資料傳遞給第(p+1)顆核心，此種往下傳遞資料的方式，這會造成資料不能往回送，因此降低了 dynamic partition 的彈性，而這部分實驗室之後也會將支援雙向的 IPC controller。

而最後我們希望我們提出的系統架構將能被應用在現今的多核心平台上，且我們提出的平行化視訊解碼的部分，也能夠在資源有限的嵌入式環境中，發揮其功效。

參考文獻

[1] JO, Song Hyun; JO, Seongmin; SONG, Yong Ho. Efficient coordination of parallel threads of H. 264/AVC decoder for performance improvement.Consumer Electronics, IEEE Transactions on, 2010, 56.3: 1963-1971.

[2] VAN DER TOL, Erik B.; JASPERS, Egbert G.; GELDERBLOM, Rob H. Mapping of H. 264 decoding on a multiprocessor architecture. In: Electronic Imaging 2003.

International Society for Optics and Photonics, 2003. p. 707-718.

[3] SEITNER, Florian H., et al. Evaluation of data-parallel splitting approaches for H.

264 decoding. In: Proceedings of the 6th International Conference on Advances in Mobile Computing and Multimedia. ACM, 2008. p. 40-49.

[4] MEENDERINCK, Cor, et al. Parallel scalability of video decoders. Journal of Signal Processing Systems, 2009, 57.2: 173-194.

[5] Á LVAREZ MESA, Mauricio, et al. Scalability of macroblock-level parallelism for H.

264 decoding. In: Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on. IEEE, 2009. p. 236-243.

[6] AZEVEDO, Arnaldo, et al. Parallel H. 264 decoding on an embedded multicore processor. In: High Performance Embedded Architectures and Compilers. Springer

在文檔中在多核心嵌入式平台使用平行處理加速H.264視訊解碼 (頁 71-87)