TLP 實作與技術 - 在多核心嵌入式平台使用平行處理加速H.264視訊解碼

Pipeline-based video decoder 的效能是非常仰賴於每個 pipeline stages 的平衡，因為每個 Macroblock 解碼複雜度變化是非常明顯的，所以很難利用固定的 stage partition 來達到接近完美的 load balancing，因此我們才在每兩個 pipeline stages 之間加入 FIFO circular buffer，來吸收因為 MBs 擁有不同解碼複雜度所造成的負擔，但這樣還是無法接近完美，因此我們在提出了兩種 Dynamic pipeline partitioning 的方法，所以接下來章節會就 (1)Static pipeline partitioning 、 (2)Dynamic pipeline partitioning using Macroblock mode 和 (3)Dynamic pipeline partitioning using monitoring buffer 三種實作來介紹。

1) Static pipeline partitioning

如 Figure 34.所示，每個核心執行固定的 tasks，Core 1 執行 ED、MVR 或 IMR²⁵，會這樣分配是因為 MVR 與 ED 會共享資料，特別是目前 MB 與鄰近 MBs 的 MV 和 refIdx，MV 與 refIdx 儲存在 DDR2，因此當這兩個 task 分配在不同的核心時，這兩個核心會在 DDR2 讀取同樣的資料，故放在同核心就只要在 cache 中讀取，可以減少 system bus 的使用。Core 2 則是執行 IQIT、MP 或 IP、VR，這部分是三個核心中負載最重的一顆(因為受限於核心數)，而 Core 3 則是單純執行 ILF。

我們知道 MB 的解碼時間與 coding modes 有相當大的關係，舉例來說，當目前解碼的單位為 Skipped Macroblock 時，則在 ED 時會消耗很少的時間，而相對的，當解碼單位為 Intra Macroblock 時，ED 需要花費較多的時間解碼，其原因為 Intra Macroblock 通常擁有較多的 Residual bits。Table 3.所示為在五個 MPEG 測試視訊，

為各種 Macroblock types 的分布與加速比，我們可以很輕易的注意到在 News 影像

Figure 34. Static pipeline partitioning for the P-MBs and I-MBs of H.264/AVC

25 當 MB type 為 P-MB，則執行 MVR，反之執行 IMR

Table 3. Macroblock mode distributions and performance (use static pipeline partitioning)

Sequence Bitrate Intra MB (%) Inter MB (%) Skip MB (%) Speedup ratio

Crew 512 k 14.4 62.8 22.8 2.54

1.5 m 15.4 76.5 8.1 2.63

Foreman 512 k 4.1 66.2 29.7 2.56

1.5 m 4.2 80.4 15.4 2.74

Mobile 512 k 0.5 73.1 26.4 2.47

1.5 m 0.5 86.4 13.1 2.72

News 512 k 1.4 29.2 69.4 2.31

1.5 m 1.8 46.4 51.8 2.74

Stefan 512 k 3.0 59.4 37.6 2.29

1.5 m 3.8 72.9 23.3 2.54

The percentages of the macroblock modes are computed for the whole sequence of 300 frames.

Skip MBs 影響效能的原因是很簡單的，當在視訊中一連串的 skip MBs 會造成 ED 解碼時花費很少的時間，而當 FIFO circular buffer #1 深度不足時，在短時間內，第一顆核心就會因為 buffer 滿載而閒置，理想上，FIFO circular buffer 其大小是要越大越好，但現實中會受限於 local scratchpad memory 的容量限制，而為了解決這個問題，

我們提出了 Dynamic pipeline partitioning。

2) Dynamic pipeline partitioning using Macroblock mode

Dynamic partition 是在 Execution time 來決定目前的 partition 方式，能依照目前解碼情況來改變，這樣可以避免和 Static partition 般，因為一連串的 Skip MB 造成第一顆核心與第二顆核心的負載量不均，而我們用來判斷動態切割的依據是利用 macroblock type，當 macroblock type 為 skip MB 時，使用 Figure 35(a)方法，圖中 SNIP 為執行 SaveNeighborForIntraPred，其功能是儲存目前 MB 部分 pixels，作為 Intra prediction 的參考資料，而 SNIP 必須固定在同一顆核心處理，否則需要把其 SNIP 資料儲存在 DDR2，會影響解碼效能。

36 (I) P-MB decoding steps

(II) I-MB decoding steps (b) Second mode (I) P-MB decoding steps

(II) I-MB decoding steps

Figure 35. (a) ~ (b) Two kinds of different partition for dynamic pipeline partitioning using MB mode

為了讓 MB types 影響平行化效能能更容易被理解，如 Figure 36、Figure 37、Figure 38 和 Figure 39 所示，為兩種 software pipeline decoder(static 和 dynamic pipeline partitioning schemes)的加速比，而測試視訊為 Crew 1.5mbps 和 Stefan 512kbps，可以很明顯的發現 bits 數下降的 frame 與 speedup ratio 是有對應的，bits per sample 與 skip MB 有很大的關係，更進一步的觀察 Figure 39，就能發現當 Skip MB 數上升(尖峰)，會對應到較低的 speedup ratio，而在 Crew1.5mbps 的部分，能觀察出 speedup ratio 的尖峰是對應在 bits per sample 的尖峰也是對應在 number of the intra 4x4 的尖峰。

而 Figure 37 所示，Stefan 512kbps 中 skip MB 比重較重，更精確的說法是，出現連續的 skip MB 比重較高，因此能看到當我們採用 dynamic partition 時，效能有顯著的改進，而在 skip MB 比重較低的 Crew 1.5mbps 影響就有限。

Figure 36.Top: the speedup ratio of the dynamic (use MB type) ²⁶and static pipeline partition decoder for [email protected]. Bottom:bits per frame of the video sequence

0 50 100 150 200 250 300

Figure 37. Top: the speedup ratios of the dynamic (use MB type) and static pipeline partition decoder for [email protected]:bits per frame of the video sequence

26 Figure 36 會發現 speedup ratio 會超過 3，其原因是 single core 的版本資料一律放置在 DDR2。

number of 4x4 MBs

dynamic static

[email protected]

Figure 38. Top: the speedup ratio of the dynamic and static pipeline partition decoder for [email protected].

Bottom: the number of the I4x4 MBs

0 50 100 150 200 250 300

number of Skip MBs

dynamic static

Stefan@512k

Figure 39. Top: the speedup ratio of the dynamic and static pipeline partition decoder for Stefan@512kbps.

Bottom: the number of the Skip MBs

在我們的 static pipeline partition decoder 設計中，總共有三個 statges，第一個處理 Entropy decoder 和 Intra mode 或 Motion Vector Reconstruction，第二個處理 IQ、IT、

MB prediction 和 Reconstruction，而第三個則是執行 In-loop filter，而在 decoder 中有兩個 FIFO circular buffer，如 Figure 34 所示，每個 buffer node 具有相同的資料結構，而 speedup ratio 是直接被 buffer 是否 underflow 或 overflow 影響，假如在解碼的過程中 buffer 不存在 underflow 或 overflow，則平均 speedup ratio 將會接近於三倍，但現實中每個 MB 為 variable complexity，因此可能會造成 underflow 或 overflow，而效能因此被受影響。Figure 40 為 Stefan@512kbps frame #1 在 runtime 時 buffer depth 增長情況，frame #1 為此 sampley 在 static partition pipeline decoder 中最低 speedup ratio(1.99)的 frame，可以發現第一個 FIFO 幾乎都處於 overflow 狀態，第二個則是 underflow，而我們在使用 dynamic partition，就能發現第一個 buffer overflow 的情況舒緩了許多，效能也提升到 2.65 倍。Figure 41 為 [email protected] frame #152，是該 sample 的 speedup ratio 最低的 frame，一樣存在的同樣問題，雖然 dynamic partition(skip mode)有改善，但 buffer ovderflow 情況依然明顯，因此換個角度，從 FIFO buffer 這邊出

Stefan frame #1 , static partition speedup : 1.99

0 50 100 150 200 250 300 350 400

Stefan frame #1 , dynamic partition speedup : 2.65

1st FIFO 2nd FIFO

Figure 40. The runtime buffer growth at frame #1 of the Stefan@512kbps for the static and dynamic (skip)

partition pipeline decoder

0 50 100 150 200 250 300 350 400

Crew frame #152 , static partition speedup : 2.37

0 50 100 150 200 250 300 350 400

Crew frame #152 , dynamic partition speedup : 2.69

1st FIFO 2nd FIFO

Figure 41. The runtime buffer growth at frame #152 of the [email protected] for the static and dynamic (skip) partition pipeline decoder

發，因此我們提出了第二個 dynamic partition 的機制。

3) Dynamic pipeline partitioning using monitoring circular buffer

上一小節提到了，MB type 與效能有直接的關係，因此我們使用最容易影響效能的 skip MB 作為 dynamic partition 的依據，但不可否認的，其他如 Intra 4x4、Inter 8x8 或一些 MB 具有較高的 residual bits，這都會直接我間接的影響效能，但把這些因素全部都考慮，將會造成 dynamic partition 的規則過於複雜，也需要做大量的測試與會受限於輸入的視訊影像而有不同的結果，因此我們直接使用兩個 FIFO circular buffer 的情況作為改變 partition 的機制，其規則與切割模式如 Figure 42 與 Figure 43 所示，

當 buffer#1 深度超過 threshlod α時，就會使用 mode 1，即把第二顆處理 P-MB 部分拉到第一顆處理器執行，而當 buffer#2 深度超過 threshlod β且 buffer #1 沒有過載的情況時(避免第二顆處理器本身負載也過重)，使用 mode 2，即把 In-loop filter 拉到第

二顆處理器執行，用來減輕 buffer#2 之負擔，而沒有發生以上兩種情況則使用預設的 mod 0，α與β是依照模組比例與 fine-tune 後決定的，為總深度*0.8 與總深度*0.9。

if ( buffer_1 > threshold α) // α is TolalBufferDepth*0.8 dynamicMode = 1;

else if ( buffer _2 > threshold β && buffer_1 < α ) // β is TolalBufferDepth*0.9 dynamicMode = 2;

else

dynamicMode = 0;

Figure 42.The rule for dynamic partition using monitoring buffer (a) Mode #0 (I) P-MB decoding steps

(II) I-MB decoding steps (b) Mode #1 (I) P-MB decoding steps

(II) I-MB decoding steps (c) Mode #2 (I) P-MB decoding steps

(II) I-MB decoding steps

Figure 43. The decoding steps for dynamic partition pipeline decoder using monitoring buffer

0 50 100 150 200 250 300 350 400

0 5 10 15

Crew frame #152 , dynamic partition ( monitoring buffer ) speedup : 2.77

MB number

buffer depth

0 50 100 150 200 250 300 350 400

0 5 10 15

Crew frame #152 , dynamic partition ( use skip MB ) speedup : 2.69

MB number

buffer depth _{1st FIFO}

2nd FIFO

1st FIFO 2nd FIFO

Figure 44. The runtime buffer growth at frame #152 of the [email protected] for the dynamic#1 (use skip) and dynamic#2 (monitor) partition pipeline decoder

Figure 44 所示，利用 monitoring buffer 來控制 partition 方式，效能從 2.69 提升到 2.77 倍，圖中也能觀察出 1st FIFO 的 overflow 情況明顯變少， 2nd FIFO 的使用率也有提高，因此能觀察的出來 monitoring buffer 可以減少核心閒置的情況。

在文檔中在多核心嵌入式平台使用平行處理加速H.264視訊解碼 (頁 46-56)