基於MPEG標準之多媒體通訊整合平台及其應用---子計畫VI：多點視訊會議技術之研究(I)

全文

(1)行政院國家科學委員會補助專題研究計畫成果報告多點視訊會議技術之研究(I) Research in Multipoint Videoconferencing Technologies 計畫編號：NSC 92-2219-E-009-009 執行期限：92 年 8 月 1 日至 93 年 7 月 31 日主持人：林大衛交通大學電子工程學系教授計畫參與人員：詹益鎬、林岳賢、蔡鎮宇、簡志凱、劉明瑋交通大學電子工程學系研究生. 摘要近年來，桌上型視訊會議技術已愈趨實用與可即。然而目前一般的系統仍不具有近似當面開會的視聽感覺。本計畫主旨在研究分散式桌上型多點視訊會議技術，其中特別著重視訊的處理。我們擬在每個會議端點的電腦螢幕上顯示一個虛擬的會議室場景，其中呈現所有其他端點的與會人員。為此，每一端點需先將本地輸入視訊加以分割，取出與會者影像予以編碼，然後傳到其他端點。每一端點也需將所有接收到的視訊予以解碼及合成。本計畫之研究係建構在 MPEG-4 規範的基礎上，採用個人電腦為實現平台。計畫之研究子題可分四大組：會議系統、網路傳輸、傳送視訊處理、與接收視訊處理，預定以三年時間進行研究。本報告係針對第一年之研究，其中著重視訊分割技術、有效率之 MPEG-4 視訊編碼、及 MPEG-4 規範之了解。在視訊分割方面，我們提出了一些做法，並使用個人電腦實作了一個即時視訊輸入與分割系統。在 MPEG-4 視訊編碼方面，我們使用一個公眾領域的軟體加以改進，在個人電腦上實現了一個較快速的 MPEG-4 視訊編碼器。在其他 MPEG-4 相關規範之了解方面，我們研讀了有關場景合成與訊號傳輸介面的文獻，也試取得相關軟體並檢驗其功能。. 關鍵詞：視訊分割、視訊合成、MPEG-4 視訊、軟體即時視訊編碼、多點視訊會議. I.

(2) Abstract Desktop videoconferencing is becoming more practical and accessible in recent years. But typical systems today still lack a nearly face-to-face conference look and feel. The purpose of this project is to research into technologies for distributed desktop multipoint videoconferencing, wherein we especially emphasize the processing of video signals. On the computer screen of each conference terminal, we intend to show a virtual conference room where the images of conferees at all other terminals are shown. For this, each terminal will first need to segment the local input video to extract the image of the local conferee, encode it, and transmit it to the other terminals. Each terminal will also need to decode all the received videos and form a composition. The research is conducted based on the MPEG-4 specifications, utilizing personal computers as the realization platform. Subjects in this research can be divided into four major groups: conferencing system, network transport, transmitter video processing, and receiver video processing. The intended period of research is three years. This report is concerned with research done in year one, where we have emphasized video segmentation techniques, efficient MPEG-4 video coding, and the understanding of MPEG-4 specifications. In video segmentation, we proposed several methods and implemented a real-time video input and segmentation system on a personal computer. In MPEG-4 video coding, we implemented a faster MPEG-4 video encoder on a personal computer by improving a public-domain software. In understanding of other MPEG-4-related specifications, we studied literature about scene composition and the signal transmission interface, and we tried to acquire some relevant software packages and examine their functionalities. Keywords: Video Segmentation, Video Composition, MPEG-4 Video, Real-time Software Video Coding, Multipoint Videoconferencing. II.

(3) 目錄 Table of Contents 一、計畫緣由與目的................................................................................................................ 1 二、結果與討論........................................................................................................................ 3 A. 視訊分割：更精確界定物件邊界及將分割結果用於場景合成之研究 .................. 3 B. 視訊分割：使用背景建構法分割視訊以及即時視訊分割系統之實現 .................. 5 C. 有效率之 MPEG-4 視訊編碼...................................................................................... 7 D. MPEG-4 場景合成及訊號傳輸介面之初步研究........................................................ 8 三、參考文獻.......................................................................................................................... 12 四、計畫成果自評.................................................................................................................. 14 五、發表論文.......................................................................................................................... 15 附件：出席國際學術會議報告(含論文). III.

(4) 一、計畫緣由與目的過去所謂視訊會議，通常是在特定的視訊會議室、使用特用的視訊會議設備來進行。近年來，隨著視訊壓縮技術與數位積體電路的發展，透過個人電腦與網際網路來在辦公室或家中打視訊電話或進行多點視訊會議已經不再是未來的夢想。如 NetMeeting (Microsoft), IVS (INRIA), NV (Xerox Parc), vic (U.C. Berkeley and Lawrence Berkeley National Laboratory), CU-SeeMe (Cornell University), 及 iVisit 等系統，都提供即時互動視訊通訊的功能。這種型態的互動式視訊通訊，有時被稱為桌上型(desktop) 視訊通訊。目前一般的桌上型多點視訊會議(multipoint videoconferencing)系統，仍未提供近似於當面開會的視聽感覺。其實，即使是使用特定房間與設備的傳統視訊會議，也未能達到近似於當面開會的特質。最近有一些研究，即在探討視訊會議所呈現的畫面等課題，以期能使視訊會議更具有趨近於當面開會的視聽特質。這些研究中考慮將由數個會議端點所收到的視訊合成一個類似會議室的場景。當然，在合成之前可能需要先將這些會議端點的與會人員的影像分割出來。幾個此類研究的例子見[1]-[4]。國內人士過去亦有相關之研究[5]-[7]，但相關之技術議題仍有不少待進一步探討之處。本計畫主旨在研究桌上型多點視訊會議之相關技術，並建構一個實驗性之系統。本計畫之研究標的，可透過圖一說明之。圖中左方呈示之個人電腦(PC)及螢幕(display)，為每一會議點所使用之設備。螢幕顯示一個虛擬之會議室(virtual conference room)，其中之視訊為所有其他會議點所傳來之與會人員視訊的一個合成(composition)。會議室場景及各人的位置是由主席安排。螢幕上另有二個視窗，即控制台(control panel)與本地視訊之預覽(local preview)。我們發現：最近擬訂的 MPEG-4 標準，其中的若干規範相當適合本研究之所需。例如：其視訊編碼部分容許將視訊分割後再編碼，其資料結構與合成部分定義了一個相當有效率的 BIFS (Binary Format for Scenes)，其網路傳輸部分定義了 DMIF (Delivery Multimedia Integration Framework)等[8]。故本計畫之研究係基於 MPEG-4 之規範。. 圖一：多點視訊會議系統架構示意圖本計畫之研究子題可分為四個群組，即會議系統、網路傳輸、傳送視訊處理、及接收視訊處理，預定以三年時間進行相關研究。本報告係針對第一年之研究，其中重點為第三群組子題之深入研究，以及第二與第四兩群組子題之初步探討。申言之，在第三群組子題，即傳送視訊處理方面，我們研究了視訊分割技術與有效率之 MPEG-4 視訊編碼。關於視訊分割，我們提出了一些適用於會議型態影像的視訊分割方法，並使 1.

(5) 用個人電腦實作了一個簡單的即時視訊輸入與分割系統，目前在繼續改進其運算速度中。關於 MPEG-4 視訊編解碼，我們使用一個公眾領域的軟體加以改進，在個人電腦上實現了一個較快速的 MPEG-4 視訊編碼器。而在第二及第四群組子題，即網路傳輸與接收視訊處理兩方面，我們研讀了有關 MPEG-4 場景合成與訊號傳輸介面之規範的文獻，也試取得相關軟體並檢驗其功能。在第二年度之研究中，我們將較著重第二及第四群組子題的探討，也擬開始第一群組子題，即會議系統之研究。上述視訊編碼器改進過程中所累積的經驗，可為第四群組子題中，視訊解碼器實現之參考。. 2.

(6) 二、結果與討論本節中，我們分項討論視訊分割之研究、有效率之 MPEG-4 視訊編碼之研究、及場景合成與訊號傳輸介面之初步研究等三個主題。在視訊分割方面，我們的研究係分兩途進行：一是考慮如何更精確的界定移動物件的邊界，並就使用分割後的物件做場景合成的方法及效果做初步探討；二是使用背景建構的觀念，發展出一個視訊分割方法，並使用個人電腦建構一個包括視訊輸入、即時分割、與分割後視訊顯示的即時視訊分割系統。前者所提出的界定移動物件邊界的方法，可為後者未來改進之參考。前者對場景合成之研究成果，則亦可為前節所述第四群組子題後續研究之參考。因此，以下就分四小節分別簡述上開三主題之研究，其中視訊分割依上述兩途分為二小節討論之。. A. 視訊分割：更精確界定物件邊界及將分割結果用於場景合成之研究視訊分割有兩大議題尚需繼續研究，一是物件邊界之精確界定，二是運算複雜度。關於物件邊界之界定，幾個常見的途徑為分水嶺分析(watershed analysis) [9], [10]、輪廓演化(contour evolution) [11], [12]、及邊緣連結(edge linking) [13], [14]。本研究考慮第三者，並提出一個有效率的運算方法。本研究所提出的視訊分割演算法架構如圖二所示。其中 “Edge Detection” 使用 Canny edge detector [15]； “Change Detection” 係將兩畫框(frame)間相異程度較高的像素取出稱為 changed pixels； “Forward Tracking” 及 “Backward Validation” 係使用階層式運動估計，等效之搜尋範圍為正負 14 像素，複雜度約與全尋法相似。(但我們亦另外發展了更快速的運動估計法，在投稿中。) “Video Object Extraction” 為最創新之部分，使用形態學式(morphological)之處理，以獲得一相當逼近實際物件邊界之物件草型(object mask)，然後用 Dijkstra 最短路徑演算法[16]連結草型外緣邊界之「斷裂」之處，以得到最後萃取出之物件。細節可參[17]。由於 Dijkstra 演算法之複雜度與其所需搜尋的像素數目成平方關係，故當物件草型相當逼近實際物件邊界時，Dijkstra 演算法之複雜度可以降低。 FRAME n. VIDEO INPUT. EDGE DETECTION CHANGE DETECTION. FORWARD TRACKING. FRAME n−p. BACKWARD VALIDATION. VIDEO OBJECT EXTRACTION. FRAME n−1. FRAME MEMORY. 圖二：所提出之視訊物件萃取與追蹤方法之一圖三為ㄧ任意之例，用以說明 “Video Object Extraction” 方塊運作之效果。其中圖三(a)所示為 Video Object Extraction 方塊之輸入。圖中之小格表示像素，灰格及黑格共同表示經 backward validation 後之物件草型，黑格則表示草型中屬邊緣的部份(為 3.

(7) Edge Detection 方塊之輸出)。圖三(b)所示為上述所謂相當逼近實際物件邊界之物件草型，可見其外緣相當貼近最外方的邊緣(黑格)。Dijkstra 演算法在 A-B 及 C-D 間的搜尋範圍見圖三(c)，其中 Dw 為搜尋深度，為一可調之參數。經使用 Dijkstra 演算法連結所有外緣邊界斷裂之處後，結果如圖三(d)所示。. B. A. C. (a). D. (b). Dw. Dw. B. A. Dw. C. D. (c). (d). 圖三：任意之圖例，用以說明 “Video Object Extraction” 方塊運作之效果以上 video object extraction 方法，在物件形狀高度非凸狀(highly nonconvex)時，特別能顯出其效用。圖四顯示一些視訊分割結果，其中最後之 Dijkstra 演算法所用之搜尋深度為 Dw = 5。配以較高速之運動估計法，整個分割法可對 CIF (352x288)視訊在現有ㄧ般個人電腦中達每秒數十張畫框之即時執行速度。. 80. 150. 圖四：一些 Mother-and-Daughter 視訊之分割結果。上列：原始圖框；下列：分割出之移動物件。底部數字為圖框序號. 4.

(8) 其次我們考慮使用分割出之物件做場景合成。此種處理應該常會需要做物件之放大或縮小。為此我們使用 MPEG-2 中規範的二倍縮放濾波器，設計了ㄧ個簡單的多倍數連續放大或縮小的作法。這種連續性的縮小或放大在作物件大小可調式編碼(spatial scalable coding)時，可能相當有用。MPEG-2 之二倍放大及縮小濾波器係數各為 [-12, 140, 140, -12]/256 及 [-29, 0, 88, 138, 88, 0, -29]/256。若放大倍數 m 為 2 的整數次方，則只要使用 MPEG-2 放大濾波器整數倍即可。若 m 為整數但不為 2 的整數次方，則我們先將物件放大至比 m 小的、2 的整數次方倍，然後使用相鄰的點作線性內插，以求得在 m 被放大時的各像素數值。縮小倍數為整數倍時，處理方式相似。若縮放倍數為有理數但非整數，亦可做類似之處理，見[17]。圖五顯示幾個使用分割後之物件與其他影像合成後的場景。圖面可稱相當自然。. 圖五：使用分割而得之物件(Akiyo, Claire, Salesman)，在縮小二倍後，與其他影像合成之場景. B. 視訊分割：使用背景建構法分割視訊以及即時視訊分割系統之實現在此，我們設計一種方法來收集各個畫框中的背景部份，建構出一個盡量完整的背景圖。然後將現在收到的畫框與背景圖比較，把差異很大的部份取出並做一些修飾，就可以將移動的前景物件分割出來了。演算法的架構如圖六所示。其中我們首先分析視訊中的攝影機雜訊量(Camera Noise Estimation)。我們設計了ㄧ個二級的方法來估計雜訊的變異數，以減低移動物件對估計精確度的負面影響。細節見[18]。其次，我們構建一個暫時性的前景物件草型 (temporary foreground mask)。這是透過圖六中的 Frame Difference、Fill-In、及 Canny Operator 三個功能方塊達成。其中 Frame Difference 取得畫框中變異較大的像素， Fill-in 將像素間的空白處填滿，使其涵蓋移動物件的區域，Canny Operator 及相隨的運算則使區域內縮，使之更接近實際的物件形狀(但可能仍有相當差異)。圖七(a)所示為一個結果的例子。第三，我們構建一個短期的背景 (Short-term Background Estimation)。這是透過分析連續六張畫框來達成。如果某ㄧ像素值在這些畫框中變化不大，則暫將之算為背景像素。圖七(b)所示為一個結果的例子。第四，我們使用以上結果來構建一個靜態背景畫面(Stationary Background Buffer)。由於移動物件若是內部的亮度與色彩相當平滑，則在簡單的分析中，有可能被誤判為背景，如圖七(b)所暗示。 5.

(9) 所以我們使用之前獲得的暫時性前景物件草型來將短期背景像素予以加權，如果其累積之加權值超過某ㄧ門檻，再將之放入最終的靜態背景畫面。繼續上例，圖七(c)所示為根據圖七(a)所得之權值，其中黑色表示最高的權值(比較可靠的背景部份)，白色表示最低的權值(零)，灰色表示中等的權值。. 圖六：使用背景建構法來分割視訊的演算法架構. (a). (b). (c). (c). 圖七：演算過程中之部分結果以上結果，在靜態背景的情況下，已可用。但有時攝影機會被移動，此時就需調整背景。若要將既有背景畫面刪除，重新計算出一個，自然可以。但這會需要較多的畫框來重新建構一個背景畫面。但在視訊電話或視訊會議的應用中，攝影機移動之後的背景畫面，可能有很大一部分是和之前的畫面重疊的。所以我們考慮估計攝影機的運動，以使我們可以由之前已建構好的背景畫面中獲得此重疊的部份，以加速新背景畫面的建構。這就是圖六中 Scene Change Detection、Global Motion Estimation、及 Panorama Background Buffer 的作用所在了。其中 Scene Change Detection 是用以偵測攝影機是否有移動， Global Motion Estimation 係使用 affine motion model ，而 Panorama Background Buffer 則收存迄今獲得之完整背景畫面(在 MPEG-4 中可稱 sprite)。由此 Panorama Background Buffer 可協助建構現在的 Stationary Background 6.

(10) Buffer。最後，將各個要分割的畫框與 Stationary Background Buffer 相減，並稍加修飾，即可得所分割出之移動物件。圖七(d)為一例。根據以上演算法，我們在個人電腦上實現了ㄧ個即時視訊輸入、分割、及結果顯示的系統。圖八為視窗介面的一個 snapshot。由於視窗程式佔去不少時間，以 2.4 GHz P-4 CPU 而言，目前尚未最佳化的程式所能達到的速度，在攝影機靜止的情況下，約每秒 5 張 CIF 畫框，在攝影機有移動的情況下，則約每秒 2 張畫框。. 圖八：即時視訊分割系統視窗介面之 snapshot. C. 有效率之 MPEG-4 視訊編碼由於我們考慮使用個人電腦做會議系統的平台，所以需要在其上建構一個有效率的 MPEG-4 視訊編碼器。過去有些公眾領域的 MPEG-4 視訊編碼軟體，並非特別為個人電腦上的執行效率而設計，所以其執行效率有大幅提升的空間。此處我們使用 Microsoft 公司開發的一個公眾領域的 MPEG-4 視訊編碼軟體，在個人電腦上，試改良程式寫法，並使用 Intel CPU 的 MMX (multimedia extension) 處理器單元以加速其運算。上述編碼軟體係處於 MPEG-4 視訊定義之 Main Profile 和 Simple Scalable Profile 之層次。 Intel 之 MMX 處理器單元，是在 Pentium CPU 以外的處理單元，第一代稱為 MMX，有八個 64 bits 的 registers，可儲存 8 bytes 或 4 words 或 2 doublewords 或 1 quadword。第二代稱 SSE (streaming SIMD extensions)，加了八個 128 bits 的 registers，可進行一些浮點運算。不過視訊編解碼中，不太需要用到浮點運算，所以此一功能效用不大。第三代稱 SSE2 (streaming SIMD extension 2)，其中八個 128 bits 的 registers 可做定點運算，這對視訊編解碼就很有用了。總而言之，MMX 處理器中共有 16 個大小不一的 registers，而 Intel 也提供了ㄧ些可在此處理器中執行的高效率平行計算組合語言指令。由於此處理器係與 Pentium CPU 分開運作，加上 Intel 的 C++ compiler 可以處理這些指令與 C++程式併用的情形，所以即使使用組合語言來寫部分的程式，也不太困難。我們遂分析 MPEG-4 視訊編碼軟體，找到其中較耗時的部份。然後使用適用的 MMX 處理器指令來加速。上述分析，主要是藉助於 Intel 的 VTune performance analyzer。此軟體工具提供 tree-structured call 7.

(11) graph、各函式使用時間百分比及使用之 clockticks 等資料，相當有助於了解程式的瓶頸所在，並可用以比較修改前及修改後之程式運作差異。程式中第一耗時的，如一般可預期，是運動估計。對有的視訊而言，可佔 90%以上的運算時間。我們也使用較快速(但效果稍遜)的運動估計法作實驗。圖九顯示 Foreman 視訊的編碼速率，其中原始程式約為每秒 3 張畫框，修改後但使用相同運動估計法(FS, 即 full search)的程式約每秒 7 張，改用較快速的運動估計法(DS = diamond search, NDS = new diamond search, 2DLS = two-dimensional logarithmic search)則可達每秒 20 張。. Speed vs Kbits/frame Chart For Foreman without shape information 300 frames / Release Mode. Optimized DS / Release M ode. 25. 20. Optimized NDS / Release M ode. Speed (fps). 15. Optimized 2DLS / Release M ode 10. Optimized FS / Release M ode. 5. 0 0. 10. 20. 30. 40. 50. 60. 70. Original Code / Release M ode. Kbits/frame. 圖九：修改前後之程式效率比較. D. MPEG-4 場景合成及訊號傳輸介面之初步研究 MPEG-4 中的場景合成，可使用其二進位格式場景(Binary Format for Scenes，簡稱 BIFS)規範達成。BIFS 是根據 VRML 發展而來，用來表示預先定義的視聽物件(Media Object)的行為及時空關係。BIFS 攜帶了場景描述的資訊，規定了如何重現 MPEG-4 的場景圖，實現物件的動畫和互動行為，以及對這些元素的生成加以時序化和同步化；此外還定義了事件的處理、物件組合及運行規則等。場景描述的結構採用如圖十所示的樹狀結構，透過各個節點(Node)來描述一個完整的場景，節點又由表示節點特性的一組「域(Field)」所組成。域可以為某一特定值，如球節點的半徑大小等，有時可以表示許多值，如定義一個多邊型頂點的表列。許多節點都包含在其他節點的域中，這就是場景描述構成樹狀結構的原因。. 8.

(12) 圖十：場景描述結構(取自[19]) BIFS–Command 的設定使得場景圖中的屬性可以改變，用來修改場景在某個給定時間的一系列屬性。為了能夠在一個單一訪問單元中發送幾個命令，命令被組合到 CommandFrames 中。BIFS-Command 一共定義了 4 個基本命令，分別為：Replacement of an entire scene；Insertion；Deletion；Replacement。 BIFS–Anim 則提供了場景中節點的某些域的連續更新，它被用來組合各種動畫。儘管 BIFS-Anim 和 BIFS-Command 都有相同的基本串流(Bitstream)類型，但它們不會在同一個基本串流中傳輸。BIFS-Anim 資訊在一個獨立的基本串流中傳輸，和傳輸 BIFS-Command 的串流分開。此外，BIFS 導入了量化(Quantization)的概念，可以對各種不同類型的域量化，降低了資料量，使網路傳輸能更有效的進行。圖十一顯示了 MPEG-4 標準中描述的一個完整的多媒體系統，從圖中可以看出， MPEG-4 將視聽物件、場景描述資訊作為基本串流進行傳輸，依靠 BIFS 的場景描述資訊將視聽物件組合，進而生成多媒體互動式場景。. 圖十一：MPEG-4 系統架構圖(取自[19]) 9.

(13) 關於訊號傳輸，MPEG-4 為了能夠更廣泛地運用在不同的網路上，因此制定了一個處理網路傳送的部份，稱為多媒體傳輸整合骨架(DMIF, Delivery Multimedia Integration Framework)。圖十二便是闡述 MPEG-4 將資料和網路傳送分層的精神。將資料壓縮層放在最上層，處理後再經過系統層將資料做同步的處理再傳給下層的 DMIF 負責網路傳送。MPEG-4 主要是由系統層去掌控大局。. 圖十二：闡述 MPEG-4 將資料壓縮及訊號傳輸分層處理的基本精神與架構(取自[20]) 接下來，我們來看 DMIF 如何動作以及溝通。最宏觀地來看(圖十三)，當有一個應用程式要和遠端工作站應用程式互相傳輸時，是由 DMIF 去做事前溝通的動作，等溝通完畢時，會再建立一條應用程式之間的通道，資料由此真正傳送。. 圖十三：DMIF 運作之宏觀示意圖(取自[21]) 那真正運用 DMIF 又是怎麼樣的情況呢？我們拿微軟出的測試檔案 IM1 (implementation 1)來做分析(圖十四)，可以發現 DMIF 把在硬碟上儲存以及在網路內廣播都視作是一種網路傳輸方式，使 DMIF 能夠更加地全面。而 DMIF 之所以能夠適用於各種網路上，主要都是靠 DMIF Filter 的部份來根據系統層傳出來的資訊做選擇。而再接下來的 DMIF Instance 的部份則約和 OSI 的 session layer 相當，做一些檢查，授權等的動作，而將資訊真正地傳輸出去。. 10.

(14) 圖十四：IM1 軟體中的實現的 DMIF 功能(取自[21]) 為了分層，讓資料能夠獨立處理而不管網路傳輸的部份，在 DAI (DMIF-Application Interface)的部份便規定幾個參數，只要資料在系統層加上這些參數為 header 的格式之一，便能夠傳送資料了，換句話說，DAI 只認定幾個參數；這些參數各是服務(service)，通道(channel)以及資料(data)參數。首先，先依照服務參數決定和何種遠端工作站連線以及進行何種服務；接者便靠通道參數建立通道；最後再靠資料參數建立傳送資料的通道。那在 DMIF Filter 的部份，又是如何選擇那一部份的網路做傳送呢？在 IM1 內做的並不明智，它是將 DMIF Instance 寫成 dll 檔，再一個個掛上 DMIF Filter，讓 DMIF 一個個嘗試。最後，當資料通道建好後，要傳送資料時，TransMux 會負責將所需網路相仿的資料都集合在一起，再傳送出去，而接收端收到這些資料時則再由 TransMux 負責將資料還原為原本的獨立資料。如此一來，IM1 便能夠有效實現 MPEG-4 的 DMIF 規格了。. 11.

(15) 三、參考文獻 [1] M. E. Lukacs and D. G. Boyer, “A universal broadband multipoint teleconferencing service for the 21st century,” IEEE Commun. Mag., vol. 33, no. 11, pp. 36-43, Nov. 1995. [2] D. G. Boyer, M. E. Lukacs, and M. Mills, “The personal presence system experimental research prototype,” in IEEE Int. Conf. Commun. Conf. Rec., pp. 1112-1116, 1996. [3] O. Schreer, M. Karl, and P. Kauff, “PCI-based multi-processor system for immersive videoconference terminals,” in IEEE Int. Conf. Multimedia Expo, pp. 181-184, 2002. [4] O. Schreer, M. Karl, and P. Kauff, “A Trimedia based multi-processor system using PCI technology for immersive videoconference terminals,” in Int. Conf. Digital Signal Processing, pp. 289-293, 2002. [5] Y.-J. Chang, C.-C. Chen, J.-C. Chou, and Y.-C. Chen, “Implementation of a virtual chat room for multimedia communications,” in IEEE Workshop Multimedia Signal Processing, pp. 599-604, 1999. [6] Y.-J. Chang, C.-C. Chen, J.-C. Chou, and Y.-C. Chen, “Virual Talk: a model-based virtual phone using layered audio-visual integration,” in IEEE Int. Conf. Multimedia Expo, pp. 415-418, 2000. [7] C.-W. Lin, W.-H. Wang, M.-T. Sun, and J.-N. Hwang, “Implementation of H.323 multipoint video conference systems with personal presence control,” in IEEE Int. Conf. Consumer Electron. Digest of Tech. Papers, pp. 108-109, 2000. [8] S. Battista, F. Casalino, and C. Lande, “MPEG-4: a multimedia standard for the third millenniem,” in two parts, IEEE Multimedia, vol. 6, no. 4, pp. 74-83, Oct.-Dec. 1999, and vol. 7, no. 1, pp. 76-84, Jan.-Mar. 2000. [9] D. Wang, “Unsupervised video segmentation based on watersheds and temporal tracking,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 5, pp. 539-546, Sep. 1998. [10] S.-Y. Chien, Y.-W. Huang, and L.-G. Chen, “Predictive watershed: a fast watershed algorithm for video segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 5, pp. 453-461, May 2003. [11] S. Sun, D. R. Haynor, and Y. Kin, “Semiautomatic video object segmentation using VSnakes,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 1, pp. 75-82, Jan. 2003. [12] A.-R. Mansouri and J. Konrad, “Multiple motion segmentation with level sets,” IEEE Trans. Image Processing, vol. 12, no. 2, pp. 201-220, Feb. 2003. [13] T. Meier and K. N. Ngan, “Video segmentation for content-based coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 8, pp. 1190-1203, Dec. 1999. [14] C. Kim and J. N. Hwang, “Fast and automatic video object segmentation and tracking for content-based applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 12.

(16) [15] [16] [17]. [18]. [19]. [20]. 2, pp. 122-129, Feb. 2002. J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Machine Intell., vol. 8, no. 6, pp. 679-698, Nov. 1986. D. W. Dijkstra, “A note on two problems in connexion with graphs,” Numerische Mathematik, vol. 1, pp. 269-271, 1959. Y.-H. Jan and D. W. Lin, “Edge-and-motion-based semantic video object extraction and application to scene composition,” in Proc. IEEE Int. Symp. Consumer Electronics, pp. 340-344, Sep. 2004. Y.-H. Lin and D. W. Lin, “Real-time video segmentation based on background modeling applicable to videoconferencing,” to appear in Proc. Workshop Consumer Electronics, Hsinchu, Taiwan, ROC, Nov. 2004. P. Daras et al., “MPEG-4 authoring tool using moving object segmentation and tracking in video shots,” EURASIP J. Applied Signal Processing, vol. 2003, no. 9. pp. 1-18. http://ailab.chonbuk.ac.kr/~sjmun/mpeg4/tutorial/6-DMIF_paper.html.. [21] G. S. Tselikis, “An overview of the Delivery Multimedia Integration Framework for broadband networks,” IEEE Commun., vol. 2, no. 4, 1999.. 13.

(17) 四、計畫成果自評研究內容與原計畫相符程度：符合計畫主題，達成之主要成果包括：視訊分割之方法及軟體實現、有效率之 MPEG-4 視訊編碼、及 MPEG-4 場景合成與訊號傳輸介面之初步研究等。達成預期目標情況：本子計畫達成之貢獻形式，含創新之發現、技術水準之提升、實驗系統之建立、人才培育。成果之學術與應用價值等：視訊分割方面之若干成果已發表或投稿為會議論文，並在進行期刊論文之撰稿與投稿。MPEG-4 視訊編碼實作方面之經驗與成果將成為我們後續相關研究的參考。以上成果亦皆可供相關業界參考，惟暫不擬以收費技轉方式處理。綜合評估：本計畫獲得一些具有學術與應用價值的成果，並達人才培育之效。成效良好。. 14.

(18) 五、發表論文以下附二篇論文，如下列： [1] Y.-H. Jan and D. W. Lin, “Edge-and-motion-based semantic video object extraction and application to scene composition,” in Proc. IEEE Int. Symp. Consumer Electronics, pp. 340-344, Sep. 2004. [2] Y.-H. Lin and D. W. Lin, “Real-time video segmentation based on background modeling applicable to videoconferencing,” to appear in Proc. Workshop Consumer Electronics, Hsinchu, Taiwan, ROC, Nov. 2004.. 15.

(19) Paper 1. Edge-and-Motion-Based Semantic Video Object Extraction and Application to Scene Composition Yih-Haw Jan and David W. Lin, Senior Member, IEEE. Abstract — We consider automatic segmentation of natural video for content-based video applications. In particular, we present a segmentation algorithm employing both motion and edge information. It employs the edge-linking approach for accurate determination of object boundaries and bidirectional motion estimation for robust tracking of object motion and shape change. The algorithm is designed with computational complexity also in mind. We further consider scene composition using the segmented objects. We describe and evaluate a method to enlarge or shrink objects for such purpose. The method is amenable to spatial-scalable coding.1 Index Terms — Object resizing, object tracking, scene composition, video segmentation.. V. I.. INTRODUCTION. segmentation for object-based coding and video content manipulation has received much recent attention. In this work, we consider employing the segmented video objects in scene composition. It is noted that current segmentation techniques are still in need of improvement in two areas to make them more practical for various applications. The two areas are accuracy in identification of object boundaries (especially when the objects are grossly nonconvex in shape) and computational complexity. In terms of object boundary identification, several major approaches are watershed analysis [1], [2], contour evolution [3], [4], and edge linking [5], [6]. We take the edge linking approach, which appears to be able to result in reasonably accurate identification of object boundaries. We present a method that performs edge linking efficiently. To track robustly the motion and potential shape changes of the segmented objects, we conduct both forward and backward motion estimation. The motion estimation method is designed with computational complexity in mind. In using the segmentation results in scene composition, we frequently need to enlarge or shrink the objects. We consider how these can be done efficiently with good performance, in a manner that is also suitable for spatial-scalable coding. In what follows, Section II describes the video segmentation algorithm and the associated experimental results. Section III discusses the use of the segmentation IDEO. results in scene composition. And Section IV is the conclusion. II. PROPOSED VIDEO SEGMENTATION ALGORITHM The proposed video segmentation algorithm is shown in Fig. 1. The algorithm employs both motion analysis and edge analysis, where the motion analysis consists in the blocks marked “change detection,” “forward tracking,” and “backward validation,” and the edge analysis consists in the blocks marked “edge detection” and “video object extraction.” In edge detection, we employ the Canny edge detector [7]. In change detection, we employ a statistical significance test on the interframe pixel value differences [8], [9]. Details are omitted. The primary novelty of the algorithm consists in the forward tracking, backward validation, and video object extraction blocks. A. Video Object Tracking “Forward tracking” and “backward validation” apply to the second and subsequent frames in a video sequence. They are omitted for the first frame, for there is no prior presence of objects to track in the first frame. “Forward tracking” tracks the motion and shape change of each object. It employs hierarchical, block-based motion estimation [10] to find the motion vectors, where the finest block size is 2× 2 . It is carried out for the segmented objects only and the effective search range is ±14 pixels. Hierarchical motion estimation can capture true object motion better at a reduced complexity compared to straightforward blockmatching motion estimation [10]. Nevertheless, because we have used a small final block size for added accuracy in the estimated motion, the complexity of motion estimation becomes relatively high and it amounts to roughly 97% of full-search motion estimation with an equal search range (but only over the areas where motion estimation is conducted). Since subsequent backward validation and video object extraction will adjust the object shape, we can tolerate some inaccuracy in forward tracking. This can be used to reduce the complexity of motion estimation, but it is not pursued in the FRAME n. VIDEO 1 This work was supported by the National Science Council of Republic of China under grant no. NSC 92-2219-E-009-009 and by the Lee and MTI Center for Networking Research at National Chiao Tung University. The authors are with the Department of Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan 30010, ROC (e-mails: [email protected], [email protected]).. INPUT. EDGE DETECTION CHANGE DETECTION. FORWARD TRACKING. FRAME n−p. BACKWARD VALIDATION. VIDEO OBJECT EXTRACTION. FRAME n−1. FRAME MEMORY. Fig. 1. Proposed structure of video extraction and tracking algorithm..

(20) MASK TRIMMING. SEGMENTAL ORTHOGO− NAL SCANS. Dw. EDGE LINKING. B. Fig. 2. Procedure of video object extraction.. B. Dw. ORTHOGO− NAL SCANS. A. C. Dw. A. C. D. D. (a). (b). Dw B. (b). Dw. (a). A. Dw. C. (c). (c). (e). (d). (f). Fig. 3. Arbitrary example for illustration of the first three steps in video object extraction. (a) Backward-validated object mask M. (b) Same as (a) with edge pixels marked in black. (c) Mo, the mask after orthogonal scans. (d) Mt, the mask after mask trimming. (e) After segmental horizontal scan. (f) Ms, the mask after segmental vertical scan.. present work. “Backward validation” projects (after motion compensation) those pixels in the change detection mask but not in the forward-tracked object footprint into the previous frame and verifies if they are part of the segmented object. If not, they are deleted. This requires backward motion estimation, which is also done using the above-described hierarchical motion estimation method. But the complexity is low because it is needed only for a small portion of the pixels in a frame. B. Video Object Extraction After motion tracking, the resulting pixel mask for an object may not match the object accurately. For example, it may contain some holes in the interior and its boundary may contain cracks or bulges. “Video object extraction” attempts to rectify these problems by edge-based morphological processing. The process involves four steps as illustrated in Fig. 2, where the purpose of the first three steps are to attain a suitably tight sketch of the object mask so that the last step (edge linking) can obtain the final, precise object mask efficiently. We illustrate the steps using an arbitrary example of the backward-validated object mask shown in Fig. 3(a) (the gray pixels). For convenience, denote the mask M. Let the edge pixels (obtained by the Canny edge detector) in M be as shown in black in Fig. 3(b). We assume that the edge pixels in M and near its outer perimeter are the ones that define the. D. (d). Fig. 4. Illustration of edge linking method. (a) Ms, the mask after segmental vertical scan. (b) Search windows for gap regions AB and CD, respectively, at search depth Dw = 4. (c) Edge search results for gaps AB and CD, respectively. (d) Final object mask after edge linking.. object boundary. The problem is to identify these pixels and bridge all the gaps. For this, we first stop the “holes” inside M through the “orthogonal scans” step. In this step, similar to [11], we first conduct a “horizontal scan” over each row of M to fill in the space between the leftmost and the rightmost pixels. Then the result is subject to a “vertical scan” that fills in the space between the topmost and the bottommost pixels of each column. For the arbitrary example, we obtain Fig. 3(c) as the result. Denote it Mo for convenience. As can be seen, holes inside M are stopped in Mo, but the mask may also be significantly expanded. Thus in the “mask trimming” step, we trim the overgrowth by eroding Mo from the outer side inwards, deleting every pixel that is not an eight-connected neighbor of the original backward-validated mask M. This can patch up any one- and two-pixel-wide “cracks” that may remain in Mo, but will also leave a onepixel-wide “coating” around Mo that is outside the original perimeter of M. Therefore, we conduct the erosion for one more time, trimming away any pixel on the outer boundary of the remaining Mo that does not belong to M. The result for the arbitrary example is shown in Fig. 3(d). Denote it Mt for convenience. The next step, “segmental orthogonal scans,” tries to tighten up the mask further for the benefit of the last step, “edge linking.” It first examines each connected horizontal line segment in Mt and fills in the space between the two farthest edge pixels (see Fig. 3(e) for the arbitrary example). Then a similar operation is conducted in the vertical direction, but this time regarding pixels in the horizontal result as equivalent to edge pixels. Denote the final result Ms for convenience. Fig. 3(f) illustrates this result for the arbitrary example. After the foregoing steps, we have now obtained an object mask that is solid inside and relatively tight around the (assumed) object boundary edges on the outer side. This can be seen from Fig. 4(a), which redraws Ms in gray and black.

(21) 90. 80. 140. Fig. 5. Segmentation result of Akiyo. Top row: original frames; bottom row: extracted moving object. Numbers below are frame numbers.. 0.9965. Objective Evaluation. Fig. 7. Segmentation result of Mother and Daughter. Top row: original frames; bottom row: extracted moving object. Numbers below are frame numbers.. Dw can be quite small (experiments show that Dw ≤ 5 is enough) and yet we can find most of the desired edges. Fig. 4(c) shows the edge search result for gaps AB and CD, and Fig. 4(d) the final object mask obtained by video object extraction.. 0.997. 0.996. 0.9955. 0.995. 150. 0. 50. 100. 150. Frame Number. Fig. 6. Fractional agreement between segmentation and reference mask for Akiyo sequence.. (the latter denoting edge pixels). We now perform the last step of work, “edge linking,” to finalize the boundary of the extracted object. For this we resort to the well-recognized Dijkstra shortest-path algorithm [12] to bridge all the gaps, including the “apparent gaps” AB and CD. In application of the Dijkstra algorithm, each gap is treated separately. An edge pixel in Ms is assigned a distance d 0 and each nonedge pixel a distance d1 . (We set d 0 = 1 and d1 = 10 in the experiments reported below.) A suitable search. depth into Ms (denoted Dw) is chosen. Fig. 4(b) illustrates the edge search windows for gaps AB and CD under Dw = 4. Since the complexity of the Dijkstra algorithm is typically O(n 2 ) where n is the number of pixels in the search window,. the smaller the window, the higher the algorithm efficiency. Because we have suitably tightened the outer perimeter of Ms around the (assumed) object boundary edges, the search depth. C. Experimental Results We show some results from segmenting the Akiyo and the Mother-and-Daughter sequences in CIF format ( 352 × 288 ). Fig. 5 shows the results for Akiyo at Dw = 5. Fig. 6 shows the fractional agreement of the segmented object mask compared to a reference mask, calculated as proposed in [13]. The agreement is always above 0.995. Fig. 7 shows the results for Mother and Daughter, also at Dw = 5. The identified object boundaries are rather accurate. III. APPLICATION IN SCENE COMPOSITION A. Method for Object Enlargement and Shrinkage Consider scene composition using the extracted video objects. This is expected to require often enlargement or shrinkage of the video objects. For convenience, we consider employing the interpolation and decimation filters specified in MPEG-2, which are [-12, 140, 140, -12]/256 for interpolation and [-29, 0, 88, 138, 88, 0, -29]/256 for decimation. The frequency responses of these filters are shown in Fig. 8. Several modifications are necessary, however. First, MPEG-2 only considers enlargement and shrinkage by a factor of two (both horizontally and vertically), but we consider arbitrary-factor interpolation and decimation. And second, MPEG-2 only considers interpolation and decimation for rectangular video frames, whereas we have arbitrarily shaped video objects. Consider enlargement by arbitrary integer factors first. We consider a simple interpolation method illustrated by example in Fig. 9. In essence, enlargement by an integer-power-of-2.

(22) 1.4. Size. Enlarge Filter Shrink Filter. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32. 1. 1.2 1/2. 1. Amplitude. 1/3. (a). 0.8. Size. 0.6. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32. 1. 0.4 1/2. 0.2 1/4. 0. 0. 0.1. 0.2. 0.3. 0.4. 0.5 Rad (Pi). 0.6. 0.7. 0.8. 0.9. 1 1/5. Fig. 8. Amplitude responses of the MPEG-2 enlargement and shrinkage filters. Size. Size 1 0. 1. 0 1. 1. 1. 0. 2. 0. 1/2 1. 0. 4. 0. 1/4 1/2 3/4. 1 1/3 2/3. 0. 1. 5 1/5 2/5 3/5 4/5. (a). Fig. 10. Proposed shrinkage algorithm, illustrated by example. (a) Shrinkage by two or three times. (b) Shrinkage by two, four, or five times. (Gray squares denote pixels in the original object mask or as obtained by decimation using the MPEG-2 filter; white squares denote linearly interpolated pixels. Dashed lines indicate linear interpolation between nearest pixels.). 2. 1 1/2. 3. (b). (b). Fig. 9. Proposed enlargement algorithm, illustrated by example. (a) Enlargement by two or three times. (b) Enlargement by two, four, or five times. (Gray squares denote pixels in the original object mask or as obtained by interpolation using the MPEG-2 filter; white squares denote linearly interpolated pixels. Dashed lines indicate linear interpolation between nearest pixels.). (say 2 n ) times is accomplished by repeated application of the MPEG-2 interpolation filter n times. Enlargement by a factor that is not an integer power of 2 is accomplished by enlarging to the nearest lower integer power of 2, followed by linear interpolation between two nearest pixels according to the relative pixel distances. Shrinkage by arbitrary integer factors operates on a similar principle and is illustrated also by an example in Fig. 10. Again, shrinkage by an integer-power-of2 times is accomplished by repeated application of the MPEG2 decimation filter. Shrinkage by a factor that is not an integer power of 2 is accomplished by shrinking to the nearest lower integer power of 2, followed by linear interpolation between two nearest pixels according to the relative pixel distances. A rational-factor (say, q/p where p and q are coprime integers) enlargement or shrinkage can be obtained as follows. Find the nearest lower powers of 2 for p and q, say, 2 m and 2 n , respectively. Enlarge or shrink the object by 2 n−m times.. Then perform linear interpolation between nearest pixels according to the factor q/p in a similar way as that for integerfactor enlargement or shrinkage. In either enlargement or shrinkage, when the filter memory extends beyond the object mask (this happens when operating on pixels near the object boundary), we repeat the boundary pixel values for filtering purpose, similar to the principle used in H.263 and MPEG-4 for motion estimation. Note also that the way object enlargement and shrinkage are conducted fits well into a context of spatial-scalable coding. B. Experimental Results Since we deal with arbitrarily shaped objects, the interpolation and decimation performance at object boundaries is of particular interest. In Fig. 11 we show some results on the similarity (in PSNR) between the original segmented object and the resulting one that is enlarged X times and shrunk to the original size, where X is between 2 and 5. As expected, the interior pixels (inside a three-pixel-wide band at object boundary) are subject to less distortion from enlargement and shrinkage compared to pixels near the object boundary. But overall, the PSNR is reasonably high. Fig. 12 shows two synthesized scenes (two frames each) obtained by superimposing the moving objects segmented (and shrunk by two times) from the Claire, the Akiyo, and the Salesman sequences onto some background and middle ground, and with an additional foreground overlaid on them. The synthesized scenes appear natural..

(23) 45.8 Two Three Four Five. 45.6. PSNR (dB). 45.4. 45.2. 45. 44.8. Fig. 12. Synthesized scenes using segmented (and shrunk by two times) video objects and other video contents.. 44.6. 44.4. 0. 50. 100. 150. Frame Number. rather natural.. (a). REFERENCES. 45. [1] [2]. 44.5. [3]. PSNR (dB). 44. [4] 43.5. [5] 43. [6]. Two Three Four Five 42.5. 0. 50. 100. 150. Frame Number. (b). Fig. 11. Similarity, in PSNR, between the original segmented video object in CIF Mother-and-Daughter sequence and that enlarged X times and shrunk to original size, where X is between 2 and 5. (a) For interior of object (inside a three-pixel-wide band at object boundary). (b) For the three-pixel-wide band at object boundary.. [7] [8] [9]. [10] [11]. IV. CONCLUSION We proposed an automatic video segmentation (or video object extraction) algorithm for content-based video applications. The algorithm employed edge analysis for accurate determination of object boundaries and it employed bidirectional motion estimation for robust tracking of object motion and shape change. The algorithm had been designed with computational complexity in mind, but further reduction in computational complexity is still highly desirable, especially in motion estimation. We considered using the segmented video objects in scene composition. For this we discussed a way to enlarge or shrink arbitrarily shaped video objects. The method worked in a spatial-scalable manner. The synthesized scenes appeared. [12] [13]. D. Wang, “Unsupervised video segmentation based on watersheds and temporal tracking,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 5, pp. 539-546, Sep. 1998. S.-Y. Chien, Y.-W. Huang, and L.-G. Chen, “Predictive watershed: a fast watershed algorithm for video segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 5, pp. 453-461, May 2003. S. Sun, D. R. Haynor, and Y. Kin, “Semiautomatic video object segmentation using VSnakes,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 1, pp. 75-82, Jan. 2003. A.-R. Mansouri and J. Konrad, “Multiple motion segmentation with level sets,” IEEE Trans. Image Processing, vol. 12, no. 2, pp. 201-220, Feb. 2003. T. Meier and K. N. Ngan, “Video segmentation for content-based coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 8, pp. 1190-1203, Dec. 1999. C. Kim and J. N. Hwang, “Fast and automatic video object segmentation and tracking for content-based applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 2, pp. 122-129, Feb. 2002. J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Machine Intell., vol. 8, no. 6, pp. 679-698, Nov. 1986. M. Kim and J. Kim, “Moving video object segmentation using statistical hypothesis testing,” Electron. Lett., vol. 36, no. 2, pp. 128-129, Jan. 2000. M. Kim, J. G. Choi, D. Kim, H. Lee, M. H. Lee, C. Ahn, and Y. S. Ho, “A VOP generation tool: automatic segmentation of moving objects in image sequences based on spatio-temporal information,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 8, pp. 1216-1226, Dec. 1999. M. Bierling, “Displacement estimation by hierarchical blocking,” SPIE vol. 1001, Visual Commun. Image Processing, pp. 387-403, 1986. T. Meier and K. N. Ngan, “Automatic segmentation of moving objects for video object plane generation,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 5, pp. 525-538, Sep. 1998. D. W. Dijkstra, “A note on two problems in connexion with graphs,” Numerische Mathematik, vol. 1, pp. 269-271, 1959.. M. Wollborn and R. Mech, “Refined procedure for objective evaluation of VOP generation algorithms,” Doc. ISO/IEC JTC1/SC29/WG11 MPEG98/3448, Mar. 1998..

(24) Paper 2. REAL-TIME VIDEO SEGMENTATION BASED ON BACKGROUND MODELING APPLICABLE TO VIDEOCONFERENCING Yueh-Hsien Lin and David W. Lin Dept. of Electronics Engineering and Center for Telecommunications Research National Chiao Tung University Hsinchu, Taiwan 30010, R.O.C. E-mails: [email protected], [email protected] ABSTRACT We design a video segmentation algorithm and conduct a real-time implementation on a personal computer (PC). The algorithm is based on background subtraction. To facilitate the setting of some thresholds used in the algorithm, we develop a novel method to estimate the camera noise. The algorithm builds a stationary background buffer by considering short-term background and temporary foreground masks. When camera motion occurs, we recover the stationary background from panorama background buffer through global motion estimation. Simulation results show that the proposed techniques perform reasonably well. The realtime PC-software implementation employs a graphical user interface. With a P-4 2.4-GHz CPU and 512-MB RAM, the current un-optimized implementation yields a speed of about 5 frames per second (fps) for CIF-size video when the camera is still. In presence of camera motion, the speed is about 1.7 fps. 1. INTRODUCTION We consider the design and implementation of a video segmentation algorithm on a personal computer (PC). The intended application is to support PC-based multipoint videoconferencing. Our segmentation algorithm is based on the “background subtraction” approach. One way to obtain the background image is to find and remove all moving objects, but in many situations this may not be easy. In our method, the background is obtained by gathering the stationary regions. Because flat inner regions in moving objects may be mistaken to be background, we use a temporary object mask to alleviate this problem. In the event of camera motion, we need to rebuild a new background. In order to reduce the rebuilding delay, we use the idea of mosaic (“sprite”) to salvage most of the existing background. This work was supported by the National Science Council of R.O.C. under grant no. NSC 92-2219-E-009-009.. In deciding whether a pixel should be considered background, some thresholds are needed which should be set taking camera noise into account. We thus develop a novel two-stage method for camera noise estimation. In what follows, Section 2 describes the proposed segmentation method. Section 3 describes the PC-based implementation. Section 4 presents some simulation results. And finally, Section 5 contains the conclusion. 2. PROPOSED SEGMENTATION METHOD Figure 1 shows a block diagram of the proposed segmentation method. We explain its functioning in the following subsections. 2.1. Two-Stage Noise Estimation To facilitate the choice of the many noise-dependent thresholds needed in the segmentation algorithm, we develop a two-stage method for accurate camera noise estimation. Assume that the camera noise obeys zero-mean Gaussian distribution. Since the camera noise is uncorrelated between frames, the interframe difference of a stationary pixel obeys zero-mean Gaussian distribution with variance that is twice the variance of the camera noise. This is similar to [1] and other works. To estimate , we should use the pixels belonging to stationary background and exclude those belonging to moving objects. In our experience (and as is intuitively reasonable), the pixels with larger interframe differences usually form a cluster when they belong to moving objects. The larger frame differences caused by camera noise are usually randomly distributed. Therefore, similar to [2], we check for existence of directional structure in the interframe difference at each pixel to detect pixels belonging to moving objects. Specifically, for each pixel, we calculate the four directional sums in frame difference map using the masks shown in Fig. 2. If one of them is larger than a certain.

(25) Fig. 4. Noise estimation for Mother-and-Daughter sequence.. Fig. 1. Block diagram of the proposed segmentation method.. Fig. 2. Four masks for directional sums.. threshold, we assume that the pixel belongs to a moving object. Now we face a problem: how can we set the threshold without knowing the amount of camera noise in the first place? To solve this problem, we calculate the interframe pixel variance over the entire frame and use it to set an to test the directional sums. After we initial threshold remove the pixels with large directional interframe differences, the remaining pixels are used to compute a second interframe variance . Then we use as the threshold to classify the pixels and obtain the final estimate for . To verify the validity of our method, we first estimate the true variance from manually chosen pixels in video. . . . Fig. 5. Noise estimation for Claire sequence. frames that belong to the stationary background. For example, the white areas in Fig. 3 mark the pixels used to estimate for the two images. Figures 4 and 5 (curves labeled “Stage 1”) show that using we can remove most pixels belonging to moving objects, but not completely satisfactorily. The result of stage 2 is closer to the exact value.. . 2.2. Temporary Foreground Mask We now generate a temporary foreground mask to be used in obtaining the stationary background buffer, detecting scene change, and global motion estimation. 2.2.1. Get Initial Object Mask. . At first, we use the window to calculate the meansquare frame difference at each pixel. If the result is larger than a threshold , then the pixel is classifed as belonging to a moving object. An example of the thresholded frame difference map is shown in Fig. 6(a). Next, we use the fillin technique proposed in [3] to get a rough mask. For this the pixels between the first and last white pixels (indicating pixels in moving objects) in each row is made white. Then. . Fig. 3. Image areas used to estimate the true camera noise variance for (a) Mother-and-Daughter and (b) Claire sequences..

(26) Fig. 6. (a) Thresholded frame difference map. (b) Fill-in for each row. (c) Fill-in for each column. (d) Second fill-in for each row.. Fig. 8. Result of short-term background estimation.. Fig. 9. Weighting mask for the Mother-and-Daughter sequence. 2.3. Short-Term Background Estimation. Fig. 7. (a) Initial object mask. (b) Edge map. (c) Refined mask. (d) Edge map after removing background edges. (e) Final object mask. this is doen for each column and once again for each row. The step-by-step results are shown in Fig. 6.. We consider six consecutive frames (

(27) ) and calculate the frame differences ( "! ) at each pixel . For every pixel , we calculate the variance of # ( $ %& '! ) in a window. If the variance is smaller than a threshold given by ( , then we consider the changes in the six frames small and regard pixel in the sixth frame as background. The result is shown in Fig. 8 for the earlier example.. . 2.4. Construct Stationary Background Buffer 2.2.2. Refine Initial Object Mask A rough mask obtained above is enough in some cases but not in others. In Fig. 7, for instance, there are two persons sitting side by side. The background between the two persons is included in the mask by the fill-in process. Hence we use edge information to refine the initial mask, where the edge information is obtained using the Canny operator [4]. The Canny operator performs a gradient operation on the image that has been convolved with a Gaussian filter. Then nonmaximum suppression is applied to thin the edges. Lastly, thresholding with hysteresis is used to find and link edges. The edge map after applying Canny operator is shown in Fig. 7(b). The code for it is obtained from [5]. We refine the initial object mask by shrinking the initial mask to fit the edge map. Figure 7(c) shows the shrunk mask for Fig. 7(a). The example shows that the edge map may include many background edges and these edges may impact the result adversely. To reduce their influence, we use a buffer to store the background edges. When an edge always appears at a certain position, we assume that it is a background edge. The edge map after removing background edges is shown in Fig. 7(d) and the resulting object mask is shown in Fig. 7(e). Comparing Figs. 7(c) and (e), we see that the overgrowth due to background edges can be effectively removed.. Most wrong decisions in short term are due to flat inner object regions because they do not show significant variations across frames. Such effects can be seen from Fig. 8. To reduce this problem, we use the temporary foreground mask to weigh every pixel before putting the short-term background into the final background buffer. A weighting mask for the earlier example is shown in Fig. 9, where black pixels indicate reliable background pixels and are given higher weight while white pixels indicate moving objects and are given zero weight. A pixel marked gray is one in the short-term background and also in the temporary foreground mask. It may suffer from the flat inner region problem and we give it a lower weight. We accumulate the weights at every pixel and the short-term background is put into the stationary background buffer when the accumulated weight meets a threshold. 2.5. Deal with Camera Motion In background subtraction, the background should ideally be stationary. If the camera moves, the background buffer should be rebuilt. Typically, there may be a large overlap between the old and the new backgrounds. Thus we employ the image mosaic technique to make use of the overlap and speed up background reconstruction..

(28) The gradient descent iterations are carried out according to a')(. . . a'* H + b. where a' denotes a at iteration , , H is an to one-half times the Hessian of : . -. . /. . . matrix equal. 0 0 0 0 .. . . and b is an -vector equal to minus one-half times the gra 1 dient of : . . . Fig. 10. Global motion estimation. 2.5.1. Scene Change Detection We use scene change detection to detect camera motion. When the difference between background frames at different times is large, we assume that camera motion has occured. The background here is obtained by excluding the temporary foreground mask. Since a flat region may yield no frame difference in small camera motion, we only consider the regions near edges.. . 0. . . . 0 ! . Besides, in the first iteration of each level, the histogram of 2 2 is computed to find a threshold 3 such that the number 2 2 bigger than 3 is about 15% of considered of pixels. In 2 2 the following iterations, the pixels whose are larger than By observing 3 are excluded in gradient descent operation. the speed of convergence for the Stefan sequence and the related results in [6], we set the maximum number of iter ations at each level to 34. The transform between

(29) is usually non-integer and therefore bilinear inand )

(30) 4 terpolation is used. The projection of motion parameters from one level to the next is effected by multiplying and by two and keeping the others the same.. . 2.5.3. Panorama Background and Background Recovery 2.5.2. Global Motion Estimation Figure 10 illustrates the method we use to estimate the global motion due to camera motion, which is based on the hierarchical architecture of [6] and [7]. The advantage of a hierarchical architecture is that it can handle large displacements and reduce computational complexity. We minimize the sum-squared difference between the current frame and the displaced reference frame :. . . . . . . . . . .

(31)

(32) . Considering both perforwhere. mance and simplicity, model: we adopt the affine motion.

(33)

(34). . .

(35). . where "!!"!" are the motion parameters. The gradient descent method [6], [8] needs a set of initial values for . We use stepped search to obtain the initial and . The search range is # ! in both coordinates and therefore the range in full size is # % $ . The others ’s are set to $ . % and &. . After we have obtained the camera motion, the background can be stored in the panorama background buffer according to the motion parameters. When camera motion occurs, the stationary background buffer can be rebuilt from the panorama background quickly. The bilinear interpolation is adopted to deal with non-integer motion. 2.6. Background Subtraction The final object mask is obtained by differencing the current frame and the stationary background buffer. In general, the background in the current frame may be subject to lighting change and contain shadows, and the stationary background may contain some wrongly identified background pixels. Therefore, the differencing result may still contain some error regions. We therefore remove the small isolated pixel groups outside and inside the resulting object mask. Figure 11 shows an example. 3. PC-BASED IMPLEMENTATION The PC-based implementation employs a capturing camera and a PC, where the PC is used for system control, video.

(36) Fig. 11. (a) 255th frame. (b) Stationary background buffer. (c) Mask after subtraction and thresholding. (d) Final object mask.. Fig. 13. Some segmentation results of the Mother-andDaughter ((a)–(d)), the Claire ((e)–(h)), and the Akiyo ((i)– (l)) sequences. Panels alternately show stationary background buffer and segmented object mask. The frame numbers are as follows. (a), (b) 140. (c), (d) 260. (e), (f) 60. (g), (h) 150. (i), (j) 10. (k), (l) 165. Fig. 12. The application program interface. segmentation, and result display. We develop a GUI (graphical user interface) employing the Windows SDK (software development kit) from Microsoft. We use the VfW (Video for Windows) 1.0 library, originally released by Microsoft in November 1992 for the Windows 3.1 operating system, to interface with the camera. The captured frame is embedded in an AVI (Audio-Video Interleaved) file with its beginning marked by “##db”. After the video segmentation, we display the result in a window created through the Windows SDK. We implement two major control units: capture control and threshold control. The former controls the digital camera, such as start, stop, image size, and luminance. The latter is used to adjust the thresholds in temporary foreground mask, short-term background, and background subtraction. The application program interface is illustrated in Fig. 12. With a P-4 2.4-GHz CPU and 512-MB RAM, the current un-optimized implementation yields a speed of about 5 frames per second (fps) for CIF-size (352 288) video when the camera is still. In the presence of camera motion, the speed is about 1.7 fps.. . 4. SIMULATION RESULTS 4.1. Segmented Image Masks Figure 13 shows some results of the Mother-and-Daughter, the Claire, and the Akiyo sequences. The required number of frames to obtain enough background for them is observed to be about 260, 150, and 10, respectively. We can see that. the object masks are more accurate as we obtain more background. Two major factors influence the required number of frames to gather enough background. First, if the background is covered by moving objects for a long time, then of course we would need to wait for a long time until the background becomes uncovered to gather it. Second, it also depends on the amount of camera noise. For a low camera noise sequence, we can set more critical thresholds for the short-term background and the temporary foreground. This can lead to a shorter time in gathering the stationary background buffer. The amount of camera noise in the three test sequences, in descending order, is Mother-andDaughter, Claire, and Akiyo. We see that the required number of frames to obtain enough background, in descending order, is also the same for the three sequences. 4.2. Global Motion Estimation and Mosaic We now examine the effect of global motion estimation and background mosaic. First, Figure 14 shows two panorama background buffers obtained using the Stefan test sequence, for which a reference segmentation exists Next, we show the benefit of background salvaging using a sequence captured in our lab. Figure 15 shows the segmented image masks for several frames after a camera motion when we just reset the background upon camera motion. If we use background mosaic to salvage existing background, the resulting image masks are as shown in Figure 16. It is obvious that the result with background mosaic is more accurate during the rebuilding of new background..

(37) Fig. 16. Segmented image masks with background mosaic. Frame numbers are as in previous figure. sirable. With a P-4 2.4-GHz CPU and 512-MB RAM, the current un-optimized implementation yields a speed of about 5 fps for CIF-size video when the camera is still. In the presence of camera motion, the speed is about 1.7 fps. Fig. 14. Mosais results for the Stefan sequence. (a) From 1st to 13th frames. (b) From 40th to 73th frames.. 6. REFERENCES [1] T. Aach, A. Kaup ,and R. Mester, “Statistical modelbased change detection in moving video,” Signal Processing, vol. 31, pp. 165–180, Mar. 1993. [2] Y. H. Jan and D. W. Lin, “Video segmentation with extraction of overlaid objects via multi-tier spatiotemporal analysis,” Int. J. Elec. Eng., vol. 11, no. 3, Aug. 2004.. Fig. 15. Segmented image masks without background mosaic. Frame numbers are: (a) 145 (where camera motion is detected), (b) 146, (c) 147, (d) 148, (e) 149, (f) 150, and (g) 151. 5. CONCLUSION We developed a video segmentation algorithm based on the background subtraction approach and implemented a realtime video segmentation system on PC based on the algorithm. The intended application was PC-based multipoint videoconferencing. The algorithm used a temporary foreground mask to reduce the influence in background construction of the inner flat regions in the moving objects. It also used a panorama background buffer (mosaic or sprite) to improve the segmentation accuracy immediately after camera motion. A two-stage method for camera noise estimation was developed to facilitate the setting of various algorithm thresholds. Simulation results show that the algorithm performs relatively well, although further improvements are still de-. [3] T. Meier and K. N. Ngan, “Video segmentation for content-based coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 8, pp. 525–538, Dec. 1999. [4] J. F. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Machine Intell., vol. 6, pp. 679–698, Nov. 1986. [5] “Canny operator code,” http://ouray.cudenver.edu/ na0alber/DataCompressionPaper.htm. [6] F. Dufaux and J. Konrad, “Efficient, robust, and fast global motion estimation for video coding,” IEEE Trans. Image Processing, vol. 9, pp. 497–501, Mar. 2000. [7] Y. Lu, W. Gao, and F. Wu, “Fast and robust sprite generation for MPEG-4 video coding,” in Proc. IEEE PacificRim Conf. Multimedia, Oct. 2001, pp. 118–125. [8] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C: The Art of Scientific Computing, 2nd ed. Cambridge, England: Cambridge University Press, 1992..

(38) 附件：出席國際學術會議報告(含論文) 報告人：林大衛國立交通大學電子工程學系教授國科會計畫編號：NSC 92-2219-E-009-009 壹、前言在本計畫及交大電子工程系補助之下，本人今年共參加二項國際學術會議如下。其中第一項會議之支出由本計畫補助，而第二項會議支出則由本計畫補助小部分，交大電子工程系補助大部分。因此，本報告主體亦分為二部分，分別報告二項會議的出席經過及心得等。兩項會議共發表三篇論文，其中二篇係國科會計畫成果，不過不是本計畫，而是另一個計畫。由於出國經費係核定於本計畫之下，故由其負擔出國經費。 1. 會議名稱： (中文) 2004 年 IEEE 國際聲學、語音、與訊號處理會議 (英文) 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004) 時間及地點：2004 年 5 月 17-21 日加拿大魁北克蒙特婁發表論文題目： (1) (中文) 使用邊界均方差估計以變化步階之高階 QAM 之多模盲目判定回授等化器 (論文全文見 Appendix A) (英文) Variable-Step-Size Multimodulus Blind Decision-Feedback Equalization for High-Order QAM Based on Boundary MSE Estimation (2) (中文) 使用平行干擾消除接收器之片碼間插寬頻分碼多重進接在多路徑瑞利衰落通道中之效能 (國科會計畫 NSC 92-2219-E-009-018 成果) (論文全文見 Appendix B) (英文) Chip-Interleaved WCDMA with Parallel-Interference-Cancellation Receiver 2.. in Multipath Rayleigh Fading Channels 會議名稱： (中文) 2004 年 IEEE 無線通訊領域之訊號處理進展國際研討會 (英文) 2004 IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC 2004) 時間及地點：2004 年 7 月 11-14 日葡萄牙里斯本發表論文題目： (中文) ㄧ個可消除多用戶干擾並降低符際干擾而可用低複雜度之接收處理的片碼間插同步直接序列分碼多重進接技術 (國科會計畫 NSC 92-2219-E-009-018 成果) (論文全文見 Appendix C) (英文) A chip-interleaved synchronous DS-CDMA technique enabling MAI-free and reduced-ISI transmission with low complexity receiving. 貳、第一部分：ICASSP 2004 1.

(39) 一、參加會議經過 IEEE Signal Processing Society 每年一度的 International Conference on Acoustics, Speech, and Signal Processing (ICASSP)自 1976 年起舉辦，迄今年已是第 29 屆。是世界上最主要的大型國際訊號處理學術研討會之一，素受訊號處理學術界之重視。今年投稿之論文達 2434 篇，獲得接受者 1261 篇。由於去年原定在香港舉行的 ICASSP 2003 因 SARS 之故臨時取消，所以今年的會議中也收錄了一些原定去年發表的論文。我與學生合撰之論文二篇獲得接受於會中發表，分別是 “Variable-Step-Size Multimodulus Blind Decision-Feedback Equalization for High-Order QAM Based on Boundary MSE Estimation” (與洪崑健同學及工研院電通所柯俊男先生合撰)及 “Chip-Interleaved WCDMA with Parallel-Interference-Cancellation Receiver in Multipath Rayleigh Fading Channels” (與林郁男同學合撰)。本次會議係於 2004 年 5 月 17-21 日在加拿大蒙特婁舉辦，其中第一天(5 月 17 日) 主要是一些短期課程(tutorials)，報名費另收，我未參加。主要會議(technical sessions)在後四天(5 月 18-21 日)，除四場 plenary talks(含一場 keynote speech)之外，共分 13 個 parallel sessions 同時進行，其中包括 5 個 oral sessions 及 8 個 poster sessions，而 poster sessions 的論文數目通常又較 oral sessions 為多，故大部分的論文係以 poster 方式發表。在 technical sessions 中，與我的研究領域比較直接相關的是視訊處理與傳輸訊號處理等方面的主題。我於 5 月 17 日自中正機場出發，經溫哥華轉蒙特婁，在 18 日上午 7 點多抵達，單程約 20 小時。回程係於 5 月 21 日晚上自蒙特婁機場出發，經溫哥華轉機返台，在 23 日早上 6 時左右抵達，單程約 22 小時。除聆聽 plenary talks 及參加 technical sessions 之外，由於我前幾年擔任 IEEE Taipei Signal Processing Chapter 主席，故也代表該 chapter 參加了 chapter chairs 的午餐會，了解有關 chapters 的事宜，並與其他 chapters 的代表有點交流。參加本會議之差旅支出全由本計畫補助。二、與會心得 1.. Plenary Talks. Prof. Nikil Jayant (GeorgiaTech) 主講 “Pervasive Broadband: Opportunities for Signal Processing”。通訊網路正在由過去的電話時代進入多媒體寬頻通訊的時代(這就是標題中的 broadband 一字之所指)。本演講主要目的是論及此發展中有關訊號處理的議題與應用。Prof. Jayant 在訊號處理領域世界知名。他早期(1970 年代)任職 Bell Laboratories，並在語音編碼方面著有貢獻，是在 1998 年起轉任職於 GeorgiaTech。 Dr. Gene Frantz (Texas Instruments) 主講 “Human Speech: The Alpha and Omega of Signal Processing”。二十餘年前，TI 推出了一個稱為 Speak-and-Spell 的玩具，寓教於樂，可幫助兒童們學習英文拼字，很受歡迎。該產品是大量使用語音訊號處理的消費電子產品的早期而成功的例子。Dr. Frantz 在其中扮演重要的角色。本演講回顧過去 25 年來語音訊號處理技術的發展，含語音編碼、語音朗誦(text-to-speech conversion)、語音辨認 (speech recognition)等。也談及相關的數位訊號處理硬體(DSP)與積體電路之發展。 2.