Receiver System Integration - 基於MPEG標準之多媒體通訊與串流整合平台及其應用---子計畫六:多點視訊會議技術之研究(II)

1. Overall System Structure

Figure 4-5 shows how the integrated receiver program works. The GUI block creates a window and the user can input the ports and the positions of the different videos. Since there may be multiple video and audio streams to be handled, we put the decoders in multiple threads, which lets the operating system handle their scheduling.

Fig. 4-5: Flow diagram of the integrated receiver.

So, after the user has specified the ports and the positions of the videos, the system creates the decoder threads. It also obtains information on which decoders are active at present, so as to determine the highest-indexed video decoder and pass the control responsibility (including video composition and display) to it. Now the video and the audio decoders can begin their work in decoding the data received through the RTP network interface. The composed video is displayed in a window and the composed audio is played.

2. The RTP Network Interface

As in the transmitter, we employ the JRTPLIB 3.1.0 software for the network interface.

Two important parameters that need to be set for each session are the timestamp and the portbase. The timestamp parameter is set to 1 section per second for a video stream and to 1 section per 4 seconds for an audio stream. (These parameter values are somewhat inappropriate for real-time applications, but are chosen based on experience for smooth running of the program. Further work is needed to determine the underlying problem and potential solution.) For the portbase, since each transmitter sends two streams (video and audio), the video stream uses a user-specified port number and the audio stream uses that number plus 100.

In our system, after the setting of these parameters, we put the receiving of the RTP packets in a main loop. For some reason yet unclear, the software would lose the first two packets and receive the third and later packets successfully.

3. Video Decoding and Display

Our video decoder is from the public Microsoft MPEG-4 Video Reference Software. To integrate the video decoder into the overall system, we modify the original decoder program into a function called MPEG4VDecoder and give it two parameters: handle (of the display window) and thread index. The handle can give the decoder some information to control the window and display the video stream.

To display the video, we convert the decoder output from the original 4:2:0 format to the 4:4:4 format. Then we calculate the RGB values of each pixel from the luminance and the chrominance values. And then we use the SetPixelV function provided by the Windows SDK library to display the RGB values pixel by pixel. Experience shows that the SetPixelV function is very slow and can significantly slow down the overall speed of the receiver system. Hence, to reduce its use, instead of using it on all pixels in the display window, we only use it to update the pixels in the object areas of two successive frames.

3. Audio Decoding and Composition

For audio decoding, we use the Freeware Audio Decoder (FAAD2) [33]. We only make use of the MAIN profile. After decoding the audio stream, the result is saved as a temporal audio file in the WAV format. The Media Control Interface (MCI), a high level open interface, is used to play the audio. Since each audio section is four seconds, the decoder

would wait for that long as the decoder output is played by the MCI.

For audio composition, two intuitive methods are (1) to sum all audio streams and (2) to play only one stream. The first method suffers an overflow problem which can be solved by proper scaling. This is left to potential future work. For simplicity in the final system we only play the audio from the first transmitter site.

D. Some Experimental Results

For convenience, for the video part we use the CIF test sequence Bream with its associated binary shape information. Multiple instances of the sequence are considered separate video transmissions. Figure 4-6 shows some typical composed scenes with two videos. We noted previously that the SetPixelV (and SetPixel) function was slow. Time analysis shows that image display alone can take over 70% of the overall processing time.

Hence, unless we can find more efficient ways to set the display pixels, we should minimize the use of the time-consuming SetPixelV and SetPixel functions.

Fig. 4-6: Some typical composed scenes with two videos.

Experiments with the fully integrated receiver system show that the processing speed is approximately 11.2 fps with one video, 5.2 fps with two videos, and 3.6 fps with three videos (all Bream). Hence it is approximately reciprocal to the number of videos.

E. Conclusion

We considered the design and implementation of a novel type of software-based multipoint videoconference receiver on a PC, where some distinguishing features were the use of MPEG-4 object-based coding and the composition of decoded videos into one scene.

Further enhancement of the system in speed and robustness is considered.

五、參考文獻

[1] M. E. Lukacs and D. G. Boyer, “A universal broadband multipoint teleconferencing service for the 21^st century,” IEEE Commun. Mag., vol. 33, no. 11, pp. 36-43, Nov.

1995.

[2] D. G. Boyer, M. E. Lukacs, and M. Mills, “The personal presence system experimental research prototype,” in IEEE Int. Conf. Commun. Conf. Rec., pp. 1112-1116, 1996.

[3] O. Schreer, M. Karl, and P. Kauff, “A Trimedia based multi-processor system using PCI technology for immersive videoconference terminals,” in Int. Conf. Digital Signal

Processing, pp. 289-293, 2002.

[4] MoMuSys, “MoMuSys final report,” Mar. 2001. Available from http://www.tnt.uni-hannover.de/project/eu/momusys.

[5] O. Schreer, H. Fuchs, W. IJsselsteijn, and H. Yasuda, eds., Special Issue on Immersive

Telecommunications, IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 3, Mar.

2004.

[6] Y.-J. Chang, C.-C. Chen, J.-C. Chou, and Y.-C. Chen, “Implementation of a virtual chat room for multimedia communications,” in IEEE Workshop Multimedia Signal

Processing, pp. 599-604, 1999.

[7] Y.-J. Chang, C.-C. Chen, J.-C. Chou, and Y.-C. Chen, “Virual Talk: a model-based virtual phone using layered audio-visual integration,” in IEEE Int. Conf. Multimedia

Expo, pp. 415-418, 2000.

[8] C.-W. Lin, W.-H. Wang, M.-T. Sun, and J.-N. Hwang, “Implementation of H.323 multipoint video conference systems with personal presence control,” in IEEE Int.

Conf. Consumer Electron. Digest of Tech. Papers, pp. 108-109, 2000.

[9] C.-W. Lin, Y.-J. Chang, Y.-C. Chen, and M.-T. Sun, “Implementation of a realtime object-based virtual meeting system,” in IEEE Int. Conf. Multimedia Expo, pp.

565-568, 2001.

[10] S. Battista, F. Casalino, and C. Lande, “MPEG-4: a multimedia standard for the third millenniem,” in two parts, IEEE Multimedia, vol. 6, no. 4, pp. 74-83, Oct.-Dec. 1999, and vol. 7, no. 1, pp. 76-84, Jan.-Mar. 2000.

[11] M. Bourges-Sevenier and E. S. Jang, “An introduction to the MPEG-4 Animation Framework eXtension,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 7, pp.

928-936, July 2004.

[12] Y.-H. Jan and D. W. Lin, “Edge-based morphological processing for efficient and accurate video object extraction,” IEICE Trans. Inf. & Syst., vol. 88-D, no. 2, pp.

335-340, Feb. 2005.

[13] Y.-H. Jan and D. W. Lin, “Automatic video segmentation with novel motion analysis and edge processing for accurate identification of object boundaries,” Int. J. Elec. Eng., vol. 12, no. 3, pp. 297-304, Aug. 2005.

[14] Y.-H. Jan, “Research in video segmentation techniques for object-oriented applications,” Ph.D. dissertation, Dept. Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan, R.O.C., May 2005.

[15] C.-Y. Tsai, C.-K. Chien, and D. W. Lin, “A videoconference transmitter supporting object-based video encoding with real-time video segmentation,” to appear in Proc.

Workshop Consumer Electronics Signal Processing, Nov. 2005.

[16] C.-Y. Tsai, “Integration of videoconference transmitter with MPEG-4 object-based video encoding,” M.S. thesis, Dept. Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan, R.O.C., June 2005.

[17] C.-K. Chien, C.-Y. Tsai, and D. W. Lin, “A multipoint videoconference receiver based on MPEG-4 object video,” to appear in Proc. Int. Symp. Commun., Nov. 2005.

[18] C.-K. Chien, “A multipoint videoconference receiver for MPEG-4 object-based video,”

M.S. thesis, Dept. Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan, R.O.C., June 2005.

[19] S. Sun, D. R. Haynor, and Y. Kin, “Semiautomatic video object segmentation using VSnakes,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 1, pp. 75-82, Jan.

2003.

[20] A.-R. Mansouri and J. Konrad, “Multiple motion segmentation with level sets,” IEEE

Trans. Image Processing, vol. 12, no. 2, pp. 201-220, Feb. 2003.

[21] D. Wang, “Unsupervised video segmentation based on watersheds and temporal tracking,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 5, pp. 539-546, Sep.

1998.

[22] S.-Y. Chien, Y.-W. Huang, and L.-G. Chen, “Predictive watershed: a fast watershed algorithm for video segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 5, pp. 453-461, May 2003.

[23] T. Meier and K. N. Ngan, “Video segmentation for content-based coding,” IEEE Trans.

Circuits Syst. Video Technol., vol. 9, no. 8, pp. 1190-1203, Dec. 1999.

[24] C. Kim and J. N. Hwang, “Fast and automatic video object segmentation and tracking for content-based applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no.

2, pp. 122-129, Feb. 2002.

[25] L. Atzori, D. D. Giusto, and C. Perra, “A novel block-based video segmentation algorithm,” in IEEE Int. Conf. Multimedia Expo, pp. 653-656, 2001.

[26] H. Luo and A. Eleftheriadis, “Rubberband: an improved graph search algorithm for interactive object segmentation,” in Proc. IEEE Int. Conf. Image Processing, vol. 1, pp.

101-104, 2002.

[27] J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal.

Machine Intell., vol. 8, no. 6, pp. 679-698, Nov. 1986.

[28] E. W. Dijkstra, “A note on two problems in connexion with graphs,” Numer. Math., vol.

1, pp. 269-271, 1959.

[29] Y.-H. Lin, “Real-time video segmentation based on background modeling for videoconferencing,” M.S. thesis, Dept. of Electronics Engineering, National Chaio Tung University, Hsinchu, Taiwan, R.O.C., June 2004.

[30] M.-Y. Liu, “Real-time implementation of MPEG-4 video encoder using SIMD-enhanced Intel processor,” M.S. thesis, Degree Program of Electrical Engineering and Computer Science, National Chaio Tung University, Hsinchu, Taiwan, R.O.C., July 2004.

[31] “VidCap: Full-featured video capture application,”

http://msdn.microsoft.com/library/devprods/vs6/visualc/vcsample/vcsmpvidcap.htm.

[32] “Capture without using disk storage,”

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/multimed/htm/_win32 _capture_without_using_disk_storage.asp.

[33] “AudioCoding.com,” http://www.audiocoding.com/.

[34] “JRTPLIB 3.1.0,” http://research.edm.luc.ac.be/jori/jrtplib/jrtplib.html.

六、計畫成果自評

研究內容與原計畫相符程度：符合計畫主題，達成之主要成果包括：視訊分割技術之進一步研究成果、傳送端系統整合之技術與簡單雛型、及接收端系統整合之技術與簡單雛型。

達成預期目標情況：本子計畫達成之貢獻形式，含先前技術之改進、技術水準之提升、實驗系統之建立、人才培育。

成果之學術與應用價值等：視訊分割方面之若干成果已發表為期刊論文。多點視訊會議傳送端系統與接收端系統整合之研究成果亦已將發表於國內會議，並可供相關業界研發之參考。以上各方面研究之經驗與成果亦可成為我們後續相關研究的參考。

綜合評估：本計畫獲得一些具有學術與應用價值的成果，並達人才培育之效。成效良好。

可供推廣之研發成果資料表

英文：At the transmitter, segment the input video and extract the conferee image. Encode it and the audio using MPEG-4 object-based video coding and audio coding, respectively. Transmit the result by RTP. At the receiver, employ RTP to receive the information from all other conference terminals and perform video and audio decoding.

Then compose the videos into one scene for display, and play one chosen audio (or the sum of all audios).

可利用之產業

在文檔中基於MPEG標準之多媒體通訊與串流整合平台及其應用---子計畫六:多點視訊會議技術之研究(II) (頁 19-26)