針對 MPEG 自由視角視訊之合成品質導向深度圖優化

(1)

ġ

⚳

ġ

䩳

ġ

Ṍ

ġ

忂

ġ

⣏

ġ

⬠ġ

⮩ⴡ浣ぴ䲚䪣䴅㓏

䬸

⭺

嵥

㠖

ଞჹ NQFH Ծҗຎفຎૻϐӝԋࠔ፦Ꮴӛ!

ుࡋკᓬϯ!

!

A Synthesis-Quality-Oriented Depth Refinement Scheme

for MPEG Free Viewpoint Television (FTV)

ࣴ!ز!ғǺഋߪӓ!

ࡰᏤ௲௤Ǻ൹Ўֵ!!շ౛௲௤!

!

₼ 噾㺠 ⦚ ⃬ ◐ ⏺ ㄃ ⃬ 㦗

Ծҗຎفຎૻϐӝ

y-Oriented Depth

iewpoin

ࡋკᓬϯ

ϯ

ૻϐ

җຎفຎૻϐ

riented De

ૻ

ຎૻ

ుࡋკ

җຎفຎ

ుࡋკ

ຎ

فຎ

ຎف

ف

ຎ

(2)

ଞჹ NQFH Ծҗຎفຎૻϐӝԋࠔ፦Ꮴӛుࡋკᓬϯ!

A Synthesis-Quality-Oriented Depth Refinement Scheme for MPEG Free

Viewpoint Television (FTV)

ࣴ ز ғǺഋߪӓ StudentǺChun-Chi Chen

ࡰᏤ௲௤Ǻ൹Ўֵ AdvisorǺWen-Hsiao Peng

୯ ҥ Ҭ ೯ ε Ꮲ

ӭ ൞ ᡏ π ำ ࣴ ز ܌

ᅺ γ ፕ Ў

A Thesis

Submitted to Institute of Multimedia Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Computer Science

September 2009

Hsinchu, Taiwan, Republic of China

ύ๮҇୯ΐΜΖԃΐД

ำ

ᅺ γ ፕ Ў

d to En

ollege of Computer Scienc nal Chiao Tung Univ

is of Multim

ᅺ γ ፕ Ў

o Institute of Multimedia ege of Computer Scie

Chiao Tung f Computer S hesis titute o A Thes ute of e

(3)

ଞჹ NQFH Ծҗຎفຎૻϐӝԋࠔ፦Ꮴӛుࡋკᓬϯ!

!!!!!ࣴ!ز!ғǺഋߪӓ!!!!!!!!!!!!!!!!!ࡰᏤ௲௤Ǻ൹Ўֵ!

୯ҥҬ೯εᏢӭ൞ᡏπำࣴز܌!!ᅺγ੤!

ᄔ

ा

! ! ! ! ! ! !

!!!୷ܭ NQFH!Ծҗຎفႝຎ኱ྗ)GUW*ϐࢎᄬǴҁፕЎឍញ΋ঁճҔௗԏډϐୖ!

Եฝय़ᆶుࡋቹႽǴ຾ՉుࡋၗૻᓬϯޑୢᚒǶ٬Ҕ࿶ᓸᕭၸϐుࡋკ຾Չ຀ᔕ! ຎفӝԋਔǴӸӧ๱ӝԋᇤৡޑલഐǶӢԜǴҁЎ२ӃճҔ؂ঁႽનϐుࡋॶᇤ! ৡᡂ౦ኧǵߝࡋᡂ౦ኧǵ੿ჴుࡋၗૻک຀ᔕ࣬ᐒޑՏ࿼Ǵࡌҥ΋ಔϩ݋ӝԋᇤ! ৡϐኳࠠǴ٠аԜኳࠠ՗ෳ؂ঁႽનޑӝԋᇤৡǶځԛǴךॺаԜϩ݋ኳࠠࣁ୷! ᘵǴ೸ၸᔠࢗႽનϐߝࡋఊࡋٰୀෳόё᎞ޑుࡋႽનǶനࡕճҔংᒧΓ୔༧ຎ! ৡཛྷ൨)Dandidate-based Block Disparity Search*ٰᓬϯόё᎞ޑుࡋႽનϐኧ! ॶǶࣁΑᡣ΢ॊٿঁ؁ᡯёаӧόӕᓸᕭਏᔈύ׳уऐҔǴ໺ଌᆄሡा໺ሀᚐѦ! ޑၗૻ๏ௗԏᆄǴҔа௓ڋӝԋຎૻޑࠔ፦Ƕჴᡍ่݀ᡉҢǴҁЎගрޑБݤό! ໻ӧӝԋ่݀ϐѳ֡ QTOS ΢ຬຫ NQFH!GUW ኱ྗၻ 2/3!eCǴ׳ёᛙۓޑគၸҞ! ߻നӃ຾ޑుࡋၗૻᓬϯБݤǶԜѦǴӧࡐεޑำࡋ΢ǴҁЎޑБݤёԖਏ౽ନ! ӝԋѨ੿Ǵځӝԋ่݀ёၲډ߈՟ܭ຀ᔕຎفϐ੿ჴቹႽϣ৒Ƕ! ! ! ! ! લഐǶӢԜǴҁЎ ჴుࡋၗૻک຀ᔕ࣬ᐒޑՏ ෳ؂ঁႽનޑӝԋᇤৡǶځԛ ఊࡋٰୀෳόё᎞ޑుࡋႽન Disparity Search* ԋᇤৡǶǶ ᔕ࣬ ӢԜ ుࡋၗૻک຀ᔕ࣬ᐒ ؂ঁႽનޑӝԋᇤৡǶځ ٰୀෳόё᎞ޑుࡋႽ sparity Sear ෳόё᎞ޑుޑు ӝԋᇤ ୀෳ ޑ ঁႽનޑӝԋ ຀ ຀ ᎞ ᎞ ૻک ࡋၗૻ ෳ ୀ ࡋ

(4)

A Synthesis-Quality-Oriented Depth Refinement

Scheme for MPEG Free Viewpoint Television (FTV)

Student : Chun-Chi Chen Advisor : Wen-Hsiao Peng

Institute of Multimedia Engineering

National Chiao Tung University

ABSTRACT

This thesis addresses the problem of refining depth information from the received reference and depth images within the MPEG FTV framework. An analytical model is first developed to approximate the per-pixel synthesis distortion (caused by depth-image compression) as a function of depth-error variances, intensity variations, ground-truth depth and virtual camera locations. We then follow the model to detect unreliable depth pixels by inspecting intensity gradients and to refine their values with a candidate-based block disparity search. Additional side information is transmitted to make both operations robust against compression effects. Experimental results show that our scheme offers an average PSNR improvement of 1.2 dB over MPEG FTV and consistently outperforms the state-of-the-art methods. Moreover, it can remove synthesis artifacts to a great extent, producing a result that is very close in appearance to the ground-truth view image.!

n the MPEG FTV ximate the per-pixel synt

s a va

ual camera locations. We then nspecting intensity gradien

disparity search. h error cations. pixe h-err G mate the per-pixel s

a f r

camera locations. We th ecting intensity gradi

arity sear er intensity gr pth loca ng function of dep ra lo er-p dep p i of pe of e the ctionn ng tin c

(5)

!

ᇞ

ᖴ

! ! ! ! ! ! !

!ӣ៝ٿԃޑࣴز܌ғఱǴ२ӃǴךाགᖴךޑࡰᏤ௲௤ȋ൹Ўֵ!റγǴ๏ ϒךܭᏢୢࣴز΢ޑࡰᏤǶ൹Դৣჴ٣؃ࢂޑᆒઓǴᆶుΕও݋ୢᚒޑᄊࡋǴځ ଓਥز࢞ᆶࠨԶό௭ޑࡰᏤБԄǴς࿶ԋࣁךӧᏢಞᆶࣴزၡ΢ޑڂጄᆶིኳǶ ځԛǴךाགᖴךޑᏢߏȋഋ䙊દ!റγǴόᜏٌമޑᆶך૸ፕǴ๏ϒ೚ӭࣔ຦ ޑཀـǴ٠Ъૈ፾ਔவ਒๏ϒࡌ᝼অ҅ךςୃৡޑࣴزБӛǴ٬ךӧ೭ٿԃޑᅺ γғఱǴόӆᖐ؁ᆢᖑǶᙣԜჹٿΓठ΢җ૱ޑᖴཀǶ! !Ԗᄪ۩຾Εӭ൞ᡏࢎᄬᆶೀ౛ჴᡍ࠻Ǵёаӧ೭ঁᓬؼޑᕉნΠόᘐᏢಞǴ ΞԖ዗ЈᆶᒃϪޑჴᡍ࠻ԋ঩ॺޑϪዲᆶ૸ፕǴࢂךӧᏢγࡕਔжനкჴޑਔ ӀǶགᖴךޑᏢߏۆॺȋഋ䙊દ!റγǵ׵דᗶ!റγǵ݅ᗶד!റγǵ໳ഓ൛ǵ ݅ۢ຾ǵᆶഋ௵҅ǴЇሦך຾Εࣴزғޑ໘ࢤǹགᖴךޑӳӕᏢॺ݅ণ҉ǵ၏ৎ ݒǵᆶഋࡌᑉǴόፕࢂፐ཰΢܈ࣴز΢Ǵдॺᕴࢂёа΋ଞـՈӦගрୢᚒޑਡ ЈाᗺǴ๏ϒനޔௗڐշǹགᖴךޑᏢ׌ླྀൺ൏ǵֆࡘ፣ǵЦᐛ◖ǵጰ໑ԓǵᆶ ֆ஖ᇬǴӧനࡕ೭΋ԃϣǴ๏ϒ೚ӭคدޑڐշǶ! !!!!നࡕǴךाགᖴךޑৎΓȋጰٰࡾ!ζγǵഋᔾᖒ!ӃғǵЦДḏ!ζγޑਭ ୻ǴӧݾڗᅺγᏢՏޑၡ΢Ǵ๏ϒԭϩϐԭޑЍ࡭Ǵᡣךխѐ೚ӭࡕ៝ϐኁᆶྠ ᘋǶགᖴךޑлۆȋഋֻߪᆶഋ܃։Ǵ๏ϒךᅈᅈЋىޑᜢᚶǶགᖴךޑζ϶ȋ ೚ٵ৒Ǵ೭൳ԃٌٰधӦഉՔǵᡏፊᆶᜢЈǶགᖴךޑԴৣǵৎΓǵᆶܻ϶ॺǴ ࢂգॺޑЍ࡭Ǵ٬ךԖߞЈڗள೭ঁᏢՏǴᖴᖴգॺǶ! റγǵ׵דᗶ ຾Εࣴزғޑ໘ࢤǹགᖴ ཰΢܈ࣴز΢Ǵдॺᕴࢂёа ǹགᖴךޑᏢ׌ླྀൺ൏ǵֆࡘ ๏ϒ೚ӭคدޑڐշǶ Γȋጰٰࡾ!ζγǵഋᔾ ϩϐԭ ׌ླྀൺ൏ǵ ࢤ ॺᕴࢂ Εࣴزғޑ໘ࢤǹག ΢܈ࣴز΢Ǵдॺᕴࢂё གᖴךޑᏢ׌ླྀൺ൏ǵֆ ๏ϒ೚ӭคدޑڐշǶ ጰٰࡾ!ζγǵ ӭคدޑڐշ ᕴ ڐ ॺ Ꮲ׌ླྀ ೚ӭ ޑڐ ᖴךޑᏢ׌ ޑ໘ дॺ ޑ ޑ ғޑ ΢Ǵ ࣴز΢ ೚

(6)

List of Tables

5.1 Depth Estimation Settings. Column (a) to (c) represents Smoothing

Coecient, Precision and Search Level, respectively. . . 21

5.2 Encoder Settings . . . 21

Settings. Column (a) to ( on and Search Level, respect

. . . . (a) t

h Level es

(a) ettings. Column (a) to

and Search Level, respe . . . . (a . . . . mn arch L . . tings. Colum Search um Col Col gs CC g

(9)

List of Figures

2.1 View synthesis based on multi-view video plus depth. . . 5

2.2 Categorization of the synthesis errors observed in a neaby reference view

[7]. . . 6

2.3 Detected unreliable pixels [8]. . . 7

3.1 Disparity-compensated interpolation using an impaired depth

represen-tation. . . 10

3.2 The ratio of ]_s to _q(p) over depth QP. Each point represents the

av-erage ratio of all test sequences. . . 11

3.3 Measuring the depth-error sensitivity under various settings of ]_s, ]_t

and p(2)_j (p). . . 12

3.4 A geometrical interpretation of the eect of ]_t on depth-error sensitivity. 14

4.1 System Block Diagram. . . 16

4.2 A sample result of the proposed depth renement algorithm: (a)(d) the

original depth image, (b)(e) the decoded depth image, and (c)(f) the

rened depth image. . . 19

5.1 PSNR of synthesized images as a function of the depth and reference

QP. The reference view images are coded with QP=22. . . 22

ed on multi-view video plus

he ed i . . . . pixels [8]. . . . rpolation u rs observervr video on multi-view video pl s ve . els [8]. . . . . . . . de rrors . . . ynthesis erro ew -vie multi [8]

(10)

LIST OF FIGURES

5.4 Subjective quality comparison of synthesized images: (a) MPEG FTV

(without depth renement), (b) Tanimoto [7], (c) Sung [8] and (d) the

proposed scheme. The depth QP of Door Flowers sequence is set to 44. 25

proposed scheme. The depth QP of Newspaper sequence is set to 44. . 26

proposed scheme. The depth QP of Dog sequence is set to 44. . . 27

5.7 Pixels whose depth values are judged unreliable: (a) Tanimoto [7]

(cat-egory 2), (b) Sung [8] and (c) the proposed scheme. Top-to-down rows

are Door Flowers, Newspaper and Dog sequences, respectively. . . 28

ues are judged unrel 8] and (c) the proposed sc

Newspaper and Dog sequencog sequqq

ropo g seq are judged un and (c) the proposed

wspaper and Dog sequeDogs

wspaper and D e pr d D pr and the and d (c) aper a

(11)

CHAPTER 1 Research Overview

1.1 Introduction

Technology evolution in the capture and display of 3D videos will soon extend visual sensation from 2D to 3D while allowing unrestricted spatiotemporal scene navigation. In general, oering a 3D depth impression of a real-world scene requires two separate images captured from properly arranged viewing positions. To enable scene navigation, a multi-view video may have to be acquired through a dense camera set-up. However, due to the complexity involved in acquisition, storage and transmission, it is unlikely to have a large number of camera inputs. An ecient 3D data format is thus needed to allow the generation of intermediate views from a sparse sampling of the observed scene.

For this, the MPEG committee has recently dened a "multi-view video plus depth" data format [1], which species a way of multiplexing the coded representations of a multi-view video and its associated per-pixel depth information. With explicit scene geometry, an arbitrary virtual view can be generated at the receiver side by means of the so-called depth-image-based rendering (DIBR) [2][3][4][5], requiring only a small

n

(12)

Chapter 1. Research Overview

number of view images for scene navigation. Since depth images must be conveyed together with the corresponding view images, both types of scene representations are compressed, based mostly on H.264/AVC, for an ecient use of network bandwidth.

1.2 Problem Statement

Although block-based hybrid coding is equally applicable to depth-image compression, it causes undesirable synthesis artifacts. This is because depth images represent scene geometry information, the characteristics of which are very dierent from those of intensity data. It was shown in [6] that visually imperceptible depth errors can still have a profound eect on synthesis quality. To tackle this problem, we propose to regard both the received reference and depth images as sources of information about the ground-truth depth of the scene, and provide ways to detect and rene unreliable depth values.

1.3 Contributions

In this thesis, we propose a synthesis-quality-oriented depth renement scheme. Rather than trying to minimize depth errors, our scheme intends to detect and rene only those depth pixels that are highly sensitive to errors. The main contributions of our work include:

1. An analytical model that formulates the per-pixel synthesis distortion (caused by depth-image compression) as a function of depth-error variances, intensity variations, ground-truth depth and virtual camera locations.

2. A detection scheme that discovers unreliable depth pixels by inspecting inten-sity gradients, where the detection threshold is determined by evaluating, at the sender side, the detection quality as perceived by the receiver.

3. A renement scheme that performs depth renement with a candidate-based block disparity search, where a uniform quantizer operates on the received depth and categorizes the unreliable depth pixels into several disjoint subsets, each of which is assigned with a proper disparity search range.

In order for the detection and renement processes to be able to adapt to statis-tical changes due to compression eects, the settings of their control parameters are

ons

ynthesis-quality-oriented de errors, our scheme inten

ve to errors

s

thesis-quality-oriented ors, our scheme in

quality-oriene

sis-qq y-y or

si t

(13)

Chapter 1. Research Overview

rst determined at the sender side by evaluating the performance as perceived by the receiver over the range of all possible choices, and then sent to the receiver as the side information. Although extra bits are required for signaling, the overhead is negligible and justied by the signicant improvement in synthesis quality. Experimental results show that our scheme oers an average PSNR improvement of 1.2 dB over MPEG FTV and consistently outperforms the state-of-the-art methods. Moreover, it can remove synthesis artifacts to a great extent, producing a result that is very close in appearance to the ground-truth view image.

1.4 Organization

This thesis is organized as follows: Chapter 2 contains a brief review of DIBR. Chapter 3 introduces an analytical model for characterizing synthesis distortions caused by depth-image compression. Chapter 4 describes our proposed synthesis-quality-oriented depth renement scheme. Chapter 5 compares the proposed scheme with the state-of-the-art approaches in terms of synthesis quality. The thesis is concluded with a summary of our observations.

ter 4 describes our p hapter 5 compares the p rms of synthesis quality. T ns. uality. ares ualit 4 describes o pter 5 compares the

ms of synthesis quality.of synthesis qual

mpa sis a hes om hes er 5 c syntnt

(14)

CHAPTER 2 Background

2.1 Depth-Image-based Rendering

Depth-image-based rendering (DIBR) is a view generation method that renders virtual views of a scene from a known reference image and its associated per-pixel depth infor-mation. Often referred to as 3D image warping, the process involves rst reprojecting the reference image into the 3D space utilizing its depth information, followed by the projection of the reconstructed scene onto the image plane of a virtual view camera.

The warping denes a vector-valued function that takes pixel coordinatesp = [{> |]W

in the reference view as input and returns the corresponding coordinatesp0= [{0> |0]W

in the virtual view as output:

: p 1 7 p0 1 = A0_RA1 p 1 + 1 ]sA 0_T> _(2.1)

where the rotation and translation matrices, R and T, specify the relative position

of the virtual camera; A0 and A indicate respectively the intrinsic parameters of the

virtual and reference cameras; and ]_s is the depth value associated with p. In the

ge-based Renderi

BR) is a view ge

(15)

Chapter 2. Background

Figure 2.1: View synthesis based on multi-view video plus depth.

above, we have tacitly assumed parallel camera conguration. The reader is referred

to [2] for details. For brevity we use (p; ]_s) to denote the warping of the pixel p.

Equation (2.1) establishes a depth dependent relation between the pixel coordinates of corresponding points in an image pair. According to the equation, an arbitrary

virtual view can in principle be generated, provided that the depth value ]_s is known

for every pixel p in the reference image and that camera parameters are available. In

practice, however, the viewpoint navigation is constrained by disocclusion problems: "holes" appear in synthesized images if areas occluded in the reference view become visible in a virtual view. Such artifacts become most obvious when the virtual view is very far away from its reference.

To reduce the eects of disocclusion, the MPEG committee has recently proposed a "multi-view video plus depth" data format that enables the generation of a virtual view to make use of more than one reference view. Figure 2.1 shows a classic illustration of the view synthesis based on such data format. In the example, each pixel in the virtual view is formed by a weighted sum of its corresponding points in the two reference views, and depending on the disocclusion level, the weight vector can vary from one pixel to another. To nd the corresponding points, depth images must be transmitted along with their video signals. Due to the enormous amount of data involved, both view and depth images must be compressed prior to transmission. The inuence of depth-image compression on synthesis quality is the subject of the next chapter.

mage pair. Accord be generated, provided th rence image and that camer wpoint navigation is constrai ed images if areas occluded

artifacts become most hat cam ation is ns rovid hat c ge pair. Ac generated, provided nce image and that cam oint navigation is constr images if areas occlud

facts become m at s if areas oco d tha igatio ages s ce image and aviga pr nd r e an ed, an nerat magege es ag m

(16)

Figure 2.2: Categorization of the synthesis errors observed in a neaby reference view [7].

2.2 Compensation of Synthesis Errors

In [7] Tanimoto et al. proposed to compensate the synthesis errors (due to the use of an impaired depth representation) in a virtual view by estimating their magnitudes from the errors observed in a nearby reference view. In their approach, the synthesis errors are classied into two categories according to their magnitudes. As shown in Figure 2.2, the less signicant errors fall within Category 1, whereas the more signicant ones belong to Category 2. The former can be attributed to the depth errors in areas where the intensity changes smoothly, while the latter is due to the errors in regions with a sudden change in intensity. On account of their distinct characteristics, the synthesis errors of dierent categories are compensated separately.

As a general observation, the magnitude of synthesis errors is proportional to the distance between the reference and virtual cameras. The fact helps to predict the errors in a virtual view. For example, in Figure 2.1, if we warp the left view image to the right view, then the resulting errors can be linearly scaled to estimate the ones when the left view image is warped to the virtual view. This approach, although being useful for compensating the Category-1 errors, is less eective at dealing with the Category-2 errors.

For this reason, the authors [7] turned to make use of the linear relation between the geometry distortions of dierent views to compensate the Category-2 errors. First, they spotted the Category-2 errors by observing the misalignment of vertical edges when one

to compensate the sy on) in a virtual view by est

arby eir

gories according to their ma s fall within Category 1, wh

er can be attributed h l w. In th ding to t i view b In compensate th in a virtual view by y he

ries according to their m ll within Category 1, can be attribu In hin Categoryor iew. accordin with reference view cord al v i v at tua a virt f th wi f

(17)

Figure 2.3: Detected unreliable pixels [8].

of the reference views was warped to the other. More specically, the pixels between the misaligned vertical edges were marked as being distorted by the Category-2 errors, and their corresponding depth was rened to minimize the geometry misalignments.

In [7], the edge maps were manually generated in order to distinguish pixels cor-rupted by the errors of dierent categories. While the results were promising, the approach neglected to account for the eects of ill-formed edge maps. Besides, the bandwidth necessary for signaling the edge maps was not considered if the compensa-tion were to be done at the receiver side.

2.3 Group-based Depth Renement

Sung et al. [8], on the other hand, made use of the Lambertian condition to rene depth images. The process involves using the similarity between the depth and intensity values of corresponding pixels to detect unreliable depth pixels and then rening their values through a group-based disparity search.

This scheme rst follows the FTV framework that warps the stereo reference and depth images to the position of the virtual view. A pixel in one warped reference view is considered to be unreliable if both the intensity and depth dierences between itself and the correspondent in the other warped view, respectively, exceed the given depth and intensity thresholds. Figure 2.3 marks the unreliable pixels in white. A group-based depth correction scheme rst groups the unreliable pixels by

connected-was rened to minim re manually generated in erent categories. While th

un for

naling the edge maps was n eiver s While eects o il nerate Whi rened to mi manually generated rent categories. While

nt -f

ing the edge maps wa r side. W e edge mapsp s. W r the e the nt categories. the genn m gori y g gori anually categeg he t c l

(18)

component analysis, and then independently corrects the depth pixels of each group. The correction tries to optimize an oset value for depth pixels in a group that min-imizes the sum of synthesis errors, and adds the oset to these depth pixels. Finally, the virtual view images is synthesized by referring to the updated depth images.

Although the group-based depth correction reduces many synthesis errors by its hybrid processes, this scheme is inecient if a group covers too many pixels to meet the local condition, especially when the depth images are severely distorted.

In summary, only a few approaches have been proposed to alleviate synthesis arti-facts caused by depth-image compression. Because these schemes rely entirely on the decoded information for intensity correction or depth renement, their performance is greatly inuenced by compression eects.

(19)

CHAPTER 3 Per-Pixel Synthesis Distortion Model

3.1 Synthesis Distortion Model

An analytical model is introduced for characterizing synthesis distortions caused by depth-image compression. The model is explained with reference to Figure 3.1, which illustrates an example of disparity-compensated interpolation based on an impaired

depth representation. In the gure, L_W denotes a virtual view image generated from

the reference image L_U utilizing its ground-truth, per-pixel depth information. As

mentioned previously, the warping establishes a relation between the intensity values

of the reference and virtual images: p0 = (p;]_s) and L_W(p0) = L_U(p). To simplify our

discussion, we assume that the reference image L_U is without coding errors. The more

general case can be analyzed along the same lines of derivations that follow.

To examine the inuence of depth-image compression on synthesis quality, we

ap-proximate the coding eects of depth images by an additive noise model, i.e., b]_s =

]s+ qs. Through the warping function , the depth error qs causes the projection of

the pixelp to move from p0 = (p;]_s) to q0 = (p; b]_s); the eect is known as geometry

distortion. It then follows that L_U(p) is substituted for L_W (q0) as the intensity value

istortion Model

d for characteri

tortion Mode

or

rt

ti

ion Mo

Mo

M

s

(20)

Chapter 3. Per-Pixel Synthesis Distortion Model

Figure 3.1: Disparity-compensated interpolation using an impaired depth represen-tation.

of the pixelq0; the squared dierence indicates the synthesis distortion contributed by

qs:

s , (LU(p) LW (q0))2 = (LU(p) LU(q))2

r (LU(p) LU(p) OLU(p) · (q p))2

= (OLU(p) · (q p))2> (3.1)

whereq0 is inversely projected to L_U by the inverse mapping function 1(q0; ]_t) and

a Taylor’s series expansion is used to approximate L_U(q). Recognizing that q0 =

(q;]t) =(p;]s + qs), we solve for the vector dierence (q p) as

q p = qs

]t(]s+ qs)c>

where c = h _I₂ ₀_2×1 iAR1T is a vector depending solely on camera parameters.

Substituting this result into Equation (3.1) then yields

s r μ qs ]t(]s+ qs)OLU(p) · c ¶2 = (3.2)

Now let us consider parallel camera conguration, with which the vectorc has the

form of [f> 0]W where|f| is proportional to the distance between the reference and virtual

cameras. Then it is obvious that

s r μ qs ]t(]s+ qs) ¶2 × j2 {(p) × f2> (3.3) (q )) (p) L LU(p) OL ted map is used to approximate L

e for the vector di y the inve ))2_> L OLUL ( LU L d t m used to approximat the vecto y to approxim LU L by t d U b p q p) ( d t ed

(21)

Chapter 3. Per-Pixel Synthesis Distortion Model Depth QP 22 24 26 28 30 32 34 36 38 40 42 44 Zp to V(pn ) R a tio 0 50 100 150 200 250 300 Average Ratio

Figure 3.2: The ratio of]_s to_q(p) over depth QP. Each point represents the average

ratio of all test sequences.

where j_{(p) denotes the { component of the gradient OL_U(p) = [j_{(p) > j_|(p)]W

com-puted at p. To obtain the expected per-pixel synthesis distortion conditioned on

ground-truth depth values, we take conditional expectations of both sides and expand

(qs@]t(]s + qs))2 into its Taylor series in qs:

H{s|]s> ]t} r H (μ qs ]t(]s+ qs) ¶2 |]s>]t ) × p(2) j (p) × f2 = 1 ]2 t × Ã H©q2 s ª ]2 s 2 H©q3 s ª ]3 s + 3H © q4 s ª ]4 s === ! × p(2) j (p) × f2 = 1 ]2 t × μ 2 q(p) ]2 s + 9q4(p) ]4 s + 756q(p) ]6 s + === ¶ × p(2) j (p) × f2 r _]1₂ t × 2 q(p) ]2 s × p (2) j (p) × f2> (3.4)

where p(2)_j (p) = H{j2_{(p)} can be viewed as a measure of how rapidly the intensity

changes along the horizontal direction at p, and 2_q(p) indicates the corresponding

depth-error variance. In the above, q_s is assumed to be independent of j_{(p) and

to obey the normal distribution, i.e., q_s Q(0> _q2(p)). The last approximation in

Equation (3.4) is justied because ]_s is usually much greater than _q(p). As we can

see in Figure 3.2, the magnitudes ]_s are, on average, about 40 to 250 times larger

than_q(p), when the depth QP is varied from 22 to 44.

ed per pix take conditional expe Taylo ) ¶ ) ) j (p © ₃ª H : ke conditional ex lor ¶ | ) p(2) j ) ( 2 qs ) ) in series |] 2

(22)

Expected Synthesis Distortion 0 5 10 15 20 25 30 35 40 Dep th-error S ensit iv ity 0 50 100 150 200 250 300 350 400 450 500 m_g(2) (p)=36 m_g(2)₍_p)=64 333.0 249.7 Zp=30, Zq=0.5Zp

Expected Synthesis Distortion 0 5 10 15 20 25 30 35 40 Dept h-err or Sensit iv it y 0 50 100 150 200 250 300 350 400 450 500 m_g(2)(_p)=36 m_g(2) (p)=64 166.5 124.8 Zp=60, Zq=0.5Zp (a) (b)

Expected Synthesis Distortion 0 5 10 15 20 25 30 35 40 Dep th-error S ensit iv ity 0 50 100 150 200 250 300 350 400 450 500 m_g(2)(p)=36 m_g(2)₍_p)=64 166.5 124.8 Zp=30, Zq=Zp

Expected Synthesis Distortion 0 5 10 15 20 25 30 35 40 Dep th-error S ensit iv ity 0 50 100 150 200 250 300 350 400 450 500 m_g(2)₍_p)=36 m_g(2) (p)=64 83.2 62.4 Zp=60, Zq=Zp (c) (d)

Expected Synthesis Distortion 0 5 10 15 20 25 30 35 40 Dep th-error S ensit iv ity 0 50 100 150 200 250 300 350 400 450 500 m_g(2)(p)=36 m_g(2) (p)=64 111.0 83.2 Zp=30, Zq=1.5Zp

Expected Synthesis Distortion 0 5 10 15 20 25 30 35 40 Dep th-error S ensit iv ity 0 50 100 150 200 250 300 350 400 450 500 m_g(2)₍_p)=36 m_g(2) (p)=64 55.5 41.6 Zp=60, Zq=1.5Zp (e) (f)

Figure 3.3: Measuring the depth-error sensitivity under various settings of ]_s, ]_t

andp(2)_j (p). Distor 25 p)=36 m_g(2)₍_p)=6 5 it iv ity 0 350 4 De 100 5 rror S 150 200 25 3 ortio p)=64 _nsi 300 50 5 40 0 35 ep th 1 tth S on

(23)

3.2 Observations

Equation (3.4) provides a non-stationary model for the expected per-pixel synthesis distortion, which suggests that the depth error for dierent pixels should have dierent contributions to the overall synthesis distortions. From the equation, the distortion

caused byq_s is determined by several factors measured atp: the depth-error variance,

the intensity variation, the (ground-truth) depth value, as well as the position of the virtual camera. Further insight into the combined eects of these factors is gained

by looking at Figure 3.3, which displays the ratio of ]_s to _q(p) as a function of

H{s|]s> ]t}, under various settings of ]s, ]t, and p(2)j (p) simulating smoothly- or

rapidly-changing depth/intensity elds. In the experiment,_q(p) was varied to identify

the highest level of error variance at which the specied distortion is achieved. The

result is then used to compute ]_s@_q(p). Intuitively, the ratio, which we call

depth-error sensitivity, characterizes how sensitive a pixel is to its depth depth-error in terms of the extent of synthesis distortions. A higher ratio (sensitivity) implies that a small error in depth can lead to a considerable distortion.

From the gure, several important observations can be made:

1. Compare the curves produced with dierent settings of p(2)_j (p)= The larger the

value of p(2)_j (p), the more sensitive the pixel p is to its depth error; namely,

when depth errors happen in areas with vertical edges or ne texture details, their eects on synthesis quality are more apparent. This observation is also corroborated by [7].

2. Compare parts (a)(c)(e) with parts (b)(d)(f). When a pixel corresponds to a farther clipping plane, it exhibits a lower depth-error sensitivity. In this case, the

pixel has a larger depth value ]_s and according to Equation (2.1), the resulting

geometry distortion is less signicant.

3. Compare part (e) with parts (a)(c) (or (f) with (b)(d)). When a pixel p is

ill-warped toq0, the resulting synthesis error is less observable if]_t is much greater

than ]_s (and hence b]_s). The result can be explained using the example shown

in Figure 3.4, where q₁ and q₂ denote respectively the inverse projections of q0

for the two extreme cases: ]_t1 À b]_s and]_t2¿ b]_s= Since ]_t1À b]_s ]_s À ]_t2,

the artifact is more noticeable when a depth error causes warping to substitute a background pixel for a foreground pixel, which explains the less signicant change

w sensitive a pix

s. A sensit

der

important observations can

rod etti

more sensitive the pixel in areas with v ervations th di s io ( A hi (sen ra mportant observations ca du se

e sensitive the pix

with dieren bserv ed wi ortant obse rati tion er r ort e dist e

(24)

Figure 3.4: A geometrical interpretation of the eect of]_t on depth-error sensitivity.

in intensity when]_tÀ ]_s.

4. Observe the reciprocal relation between _q2(p)@]_s2 and f2 in Equation (3.4). It

suggests that when a pixel p is warped to a virtual view that is farther away from

the reference view, it is more sensitive to depth errors.

These observations remain valid for other camera congurations, except that the ef-fects of the intensity variation and camera arrangement must jointly be considered by

evaluatingH{(OL_U(p) · c)2}.

(25)

CHAPTER 4 Synthesis-Quality-Oriented Depth

Renement Scheme

The framework of MPEG FTV [9] views the transmitted depth images as determin-istically specifying the depth information for the reference images. The compression eects of depth images were neglected during the rendering of virtual views. As seen from the analysis in §3, depth errors can cause disturbing synthesis artifacts, especially at areas with sharp edges or ne texture details. To tackle the problem, we propose to regard both the received view and depth images as sources of information about the ground-truth depth of the scene, and provide ways to detect and rene unreliable depth values.

4.1 System Architecture

To allow for an easier understanding of our algorithm, Figure 4.1 depicts the system block diagram with a highlight on the data communicated between functional blocks.

As shown, for an economic use of network bandwidth, both reference images {L₁> L₂}

eme

9] views the tra

me

(26)

Chapter 4. Synthesis-Quality-Oriented Depth Renement Scheme

Figure 4.1: System Block Diagram.

and their respective per-pixel depth information {G₁> G₂} are compressed prior to

transmission. These data are decoded and reconstructed at the receiver side before they are used for the creation of virtual views. The "prime" symbols in the gure dierentiate the coded view and depth images from their original sources.

Recognizing that depth-image compression may give rise to depth errors, we intro-duce a depth renement mechanism at the receiver side. The objective is to improve synthesis quality by rening the depth values for those pixels (which we call unreliable pixels) being highly sensitive to depth errors. The process consists of two sequen-tially operated steps: (1) the detection of unreliable pixels and (2) the renement of their depth values, both need to access the coded view and depth images. To make their performance robust against compression eects, additional control parameters are transmitted to the receiver as the side information, with their settings being determined at the sender side by evaluating the detection and renement quality as perceived by the receiver over the range of all possible choices. The details are elaborated in the subsequent sections.

4.2 Reliability Detection

The detection process at the receiver side aims to discover unreliable pixels—i.e., those that are highly sensitive to depth errors and hence require higher delity for their depth

th informati decoded and reconstr ion of virtual views. The w and depth images from the

image compression may give hanism at the receiver si

depth values for th ges from pression m m m ec ws. coded and recon n of virtual views. Th nd depth images from t age compression may g nism at the receiver

h l mpression m t th i s. mages comp d depth imag om vie ual virtu m a

(27)

values in order to minimize rendering errors. From the theoretical analysis in §3, a pixel is likely to be unreliable if it locates in a region with large intensity variation, or if it represents a pixel in a near clipping plane. Although both facts can jointly be utilized to form detection criteria, we consider only the use of intensity variation because view images are generally better compressed than their depth representations, making the intensity information more reliable for decision-making.

To quantify intensity variation, we adopt the Gaussian derivative operator to

com-pute gradient for all the pixels in view images. A pixelp is considered to be unreliable

and its depth value deserves rening if the magnitudekOL_U0 (p)k of its gradient exceeds

a given levelW_G1_{. According to Observation #1 in §3, such a pixel is highly sensitive to}

depth errors, hence requiring higher precision for its depth value. Apparently, the value

ofW_G plays a pivotal role in determining the detection accuracy. With non-stationary

signal statistics, we propose to adapt W_G on a frame-by-frame basis. This is realized

by transmitting its value as the frame-level side information.

In determining the value of W_G for a particular frame, we wish to strike a good

balance between the hit and false-alarm rates. The best setting of W_G, denoted by

W

G, should have the subset of pixels S(WG) = {p : kOLU0 (p)k A WG} contain as many

unreliable pixels as possible while keeping the number of reliable ones to be minimal.

To ndW_G, we rst associate each plausible choice of W_G and the corresponding set of

pixels S(W_G) with a matching score that weights the hit rate against the false-alarm

rate:

M(WG) =

X

pS(WG)

(1S(p)s (1 1S(p))) >

where 1_S : p S {0> 1} is an indicator function dened as

1S(p) = 1 if s 0 if s ? =

Then we choose, among all possible choices, the one that yields the highest matching

score, i.e., W_G = arg max_W_GM(W_G). The approach can be interpreted as to evaluate, at

the sender side, the detection quality as perceived by the receiver.

In the course of computing the matching score, it is necessary to decide whether a

1_{With parallel camera conguration, only the} _{{ component of the gradient is computed and}

W L0

frame-level side inf

of ar fr

nd false-alarm rates. The b

of OL

while keeping the number ach plausible choice o

h t es. Th W G W ) = { rticu T me-level side f W lar

false-alarm rates. The

p kO

hile keeping the numb plausible choi eping the nun ates. S(W kee h alse-alarm rat s S( par t ar th m a m G for e alarar kee k e h

(28)

hit or false alarm occurs. This is accomplished by evaluating the per-pixel synthesis

distortion _s at the sender side with L₁ and L₂ (or in the reverse order) being used

in place of L_U and L_W, respectively (cf. Equation (3.1)). Specically, if _s is greater

than or equal to a threshold , indicating that the depth associated with the pixel p

may be unreliable, a hit is identied; otherwise, a false alarm is signaled. Ideally, the should be set to zero according to the Lambertian condition; however, in practice a non-zero value was used to compensate camera noises and illumination dierence

between view images. The settings of and that yield the best synthesis quality (in

terms of PSNR) are searched exhaustively at the sender side. Note that they need not be transmitted to the receiver.

4.3 Depth Renement

After we discover all the unreliable pixels, our next step is to rene their depth values. Because depth renement is performed by the receiver, its operation must be made computationally simple and ecient. For this reason, we adopt a candidate-based disparity estimation scheme to derive depth from the received view images. As in most block-based algorithms, a constant disparity is searched for each block of pixels (of size 7 ×7), centered on an unreliable pixel p, by minimizing the error between the two view images after disparity compensation. However, unlike their techniques, which usually require examining a large number of disparities, ours restricts the search to only those

disparities that correspond to an integer depth value in the interval of [ b]_sU_s> b]_s+U_s].

On one hand, this constraint is an expediency out of complexity considerations, and on the other hand, it prevents the simple block-based search from getting an improper disparity.

Although reducing the number of search candidates helps to simplify the disparity

search, the issues are how to determine a proper value of U_s for each unreliable pixel

and how to signal the information eciently. As described previously, the value ofU_s

determines the maximum modication of b]_s that can be caused by depth renement—

i.e., it controls the strength of renement. It was found in our analysis that the depth error sensitivity of a pixel is related to its ground-truth depth value, implying that the

adaptation of U_s should refer to the value of b]_s (which is an approximation of ]_s).

ble pixels, our next performed by the receiv

d e son

to derive depth from the rec nstant disparity is searched

e pixel inimizi However his rea pth from h the r is re pixels, our n erformed by the rece

e as

o derive depth from the tant disparity is search

ixelp, by minim s r sparity is sease this depth t di is cient. For t ve dep y t r t t y Fo d b Fo ormed ent dis t d e t

(29)

(a) (b) (c)

(d) (e) (f)

Figure 4.2: A sample result of the proposed depth renement algorithm: (a)(d) the original depth image, (b)(e) the decoded depth image, and (c)(f) the rened depth image.

For a trade-o between quality and overhead, we divide the set S(W_G) into Q disjoint

subsets v_l(W_G)> 1 l Q, each of which is assigned a renement search range u_l.

A uniform quantizer that operates on the received depth b]_s is used to categorize the

unreliable pixels in S(W_G) into one of the Q subsets. After that, the best settings of

{ul}Ql=1 are searched exhaustively at the sender side and transmitted to the receiver as

the side information.

Figure 4.2 shows a sample result of our renement process. Observe that depth compression introduces blocking artifacts on the decoded depth image (see parts (b) and (e) of Figure 4.2). With depth renement, we can remove the artifacts largely (see parts (c) and (f) of Figure 4.2); note the clarity of object boundaries that simply are not visible in the decoded depth image. Interestingly, the renement can even recover some details that are removed by the enforcement of depth smoothing (compare parts (a)(d) and (c)(f) of Figure 4.2).

decoded depth i

lity and overhead, we divide each of which is assigned

rates on the received de

e of the Q sub ead, we d hi h i di d oded dep

y and overhead, we divi ach of which is assign

es on the received which is as rhead of w s and overhe f

(30)

CHAPTER 5 Experiments

Extensive simulations were carried out to demonstrate the performance of the proposed scheme, and the results were compared with that of [7] and [8]. All the renement schemes were implemented with the MPEG committee software VSRS 2.1 [10]. All experiments used DERS 2.0 [10] to generate depth images and JMVC 3.0.1 [11] to encode multi-view videos and their depth images. The average PSNR of synthesized images was computed based on the rst 100 frames of each test sequence. Particularly, in implementing the method described in [7], we employed the magnitude of synthesis errors rather than manually generated edge maps to distinguish pixels of dierent categories. For a fair comparison, all the threshold values used in [7] and [8] were determined by optimizing the quality of synthesized images. Table 5.1 and Table 5.2 detail the depth estimation settings and the encoder settings, respectively.

Figure 5.1, 5.2 and 5.3 compares the PSNR of various schemes when the depth QP is varied from 22 to 44. The curves associated with MPEG FTV were produced without depth renement. To see the eects of reference quality, Figure 5.1 show the results generated utilizing high-quality references (QP=22), whereas Figure 5.2 are their low-quality counterparts (QP=31). It can be seen that all three schemes

carried out to demonstrate th e compared with that of

the MPEG comm to demon ried out to demonstrate

ompared with that MPEG t to demonst

d i h h

d out t par

(31)

Chapter 5. Experiments

Table 5.1: Depth Estimation Settings. Column (a) to (c) represents Smoothing Coecient, Precision and Search Level, respectively.

SL-SR NL-NR SearchRng Min-Max DisparityRng Min-Max (a) (b) (c) Lovebird1 6-7 5-8 4-90 1-110 4 4 2 Newspaper 4-5 3-6 26-88 20-90 4 2 2 Alt Moabit 9-8 10-7 1-33 1-32 1 2 2 Book Arrival 9-8 10-7 30-70 30-70 2 2 2 Door Flowers 9-8 10-7 12-38 10-0 2 4 2 Leaving Laptop 9-8 10-7 15-33 15-33 2 4 4 Dog 39-41 38-42 1-20 0-20 4 4 1 Pantomime 39-40 38-41 0-20 0-20 2 4 4

Table 5.2: Encoder Settings

Reference Frame 2 Intra Period 15 CABAC on 8x8 Transform on BasisQP 22, 25, 28, 31, 35, 38, 41, 44 Inter-view Prediction on

Search Mode 4 (Fast Search)

Motion Search Range ±32

outperform MPEG FTV in all test sequences, and as expected, the improvement is the greatest when depth images are coarsely quantized. Moreover, ours has the highest gain of all the schemes—an average PSNR improvement of 1.2dB over MPEG-FTV. The results are consistent with dierent test conditions.

Figure 5.4, 5.5, 5.6 further compare the subjective quality of synthesized images. Part (a) illustrates what can happen if incorrect depth information is used for view synthesis. Parts (b) through (d) show the results obtained by correcting depth with one of the three schemes just described (i.e., [7], [8], and ours). As can be seen, "ghost eects" appear around object boundaries if the depth is not rened; in comparison, the visual results with depth renement are considerably improved. Our scheme even produces a result that is very close in appearance to the ground-truth view image. The reason behind the superior performance can be explained with Figure 5.7, which makes visible the unreliable pixels detected by the three schemes. As expected, our scheme tends to correct more depth pixels locating in areas with ne texture details or vertical

2 15 rm 1, 3 Pr de rch) rch R sequences, an 5, 28, 3 n 4 (Fast 28 15 m 31 e arc R 28 ( S ge ±32 25, on ng 22, 2 ction o n n 2 2 32 32 22 on on 22 nge an h

(32)

Chapter 5. Experiments DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 29 30 31 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Door Flowers DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 26 27 28 29 30 31 MPEG FTV Tanimoto [7] Sung [8] Proposed Newspaper (a) (b) DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Pantomime DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 28 29 30 31 32 MPEG FTV Tanimoto [7] Sung [8] Proposed Lovebird1 (c) (d) DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 31 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Alt Moabit DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Book Arrival (e) (f) DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 27 28 29 30 31 32 33 MPEG FTV Tanimoto [7] Sung [8] Proposed Dog DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Leaving Laptop (g) (h)

Figure 5.1: PSNR of synthesized images as a function of the depth and reference QP. The reference view images are coded with QP=22.

38 4 22 24 28 bit PS N 5 40 22 2 NR 35 36 36

(33)

Chapter 5. Experiments DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 29 30 31 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Door Flowers DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 26 27 28 29 30 31 MPEG FTV Tanimoto [7] Sung [8] Proposed Newspaper (a) (b) DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Pantomime DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 28 29 30 31 32 MPEG FTV Tanimoto [7] Sung [8] Proposed Lovebird1 (c) (d) DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 31 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Alt Moabit DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Book Arrival (e) (f) DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 27 28 29 30 31 32 33 MPEG FTV Tanimoto [7] Sung [8] Proposed Dog DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Leaving Laptop (g) (h)

38 4 22 24 28 bit PS N 5 40 22 2 NR 35 36 36

(34)

Chapter 5. Experiments DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 29 30 31 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Door Flowers DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 26 27 28 29 30 31 MPEG FTV Tanimoto [7] Sung [8] Proposed Newspaper (a) (b) DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Pantomime DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 28 29 30 31 32 MPEG FTV Tanimoto [7] Sung [8] Proposed Lovebird1 (c) (d) DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 28 29 30 31 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Alt Moabit DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 30 31 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Book Arrival (e) (f) DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 27 28 29 30 31 32 33 MPEG FTV Tanimoto [7] Sung [8] Proposed Dog DepthQP 22 24 26 28 30 32 34 36 38 40 42 44 P SNR 30 31 32 33 34 35 36 37 MPEG FTV Tanimoto [7] Sung [8] Proposed Leaving Laptop (g) (h)

38 4 22 24 28 T Su Propo bit PS N 3 34 40 22 2 M Ta Sun opos NR 34 o [7] 35 36 G FTV moto 8] FT 8] d m g [8

(35)

(a)

(b)

(c)

(d)

Figure 5.4: Subjective quality comparison of synthesized images: (a) MPEG FTV (without depth renement), (b) Tanimoto [7], (c) Sung [8] and (d) the proposed scheme. The depth QP of Door Flowers sequence is set to 44.

b) (

(36)

(a)

(b)

(c)

(d)

Figure 5.5: Subjective quality comparison of synthesized images: (a) MPEG FTV (without depth renement), (b) Tanimoto [7], (c) Sung [8] and (d) the proposed scheme. The depth QP of Newspaper sequence is set to 44.

b) (

(37)

(a)

(b)

(c)

(d)

Figure 5.6: Subjective quality comparison of synthesized images: (a) MPEG FTV (without depth renement), (b) Tanimoto [7], (c) Sung [8] and (d) the proposed scheme. The depth QP of Dog sequence is set to 44.

(38)

(a) (b) (c)

Figure 5.7: Pixels whose depth values are judged unreliable: (a) Tanimoto [7] (cate-gory 2), (b) Sung [8] and (c) the proposed scheme. Top-to-down rows are Door Flowers, Newspaper and Dog sequences, respectively.

edges—namely, those that will crucially aect synthesis quality. depth values are judged unre

the proposed scheme. Top-t es, res

s are judg d pth values are judged un

e proposed scheme. To resp osed scheme. i l me b) values a opo em (b) alues (b) ch ch pec o e

(39)

CHAPTER 6 Conclusion

To alleviate the coding eects of depth images, we proposed in this thesis a synthesis-quality-oriented depth renement scheme. The approach is characterized by the unique consideration of attempting to rene only those depth pixels that are likely to cause noticeable synthesis artifacts. In the course, we developed an analytical model to estab-lish criteria for reliability detection and to form guidelines for depth renement. Since both operate on the decoded information, additional side information is transmitted to make them robust against compression eects. Experimental results show that our scheme has the highest PSNR gain of all the state-of-the-art methods. It also produces a result that is visually similar to the ground-truth image.

This work is still in its early stage. Both detection and renement schemes have not fully utilized all the factors suggested by the per-pixel synthesis distortion model. Further improvements can be expected. Possible extensions could include more so-phisticated disparity search, time-space consistency and signal restoration techniques. Besides, the analytical model can nd its application in developing depth compression algorithms.

cts of depth images, we prop ment scheme. The approa

ne only those images

of depth images, we pr t scheme. The app

l t th images, w

Th

deptht g

(40)

Bibliography

[1] C. Fehn, R. Barre, and R. S. Pastoor, “Interactive 3-DTV: Concepts and Key Technologies,” Proceedings of the IEEE, vol. 94, pp. 524—538, March 2006.

[2] C. Fehn, “A 3D-TV Approach Using Depth-Image-Based Rendering (DIBR),” Proceedings of Visualization, Imaging, and Image Processing, September 2003. [3] C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski,

“High-Quality Video View Interpolation Using a Layered Representation,” ACM Trans-actions on Graphics, vol. 23, pp. 600—608, August 2004.

[4] A. Smolic, K. Muller, K. Dix, P. Merkle, P. Kau, and T. Wiegand, “Intermediate View Interpolation based on Multiview Video plus Depth for Advanced 3D Video Systems,” IEEE Int’l Conf. on Image Processing, October 2008.

[5] E. Cooke, P. Kau, and T. Sikora, “Multi-view Synthesis: A Novel View Creation Approach for Free Viewpoint Video,” Signal Processing: Image Communication, vol. 21, pp. 476—492, July 2006.

[6] P. Merkle, A. Smolic, K. Muller, and T. Wiegand, “Multi-view Video plus Depth Representation and Coding,” IEEE Int’l Conf. on Image Processing, October 2007.

d R. S. Pastoor, “Interacti

ing pp

Approach Using Depth-Ima on, Imaging, and Imag

E, vol. 949494

“Inte R. S. Pastoor, “Intera

gs ,

proach Using Depth-Im Imaging, and I Using Dept n ep EEE v ach D of the IEEE or, g g stoo S. Pas ac p

(41)

BIBLIOGRAPHY

[7] M. Tanimoto, T. Fujii, M. P. Tehrani, M. Wildeboer, and H. Furihata, “Error Cancellation in Free-viewpoint Image Generation for FTV,” ISO/IEC JTC1/SC29/WG11, MPEG09/M16607, April 2009.

[8] J. Sung, Y. J. Jeon, J. H. Lim, and B. M. Jeon, “Improving View Synthe-sis Results based on Depth Quality Measure,” ISO/IEC JTC1/SC29/WG11, MPEG09/M16417, April 2009.

[9] “Applications and Requirements on 3D Video Coding,” ISO/IEC

JTC1/SC29/WG11, MPEG09/N10570, April 2009.

[10] M. Tanimoto, T. Fujii, and K. Suzuki, “Reference Software of Depth Es-timation and View Synthesis for FTV/3DV,” ISO/IEC JTC1/SC29/WG11, MPEG08/M15836, October 2008.

[11] “Text of ISO/IEC 14496-10:2008/FDAM 1 Multiview Video Coding,” ISO/IEC JTC1/SC29/WG11, MPEG09/N9978, July 2008.

0:2008/FDAM 1 M

PEG0 July 2008

08/FDAM

針對 MPEG 自由視角視訊之合成品質導向深度圖優化

ġ

⚳

䩳

Ṍ

忂

⣏

⬠ġ

⮩ⴡ浣ぴ䲚䪣䴅㓏

䬸

⭺

嵥

㠖

ଞჹ NQFH Ծҗຎفຎૻϐӝԋࠔ፦Ꮴӛ!

ుࡋკᓬϯ!

!

A Synthesis-Quality-Oriented Depth Refinement Scheme

for MPEG Free Viewpoint Television (FTV)

ࣴ!ز!ғǺഋߪӓ!

ࡰᏤ௲௤Ǻ൹Ўֵ!!շ౛௲௤!

!

!

₼ 噾 㺠 ⦚ ⃬ ◐ ⏺ ㄃ ⃬ 㦗

Ծҗຎفຎૻϐӝ

y-Oriented Depth

iewpoin

ࡋკᓬϯ

ϯ

ૻϐ

җຎفຎૻϐ

riented De

ૻ

ຎૻ

ుࡋკ

җຎفຎ

ుࡋკ

ຎ

فຎ

فຎ

ຎف

ف

ຎ

ଞჹ NQFH Ծҗຎفຎૻϐӝԋࠔ፦Ꮴӛుࡋკᓬϯ!

A Synthesis-Quality-Oriented Depth Refinement Scheme for MPEG Free

Viewpoint Television (FTV)

ࣴ ز ғǺഋߪӓ StudentǺChun-Chi Chen

ࡰᏤ௲௤Ǻ൹Ўֵ AdvisorǺWen-Hsiao Peng

୯ ҥ Ҭ ೯ ε Ꮲ

ӭ ൞ ᡏ π ำ ࣴ ز ܌

ᅺ γ ፕ Ў

ύ๮҇୯ΐΜΖԃΐД

ำ

ᅺ γ ፕ Ў

ᅺ γ ፕ Ў

ଞჹ NQFH Ծҗຎفຎૻϐӝԋࠔ፦Ꮴӛుࡋკᓬϯ!

!!!!!ࣴ!ز!ғǺഋߪӓ!!!!!!!!!!!!!!!!!ࡰᏤ௲௤Ǻ൹Ўֵ!

୯ҥҬ೯εᏢӭ൞ᡏπำࣴز܌!!ᅺγ੤!

ᄔ

ा

! ! ! ! ! ! !

!!!୷ܭ NQFH!Ծҗຎفႝຎ኱ྗ)GUW*ϐࢎᄬǴҁፕЎឍញ΋ঁճҔௗԏډϐୖ!

A Synthesis-Quality-Oriented Depth Refinement

Scheme for MPEG Free Viewpoint Television (FTV)

Student : Chun-Chi Chen Advisor : Wen-Hsiao Peng

Institute of Multimedia Engineering

National Chiao Tung University

ABSTRACT

!

ᇞ

ᖴ

! ! ! ! ! ! !

Contents

List of Tables

List of Figures

CHAPTER 1

Research Overview

1.1

Introduction

n

1.2

₼ 噾㺠 ⦚ ⃬ ◐ ⏺ ㄃ ⃬ 㦗