A pseudo-object-oriented very low bit-rate video coding system with cache VQ for detail compensation

(1)

A

PSEUDO OBJECT-ORIENTED VERY

LOW

BIT-RATE VIDEO CODING

SYSTEM WITH CACHE VQ FOR DETAIL COMPENSATION

Chung-

Wei Ku, Liang-Gee Chen, You-Mzng Chiu,

a,nd

Yung-Pin

L e e

DSP/IC

Design

Lab.,

Department

of

Electrical Engineering,

National Taiwan University, Taipei, Taiwan,

R.O.C.

email: {william,lgchen}@video.ee.ntu.edu.tw

URL

:

h t tp:

/

/video.ee.nt u.edu. tw

/

-william

ABSTRACT

In

this

paper, a pseudo object-oriented video coding system is proposed and implemented. In order to increase the coding efficiency, cache VQ algorithm is suggested t o further compress those areas where motion estimation fails. According t o our primary simulation results, the visual quality of long-timed sequences is still acceptable even for bit-rates below 10 Kbps. In addition t o the high compression ratio for very low bit-rates, content-based applications are also expected since the proposed system utilizes segmented motion field; furthermore, the occurrences of prediction errors generally locate at emotionally important parts, e.g. eyes a n d mouth, etc. All the coded components are not only useful for compression but also meaningful for video recognition.

1. INTRODUCTION

As early as May 1991 the Moving Picture Expert Group

(MPEG)

raised the issue of audio-visual standard tar-

geted at t h e bit-rate of 4.8-64 Kbps. These efforts was approved in July 1993 with the MPEG-4 nickname and the title “Very Low Bit-Rate Coding of Moving Pic- tures and Associated Audio”. Basically larger compression ratio is expected in order t o meet the modern mo- dem standard V.34 in which the bandwidth is defined as 28.8 Kbps. Currently the standard about visual tele- phone is covered by

ITU-T

H.324, where the video coding is defined in H.263. Besides, the goal of MPEG-4 moves t o the issues about content-based applications or

functionalities. As a result, variable kinds of approaches are all developing very fast. For example, the model- based approaches [l] or analysis-synthesis approaches

[2]. I n this paper, a pseudo object-oriented very low bit-rate video coding system is proposed. The goal of the proposed system is to combine GSTN communication, personal computers, and audio-visual multimedia

service. Compression of video for very low bit-rates is expected; in addition, content-based functionalities are another possible applicaiions.

2 . M O T I O N E S T T M A T I O N A N D CODING O F THE M O T I O N F I E L D

2.1. O p t i c a l Flow based Motion Estimation Similar t o most video coding system, thr proposed system removes the temporal redundancy in the video sequences via motion estimation. In order t o code the motion vectors more efficiently, we would like t o generate a motion field with less spatial variation and good prediction. Besides, the generated motion vectors had better be meanful about the real movements. For the above reason, we propose a modified cost function which is shown as follows:

( u , v ) is the motion vector and E, is the partial devi- ation of the spatial-temploral function of a pixel along the z axis. Similar explanation is for E,,. The addi- tional third term indicates the penalty on convoluted contours. Another problem is that the answers may be trapped in local minimum due to gradient descent approach. To reduce the probability of this situation, a pyramid approach is suggested. The higher level of the pyramid is composed of the subsampled pixels a t the lower level. Motion estimation is executed from t h e t o p level to the bottom level. The scale of‘ adjustment at higher level is larger than that, of lower ones to achieve

n coarse answer: in addition, more iterations are executed a t lower level to obtain the precise vector. AS a

result, most local minima a r e avoided and the possibil- ity of finding global minimum is increased. In addition, the post processing of edges and stationary regions are

(2)

also suggested. All the above is named modified optical flow algorithm (MOFA) [6] and an example is given in Figure 1.

2.2. Arbitrarily Shaped Transform of the Seg-

mented Motion Field

In order t o encode these motion vectors efficiently, the spatial redundancy in the motion field is exploited. The motion field is subsampled then segmented into several homogeneous regions. After t h a t , arbitrarily shaped transform (AST) is applied to these regions. Since the shape information is necessary for AST and chain code description costs much d a t a amount, a rectangle ap- proximation is suggested. According t o the simulation results, the saving of d a t a amount is obvious while the loss of performance is negligible 171. T h e segmented and approximated motion field is then applied with AST in the subsampled pixel domain. Polynomial based ker- nels are selected in our system. In order t o speed up the processing time, two heuristic rules are assumed t o define the coefficient selection function (CSF). In our

system, f ( z ) =

&

is selected as the CSF. Finally, for each region, two sets of the AST coefficients and one set of the knot points describing the shape are transmitted. Before being sent to channel, all the coded elements are variable length coded (VLC) by Huffman code t o remove the statistical redundancy.

3. DETAIL COMPENSATION OF PREDICTION ERRORS

According t o our experiment results, we found most of the errors occur in the position of eyes and mouth. The reason is that some detailed movement is not well compensated by the motion field; such as the open or close of eyes a n d mouth. However, the representation of fine emotion is especially important for conversation applications. In order t o encode these part more efficiently a n d avoid the straightforward approach such as refresh frame, the idea of “motion off objects” is proposed. T h a t is, for those not well compensated re-

gions, which are generally eyes or mouth, cache vector quantization is applied instead of original motion compensated method.

3.1. Localization of Motion Off Objects

The first step is t o find the appropriate position of those “motion off objects” (MFO’s). In our system, the pro- cedures of determining MFO is as follows:

1. A sliding window with size 8 by 8 is applied to the error frame in a way similar t o two dimensional

2

filter. The mean absolute error (MAE) of each sliding window is calculated.

If the MAE is too larger, in addition, 80% of the pixel errors in this windows are large enough, al- l the pixels in the window are marked as MFO regions.

As the sliding window operates to all the error frame,

the distribution of MFO’s are found. For example, according to the reconstructed error in Figure 2 (a), the locations of MFO are the white area in Figure 2 (b). We found the above scheme detects most of the motion

off areas such as eyes and mouth. In fact, the proposed method to locate MFO is very similar t o t h e morpho- logical filter approach [4] or median filter approach [5].

3.2. Cache Vector Quantization Scheme

In order t o encode MFO more efficiently, the temporal redundancy of the MFO’s are exploited. T h a t is, since most of the MFO’s a r e about eyes and mouth of the same person, generally the image content about these parts are strongly correlated as the time passed. If some typical patterns about the eyes and mouth can be detected and memorized, the coding of later MFO’s can be solved by sending the indices of the patterns rather than content of the image [3]. Basically, the above idea is an extension of cache vector quantization proposed in [SI. However, several problems still exist. The first problem is how to memorize these patterns

for later pattern matching. Therefore, the idea of image p a t t e r n p r o t o t y p e ( I P P ) is adopted in o u r system, T h a t is, several blocks with sizes 24 by 16 are extended from the MFO’s. The idea is shown in Figure 3. The growing of the boundary is from the center of the MFO. For example, an I P P including the mouth area is displayed in Figure 4.

The second problem is the way to memorize appropriate patterns and apply the matching scheme. In our’

system, all the IPP’s are saved in a cache-like codebook. T h a t is, whenever the I P P of current frame is generated, they a r e compared with the IPP’s in codebook. If the winner in the codebook is very similar to the input I P P , index of the winner I P P is transmitted; otherwise the input I P P is encoded by DCT similar t o intra mode coding and this I P P is put into the cache codebook. The update strategy we used for cache replenishment is least-recently-used

( L R U )

because its performance IS t h e best 181. However, t o find the best matched I P P from the whole codebook is time consuming and unreasonable because it is meaningless t o compared a “mouth” I P P with an “eye” I P P . There- fore, the content in the codebook should be classified

(3)

in a reasonable sort. Observing the sequence about IP- P, it is easy t o find t h a t the positions of each kind of the IPP’s are strongly correlated because the position of mouth or eyes changes quite little in the sequences. According to t he above idea, the position of the I P P is selected as the cache tag field in our system instead of some classification methods [3]. In other words, the position of the input IPP is compared with all the positions of the IPP’s in codebook; only those IPP’s n- ear t h e input will become the candidate IPP’s for later matching process. The matching criteria we used is mean absolute error (MAE).

Although most of the MFO’s are well enclosed in t h e IPP’s, t h e exact position of MFO must be adjusted to match the input MFO instead of input IPP. For this purpose, the motion vectors of I P P are also adopted [3]. T h a t is, the best matched I P P is the one whose MAE is t h e minimum with the IPP’s best motion vector. Full search within a pre-defined range similar to block matching motion estimation is utilized to find the best motion vector for each I P P candidate. Anoth- er important issue is the position tag of each IPP in codebook must be updated if it is the “hit” one. In order to keep the correlation of locations with the input IPP, the position of t h e “hit” I P P is updated toward the input IPP with the displacement of the motion vector. In brief, the proposed cache VQ for MFO coding is: if the input IPP really “hits” one of the codeword IPP, index of t h e codeword and a motion vector related t o this codeword are transmitted for d a t a compression; otherwise the input is D C T coded in a way similar t o intra mode coding.

4.

SIMULATION

RESULTS

The idea of cache VQ for MFO coding is verified by the simulation results. In order to determine the size of the

IPP

codebook, the simulation for several kind- s of codebook is executed and the results for a long- timed sequence “CMS’ which is generated in our Lab for simulation are listed in Table 1. According to the simulation results, as the size of the codebook increas-

es, t h e performance is improved because most of the typical IPP’s can be found in the codebook gradually; however, as the size of the codebook is larger than 128, the performance can not be improved any more. Be- sides, t h e performance degrades a little because some “hit” IPP’s are not good enough therefore the num- ber of MFO’s might increase in later sequences due

to

error propagation. In terms of bit-rate, the average bit-rate decreases as the size of codebook increases because lots of IPP’s can be found in the cache codebook and index of “hit” I P P uses less d a t a amount compared

with DCT coding; however, the index costs more d a t a

as the size of codebook increases therefore the average bit-rate increases finalby. According to the above phe- nomenon, the size of I P P codebook is selected as 128

in our system.

Since the I P P codebook is empty at the beginning, the hit ratio is not very high originally. However, as time passed, the cache is filled with those typical IP- P’s and the efficiency increases. In fact, the proposed system is simulated for several head and shoulders se-

quences. Some primary simulation results give 34 dB PSNR even a t lower than 10 Kbps bit-rate for long- timed sequences. For short-timed sequences, since the utilization of cache codiebook is not high enough, the average bit-rate will be between 10-20 Kbps with ac- ceptable quality. Since the IPP’s generally locates a t the important parts of human features, more efforts are spent on investigating the application about content analysis currently. An example of I P P codebook content after coding several frames for sequence “Miss

America” is shown in Figure 5.

5. C O N C L U S I O N

A pseudo object-oriented very low bit-rate video coding system is proposed in this paper. The proposed system uses modified optical flow based motion estimation algorithm to remlove the temporal redundancy; the spatial redundancy in the motion field are removed by the arbitrarily shaped transform and rectangle ap- proximation. Finally, for those parts which are not well predicted by motion compensation, cache VQ is suggested t o achieve further compression. Currently we are studying the advanced coding of MFO’s and how to use the coded information for video recognition. From our simulation results, we believe the proposed method is potential for very low bit-rate video coding or even future content-based applications.

6. R E F E R E N C E S

K. Aizawa, H. Haratshima and T. Saito, “Model- based analysis-synthesis image coding (MBASIC) system for a person’s face”, Signal Processing: Im-

age Communication, vol. 1, pp. 139-152, 1989.

J. Ostermann, “Object-based analysis-synthesis oding based on the source model of moving rigid 3D objects”, Signal Processing: Image Communi- cation, vol. 6, pp. 1413-161, 1994.

M. Wollborn, “Prototype prediction for color update in object-based analysis-synthesis coding”,

(4)

IEEE Trans. Circuit and System for Video Tech- nology, vol. 4, pp. 236-245, Jun. 1994.

[4] W. Li, V. Bhaskaran, and M. Kung, “Very low bit- rate video coding with DFD segmentation”, Signal Processing: Image Communication, vol. 7, no. 4 6 , pp. 419-434, NOV. 1995.

[5] D. Qian, “A motion compensated subband coder for very low bit-rates”

,

Signal Processing: Image Communication, vol. 7, no. 4-6, pp. 397-418, Nov. 1995.

[6] C.-W. Ku, Y.-M. Chiu, L.-G. Chen, and Y.-P. Lee, “Building a pseudo object-oriented very low bit- rate video coding system from a modified optical flow motion estimation algorithm”,

t o

appear zn ICASSP ’96.

171 C.-W. Ku, Y.-M. Chiu, L.-G. Chen, and Y.-P. Lee, “The arbitrarily shaped transform of segmented motion field for a pseudo object-oriented very low bit-rate video coding system”,

t o

appear in IS-

CAS’96.

[8] C.-W. Ku, L.-G. Chen, and

T.-D.

Chiueh, “Cache vector quantisation algorithm in video compression”, IEE Electronic Letters, vol. 29, No. 16, pp. 1423-1424, Aug. 1993.

Figure 1: Results of MOFA for“Miss American”: (a)

motion field, (b) reconstructed frame.

Table 1: Results for different codebook sizes. Size

1

PSNR

1

Bit-rate

1

Hit Ratio

32

I

33.332 dB

I

9.90 Kbps

I

69.9% 33.348 d B 8.33 Kbps 82.8% 128 33.382 dB 7.71 Kbps 85.7%

1

: : 6

I

33.341 d B

1

7.78 Kbps

1

84.9% 512 33.341 d B 9.40 Kbps 85.0%

Figure 2: Localization of MFO: (a) reconstructed error, (b) locations of MFO.

1

zxtr11drd MFO

Figure 3: Extend MFO into image pattern prototype.

Figure 4: Location of the I P P including t h e mouth.

’ “Ly

>...

I

Figure 5: Content of the IPP codebook for “Miss Amer- ica”.