A neural fuzzy system for image motion estimation

(1)

A neural fuzzy system for image motion estimation

C.T. Lin

∗

_{, I.F. Chung, L.K. Martin Sheu}

Department of Electrical and Control Engineering, National Chiao-Tung University, 1001 Ta Hseuh Road, Hsinchu, Taiwan Received April 1998; received in revised form November 1998

Abstract

Many methods for computing optical ow (image motion vector) have been proposed while others continue to appear. Block-matching methods are widely used because of their simplicity and easy implementation. The motion vector is uniquely dened, in block-matching methods, by the best t of a small reference subblock from a previous image frame in a larger, search region from the present image frame. Hence, this method is very sensitive to the real environments (involving occlusion, specularity, shadowing, transparency, etc.). In this paper, a neural fuzzy system with robust characteristics and learning ability is incorporated with the block-matching method to make a system adaptive for dierent circumstances. In the neural fuzzy motion estimation system, each subblock in the search region is assigned a similarity membership contributing dierent degrees to the motion vector. This system is more reliable, robust, and accurate in motion estimation than many other methods including Horn and Schunck’s optical ow, fuzzy logic motion estimator (FME), best block matching, NR, and fast block matching. Since fast block-matching algorithms can be used to reduce search time, a three-step fast search method is employed to nd the motion vector in our system. However, the candidate motion vector is often trapped by the local minimum, which makes the motion vector undesirable. An improved three-step fast search method is tested to reduce the eect from local minimum and some comparisons about fast search algorithms are made. In addition, a Quarter Compensation Algorithm for compensating the interframe image to tackle the problem that the motion vector is not an integer but rather a oating point is proposed. Since our system can give the accurate motion vector, we may use the motion information in many dierent applications such as motion compensation, CCD camera auto-focusing or zooming, moving object extraction, etc. Two application examples will be illustrated in this paper. c 2000 Elsevier

Keywords: Optical ow; Motion vector; Block matching; Membership function; Backpropagation; Ane motion

1. Introduction

The measurement of optical ow (or image motion vector) is a fundamental problem in the processing of image sequences. The goal is to compute an approximation to the 2-D motion eld – a projection of the 3-D velocities of surface points onto the imaging surface. Several methods exist for computing optical ow: block-matching methods, dierential techniques, etc. [3]. Regretably, however, these systems are usually

∗_{Corresponding author.}

(2)

sensitive to noise, do not easily converge to the desired motion estimation, and have poor accuracy. This paper introduces neural fuzzy system to attack these problems.

Many methods for computing optical ow have been proposed and classied by Barron [3]. Dieren-tial techniques, phase-based and energy-based methods, and region-based matching techniques are the most widely used means for computing the 2-D motion eld. Dierential techniques compute velocity from spa-tiotemporal derivatives of image intensity or ltered versions of the image (using low-pass or band-pass lters) [8,16,26,27,30,31]. One requirement of dierential techniques is that I(x; t) (image intensity at pixel location x and time t) must be dierentiable. This implies that temporal smoothing at the sensor is needed to avoid aliasing and that numerical dierentiation must be done carefully. Another problem of dierential techniques is that, if more accurate results from the spatiotemporal patterns of image intensity are desired, then more than two spatiotemporal patterns are required. Regretably, having more than two patterns occupies much memory and takes much time. Image velocity is dened in terms of the phase behavior of band-pass lter outputs in phase-based methods. The zero-crossing techniques [6] are classied as phase-based methods because zero crossing can be viewed as level phase crossing. Energy-based techniques based on the output energy of velocity-tuned lters are also called frequency-based methods owing to the design of velocity-tuned lters in Fourier domain [9]. Region-based matching methods attempt to nd the best t of a small reference subblock from a previous image frame in a larger, search region from the present image frame [10,18,19]. More clearly, an image frame is segmented into two-dimensional small blocks of N × N pixels. Each block searches for the displacement which produces the best match among the candidate blocks in the present frame. Fast block-matching algorithms can be used to reduce search time [18,19]. However, the fast search algorithms are easily trapped by the local minimum and thus result in considerable error.

Because of noise or aliasing in the image acquisition process, nding a perfect match is impossible. Since fuzzy logic gives greater generality, higher expressive power, an enhanced ability to model real-world problems and, most importantly, a methodology for exploiting the tolerance for imprecision, it can be applied for image motion estimation. The fuzzy logic motion estimator (FME) proposed by Lipp [23] is a method based on block-matching algorithm linked with fuzzy logic. In the FME, for each subblock comparison, the fuzzy logic system classies an (M × M) subblock in the search region as to its potential membership in the subblock being matched using fuzzy membership function (FMF). The FME has shown more accurate motion vector estimates with uniform and ane modeled frame-to-frame image motion of real-image data than those from Horn and Schunk’s optical ow algorithm [16] or Netravali and Robbins’ pel recursive algorithm [28]. In the FME system, however, the generation of membership for each subblock comparison is quite subjective; this subjectivity in turn harms the accuracy of the motion estimation even if the FME technique has better system performance than conventional methods do.

This paper integrates the conventional region-based matching method and a ve-layered neural fuzzy network into a system to estimate image motion. The backpropagation (BP) learning rule is used for choosing proper membership functions in order to make this system adapt to dierent environments. Since the fast block-matching algorithm is a good choice for reducing search time, it was adopted in this system. However, since the local minimum problem appears in the searching process of the conventional fast block-matching algorithms and results in considerable error for the motion estimation, an improved fast block-matching method is proposed to lessen the local minimum problem. Some comparisons of fast search algorithms are made. This system is more reliable and accurate for image motion estimation when compared with other methods including Horn and Schunck’s optical ow algorithm, best block-matching, Netravali and Robbins’ (NR) pel recursive algorithm, FME, and three-step fast block-matching algorithm. As a summary, the signicant characteristics of the proposed system include that it is more robust than several previous methods; it can estimate motion vector more accurately than several other methods; the system parameters are tunable through neural learning; each subblock in the search region will be assigned similarity membership contributing dierent degrees to the motion estimation; and it adopts a modied fast block-matching method which can reduce local minimum problem.

(3)

The architecture and learning algorithm of the proposed system are described in Section 2. Some perfor-mance comparisons are made in Section 3. In Section 4, the applications of the proposed system are illustrated. Conclusions are summarized in Section 5.

2. The architecture of the neural fuzzy motion estimator

Fig. 1 shows the system architecture of the proposed neural fuzzy motion estimator. It is a ve-layered network structure. Fig. 2 is the functional block diagram of this system. First, two image sequences containing moving objects were processed by a region-based matching technique attempting to nd a best t of a small reference subblock from a previous image frame in a larger search region from the present image frame. Second, the outputs from region-based matching were fed into a neural fuzzy motion estimator which computed the motion eld of the image sequences.

Nodes at layer one were input units representing inputs in linguistic variables. Nodes at layer two were input term nodes which acted as membership functions for performing the similarity measure of each subblock in the search region of the present image frame. Layer 3 represented the rule nodes. Layer 4 contained output term nodes and Layer 5 contained two output nodes producing the crisp image velocity value in the x1 and x2

directions, respectively. In the proposed system, the motion vector was not decided merely by the best t of the reference subblock from a previous image frame in the larger search region of the present image frame. For each subblock matching, a fuzzy membership function (FMF) will give potential membership to each subblock in the search region. A potential membership represents the condence measure of motion vector. This was performed over all subblocks in the search region. The matching degrees then passed through fuzzy reasoning process (Layers 3 and 4) and the defuzzication process (Layer 5) to obtain two crisp outputs; these outputs indicate the motion vector of the reference subblock in the search region. To train (or, more precisely, to calibrate) the proposed network, a known, uniformly displaced and=or ane modeled frame-to-frame image was used to get the desired motion vectors for learning the proper parameters of the FMFs. Moreover, a modied fast search scheme was used in this present system. In both learning or processing, the fast search scheme increased the learning rate or reduced the search time if it could nd the global minimum of the matching criterion correctly in the region-based matching algorithm. Each part of this system will now be described in greater detail.

2.1. Block-matching Algorithm

A block-matching algorithm attempts to nd a best t of a small reference subblock from a previous image frame in a larger search region from the present image frame as depicted in Fig. 3. As the gure shows, this algorithm compares the (M × M) reference subblock with each of those having the same-sized portions within the (N × N) search region. In general, the size of the search region is larger than the reference’s but much smaller than the image frame containing it. There are many choices for determining the best match, e.g., the cross-correlation function; mean square error (MSE); mean absolute dierence (MAD) [5], etc. If one chooses the mean square error (MSE) as the matching criterion, one has

MSE(d1; d2) =_M1₂ M X i=1 M X j=1 [Is(i + d1; j + d2) − Ir(i; j)]2; (1)

where Ir is the reference subblock in the previous image frame, Is is the search region in the present image

frame, N × N is the size of search region, and M × M is the size of reference subblock. The displacement (d1; d2) which minimizes MSE(d1; d2) is selected as the motion vector.

(4)

Fig. 1. System architecture of the proposed neural fuzzy motion estimator (NFME).

Layer 1 (Input layer): The nodes in this layer just transmit input values to the next layer directly. That is,

f = u(1)_i and a = f; (2)

where u(1)_i = MSE(d1; d2) and i means the ith compared subblock corresponding to displacement (d1; d2) in

the search region.

Layer 2 (Input term nodes): A single node was used to act as a membership function and a bell-shaped Gaussian function was selected as the input membership function,

B(x; m) = e−1=2·(x−m)2₌2

; (3)

where x is the input to the node from Layer 1, and m and are the center and width of the bell-shaped function, respectively. In this layer, the center of the bell-shaped function was set to zero and the width was

(5)

Fig. 2. The functional diagram of the proposed neural fuzzy system for image motion estimation.

Fig. 3. Block-matching algorithm.

tuned in the learning procedure to t dierent environments. Hence, we have f = −1

2·

MSE(d1; d2)

2 and a ≡ member(d1; d2) = ef; (4)

where member(d1; d2) is the membership value representing the condence measure of each subblock which

displaces d1 and d2 pixels in horizontal and vertical axes, respectively. If member(d1; d2) is close to 1, then

the subblock’s motion vector probably equals (d1; d2); if member(d1; d2) is close to 0, then the subblock

probably does not have a motion vector equal to (d1; d2).

Layer 3 (Rule nodes): The fuzzy inference rules for generating fuzzy outputs are of this form: If member(d1; d2) then yx1= D1 and yx2= D2;

where member(d1; d2) is given in Eq. (4), Di is the fuzzy number of di (e.g., if d1= 1, then D1 is the fuzzy

number “1”); yx1 and yx2 are the output linguistic variables representing the displacement (in

pixels=time-step) in the horizontal and vertical axes, respectively. The output membership functions dened as uniformly spaced Gaussian functions with overlap were used in this present system. The membership functions for

(6)

Fig. 4. The output membership functions.

output linguistic variable, yx1, are depicted in Fig. 4. The membership functions for yx2 are exactly the same.

From the above fuzzy rule, there is only one input for each rule node in Layer 3 (see Fig. 1). Hence, the (rule) node in this layer just transmits the input values to the next layer directly. That is,

f = u(3)_i and a = f: (5)

In total, there are (N − M)2 _{rules for all possible displacements (disp}

1, disp2) of the reference subblocks in

the search region. In other words, there are (N − M)2 _{rule nodes in Layer 3.}

Layer 4 (Output term nodes): The links at layer four should perform the fuzzy OR operation to integrate the red rules which have the same consequence. Hence, we have

f =Xmember(d1; d2) and a = f (6)

and the link weight is w(4)_i = 1.

Layer 5 (Output nodes): The node at this layer performs the defuzzication process. The following function can be used to approximate the COA defuzzication method:

f =Xw(5)_ij u(5)_i and a =Pf

u(5)_i ; (7)

where w_ij(5) is the link weight assigned as displacement value di.

2.2. Learning unit

Gaussian function is used as the fuzzy membership function (FMF) in Layer 2 (see Figs. 1 and 2) with parameter to be determined adaptively for various environments. The backpropagation (BP) learning rule is used to nd the parameter’s proper value. Consequently, training data indicating desired motion vectors are required for learning. The training data are obtained from a real-image synthetically displaced with ane motion, which is dened by

d1 d2 = 1 − cos(Âr) sin(Âr) − sin(Âr) 1 − cos(Âr) x1 x2 + u1 u2 ; (8)

where [d1; d2]T is the motion vector of the synthetic image, Âr is the clockwise angle of rotation, and u1; u2

are the uniform components of the motion in the x1 (vertical), x2 (horizontal) directions, respectively. The

detailed derivation of the learning algorithm is shown below. The output error is dened as

E =1

2[yx1− ˆyx1]

2₊1

2[yx2− ˆyx2]

(7)

where yx1; yx2 are the desired outputs in the x1 and x2 directions, respectively, and ˆyx1; ˆyx2are the corresponding

actual outputs, which are given by ˆy_x₁= P i P jmiwaij P i P jwaij ; (10) ˆy_x₂= P i P jmjwaij P i P jwaij ; (11) aij= e−(1=2)xij2=2ij; (12) x2 ij= MSE(i; j); (13)

where MSE(i; j) is the mean square error for determining the matching criterion dened in Eq. (1), w is the width of output membership function, and mi; mj are the centers of output membership functions in the x1

and x2 directions, respectively. The backpropagation learning rule then gives

0 ij= ij+ ij; (14) ij= − Á_@@E ij (15) = − Á@E_{@ ˆy} _@@ ˆy ij (16) = − Á @E @ ˆy_x₁ @ ˆy_x₁ @ij + @E @ ˆy_x₂ @ ˆy_x₂ @ij (17) = 1 ij+ 2ij; (18) where 1

ij= − Á@E=@ ˆyx1· @ ˆyx1=@ij and 2ij= − Á@E=@ ˆyx2· @ ˆyx2=@ij which can be calculated as follows:

1. _{@ ˆy}@E x1 @E @ ˆy_x₁= − [yx1− ˆyx1]; (19) 2. @ ˆyx1 @ij @ ˆy_x₁ @ij = @ ˆy_x₁ @aij · @aij @ij = [yx1− ˆyx2] M X i=1 M X j=1    P i P jmi ·P_iP_jaij−PM_i=1PM_j=1miaij P_M i=1 P_M j=1aij 2 × e−(1=2)x2 ij=2ij×x 3 ij 3 ij    : (20)

In Eq. (20), the termP_iP_jmi= 0, because the output membership functions are chosen as Gaussian functions

symmetrical with respect to the origin. Hence, 1

ij becomes 1 ij= Á[yx1− ˆyx1] M X i=1 M X j=1  − P_M i=1 P_M j=1miaij P_M i=1 P_M j=1aij 2 × e−(1=2)x2 ij=ij2×x 3 ij 3 ij   : (21)

(8)

Similarly, 2 ij can be derived as 2 ij= Á[yx2− ˆyx2] M X i=1 M X j=1  − P_M i=1 P_M j=1mjaij P_M i=1 P_M j=1aij 2 × e−(1=2)x2 ij=2ij×x 3 ij 3 ij   : (22)

The nal update rule is

ij= 1ij+ 2ij: (23)

Eq. (23) is an update rule for the width of FMFs. Since objects can move arbitrarily in image sequences, it is reasonable to let all the FMFs have the same width, . The common width parameter, , is updated by the average value of all ij for each epoch of learning.

2.3. Fast search algorithm

In the proposed neural fuzzy motion estimator, the motion vectors in image sequences are determined by all the condence measures of candidate locations in the search region, and especially by those locations with greater membership values. The greatest membership will fall in the location whose MSE value is the global minimum in the search region. Moreover, it is reasonable to assume that the signicant memberships are at the locations around that with the greatest membership. Hence, when the global minimum of MSE is found in the search region, we can use 3 × 3; 5 × 5; : : : MSEs centered at the global minimum of MSE to calculate the approximate motion vector, which will be close to the desired motion vector. If the global minimum of MSE can be found without full search, the search time can be reduced and the learning rate can be increased. Hence, a modied fast search algorithm is incorporated into the learning unit of this present system to speed up both the learning and processing.

To reduce the heavy computational cost resulting from the massive number of candidate locations, the three-step fast block search algorithm [19] searches for the best motion vector in a coarse-to-ne manner. Fig. 5 illustrates the procedure of the three-step search with an example of motion vector (−7; 5). In the rst step, nine sparsely located candidates are evaluated and the one with a minimal MSE is picked out. In the second step, the search focuses on the area centered at the winner of the previous step, but distances between candidate locations are shortened by one-half. In the same manner, the third step compares the MSE’s of the nine locations around the winner found in the second step and then gives the nal motion vector. The three-step search algorithm commonly uses the range of d1= d2= 7. In this manner, the number of search

locations will decrease to 1=9 of the number of the full search approach.

In practice, many fast block-matching algorithms are trapped by the local minimum, the three-step search algorithm included. This paper thus proposes a modied fast search algorithm. In the second and third steps, coarse-to-ne search is replaced by full search to reduce the extent to which that searching process is trapped by local minimum. With this modication, the number of search locations will decrease to 1=5 of the number of the exhaustive search. Some comparisons of the full search, the original 3-step search, and the modied 3-step search are made in the following.

2.4. Experimental results of learning

The target motion vectors of the image sequences from the ane motion equation (Eq. (8)) can be found for updating the parameter, , of the fuzzy membership functions. The update rule is given by Eq. (23). Three cases are considered to show the learning ability of the proposed neural fuzzy motion estimator:

• Learning an image with dierent motion parameters: The training data are made using a real-image

(9)

Fig. 5. Three step fast block search algorithm.

Fig. 6. (a), (b) Two dierent images for learning. (c), (d) Two rotated images for learning.

dierent uniform translation components (u1; u2), (2; 2); (− 2; 2); (2; − 2); (− 2; − 2) are used to produce

training patterns representing the general motion of moving objects when they move arbitrarily in the image.

• Learning two dierent images: replacing the above image with a new one, redoing the same experiment,

and observing the value of .

• Learning an image using dierent numbers of rules: Dierent numbers of rules, 3 × 3; (1 × 1 is the best

blockmatching), 5 × 5; 7 × 7, 9 × 9, are used for an image to see the value of .

Two dierent images and two rotated ones for learning are shown in Fig. 6. Some training patterns repre-sented as the motion eld ow map are depicted in Fig. 7. The nal values of updated by dierent training patterns made by ane motion from two dierent images are listed in Table 1.

The average value of is obtained by averaging all the values learned from dierent training patterns, and this average will serve as the width parameter of input membership functions. Two average values of for

(10)

Fig. 7. The desired motion vectors. (a) (u1; u2) = (2; 2); Âr= 4◦. (b) (u1; u2) = (2; 2); Âr= 8◦. (c) (u1; u2) = (2; 2); Âr= −4◦. (d)

(u1; u2) = (2; 2); Âr= −8◦.

two dierent images are observed in Table 1, indicating that the parameter of fuzzy membership function must be chosen carefully. Hence, one should tune the parameters of input membership functions to nd a proper value for tting dierent circumstances in practical applications.

As observed from Table 2, dierent numbers of rules result in dierent values of . That is, if one wants to produce crisp values of motion estimation using dierent number of rules, then the value of must be chosen properly.

2.5. Comparisons of various fast block-matching algorithms

The performance of fast search algorithms is aected by local minima. Hence, an algorithm with a bigger search range than the original three-step search algorithm was proposed to reduce the error. Using the modied

(11)

Table 2

The learned values of the width parameter when learning using dierent number of rules

Rule no. Image 1 Image 2

3 × 3 5.341 5.92

5 × 5 7.597 6.01

7 × 7 19.91 6.61

3 × 3 19.85 8.43

three-step search algorithm described previously, the number of search locations increased from 1=9 to 1=5 of the number of the exhaustive approach, compared with the original 3-step search algorithm. Two synthetically uniform translated images in Fig. 6 were used to examine the performance of the fast search and improve fast search algorithms. The full search algorithm may be regarded as having the perfect performance (i.e., 0% error rate) in the searching process, then the error percentage of three-step and modied three-step methods with respect to the full search algorithm was obtained. The original three-step search scheme had a 54.32% error rate, while the modied method with bigger search range had about 20.26% error. The local minimum problem still existed in the improved method. Hence, the learning rate could be increased eectively if the fast search algorithm worked well in the search process.

3. Test and comparisons of synthetic image sequences

This section examines the performance of the proposed and some existing image motion estimation tech-niques on synthetic image sequences for which 2-D motion elds have been known. Before discussing the experimental results, however, it is essential to describe the image sequences used for comparison and the measures of error.

3.1. Synthetic image sequences

The main advantage of using synthetic inputs is that the 2-D motion elds and scene properties can be controlled and tested in a methodical way. One may use the known motion vectors to quantify the performance of a specic algorithm. On the other hand, it must be kept in mind that such inputs are usually clean signals (involving no specularity, shadowing, transparency, occlusion, etc.) and therefore this measure of performance should be taken as an optimistic bound on the expected errors with real-image sequences. The synthetic images are made by means of ane motion (shown in Eq. (8)), and the synthetic image sequences include

• Translation sequences: Results of the case with velocity v = (− 3; 2) are reported (see Fig. 8). Hence, the

desired motion vectors will fall in location (− 3; 2) of the 2-D plane if one represents the motion vectors using Cartesian coordinate system. The better the motion estimation is, the closer the estimated motion vectors will be to the point (− 3; 2). This phenomenon can be observed from the scatter map of the motion vectors.

• Rotation sequences: For this experiment, the image is rotated 6◦ _{clockwise using ane modeled motion} to produce the present image frame (see Fig. 8).

3.2. Image motion estimation techniques for comparisons

Synthetic image sequences of moving brightness patterns as mentioned above have been processed by many dierent methods including Horn and Schunck’s, FME, Netravali and Robbins’ pel recursive algorithm, best-t block-matching, and this paper’s neural fuzzy motion estimator. These methods are described as follows.

(12)

Fig. 8. Testing images. (a) Original image. (b) Image after translation, v = (− 3; 2). (c) Image after rotation, Âr= 6◦clockwise.

• Horn and Schunck (HS) [16]: Horn and Schunck [16] combined the gradient constraint with a global

smoothness term to constrain the estimated velocity eld C = (u(x; t); v(x; t)) in minimizing the error equa-tion. The gradient constraint is

5 I(x; t) · v + It(x; t) = 0; (24)

where It(x; t) denotes the partial time derivative of I(x; t), 5I · v denotes the usual dot product, and

5I(x; t) = (Ix(x; t); Iy(x; t))T. The total error to be minimized is

Z

D[(5I · v + It)

2₊2_{(|| 5 u||}2

2+ || 5 v||22)] dx (25)

dened over a domain D, where the magnitude of re ects the in uence of the smoothness term. This study used = 50 because it produced better results in most test cases. Iterative equations are used to minimize Eq. (25) to obtain motion vectors:

uk+1= uk−Ix[Ixu k_{+ I}_y_vk_{+ I}_t_] 2_{+ I}2 x + Iy2 ; (26) vk+1= vk−Iy[Ixu k_{+ I}_y_vk_{+ I}_t_] 2_{+ I}2 x + Iy2 ; (27)

where k denotes the iteration number, u0 _{and v}0 _{denote initial velocity estimates (here set to zero), and}

uk_{; v}k _{denote neighborhood averages of u}k _{and v}k_{. At least 100 iterations were used in this present study}

to obtain better results. The method with spatiotemporal aliasing was implemented and the subsequent derivative estimates [16] were also improved.

• Fuzzy logic motion estimator (FME) (Lipp [23]): Based on the block-matching algorithm, this technique

calculates the displaced frame distortion (DFD) using DFD(d1; d2) = M X i=1 M X j=1 [Is(i + d1; j + d2) − Ir(i; j)]2; (28)

where (d1; d2) is the motion vector, M is the reference block size, Ir(i; j) is the reference subblock in

the previous image frame and Is(i; j) is the search region in the present image frame (with origin

(13)

a normalization factor in dening the following membership value, which plays the same role as that in Eq. (4): member(d1; d2) =      1 − 5DFD(d_{DFD(0; 0)}1; d2) if 5DFD(d_{DFD(0; 0)}1; d2) ¡ 1; 0 otherwise: (29) In defuzzication, uniformly spaced triangular membership functions with little overlap (similar to those shown in Fig. 4) are used for the output membership functions. The outputs are combined using fuzzy centroid defuzzication to produce sharp outputs for directions x1 (vertical) and x2 (horizontal).

• Netravali and Robbins’ Pel recursive algorithm (NR) [28]: This motion estimation method attempts to

minimize recursively a certain quantity (function of the motion estimation error). If the changes in successive image frames are due to translation of an object, then the algorithm iterates in a gradient or steepest descent direction such that the consecutive estimates converge to an estimate of translation. It is noted that since the NR approach considers only the point-to-point displacement, it relates to image translation only rather than to rotation. Assume I(x; t) and I(x; t − ) are the intensity values of the two successive frames as a function of spatial location x (a two-dimensional vector) at time t. The time between the two frames is . If an object moves in translation, then in the moving area one has

I(x; t) = I(x − D; t − ); (30)

where D is the translation vector of the object during the time interval [t − ; t]. The frame dierence at spatial position x is given by

FDIF(x) = I(x; t) − I(x; t − ) (31)

= I(x; t) − I(x + D; t): (32)

In the recursive estimation scheme, the displaced frame dierence DFD(x; ˆDi−1) analogous to FDIF(x) is dened by

DFD(x; ˆDi−1) = I(x; t) − I(x − ˆDi−1; t − ); (33)

ˆ

Di= ˆDi−1+ Ui_; ₍₃₄₎

where ˆDi is the ith displacement estimate, ˆDi−1 is an initial estimate of ˆDi, and Ui _{is the update of ˆ}_Di−1

for making it more accurate (i.e., the estimate of D − ˆDi−1). Then the estimator can be derived as ˆ

Di= ˆDi−1− DFD(xa; ˆDi−1) 5 I(xa− [ ˆDi−1]; t − ); (35)

where 5 is the gradient with respect to x, is a positive scalar constant, and a pixel at location xa is

predicted with displacement ˆDi−1.

• BF (Best-t block-matching): This technique is widely used for the motion estimation because of its

simplicity and coding eciency for motion information. A detailed description of this technique is in Section 2.

• Neural fuzzy motion estimator (NFME): The authors propose this new technique introduced in Section 2 for

image motion estimation. This system can learn proper system parameters for dierent circumstances and has been tuned using various training patterns obtained from ane motion equation (Eq. (8)). The rotation angles Âr, 4◦, 8◦, −4◦, −8◦ and the uniform translation components (u1; u2)T; (2; 2); (− 2; 2); (2; − 2);

(14)

the value = 6:067 was obtained as the parameter of input fuzzy membership functions. Although the synthetical images for comparisons were displaced uniformly (− 3; 2) and rotated 6◦ _{clockwise, which} diered from training patterns, one will see the more accurate motion estimation resulted from this system in the following comparisons. The comparisons also show that the proposed system is more robust than the others.

3.3. Comparison criteria and results

Some criteria must be dened to compare the performance of dierent methods. The ve criteria used here are the root-mean-square velocity dierence in the x1-direction (err1), the root-mean-square velocity dierence

in the x2-direction (err2), the average angle dierence (Âerr) of motion vector, the root-mean-square error of

velocity (RMSE), and the average motion vector ( Vx1; Vx2) in translation test. These ve criteria are dened

as follows: err1 = r 1 n1· n2 X X [D−Vx1− Vx1]2; (36) err2 = r 1 n1· n2 X X [D−Vx2− Vx2]2; (37) Âerr=_n 1 1· n2 X X _tanD−Vx2 D−Vx1 − tanVx2 Vx1 ; (38) RMSE = s 1 n1· n2 X X q D−Vx12+ D−Vx22− q Vx12+ Vx22 ₂ ; (39) ( Vx1; Vx2) = 1 n1· n2 X i X j Vx1(i; j); 1 n1· n2 X i X j Vx2(i; j) ! ; (40)

where D−Vx1 and D−Vx2 are desired motion vectors, Vx1 and Vx2 are actual motion vectors, and n1× n2

is the size of an image frame. Three types of performance representation are used: (i) motion vector ow map – represents motion vector at every location (x1; x2) by a small arrow whose length and direction are

proportional to the motion vector’s magnitude and angle; (ii) motion vector scatter map – represents motion vectors on the Cartesian coordinate plane; and (iii) error table – lists of all the errors in Eqs. (36)–(40) for comparison.

Fig. 9(a) and (b) show the ow map and scatter map, respectively, of the desired motion vectors when the image was uniformly displaced by − 3 pixels in the vertical direction and 2 pixels in the horizontal direction. Fig. 9(c) is the ow map computed by the HS method, and Fig. 9(d) shows its scatter map. The motion estimation had err1 = 0:2283 pixels, err2 = 0:7839 pixels, Âerr= 20:095 degrees, RMSE = 0:7191, and

( Vx1; Vx2) = (− 1:43; 1:35) as shown in Table 3. Figs. 9(e) and (f) indicate the motion estimation computed from

the FME method, which had err1 = 0:3952 pixels, err2 = 0:6802 pixels, Âerr= 12:448 degrees, RMSE = 0:7421,

and ( Vx1; Vx2) = (− 2:51; 2:02). Figs. 10(g) and (h) are the ow map and scatter map representing the motion

estimation of the NR recursive algorithm, which had err1 = 0:5009 pixels, err2 = 0:3555 pixels, Âerr= 6:258

degrees, RMSE = 0:6877, and ( Vx1; Vx2) = (− 2:91; 2:35). These errors are listed in Table 3. Figs. 10(i) and

(j) show the motion estimation resulting from the BF method. The corresponding errors are listed in Table 3. Fig. 10(j) and Table 3 indicate that the motion estimation computed from the BF method was error-free in the translation test. This awlessness is due to the fact that the signal made from an integer-translated image is clean (with no shadowing, specularity, transparency, occlusion, etc.); consequently, BF nds a motion

(15)

Fig. 9. Translation test. (a) Desired ow map. (b) Desired scatter map. (c) HS ow map. (d) HS scatter map. (e) FME ow map. (f) FME scatter map.

vector which is a perfect match in the search region from the present image. As will be see later, BF showed its shortcoming when it could not nd a perfect match in the rotation test. Figs. 10(k) and (l) are the ow map and scatter map representing the motion estimation of the NFME method, which had err1 = 0:0044 pixels, err2 = 0:0042 pixels, Âerr= 0:065 degrees, RMSE = 0:0085, and ( Vx1; Vx2) = (− 3; 1:99). This result was very

close to the desired one.

Fig. 11(a) is the ow map representing the desired motion vector when the image is rotated 6◦ _clockwise. Figs. 11(b)–(f) are the estimated ow maps using various motion estimation methods. The corresponding error information is shown in Table 3. As observed from Table 3, the NFME method performs the best among all the compared methods. It was also found that the motion estimation obtained from the BF method was not error-free any more: this was because the perfect-match motion vectors did not always exist when pixels moved in the rotation style. From the comparison results shown as ow maps, scatter maps, and an

(16)

Table 3 Error table

Translation (− 3; 2) Rotation 6◦

err1 err2 Âerr RMSE ( Vx; Vy) err1 err2 Âerr RMSE

HS 0.2283 0.7839 20.095 0.7191 (−1:43; 1:35) 0.8136 1.0729 31.459 1.5315 FME 0.3592 0.6802 12.448 0.7421 (− 2:51; 2:02) 0.8849 0.8789 34.703 1.0837 BF 0 0 0 0 (− 3:2) 0.3251 0.3805 17.556 0.4155 NR 0.5009 0.3555 6.258 0.6877 (− 2:91; 2:35) 2.7433 3.5037 63.68 1.9491 NFME 0.0044 0.0042 0.065 0.0085 (− 3; 1:99) 0.2833 0.3152 12.648 0.3597 3-step 0.9648 0.7344 18.652 1.1971 (− 2:59; 1:85) 0.8094 1.2068 33.657 1.2785 Modied 0.5926 0.5360 12.346 0.7539 (− 2:53; 1:48) 0.6646 0.8984 27.100 0.9192 3-step

Fig. 10. Translation test (continued). (g) NR ow map. (h) NR scatter map. (i) BF ow map. (j) BF scatter map. (k) NFME ow map. (l) NFME scatter map.

(17)

Fig. 11. Rotation test. (a) Desired ow map. (b) HS ow map. (c) FME ow map. (d) NR ow map. (e) BF ow map. (f) NFME ow map.

error table in Figs. 9–11, and Table 3, one can infer that the proposed NFME method is very robust and accurate in both the translation and rotation tests.

Performance comparisons of the three-step fast search and modied three-step fast search algorithms are shown in Table 3 and Figs. 12(a)–(f). A modied three-step algorithm could reduce the error from the local minimum. The number of search locations were 1=5 of those from the exhaustive search approach.

Ane motion with a large rotation angle, Âr= 12◦, was also tested in this study. It is noted that this rotation

angle is outside the range of the training data used for training the NFME system in Section 2. The parameter, reference block size 11 × 11, and search region 31 × 31 were used to examine the performance of various methods described in Section 3.2. The results are shown in Fig. 13 and Table 4. The same conclusion was still reached: the NFME system provides the most accurate estimation when the image is rotated with a large angle. The average time took by each motion estimation algorithm for computing the motion vectors of an image is listed in Table 5, where the programs were run on a PC486-66. The table shows that the proposed

(18)

Fig. 12. Tests of fast search algorithms. (a) 3-step ow map. (b) 3-step scatter map. (c) Modied 3-step ow map. (d) Modied 3-step scatter map. (e) 3-step ow map (rotation test). (f) Modied 3-step ow map (rotation test).

NFME system takes the same time as FME and BF methods, longer time than HS and NR methods, but it reaches the highest accuracy.

4. Applications

4.1. Moving image compression

There are many methods used to compress data for transmission or storage of images. Frame skipping is one of the simplest methods of data compression for interframe motion images. For simplicity, suppose only the alternate frames are skipped. With no knowledge of the motion trajectory of the pixels, a skipped frame is generally reproduced either by repeating the preceding frame or by interpolation between the preceding and

(19)

Fig. 13. Ane motion with large rotation angle. (a) Desired ow map. (b) HS ow map. (c) FME ow map. (d) NR ow map. (e) BF ow map. (f) NFME ow map.

the following frames. Both of these methods harm the quality of motion reproduction, however. The former results in jerkiness in the reproduction of the motion while the latter blurs the moving areas.

Let U2k be the (2k)th frame where frames 2; 4; : : : ; 2k; : : : , have been skipped. Then U2k∗, the reproduced

value of U2k, is obtained (without motion compensation) as follows:

Frame repetition: U∗ 2k(m; n) = U2k−1(m; n): (41) Frame interpolation: U∗ 2k(m; n) =12{U2k−1(m; n) + U2k+1(m; n)}: (42)

The disadvantages of frame repetition and interpolation can be overcome by predicting or interpolating the pixels of the skipped frame along its motion trajectory. Hence, with motion compensation, frame repetition

(20)

Table 4 Table 5

The performance of various estimation methods on the ane Average time took by each motion estimation algorithm for motion with large rotation angle computing the motion vectors of an image

Translation (− 3; 2) Rotation 12◦

err1 err2 Âerr RMSE

HS 2.0931 2.0834 38.616 3.1053 FME 2.0217 1.7948 33.401 3.2401 NR 1.9745 3.1130 49.448 2.6358 BF 0.7207 0.6481 18.618 0.6922 NFME 0.6606 0.6179 15.798 0.6231 HS NR FME BF NFME Time (s) 15 25 75 75 75

and frame interpolation equations are replaced by U∗

2k(m; n) = U2k−1(m + q; n + l) (43)

and U∗

2k(m; n) =12{U2k−1(m + q; n + l) + U2k+1(m + q0; n + l0)}; (44)

respectively, where (q; l) and (q0_{; l}0_{) are the motion vector of U}

2k relative to the preceding and the following

frames, respectively. It is noted that q; l; q0_{; l}0 _{in Eqs. (43) and (44) are all integer numbers in the image} coordinates. How does one tackle such a problem if q; l; q0_{; l}0 _{are not integer numbers as those produced by} the proposed NFME method?

Coding and decoding systems (codecs) that use motion-compensation with fractional-pel accuracy have been reported in [7,12–15,20,24]. Typically, fractional-pel accuracy is achieved by simple bilinear interpolation which produces a spatially blurred prediction signal. Improvement gained in this way is referred to as the “ltering eect”. Sinc-interpolation, bilinear interpolation, and Wiener ltering were compared at integer-pel, 1=2-pel, 1=4-pel, and 1=8-pel accuracies. Remarkably, for the neural fuzzy motion estimator (NFME), the motion vector’s accuracy was at innitesimal-subpixel. Hence, we propose a compensation method called the quarter compensation algorithm (QCA), an innitesimal-subpixel compensation method, to compensate for inter-frame of images according to NFME’s motion estimation.

Suppose the origin point of an image is in the left-top corner and the positive directions are rightward and downward. Assume the motion vector v = (v-row; v-col) in some location (m; n) has been evaluated, and both v-row and v-col are oating numbers. Without loss of generality, it is assumed that both v-row and v-col are positive; i.e., the point in location (m; n) moves toward the lower-right corner. One distinguishes the integer part and decimal part of the motion vector as follows:

v-row = q + d-row; (45)

v-col = l + d-col; (46)

where (q; l) and (d-row; d-col) are the integer and decimal part of (v-row; v-col), respectively. The square which was in location (m; n) originally is now situated on the four locations (m + q; n + l), (m + q + 1; n + l), (m + q; n + l + 1), and (m + q + 1; n + l + 1). A moved square is thus divided into four portions, with the area of each portion decided by the decimal part, (drow; dcol), of the motion vector v. The area of each portion

is viewed as a weighting factor in deciding new gray values. In other words, the gray value of the pixel, which was originally in location (m; n), is distributed into the new gray values of the four locations (pixels) according to these weights.

(21)

Hence, for any location (x1; x2) in the compensated image frame, its gray value U2k∗(x1; x2) is decided by

the following equation: U∗ 2k(x1; x2) = P (m; n) ∈ D_Pw(m; n) → (x1; x2)· U2k−1(m; n) (m; n) ∈ Dw(m; n) → (x1; x2) ; (47)

where (x1; x2) is the image coordinate, D= {(m; n) | one of the four portions of the moved points U4 2k−1(m; n)

is located in the position (x1; x2)}, w(m; n) → (x1; x2) is the area (weighting factor) of the portion of U2k−1(m; n)

that falls in the location (x1; x2),PDw(m; n) → (x1; x2) is the normalization factor.

For any location (x1; x2) in the compensated image frame, its gray value can be recovered by Eq. (47).

The numerator denotes the sum of product of the weighting factors and the gray values of the moved points which fall in the location (x1; x2), and the denominator is the sum of all weighting factors, which is used as

the normalization factor in the recovering process.

Experiment. Two real-image frames containing moving objects were tested using the quarter compensation algorithm (QCA) for interframe motion compensation. The result is compared with that of the integer com-pensation method described by Eq. (43), and the performance is measured by normalized mean square error (NMSE): NMSE =_N 1 1× N2 X n1 X n2 [U2k(n1; n2) − U2k∗(n1; n2)]2; (48)

where U2k(n1; n2) is the original skipped image, U2k∗(n1; n2) the compensated image from U2k−1(n1; n2), and

N1× N2 the dimension of image frame.

The results are shown in Fig. 14. By integrating NFME with the QCA to compensate for the interframe of image, better results than using the integer-compensated method can be found as observed from Fig. 14. Without motion compensation (i.e., the interframe is skipped), the error (NMSE) is up to 22.514. This error is reduced to 10.658 using the integer compensation method, to 7.342 using the bilinear interpolation scheme, and to 2.764 using the proposed QCA.

4.2. Multiple moving object extraction

Letting the computer nd a specic item among many uncertain moving objects is a dicult task. Human beings can easily distinguish dierent moving objects and grab them even when the light is dim or the object’s outline is vague. However, a computer must have more useful information, such as motion vector, to catch the moving objects and discard the redundancies in an image frame. It is possible that there is more than one moving object in an image frame, but only one is sought. How to extract it then becomes an important task in order to interpret or recognize the object. Obviously, the reasonable motion information can help to extract a moving object from image sequences. More accurate motion vectors will provide more useful information about the moving objects. Hence, the proposed NFME can be used to compute motion vectors and extract moving object because the NFME can provide accurate dynamic information. An experiment on multiple moving object extraction in this application was conducted.

Two image sequences with two moving objects contained in them are shown in Fig. 15. The motion estimation computed by the NFME system is shown in Fig. 16, where one can see clearly the two moving objects. As observed from this diagram, two dierent regions corresponding to the two dierent moving objects can be distinguished and extracted based on the combined information of moving angles and velocities transformed from the estimated motion vectors. Hence, a moving object toward the North and with a velocity of about 2 pixels=frame was extracted as shown in Fig. 15(d).

(22)

Fig. 14. Motion estimation and compensation. (a) The previous image. (b) The present image. (c) The compensated image using the integer compensation method. (d) The compensated image using the QCA.

Fig. 15. The images for multiple moving objects extraction. (a) The previous image. (b) The present image. (c) The moving objects. (d) The extracted moving object.

5. Conclusions

The neural fuzzy motion estimator presented in this paper has been shown to provide accurate motion vector estimates for uniform and ane modeled object’s motion. Based on block matching, each subblock in the search region is assigned a similarity membership contributing dierent degree to the estimated motion vector in the neural fuzzy motion estimator. This system is more reliable and robust in motion estimation than other methods such as Horn and Schunck’s optical ow, fuzzy logic motion estimator (FME), NR, fast block-matching, etc. Motion estimation and compensation are an integral part for video compression since it enables the removal of naturally existing temporal redundancies. In this paper a neural motion compensation

(23)

Fig. 16. The ow map of multiple moving objects.

method, the QCA, is proposed. The QCA can perform innitesimal-subpixel interframe motion compensation. This algorithm provides better results than the conventional methods based on integer motion vectors for the interframe motion compensation. The proposed neural fuzzy motion estimator can be applied in various dynamic image-related applications where motion information is concerned. The proposed system is espe-cially eective for extracting multiple moving objects. In this application, useful motion-related features were extracted from the estimated motion vectors. This is an important technique that can be processed at the earlier stages of vision, prior to higher level tasks such as segmentation, recognition, or interpretation. Motion information has become a basic knowledge for image understanding, image compression, vision-based control, etc. The accurate motion eld estimated by the proposed system can provide useful motion information for making better decisions.

References

[1] P.K. Allen, A. Timcenko, B. Yoshimi, P. Michelman, Automated tracking and grasping of a moving object with a robotic hand-eye system, IEEE Trans. Robotics Automat. 9 (1993) 152–165.

[2] R. Baker, Waveform based very low rate video coding, Keynote address presented at the Internat. Note Workshop Very Low Bit Rate Video Compression, Urbana, IL, 1993.

[3] J.L. Barron, D.J. Fleet, S.S. Beauchemn, Systems and experiment performance of optical ow techniques, Internat. J. Comput. Vision 12 (1) (1994) 43–77.

[4] G.C. Buttazzo, B. Allotta, P. Fanizza, Mousebuster: a robot for real-time catching, IEEE Control Systems 14 (1) (1994) 49–56. [5] K.W. Chun, J.B. Ra, An improved block matching algorithm based on successive renement of motion vector candidates, Signal

Processing: Image Com. 2 (1994) 115–122.

[6] J.H. Duncan, T.C. Chou, Temporal edges: the detection of motion and the computation of optical ow, Proc. 2nd Internat. Conf. Comput. Vis., Tampa, 1988, pp. 374–382.

[7] S. Ericsson, Fixed and adaptive predictors for hybrid predictive=transform coding, IEEE Trans. Commun. COM - 33 (1985) 1291– 1302.

(24)

[8] C. Fennema, W. Thompson, Velocity determination in scenes containing several moving objects, Comput. Graph. Image Process. 9 (1979) 301–315.

[9] D.J. Fleet, Measurement of Image Velocity, Kluwer Academic Publishers, Norwell, MA, 1992.

[10] H. Gharavi, M. Mills, Blockmatching motion estimation algorithm – new results, IEEE Trans. Circuits Systems 37 (1990) 649–651. [11] B. Girod, D.J. Le Gall et al., Guest editorial: introduction to the special issue on image sequence compression, IEEE Trans. Image

Process. 3 (1994) 465–468.

[12] B. Girod, F. Joubert, Motion-compensating prediction with fractional pel accuracy for 64 kbits=s coding of moving videl, Proc. Internat. Workshop 64kbit=s Coding Moving Video, Hannover, Germany, 1988.

[13] B. Girod, The deciency of motion-compensating prediction for hybrid coding of video sequences, IEEE J. Select. Areas Commun. SAC-5 (1987) 1140–1154.

[14] B. Girod, Motion-compensating prediction with fractional-pel accuracy, IEEE Trans. Commun. 41 (1993) 604–612.

[15] M. Glige, A high quality videophone coder using hierachical motion estimation and structure coding of the prediction error, Proc. SPIE Conf. Visual Commun. Image Proc. ’88 1001, Cambridge, MA, SPIE, 1988, pp. 864–874.

[16] B.K.P. Horn, B.G. Schunck, Determining optical ow, Articial Intell. 17 (1981) 185–204.

[17] J. Hwang, Y. Ooi, S. Ozawa, An adaptive sensing system with tracking and zooming a moving object, IEICE Trans. Inform. Systems E76-D (1993) 926–934.

[18] J.R. Jain, A.K. Jain, Displacement measurement and its applications in interfame coding, IEEE Trans. Commun. COM-29 (1981) 1799–1808.

[19] H.M. Jong, L.G. Chen, T.D. Chiueh, Parallel architectures for 3-step hierachical search blockmatching algorithm, IEEE Trans. Circuits Systems Video Technology 4 (1994) 407–416.

[20] C.D. Kuglin, D.C. Hines, The phase correlation image alignment method, Proc. IEEE 1975 Internat. Conf. Cybernet. Soc. (1975) 163–165.

[21] C.T. Lin, C.S. Gorge Lee, Neural-network-based fuzzy logic control and decision system, IEEE Trans. Comput. 40 (1991) 1320–1366. [22] H. Li, A. Lundmark, R. Forchheimer, Image sequence coding at very low bitrates: a review, IEEE Trans. Image Process. 3 (5)

(1994) 589–609.

[23] J.I. Lipp, Frame-to-frame image motion estimation with a fuzzy logic system, Proc. Circ. and Syst. Conf., 1992, pp. 987–990. [24] F. May, Model based movement compensation and interpolation for ISDN videotelephony, IEEE Int. Symp. Circuits Syst. (ISCAS

88) Espoo, Finland, 1988.

[25] A. Murat Tekalp, Digital Video Processing, Prentice-Hall, Englewood Clis, NJ, 1995.

[26] H.H. Nagel, Displacement vectors derived from second-order intensity variations in image sequences, Comp. Graph Image Process. 21 (1983) 85–117.

[27] H.H. Nagel, On the estimation of optical ow: relations between dierent approaches and some new results, Articial Intell. 33 (1987) 299–324.

[28] A.N. Netravali, J.D. Robbins, Motion compensated television: Part I, Bell System Tech. J. 58 (1979) 631–670. [29] R. Plompen, Motion video coding for visual telephony, PTT Research Neher Laboratories, 1989.

[30] O. Tretiak, L. Pastor, Velocity estimation from image sequences with second-order dierential operators, Proc. 7th Internat. Conf. Patt. Recog., Montreal, 1984, pp. 20–22.