National Sun Yat-sen University Institutional Repository:Item 987654321/28896

(1)

1 Introduction

The phenomenal increase in generation, transmission, and use of digital images and video in various applications is placing an enormous demand on storage space and communication bandwidth. Data compression is a viable approach to alleviate such storage and bandwidth demands and enables such applica-tions as digital video broadcasting, video streaming over internets, and mobile videophones which were impracticable only a few years ago [57, 22, 6, 18, 74]. Image/video compression concerns transmitting image/video data with as much high quality and low bandwidth as possible, within a processing time acceptable by the user. Two types of compression are possible, i.e., lossless and lossy. With lossless compression, there is no loss of information and therefore the image can be perfectly recovered. Lossy compression, on the other hand, is a trade-off which results in a less amount of data with a sacrifice to fidelity. MPEG-4 is an ISO/IEC standard and is used for encoding/decoding object-based compressed video. We demonstrate the usage of fuzzy neural networks to extract human objects from video sequences[42, 70] and show how to use the extracted object for MPEG-4 applications.

1.1 Image Compression

A number of image compression techniques have been developed. Generally, there are four directions in image compression: vector quantization, transform coding, predictive coding, and entropy coding. Vector quantization (VQ) tries to reducing spatial redundancy in an image [51]. Representative blocks are chosen as code-words from the image to be compressed. Each block of the image is compared with the code-words to find out the most similar one and labeled with the index of the corresponding code-word. As a result, the un-derlying image is represented by a series of indices which are then transmitted through the communication channel. At the receiving end, the code-words cor-responding to the received indices are used to reconstruct the original image. Since only table lookups are involved at the receiving end, VQ is particularly suitable for real-time decoding. This simplicity also results in low complexity of the decoder which is attractive for low power systems and applications.

Transform coding compresses an image by converting the image into a small number of coefficients. Discrete Cosine Transform (DCT) and wavelet transform are two well-known methods. DCT transforms a block of pixels into a matrix of coefficients which represent spatial frequencies. Because most high-frequency coefficients are nearly zero, compression is attained. However, the underlying block-based scheme generally degrades the performance at low bit-rates since correlation across the block boundaries is not eliminated. Re-cently, the wavelet transform coding (also referred to as sub-band coding) has emerged as a cutting edge technology in the field of image compression. The basic idea of wavelet transform is to represent any arbitrary function as a su-perposition of a set of wavelets or basis functions which are obtained by scaling

(2)

and shifting from the mother wavelet. Wavelet-based coding is more robust under transmission and decoding errors, and also facilitates progressive trans-mission of images. Because of this inherent multi-resolution nature, wavelet coding schemes are especially suitable for applications where scalability and tolerable degradation are important.

Predictive coding works on the basis that adjacent pixels in an image are generally very similar, and thus the redundancy between successive pixels can be removed and only the residual between the actual and predicted pixel values is encoded. Differential Pulse Code Modulation (DPCM) is one particular example of predictive coding. Entropy coding relies on uneven distribution of values to be encoded, and the code length of a value associated inversely with the probability the value occurs in the image. Two popular entropy coding schemes are Huffman coding and arithmetic coding. Huffman coding aims to produce variable-length codes for symbols to represent data at a rate of its entropy. From a given probability function for each value of image pixels, a tree is constructed by summing the two lowest probabilities iteratively until no merge can be done. Then each symbol is encoded by tracing the corresponding path in the tree. In this way, Huffman coding maps fixed length symbols to variable length codes. Arithmetic coding converts a variable number of symbols into a variable length code-word. A sequence of symbols is represented by an interval with length equal to its probability. The interval is specified by its lower and upper boundaries. The code-word of the sequence is the common bits in binary representations of its lower and upper boundaries.

1.2 Video Compression

Video data usually contains streams of image frames, each stream consisting of a set of sequential image frames. Transmitting a video stream requires a large bit rate and needs a high compression ratio for real applications. Two types of approaches, block-based and object-based, have been developed for video compression. In block-based approaches, e.g., MPEG-2 [2], an image frame is encoded as a set of fixed-size blocks, and redundancies are removed by attempting to match and reuse the blocks from the previous frames of the current frame. The most basic form of block-based approaches checks if a block in the current frame is identical to a block around the same place in the previous frame. If they are the same, the data of the underlying block is not encoded. Otherwise, the best matched block in the previous frame is found and the difference among them is encoded. The area in which matching is searched affects the quality of the reconstructed image frame. The larger the area in which matching is searched, the larger the chance that matching can be found. However, most matched blocks are found around the place of the original block. Besides, increasing the search area also increases the computation required.

Object-based approaches, e.g., MPEG-4, treats an image frame as a set of objects. Instead of coding each individual block of an object, we can represent

(3)

the whole object by a simple code and thus transmission efficiency can be in-creased. The same objects between adjacent image frames can be described by the same identity plus the difference between them. Furthermore, object-based approaches allow more sophisticated content-based interactivity, such as tun-ing compression parameters for different objects and manipulation of various objects. In an image frame, a region is defined as a contiguous set of pixels that are homogeneous in terms of certain features, i.e., texture, color, motion, or shape. A video object is a collection of regions which have been grouped together under some criteria across several image frames. For instance, a shot of a walking person can be segmented into a collection of adjoining regions by different criteria, but all the regions may exhibit consistency in their motion attributes. Therefore, finding video objects from image frames is the key issue for object-based compression approaches.

1.3 Fuzzy Theory and Neural Networks

Clearly, many compression techniques require a mechanism for clustering or prediction. For example, VQ may apply a clustering mechanism to derive code-words. The blocks of an image can be separated into clusters, and one representative is decided for each cluster and is used as a code-word. A sim-ilar clustering mechanism can be applied in object-based video compression for determining various objects from a video stream. The predictive coding method for image compression may use a predicting mechanism to predict the value of the current pixel from the values of the previous pixels.

Quantitative approaches based on conventional mathematics can be used to group similar elements into categories, but they are not suitable when the underlying application is complex, ill-defined, or uncertain. Fuzzy the-ory [72] was proposed to deal with the applications in which the categorical boundaries are not crisp. Bezdek et al. [9] proposed the fuzzy c-means (FCM) algorithm which generalizes the hard c-means algorithm [20] to produce a soft partition for a given dataset. Lin et al. [43] obtained fuzzy partitions by iter-atively cutting each dimension of the input space into two parts. Wong and Chen [66] proposed another idea for clustering. Reference vectors attract one another and form different clusters according to a similarity measure. Juang and Lin [33, 34] obtains fuzzy clusters from a given dataset via an aligned clustering-based algorithm and a projection-based correlation measure.

Neural networks have learning capabilities and can learn clusters from given training data. Adaptive resonance theory (ART) [28, 13, 14] is a net-work for data clustering. The stored prototype of a category is adapted when an input training pattern is sufficiently similar to the prototype. When an in-put training pattern is not sufficiently similar to any existing prototype, a new category is formed with the input training pattern as the prototype. Kohonen clustering network (KCN) [38], also known as Kohonen self-organizing map (SOM), is a fully connected linear network. The output generally is organized in a one- or two-dimensional arrangement of neurons. The weights connecting

(4)

the input to the output perform association between weights and inputs. Not only the winner of the competition but also its neighbors have their weights updated according to the competitive rule. The HEC network [44] performs a partitional clustering using the regularized Mahalanobis distance. It consists of two layers. The first layer employs a number of principal component anal-ysis subnetworks to estimate the hyper-ellipsoidal shapes of currently formed clusters, and the second layer performs a competitive learning using the clus-ter shape information provided by the first layer. The spiking neural network [11] consists of a fully connected feedforward network of spiking neurons. Each data-point is translated into a multidimensional vector of spike-times in the input layer. If the distance between clusters is sufficiently small, the winner-takes-all competition tunes output neurons to the spike-time vectors associated with the centers of the respective clusters. The cluster-detection-and-labeling (CDL) network [21] consists of two layers and a threshold calcu-lation unit. The first layer performs similarity matching, whereas the second layer implements cluster assignments.

Recently, neuro-fuzzy approaches for data clustering have attracted a lot of attention [35, 66, 55, 69, 68]. Such approaches combines advantages of both fuzzy theory and neural networks. Fuzzy ART [15] is capable of learning cat-egories in response to arbitrary sequences of analog or binary input patterns. Input vectors are normalized according to a complement coding process which makes the MIN operator and the MAX operator of fuzzy theory complemen-tary to each other. The fuzzy min-max clustering neural network proposed in [59] adopts hyperbox fuzzy sets. Learning is done by creating and expand-ing/contracting hyperboxes in the pattern space. The fuzzy Kohonen cluster-ing network (FKCN) [61] combines the fuzzy c-means algorithm and KCN. It offers automatic control on the learning rate distribution and the extent of topological neighborhood using fuzzy membership values. The fuzzy bidirec-tional associative clustering network (FBACN) [64] is composed of two layers of recurrent networks, performing fuzzy-partition clustering according to the objective-functional method. The first layer of FBACN is implemented by a Hopfield network, while the second layer is implemented by a multi-synapse neural network with added stochastic elements of simulated annealing. The self-constructing fuzzy neural network (SCFNN) [41, 50] is able to partition a given dataset into a set of clusters based on similarity tests. Membership func-tions associated with each cluster are defined according to statistical means and variances of the data points included in the cluster. Besides, parameters can be refined to increase the precision of the resulting clusters.

2 Neuro-Fuzzy Techniques

As indicated earlier, a lot of neuro-fuzzy techniques have been developed. In this section, we describe three techniques: FKCN, Fuzzy-ART, and SCFNN. FKCN is a non-sequential fuzzy neural network, i.e., all the training data

(5)

are considered together. Fuzzy-ART and SCFNN, on the other hand, are sequential ones in which data are considered one at a time.

2.1 Fuzzy Kohonen Clustering Networks (FKCN)

Assume that we have N training vectors each of dimension n. The task of FKCN is to find out c reference vectors to partition these N training vectors (or patterns) into c clusters properly. A FKCN network consists of three layers: an input layer, a distance layer, and an output (membership) layer, having n, c, and c neurons, respectively, as shown in Fig 1. The input layer receives the input training patterns and broadcasts them forward to the distance layer. The distance layer calculates distances between input patterns and reference vectors, and transmits the distances to the output layer. The output layer computes the membership degrees that the input pattern belongs to each cluster. A matrix of trainable weights, V,

(6)

V =      v1 v2 .. . vc      =      v11 v12 · · · v1c v21 v22 · · · v2c .. . ... ... ... vn1 vn2 · · · vnc      . (1)

exists between the input layer and the distance layer, where vij is the weight

of the connection between the input node i and the distance node j. Basically, V is obtained by minimizing the following objective function:

X

k

X

j

(ukj)m||xk− vj||2, (2)

where ukj is the membership degree of xk belonging to cluster j and m is a

controlling exponent to be calculated later.

During the training, each input pattern xk, where xk= [x1k, . . . , xnk] and

1 ≤ k ≤ N , is presented to the input layer. Each node j, j = 1, 2, · · · , c, in the distance layer calculates the distance dkj between xk and vj as follows:

dkj = kxk− vjk2= (xk− vj)T(xk− vj). (3)

Then all the weights are adjusted by vj(t) = vj(t − 1) + [ N X k=1 ηkj(xk− vj(t − 1))]/ N X k=1 ηkj (4)

where ηkj is the learning rate defined by

ηkj = (ukj)m, (5)

m = m0− z∆m, (6)

∆m = m0− 1 zmax

(7) with m0 being a positive constant greater than 1, z being the current epoch

count, zmax being the limit of the epoch count, and ukj being calculated by

ukj = 1 Pc i=1( dkj dki) 2 m−1 . (8)

The process iterates until either that the weight matrix V in two consecu-tive epochs are close enough or the epoch count z exceeds zmax. The FKCN

algorithm can be summarized below. procedureFKCN

Set the number of clusters to be c and the limit of epoch count to be zmax;

Initialize the weight matrix V and choose m0, where m0> 1;

(7)

Compute all learning rates ηkj, j = 1, 2, · · · , c, k = 1, 2, · · · , N ,

with Eq.(5);

Update all weight vectors vj, j = 1, 2, · · · , c, with Eq.(4);

Compute E=||Vz_{− V}(z−1)_||2₌P j||vzj− v (z−1) j ||2; ifE ≤ ǫ break; elsez = z + 1; returnwith c clusters; endFKCN

When the process terminates, vj, j = 1, 2, · · · , c, is treated as the center or

mean of cluster j. A pattern xk belongs to cluster p if ukp of node p is the

highest output at the output layer. 2.2 Fuzzy-ART Networks

Fuzzy-ART [15] networks are similar to ART networks [13, 14], but they accept continuous inputs between 0 and 1 that represent fuzzy membership values. A fuzzy-ART network consists of three layers: an input layer, a choice layer, and a match layer, as shown in Fig 1. Like FKCN networks, a matrix V of weights exists between the input layer and the choice layer. However, unlike FKCN networks, clusters are generated as needed and the number of clusters is determined automatically. Initially, the network contains only the input layer and the number of clusters, J, is set to 0. When an input vector xk is presented to the network, we compute distance dj by

dj =

kxk∧ vjk

β + kvjk

(9) for each node j in the choice layer, where ∧ is the fuzzy min operator and β is a constant insuring that the denominator is larger than 1. Let vj∗ be

the reference vector of node j∗_{having the largest distance, and be called the}

winner reference vector. Then we perform a vigilance test which checks the degree of similarity between vj∗ and x_k as follows:

kxk∧ vjk

kxkk

> ρ (10)

where ρ is the vigilance parameter specified by the user. Note that the test is done at the corresponding node, node j∗

, at the match layer. If vj∗ passes

the vigilance test, it is adapted to xk by

vj(t) = (1 − η)vj(t − 1) + η(xk∧ vj(t − 1)). (11)

Otherwise, the current winner node is deactivated and the next winner is chosen. This process repeats until either a reference vector passes the vigilance test or none passes the test. If the latter case occurs, we increase J by one,

(8)

Fig. 2. Architecture of Fuzzy-ART networks.

i.e., J = J + 1, create a new subnet for xk, as shown in Figure 3, and add

it to the network. Note that vJ is set to xk in this subnet. The Fuzzy-ART

algorithm can be summarized below. procedureFuzzy-ART

Set parameters, η, β, and ρ. Initialize J = 0; fork=1, 2, · · · , N

Input the training vector, xk;

Choose the winner node j∗ _{by Eq.(9);}

Do vigilance test of the winner node by Eq.(10); whilefails the vigilance test, do

Find the next winner node; ifone node passes the vigilance test

(9)

Fig. 3.Subnet J is created for Fuzzy-ART.

elseJ = J + 1 and create a new subnet J for xk;

returnwith J clusters; endFuzzy-ART

When the process terminates, vj, j = 1, 2, · · · , J, is treated as the center or

mean of cluster j. A pattern xk belongs to cluster p if node p has the highest

(10)

2.3 Self-Constructing Fuzzy Neural Networks (SCFNN)

The task of SCFNN is to partition the given data set into fuzzy clusters, with the degree of association being strong for data within a cluster and weak for data in different clusters. Let x = [x1, x2, . . . , xn] be an input vector of

n dimensions. A fuzzy cluster Cj is defined as a Gaussian function of the

following form: Cj = G(x; vj, σj) = n Y i=1 g(xi; vij, σij) (12) where g(xi; vij, σij) is g(xi; vij, σij) = exp " − xi− vij σij 2# . (13)

where vj = [v1j, . . . , vnj] denotes the mean vector and σj = [σ1j, . . . , σnj]

denotes the deviation vector for Cj. Gaussian functions are adopted in SCFNN

for representing clusters because of their superiority over other functions in performance [71].

Like Fuzzy-ART networks, clusters are generated as needed and the num-ber of clusters is determined automatically in SCFNN networks. A SCFNN network consists of three layers: an input layer, a fuzzification layer, and a competition layer, as shown in Figure 4 [48, 41]. The input layer contains n nodes. It receives input patterns and broadcasts them to the fuzzification layer. The fuzzification layer contains J groups each of which contains n nodes. The corresponding weight vector [v1j, v2j, . . . , vnj] of each group j represents

the prototype of cluster j. The ith node of group j calculates the Gaussian function value g(xi; vij, σij). The competition layer contains J nodes. The

output of node j of this layer is the product of all its inputs from the previous layer, i.e.,

n

Y

i=1

g(xi; vij, σij). (14)

Note that the weights {(vij, σij)|1 ≤ i ≤ n, 1 ≤ j ≤ J} between the input

layer and the fuzzification layer are adjustable, and the other weights are fixed to 1.

Let xk be a training pattern. We define that xk belongs to cluster Cj if

xk contributes to the distribution of patterns in Cj, i.e., vj and σj have to

be recalculated due to the addition of xk. Let Sj indicate the size of Cj, i.e.,

the number of patterns that belong to Cj. Also, we define an operator, comb,

to combine a cluster Cj and xk to result in a new cluster C

′ j, as follows: Cj′ = comb(Cj, xk) = G(x; v ′ j, σ ′ j) (15) = n Y i=1 g(xi; v ′ ij, σ ′ ij) (16)

(11)

Fig. 4. Architecture of SCFNN networks.

where the mean and deviation vectors, v′j and σ

′ j, associated with C ′ j are computed by: v′ij= Sjvij+ xik Sj+ 1 , (17) σij′ = " (Sj− 1)(σij− σ0)2+ Sj(vij)2+ (xik)2 Sj −Sj+ 1 Sj ( Sjvij+ xik Sj+ 1 ) 2 1/2 + σ0 (18)

for 1 ≤ i ≤ n, with σ0 being a user-defined constant.

Let J be the number of existing fuzzy clusters. Initially, J is 0 since no cluster exists at the beginning. For a training pattern xk being applied to the

network, we first find the winner node j∗ _{at the competition layer}

j∗ = arg max j { n Y i=1 g(xi; vij, σij)}, 1 ≤ j ≤ J. (19)

(12)

Then we check if xkis similar enough to cluster Cj∗ by the following similarity test: n Y i=1 g(xi; vij∗, σ_ij∗) ≥ ρ (20)

where ρ, 0 ≤ ρ ≤ 1, is a predefined threshold. If xk passes the similarity

test on cluster Cj∗, we further check the variance of the resulting cluster

Cj′∗ = comb(Cj∗, xk) induced by the addition of xk. We say that pattern xk

passes the variance test on cluster Cj∗ if

kσ′_j∗k ≤ τ (21)

where σ′j∗ is computed by Eq.(18) and τ is another user-defined threshold. If

either Eq.(20) or Eq.(21) fails, the current winner node is deactivated and the next winner is chosen.

Two cases may occur. First, there are no existing fuzzy clusters on which pattern xk has passed both the similarity test and the variance test. For this

case, pattern xk is not close enough to any existing cluster and a new fuzzy

cluster CJ, J = J + 1, is created with

vJ= xk, σJ= [σ0, σ0, . . . , σ0] (22)

as shown in Figure 5. Note that the new cluster CJcontains only one member,

pattern xk. The reason that σJ is initialized to a non-zero vector is to avoid

the null width of a singleton cluster. Of course, the number of clusters is increased by 1 and the size of cluster CJ should be initialized, i.e.,

J = J + 1, SJ= 1. (23)

On the other hand, if there is a winning node j∗ _{on which pattern x} k has

passed both the similarity test and the variance test, pattern xkis close enough

to cluster Cj∗ and the weights of the winning node j∗ are modified to include

pattern xk by Eq.(18) and Eq.(17), and the size of Cj∗ is increased by 1, i.e.,

Sj∗ = S_j∗ + 1. (24)

Note that J is not changed in this case.

The above process is iterated until all the training patterns are processed. The SCFNN algorithm can be summarized below.

procedureSCFNN

Set parameters, σ0 ,ρ and τ ;

fork=1, 2, · · · , N

Input the training vector xk;

Choose the winner node j∗ _{by Eq.(19);}

(13)

Fig. 5. Subnet J is created for SCFNN.

whilefails either similarity test or variance test, do Find the next winner node;

ifone node passes both tests thenadd xk into the winner node;

elseJ = J + 1 and create a new subnet J for xk;

returnwith J clusters; endSCFNN

When the process terminates, vj, j = 1, 2, · · · , J, which is the mean vector of

cluster Cj. Furthermore, the variance, σj, is also provided for each cluster Cj

by the SCFNN algorithm. A pattern xk belongs to cluster p if node p has the

highest output at the competition layer.

Because of the similarity and variance tests, SCFNN can generate compact and dense clusters, and capture the real distribution of the training vectors. Thus, the clusters generated can represent training vectors appropriately. Be-sides, the network obtained by SCFNN can be tuned when applied to the supervised recognition problem. This is useful in identifying objects for video compression to be presented later.

(14)

3 Neuro-Fuzzy Based Vector Quantization for Image

Compression

As mentioned, vector quantization (VQ) is attractive for image compression due to its simplicity in decoding at the receiving end. One of the key issues for VQ is the generation of code-words based on which image blocks are encoded and decoded. Many methods have been proposed for generating code-words for VQ [10, 32]. The LBG algorithm [52, 53] is one of the most famous methods. It starts with a set of randomly selected code-words which form initial clusters. According to the Euclidean distances from code-words, a training pattern is clustered to the code-word nearest to it. Data in the same cluster form a new code-word which replaces the old one. The process repeats until the variation of average distortion of all clusters is smaller than a predefined threshold. Modified ART2 [30, 62] is another approach for VQ based on the ART algorithm. Code-words are constructed gradually as the data are fed one by one. When the first block of image comes in, the first code-word is set up. Incoming blocks are compared with existing code-words. If a code-word is similar enough to an incoming block, the block is categorized to the code-word and the code-word is modified to include the block. The similarity threshold increases in each iteration. The algorithm terminates when the number of code-words reaches the desired one or the threshold reaches the predefined upper-bound.

VQ can incorporate nicely with neuro-fuzzy clustering methods for deriv-ing the code-words for image compression. Any neuro-fuzzy clusterderiv-ing method presented in Section 2 can do the job. Obviously, the code-book size is identical to the number of clusters and all the vjvectors form the desired code-words. In

this section we describe how the SCFNN clustering method is used to generate code-words for vector quantization. The fuzzy clusters obtained by SCFNN have a high-degree of intra-cluster similarity and a low-degree of inter-cluster similarity. The mean vector of each obtained fuzzy cluster becomes naturally a code-word. The advantages of using SCFNN include that the fuzzy clus-ters generated are compact and dense, the real distribution of image content can be captured, and image content can be represented by code-words more appropriately.

3.1 VQ Encoding/Decoding

For simplicity, we only focus on gray-level images. Extension to color images is obvious. Given an original image, I, of Nxby Ny pixels to be transmitted,

namely,      I11 I12 · · · I1Nx I21 I22 · · · I2Nx .. . ... ... ... INy1 INy2· · · INyNx     

(15)

where Iij represents the gray value of the pixel located at position (i, j), we

divide I into non-overlapping blocks each of which contains p by p pixels. Usually, Nxand Nyare both multiples of p. Therefore, I can be divided into

NxNy/p2 blocks which are numbered 1, 2, . . . , NxNy/p2 from left to right

and top to bottom, as shown in Figure 6. Similarly, the pixels in a block are

Fig. 6.Block numbering in an image.

numbered 1, 2, . . . , p2 _{from left to right and top to bottom, as shown in}

Figure 7.

Fig. 7.Pixel numbering in a block.

The VQ-based compression/decompression system we are concerned with is shown in Figure 8. The system consists of three major components: en-coder, deen-coder, and code-book. Usually, the size of the code-book is 2b_{, i.e.,}

it contains 2b_{code-words. Each code-word is a block of pixels. We obtain the}

code-book using SCFNN from the blocks of the underlying image. Then at the transmitting end, each block of the image is encoded by comparing it with the code-words of the code-book. The code-word which is most similar to the block is chosen and the index of the code-word is used to represent the block and is transmitted to the receiving end through the communication channel. At the

(16)

Fig. 8. The VQ-based compression system.

receiving end, a received index is decoded by checking against the code-book which is the same as that used by the encoder at the transmitting end. The corresponding code-word is recalled to become the block reconstructed. When we are done with all the indices received, the whole image is reconstructed which is an approximate version of the original image at the transmitting end. 3.2 Clustering by SCFNN

As mentioned earlier, the pixels of a given image are divided into NxNy/p2

blocks and each block contains p2_{pixels. We represent each block as a vector}

of size p2_{. Let N = N}

xNy/p2. Therefore, we have N training vectors, x1, x2,

. . ., xN, and each training vector xk has p2 dimensions, i.e., n = p2. These

training vectors are given to the SCFNN algorithm of Section 2.3. Suppose J clusters are obtained. Then the mean vector of each cluster obtained becomes a code-word for encoding/decoding.

Note that SCFNN determines the number of clusters by itself. However, it is desirable that the size of the code-book be 2b_{. Therefore, we have to}

make the number of clusters obtained as close to 2b _{as possible, i.e., J ≃}

2b_{. This is achieved by adjusting iteratively the values of the two involved}

parameters, ρ and τ . Suppose we would like the number of final clusters to be B = 2b_{. Initially, each training pattern is treated as a cluster and the number}

of clusters is N . We randomly choose some values for ρ and τ and perform fuzzy clustering. Let the number of clusters generated be L. If L > B, then we decrease ρ and increase τ by an amount proportional to (L − B)/(N − L). If L < B, then we increase ρ and decrease τ by an amount proportional to (B −L)/(N −L). This process iterates until L is close enough to B. Therefore, a code-book obtained by SCFNN can be summarized below.

procedureCode Book SCFNN Let B = N ;

whileB is not close to 2b_do

Adjust ρ and τ appropriately; B = 0;

(17)

W1= {Cj|G(xk; vj, σj) ≥ ρ, 1 ≤ j ≤ B};

W2= {Cj|σ

′

j≤ τ, Cj∈ W1};

ifW2== ∅

A new cluster CB, B = B + 1, is created;

elselet Ca∈ W2 be the cluster with the

largest similarity measure; Incorporate xk into Ca;

returnwith the created B clusters; endCode Book SCFNN

3.3 Experimental Results

Three characteristics are usually considered for evaluating the effectiveness of an image compression algorithm, i.e., compression ratio, compression speed, and image quality. Compression ratio (CR) for an image is defined to be

CR = WI/WT (25)

where WI is the number of bits in the original image and WT is the number

of bits transmitted through the communication channel for the image. Ob-viously, a larger CR means a less amount of bandwidth requirement and is more efficient in transmission. Image quality concerns the quality of the re-constructed image at the receiving end and is usually indicated by signal-noise ratio (SNR) defined below:

SNR = 10 log10 1 NxNy PNx i=1 PNy j=1x(i, j)2 MSE , (26) MSE = 1 NxNy Nx X i=1 Ny X j=1 (x(i, j) − ˆx(i, j))2 (27)

where x(i, j) and ˆx(i, j) are original and reconstructed values, respectively, of the pixel located at position (i, j). As usual, we assume that if a reconstructed image has a higher SNR, then it is of higher quality.

To demonstrate the effectiveness of SCFNN, we show the results of three experiments below. In these experiments, each block contains 8×8 = 64 pixels, i.e., p = 8. Firstly, the results obtained from four benchmark images are presented using local code-books. Next, we show the performance of image compression using a global code-book. Finally, we work with another two benchmark images contaminated with the white Gaussian noise.

Benchmark Images (Local Code-book)

A local code-book is derived from the blocks of the image to be transmitted, and has to be sent with the code-book indices to the decoder. Therefore,

(18)

compression with local code-books have better quality of reconstruction, but results in a low compression ratio. We do compression with local code-books on four benchmark images, Elaine, Peppers, Man and Boat, as shown in Figure 9. All these images are 256×256 in resolution. MSE, SNR, and CR associated

(a) (b)

(c) (d)

Fig. 9.The original benchmark images: (a) Elaine; (b) Peppers; (c) Man; (d) Boat.

with these images are shown in Table 1. Note that 2b _{in the first column of}

these tables indicates the size of the code-book, and the transmission of the associated code-book is considered in the calculation of CR for each image. Figure 10(a), Figure 10(b) and Figure 10(c) show the reconstructed images of Elaine, Man and Boat, respectively.

Benchmark Images (Global Code-book)

Next, we show the performance of image compression using a global code-book. A global code-book is defined to be one that is used for

(19)

encod-(a) (b)

(c) (d)

(e) (f)

Fig. 10.Reconstructed images: (a) Elaine(29

); (b) Man(28

); (c) Boat(28

); (d) Pep-pers based on the Elaine(28

) code-book; (e) Man based on the Elaine(28

) code-book; (f) Boat based on the Elaine(28

(20)

Table 1.MSE, SNR, and CR for benchmark images using local code-books. Image MSE SNR CR Elaine(28 ) 72.28 24.62 3.76 Elaine(29 ) 19.26 30.37 1.93 Man(28 ) 211.37 17.28 3.76 Boat(28 ) 125.14 21.86 3.76

ing/decoding independent of images to be transmitted. With this idea, com-pression ratio can be increased since each image is transmitted without trans-mitting its own code-book together. Usually, we test this idea by encoding one image with a code-book that is obtained from another image. We compress Peppers, Man, and Boat based on the code-book obtained from Elaine(28_{) and}

the results are shown in Table 2. Figure 10(d), Figure 10(e) and Figure 10(f)

Table 2. MSE, SNR, and CR for benchmark images using the code-book of Elaine(28 ). Image MSE SNR CR Peppers(28 ) 365.45 16.83 64.00 Man(28 ) 676.87 12.22 64.00 Boat(28 ) 379.59 17.04 64.00

show the reconstructed images of Peppers, Man and Boat, respectively, based on the code-book of Elaine(28_).

Benchmark Images with Noise

We test the effectiveness of SCFNN with another two benchmark images, Lena and Bird, contaminated with the white Gaussian noise, as shown in figure 11. Figure 12 show the reconstructed images obtained using local code-books of size 256 and 128 of original images, respectively. The values of SNR and MSE for these reconstructed images are given in Table 3.

Table 3.MSE, SNR, and CR for benchmark images contaminated with noise. Image MSE SNR CR

Lena(28

) 104.31 22.23 3.76 Bird(27

(21)

(a) (b) Fig. 11.Original benchmark images with noise: (a) Lena; (b) Bird.

(a) (b)

Fig. 12.Reconstructed images: (a) Lena(28

); (b) Bird(27

).

4 Image Transmission by NITF

We demonstrate the usage of fuzzy neural networks in real communication of images in this section. We adopt NITFF (National Imagery Transmission Format) [7, 1] which is a standard for encoding/decoding VQ compressed images. The standard is introduced first, and the functions of fuzzy neural networks in the standard are then described.

4.1 Introduction of NITF

NITF is a multi-component format which is designed to allow up to 999 images and symbols to be combined in a single file. Each component has metadata

(22)

associated with it. This technique allows for overlays that can readily be re-moved through an application. NITF is more than just an image file format; it goes beyond supporting the core needed to share imagery between disparate systems. It facilitates the increasing need for greater flexibility in using mul-tiple images with annotation in a composition that relates the images and annotation to one another.

NITF can accept and decompress data that has been compressed using a VQ compression scheme. Images contained in a NITF file can be in either color or gray scale. For simplicity, we only consider gray scale images here.

The components of a NITF file, as shown in Figure 13, include:

Fig. 13.NITF file Structure with VQ compressed images.

• NITF File Header. This gives the basic description of the file, e.g., how many subcomponents such as images, symbols, or texts, exist.

• Image Segments. NITF allows each image contained in the file to be com-pressed. Currently, bi-level, JPEG, JPEG2000 and VQ are the compression methods supported by NITF.

• Symbol Segments. • Text Segments. • Remaining segments.

The subheader of each image identifies the image compression method used. If an image is compressed with VQ, then the code-book is placed in the VQ header followed by the compressed image codes. The VQ header also provides information about the organization of the code-book, indicating how many code-words included in the code-book, the size of each code-word, and how the data that makes up the code-words is organized. As a multi-component

(23)

format, NITF can collect several images in a file and thus is suitable for transmitting a global code-book in one file.

4.2 Encoding a VQ compressed NITF Image

Fig. 14. VQ compression procedure.

To compress an image with vector quantization, we can use fuzzy neural networks to generate the code-book from the input image and then classify each image block to the nearest word, as shown in Figure 14. The code-book and the indices of image blocks together are encoded into a bitstream and become the compressed image data in a NITF file. Two encoding schemes are provided, block-based and row-based.

Encoding with Block-Based Code-books

For the block-based scheme, a code-book with each code-word of 4×4 pixels in size is created. To illustrate how it works, we do compression on Lena. The image is first partitioned into blocks of 4 × 4 pixels. Therefore, each block or code-word contains 16 pixels. A code-book of 512 code-words is generated and 64 × 64 = 4096 blocks are vector quantized through the network. Following the file header and the image subheader, the 512 code-words are encoded in ascending order, i.e., starting with code-word 1 and ending with code-word 512. Assume that the first code-word is

    153 153 153 153 155 155 155 155 155 155 155 155 154 154 154 154    

(24)

and the second code-word is     150 151 150 151 148 148 148 148 135 132 132 131 112 112 113 113     .

Then the bitstream of the first code-word, i.e., [153 153 153 · · · 154 154 154], is followed by that of the second code-word, i.e., [150 151 150 · · · 112 113 113], and so on. The bitstream of the code-book is therefore encoded as [153 153 153 · · · 154 154 154 150 151 150 · · · 112 113 113 · · · ·]. Following the code-book, the indices of the image blocks from left to right and top to bottom are encoded into another bitstream of compressed image data. Suppose we have the following image with each image block replaced by its index:

     127 55 · · · 13 8 288 · · · 512 .. . ... ... ... 64 175 · · · 336      .

The bitstream consists of these indices becomes [127 55 · · · 13 8 288 · · · · 512 · · · 64 175 · · · 336].

Encoding with Row-Based Code-books

NITF allows the organization of the VQ code-book to be optimized for the specific use of the VQ data. For the row-based scheme, four code-books with each code-word of 4 pixels in size are created. That is, it stores different rows of 4×4 code-words in different code-books such that the image can be reconstructed line-by-line, instead of block-by-block.

The first code-book is used to group row 1 of all the 4×4 code-words together. For instance, for the previous example, the first code-book has the following form:    153 153 153 153 150 151 150 151 .. . ... ... ...   .

Row 2, row 3, and row 4 of all the 4×4 code-words are placed in respective code-books as above. Then these row-based code-books are encoded one by one. The bitstream of these code-books thus looks like [153 153 153 153 150 151 150 151 · · · 155 155 155 155 148 148 148 148 · · · 155 155 155 155 135 132 132 131 · · · 154 154 154 154 112 112 113 113 · · ·]. Note that the quantized image data remains the same as those compressed with the block-based code-book scheme.

(25)

4.3 Decoding a VQ compressed NITF Image

The decoding process of a VQ compressed image is shown in Figure 15. When

Fig. 15.VQ decompression procedure.

a VQ compressed image is received, code-books are read and then the image blocks are reconstructed.

Decoding with Block-Based Code-books

For the block-based scheme, we only need to check with one code-book. After the code-book is read, the indices of image blocks are extracted. Through the table lookup operation with the code-book, the code-word indexed by the first vector-quantized image code is used to spatially decompress the 4 × 4 block at the upper left corner of the image. Decompression continues from left to right and top to bottom, as shown in Figure 16, until all the image blocks have been spatially decompressed.

Decoding with Row-Based Code-books

The process of decoding a VQ compressed NITF image with row-based code-books is very similar to those of the block-based code-book scheme, except that we have to check with four code-books. After the row-based code-books are read, the first 64 indices, i.e., quantized image data, are used to decompress the first row of the image by table-lookup in the first code-book. The second, third and fourth rows of the image are then decompressed by table-lookup in the second, third, and fourth code-books, respectively. The decompression continues from left to right and top to bottom, as shown in Figure 17, until all the image rows have been spatially decompressed.

(26)

Fig. 16.Spatial decompression with the block-based code-book scheme.

(27)

5 Neuro-Fuzzy Based Video Compression

A lot of video applications appear on internets and wireless communications, such as video conference, video phone, and distance education, in which facial expression and body gesture are usually the main focus in a video stream. Therefore, the most important topic in object segmentation is the extraction of human objects, including face and body, in image sequences.

Several approaches have been proposed for identifying human objects in a video stream. One approach [26, 27, 16, 56, 60, 37, 47, 65, 8, 19, 73] applies static features or spatial data, such as luminance, chrominance, location, or shape of human objects, to determine the foreground or background regions in a video frame. The advantage of this approach is that only one video frame is required for segmentation. However, finding unique features to be used for identifying human objects is not easy, leading to a high segmentation error. Another approach [12, 67, 29, 23, 36, 39, 40, 45, 46] uses motion information, or temporal data, to detect human objects. Human objects are assumed to have the most significant motion in a video sequence and are extracted by comparing two or more frames in a stream. However, while motion is not obvious or there are other objects having more significant motion, the detected result will be wrong.

We combine spatial and temporal information [12, 67] and employ the SCFNN algorithm to overcome the above difficulties. The basic idea is that the base frame of a video stream is divided into segments and then each segment is categorized as foreground or background based on a combination of multiple criteria. Firstly, SCFNN is used to group similar pixels in the frame into clusters. Connected segments contained in the clusters are combined, and each segment is checked if it is a part of human face using the values of chrominance and luminance. By referring to the position of the face region and related motion information, the corresponding body is located. Then, the obtained SCFNN network is further tuned by a SVD-based hybrid learning algorithm which can then be used to precisely locate the human object in the base frame and the remaining frames of the video stream.

We have tested the proposed method on different color video sequences, including standard benchmarks Akiyo and Silent. The obtained results have shown that the method can improve the accuracy of the identification of human objects in video sequences. Also, the method can work well even when the human object presents no significant motion in a sequence.

5.1 System Overview

The system adopted for segmenting human objects combines temporal and spatial information, and consists of three main steps: clustering, detection, and refinement, as shown in Figure 18. For simplicity, we assume that at most one people appears in the video stream we are interested in. In the clustering stage, the SCFNN algorithm is used to group similar pixels in the

(28)

Fig. 18. Block diagram of the human-object segmentation system.

base frame of a given video stream into fuzzy clusters. The number of clusters is determined automatically by the algorithm. Connected segments in the clusters are then combined to form larger segments. In the detection stage, each segment is checked if it is a part of human face using chrominance values and the variance of luminance values. By referring to the position of the face region and related motion information, the corresponding body is found. Then the base frame is divided into three regions: foreground, background, and ambiguous. In the last step, a supervised network is constructed from the fuzzy clusters obtained by SCFNN and is trained by a highly efficient SVD-based hybrid learning algorithm using the data points obtained from the foreground and background regions. The trained fuzzy neural network is then used to decide the category of the pixels in the ambiguous region. Finally, the pixels belonging to the foreground region form the desired human object in the base frame. Note that the same neural network can be used for segmentation of the remaining frames in the same video stream.

By this approach, similar pixels are grouped together in the first stage and are processed collectively afterwards. The face region is determined by referring to both chrominance and luminance values. Also, motion information is used to help the determination of the corresponding body. The usage of such information is confined to the neighborhood of a specific area. By using a combination of multiple criteria in determining face and body, the difficulties associated with other methods can be alleviated. Therefore, the human object can be extracted more precisely in a video stream.

5.2 Clustering by SCFNN

We use composite signals, chrominance Cr and Cb and luminance Y , as the

basis for clustering. Given an image frame of N1×N2in size, we divide it into

N1/4×N2/4 blocks each having 4×4 pixels. Each block is associated with a

feature vector x = [x1, x2, x3] where x1, x2, and x3 denote the average Cr,

Cb, and Y values, respectively, of all the constituent pixels of the block. The

(29)

  Cr Cb Y  =   0.500 −0.419 −0.081 −0.169 −0.331 0.500 0.299 0.587 0.114     R G B   (28)

where R is the red component, G is the green component, and B is the blue component of the RGB signal of the pixel [63].

Now, we have N training vectors and N = N1/4×N2/4. Each training

vector has 3 dimensions, i.e., n = 3. We apply the SCFNN algorithm of Sec-tion 2.3 to find clusters. Since chrominance and luminance represent different characteristics for each block, we consider chrominance and luminance values separately. Therefore, we need to modify the work related to the two simi-larity tests in SCFNN. Firstly, we modify the simisimi-larity test associated with Eq.(20). For a training vector (x1, x2, x3), we calculate the following

chromi-nance similarity measure:

d1(x1, x2; Cj) = 2

Y

i=1

g(xi; vij, σij) (29)

for all 1 ≤ j ≤ J. We say that x passes the chrominance similarity test on cluster Cj if

d1(x1, x2; Cj) ≥ ρ1 (30)

where 0 ≤ ρ1≤ 1 is a predefined threshold. Secondly, we modify the variance

test associated with Eq.(21). We calculate the following luminance similarity measure:

d2(x3; Cj) = g(x3; v3j, σ3j) (31)

for all cluster Cj on which x has passed the chrominance similarity test. We

say that x passes the luminance similarity test on cluster Cj if

d2(x3; Cj) ≥ ρ2 (32)

where 0 ≤ ρ2 ≤ 1 is a predefined threshold. Finally, we choose the cluster

with the largest product d1(x1, x2; Cji)d2(x3; Cji) to be the one to which x is

most similar.

5.3 Labeling Segments

Suppose we have J fuzzy clusters, C1, C2, . . ., and CJ, after clustering. A

cluster may consist of several parts which are not connected to each other, since we use composite signals, not positions, for clustering. Our desire is to let the pixels in a connected segment be processed collectively. We label each connected segment with a unique name. Let the connected segments be labeled as S1, S2, . . ., and SL, where L is the total number of such segments.

(30)

Note that L may be equal to or greater than J. Segments of very small size are considered to be noise or meaningless parts in the image. Therefore, they are combined to bigger segments in order to reduce their influence on final results. Let n(Si) denote the size of Si, i.e., the number of blocks in Si, and

κ = ζN1N2/16 where ζ is a predefined parameter and 0 < ζ < 1. We check

n(Si) for each segment Si. If n(Si) ≤ κ, then Si is combined into Sj which

is the smallest of the segments connected to Si and n(Sj) > κ. This process

is iterated until all segments are bigger than κ. Let the resulting segments be labeled as R1, R2, . . ., and RQ, where Q ≤ L is the total number of

such segments. Later on, the image will be processed based on the connected segments, instead of individual pixels or blocks.

Let’s apply the above procedure on Figure 19(a) which is an image of 360×288 pixels consisting of 90×72 blocks. The image is divided into J = 24 clusters containing L = 931 connected segments. After the combination process, we have Q = 120 segments shown in Figure 19(b) in which different segments are represented by different gray values.

(a) (b)

Fig. 19.(a) An example image; (b) Obtained connected segments.

5.4 Human Object Estimation

We apply chrominance values and the variation of luminance values to esti-mate the face region. By referring to the position of the face region and motion information, the body region is also estimated.

Face Estimation

Crand Cbvalues of human skin have been found to occupy only a small region

in the CrCb space, as the white area shown in Figure 20, with approximately

(31)

Fig. 20. Cr and Cb values of human skin.

possible face segments. Let x1(Ri) and x2(Ri) be the Cr and Cb values of

segment Ri. If (x1(Ri), x2(Ri)) is located in the white area of Figure 20, then

Ri is regarded as a possible face segment. Let the possible face segments be

denoted as P1, P2, . . ., and PK. Also, a block is called a possible face block if

it is contained in a possible face segment.

Human face is usually a round object with a smooth boundary. The tech-nique of density map [17] can help eliminating branching or annoying parts. Then the image is divided into a set of maximally connected segments such that each segment could not be connected if more elements were added onto it. For convenience, these segments are labeled as H1, H2, . . ., and HG, where

G is the total number of such segments. Next, the variation of luminance is used to determine the segment which is most likely to be the human face. According to [17], human face usually has the largest standard deviation of lu-minance in many MPEG-4 applications. We calculate the standard deviation of luminance for each segment Hi, 1 ≤ i ≤ G, as follows:

σ(Hi) = s 1 NB(Hi) − 1 X B∈Hi (x3(B) − x3(Hi))2 (33)

where B is any block in Hi, NB(Hi) is the number of blocks in Hi, x3(B) is

the luminance value of B, and x3(Hi) is the mean luminance of Hi defined by

x3(Hi) = 1 NB(Hi) X B∈Hi x3(B). (34)

Now, the segment Hg, 1 ≤ g ≤ G, with the maximum standard deviation, i.e.,

(32)

is chosen to be the estimation of the face region.

Note that Hgis obtained from the calculation on blocks. We’d like to link

it to segments and make a possible refinement on it. Consider a possible face segment Pi, 1 ≤ i ≤ K. Let Ψi denote the set of blocks which belong to both

Hgand Pi, i.e.,

Ψi= Hg∩ Pi. (35)

We check

NB(Ψi)/NB(Pi) ≥ λ (36)

where NB(Ψi) and NB(Pi) are the number of blocks in Ψiand Pi, respectively,

and λ, 0 < λ ≤ 1 is a predefined threshold. If Eq.(36) holds, then Piis accepted

to be a part of the estimated face. Otherwise, it is not. This process is repeated for all possible face segments. Finally, we have a set of possible face segments that constitute the estimated face, and let these segments be labeled as F1,

F2, . . ., and FW. For convenience, we use Ef to denote the estimated face

region, i.e., Ef = {Fi|1 ≤ i ≤ W }.

Body Estimation

We assume that the body is located directly below the head. A circle below the face region is drawn to detect the corresponding body region. The circle is defined with center being (cx, cy+ h) and radius being h/2, where (cx, cy)

and h are the center and the height, respectively, of the face region Ef. Then,

the labeled segments, Ri, e.g., referring to Figure 19(b), covered partly or

totally by the circle region are regarded as possible body segments. A block is called a possible body block if it belongs to a possible body segment. For convenience, we use Eb to denote the estimated body region which is the set

of all possible body segments.

Based on the possible body segments obtained so far, we can add more segments, if any, to the estimated body by looking into the motion information associated with such segments. Let t represents the index of the current frame. We define the motion index of a segment Ri as follows:

V (Ri) = ΣB∈RiΣ m=k−2 m=0 Σ j=3 j=1|xt+m+1j (B) − xt+mj (B)| NB(Ri) (37)

where k is the number of frames in a video sequence to be referenced, xm j (B)

denotes the xj value of block B in the mth frame, and NB(Ri) is the number

of blocks in Ri. A segment Ri is regarded as a possible body segment if

• Ri is neither contained in Ef nor in Eb;

• Ri is connected to a segment in Eb;

• V (Ri) ≥ β, where β is a user-defined constant.

When such a Ri is found, Ri is added to Eb. This process is iterated until no

(33)

Finally, we have the estimated human object, Eu, as follows:

Eu= Ef∪Eb. (38)

Figure 21(a) shows the estimated face region and Figure 21(b) shows the estimated human object for Figure 19(a).

(a) (b)

Fig. 21. (a) The estimated face; (b) the estimated human object.

5.5 Human Object Refinement

Like in [19], we divide the base image into foreground, background, and am-biguous regions. A fuzzy neural network is constructed and trained by the data points taken from the foreground and background regions. The blocks in the ambiguous region are then classified by the trained network to the foreground region or the background region. The blocks belonging to the fore-ground region form the desired human object.

Morphological erosion and dilation are used to find foreground and back-ground regions. Several times of erosion are performed on Eu and let the

resulting image be the foreground region, denoted Ze. Also, several times of

dilation are performed on Eu and let the resulting image be the background

region, denoted Zd. The blocks belong neither to Zenor to Zd constitute the

ambiguous region Za.

Supervised Network Construction

Recall that each fuzzy cluster Cjis represented by the product of g(x1; v1j, σ1j),

g(x2; v2j, σ2j), and g(x3; v3j, σ3j), representing Gaussian membership

func-tions of Cr, Cb, and Y , respectively, and each g(xi; vij, σij), 1 ≤ i ≤ 3, has

the center vij and standard deviation σij. Cj can be interpreted as a fuzzy

(34)

IF x1 IS g(x1; v1j, σ1j) AND x2IS g(x2; v2j, σ2j) AND x3IS g(x3; v3j, σ3j)

THEN y IS cj (39)

where x1, x2, x3, and y are variables for Cr, Cb, Y , and the corresponding

output, respectively. The output cj is set as follows:

cj= 1 if C_{0 otherwise,}j totally covers a block in Eu, (40)

for 1 ≤ j ≤ J. A rule with cj = 1 specifies the conditions under which a block

belongs to the foreground region. Note that we have J fuzzy rules. These rules form a rough discriminator for classification.

Based on the J rules, a four-layer supervised fuzzy neural network is con-structed, as shown in Figure 22. The four layers are called the input layer,

(35)

the fuzzification layer, the inference layer, and the output layer, respectively. Links between layer 1 and layer 2 are weighted by (vij, σij), for all 1 ≤ i ≤ 3,

1 ≤ j ≤ J, links between layer 3 and layer 4 are weighted by cj, for all

1 ≤ j ≤ J, and the other links are weighted by 1. Note that the first three layers are totally identical to those in a SCFNN network shown in Figure 4. Let (x1, x2, x3, y) be an input-output pattern where (x1, x2, x3) is the input

vector and y is the corresponding desired output. The operation of the neural network is described as follows.

1. Layer 1. Layer 1 contains three nodes. Node i of this layer produces output o(1)i by transmitting its input signal xi directly to layer 2, i.e.,

o(1)_i = xi (41)

for all 1 ≤ i ≤ 3.

2. Layer 2. Layer 2 contains J groups and each group contains three nodes. Node (i, j) of this layer produces its output, o(2)ij , by computing the value

of the corresponding Gaussian function, i.e., o(2)_ij = Gij(o(1)i ) = exp  − o (1) i − vij σij !2  (42)

for all 1 ≤ i ≤ 3 and 1 ≤ j ≤ J.

3. Layer 3. Layer 3 contains J nodes. Node j’s output, o(3)j , of this layer is

the product of all its inputs from layer 2, i.e., o(3)j = 3 Y i=1 o(2)ij (43) for all 1 ≤ j ≤ J.

4. Layer 4. Layer 4 contains only one node whose output, o(4)_{, represents the}

result of the centroid defuzzification, i.e., o(4) = PJ j=1o (3) j ·cj PJ j=1o (3) j (44) Note that Layers 1–3 operate identically as the SCFNN network does. Ap-parently, vij, σij, and cj are the parameters that can be tuned to improve

the precision of the discriminator. The tuning is done by a hybrid learning described below.

Hybrid Learning

As mentioned earlier, the training data for the network are taken from the foreground and background regions. Let the set of training data be denoted as

(36)

T = {(x1k, x2k, x3k, yk)|1 ≤ k ≤ NT} where (x1k, x2k, x3k) is the input vector,

denoting Cr, Cb, and Y values, respectively, of the training data, and yk = 1

for the data taken from the foreground region and yk = 0 for the data taken

from the background region.

The learning algorithm we use is a combination of a recursive SVD-based least squares estimator and the gradient descent method, which was demon-strated to be efficient for the network architecture of Figure 22 [49]. In each iteration, the learning of vij, σij, and cjare treated separately. To optimize cj,

vij and σij stay fixed, and the recursive SVD-based least squares estimator

is applied. To refine vij and σij, cj stays fixed and the batch gradient

de-scent method is used. The process is iterated until the desired approximation precision is achieved.

Let k.o(4) _{and k.o}(3)

j denote the actual output of layer 4 and the actual

output of node j in layer 3, respectively, for the kth training pattern. By Eq.(44), we have k.o(4)= ak1c1+ ak2c2+ . . . + akJcJ (45) where akj = k.o(3)_j PJ j=1k.o (3) j (46)

for 1 ≤ j ≤ J. Apparently, we would like |yk − k.o(4)| to be as small as

possible for the kth training pattern. For all NT training patterns, we have

NT equations in the form of Eq.(45). Clearly, we would like

J(X) = kB − AXk (47)

to be as small as possible, where B, A, and X are matrices of NT×1, NT×J,

and J×1, respectively, and

B =      y1 y2 .. . yNT      , A =      a11 a12 · · · a1J a21 a22 · · · a2J .. . ... ... ... aNT1aNT2· · · aNTJ      , X =      c1 c2 .. . cJ      . (48)

As mentioned earlier, we treat vij and σij as fixed, so X is the only variable

vector in Eq.(47). The optimal X which minimizes Eq.(47) can be found by a recursive estimator based on the technique of singular value decomposition (SVD) [24, 25]. The method considers training patterns one by one, starting with the first pattern until the last pattern, resulting in less demanding in time and space requirements [49].

On the other hand, parameters vij and σij, 1 ≤ i ≤ 3 and 1 ≤ j ≤ J, are

refined by the gradient descent method, treating cj, 1 ≤ j ≤ J, as fixed. The

(37)

E = 1 2NT NT X k=1 (yk− k.o(4))2. (49)

We adopt the batch backpropagation mode in order to work properly with the recursive SVD-based estimator. The learning rules for vij and σij are [49]:

vijnew= vijold − η1( ∂E ∂vij ) (50) ∂E ∂vij = 2 NT NT X k=1 {[k.o(4)_{− y} k] [cj− k.o(4)][xik− vij]k.o(3)j σij2PJr=1k.o (3) r }. (51) σijnew= σoldij − η2( ∂E ∂σij ) (52) ∂E ∂σij = 2 NT NT X k=1 {[k.o(4)_{− y} k] [cj− k.o(4)][xik− vij]2k.o(3)j σij3PJr=1k.o (3) r } (53)

where η1and η2 are learning rates.

Final Human Object

After training is completed, the trained network is used to classify the blocks in the ambiguous region Za. Each block of the ambiguous region Za is fed to

the trained neural network. If the corresponding network output of a block is greater than or equal to a threshold φ, the block is categorized as foreground. Note that φ is a predefined parameter and 0 < φ < 1. Finally, the maximally connected segment [31] in the foreground region the desired human object. Figure 23(a) shows the three regions Ze, Zd, and Za for the example image of

Figure 19(a), and the final human object obtained is shown in Figure 23(b). As in [19], the trained network is used for finding human objects of the other frames in the same video stream, without the necessity of reconstruction or retraining. Each block of such a frame is fed to the network. A block is categorized as foreground if the corresponding network output is greater than or equal to a threshold φ. The largest maximally connected segment taken to be the desired human object of the underlying frame.

5.6 Experimental Results

We show segmentation results on two benchmark video streams, Akiyo (368×240) and Silent (360×288). The first frame of each stream is selected as the base frame. Note that the numbers shown in the parentheses indicate the resolution of each frame of the underlying stream. For example, each frame of Akiyo contains 368×240 pixels. Figures 24 and 25 show the human ob-jects extracted from the base frames of these video streams. There are two sub-figures in each figure. The first sub-figure shows the original base frame

(38)

(a) (b)

Fig. 23. (a) Foreground, background, and ambiguous regions; (b) refined human object.

(a) (b)

Fig. 24. The base frame of Akiyo: (a) original image; (b) extracted human object.

(a) (b)

(39)

image, the second sub-figure shows the human object extracted. To evaluate the extraction accuracy quantitatively, we use the error index EI defined as

the ratio of the number of mismatched pixels and the number of total pixels in an image, i.e., EI = Nm p N1N2 (54) where Nm

p is the number of mismatched pixels and N1×N2is the total number

of pixels in an image. A mismatched pixel is a non-human pixel mistaken for a human pixel or a human pixel mistaken for a non-human pixel. Obviously, the smaller EI is, the higher accuracy is the extraction. The error indices

associated with Figures 24 and 25 are given in the first two columns in Table 4.

Table 4.Error indices for object extraction. Akiyo, 1st Silent, 1st Akiyo, 50th Silent, 50th

0.0245 0.0111 0.0236 0.0191

Next, we show the generalization capabilities of the approach. We extract the human object from the 50th image frames of these video streams. As men-tioned before, we may use the trained network of the base frame to extract human objects in the other frames of the same stream, without reconstruc-tion and retraining. The results are shown in Figures 26 and 27. The error

(a) (b)

Fig. 26. The 50th frame of Akiyo: (a) original image; (b) extracted human object.

indices associated with Figures 26 and 27 are given in the last two columns in Table 4.

(40)

(a) (b)

Fig. 27. The 50th frame of Silent: (a) original image; (b) extracted human object.

6 Video Transmission by MPEG-4

We demonstrate the usage of fuzzy neural networks in real communication of videos in this section. We adopt MPEG-4 [58, 4, 5] as an example. In the following, MPEG-4 is introduced first, and the functions of fuzzy neural networks in MPEG-4 are then described.

6.1 Introduction of MPEG-4

MPEG-4 is an ISO/IEC standard developed by MPEG (Moving Picture Ex-perts Group), and is used for encoding/decoding object-based compressed video. It describes a multimedia system for communicating interactive scenes composed of natural, synthetic, audio or visual objects. The central concept is to divide each frame of an input video sequence into a number of arbi-trarily shaped image regions called video object planes (VOPs). The input to be coded can be a VOP of arbitrary shape, and the shape and location of the region can vary from frame to frame. Successive VOPs belonging to the same physical object in a scene are referred to as a Video Object (VO). Thus, VO can be defined as a sequence of VOPs of possibly arbitrary shape and position. Take the video sequence in Figure 28(a) as an example, which consists of three frames. Each frame can be divided into two VOPs belonging to different objects as shown in Figure 28(b) and Figure 28(c), respectively. Thus, each VOP in Figure 28(b) can be taken as an instance of human object at a certain specific time.

To facilitate interaction, VOs in a video stream are organized in a hier-archical fashion and are characterized by intrinsic properties such as shape, texture, and motion. For example, a scene of a talking person, as shown in Figure 29, can be regarded as the composition of related objects, e.g. voice, human image, and background image. To handle the objects, each object has its own description element that allows the object to be combined with other

(41)

(a)

(b)

(c)

Fig. 28.(a) The video sequence of a talking person; (b) VOPs of the human object; (c) VOPs of the background object.

objects or handled separately. Thus, we may access and manipulate VOs in different ways.

6.2 Encoding MPEG-4 Video

MPEG-4 does not specify how objects are to be extracted. We can use a fuzzy neural network n a MPEG-4 encoder, as shown in Figure 30, for cre-ating VOs and their VOPs in each frame. The shape, motion, and texture information of the VOPs belonging to the same VO is then encoded into a separate Video Object Layer (VOL) to support separate decoding of VOs. In addition, relevant information needed to identify each VOL and how the VOLs are composed to reconstruct the video sequence is also included in the bitstream. This allows separate decoding of each VOP and flexible manipula-tion of the video sequence. The composimanipula-tion informamanipula-tion about each VOP is also sent to indicate where and when the VOP is to be displayed.

It is possible to find similarities between frames in a VOL stream, and therefore the differences between VOPs, rather than each individual VOP, are encoded. Three major types of VOPs, i.e I-VOP, P-VOP and B-VOP, are

(42)

Fig. 29.The scene of a talking person.

Fig. 30.A MPEG-4 encoder.

used to remove redundancies between frames. I-VOPs are self-contained and intra-frame coded. P-VOPs are predictively coded with respect to previously coded VOPs, while B-VOPs are bi-directionally coded using the differences between both the previous and next VOPs. I-VOPs must appear regularly in the stream since they are required to decode subsequent inter-coded VOPs such as P-VOPs and B-VOPs. Motion estimation is necessary for encoding P-VOPs and B-VOPs and works by matching blocks with special attention being given to blocks that lie on the boundary of the VOP.

(43)

The multiplexer in Figure 30 merges the bitstreams of different VOLs into a video bitstream. Generally, the multiplexed stream consists of the following components:

1. One initial object descriptor stream. This is the first object descriptor to be received. Information about how to locate the scene description and associate streams is described.

2. One scene description stream. It specifies how the objects should be placed together in space and time to form a MPEG-4 scene, how certain objects in the scene should respond to user interaction, and when and how the scene should be updated, etc.

3. One object descriptor stream. This associates an object in the scene with the actual media streams.

4. One or many VOs.

6.3 Decoding MPEG-4 Video

The MPEG-4 decoding process can be shown in Figure 31. The bitstream

Fig. 31.A MPEG-4 decoder.

received is first split into a set of VOL bitstreams. Each VO is then decoded, and the result is composed. The composition is then handled in the way the information is presented to the user. Furthermore, at the decoder, a user may change the composition of the scene displayed by interacting on the composi-tion informacomposi-tion. For example, if the bitstream received is split into two VOs as shown in Figure 28(b) and Figure 28(c), respectively, then the composed video sequence will look like the one shown in Figure 28(a). However, if the user chooses to replace the original background object in Figure 28(c) with another background object as shown in Figure 32(a), then the reconstructed video sequence will look like the one shown in Figure 32(b).

(44)

(a)

(b)

Fig. 32.(a) VOPs of another background object; (b) reconstructed video sequence.

7 Conclusion

Finally, we give a summary of this chapter and provide some discussions on image/video compression.

7.1 Summary

With the rapid growth in the amount of multimedia communication, e.g., images and videos, relying solely on high-speed connections can not guarantee the quality of service. At present, the only possible solution needs the help of data compression, so that the user can access images or videos in a reasonable time with satisfactory quality.

Neuro-fuzzy techniques can be used for image/video compression due to their advantages of learning ability, resistance to noise, and parallel ar-chitectures. We have introduced three fuzzy neural networks for clustering, i.e. FKCN, Fuzzy-ART and SCFNN. In particular, We have described how SCFNN is used to generate code-words for vector quantization. The fuzzy clusters obtained by SCFNN have a high-degree of intra-cluster similarity and a low-degree of inter-cluster similarity. The mean vector of each obtained fuzzy cluster becomes naturally a code-word. The advantages of using SCFNN include that the fuzzy clusters generated are compact and dense, the real dis-tribution of image content can be captured, and image content can be repre-sented by code-words more appropriately. We have demonstrated the usage of fuzzy neural networks in real communication of images. NITFF (National Imagery Transmission Format) [7, 1] was adopted for encoding/decoding VQ compressed images.