• 沒有找到結果。

A Learning State Space Model for Image Retrieval

N/A
N/A
Protected

Academic year: 2021

Share "A Learning State Space Model for Image Retrieval"

Copied!
12
0
0

加載中.... (立即查看全文)

全文

(1)

Volume 2007, Article ID 83526,10pages doi:10.1155/2007/83526

Research Article

A Learning State-Space Model for Image Retrieval

Cheng-Chieh Chiang,1, 2Yi-Ping Hung,3and Greg C. Lee4

1Department of Information and Computer Education, College of Education, National Taiwan Normal University,

Taipei 106, Taiwan

2Department of Information Technology, Takming College, Taipei 114, Taiwan

3Graduate Institute of Networking and Multimedia, College of Electrical Engineering and Computer Science,

National Taiwan University, Taipei 106, Taiwan

4Department of Computer Science and Information Engineering, College of Science, National Taiwan Normal University,

Taipei 106, Taiwan

Received 30 August 2006; Accepted 12 March 2007 Recommended by Ebroul Izquierdo

This paper proposes an approach based on a state-space model for learning the user concepts in image retrieval. We first design a scheme of region-based image representation based on concept units, which are integrated with different types of feature spaces and with different region scales of image segmentation. The design of the concept units aims at describing similar characteristics at a certain perspective among relevant images. We present the details of our proposed approach based on a state-space model for interactive image retrieval, including likelihood and transition models, and we also describe some experiments that show the efficacy of our proposed model. This work demonstrates the feasibility of using a state-space model to estimate the user intuition in image retrieval.

Copyright © 2007 Cheng-Chieh Chiang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION

Image retrieval has become a very active research area since the 1990s due to the rapid increase in the use of digital im-ages [1,2]. Estimating the user concepts is one of the most difficult tasks in image retrieval. Feature extraction involves extracting only low-level features such as color, texture, and shape from an image. However, people understand an im-age semantically, rather than via the low-level visual features, and there is a large gap between the low-level features and the high-level concepts in image understanding [3].

The relevance feedback approach [4,5] is widely used for bridging this semantic gap. In each iteration of a retrieval task, the user assigns some relevant and irrelevant examples according to their concepts, from which the system learns to estimate what the user actually wants. Many types of learn-ing models have been applied in relevance feedback for image retrieval, such as Bayesian framework [6–8], SVM [9], and active learning [10]. Goh et al. also proposed several quanti-tative measures to model concept complexity in the learning of relevance feedback [10].

Image representation is another important issue that needs to be addressed when solving the above problem. It

is necessary to design good units for image representation even if a perfect learning approach is applied to image re-trieval. Many recent studies have adopted the region-based approach [9,11,12] for image representation, because re-gion features can be more representative for user requests than global image features. Constructing a set of visual words [13,14] that collects similar region features to be a represen-tative unit is appropriate for region-based image representa-tion. Image annotation [15,16] is another method that labels an image with high-level information. Some researchers have attempted to build a semantic space for describing the high-level concepts in images [17,18].

In this paper, we present a new scheme for image repre-sentation and propose a learning model for image retrieval. Instead of constructing a fixed semantic space for represent-ing the user concepts, we have designed a flexible scheme based on concept units for region-based image representation that combines different types of feature spaces and different scales of image segmentation. We also propose an interac-tive approach for estimating the user concepts implicit in the user feedbacks in a query session, which is the period be-tween when the first query is made to when the correspond-ing relevance feedbacks are produced. Our basic idea is to

(2)

track the behaviors of the user concepts of relevance feed-backs in image retrieval using a state-space model [19–21]. The state-space model has been well defined and widely ap-plied to dynamic systems. However, we did not find studies in the literature that have applied the state-space model to the learning problem in relevance feedback. Our work aims at demonstrating the feasibility of solving the retrieval prob-lem using a state-space model.

This paper is organized as follows. Section 2 intro-duces the motivation and the idea behind our proposed approach. Section 3 describes the proposed concept units used in region-based image representation, and the proposed learning model based on a state-space model is shown in

Section 4.Section 5presents the image ranking method used to determine the similarity of two images.Section 6describes a strategy for handling negative examples.Section 7details some experiments that applied our approach, andSection 8

draws conclusions and discusses future work. 2. MOTIVATION

We consider the problem of category search in image re-trieval. This involves grouping images into the same category that the user perceives to be semantically relevant. For ex-ample, the image set from Corel Photo, a set of image data widely used in many researches, contains many types of se-mantic categories. Hence we consider a user called “Corel Photo” who chooses relevant images to form these categories. Note that different users may assign different semantic cate-gories in the same image set. The main challenge for category search is to estimate the user concepts, for example, Corel Photo, from the interaction of the retrieval.

Let a query session comprise the first query and corre-sponding relevance feedbacks. We assume that the user does not change the requesting concepts, that is, the semantic con-cepts in a query session are constant. Ideally, we can view the process of obtaining relevance feedbacks as tracing the path from the first query to the retrieval goals, from which we can estimate the user concepts in a retrieval task.

During a retrieval task, the user could have a semantic goal but could be unable to describe it explicitly—the re-trieval target exists but is not explicit in the beginning of the retrieval. For example, the user may want to retrieve images of flowers but will be unable to describe their types wanted until she/he looks at relevant images. For this scenario, we can model the tracing path of the user concepts as

X1=IM·Xt−1+ηt−1, (1)

whereXtmeans the user state at thetth iteration, IM is the identical matrix, andηt−1 is the noise term (i.e., variations

of user concepts in relevance feedbacks). We estimate each stage of the tracing path using the stateXt, which is deter-mined from the previous estimated states and various types of feedbacks specified by the user.

Figure 1illustrates our idea that tracks the relevant re-gion features in the feature space to estimate the user con-cepts in image retrieval. Figures1(a)and1(b)show the two sets of relevant images that are specified by the user at tth

(a) Relevant examples at the

tth iteration

(b) Relevant examples at the (t + 1)th iteration

(c) Region features at thetth iteration

(d) Tracking the movement of region features fromtth

iteration to (t + 1)th iteration

Figure 1: An illustration of tracking the movement of region fea-tures in relevance feedbacks.

and (t + 1)th iterations, respectively. Figures1(c) and1(d)

describe the process of tracking the movement of relevant regions in a visual feature space. At tth iteration, it is

as-sumed that the relevant region features involve three com-ponents shown inFigure 1(c). Hence we can depict these re-gion features using the centroids (i.e., means) of the three components. At the next iteration, the estimation of the state starts with the previous centroids, drawn as blue dots in

Figure 1(d), and moves to the current relevant regions. In this work, we aim at solving (1) to estimate the user concepts relevant to image retrieval. We assume that stateXt can be modeled using a Gaussian mixture [22] with means

μt and variancesσt, whereμtrepresent the user concepts in stateXt−1, andσt are the variances of the user feedbacks in noise termηt−1. In the example ofFigure 1, a pair ofμt and

(3)

an iteration. Solving meansμtand variancesσtrequires two major tasks: representation and estimation for the state.

We first have to design a scheme for representing the state, which intuitively handles the semantic gap between vi-sual features and user concepts. We do not try to directly construct a semantic space for image retrieval because it is impossible to explicitly describe what the user wants before requests are made. In this work, we design a flexible scheme using concept units that are based on combinations of dif-ferent types of region features and different scales of image segmentation. Any two images that are designated as relevant by the user should be similar from a certain perspective. The concept units are designated to represent unknown perspec-tives of relevant images based on the user perceptions.

We next design an iterative approach for learning and es-timating the user state. The idea of eses-timating the tracing path of relevance feedbacks motivated us to design a state-space model of the user state described in (1). The state-space model has been widely applied to analyze and infer dy-namic systems according to information on time sequences. In our proposed model, the time sequence for the state-space model is associated with the iteration process of relevance feedbacks, and the training data for learning or inferring the system is extracted from positive examples in the relevance feedbacks. Moreover, we design a simple strategy for han-dling negative examples in order to eliminate false alarms in retrieval results.

3. CONCEPT UNITS FOR REGION AND IMAGE REPRESENTATION

3.1. Image segmentation and feature extraction

Region-based approach is widely used to the analysis of im-age contents. To extract regions, the first task is to partition an image into multiple regions using image segmentation. The most intuitive method for image segmentation is to seg-ment objects (or foreground subjects) for region-based im-age matching [9,11–13]. However, this is very difficult, and

the segmentation results greatly affect the performance of region-based tasks. Hence, some researchers have divided an image into rectangular girds [15] or a large number of over-lapping circular regions [23].

Generally speaking, image segmentation may not be con-sistent with human perception. Our proposal is not to gener-ate the perfect regions with segmentation, but rather to de-termine useful ones. We use the well-known watershed seg-mentation [24], which is an efficient, automatic, and unsu-pervised segmentation method for gray-level images, to par-tition an image into nonoverlapping regions. A color image is first converted to a gray image and then partitioned by the watershed segmentation. A watershed region is often homo-geneous in the intensity space, and that means that pixels in a watershed region are not very diverse. Hence, the water-shed regions are appropriate for representing the region units of an image. Wang proposed a multiscale approach for wa-tershed segmentation in order to overcome the problem of oversegmentation [24], which is the major drawback of the

original method of watershed segmentation, by controlling the scaling parameters. Different scaling parameters result in different numbers of regions being segmented in the same image.

Assume that the database containsN images, denoted as {I1,. . . , IN}, and thatv scales, denoted as S= {s1,. . . , sv}, are used for watershed segmentation. Given a scalesq, we assume there arenq regions to be partitioned for all images in the database. Thus, we can annotate the set of regions as

 rsq 1,. . . , r sq nq  . (2)

Let the set of featuresF= {f1,. . . , fu}containu different types of visual features. Given a feature type fp, the feature vector extracted from regionrsq

i is written as fp(risq). Thus, given a feature type fpand a scalesq, we have a set of feature vectors, denote that asRqp, with respect to the set of watershed regions in (2): Rqp= nq  i=1 fp  rsq i  , 1≤p≤u, 1≤q≤v. (3) Note that the region representation described above is independent of selecting visual features and segmentation methods. We collect different scales and different features of regions for an image in order to represent unknown perspec-tives of relevant images. Using more types of visual features and more scales of regions covers a wider range of the image contents, but makes the computational complexity excessive. In this work, four types of visual features (i.e.,u = 4) are used: (i) color histogram, (ii) color moments (both color fea-tures are in HSV space), (iii) cooccurrence texture, and (iv) Gabor texture. Moreover, we setv=2, that is, two types of region scales, in the watershed segmentation.

3.2. Concept units

Since it is impossible to predict the best way to represent an image, for example, which type of features or which scale for image segmentation is better for image representation, be-fore the user makes the query, we first collect different types of region representation, and then estimate which is best for characterizing the user’s perceptions in relevance feedbacks.

Rqp, in (3), represents the collection of visual features of wa-tershed regions that are observed using different scales and different features, hence giving a total of u×v types of

re-gion features withv scaling parameters and u types of visual

features.

Given the feature type fp and the scaling parameter sq, we apply theK-means algorithm [22] to cluster the feature vectorsRqp. That is, we partition the feature space intoK ar-eas. SupposeCqp(1),. . . , Cqp(K) are the clusters for all regions with respect tosqand fq. Collecting all of the region features yields the clusters:

 p,q k  k=1 Cqp(k). (4)

(4)

x1 xt−1 xt z1 zt−1 zt · · · · · · p(zt|xt) p(xt|xt−1)

Figure 2: The probabilistic structure of the state-space model.

Theseu×v×K clusters are the concept units for all

1 p u, 1≤ q v, and 1≤ k K representing images

in the entire image database with different scalings and dif-ferent features. The definition of concept units is a variant of the so-called visual word [13,14], which draws the process-ing units in the space of the visual features. The generation of the concept units with different types of feature spaces and with different region scales provides more possibilities to fit the different characteristics of the image contents for seman-tically relevant images. In our experiments, we setK at 400,

hence givingu×v×K=4×2×400=3 200 concept units.

3.3. Region-based image representation

We can build the concept units in (4) for all images in the database in order to represent the types of contents that the user retrieves. Therefore, we design a region-based im-age representation based on the concept units. LetI be an

image in the database. For each concept unitCqp(k), where 1≤p≤u, 1≤q≤v, and 1≤k≤K, let the weight wqp(k) be the ratio of the number of regions belonging toCqp(k) to the number of regions in imageI. Thus, we collect all weights, wqp(k), to from a (u×v×K)-dimensional vector for repre-senting image



wqp(k)|1≤p≤u, 1≤q≤v, 1≤k≤K. (5) 4. LEARNING MODEL BASED ON

A STATE-SPACE MODEL

4.1. State-space model

The state-space approach has been widely applied to the analysis of dynamic systems, which involve estimating the state of a system which changes over time from a sequence of noisy measurements [19]. Many papers have detailed state-space models [19–21], and hence here we only provide a brief summary of how the posterior probability of a state-space model is inferred.

Figure 2 depicts the probabilistic structure of the Bayesian network of a state-space model, which contains two types of nodes at timet: (i) xtfor the system state and (ii)zt for the observation measurement. At timet, the dynamic

sys-tem receives inputszt, for which we want to estimate the pos-terior probability of the system statextgiven the past obser-vations; this is denoted asp(xt|z1,...,t), wherez1...trepresents the collection of observationsz1tozt. Two assumptions are

generally applied to a state-space model for simplicity. The first is the first-order Markov property, given by

pxt|x1,...,t−1



=pxt|xt−1



, (6)

wherex1,...,t−1 represents the collection of statesx1 to xt−1.

The second is that the observations are mutually indepen-dent: pzt|xt,z1...,t−1  =pzt|xt  , (7)

wherez1,...,t−1means the collection of the observationsz1to

zt−1. By using the above two assumptions and Bayes’ rule, the posterior probability of statext given the past observa-tions can be inferred as

pxt|z1,...,t  = p  zt|xt  pxt|z1,...,t−1  pzt|z1,...,t−1  , (8) where pxt|z1,...,t−1  = xt−1 pxt |xt−1  pxt−1|z1,...,t−1  . (9)

Thus, we can infer the posterior probability as

pxt|z1,...,t  = p  zt|xt  pzt|z1,...,t−1   xt−1 pxt|xt−1  pxt−1|z1,...,t−1  ∝pzt|xt   xt−1 pxt|xt−1  pxt−1|z1,...,t−1  . (10) In (10), the posterior probability p(xt |z1,...,t) in a state-space model is recursively based on two factors: (i) a system modelp(xt |xt−1) which describes the evolution of the state

over time (called the transition function), and (ii) a measure-ment model p(zt | xt) which relates the observation and noise to the state (called the observation function). It is also necessary to define the prior probability of statep(x1) at the

beginning of the recursion.

4.2. The proposed learning model

The user intuition is usually implicit in the specification of positive and negative examples in the query session. Positive examples are generally used to estimate the user intuition, and negative examples are used as exceptions in the estima-tion. Hence, we apply the positive examples of thetth

itera-tion of relevance feedbacks to observaitera-tionsztof thetth stage of the state-space model, and the negative examples are used to eliminate the false alarms in retrieval results. The strategy for handling the negative examples is described inSection 6. The user conceptsXt, stated in (1), can be approximated by a Gaussian mixture model with meansμtand variancesσt where the meansμtindicate the concept units for represent-ing the user concepts, and the variancesσtcover the varying scopes of the user concepts in the concept units. Intuitively, the state vector for the state-space model could be defined as a set of the pairs of means and variances for the Gaussian

(5)

mixture model. However, this makes the model very com-plex, and also we do not have a huge training data set for learning and inferring the model because the number of pos-itive examples is not large in a query session. Hence, it is nec-essary to simplify the design of the state-space model for im-age retrieval.

In this work, we simplify the definition of the state vector in two ways. The first is to ignore the variancesσt. The def-inition of concept units covers some variances because they are defined as clusters in the feature space. Ignoring the vari-ancesσt in defining the state vector means that we assume that the variance of concepts is limited to the radius of the concept units. The second is to define a single concept unit which is viewed as a greedy method instead of multiple con-cept units in the state vector. Considering thetth iteration in

a query session, letxtbe the most representative concept unit for the user concepts that we want to estimate, and letz1,...,t be the collection of positive examples of relevance feedbacks. Thus, we want to find the maximal posterior estimation of statextgiven the past positive examples (observationsz1,...,t) in relevance feedbacks: xt∗=arg maxp  xt|z1,...,t  . (11)

The user concepts in the query session generally comprise multiple rather than single factors, and hence we take the first

H highest probabilities of x∗t to represent the user concepts. Below we define the state vector, observation function, and transition function that are used to construct the state-space model.

State vector

We define the state as the most representative concept unit for the query session. The definition of concept unitCqp(k)

is associated with feature typep, region scale q, and cluster k, and thus we define the state vector as a three-dimensional

vector denoted as (p, q, k), where 1≤p≤u, 1≤p≤v, and

1≤k≤K.

Observation function

Let the positive images of relevance feedbacks be the obser-vations of the state-space model. We define the observation functionp(zt|xt) as the likelihood of the observation given each state,

pzt|xt 

=no. of computed concept units in positive images

no. of all concept units in positive images . (12) Let us consider an example in which there are 100 regions in relevant images at an iteration of a query session. There-fore, these observations contain 100 concept units because each region feature belongs to a concept unit. If 35 regions fall in the same concept unit, its observation measurement is 35/100=0.35.

Transition function

The transition model p(xt | xt−1) is designed to model the

variations of concept units representing the user concepts in iterations of relevance feedbacks. The transition func-tion must record the changing cost between any two con-cept units. Given two state vectors v1 = (p1,q1,k1) and

v2=(p2,q2,k2) withp1 =p2, this means that the two units

are from different feature spaces. Because different types of features capture different characteristics in images, it is inap-propriate to estimate the state cross-different features. Hence we set the transition function Trans(v1,v2) to 0 if p1 = p2.

We next consider the case in which concept units are in the same feature space, that is, p1 = p2. Thus, we can

com-pute the meaningful distance between these two concept units either with or without the same region scale. How-ever, the transition measurement of concept units crossing different scales should be less than that in the same scale. Let

M(p1,q1,p2,q2) be aK×K matrix in which each element

Mi jis the Euclidean distance between concept units (p1,q1,i)

and (p2,q2,j). Note that Mi j corresponds to the Euclidean distance between the means of clustersCq1

p1(i) and C

q2

p2(j). We

then define the transition function as Transv1  p1,q1,k1  ,v2  p2,q2,k2  = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 2·exp−Mk1k2  yexp  −Mk1y  ifp1=p2,q1=q2, α·2·exp  −Mk1k2  yexp  −Mk1y  ifp1=p2,q1=q2, 0 ifp1=p2, (13)

whereα is a scaling factor with 0≤α≤1. Note thatα=0.5

in our implementation.

Prior distribution

All of the prior probabilities of the states are set equal. This means that the tracking of the model starts at all concept units.

At the beginning of the iterations, all concept units have equal probabilities for representing the query concepts. Dur-ing the process of relevance feedbacks in the query session, representative concept units from observations will have higher probabilities based on the inference of the state-space model using (10). We take firstH concept units with

maxi-mal posterior probabilities to represent the user concepts at each iteration.

Two factors are involved in image retrieval based on the proposed state-space model: (i) the likelihoods of positive examples and (ii) the transitive conditions between any two concept units. The former is commonly applied in a Bayesian framework, and the latter is not common in image retrieval. An interesting approach to the transition is to use the onto-logical structure which represents a domain of knowledge in image retrieval [25,26]. Note that embedding these two fac-tors in relevance feedbacks is one of the main contributions of our proposed model.

(6)

d

r

Negative hole

Regions of positive images Regions of negative images Untested regions

Figure 3: An illustration of the negative holes,d: distance to the

nearest positive region,r: the radius of the negative hole, d/2.

5. IMAGE RANKING

The proposed learning model usesH concept units to largely

represent the concepts the user retrieves in a query session. A similarity measure between the retrieval concepts and an image in the database is used for image matching and rank-ing. Without loss of generality, let the firstH concept units

with maximal posterior probabilities at thetth iteration be

denoted byvτ(i), where 1≤i≤H. The posterior probabili-ties of theseH concepts are described by

pt(i)=p  xt  vτ(i)  |zt  , 1≤i≤H, (14) whereτ(i) is the index of concept units, and xt(vτ(i)) is the state with concept unitvτ(i)at thetth iteration.

The idea of designing the similarity measure is to find im-ages containing most of theH concept units in (14). Since an imageI in the database can be represented as (5), we design a dissimilarity measure between the retrieval concepts of the query session and the imageI at the tth iteration as follows:

DisSim(I, t)= M i=1  wτ(i)−pt(i) 2 1/2 (15)

6. STRATEGY FOR HANDLING NEGATIVE EXAMPLES The previous sections only use positive examples of feed-backs for learning the concepts that the user wants to re-trieve. While negative examples could be applied in the learn-ing model to decrease the rate of false retrieval results, han-dling them is difficult because they are diverse either in fea-ture spaces or in semantic concepts. In our opinion, a nega-tive example only removes some of the false retrieval results in a localized area. In this work, we adopt the strategy follow-ing from [27] for handling negative examples. The basic idea is to excavate a “negative hole” in the feature space around the regions of each negative example.Figure 3illustrates an

example of negative holes. The center of a negative hole is a region feature of a negative image, and its radius is half the distance from the negative region to the nearest positive one. Each iteration of relevance feedbacks involves the generation of many negative holes associated with regions of negative examples. A region of a test image in the database is neglected in computing weightswqp(k) in (5) if it falls in a negative hole. 7. EXPERIMENTAL RESULTS AND DISCUSSION

7.1. Dataset

In our experiments, we used three datasets (denoted as DI, DII, and DIII) where DI and DII contain photo images col-lected from Corel Photo and DIII is Caltech-101 Object Cat-egories [28].

Dataset DI

DI contains 20 categories and each category consists of 100 photo images. All images can be partitioned into over 70 000 regions with two scales of image segmentation. These images contain a wide range of contents, such as landscapes, ani-mals, plants, and buildings. These data categories are classi-fied according to human concepts such as “beautiful rose,” “autumn,” and “doors in Paris,” and hence even images in the same category may have had diverse contents. However, all images in the same category are viewed as relevant to each other.

Dataset DII

We extended DI to the larger dataset DII which contains 50 categories, each consisting of 100 photo images, giving a to-tal of 5 000 images. All images can be partitioned into over 200 000 regions with two scales of image segmentation. For each category in DI and DII, we randomly choose 10 images as the query, so the size of the query set is 200 and 500 im-ages, respectively. Moreover, 10 iterations are performed for each query.

Dataset DIII

We took the Caltech-101 Object Categories [28] as the third dataset that is publicly available and involves 101 categories of objects with over 8 000 images. The number of images in each category is different. Over 300 000 regions are seg-mented with two scales of image segmentation. We randomly chose 10 images as the query for the larger categories which contain more than 80 images, giving a total of 240 query im-ages.

7.2. Evaluation and discussion

The precision and the recall are commonly used to evalu-ate the performance of a retrieval system. Note that precision

=A/B and recall=A/C, where A is the number of relevant

images that we retrieve,B is the number of returned images

(7)

Table 1: The detailed precisions using DI without handling negative examples. Cat ID t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10 0 0.354 0.497 0.549 0.556 0.557 0.558 0.559 0.559 0.559 0.559 1 0.134 0.251 0.305 0.332 0.349 0.352 0.355 0.355 0.355 0.355 2 0.154 0.302 0.398 0.432 0.443 0.447 0.453 0.457 0.457 0.457 3 0.156 0.273 0.381 0.446 0.479 0.491 0.493 0.495 0.496 0.496 4 0.177 0.268 0.378 0.485 0.531 0.548 0.553 0.554 0.554 0.554 5 0.241 0.475 0.633 0.713 0.752 0.754 0.758 0.758 0.758 0.758 6 0.247 0.404 0.548 0.651 0.705 0.722 0.724 0.725 0.726 0.726 7 0.156 0.266 0.386 0.484 0.538 0.555 0.555 0.555 0.556 0.565 8 0.245 0.428 0.547 0.583 0.606 0.607 0.608 0.609 0.613 0.634 9 0.415 0.644 0.782 0.849 0.883 0.884 0.884 0.884 0.884 0.884 10 0.221 0.395 0.497 0.533 0.543 0.545 0.546 0.562 0.641 0.709 11 0.285 0.548 0.657 0.672 0.673 0.673 0.673 0.693 0.810 0.859 12 0.205 0.352 0.455 0.504 0.521 0.524 0.539 0.556 0.730 0.788 13 0.223 0.375 0.464 0.513 0.523 0.531 0.563 0.701 0.760 0.798 14 0.238 0.358 0.496 0.593 0.643 0.667 0.724 0.823 0.895 0.919 15 0.297 0.484 0.576 0.592 0.633 0.743 0.876 0.893 0.893 0.893 16 0.450 0.611 0.752 0.847 0.888 0.912 0.959 0.967 0.968 0.968 17 0.216 0.386 0.537 0.612 0.712 0.833 0.888 0.888 0.888 0.888 18 0.283 0.461 0.602 0.668 0.736 0.851 0.883 0.887 0.890 0.890 19 0.197 0.312 0.444 0.568 0.695 0.838 0.874 0.888 0.888 0.889 AVG 0.245 0.404 0.519 0.582 0.620 0.652 0.673 0.690 0.716 0.730

(C = 100 in DI and DII). We setB = 100 in our system, hence precision= recall in datasets DI and DII. Moreover, some of the categories contain more than 100 images in dataset DIII. Thus, we employ the recall instead of the pre-cision to evaluate the performance of the proposed method in our experiments.

Figure 4shows the average recalls at each iteration of rele-vance feedbacks in five cases: only using DI without handling negative examples, and using DII and DIII with/without handling negative examples. DI-pos exhibits the highest re-calls because the size of DI is smaller than that of DII and DIII. However, the performances of DII-pos+neg and DIII-pos+neg indicate that handling negative example can signif-icantly improve the retrieval.

Table 1lists the detailed recalls of all categories of DI of relevance feedbacks using our proposed model without neg-ative examples. The first row inTable 1denotes the iteration of relevance, and the last row indicates the average precisions of all image categories. Note that precisions larger than 0.8 are shown in boldface.

BothFigure 4andTable 1indicate that the retrieval per-formances are bad at the beginning of the retrieval. The rea-son is that only few positive feedbacks at the beginning are available, and hence the training data are insufficient for ac-curately estimating the states. After several iterations, the ef-ficacy of the proposed model is more manifest.

We now discuss the experiments in detail. Figures 5

and6(b)illustrate two cases that correspond to better and worse retrieval results, respectively, using DII without

han-1 2 3 4 5 6 7 8 9 10 Iteration 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Re ca ll DI-pos DII-pos DII-pos+neg DIII-pos DIII-pos+neg

Figure 4: Average recalls for the three datasets DI-pos, DII-pos, and DIII-pos: using these datasets without handling negative examples; DII-pos+neg, and DIII-pos+neg: using the two datasets with han-dling negative examples.

dling negative examples.Figure 5(a)shows some images of the categories “bus” and “butterfly” for which our proposed model produces better results, andFigure 5(b)lists the aver-age precisions of the two categories at each iteration. Sim-ilarly, Figure 6(a) shows example images of the categories

(8)

(a) The first and second rows are examples of categories “bus” and “butterfly,” respectively

Cat. 1 2 3 4 5 6 7 8 9 10

Bus 0.179 0.316 0.437 0.543 0.658 0.758 0.824 0.863 0.878 0.896 Butt. 0.067 0.122 0.175 0.222 0.39 0.704 0.782 0.81 0.938 0.969

(b) The detailed precisions of the categories “bus” and “butterfly,” respectively Figure 5: Illustrations of better results using DII without handling negative examples.

(a) The first and second rows are examples of categories “in desert” and “snow mountain,” respectively

Cat. 1 2 3 4 5 6 7 8 9 10

Des. 0.057 0.09 0.118 0.151 0.178 0.19 0.193 0.194 0.194 0.194 Snow 0.048 0.09 0.116 0.146 0.151 0.17 0.18 0.186 0.188 0.188

(b) The detailed precisions of the categories “in desert” and “snow mountain,” respectively Figure 6: Illustrations for worse results using DII without handling negative examples.

“in desert” and “snow mountain” that have worse results, and

Figure 6(b)shows their average precisions. In the better cases ofFigure 5, images in the same category have the same se-mantic concepts but still look quite different. This shows the feasibility of using the proposed approach to model images with similar semantic concepts but diverse visual features. However, huge variations either in visual features or seman-tic concepts are still very difficult to model. For example, the “snow mountain” images inFigure 6are easily confused with those in other landscape categories.

Basically, our approach is appropriate for image retrieval with relevance feedbacks. The time sequences in the state-space model can be easily associated with the iterations of relevance feedbacks. The proposed model does not only in-volve the likelihoods of positive images, but also considers

the transition possibilities among concept units. However, two problems are worth solving in our approach. The first is the smaller number of positive examples at the beginning of the feedbacks. This is a common problem in image retrieval because no users enjoy manually assigning a huge number of positive examples in the feedback process. One method for solving this problem is to design a long-term strategy to include all positive examples of previous query sessions as training data. The second problem is the huge variations be-tween images in the same category. A possible method for solving this problem is to make our model more complex by embedding more information. However, this could result in overfitting, especially since we do not have many train-ing data in relevance feedbacks. Constructtrain-ing a knowledge structure such as the ontology-based approach [25,26] is

(9)

potential in image retrieval if the retrieval task focuses on an application domain. After defining the transition model of the structure for the knowledge domain, our proposed model can consider both the low-level features (likelihood model) and high-level concepts (transition model) for bridging the semantic gap problem in image retrieval.

8. CONCLUSIONS AND FUTURE WORK

This work demonstrates the feasibility of solving the problem of the semantic gap for image retrieval using a state-space model. We design concept units, which integrate with differ-ent types of visual features and with differdiffer-ent scales of image segmentation, for image representation. We also propose a state-space model for estimating the user concepts in a query session. Our approach involves both the likelihood model of positive examples and the transition model among concept units in image retrieval. Moreover, we have presented a strat-egy for handling negative feedbacks for refining the retrieval results in this paper.

Some future tasks are required to extend this work. The first is to define a long-term learning strategy for solving the problem of a small training set at the beginning iterations of relevance feedbacks. The second is to integrate the knowledge structure for a domain application with the transition model in our proposed approach. Moreover, the design of con-cept units could be revised to contain higher-level informa-tion rather than visual features. Other methods of machine learning, such as active leaning or boosting, could be inte-grated with the state-space model for image retrieval. ACKNOWLEDGMENTS

This work was supported in part by the Ministry of Eco-nomic Affairs, Taiwan, under Grant 95-EC-17-A-02-S1-032 and by the Excellent Research Projects of National Taiwan University under Grant 95R0062-AE00-02.

REFERENCES

[1] R. Datta, J. Li, and J. Z. Wang, “Content-based image retrieval: approaches and trends of the new age,” in Proceedings of the 7th

ACM SIGMM International Workshop on Multimedia Informa-tion Retrieval (MIR ’05), pp. 253–262, Singapore, November

2005.

[2] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimedia information retrieval: state of the art and chal-lenges,” ACM Transactions on Multimedia Computing,

Com-munications and Applications, vol. 2, no. 1, pp. 1–19, 2006.

[3] K. Goh, B. Li, and E. Y. Chang, “Semantics and feature dis-covery via confidence-based ensemble,” ACM Transactions on

Multimedia Computing, Communications, and Applications,

vol. 1, no. 2, pp. 168–189, 2005.

[4] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, “Relevance feedback: a power tool for interactive content-based image re-trieval,” IEEE Transactions on Circuits and Systems for Video

Technology, vol. 8, no. 5, pp. 644–655, 1998.

[5] X. S. Zhou and T. S. Huang, “Relevance feedback in image re-trieval: a comprehensive review,” Multimedia Systems, vol. 8, no. 6, pp. 536–544, 2003.

[6] I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, and P. N. Yianilos, “The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments,”

IEEE Transactions on Image Processing, vol. 9, no. 1, pp. 20–37,

2000.

[7] Z. Su, H. Zhang, S. Li, and S. Ma, “Relevance feedback in content-based image retrieval: Bayesian framework, feature subspaces, and progressive learning,” IEEE Transactions on

Im-age Processing, vol. 12, no. 8, pp. 924–937, 2003.

[8] N. Vasconcelos and A. Lippman, “Learning from user feed-back in image retrieval systems,” in Proceedings of Advances

in Neural Information Processing Systems (NIPS ’99), pp. 977–

986, Denver, Colo, USA, November-December 1999. [9] F. Jing, M. Li, H.-J. Zhang, and B. Zhang, “An efficient and

ef-fective region-based image retrieval framework,” IEEE

Trans-actions on Image Processing, vol. 13, no. 5, pp. 699–709, 2004.

[10] K.-S. Goh, E. Y. Chang, and W.-C. Lai, “Multimodal concept-dependent active learning for image retrieval,” in Proceedings

of the 12th Annual ACM International Conference on Multime-dia, pp. 564–571, New York, NY, USA, October 2004.

[11] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik, “Blobworld: a system for region-based image index-ing and retrieval,” in Proceedindex-ings of the 3rd International

Con-ference on Visual Information and Information Systems (VI-SUAL ’99), pp. 509–516, Amsterdam, The Netherlands, June

1999.

[12] J. Z. Wang, J. Li, and G. Wiederhold, “SIMPLIcity: semantics-sensitive integrated matching for picture libraries,” IEEE

Transactions on Pattern Analysis and Machine Intelligence,

vol. 23, no. 9, pp. 947–963, 2001.

[13] K. Barnard and D. Forsyth, “Learning the semantics of words and pictures,” in Proceedings of the 8th IEEE International

Con-ference on Computer Vision (ICCV ’01), vol. 2, pp. 408–415,

Vancouver, BC, Canada, July 2001.

[14] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for learning natural scene categories,” in Proceedings of the IEEE

Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’05), vol. 2, pp. 524–531, San Diego, Calif,

USA, June 2005.

[15] S. L. Feng, R. Manmatha, and V. Lavrenko, “Multiple Bernoulli relevance models for image and video annotation,” in

Pro-ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’04), vol. 2, pp. 1002–

1009, Washington, DC, USA, June-July 2004.

[16] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image an-notation and retrieval using cross-media relevance models,” in

Proceedings of the 26th Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval (SIGIR ’03), pp. 119–126, Toronto, Ont, Canada, July-August

2003.

[17] D. R. Heisterkamp, “Building a latent semantic index of an im-age database from patterns of relevance feedback,” in

Proceed-ings of the 16th International Conference on Pattern Recognition (ICPR ’02), vol. 4, pp. 134–137, Quebec, Canada, August 2002.

[18] A. Shah-Hosseini and G. M. Knapp, “Learning image se-mantics from users relevance feedback,” in Proceedings of the

12th Annual ACM International Conference on Multimedia, pp.

452–455, New York, NY, USA, October 2004.

[19] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” IEEE Transactions on Signal Processing, vol. 50, no. 2, pp. 174–188, 2002.

(10)

[20] Z. Ghahramani, “An introduction to hidden Markov models and Bayesian networks,” International Journal of Pattern

Recog-nition and Artificial Intelligence, vol. 15, no. 1, pp. 9–42, 2001.

[21] K. P. Murphy, Dynamic Bayesian networks: representation,

in-ference and learning, Ph.D. thesis, University of California,

Berkeley, Calif, USA, 2002.

[22] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, John Wiley & Sons, New York, NY, USA, 2nd edition, 2001. [23] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning

object categories from Google’s image search,” in Proceedings

of the 10th IEEE International Conference on Computer Vision (ICCV ’05), vol. 2, pp. 1816–1823, Beijing, China, October

2005.

[24] D. Wang, “A multiscale gradient algorithm for image segmen-tation using watersheds,” Pattern Recognition, vol. 30, no. 12, pp. 2043–2052, 1997.

[25] V. Mezaris, I. Kompatsiaris, and M. G. Strintzis, “An ontology approach to object-based image retrieval,” in Proceedings of

IEEE International Conference on Image Processing (ICIP ’03),

vol. 2, pp. 511–514, Barcelona, Spain, September 2003. [26] M. Srikanth, J. Varner, M. Bowden, and D. I. Moldovan,

“Exploiting ontologies for automatic image annotation,” in

Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Re-trieval (SIGIR ’05), pp. 552–558, Salvador, Brazil, August 2005.

[27] I. Atmosukarto, W. K. Leow, and Z. Huang, “Feature combi-nation and relevance feedback for 3D model retrieval,” in

Pro-ceedings of the 11th International Multimedia Modelling Con-ference (MMM ’05), pp. 334–339, Melbourne, Australia,

Jan-uary 2005.

[28] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories,” in

Proceed-ings of IEEE CVPR Workshop of Generative Model Based Vision (WGMBV ’04), Washington, DC, USA, June 2004.

Cheng-Chieh Chiang received his B.S. de-gree in applied mathematics from Tatung University, Taipei, Taiwan, in 1991, and his M.S. degree in computer science from Na-tional Chiao Tung University, Hsinchu, Tai-wan, in 1993. He is currently working to-wards the Ph.D. degree in Department of Information and Computer Education, Na-tional Taiwan Normal University, Taipei, Taiwan. His research interests include

mul-timedia information indexing and retrieval, pattern recognition, machine learning, and computer vision.

Yi-Ping Hung received his B.S. degree in electrical engineering from the National Taiwan University in 1982. He received his M.S. degree from the Division of Engineer-ing, his M.S. degree from the Division of Applied Mathematics, and his Ph.D. de-gree from the Division of Engineering, all at Brown University, in 1987, 1988, and 1990, respectively. He is currently a Professor in the Graduate Institute of Networking and

Multimedia, and in the Department of Computer Science and In-formation Engineering, both at the National Taiwan University. From 1990 to 2002, he was with the Institute of Information

Science, Academia Sinica, Taiwan, where he became a Tenured Re-search Fellow in 1997 and is now an Adjunct ReRe-search Fellow. He served as a Deputy Director of the Institute of Information Science from 1996 to 1997, and received the Young Researcher Publication Award from Academia Sinica in 1997. He has served as the Pro-gram Cochair of ACCV’00 and ICAT’00, as the Workshop Cochair of ICCV’03, and as a member in the editorial board of the Interna-tional Journal of Computer Vision since 2004. His current research interests include computer vision, pattern recognition, image pro-cessing, virtual reality, multimedia, and human-computer interac-tion.

Greg C. Lee received his B.S. degree from Louisiana State University in 1985, and his M.S. and Ph.D. degrees from Michigan State University in 1988 and 1992, respectively, all in computer science. Since 1992, he has been with the National Taiwan Normal Uni-versity where he is currently a Professor at the Department of Computer Science and Information Engineering. His research in-terests are in the areas of image processing,

video processing, computer vision, and computer science educa-tion. Dr. Lee is a Member of IEEE and ACM.

(11)

Special Issue on

Network-Aware Peer-to-Peer (P2P) and Internet Video

Call for Papers

Video and peer-to-peer contents are both rapidly increasing Internet bandwidth demands. Recent reports predict an “exaflood” from advances in video over the Internet, rich media content, and user-generated content (UGC). Another trend is the continued popularity and growth of peer-to-peer (P2P) content delivery. Many estimates say that P2P accounts for most of the current Internet traffic, and that video accounts for most of the growth in Internet traffic. New systems and studies to optimize future P2P and video traffic can have a very high impact on the future of the Internet.

In many cases, P2P traffic traverses long distances across core networks and multiple Internet service provider (ISP) networks, even though the content could have been retrieved from a much closer location. Internet video is similar and often delivered from distant servers via multiple, redundant, unicast streams. To address this, there has been a spate of recent effort on systems, information exchange, and control to enable efficient P2P and video content distribution. New systems and protocols are needed to enable Internet content to work in concert with the network to be delivered from the best source or over the least congested links. Localized P2P traffic may only traverse a few hops instead of ten or twenty; allowing a vast decrease in core network bandwidth. Application-layer multicast streaming and emerging new content delivery systems using P2P are also being created and optimized for efficiency.

Promoting state-of-the-art contributions from different research and industrial fields, directly involved or applicable in solving the issues and obstacles of P2P networking and P2P video, is the scope of this special issue. Topics of interest include, but are not limited to:

• Network-aware P2P and source-aware P2P

• Selecting P2P sources and P2P routing to account for network loading

• P2P with network awareness and control • Localized P2P

• Linkages between network providers and Internet services

• Using standards for NGN, IMS, SIP, and so forth for P2P

• P2P in Next-Generation Service Overlay Networks (NGSONs)

• P2P Content Distribution Networks (CDNs) • P2P Video on Demand (VOD)

• Systems for legitimizing P2P and controlling P2P content

• Streaming P2P video systems and analyses

• Application-layer multicast for P2P streaming with multiple trees, meshes, data chunks, swarms, and so forth

• P2P and video bandwidth usage; current trends and future growth

• Distribution of user-generated content (UGC) • Use of local cache servers

• P2P combined with server-based content distribution • Proactive network Provider Participation for P2P Before submission, authors should carefully read over the journal’s Author Guidelines, which are located athttp://www .hindawi.com/journals/ijdmb/guidelines.html. Prospective authors should submit an electronic copy of their complete manuscript through the journal Manuscript Tracking Sy-stem athttp://mts.hindawi.com/, according to the following timetable:

Manuscript Due June 1, 2009 First Round of Reviews September 1, 2009 Publication Date December 1, 2009 Lead Guest Editor

Ken Kerpez, Telcordia Technologies, Inc., Piscataway, NJ 08854, USA;kkerpez@telcordia.com

Guest Editors

Yuanqiu Luo, Huawei Technologies USA, Plano, TX 75075, USA;yluo@huawei.com

Stan Moyer, Telcordia Technologies, Inc., Piscataway, NJ 08854, USA;stanm@research.telcordia.com

John Buford, Avaya Labs Research, Basking Ridge, NJ 07920, USA;buford@avaya.com

Dave Marples, Technolution B.V., Mansfield NG18 9DY, UK;dave.marples@technolution.eu

Hindawi Publishing Corporation http://www.hindawi.com

(12)

Special Issue on

Convergence of Digital TV Systems and Services

Call for Papers

The migration from analog to digital of the broadcasting technologies, already well-consolidated for satellite systems, is becoming a reality also for terrestrial transmission. The digital terrestrial television (DTT) is also evolving to offer interactive services and a degree of flexibility which can be exploited to offer tailored application to users which includes, for instance, interactivity, different levels of person-alization and innovative location-based as well as context-aware services. A clear example of this trend is given by the rising success of the IPTV which allows for a degree of flexibility on the offered services unknown to the other more traditional broadcasting systems. Some satellite operators are starting to launch IPTV services using direct satellite links, moreover some terrestrial internet service providers are offering digital TV channels embedded in the IP streaming over XDSL. Furthermore, IPTV services are likely to be broadcast and wireless, with the income of new technologies such as WiMAX. Last but not least, TV and broadcast services for mobile users are also being deployed in many countries using DVB-H and will be soon available on an even broader scale thanks to its satellite counterpart, DVB-SH. In the near future, a set of different technologies able to offer personalized and customized services to different classes of users are expected in the area of the wireless broadcasting, and the convergence on technologies is aus-picious. This concept entails different levels of convergence, namely, at terminal level (one device fits all), at service level (convergence of traditional fixed, mobile, and broadcast services), at transport and network levels with a common and standardized set of protocols, and at access layer thanks to the harmonic coexistence of different access technologies. This special issue aims to promote the state-of-the-art research works on the integration of DTT/satellite/IPTV systems for the broadcasting of multimedia and interactive services. Topics of interest include, but are not limited to

• Advances in DVB-T/H/SH/T2 system and implemen-tation

• System and service for IPTV and Interactive TV • System, service, and implementation of ISDB-T/S,

CMMB, T/S-DMB, and MediaFLO, • Digital TV with WiMax, LTE, and HSPA • Hybrid cellular/noncellular broadcasting

• Source coding for mobile multimedia broadcasting • Channel coding, modulation, and signal processing

techniques

• Mobile TV and interactive multimedia broadcasting services

• Field trials for mobile multimedia broadcasting Authors should follow the International Journal of Digital Multimedia Broadcasting manuscript format described at the journal site http://www.hindawi.com/journals/ijdmb/. Prospective authors should submit an electronic copy of their complete manuscript through the journal Manuscript Tracking System athttp://mts.hindawi.com/according to the following timetable:

Manuscript Due February 1 , 2009 First Round of Reviews May 1, 2009 Publication Date August 1, 2009

Guest Editors

Maurizio Murroni, Dipartimento di Ingegneria Elettrica ed Elettronica (DIEE), Università di Cagliari, 09123 Cagliari, Italy;murroni@diee.unica.it

Sandro Scalise, German Aerospace Center (DLR), Institute of Communications and Navigation,

82234 Oberpfaffenhofen, Germany;sandro.scalise@dlr.de

Alessandro Vanelli-Coralli, Centro di Ricerca sui Sistemi Elettronici per l’Ingegneria dell’Informazione e delle Telecomunicazioni “Ercole De Castro” (ARCES), Dipartimento di Elettronica, Informatica, Sistemistica, Università di Bologna, 40125 Bologna, Italy;

avanelli@deis.unibo.it

Sooyoung Kim, College of Engineering, Chonbuk National University, Jeonju 561-756, South Korea;

sookim@chonbuk.ac.kr

Robert Briskman, Sirius XM Radio Inc., 1221 Avenue of the Americas New York, NY 10020, USA;

rbriskman@siriusradio.com

Hindawi Publishing Corporation http://www.hindawi.com

數據

Figure 1 illustrates our idea that tracks the relevant re- re-gion features in the feature space to estimate the user  con-cepts in image retrieval
Figure 2: The probabilistic structure of the state-space model.
Figure 3: An illustration of the negative holes, d: distance to the nearest positive region, r: the radius of the negative hole, d/2.
Table 1: The detailed precisions using DI without handling negative examples. Cat ID t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10 0 0.354 0.497 0.549 0.556 0.557 0.558 0.559 0.559 0.559 0.559 1 0.134 0.251 0.305 0.332 0.349 0.352 0.355 0.35
+2

參考文獻

相關文件

Based on [BL], by checking the strong pseudoconvexity and the transmission conditions in a neighborhood of a fixed point at the interface, we can derive a Car- leman estimate for

Based on Cabri 3D and physical manipulatives to study the effect of learning on the spatial rotation concept for second graders..

Since the generalized Fischer-Burmeister function ψ p is quasi-linear, the quadratic penalty for equilibrium constraints will make the convexity of the global smoothing function

Moreover, for the merit functions induced by them for the second-order cone complementarity problem (SOCCP), we provide a condition for each stationary point being a solution of

Moreover, for the merit functions induced by them for the second- order cone complementarity problem (SOCCP), we provide a condition for each sta- tionary point to be a solution of

We have also discussed the quadratic Jacobi–Davidson method combined with a nonequivalence deflation technique for slightly damped gyroscopic systems based on a computation of

•It directly models prior semantic knowledge units, which enhances the ability to learn semantic representation?. • ERNIE learns the semantic representation of complete concepts by

Because the nodes represent a partition of the belief space and because all belief states within a particular region will map to a single node on the next level, the plan