Unsupervised Semantic Feature Discovery for Image Object Retrieval and Tag Refinement

12  Download (0)

Full text


Unsupervised Semantic Feature Discovery for Image Object Retrieval and Tag Refinement

Yin-Hsi Kuo, Wen-Huang Cheng, Hsuan-Tien Lin, Winston H. Hsu

Abstract—We have witnessed the exponential growth of images and videos with the prevalence of capture devices and the ease of social services such as Flickr and Facebook. Meanwhile, enor- mous media collections are along with rich contextual cues such as tags, geo-locations, descriptions, and time. To obtain desired images, users usually issue a query to a search engine using either an image or keywords. Therefore, the existing solutions for image retrieval rely on either the image contents (e.g., low-level features) or the surrounding texts (e.g., descriptions, tags) only. Those so- lutions usually suffer from low recall rates because small changes in lighting conditions, viewpoints, occlusions or (missing) noisy tags can degrade the performance significantly. In this work, we tackle the problem by leveraging both the image contents and associated textual information in the social media to approximate the semantic representations for the two modalities. We propose a general framework to augment each image with relevant semantic (visual and textual) features by using graphs among images. The framework automatically discovers relevant semantic features by propagation and selection in textual and visual image graphs in an unsupervised manner. We investigate the effectiveness of the framework when using different optimization methods for maximizing efficiency. The proposed framework can be directly applied to various applications, such as keyword-based image search, image object retrieval, and tag refinement. Experimental results confirm that the proposed framework effectively improve the performance for these emerging image retrieval applications.

Index Terms—semantic feature discovery, image graph, image object retrieval, tag refinement



OST of us have been used to sharing personal photos on the social services (or media) such as Flickr and Facebook. More and more users are also willing to contribute related tags or comments on the photos for photo management and social communication [1]. Such user-contributed contex- tual information provides promising research opportunities for understanding the images in social media. Image retrieval (either content-based or keyword-based) over large-scale photo collections is one of the key techniques for managing the exponentially growing media collections. Lots of applications such as annotation by search [2][3] and geographical infor- mation estimation [4] are keen to the accuracy and efficiency of content-based image retrieval (CBIR) [5][6]. Nowadays, the

Y.-H. Kuo is with the Graduate Institute of Networking and Multimedia, National Taiwan University (e-mail: kuonini@cmlab.csie.ntu.edu.tw). W.-H.

Cheng is with Research Center for Information Technology Innovation, Academia Sinica, Taiwan (e-mail: whcheng@citi.sinica.edu.tw). H.-T. Lin is with the Department of Computer Science and Information Engineering, National Taiwan University (e-mail: htlin@csie.ntu.edu.tw). W. H. Hsu is with the Graduate Institute of Networking and Multimedia and the Department of Computer Science and Information Engineering, National Taiwan University, Taipei 10617, Taiwan (e-mail: winston@csie.ntu.edu.tw). Prof Hsu is the contact person.

(c) The proposed method—visually similar but more diverse results

1 (2) 2 (1) 3 (4) 4 (3) 5 (7863) 6 (2152)

7 (270) 8 (521) 9 (1677) 10 (1741) 11 (5805) 12 (6678) (b) Traditional BoW model—visually similar to the query image

1 2 3 4 5 6 (incorrect) (a) Object-level query image

Fig. 1. Comparison in the image object retrieval performance of the traditional BoW model [5] and the proposed approach. (a) An example of object-level query image. (b) The retrieval results of a BoW model, which generally suffers from the low recall rate. (c) The results of the proposed system, which obtains more accurate and diverse images, with the help of automatically discovered visual features. Note that the number below each image is its rank in the retrieval results and the number in a parenthesis represents the rank predicted by the BoW model.

existing image search engines employ not only the surrounding texts but also the image contents to retrieve images (e.g., Google and Bing).

For CBIR systems, bag-of-words (BoW) model is popular and shown effective [5]. BoW representation quantizes high- dimensional local features into discrete visual words (VWs).

However, traditional BoW-like methods fail to address issues related to noisily quantized visual features and vast variations in viewpoints, lighting conditions, occlusions, etc., commonly observed in large-scale image collections [6][7]. Thus, the methods suffer from low recall rate as shown in Figure 1(b).

Due to varying capture conditions and large VW vocabulary (e.g., 1 million vocabulary), the features for the target images might have different VWs (cf. Figure 1(c)). Besides, it is also difficult to obtain these VWs through query expansion (e.g., [8]) or even varying quantization methods (e.g., [6]) because of the large differences in visual appearance between the query and the target objects.

For keyword-based image retrieval in social media, textual features such as tags are more semantically relevant than visual features. However, it is sill difficult to retrieve all the target images by keywords only because users might annotate non- specific keywords such as “Travel” [9]. Meanwhile, in most photo-sharing websites, tags and other forms of text are freely


Unsupervised Semantic Feature Discovery

(Section IV) Semantic Features

(e.g., AVW, Refined Tags)

Image Contents &

Surrounding Texts Social Media (e.g., Flickr, Facebook)

Image Database

Various Applications (e.g., Image Object Retrieval, Text-Based Image Retrieval)

Content-Based or Keyword-Based Image Retrieval Results

"Tower Bridge"

Visual Graph Textual Graph

Fig. 2. A system diagram of the proposed method. Based on multiple modalities such as image contents and tags from social media, we propose an unsupervised semantic feature discovery which exploits both textual and visual information. The general framework can discover semantic features (e.g., semantically related visual words and tags) in large-scale community- contributed photos. Therefore, we can apply semantic features to various applications.

entered and are not associated with any type of ontology or categorization. Tags are therefore often inaccurate, wrong or ambiguous [10].

In response to the above challenges for content-based and keyword-based image retrieval in social media, we propose a general framework, which integrates both visual and textual information1, for unsupervised semantic feature discovery as shown in Figure 2. In particular, we augment each image in the image collections with semantic features—additional features that are semantically relevant to the search targets (cf. Figure 1(c))—such as specific VWs for certain landmarks or refined tags for certain scenes and events. Aiming at large-scale image collections for serving different queries, we mine the semantic features in an unsupervised manner by incorporating both visual and (noisy) textual information. We construct graphs of images by visual and textual information (if available) respectively. We then automatically propagate and select the informative semantic features across the visual and textual graphs (cf. Figure 5). The two processes are formulated as optimization formulations iteratively through the subtopics in the image collections. Meanwhile, we also consider the scala- bility issues by leveraging distributed computation frameworks (e.g., MapReduce).

We demonstrate the effectiveness of the proposed frame- work by applying it to two specific tasks, i.e., image object retrieval and tag refinement. The first task—image object retrieval—is a challenging problem because the target object may cover only a small region in the database images as shown in Figure 1. We apply the semantic feature discovery framework to augment each image with auxiliary visual words (AVW). The second task is tag refinement which augments each image with semantically related texts. Similarly, we apply the framework on the textual domain by exchanging the role of

1We aim to integrate different contextual cues (e.g., visual and textual) to generate semantic (visual or visual) features for database images in the offline process. In dealing with the online query, i.e., when users issue either an image or keywords to the search engine, we can retrieve diverse search results as shown in Figure 1(c). Of course, if the query contains both image and keywords, we can utilize the two retrieval results or adopt advanced schemes like the re-ranking process for obtaining better retrieval accuracy [11][12][13].

visual and textual graphs so that we can propagate (in visual graph) and select (in textual graph) relative and representative tags for each image.

Experiments show that the proposed method greatly im- proves the recall rate for image object retrieval. In par- ticular, the unsupervised auxiliary visual words discovery greatly outperforms BoW models (by 111% relatively) and is complementary to conventional pseudo-relevance feedback.

Meanwhile, AVW discovery can also derive very compact (i.e.,

∼1.4% of the original features) and informative feature repre- sentations which will benefit the indexing structure [5][14].

Besides, experimental results for tag refinement show that the proposed method can improve text-based image retrieval results (by 10.7% relatively).

The primary contributions of the paper2 include,

Observing the problems in image object retrieval by conventional BoW model (Section III).

Proposing semantic feature discovery through visual and textual clusters in an unsupervised and scalable fashion, and deriving semantically related visual and textual fea- tures in large-scale social media (Section IV and Section VI).

Investigating different optimization methods for effi- ciency and accuracy in semantic feature discovery (Sec- tion V).

Conducting experiments on consumer photos and show- ing great improvement of retrieval accuracy for image object retrieval and tag refinement (Section VIII).


In order to utilize different kinds of features from social websites, we propose a general framework for semantic feature discovery through image graphs in an unsupervised manner.

The semantic visual features can be visual words or user- provided tags. To evaluate the effect of semantic feature discovery, we adopt the proposed framework to image object retrieval and tag refinement. Next, we introduce some related work for these issues in the following paragraphs.

Most image object retrieval systems adopt the scale- invariant feature transform (SIFT) descriptor [16] to capture local information and adopt bag-of-words (BoW) model [5]

to conduct object matching [8][17]. The SIFT descriptors are quantized to visual words (VWs), such that indexing techniques well developed in the text domain can be directly applied.

The learned VW vocabulary will directly affect the im- age object retrieval performance. The traditional BoW model adopts K-means clustering to generate the vocabulary. A few attempts try to impose extra information for visual word generation such as visual constraints [18], textual information [19]. However, it usually needs extra (manual) information during the learning, which might be formidable in large-scale image collections.

Instead of generating new VW vocabulary, some researches work on the original VW vocabulary such as [20]. It suggested

2Note that the preliminary results were presented in [15]. We extend the original method to a general framework and further apply it in the text domain.


0 20 40 60 80 100

0 0.2 0.4 0.6 0.8 1 1.2

Images (%)

Visual words (%)

Flickr11K (11,282 images) Flickr550 (540,321 images) Half of the visual words occur in Flickr11K: 0.106% (12 images) Flickr550: 0.114% (617 images)

Fig. 3. Cumulative distribution of the frequency of VW occurrence in two different image databases, cf. Section III-A. It shows that half of the VWs occur in less than 0.11% of the database images (i.e., 12 and 617 images, respectively). The statistics represent that VWs are distributed over the database images sparsely. Note that the x-axis only shows partial values (0% 1.2% images) because the cumulative distribution almost saturates (∼99%) at 1.2% level, so we skip the remaining parts (1.2% 100%) in the figure.

to select useful feature from the neighboring images to enrich the feature description. However, its performance is limited for large-scale problems because of the need to perform spatial verification, which is computationally expensive. Moreover, it only considers neighboring images in the visual graph, which provides very limited semantic information. Other selection methods for the useful features such as [21] and [22] are based on different criteria—the number of inliers after spatial verification, and pairwise constraints for each image, thus suffer from limited scalability and accuracy.

Authors in [9] consider both visual and textual information and adopt unsupervised learning methods. However, they only use global features and adopt random-walk-like methods for post-processing in image retrieval. Similar limitations are observed in [23], where only the image similarity scores are propagated between textual and visual graphs. Different from the prior works, we use local features for image object retrieval and propagate features directly between the textual and visual graphs. The discovered semantic visual features are thus readily effective in retrieving diverse search results, eliminating the need to apply a random walk in the graphs again.

Similar to [9], we can also apply our general framework to augment keyword-based image retrieval by tag refinement to improve text (tag) quality for image collections. Through tag propagation and selection processes, we can annotate images and refine the original tags. Annotation by search [3]

is a data-driven approach which relies on retrieving (near) duplicate images for better annotation results. The authors in [24] propose a voting-based approach to select proper tags via visually similar images. Different from annotation by search [3] and voting-based tag refinement [24], we propagate and select informative tags across images in the same image clusters. Meanwhile, the tag propagation step can also assign suitable tags for those images without any tags in database.

For both [25] and [26], they focus on modifying the weights of the tags originally existing in the photo and only retain those

Title: Outline of the golden Gate Bridge Tags: GoldenGatBridge !

SanFrancisco ! GoldenGatePark Title: Golden Gate Bridget, SFO

Tags: Golden Gate Bridge ! Sun Tags: n/a

Title: n/a

Fig. 4. Illustration of the roles of semantic related features in image object retrieval. Images in the blue rectangle are visually similar, whereas those images in the red dotted rectangle are textually similar. The semantic (textual) features are promising to establish the in-between connection (Section IV) to help the query image (the top-left one) retrieve the right-hand side image.

reliable tags based on the voting by visually similar images.

Instead, our proposed method concentrates on obtaining more (new) semantically related tags from semantically related images. We further select those representative tags to suppress noisy or incorrect tags. [14] proposed to select the most descriptive visual words according to the TF-IDF weighting.

Different from [14], our selection process further considers similar images to retain more representative tags.



Nowadays, bag-of-words (BoW) representation [5] is widely used in image retrieval and has been shown promising in several content-based image retrieval (CBIR) tasks (e.g., [17]).

However, most existing systems simply apply the BoW model without carefully considering the sparse effect of the VW space, as detailed in Section III-A. Another observation (ex- plained in Section III-B) is that VWs are merely for describing visual appearances and lack the semantic descriptions for retrieving more diverse results (cf. Figure 1(b)). The proposed semantic feature discovery method is targeted to address these issues.

A. Sparseness of the Visual Words

For better retrieval accuracy, most systems will adopt 1 million VWs for their image object retrieval system as sug- gested in [17]. As mentioned in [27], one observation is the uniqueness of VWs—visual words in images usually do not appear more than once. Moreover, our statistics shows that the occurrence of VWs in different images is very sparse.

We calculate it on two image databases of different sizes, i.e., Flickr550 and Flickr11K (cf. Section VII-A), and obtain similar curves as shown in Figure 3. We can find that half of the VWs only occur in less than 0.11% of the database images and most of the VWs (i.e., around 96%) occur in less than the 0.5% ones (i.e., 57 and 2702 images, respectively).

That is to say, two images sharing one specific VW seldom



E Visual clusters

Textual clusters






From visual clusters From textual clusters















(c) Visual and textual graphs. (d) Extended visual clusters. (e) Extended textual clusters.

(a) A visual cluster sample. (b) A textual cluster sample.

Fig. 5. The visual cluster (a) groups visually similar images in the same cluster, whereas the textual cluster (b) favors semantic similarities. The two clusters facilitate representative feature selection and semantic feature propagation, e.g., visual words, tags. Based on visual and textual graphs in (c), we can propagate auxiliary features among the associated images in the extended visual or textual clusters. (d) shows the two extended visual clusters as the units for propagation respectively; each extended visual cluster include the visually similar images and those co-occurrences in other textual clusters. Similarly, (e) shows three extended textual clusters include the semantically (by expanded tags) similar images and those co-occurrences in other visual clusters.

contain similar features. In other words, those similar images might only have few common VWs. This phenomenon is the sparseness of the VWs. It is partly due to some quantization errors or noisy features. Therefore, in Section IV, we propose to augment each image with auxiliary visual words.

B. Lacking Semantics Related Features

Since VWs are merely low-level visual features, it is very difficult to retrieve images with different viewing angles, lighting conditions, partial occlusions, etc. An example is shown in Figure 4. By using BoW models, the query image (e.g., the top-left one) can easily obtain visually similar results (e.g., the bottom-left one) but often fails to retrieve the ones in a different viewing angle (e.g., the right-hand side image).

This problem can be alleviated by taking benefits of the textual semantics. That is, by using the textual information associated with images, we are able to obtain semantically similar images as shown in the red dotted rectangle in Figure 4. If those semantically similar images can share (propagate) their VWs to each other, the query image can retrieve similar but more visually and semantically diverse results.


Based on the observations above, it is necessary to discover semantic features for each image. Unlike previous works that focus on constructing the features in one single domain, we propose a general framework for semantic feature discovery based on multiple modalities such as image contents and tags. Meanwhile, such framework can also discover semanti- cally related visual words and tags in large-scale community- contributed photos. In this section, we first illustrate the framework from the view of the visual domain. Then we adapt the framework for applications in the textual domain in Section VI.

As mentioned in Section III, it is important to propagate VWs to those visually or semantically similar images. We follow the intuition to propose an offline stage for unsuper- vised semantic feature discovery. We augment each image with auxiliary visual words (AVW)—additional and important fea- tures relevant to the target image—by considering semantically

related VWs in its textual cluster and representative VWs in its visual cluster. When facing large-scale datasets, we can deploy the processes in a parallel way (e.g., MapReduce [28]).

Besides, AVW reduces the number of VWs to be indexed (i.e., better efficiency in time and memory [14]). Such AVW might potentially benefit the further image queries and can greatly improve the recall rate as demonstrated in Section VIII-A and in Figure 8. For mining AVWs, we first generate image graphs and image clusters in Section IV-A. Then based on the image clusters, we propagate auxiliary VWs in Section IV-B and select representative VWs in Section IV-C. Finally, we combine both selection and propagation methods in Section IV-D.

A. Graph Construction and Image Clustering

The proposed framework starts by constructing a graph which embed image similarities from the image collection.

We adopt efficient algorithms to construct the large-scale image graph by MapReduce. We apply [29] to calculate the image similarity since we observe that most of the textual and visual features are sparse for each image and the correlation between images are sparse as well. We take the advantage of the sparseness and use cosine measure as the similarity measure. The measure is essentially an inner product of two feature vectors and only the non-zero dimensions will affect the similarity value—i.e., skipping the dimensions that either feature has a zero value. To cluster images on the image graph, we apply affinity propagation (AP) [30] for graph-based clustering. AP passes and updates messages among nodes on graph iteratively and locally—associating with the sparse neighbors only. AP’s advantages include automatic determin- ing the number of clusters, automatic exemplar (canonical image) detection within each cluster.

In this work, the images are represented by 1M VWs and 90K text tokens expanded by Google snippets from their associated (noisy) tags. The image clustering results are sampled in Figure 5(a) and (b). Note that if an image is close to the canonical image (center image), it has a higher AP score, indicating that it is more strongly associated with the cluster.


B. Auxiliary Visual Word Propagation

Seeing the limitations in BoW model, we propose to augment each image with additional VWs propagated from the visual and textual clusters (Figure 5(c)). Propagating the VWs from both visual and textual domains can enrich the visual descriptions of the images and be beneficial for further image object queries. For example, it is promising to derive more semantic VWs by simply exchanging the VWs among (visually diverse but semantically consistent) images of the same textual cluster (cf. Figure 5(b)) .

We actually conduct the propagation on each extended visual cluster, containing the images in a visual cluster and those additional ones co-occurring with these images in certain textual clusters. The intuition is to balance visual and semantic consistence for further VW propagation and selection (cf.

Section IV-C). Figure 5(d) shows two extended visual clusters derived from Figure 5(c). More interestingly, image E has no tags and is thus singular in the textual cluster; however, E still belongs to a visual cluster and can receive AVWs in its associated extended visual cluster. Similarly, if there is a single image in a visual cluster such as image H, it can also obtain auxiliary VWs (i.e., from image B and F ) in the extended visual cluster.

Assuming matrix X ∈ RN×D represents the N image histograms in the extended visual cluster and each image has D (i.e., 1 million) dimensions. Let Xi be the VW histogram of image i and assume M among N images are from the same visual cluster. For example, N = 8 and M = 4 in the left extended visual cluster in Figure 5(d). The visual propa- gation is conducted by the propagation matrix P ∈ RM×N, which controls the contributions from different images in the extended visual cluster. P (i, j) weights the whole features propagated from image j to i. If we multiply the propagation matrix P and X (i.e., P X), we can obtain a new M × D VW histograms, as the AVWs. Each row of P X represents the new VW histogram for each image which augmented by the N images.

For each extended visual cluster, we desire to find a better propagation matrix P , given the initial propagation matrix P0

(i.e., P0(i, j) = 1, if both i and j are semantically related and within the same textual cluster). Here is an example of an initial propagation matrix P0,




A 1 0 1 1 0 0 1 0

C 1 0 1 1 0 0 1 0

E 0 0 0 0 1 0 0 0

F 0 1 0 0 0 1 0 1



Each row represents the relationship between the image and its semantically similar images (i.e., in the same textual cluster).

For example, image A (the first row) is related to image A, C, D and G as shown in Figure 5(c). Note that we can also modify the weights in P0based on the similarity score or AP score. We propose to formulate the propagation operation as

fP = min

P α∥P X∥2F

NP 1 + (1− α)∥P − P02F

NP 2 . (1)

Selection S

(weight on each dimension) Xi


(a) Common VWs selection. (b) Two examples.

Fig. 6. Illustration of the selection operation for auxiliary visual words. The VWs should be similar in the same visual cluster; therefore, we select those representative visual features (red rectangle). (b) illustrates the importance (or representativeness) for different VWs. And we can further remove some noisy features (less representative) which appeared on the people or boat. The similar idea can be used to select informative tags from the noisy ones for each image.

The goal of the first term is to avoid from propagating too many VWs (i.e., propagating conservatively) since P X becomes new VW histogram matrix after the propagation. And the second term is to keep the similarity to the original propa- gation matrix (i.e., similar in textual cluster). Here||.||F stands for the Frobenius norm. NP 1=∥P0X∥2F and NP 2=∥P02F

are two normalization terms and α modulates the importance between the first and the second terms. We will investigate the effects of α in Section VIII-C. Note that the propagation process updates the propagation matrix P on each extended visual cluster separately as shown in Figure 5 (d); therefore, this method is scalable for large-scale dataset and easy to adopt in a parallel way.

C. Common Visual Word Selection

Though the propagation operation is important to obtain different VWs, it may include too many VWs and thus decrease the precision. To mitigate this effect and remove those irrelevant or noisy VWs, we propose to select those representative VWs in each visual cluster. We observe that images in the same visual cluster are visually similar to each other (cf. Figure 5(a)); therefore, the selection operation is to retain those representative VWs in each visual cluster.

As shown in Figure 6(a), Xi(Xj) represents VW histogram of image i (j) and selection S indicates the weight on each dimension. So XS indicates the total number of features retained after the selection. The goal of selection is to keep those common VWs in the same visual cluster (cf. Figure 6(b)). That is to say, if S emphasizes more on those common (representative) VWs, the XS will be relatively large. Then the selection operation can be formulated as

fS= min

S β∥XS0− XS∥2F


+ (1− β)∥S∥2F


. (2) The second term is to reduce the number of selected features in the visual clusters. The selection is expected to be compact but should not incur too many distortions from the original features in the visual clusters and thus regularized in the


first term, showing the difference of feature numbers before (S0) and after (S) the selection process. Note that S0will be assigned by one which means we select all the dimensions.

NS1 = ∥XS02F and NS2 = ∥S02F are the normalization terms and β stands for the influence between the first and the second terms and will be investigated in Section VIII-C.

D. Iteration of Propagation and Selection

The propagation and selection operations described above can be performed iteratively. The propagation operation ob- tains semantically relevant VWs to improve the recall rate, whereas the selection operation removes visually irrelevant VWs and improves memory usage and efficiency. An empirical combination of propagation and selection methods is reported in Section VIII-A.


In this section, we study the solvers for the two formulations above (Eq. (1) and (2)). Before we start, note that the two formulations are very similar. In particular, let ˜S = S− S0, the selection formulation (2) is equivalent to


β∥X ˜S∥2F


+ (1− β)∥ ˜S + S02F


. (3)

Given the similarity between Eq. (1) and (3), we can focus on solving the former and then applying the same technique on the latter.

A. Convexity of the Formulations

We shall start by computing the gradient and the Hessian of Eq. (1) with respect to the propagation matrix P . Consider the M by N matrices P and P0. We can first stack the columns of the matrices to form two vectors p = vec(P ) and p0 = vec(P0), each of length M N . Then, we replace vec(P X) with (XT ⊗ IM)p, where IM is an identity matrix of size M and

⊗ is the Kronecker product. Let α1 = Nα

P 1 > 0 and α2 =


NP 2 > 0, the objective function of Eq. (1) becomes f (p)

= α1∥(XT ⊗ IM)p∥22+ α2∥p − p022

= α1pT(X⊗ IM)(XT ⊗ IM)p + α2(p− p0)T(p− p0) Thus, the gradient and the Hessian are

pf (p) = 2(

α1(X⊗ IM)(XT ⊗ IM)p + α2(p− p0)) . (4)

2pf (p) = 2(

α1(X⊗ IM)(XT ⊗ IM) + α2IM N)

. (5)

Note that the Hessian (Eq. (5)) is a constant matrix. The first term of the Hessian is positive semi-definite, and the second term is positive definite because α2 > 0. Thus, Eq. (1) is strictly convex and enjoys an unique optimal solution.

From the analysis above, we see that Eq. (1) and (2) are strictly convex, unconstrained quadratic programming prob- lems. Thus, any quadratic programming solver can be used to find their optimal solutions. Next, we study two specific solvers: the gradient descent solver which iteratively updates p and can easily scale up to large problems; the analytic one which obtains the optimal p by solving a linear equation and reveals a connection with the Tikhonov regularization technique in statistics and machine learning.

B. Gradient Descent Solver (GD)

The gradient descent solver optimizes Eq. (1) by starting from an arbitrary vector pstart and iteratively updates the vector by

pnew← pold− η∇pf (pold),

where a small η > 0 is called the learning rate. We can then use Eq. (4) to compute the gradient for the updates. Neverthe- less, computing (X⊗ IM)(XT ⊗ IM) may be unnecessarily time- and memory-consuming. We can re-arrange the matrices and get

(X⊗ IM)(XT⊗ IM)p = (X⊗ IM)vec(P X) = vec(P XXT) Then,

pf (p) = 1vec(P XXT) + 2α2vec(P− P0)

= vec(2α1P XXT + 2α2(P − P0)).

That is, we can update pold as a matrix Pold with the gradient also represented in its matrix form. Coupling the update scheme with an adaptive learning rate η, we get update propagation matrix by

Pnew= Pold− 2η(

α1PoldXXT + α2(Pold− P0)) . (6) Note that we simply initialize pstart to vec(P0).

For the selection formulation (Section IV-C), we can adopt similar steps with two changes. And let β1 = Nβ

S1 and β2=


NS2. First, Eq. (6) is replaced with Snew= Sold− 2η(

−β1XTX(S0− Sold) + β2Sold) . (7) Second, the initial point Sstartis set to a zero matrix since the goal of selection formulation is to select representative visual words (i.e., retain a few dimensions).

There is one potential caveat of directly using Eq. (7) for updating. The matrix XTX can be huge (e.g., 1M× 1M). To speed up the computation, we could keep only the dimensions that occurred in the same visual cluster, because the other dimensions would contribute 0 to XTX.

C. Analytic Solver (AS)

Next, we compute the unique optimal solution pof Eq. (1) analytically. The optimal solution must satisfypf (p) = 0.

Note that From Eq. (4),

pf (p) = Hp− 2α2p0,

where H is the constant and positive definite Hessian matrix.


p= 2α2H−1p0.

Similar to the derivation in the gradient descent solver, we can write down the matrix form of the solution, which is

P= α2P01XXT + α2IM)−1.

For the selection formulation, a direct solution from the steps above would lead to

S= β11XTX + β2ID)−1XTXS0. (8)


Nevertheless, as mentioned in the previous subsection, the XTX matrix in Eq. (8) can be huge (e.g., 1M × 1M). It is a time-consuming task to compute the inverse of an 1M × 1M matrix. Thus, instead of calculating XTX directly, we transform XTX to XXT which is N by N and is much smaller (e.g., 100× 100). The transformation is based on the identity of the inverse function

(A + BBT)−1B = A−1B(I + BTA−1B)−1. Then, we can re-write Eq. (8) as

S= β1XT1XXT + β2IN)−1XS0. (9) Note that the analytic solutions of Eq. (1) and (2) are of a similar form to the solutions of ridge regression (Tikhonov regularization) in statistics and machine learning. The fact is of no coincidence. Generally speaking, we are seeking to obtain some parameters (P and S) from some data (X, P0 and S0) while regularizing by the norm of the parameters. The use of the regularization not only ensures the strict convexity of the optimization problem, but also eases the hazard of overfitting with a suitable choice of α and β.


Textual features are generally semantically richer than visual features. However, tags (or photo descriptions) are often miss- ing, inaccurate, or ambiguous as annotated by the amateurs [10]; e.g., adding the tag “honeymoon” to all images of a newly married couple’s trip. Traditional keyword-based image retrieval systems are thus limited in retrieving these photos with noisy or missing textual descriptions. Hence, there arise strong needs for effective image annotation and tag refinement.

To tackle this problem, most recent researches focus on anno- tation by search [2][3] or discovering relevant tags from the votes by its visually similar images [24][25]. Those previous work solely rely on one feature modality to improve the tag quality. In this work, we further propose to annotate and refine tags by jointly leveraging the visual and textual information.

In Section IV, we propose a framework for semantic feature discovery, where we utilize the image graphs to propagate and select auxiliary visual words starting from the images’

textual relations for introducing more diverse but semantically relevant visual features. In this section, we will show that the proposed framework is general and can be extended to tag refinement and photo annotation by exchanging the roles of visual and textual graphs. That is, starting from the visual graph, we propagate and then select representative tags in the textual graph. We will introduce tag propagation in Section VI-A and representative tag selection, where we further considered the sparsity of tags in Section VI-B. Note than we apply our proposed method on the same image graphs constructed in Section IV-A.

A. Tag Propagation

In order to obtain more semantically relevant tags for each image, we propose to propagate tags through its visually similar images. We will then remove noisy tags and preserve representative ones in Section VI-B. Following the auxiliary

feature propagation in Section IV-B, we construct the extended textual cluster to propagate relevant tags. As shown in Figure 5(e), we conduct the propagation on each extended textual cluster which contains the images in a textual cluster and those additional ones co-occurring with any of these images in certain visual clusters (in the image graph).

To find a proper propagation matrix for each extended tex- tual cluster, we can adopt the same formulation as mentioned in Section IV-B. That is, we can directly apply Eq. (1) to propagate related tags on the extended textual clusters. It brings some advantages as discussed in Section IV-B and is also applicable to the textual domain. For example, as shown in Figure 5(c), image E has no tags and thus is singular in the textual cluster. However, through the tag propagation, image E can obtain some related tags from the images of A, C, and F (cf. Figure 5(e)). Note that this process is similar to image annotation. In the same way, image H is singular in the visual cluster, we can still propagate related tags to image H through extended textual cluster. For example, an image might obtain different tags such as “Tower bridge,” “London,” or “Travel.”

B. Tag Selection and Sparsity of Tags

After the previous tag propagation step, each image can obtain more different tags. However, it is possible to obtain some incorrect ones. Similar to visual feature selection in Section IV-C, we propose to retain important (representative) tags and suppress the incorrect ones. To select important tags for each image, we can directly adopt the same selection formulation (Eq. (2)) as mentioned in Section IV-C. Following Eq. (2), we select representative tags in each textual cluster since images in the same textual cluster are semantically similar to each other. For example, in Figure 5(b), the more specific tag, “Tower bridge,” would have higher score than a general one, “London.”

Through tag selection, we can highlight the representative tags and reject the noisy ones. However, as the system con- verges, we observed that each image tends to have many tags with very small confidence scores; an ad-hoc thresholding process is required to cut those low-confidence (probably noisy) tags. Meanwhile, users usually care about few important (representative) tags for each image rather than plenty of tags.

Thus, we need to further consider the sparsity of selected tags.

We do so by modifying the original regularization term (L2- norm) to L1-norm. That is, the objective function of Eq. (2) is adjusted as:

fSS = min

S ∥XS0− XS∥2F+ λ∥S∥1. (10) λ is a regularization parameter. Since the L1-norm regulariza- tion term is non-differentiable, we can not obtain the analytic solution directly. However, recent researches have provided certain solutions for this problem [31][32], we can derive the solution by way of [33] or SPAMS (SPArse Modeling Software).3






Solver Propagation→ Selection (propagation first) Selection→ Propagation (selection first)


Gradient descent solver (GD) 0.375 0.516 (+110.6%) 0.3M 0.342 0.497 (+102.9%) 0.2M Analytic solver (AS) 0.384 0.483 (+97.1%) 5.2M 0.377 0.460 (+87.8%) 13.0M



Fig. 7. Query examples in the Flickr11K dataset used for evaluating image object retrieval and text-based image retrieval. The query objects are enclosed by the blue rectangles and the corresponding query keywords are listed below each object image.


A. Dataset

We use Flickr5504 [34] as our main dataset in the ex- periments. To evaluate the proposed approach, we select 56 query images (1282 ground truth images) which belong to the following 7 query categories: Colosseum, Eiffel Tower (Eiffel), Golden Gate Bridge (Golden), Leaning tower of Pisa (Pisa), Starbucks logo (Starbucks), Tower Bridge (Tower), and Arc de Triomphe (Triomphe). Also, we randomly pick up 10,000 images from Flickr550 to form a smaller subset called Flickr11K.5 Some query examples are shown in Figure 7.

B. Performance Metrics

In the experiments, we use the average precision, a per- formance metric commonly used in the previous work [17], [34], to evaluate the retrieval accuracy. It approximates the area under a non-interpolated precision-recall curve for a query. A higher average precision indicates better retrieval accuracy. Since average precision only shows the performance for a single image query, we also compute the mean average precision (MAP) over all the queries to evaluate the overall system performance.

C. Evaluation Protocols

As suggested by the previous work [17], our image object retrieval system adopts 1 million visual words as the basic



vocabulary. The retrieval is then conducted by comparing (indexing) the AVW features for each database image. To further improve the recall rate of retrieval results, we apply the query expansion technique of pseudo-relevance feedback (PRF) [8], which expands the image query set by taking the top-ranked results as the new query images. This step also helps us understand the impacts of the discovered AVWs because in our system the ranking of retrieved images is related to the associated auxiliary visual words. They are the key for our system to retrieve more diverse and accurate images as shown in Figure 8 and Section VIII-A. We take L1 distance as our baseline for BoW model [5]. The MAP for the baseline is 0.245 with 22M (million) feature points and the MAP after PRF is 0.297 (+21.2%).

For evaluating tag refinement, we seek text-based image retrieval to evaluate the overall tag quality. We also include semantic queries in text-based image retrieval tasks. We use the following keywords as the query for the 12 categories:

Colosseum, Eiffel tower, Golden gate bridge, Leaning tower of Pisa, Starbucks, Tower bridge, Arc de Triomphe, Beach, Football, Horse, Louvre, and Park. Note that we use the same ground truth images as content-based image retrieval for evaluation.


In this section, we conduct experiments on the proposed framework—unsupervised semantic feature discovery. Since we target a general framework for serving different appli- cations, we will first adopt the proposed method to visual domain for image object retrieval in Section VIII-A and then the textual domain for tag refinement (by keyword- based retrieval and annotation) in Section VIII-B. Moreover, in Section VIII-C, we also investigate the impact of different parameters in the formulations.

A. The Performance of Auxiliary Visual Words

The overall retrieval accuracy is listed in Table I. As mentioned in Section IV-D, we can iteratively update the features according to Eq. (1) and (2). It shows that the iteration with propagation first (propagation → selection) lead to the best results. Since the first propagation will share all the VWs with related images and then the selection will choose those common VWs as representative VWs. However, if we do the iteration with selection first (i.e., selection → propagation), we might lose some possible VWs after the first selection.

Experimental results show that we only need one or two iterations to achieve better results because those informative


Query 1 2 3 4 5 6 7 8 9 10

Fig. 8. More search results by auxiliary VWs. The number represents its retrieval ranking. The results show that the proposed AVW method, though conducted in an unsupervised manner in the image collections, can retrieve more diverse and semantic related results.

and representative VWs have been propagated or selected in the early iteration steps. Besides, the number of features are significantly reduced from 22.2M to 0.3M (only 1.4%

retained), essential for indexing those features by inverted file structure [5][14]. The required memory size for indexing is proportional to the number of features.

In order to have the timely solution by gradient descent solver, we set a loose convergence criteria for both propagation and selection operations. Therefore, the solution of the two solvers might be different. Nevertheless, Table I still shows that the retrieval accuracy of the two solvers are very similar.

The learning time for the first propagation is 2720s (GD) and 123s (AS), whereas the first selection needs 1468s and 895s for GD and AS respectively. Here we fixed α = 0.5 and β = 0.5 to evaluate the learning time.6 By using analytic solver, we can get a direct solution and much faster than the gradient descent method. Note that the number of features will affect the running time directly; therefore, in the remaining iteration steps, the time required will decrease further since the number of features is greatly reduced iteratively. Meanwhile, only a very small portion of visual features retained.

Besides, we find that the proposed AVW method is com- plementary to PRF since we yield another significant im- provement after conducting PRF on the AVW retrieval results.

For example, the MAP of AVW is 0.375 and we can have 0.516 (+37.6%) after applying PRF. The relative improvement is even much higher than PRF over the traditional BoW model (i.e., 0.245 to 0.297, +21.2%). More retrieval results by AVW + PRF are illustrated in Figure 8, which shows that the proposed AVW method can even retrieve semantically consistent but visually diverse images. Note that the AVW is conducted in an unsupervised manner in the image collections and requires no manual labels.

6The learning time is evaluated in MATLAB at a regular Linux server with Intel CPU and 16G RAM.

Average precision

0 0.2 0.4 0.6 0.8 1

Colosseum Eiffel Golden Pisa Starbucks Tower Triomphe MAP BoW


Fig. 9. Performance breakdown with auxiliary VWs (AVW) and PRF for image object retrieval. Consistent improvements across queries are observed.

The right most is the average performance across seven queries (by MAP).

Figure 9 shows the performance breakdown for the seven queries. It can be found that the combination of AVW and PRF consistently improves the performance across all query categories. Especially, the proposed method works well for small objects such as “Starbucks logo,” whereas the combina- tion of BoW and PRF just marginally improves the retrieval accuracy. Besides, it is worthy to notice that the proposed method can achieve large improvements in “Tower bridge”

query although the ground-truth images of “Tower bridge”

usually have various lighting conditions and viewpoint changes as shown in Figure 5(b) and the fourth row of Figure 8.

B. The Performance of Tag Refinement

For the tag refinement task introduced in Section VI, we employed text-based image retrieval to evaluate the MAP by using predefined queries as mentioned in Section VII.

The goal is to evaluate the overall tag quality before and after the tag refinement in the image collection. The overall retrieval accuracy is shown in Table II. It shows that our proposed method (Propagation + Selection) in general achieves better retrieval accuracy (+10.7%) because the tag propagation process obtains more semantically related tags and the tag selection process further preserves representative ones. How- ever, the proposed method might slightly degrade after the tag refinement. For example, the “Starbucks” query does not gain






Query Original Voting-based Propagation Voting tags method [24] + Selection + Ours

Colosseum 0.694 0.716 0.790 0.831

Eiffel 0.467 0.468 0.676 0.699

Golden 0.463 0.463 0.671 0.699

Pisa 0.136 0.137 0.303 0.353

Starbucks 0.855 0.855 0.640 0.866

Tower 0.515 0.522 0.576 0.651

Triomphe 0.460 0.448 0.701 0.668

Beach 0.349 0.389 0.227 0.398

Football 0.543 0.655 0.628 0.686

Horse 0.784 0.783 0.601 0.774

Louvre 0.430 0.521 0.628 0.647

Park 0.281 0.285 0.178 0.301

MAP 0.498 0.520 0.551 0.631

(+4.4%) (+10.7%) (+26.7%)

from the proposed method because “Starbucks” images in the visual cluster tend to have more semantically diverse tags as the small objects do not necessarily correlate with semantically and visually similar images. In addition, incorrect (noisy) tags might be kept through the tag propagation process. Although the tag selection mechanism can help to alleviate this problem, it sometimes degrades the retrieval accuracy due to the loss of some important tags. For example, the “Triomphe” query obtains higher retrieval accuracy right after the tag propagation (0.729) but decreases slightly after the selection (0.701).

Besides, the voting-based method [24] reaches better accu- racy in few queries (e.g., “Beach”) since it merely reweighs the tags originally existing in the photo. Different from [24], the proposed method aims to obtain more semantically related tags through the propagation process. Therefore, the proposed method might slightly degrade in few queries (e.g., “Park”) due to the limitation of the BoW feature for describing the visual graph among scene-related images.7 Although the propagation process highly relies on the visual similarity, the selection process can alleviate this effect by retaining more representative tags (e.g., “Football:” from 0.366 (propagation) to 0.628 (+ selection)) so that the overall retrieval accuracy is still better. Moreover, we notice that there is an advantage for the voting-based method as mentioned above so that we further combine it and our method (Voting + Ours) to achieve the best results (+26.7%).

We also show tag refinement examples in Figure 10. As mentioned above, each image can obtain more related (new) tags after tag propagation as shown in Figure 10(b) (e.g.,

“Colosseum” or “Eiffel tower”). And each image can further retain those representative tags and reject incorrect (or less

7We believe the fusion of further visual features (e.g., texture, color) will enhance this part. In this work, we emphasize the proposed general framework for deriving semantic (visual or textual) features by leveraging more contextual cues along with the large-scale photo collections.

(a) Orinigal tags: Visiteiffel Ц + Tag propagation (Section VI-A)

(b) Visiteiffel, France, Eiffel tower, Paris, Trish, Sunset, Eiffel Ц + Tag selection (Section VI-B)

(c) Paris,Eiffel tower, France, Visiteiffel, Eiffel, Vacation, Trish Or + Tag selection with sparsity

(d)Eiffel tower

(a) Orinigal tags: Roma, Antique, Heat, Colloseum, Cameraman Ц + Tag propagation (Section VI-A)

(b) Roma, Cameraman, Colloseum. Heat, Antique, Colosseum Ц + Tag selection (Section VI-B)

(c) Roma,Colosseum, Rome, Italy, Cameraman, Colloseum Or + Tag selection with sparsity

(d) Roma,Colosseum

Fig. 10. Examples for tag refinement by tag propagation and selection (Section VI). (a) shows the original tags from the Flickr website. After tag propagation, each image can have more related (new) tags (b). To reduce incorrect (noisy) tags, we adopt tag selection to select the informative and representative tags (c). We further consider sparsity for tag selection to retain few (salient) tags (d). Note that the correct tags are indicated in bold style.

Tower bridge London Bridge England

Eiffel Tower France Paris

Arc de Triomphe l‘Arc de Triomphe France Paris

Italy Europe Pisa Leaning

Failure case

saraoberlin california siena ronda Starbucks logo

Fig. 11. Example results for image annotation on those images originally not associated with any tags. Though initially each image is singular in the textual cluster, through extended textual cluster and the following semantic feature (tag) discovery, each image can obtain semantically related tags. However, if the image object is too small to derive visually similar images (e.g., Starbucks logos), it might incur poor annotations.

frequent) ones (e.g., “Visiteiffel”) after the tag selection in Figure 10(c). Interestingly, through the processes, we could also correct typos or seldom used tags such as “Colloseum”

(widely used: “Colosseum,” “Coliseum” or “Colosseo”). To further consider the tag sparsity, we can retain few represen- tative tags (e.g., “Eiffel tower”) as shown in Figure 10(d).

However, it is possible to retain only some common tags such as “Paris” or “London.”

Moreover, the tag refinement process can also annotate those images which initially do not have any tags. Figure 11 shows some image annotation results after the tag refinement process.

During the tag propagation step, a single image (node) in the textual graph will obtain tags via its related visual clusters.

This approach is similar to annotation by search; however, we base on the extended textual clusters to propagate tags rather than the search results.8 As shown in Figure 11, we can correctly annotate some images on the left-hand side;

nevertheless, it is still possible to propagate some incorrect tags such as the rightmost case because the visual (textual) clusters might be noisy. This can be improved if more effective clustering methods and contextual cues are employed.

To provide another view for evaluating annotation quality, we first remove the original (user-provided) tags before con-

8Note that image annotation is a by-product of tag refinement and it only annotated database images rather than a new query image. Therefore, we do not compare with the other methods such as annotation by search [3].




Related subjects :