Introduction - 強健及分散式語音辨識系統中的動態量化技術

With the growing popularity of digital cameras, many people have saved huge collections of digital images. A resulting challenge is how to exactly to ﬁnd a desired photo, because it is simply impossible to browse through the entire collection. This calls for an eﬃcient photo retrieval approach.

Content-based image retrieval has been an active research area for years, many successful approaches of which are based on low-level image features, implemented using

“query by example” [62, 63]. However, this is not very attractive in practice, because it requires that the user provide an example photo as the query. In fact, most users prefer high-level semantic descriptions of photos that use words as queries, such as who, what, when, where (objects/events) and so on, but again, this is not an attractive solution if it

requires manual annotation of each individual photo. This observation has led to the idea of annotating photos with speech [64, 65]. When such a spoken photo annotation is taken as a spoken document, the problem becomes one of spoken document retrieval.

Many spoken document retrieval approaches have been successful in spotting the query term in the spoken documents, but these approaches usually suﬀer from the problem of word usage diversity, i.e., the query and its relevant documents may use diﬀerent sets of words. This problem is especially serious for photo retrieval as considered here, because the annotation may describe location (where), but the query may ask for a person (who), i.e., both annotation and query are typically free-form and vary signiﬁcantly. In spoken document retrieval, semantic matching strategies have been developed to solve the word usage diversity problem by discovering latent topics inherent in the query and documents.

Latent semantic indexing (LSI) and probabilistic latent semantic analysis (PLSA) are two typical examples [66, 67]. In both cases the relevance score between a query term and the spoken documents can be obtained via a set of latent topics, and relevant documents can be retrieved even using query terms that are completely diﬀerent from those used in the documents. This is because common topics are usually found in sets of documents that each include a set of similar terms, or in sets of terms that each appear in a set of similar documents, and such topical information is used in retrieval.

The above semantic matching methods have not solved the photo retrieval problem described here either. Assume that photo annotation can be formulated into six categories:

who, what (object and event ), when, where, and others. When labeling a photo, users typically select only one or two categories. As such, related photos may not be labeled using similar terms (e.g., some may be labeled by where and some by who), and the relationships among terms in diﬀerent categories cannot be trained using latent topics. For example, given a where query, many photos taken at that location may not be retrieved if they are annotated with words in other categories. Also, users generally annotate far too few photos

to train such topic models. Moreover, it is even diﬃcult to deﬁne what a “topic” should be for photos. For example, should photos of diﬀerent people taken at the same location belong to the same topic, or should photos of the same people but taken at diﬀerent locations belong to the same topic? In other words, the above six categories of labels are orthogonal, but user annotations are usually very sparse. Thus the photo retrieval problem is quite diﬀerent from the well-investigated spoken document retrieval problem, even if photos have spoken annotations.

Considering all the above, user annotations could not provide enough information to build the semantic relationships among photos. If we could extract some similar “terms”

from image features for photos of the same topic, the semantic link among photos with sparse annotations would become stronger through the extracted image “terms.” Note that the terms used in semantic analysis are discrete, while low-level image features are continuous.

Therefore, how to quantize these image features to “terms” is a key issue before semantic analysis.

The image feature quantization considered here aims to extract common “terms”

from photos having the same topic and distinguished “terms” from photos with diﬀerent topic. This is because common terms could build stronger semantic relationship for photos with the same topics, and distinguish terms could discriminate photos with diﬀerent topics.

Conventional quantization with ﬁxed and pre-trained codebook cannot well represent image features. On one hand, if the partition cells for deﬁning a color bin are ﬁxed, the same scene taken from diﬀerent cameras may have very diﬀerent color histogram features. In this situation, the same scene taken from diﬀerent cameras could not be retrieved because their image “terms” would be quite diﬀerent with ﬁxed quantization codebook. Therefore, it is important to apply the concept of dynamic quantization to deﬁne dynamic partition cells for photos taken from diﬀerent cameras. On the other hand, if the representative codewords for the color histogram features and Gabor texture features are ﬁxed, photos with diﬀerent

topics may locate on close positions in some feature dimensions, and they would be quantized to the same codewords in these dimensions. Extracting common terms from photos with diﬀerent topics is harmful for semantic analysis because the topical information for photos would become less clear. Therefore, it is important to dynamically deﬁne the representative codewords to preserve the discriminative information in the quantization process.

Considering all the above, in this chapter we propose a user friendly semantic-based photo retrieval approach using Fused image/speech/text features. We use low level image features to derive the basic links among photos, since these features are really the universal language describing photos. But we train semantic models to analyze the topics of the pho-tos using PLSA. Because the ”terms” in PLSA has to be discrete, while the low level image features have continuous real values, for each given photo we use low level image features to select a group of “cohort photos” from the photo archive with similar image characteristics as the “terms” describing the image characteristic of the photo, which is then fused with speech/text features if some annotation is added by the user. The speech/text annotation can be very “sparse,” i.e., only very few words regarding the semantics (e.g. where or who) are needed for only a small portion of photos. In this way, the image/speech/text features are fused with PLSA topic analysis, to be used in PLSA semantic-based retrieval. The sparse text/speech annotation serve as the interface for the user to access the whole photo archive, since the other photos not annotated are actually linked by the semantics of the image features based on PLSA.

The rest of this chapter is organized as follows. Section 7.2 introduces the overall photograph retrieval system. Section 7.3 describes the basic formula of PLSA. Color feature extraction with dynamic partition cells and texture features are introduced in Sections 7.4.

In section 7.5, we introduce how to extract image “terms” from low-level image features by using dynamically deﬁned representative codewords. In section 7.6, we construct document for each photo based on photo annotations and the image “terms” and use PLSA to analyze

the topics of photos for photo retrieval. In section 7.7, we perform image clustering based on PLSA model. Experimental settings and results are oﬀered in Section 7.8. Conclusions are given in the last section.

在文檔中強健及分散式語音辨識系統中的動態量化技術 (頁 95-99)