1.1 Motivation
Every time when we listen to music, we can feel the emotion expressed by it, or think a scene in our mind. For example, when we listen to Lisa Ono’s Bossa Nova, we may think of a beautiful young lady leisurely sitting on the sandy beach, with warm sunshine and a clearly blue sky. In our daily life, we connect scenes and music together consciously or unconsciously. For this reason, we want to construct an image retrieval system using music as query. When a user listen to music and want to find some images matching the scene appeared deep within his eyes, he can acquire the images related to the music with this image retrieval system.
1.2 Relative Work
Because of the explosive growth of Internet, there are abundant multimedia materials shared on the web, including image, video, music, text, and so on. How to search the materials people need become an important issue. In the decade, the most popular applications, such as Google [1], YouTube [2], and Flickr [3], allow people to search text, image, video and music by query keywords. These multimedia search systems are based on matching of the query keyword and the text associated with media.
In recent researches, content-based images retrieval (CBIR) is an interesting research area. The main goal of CBIR is to narrow down the semantic gap between visual signature and the semantic meaning. Many image retrieval techniques based on
query example image, instead of keywords. One interesting research is [9]. The concept of semantic space is proposed in their paper. Example images are used as hidden semantic features, instead of using low level visual features extracted from image. A simple example is shown in Fig. 1.1(a), and the semantic space is constructed by user’s relevant feedback (RF). The process of image retrieval is thought as a matrix operation as Fig. 1.1(b) shows.
Recently, in addition to content-based image retrieval, the multimodel fusion and retrieval techniques also attract lots of researchers’ attentions. A cross-media retrieval system is proposed in [10]. In this paper, A graph W called UCCG (Uniform Cross-media Correlation Graph) is proposed. W can be interpreted as a matrix, and Wij is the distance between media objects obji and objj. Media objects include image (I), audio (A), and text (T). The concept of MMD (Multimedia Document) was also proposed in that paper. MMD is a document including media object (I, A, T), for example, a multimedia webpage. Objects in the same MMD are assumed to have the same semantic meaning. There are several steps to construct UCCG. First, Initialize W as
(a) Simple example of semantic space
(b) Image retrival in semantic space can be thought of as matrix operation.
Fig. 1.1. Semantic space proposed in [9]
Wij = ∞ (1 < 𝑖, 𝑗 < 𝑛). (2.1) Second, measure Wij for media object within the same modality by the distance in low level feature space: objects in the same MMD have the same semantic meaning:
Wij = ε, if obji, objj ∈ Ω,∧ objih = objjh , (2.3)
where ε is a small constant, Ω = (I⋃T⋃A), and objih is the MMD of obji. Fourth, model the structure in the manifold view:
Wij = Wij, if(Wij < 𝜎) keywords or example images as query, the image retrieval system proposed allows the user to search images by using music as query. Inspired by the concept of semantic space proposed in [9], the music-image semantic matrix is proposed in our research, and each entry of music-image semantic matrix is the relevant score of certain music-image pair. The semantic matrix used in [9] is constructed by user relevance feedback, and
there is a cold-start problem in construction – the semantic matrix is too sparse at the beginning. If the database scale is large, it’s impractical to construct semantic matrix by user relevance feedback. In our research, the textual information associated with music and image is used to measure relevance between music and image. After music-image semantic matrix is constructed, hidden semantic features (HSF) of music and image are extracted from music-image semantic matrix. HSF can be regarded as the bridge between music and image. Finally, the music-image retrieval is based on HSF, and user relevance feedback is used for modifying the retrieval results. The system framework is described as following:
‧Music-image semantic matrix construction
Each entry of music-image semantic matrix is the relevant score of certain music-image pair. The relevant score is calculated by applying the ranking function derived from Okapi BM25 [20] on textual information associated with music and image. The textual information of image is a metadata collected from Flickr [3], and the textual information of music is a metadata collected from AMG [5].
‧Hidden semantic feature extraction and prediction
PLSA [18] (Probabilistic Latent Semantic Analysis) is applied to music-image semantic matrix, and HSF of each music and image in the database are extracted, while HSF is a distribution of hidden topics of PLSA, and the relevance of music and image can be measured as the similarity in HSF space. A mapping function from audio feature to HSF is trained by Neural Network (NN), and the HSF of unknown music can be predicted by the mapping function.
‧Image retrieval using music as query and relevance feedback
The query music is transformed from audio wave signal to audio feature, and then
transformed to HSF. The relevance of query music and images in database is measured by HSF, and the top k most relevant images will be retrieved. After the 1st round of image retrieval, the user can give relevance feedback to modify the retrieval results.
There are long-term learning and short-learning from relevance feedback, and they will be described in later chapters.
1.4 Chapter Outline
The structure of the remainder of this paper is as follows. In Chapter 2, the research background will be described. In Chapters 3, 4, 5, the methods of music-semantic construction, HSF extraction and predication, and the main system for image retrieval will be described, respectively. In Chapter 6, the experiment design will be described, and the experimental results will be given and discussed. Finally, we will give conclusions in Chapter 7.