Equation Section (Next)
Fig. 3.1. Flowchart of Music-Image Semantic Matrix Construction
The music-image semantic matrix S is a matrix representing the semantic correlation between music and image, where entry Sij is the relevant score of music Mi and image Ij. Fig. 3.1 is the flowchart for the music-image semantic matrix construction. Since the features extracted from music and image are in different types, it’s hard to measure the relevance between them. To bridge the semantic gap between music and image, the textual information (metadata) of image and music are used as intermediates to measure their relevance. The metadata of image and music are regarded as text documents. After the text preprocessing, the text information retrieval model, Okapi BM25, is applied to calculate Sij for all i, j, and then the music-image semantic matrix is constructed.
The image database and their metadata are collected from Flickr, and the image metadata used in our research are shown in Table 3.1(a). The metadata types of music in our database are collected from AMG, and the music metadata types used are shown in Table 3.1(b).
With image and music metadata, the problem of measuring semantic correlation between music and image is considered as information retrieval problem in textual space. Okapi BM25, which is a probabilistic model for text information retrieval, is applied to calculate the relevant score of music and image. To construct the semantic matrix, there are mainly two stages: (1) text preprocessing (2) relevant score calculation.
After text preprocessing, the image and music metadata would be transformed into image and music textual features, respectively. In next stage, the relevant scores would be calculated based on Okapi BM25 model.
(a). Image metadata types and weights
(b). Music metadata types and weights
Table 3.1. Image and music metadata types and weights
3.1 Text Preprocessing
To reduce the noises and to transform image and music metadata to the textual features suitable for Okapi BM25 model, there are several sub-stages in text preprocessing stage.
a. Metadata Weight Adjusting
As Table 3.1 shows, there are three metadata types used for image, and eight metadata types used for music. Different metadata types have different characteristics, so they have different weights. For example, in image metadata, “title”, “description”
and “tags” contain the semantic meaning of images, but in our opinion, the “title” and
“description” are noisier then “tags”, so “tags” are given higher weights. The weight of each metadata type is shown in Table 3.1
b. Stop Words Removing
In text information retrieval, some words in the documents are useless for improving retrieval results. These words are so called stop words, and they would be removed in this stage. For example, words like “a”, “the”, and “they” bring less information. The stop word list used in our research comes from [6].
c. Words Stemming
Stemming is a process to reduce the words into their stems, such that related words can be mapped to the same root. For example: “group”, “grouping”, and “groups” are all based on root “group”. In this research, the language used is English, and the stemming algorithm used in our research is Porter’s Algorithm [22].
d. Metadata Documents to Textual Features Transformation
In this stage, the metadata document is transformed to textual feature, which are represented as a TF (term frequency) vector . tfi,j is the frequency of term ti occurring in document dj. For saving storage, the textual feature of a metadata documents is kept as a
(TID, TF) table, where TID is the term id of a specific term, and TF is the
corresponding term frequency. For saving computation time in the stage of music-image relevant score calculation, an inverted file is also constructed. The inverted file is an index data structure, which records the term occurring in different documents.
Fig. 3.2 illustrates an example of text preprocessing of image metadata. Fig. 3.2(a) is the original image metadata. There are three metadata types: title, description, and tags. Fig. 3.2(b) is a term frequency table of this image metadata after weight adjusting.
As the setting shown in Fig. 3.1(a), the weights of title, description, and tags are 1, 1, and 5 respectively. Fig. 3.2(c) is the term frequency table after removing stop words. As the table shows, the terms “the” and “is” are removed because they are stop words. Fig.
3.2(d) is the term frequency table after stemming. The terms “sorrow” and “sorrowful”
are both based on root “sorrow”, so they are stemmed into the “sorrow” and their term frequencies are combined together. Fig. 3.2(e) is a vocabulary, which contains the term and TID mapping, and is constructed according to all the image and music metadata in the database. Finally, the textual feature of this image metadata is shown in Fig. 3.2(f).
(a)Original metadata (b)After weight adjusting
(e) Vocabulary, Term to TID mapping
(f) Textual feature of this image metadata (c) After stopwords removing
(d) After stemming
Fig. 3.2. Example of text preprocessing of image metadata
3.2 Music-Image Relevant Score Calculation
In this processing, a measure is used to evaluate the relevance between music and image, which is based on the ranking function of Okapi BM25 (2.12). The database of image textual feature is regarded as the document collection, and the database of music textual features is regarded as the query collection. However, the ranking function is modified to fit our problem: Lavem, k3, and c are parameters for music collection, whose definitions are similar to tfti, Li, Lavei, k1 and b respectively. In (3.1), not only image textual feature, but also length of music textual feature is considered. Long music textual feature will be punished in some degree. Through the measure, the music-image semantic matrix can be constructed. The entry (m, i) of semantic matrix is the relevant score of music m and image i.
The textual information of music and image are used to bridge the semantic gap
between them. However, according to our observation, the words used frequently to describe image and music are different, so it’s not accuracy to measure the relevant score through image textual information and music textual information directly. In this thesis, we also present a new approach to map the words used in music metadata and image metadata, called music-image descriptive words expansion, and will introduce in more detail in section 5.5