基於圖像資訊之音樂資訊檢索研究 - 政大學術集成

全文

(1)國立政治大學資訊科學系 Department of Computer Science National Chengchi University 碩士論文 Master’s Thesis. 立. 政治大. ‧ 國. 學 ‧. 基於圖像資訊之音樂資訊檢索研究 Retrieval. n. al. Ch. engchi. er. io. sit. y. Nat. A Study of Image-based Music Information i Un. v. 研究生：夏致群指導教授：蔡銘峰. 中華民國一百零六年九月 September 2017.

(2) 106. 碩士論文. 立. 政治大. ‧. ‧ 國. 學. 基於圖像資訊之音樂資訊檢索研究. n. er. io. sit. y. Nat. al. 政治大學資訊科學系. 夏致群. Ch. engchi. i Un. v.

(3) 基於圖像資訊之音樂資訊檢索研究 A Study of Image-based Music Information Retrieval 研究生：夏致群. Student：Chih-Chun Hsia. 指導教授：蔡銘峰. Advisor：Ming-Feng Tsai. 國立政治大學資訊科學系碩士論文. 立. 政治大. ‧ 國. 學. A Thesis submitted to Department of Computer Science. ‧. n. al. er. io. sit. y. Nat. National Chengchi University in partial fulfillment of the Requirements for the degree of Master iv n C h in engchi U Computer Science. 中華民國一百零六年九月 September 2017.

(4) 記錄編號：G0104753001. 國立政治大學博碩士論文全文上網授權書 National ChengChi University Letter of Authorization for Theses and Dissertations Full Text Upload (提供授權人裝訂於紙本論文書名頁之次頁用) (Bind with paper copy thesis/dissertation following the title page) 本授權書所授權之論文為授權人在國立政治大學資訊科學學系系所 ________________組 106學年度第一學期取得碩士學位之論文︒ This form attests that the _____________ Division of the Department of Graduate Institute of Computer Science at National ChengChi University has received a Master degree thesis/dissertation by the undersigned in the _________ semester of 106 academic year. . 政治大. 論文題目（Title）：基於圖像資訊之音樂資訊檢索研究 ( A Study of Image-based Music Information Retrieval ). 立. ‧ 國. 學. 指導教授（Supervisor）：蔡銘峰. ‧. 立書人同意非專屬︑無償授權國立政治大學，將上列論文全文資料以數位化等各種方式重製後收錄於資料庫，透過單機︑網際網路︑無線網路或其他公開傳輸方式提供用戶進行線上檢索︑瀏覽︑下載︑傳輸及列印︒ 國立政治大學並得以再授權第三人進行上述之行為︒ The undersigned grants non-exclusive and gratis authorization to National ChengChi University, to reproduce the above thesis/dissertation full text material via digitalization or any other way, and to store it in the database for users to access online search, browse, download, transmit and print via single-machine, the Internet, wireless Internet or other public methods. National ChengChi University is entitled to reauthorize a third party to perform the above actions. 論文全文上載網路公開之時間（Time of Thesis/Dissertation Full Text Uploading for Internet Access）：網際網路（The Internet） ■ 中華民國 107 年 9 月 15 日公開 ● 立書人擔保本著作為立書人所創作之著作，有權依本授權書內容進行各項授權，且未侵害任何第三人之智慧財產權︒ The undersigned guarantees that this work is the original work of the undersigned, and is therefore eligible to grant various authorizations according to this letter of authorization, and does not infringe any intellectual property right of any third party. ● 依據96年9月22日96學年度第1學期第1次教務會議決議，畢業論文既經考試委員評定完成，並已繳交至圖書館，應視為本校之檔案，不得再行抽換︒關於授權事項亦採一經授權不得變更之原則辦理︒ According to the resolution of the first Academic Affairs Meeting of the first semester on September 22nd, 2007,Once the thesis/dissertation is passed after the officiating examiner's evaluation and sent to the library, it will be considered as the library's record, thereby changing and replacing of the record is disallowed. For the matter of authorization, once the authorization is granted to the library, any further alteration is disallowed，立書人：夏致群簽名(Signature)：中華民國年月日 Date of signature：__________/__________/__________ (dd/mm/yyyy) . n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v.

(5) 致謝. 窗外雨下的濛濛，灑進來的喚起了記憶，一隻烏鴉順了順羽毛，從碳粉印表機上跳了下來。似乎被遠處疊得厚厚的論文吸引，或說，注意到了一點不尋常。原來另一隻烏鴉窩在紙堆後面小憩，很累的樣. 治政大起來得不是很情願，銜著了幾張散亂的碎紙片，上面好像是沒看過立的目標函式。兩隻烏鴉蹦蹦跳跳晃到了銘峰和釧茹老師旁邊，沒銜紙. 子。在旁邊端詳了一陣子，啄兩下叫醒了牠。. ‧ 國. 學. 片那隻似乎不太親近銘峰老師。我跟了過去，捎上了手中的花束，兩隻烏鴉在那黃色的百合花徘徊了一陣。有人說黃色百合代表感激、有. ‧. 人用作祝福、也有人用於歉意，對我來說，這些都是想傳達的訊息：. y. Nat. 感激的是老師們兩年的指導，讓我學到做研究的方法與待人接物的態度；祝福的是希望實驗室未來人才濟濟，往後也都諸事順利；而對於. sit. 沒能奮發努力，把握分分秒秒精進自己感到抱歉與遺憾。. er. io. 坐下來拿出蘋果筆電，邊感謝這夥伴的陪伴，邊看著兩年的過往。當我有疑問和抱怨的時候，謝謝妳總是聽我訴苦、提出可行的解決方法，與我討論。家人朋友在我遭遇挫折時，也以不同的方式關懷鼓. n. al. Ch. engchi. i Un. v. 勵，讓我前進。實驗室好像大了許多，與各位的談話總讓我有所收穫。烏鴉銜著的小紙片一堆堆散落各處，就像說好要分給大家的一樣。突然被啄了一下，似乎在暗示著快來拼湊紙片上那些難懂的公式，這些足跡也將成為一本本的精裝倚在牆邊。良久，是振翅的聲音，兩隻烏鴉往外直直飛去。雨繼續下著，洗去那些該被沖散的。我追了出去，帶著對大家的感謝，學海無涯，我只求悟出此道。. 夏致群國立政治大學資訊科學系 September 2017. 3.

(6) 基於圖像資訊之音樂資訊檢索研究. 中文摘要以往的音樂資訊檢索方法多使用歌詞、曲風、演奏的樂器或一段音頻訊號來當作查詢的媒介，然而，在某些情況下，使用者沒有辦法清楚描述他們想要尋找的歌曲，如：情境式的音樂檢索。本論文提出了一種基於圖像的情境式音樂資訊檢索方法，可以透過輸入圖片來找尋相應的音樂。此方法中我們使用了卷積神經網絡（Convolutional Neural Network）技術來處理圖片，將其轉為低維度的表示法。為了將異質性的多媒體訊息映射到同一個向量空間，資訊網路表示法學習（Network. 政治大. Embedding）技術也被使用，如此一來，可以使用距離計算找回和輸入圖片有關的多媒體訊息。我們相信這樣的方法可以改善異質性資訊間. 立. ‧ 國. 學. 的隔閡（Heterogeneous Gap），也就是指不同種類的多媒體檔案之間無法互相轉換或詮釋。在實驗與評估方面，首先利用從歌詞與歌名得到的關鍵字來搜尋大量圖片當作訓練資料集，接著實作提出的檢索方法，並針對實驗結果做評估。除了對此方法的有效性做測試外，使用. ‧. 者的回饋也顯示此檢索方法和其他方法相比是有效的。同時我們也實作了一個網路原型，使用者可以上傳圖片並得到檢索後的歌曲，實際. Nat. n. al. er. io. sit. y. 的使用案例也將在本論文中被展示與介紹。. Ch. engchi. 4. i Un. v.

(7) A Study of Image-based Music Information Retrieval. Abstract Listening to music is indispensable to everyone. Music information retrieval systems help users find their favorite music. A common scenario of music information retrieval systems is to search songs based on user’s query. Most existing methods use descriptions (e.g., genre, instrument and lyric) or audio signal of music as the query; then the songs related to the query will be retrieved. The limitation of this scenario is that users might be difficult to describe what they really want to search for. In this paper, we propose a novel method, called “image2song,” which allows users to input an image to retrieve the related songs. The proposed method consists of three modules: convolutional neural network (CNN) module, network embedding module, and similarity calculation module. For the processing of the images, in our work the CNN is adopted to learn the representations for images. To map each entity (e.g., image, song, and keyword) into a same embedding space, the heterogeneous representation is learned by network embedding algorithm from the information graph. This method is flexible because it is easy to join other types of multimedia data into the information graph. In similarity calculation module, the Euclidean distance and cosine distance is used as our criterion to compare the similarity. Then we can retrieve the most relevant songs according to the similarity calculation. The experimental results show that the proposed method has a good performance. Furthermore, we also build an online image-based music information retrieval prototype system, which can showcase some examples of our experiments.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 5. i Un. v.

(8) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 6. i Un. v.

(9) Contents 論文全文上網授權書. 2. 致謝. 3. 中文摘要. 4. Abstract. 立. 政治大. 2. Related Work 2.1 Music Information Retrieval 2.2 Cross-media Retrieval . . . . 2.3 Convolution Neural Network 2.4 Network Embedding . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. Methodology 3.1 Terminology . . . . . . . . . . . . . . . 3.2 Convolutional Neural Network Module 3.3 Network Embedding Module . . . . . . 3.4 Similarity Calculation Module . . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. v. . . . .. . . . .. Experimental Results 4.1 The Implementation of Web-based Retrieval 4.2 Experimental Settings . . . . . . . . . . . . 4.2.1 Dataset . . . . . . . . . . . . . . . 4.2.2 Evaluation Metrics . . . . . . . . . 4.2.3 Baseline . . . . . . . . . . . . . . . 4.3 Experimental Results . . . . . . . . . . . . 4.4 Case Study . . . . . . . . . . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. io. n. al. 4. 5. Ch. . . . .. engchi. Conclusions. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 5 5 6 6 7. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 9 10 11 12 13. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 17 17 18 18 18 19 19 22. . . . .. y. . . . .. sit. . . . .. er. . . . .. ‧. . . . .. Nat. 3. 1. ‧ 國. Introduction. 學. 1. 5. i Un. 25. Bibliography. 27. 7.


(11) List of Figures 1.1. The workflow of the proposed method . . . . . . . . . . . . . . . . . . .. 2. 3.1. A framework of image2song . . . . . . . . . . . . . . . . . . . . . . . .. 9. 3.2. A deep CNN architecture of VGG-19 . . . . . . . . . . . . . . . . . . .. 11. 3.3. An example of information graph . . . . . . . . . . . . . . . . . . . . . .. 13. 4.1. A demo of the online web-based retrieval system . . . . . . . . . . . . .. 17. 4.2. An image labeled as angel, sky, and cloud . . . . . . . . . . . . . . . . .. 19. 4.3. Questionnaire1—snow forest . . . . . . . . . . . . . . . . . . . . . . . .. 21. 4.4. Questionnaire2—sky with cloud . . . . . . . . . . . . . . . . . . . . . .. 21. 4.5. Questionnaire3—coffee . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 4.6. Questionnaire4—ocean . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 4.7. Questionnaire5—fog . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 22. 4.8. Case study . . . . . .. 22. ‧. ‧ 國. io. sit. y. Nat. n. al. er. 3.4. 學. 3.5. 政治大 An example of CNN vector space. . . . . . . . . . . . . . . . . . . . . . 立 An example of heterogeneous vector space. . . . . . . . . . . . . . . . .. iv . C . . . . . . . . . . . . . n . . . . hengchi U. 9. . . . . . . . . . .. 14 14.


(13) List of Tables 3.1. Terminology and definitions . . . . . . . . . . . . . . . . . . . . . . . .. 10. 4.1. The statistic of our dataset . . . . . . . . . . . . . . . . . . . . . . . . .. 18. 4.2. Experimental results of images retrieval . . . . . . . . . . . . . . . . . .. 20. 4.3. Experimental results of heterogeneous entities retrieval . . . . . . . . . .. 20. 4.6. Result of questionnaire2 . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 4.7. Result of questionnaire3 . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 4.8. Result of questionnaire4 . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 4.9. Result of questionnaire5 . . . . . . . . . . . . . . . . . . . . . . . . . .. 22. ‧. ‧ 國. io. sit. y. Nat. n. al. er. 4.4. 學. 4.5. 政治大 Experimental results: without chosen keywords . . . . . . . . . . . . . . 立 Result of questionnaire1 . . . . . . . . . . . . . . . . . . . . . . . . . .. Ch. engchi. 11. i Un. v. 20 21.


(15) Chapter 1 Introduction 政治大 media formats to online products 立 and services. With the advance of the music streaming Music holds the attention and interest of audiences; it also plays a major role in human entertainment. Traditional music industry has changed the business model from physical. ‧ 國. 學. technology, users can download or listen to any available music on their device. Simultaneously, a huge amount of music content produced by millions of artists or composers. It becomes more difficult for users to search for the music content that they prefer. Fur-. ‧. thermore, there is a lot of potentially interesting music that is also hard to be discovered. Therefore, music information retrieval systems are important.. y. Nat. sit. There has been extensive research on music information retrieval, such as signal pro-. al. er. io. cessing, pattern mining, and information retrieval. The traditional method first do the. n. preprocessing, such as data labeling, or classification. It is common to use keywords or. Ch. i Un. v. audio examples to retrieve other music. For instance, users can type genre, artist, or lyric. engchi. as keywords to find the relevant music. They can also record a part of human humming or audio signal to match the similar music. However, in some cases, it is hard for people to describe the query by only text information or audio examples. In this paper, we propose an idea called “image2song”, an image-based information retrieval method. To take an example for the real scenario: A user is sitting in a coffee shop and want to have some music in the background. He takes a photo for the surrounding scenery with his phone, and inputs this image as a query. The image-based music information retrieval system then returns some suitable music according to the scene. Recently, in multimedia information retrieval, most approaches have started heading to heterogeneous retrieval [9, 25, 6, 17], the goal of which is to retrieve items via diverse information, such as texts and social relations. However, the heterogeneity-gap between multi-type data have been widely understood as a fundamental barrier. For example, an image can not transform to a music intuitively. In order to achieve the above objective 1.

(16) and reduce the gap, we develop an innovative music retrieval system that bridges the heterogeneity-gap between music and image information. In our method, deep convolutional neural network (CNN) and network embedding algorithm is adopted. The proposed method is composed by three modules: 1. Convolutional Neural Network (CNN) Module 2. Network Embedding (NE) Module 3. Similarity Calculation (SC) Module In order to experiment with the idea of image2song, we construct our image dataset by collecting images via search engines based on keywords. For the keywords extraction, we apply the term-frequency method to extract keywords for each song from its song title and. 政治大. lyrics. Finally, we obtain a training dataset containing images, words and their relations.. 立. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. Figure 1.1: The workflow of the proposed method The workflow of our approach is shown in Figure 1.1. In the training phase, we learn the CNN representation for the images in the training dataset. The CNN model with existing pre-trained weights is adopted, such as VGG-19 [20] or GoogLeNet [21]. In the network embedding module, a heterogeneous information graph is built. The types of the entities are flexible, which can be words, songs or even videos. There are three types of entities in our method: images, songs, and lyric keywords. Every data entity is treated as an individual vertex in the information graph, and they are connected to others by the observed relation (e.g., a song included a keyword or an image retrieved by a keyword). In the information graph, the representation of each vertex is learned by network embedding algorithm, such as Deepwalk [16] or LINE [22]. The learned representation is named as heterogeneous representation. In the application phase, the 2.

(17) CNN module is used to process the input image. In the similarity calculation module, the purpose is to calculate the distance between representations. The shorter distance indicates the two representations are more relevant. In this module, we have the following stages: relevant image searching, representation mapping, relevant song searching. The input image can be denoted as one or more heterogeneous representations, and we can obtain high relevance songs. In the experiments, an image dataset is crawled from famous search engine, such as Google 1 , Bing 2 , and Baidu 3 by the extracted 72 keywords. Then we build up an imageto-song dataset that contains 62, 316 songs, and 33, 459 images. Since there is no ideal evaluation criteria, we evaluate the proposed method by using the three tasks: 1. Images Retrieval. 政治大. 2. Heterogeneous Entities Retrieval. 立. 3. User Feedback. ‧ 國. 學. For the task of images retrieval, we create a label dataset for evaluation. The proposed method Precision@5 with 0.59. For the task of heterogeneous entities retrieval and user. ‧. feedback, the proposed method still outperforms the baseline. Furthermore, an online prototype of image-based music information retrieval system is also built for demonstra-. sit. y. Nat. tion.. io. er. The rest of this paper is organized as follows. In Chapter 2, we introduce related work. In Chapter 3, we describe every module of our framework in detail. The experimental. n. al. i Un. v. setting and analysis are in Chapter 4. Conclusions are given in Chapter 5.. Ch. engchi. 1. https://images.google.com/ https://www.bing.com/?scope=images 3 https://image.baidu.com/ 2. 3.


(19) Chapter 2 Related Work Music information retrieval (MIR) is the interdisciplinary science of retrieving informa-. 政治大 Musicology, signal processing, and machine learning are common technique in this area. 立 With the rapid growth of multimedia data, cross-media retrieval is also a feasible way tion from music. MIR is a growing field of research with many real-world applications.. ‧ 國. 學. to support MIR. In recent years, convolution neural network shows a good performance on image processing. To map the different media data into same feature space, network. ‧. embedding algorithm is a very effective approach. In this chapter, we discuss the related work on those topics.. n. al. Ch. er. io. Music Information Retrieval. sit. y. Nat. 2.1. i Un. v. Music information retrieval can be said as a part of research of multimedia information. engchi. retrieval [19]. The first research works on audio signal analysis started with automatic speech recognition and discriminating music from speech content [7]. A survey of existing MIR systems was presented by Typke et al. [24]. Two main groups of content-based MIR systems can be distinguished, systems for searching audio data and systems for searching notated music. The authors defined four levels of retrieval tasks: genre level, artist level, work level, and instance level. Marius et al. [10] investigated contextual music information retrieval and recommendation. The authors mentioned the traditional applications of MIR: query by example, query by humming, and genre classification. Query by example was one of the first applications of MIR techniques. Users take audio signal as an input, and return the information of the music. Query by humming takes an input of a melody hummed by the user, and retrieves the matching music. Determining the genre of music is a classification problem based on MIR techniques. This shows that audio signal played a major role in traditional method. There is also another way to retrieve music. We can use external information, such as 5.

(20) artist, lyrics, and reviews. When the songs are labeled with those external information, a text-based retrieval is feasible. In a more recent work, Akihiro et al. [15] presented a study on emotion-based music information retrieval using lyrics information. Sander et al. [5] investigated whether it is possible to apply feature learning directly to raw audio signals. To improve the performance, Francisco et al. [18] leveraged generic summarization algorithms, previously applied to text and speech, to summarize items in music datasets.. 2.2. Cross-media Retrieval. Researchers in the area of multimedia information retrieval focus on retrieving information from different types of media content: images, video, and music. Cross-media. 政治大 through the modality of different 立 media objects. For example, when a user submits an retrieval can be regarded as a unified multimedia retrieval approach that tries to break. image, the system will return some related media objects with different modalities, such. ‧ 國. 學. as the article or the video. The popular search engines, such as Google and Bing, have good performance on text-based retrieval. User can input a text to find different types of. ‧. media content. Therefore, the common way to deal with the image query is to transform the image into a keyword and do the text-based retrieval.. y. Nat. sit. Many studies have tried to do the cross-media retrieval [9, 25, 6, 17]. Jiwoon et al. [9]. al. er. io. proposed an automatic approach to annotating and retrieving images based on a training. n. set of images. Fei Wu et al. [25] learned the representations with random walk on the. Ch. i Un. v. click graph which contains texts and images as vertices. Jianfeng et al. [6] contributed. engchi. Word2VisualVec, a deep neural network architecture that learns to predict a deep visual encoding of textual input based on sentence vectorization and a multi-layer perceptron. Jinwei et al. [17] proposed a a deep multimodal learning method (DML) on two different media types. To implement cross-media retrieval, we also proposed a method via deep architecture. Not only the text but also other information can be added.. 2.3. Convolution Neural Network. Current approaches to image recognition make essential use of machine learning methods. Alex et al. [12] trained a large, deep convolution neural network to classify 1.2 million images in the ImageNet LSVRC-2010 contest. They achieved top-1 and top-5 error rates of 37.5% and 17.0% from 47.1% and 28.2%. Since then, deep convolution neural network has dominated the field of image recognition. Karen et al. [20] proposed VGGNet model which has high accuracy. Christian et al. [21] proposed GoogLeNet which require 6.

(21) less parameters and less computation. In this work, we used CNN pre-trained model on training image and input image. Specially, we take the hidden layer before the softmax layer as the representation of this image. it is generally considered that the hidden layer contains more information.. 2.4. Network Embedding. Various methods of network embedding have been proposed in the machine learning literature [23, 4, 1]. The idea behind network embedding techniques is to compress the contextual or surrounding information of an object into its vector representation. In the field of natural language processing, the techniques are usually referred to as word embedding for language modeling and feature learning to map words or phrases into a low-dimensional. 政治大 representations of vertices of a social graph that can keep the graph structure for further 立 tasks such as community detection. In most work, they first construct a information graph.. vector space. In social network analysis, the similar idea has also been applied to learn the. ‧ 國. 學. The vertices can be texts, images, or any data. After the construction, an edge-sampling, random walk or other techniques is applied to embed the information graph. The simi-. ‧. lar information vertices are encoded into the similar vector representation [22, 16, 8, 3]. Afterwards, the representations of the vertices can be used as information retrieval by. Nat. sit. y. similarity calculations. The learned representations is also very useful in a variety of ap-. io. er. plications such as visualization [14], node classification [2, 11], link prediction [13], and recommendation [26]. Flexibility and easy to incorporate new information are also the. n. al. advantage of network embedding algorithm.. Ch. engchi. 7. i Un. v.


(23) Chapter 3 Methodology 政治大. We propose a novel method, called “image2song”, which allows the user to input an image then retrieve the corresponding songs. The framework of our approach is illustrated in. 立. Figure 3.1. The proposed method is composed by three modules. To begin with, the input. ‧ 國. 學. image and the training images are transformed to low dimension representations through CNN module. Secondly, network embedding module learns a representation from the information graph for each entity. Lastly, Our method retrieves the corresponding entities. ‧. through the similarity calculation module. Below, we describe the detail of each module. n. al. er. io. sit. y. Nat. in our framework.. Ch. engchi. i Un. v. Figure 3.1: A framework of image2song. 9.

(24) 3.1. Terminology. In this section, Terminology and definitions of quantities used in this paper are introduced exhaustively. Denote m as the dimension of the image representation through CNN pre-trained model and n as the dimension of the representation through the network embedding algorithm. For the collected dataset, denote K is a set of keywords, which plays an important role in cross-media connection. D = {I, S} is a set of multimedia data where I are images, and S are songs. The collected dataset is viewed as a information graph G = (V, E), where V = {D, K} are vertices and E ⊆ (D × K) are the undirected edges. The weight wx,y of an edge ex,y = (vx , vy ) is assigned a larger value if this edge is more important.. 治政大 vertex v ∈ V . Additionally sentation. There is a heterogeneous representation for each 立 the vertex v ∈ I exists a CNN representation. For clarity, we list the terminology and. For the representations, Υ is a CNN representation and Φ is a heterogeneous repre-. Table 3.1: Terminology and definitions. ‧. ‧ 國. 學. definitions in Table 3.1.. Definitions. m. The dimension of the CNN representation.. n. The dimension of the heterogeneous representation.. K. The set of keywords.. n. er. io. al. sit. y. Nat. Terminology. i Un. v. I. C The set of images. h. i. i ∈ I, an image.. S. The set of songs.. D. D = {I, S}, the set of multimedia data.. V. V = {D, K}, the vertices of the information graph.. v. v ∈ V , a vertex of the information graph.. vi , vs , vk. an image, song, or keyword vertex.. E. E ⊆ (D × K), the undirected edges of the information graph.. G. G = (V, E), the information graph.. ex,y. ex,y = (vx , vy ), an edge between vx and vy .. wx,y. wx,y > 0, the weight of edge ex,y. Φ(v). The heterogeneous representation of v learned from the information graph.. Υ(i). The CNN representation of an image i.. engchi. 10.

(25) 3.2. Convolutional Neural Network Module. In the past several years, many famous network architectures have been proposed, for example, AlexNet, GoogLeNet, VGGNet and ResNet. Network architectures are important in the design of deep convolutional neural network (CNN). Some trends emerge during the evolution: smaller convolutional kernel size, deeper network structures and so on. These change have improved image recognition performance evidently. We choose VGG-19 network architecture [20] for our CNN module because it achieved about 7% top-5 error on ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 challenges). It is a 19-layer network used by the visual geometry group (VGG) in the ILSVRC-2014 competition.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. Figure 3.2: A deep CNN architecture of VGG-19 Figure 3.2 illustrates the main architecture of VGG-19. There are 16 convolutional layers and 3 fully connected layers (FC layers) in VGG-19. It uses 3×3 kernels with stride 1 on all convolutional layers. In this work, we implement VGG-19 by using Keras 1 , an open source neural network library written in Python. To obtain the representations, we can extract features from an arbitrary intermediate layer, for example, the second FC layer. The output size of the chosen layer determines the dimension of the CNN representation m (m = 4096, if the second FC layer is chosen). It also provides the pre-trained weights file 2 . This file is trained on 1000-class object recognition from about 150,000 224 × 224 pixels images. 1 2. https://github.com/fchollet/keras https://github.com/fchollet/deep-learning-models/releases/download/. v0.1/vgg19_weights_tf_dim_ordering_tf_kernels.h5. 11.

(26) During the training phase, the pre-trained weights are loaded into the model. The training images are compressed to the appropriate size as inputs. For each image i, it receives a CNN representation Υ(i). During the application phase, the model with the same weight also generates a CNN representation for the input image.. 3.3. Network Embedding Module. The idea behind network embedding techniques is to compress the contextual or surrounding information of an object into its vector representation. To implement this idea, building an information graph is necessary. As we mentioned before, the correlation of songs and images is difficult to obtain. Therefore, we build a heterogeneous bipartite graph with multimedia data D and keywords K, shown in Figure 3.3, with the following. 政治大 1. The Song-Keyword Connection: A song is connected to a keyword if this key立. connection:. ‧ 國. 學. word appears in its lyrics, titles, or tags. The weight indicates the relevance between the song and the keyword.. ‧. 2. The Image-Keyword Connection: For each keyword, we collect as many images to build the connection. The weight C is equivalent for every connection.. Nat. sit. y. Each object is treated as an individual vertex v of the graph; then the module iter-. io. er. atively updates each vertex representation Φ according to its proximity to the sampled vertices in the graph. The update procedure can be summarized as the following process. n. al. i Un. of minimizing the set of sampled pairs (vx , vy ): X O=− wx,y log p(vy |vx ). Ch. engchi. v. (3.1). ex,y ∈E. where p(vy |vx ) is defined via the softmax function as follows: exp(Φ(vy )T Φ(vx )) p(vy |vx ) = P T v exp(Φ(v) Φ(vx )). (3.2). where Φ(v) denotes the heterogeneous representation of the vertex v. The dimension of the heterogeneous representation is setting given before training. In Figure 3.3, only the pairs with direct connection are sampled, the objective function of embedding process can be revised from 3.1 as follows:  P − (vs ,vk )∈E wvs ,vk log p(vs |vk ) if v ∈ S Ov = − P if v ∈ I (vi ,vk )∈E C log p(vi |vk ). (3.3). After the training, the learned representation of images Φ(vi ) and songs Φ(vs ) can be used in similarity calculation module. 12.

(27) 立. 政治大. ‧ 國. 學. Similarity Calculation Module. sit. y. Nat. 3.4. ‧. Figure 3.3: An example of information graph. al. n. our criterion:. er. io. In similarity calculation module, we apply the Euclidean distance and cosine distance as. Ch. engchi. i Un. v. v u d uX EuclideanDistance(R1 , R2 ) = t (R1dim − R2dim )2. (3.4). dim=1 d P. R1dim R2dim s CosineDistance(R1 , R2 ) = s d d P P (R1dim )2 (R2dim )2 dim=1. dim=1. (3.5). dim=1. where R1 and R2 are the same types of the representations (Φ(v) or Υ(i)) and d is the dimension of the representation. There are three stages in similarity calculation module: 1. Relevant Image Searching: After training, each image has two representation (Υ and Φ). In this stage, the similarity between Υ(input) and Υ(i) is calculated by Equation 3.4 and 3.5. We have d = m as the dimension of CNN representation. input is an input image and i is a training image. The input image do the calculation 13.

(28) with all other training images, and the top N images with shortest distance are selected. 2. Representation Mapping: We map the CNN representation, which is generated in stage 1, into heterogeneous representation (Υ(i) =⇒ Φ(i)) for next stage. 3. Stage of Relevant Song Searching: The similarity between Φ(i) and Φ(s) is also calculated by Equation 3.4 and 3.5. We have d = n as the dimension of heterogeneous representation. s is training songs. The top M songs with shortest distance are selected. At last, N × M songs is retrieved as the most relevant songs to the input image.. 立. 政治大. ‧. ‧ 國. 學 er. io. sit. y. Nat. al. n. iv n C Figure 3.4: An example U space. h e n gofcCNN h i vector. Figure 3.5: An example of heterogeneous vector space. 14.

(29) In the CNN vector space, which is shown as Figure 3.4, the input image is expressed in red. After stage of images comparison, the green and blue images is selected. In other words, the two images is much similar to the input image. Both of the green and blue images have their positions in the heterogeneous vector space, which is shown in Figure 3.5. The nearby yellow songs is considered highly relevant to the input image. The retrieval system finally return those songs as the output to the user.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 15. i Un. v.


(31) Chapter 4 Experimental Results 4.1. 治 The Implementation Retrieval 政of Web-based 大. 立. For experiment, we implemented the proposed method “image2song,” and built an online. ‧ 國. 學. web-based retrieval system which allowed users to upload their images and recommending the music based on our task. A feedback system is also designed to calculate the. ‧. accuracy for the evaluation. Figure 4.1 reveals the actual screen of our website.. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. Figure 4.1: A demo of the online web-based retrieval system. 17.

(32) 4.2 4.2.1. Experimental Settings Dataset. In this work, we focus on cross-media retrieval, especially images and songs. Unfortunately there is no existing dataset which gathered the correlation of cross-media entities. Therefore, we generated our dataset by collecting the images based on the keyword extracted from the Chinese song’s titles and their lyrics. We use Jieba 1 to do the word segmentation and keep the nouns; then extracted multiple keywords by computing termfrequency (TF). Here are some of the 72 keywords: • 傘, 光芒, 咖啡, 地球, 夜色, 大地, 天使, 天空, 宇宙, 山, 影子, 心, 手心, 日落, 星光, 書, 月亮, 流星, 流水, 浪, 海, 火, 煙, 燈, 玫瑰, 眼, 笑容, 翅膀, 背影, 船, 花朵, 茶, 蝴蝶, 行李, 街, 身影, 雪, 雲, 霧, 鳥, 黑夜. 政治大 keywords are used as the low-level feature, the correlation of images and songs can be 立 built intuitively. Table 4.1 gives a statistic of the dataset.. 學 Table 4.1: The statistic of our dataset. 62,316. Training Image. 33,459. io. n. al. 4.2.2. y. Total Songs. sit. 72. er. Nat. Total Keywords. ‧. ‧ 國. For each keyword, we collect as many images from famous search engine. The chosen. Evaluation Metrics C h. engchi. i Un. v. We employed hit rate and precision at k (P@k) to evaluate the performance. We have: HitRate =. Counthit Countall. (4.1). where numerator is a count of correct answer, and denominator is a count of all data. In our information retrieval case, recall is no longer a meaningful metric. Each query has a lot of relevant songs, and users will not be interested in all of them. Precision at k (P@k) is a useful metric: Pk. p=1 ruo(p). P (u, o) =. k. (4.2). where k means a cut-off at k. o(p) = i means the item is ranked at position p in order list o and rui means the item i is relevance with the user u or not (1=yes, 0=no). 1. https://github.com/fxsjy/jieba. 18.

(33) 4.2.3. Baseline. There is no credible evaluation criteria for the retrieval task between images and songs. In music information retrieval, popular songs always satisfy the users. To begin with, we collect a set of popular songs; then sample the needed songs for the evaluation tasks. The result is regarded as the baseline and compared to our method.. 4.3. Experimental Results. In our experiments, we separate our evaluation task into images retrieval, heterogeneous entities retrieval, and user feedback. 1. Images Retrieval: In this part, a labeled dataset is created. We first collected. 政治大 sponding context, for example, 立 Figure 4.2 is labeled as angel, sky, and cloud.. images which are not in the training dataset; then label the images with the corre-. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. Figure 4.2: An image labeled as angel, sky, and cloud There are a hundred images in the labeled dataset. We transform those images into the CNN representation through the pre-trained model, and retrieve the nearest k training images back. The result is marked as “relevant,” if the two images are in the same class. The P@k is shown as Table 4.2. The result shows that the input images can be discriminated well and retrieve similar images back. 2. Heterogeneous Entities Retrieval: Through network embedding module, the heterogeneous representation is learned. in this part, we pick 5 images and retrieve 2 songs for each image by the representation. We collect n concept words from ConceptNet for each keyword. The result is marked as “hit,” if the concept words is in the song’s lyrics. According to Table 4.3, the performance of our method is much higher than the baseline. In table 4.4, we remove the chosen 72 keywords 19.

(34) Table 4.2: Experimental results of images retrieval P@k k. Euclidean. Cosine. 5. 0.59. 0.638. 10. 0.562. 0.591. 20. 0.51. 0.552. 50. 0.445. 0.506. and recalculate the hit rate, the result suggests that our retrieval method can encode concept information into the heterogeneous representation.. 政治大 Hit Rate. Table 4.3: Experimental results of heterogeneous entities retrieval. 立. Concept Words n. 50. Baseline. 0.913. 0.872. 0.124. 0.918. 0.878. 0.157. 0.923. 0.886. 0.943. 0.911. 0.239 0.356. Nat. io. sit. y. ‧. 20. ‧ 國. 10. Our Method (Cosine). 學. 5. Our Method (Euclidean). n. al. er. Table 4.4: Experimental results: without chosen keywords Concept Words. Ch. Hit Rate. i Un. v. i Method (Cosine) engch Our. n. Our Method (Euclidean). 5. 0.163. 0.176. 0.078. 10. 0.26. 0.280. 0.113. 20. 0.349. 0.369. 0.19. 50. 0.526. 0.541. 0.342. Baseline. 3. User Feedback: We designed a questionnaire to collect user’s feedback, There are 17 users attend the test. For each image, there are 10 songs retrieved by our method, and the others are sampled from the set of popular songs. The users pick out the songs that match the image. Figure 4.3- 4.7 are the input images which are not in the training data. Table 4.5- 4.9 are the result of the questionnaire. For each image, the performance of our method is better than the baseline. The average P@10 of our method is 0.66, and the baseline is 0.47.. 20.

(35) Table 4.5: Result of questionnaire1 Method P@10 Our Method. 0.786. Baseline. 0.414. Figure 4.3: Questionnaire1—snow forest. ‧ 國. Our Method. 0.643. 學. 立. 政治 Table 4.6: Result of questionnaire2 大Method P@10 0.514. Baseline. ‧. Figure 4.4: Questionnaire2—sky with cloud. n. er. io. sit. y. Nat. al. Ch. i Un. v. Table 4.7: Result of questionnaire3 Method P@10. engchi. Our Method. 0.671. Baseline. 0.371. Figure 4.5: Questionnaire3—coffee. Table 4.8: Result of questionnaire4 Method P@10. Figure 4.6: Questionnaire4—ocean 21. Our Method. 0.729. Baseline. 0.586.

(36) Table 4.9: Result of questionnaire5 Method P@10 Our Method. 0.486. Baseline. 0.457. Figure 4.7: Questionnaire5—fog. 4.4. Case Study. 政治大. As shown in Figure 4.8, with each input image, the proposed method first retrieves the top five most relevant images in the pre-trained image dataset, and then outputs the top. 立. ten recommended songs based on the distances between the learned representations of the. ‧ 國. 學. images and songs. Below we showcase two interesting cases to demonstrate the ability of the proposed framework in finding relevant songs with respect to a given image.. ‧. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. Figure 4.8: Case study 1. Flower: In the case of “flower,” the image of a cluster of flowers is used as the input (see the left-hand side example in Figure 4.8. Observe that the retrieved images are similar to the input image. Moreover, the lyrics of the recommended songs not only contain the keyword “flower” but other related concept words2 such as “rose” and 2. The concept words are available at http://conceptnet.io.. 22.

(37) “blossom.” 2. Snow: In the case of “snow,” the image of a snow scene is treated as the input (see the right-hand side example in Figure 4.8. Although the retrieved images are similar to the input image, different from the first case, seldom of the lyrics and names of the recommended songs involve the keyword “snow” directly.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 23. i Un. v.


(39) Chapter 5 Conclusions This paper presents an image-based music information retrieval system that bridges the. 政治大 method, a user can find their favorite music through the images. We apply convolutional 立 neural network (CNN) to process images and use network embedding technique for hetheterogeneity-gap between music and image information. Different from the traditional. ‧ 國. 學. erogeneous retrieval. An online prototype is also built for demonstration. The given examples not only show the novelty and the potential of the proposed approach but its. ‧. ability in finding relevant songs with respect to a given image. Our experiment results show that each module in proposed method is effective. For the CNN module, we eval-. Nat. sit. y. uate the learned CNN representation; the method achieves 0.59 in terms of P@5. For. io. er. network embedding module, the result is 2 times better than the baseline. The user feedback also plays an important role in the retrieval system; therefore, the proposed method. al. n. iv n C hlabeled For future work, we believe using i Ubring some positive effect. e n gdata c hcan. achieves 0.66 in terms of average P@10 compared to the popular songs of 0.47.. The. heterogeneous network in this task is built by only keywords. The additional links created by the labeled data enrich the heterogeneous network. Also, it would be interesting to include different types of multimedia data into the proposed framework for further investigation. Therefore, each entity in the network can encode more information, which should be helpful for the retrieved results be more effective and diverse.. 25.


(41) Bibliography [1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, pages 585–591, 2002.. 政治大. [2] S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social networks. In Social Network Data Analytics, pages 115–148. Springer, 2011.. 立. [3] S. Cao, W. Lu, and Q. Xu. Deep neural networks for learning graph representations.. ‧ 國. 學. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.. [4] T. F. Cox and M. A. Cox. Multidimensional scaling. CRC press, 2000.. ‧. [5] S. Dieleman and B. Schrauwen. End-to-end learning for music audio. In Acoustics,. Nat. sit. y. Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on,. io. al. er. pages 6964–6968. IEEE, 2014.. v. n. [6] J. Dong, X. Li, and C. G. Snoek. Word2visualvec: Cross-media retrieval by visual. Ch. i Un. feature prediction. arXiv preprint arXiv:1604.06838, 2016.. engchi. [7] J. Foote. An overview of audio information retrieval. Multimedia Systems, 7(1):2– 10, 1999. [8] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 855–864. ACM, 2016. [9] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 119–126. ACM, 2003. [10] M. Kaminskas and F. Ricci. Contextual music information retrieval and recommendation: State of the art and challenges. Computer Science Review, 6(2):89–119, 2012. 27.

(42) [11] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012. [13] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the Association for Information Science and Technology, 58(7):1019– 1031, 2007. [14] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.. 政治大 In IFIP International Conference on Computer Information Systems and Industrial 立 Management, pages 613–622. Springer, 2015.. [15] A. Ogino and Y. Yamashita. Emotion-based music information retrieval using lyrics.. ‧ 國. 學. [16] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on. ‧. Knowledge Discovery and Data Mining, pages 701–710. ACM, 2014.. Nat. sit. y. [17] J. Qi, X. Huang, and Y. Peng. Cross-media retrieval by multimodal representation. io. er. fusion with deep networks. In International Forum of Digital TV and Wireless Multimedia Communication, pages 218–227. Springer, 2016.. n. al. Ch. i Un. v. [18] F. Raposo, R. Ribeiro, and D. M. de Matos. Using generic summarization to improve. engchi. music information retrieval tasks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(6):1119–1128, 2016. [19] S. Rüger. Multimedia information retrieval. Synthesis Lectures on Information Concepts, Retrieval, and Services, 1(1):1–171, 2009. [20] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. [22] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. ACM, 2015. 28.

(43) [23] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [24] R. Typke, F. Wiering, and R. C. Veltkamp. A survey of music information retrieval systems. In Proc. 6th International Conference on Music Information Retrieval, pages 153–160. Queen Mary, University of London, 2005. [25] F. Wu, X. Lu, J. Song, S. Yan, Z. M. Zhang, Y. Rui, and Y. Zhuang. Learning of multimodal representations with random walks on the click graph. IEEE Transactions on Image Processing, 25(2):630–642, 2016. [26] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han. Personalized entity recommendation: A heterogeneous information network approach.. 政治大. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pages 283–292. ACM, 2014.. 立. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 29. i Un. v.

(44)