國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 4.7: Questionnaire5—fog
Table 4.9: Result of questionnaire5
Method P@10
Our Method 0.486 Baseline 0.457
4.4 Case Study
As shown in Figure 4.8, with each input image, the proposed method first retrieves the top five most relevant images in the pre-trained image dataset, and then outputs the top ten recommended songs based on the distances between the learned representations of the images and songs. Below we showcase two interesting cases to demonstrate the ability of the proposed framework in finding relevant songs with respect to a given image.
Figure 4.8: Case study
1. Flower: In the case of “flower,” the image of a cluster of flowers is used as the input (see the left-hand side example in Figure 4.8. Observe that the retrieved images are similar to the input image. Moreover, the lyrics of the recommended songs not only contain the keyword “flower” but other related concept words2 such as “rose” and
2The concept words are available at http://conceptnet.io.
22
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
“blossom.”
2. Snow: In the case of “snow,” the image of a snow scene is treated as the input (see the right-hand side example in Figure 4.8. Although the retrieved images are similar to the input image, different from the first case, seldom of the lyrics and names of the recommended songs involve the keyword “snow” directly.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
24
‧
This paper presents an image-based music information retrieval system that bridges the heterogeneity-gap between music and image information. Different from the traditional method, a user can find their favorite music through the images. We apply convolutional neural network (CNN) to process images and use network embedding technique for het-erogeneous retrieval. An online prototype is also built for demonstration. The given examples not only show the novelty and the potential of the proposed approach but its ability in finding relevant songs with respect to a given image. Our experiment results show that each module in proposed method is effective. For the CNN module, we eval-uate the learned CNN representation; the method achieves 0.59 in terms of P@5. For network embedding module, the result is 2 times better than the baseline. The user feed-back also plays an important role in the retrieval system; therefore, the proposed method achieves 0.66 in terms of average P@10 compared to the popular songs of 0.47.
For future work, we believe using labeled data can bring some positive effect. The heterogeneous network in this task is built by only keywords. The additional links cre-ated by the labeled data enrich the heterogeneous network. Also, it would be interesting to include different types of multimedia data into the proposed framework for further in-vestigation. Therefore, each entity in the network can encode more information, which should be helpful for the retrieved results be more effective and diverse.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
26
‧
[1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embed-ding and clustering. In Advances in Neural Information Processing Systems, pages 585–591, 2002.
[2] S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social net-works. In Social Network Data Analytics, pages 115–148. Springer, 2011.
[3] S. Cao, W. Lu, and Q. Xu. Deep neural networks for learning graph representations.
In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[4] T. F. Cox and M. A. Cox. Multidimensional scaling. CRC press, 2000.
[5] S. Dieleman and B. Schrauwen. End-to-end learning for music audio. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 6964–6968. IEEE, 2014.
[6] J. Dong, X. Li, and C. G. Snoek. Word2visualvec: Cross-media retrieval by visual feature prediction. arXiv preprint arXiv:1604.06838, 2016.
[7] J. Foote. An overview of audio information retrieval. Multimedia Systems, 7(1):2–
10, 1999.
[8] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 855–864. ACM, 2016.
[9] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th Annual Interna-tional ACM SIGIR Conference on Research and Development in Information Re-trieval, pages 119–126. ACM, 2003.
[10] M. Kaminskas and F. Ricci. Contextual music information retrieval and recommen-dation: State of the art and challenges. Computer Science Review, 6(2):89–119, 2012.
‧
[11] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Sys-tems, pages 1097–1105, 2012.
[13] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks.
Journal of the Association for Information Science and Technology, 58(7):1019–
1031, 2007.
[14] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
[15] A. Ogino and Y. Yamashita. Emotion-based music information retrieval using lyrics.
In IFIP International Conference on Computer Information Systems and Industrial Management, pages 613–622. Springer, 2015.
[16] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social repre-sentations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 701–710. ACM, 2014.
[17] J. Qi, X. Huang, and Y. Peng. Cross-media retrieval by multimodal representation fusion with deep networks. In International Forum of Digital TV and Wireless Mul-timedia Communication, pages 218–227. Springer, 2016.
[18] F. Raposo, R. Ribeiro, and D. M. de Matos. Using generic summarization to improve music information retrieval tasks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(6):1119–1128, 2016.
[19] S. R¨uger. Multimedia information retrieval. Synthesis Lectures on Information Con-cepts, Retrieval, and Services, 1(1):1–171, 2009.
[20] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-houcke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
[22] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale infor-mation network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. ACM, 2015.
28
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
[23] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
[24] R. Typke, F. Wiering, and R. C. Veltkamp. A survey of music information retrieval systems. In Proc. 6th International Conference on Music Information Retrieval, pages 153–160. Queen Mary, University of London, 2005.
[25] F. Wu, X. Lu, J. Song, S. Yan, Z. M. Zhang, Y. Rui, and Y. Zhuang. Learning of mul-timodal representations with random walks on the click graph. IEEE Transactions on Image Processing, 25(2):630–642, 2016.
[26] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han. Per-sonalized entity recommendation: A heterogeneous information network approach.
In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pages 283–292. ACM, 2014.