• 沒有找到結果。

This section presents challenges for (1) image location identification, (2) im-age annotation, and (3) event discovery.

1.1.1 Image Location Identification

Image location identification aims to find where a given photo was taken.

Many studies for image location identification focus on landmark identification.

Given a photo, previous works for landmark identification typically matches the photo with photos of landmarks using features and/or tags of these photos. After-wards, the text- and geo-tags of the photo of landmark that is the most similar to the given photo will indicate what landmark the photo was taken for and where the landmark is located at (thus, where the photo might be taken), respectively.

Consider the phenomenon that people may take a photo of anything anywhere at any moment. In contrast to landmark identification, this dissertation focuses on city-view image location identification.

City-view image location identification is challenging mainly because of four conditions: (1) a photo may cover only a small part of the target object; (2) a photo may be taken under different operating conditions, such as weather conditions and shot sizes; (3) photos for a building may be taken indoors or outdoors; (4) there could be a number of buildings in a very close proximity. See Figure 1.2 as an illustration1. Among the above four, conditions (1)–(3) and (3)–(4) may degrade the performance of existing techniques based on visual features and geo-tags of photos, respectively. In particular, conditions (3)–(4) are more critical to city-view image location identification than traditional landmark identification. According to a study of the positional accuracy from mobile phones, locations provided by a

1All of the photos shown in this figure were obtained from Foursquare [2].

(a) (b)

(d) (c)

Figure 1.2: Illustration of challenges of city-view image location identification. (a) Photos that cover a small part of a building. (b) Photos that were taken in the morning and in the evening, respectively. (c) Photos that were taken indoors and outdoors, respectively. (d) Photos that show several stores being in a close proximity.

These conditions may degrade the performance of existing techniques that are based on visual features or geo-tags of photos.

mobile phone may have a root mean square error (RMSE) of 12.5 or 21.6 meters depending on that the phone is used outdoors or indoors [55]. Sometimes, an error of 47.9 meters might occur when a phone is used indoors [55]. Thus, previous works such as [28] that directly integrated visual features and geo-tags of photos may have a limitation on distinguishing a building within a city from the others, if the given photos were taken indoors and/or these buildings are in a close proximity.

1.1.2 Image Annotation

Image annotation aims to add proper tags for the images of a given image dataset. Image annotation is challenging mainly because of four reasons: (1) there are relatively few images having tags in a general image dataset, i.e., the resource is strictly limited, (2) an image dataset may have two or more similar images of the

same landscape but with different tags from different interpretations, (3) the number of images continues to grow rapidly, and thus complicates the learning process, and (4) traditional supervised approaches might not be practical because of the hard to form a model for every tag.

As a solution, semi-supervised learning (SSL) is widely-used to realize image annotation. SSL is a learning technique to explore lots of untagged images in the presence of a small amount of tagged images. Among the SSL approaches, graph-based approaches are quite popular due to their higher efficiencies in contrast to other approaches [60]. A typical graph-based approach models both tagged and un-tagged images as vertices followed by adding a weighted edge between each pair of vertices, where the weight of an edge is the similarity between its two terminal ver-tices (i.e., images). With the graph, a process called label propagation is activated to add tags to images (i.e., to label images) accordingly.

Traditional label propagation implies single-label propagation, where each image considers only a single tag to reduce the complexity on label propagation.

Recently, the study of annotation of multiple tags arises [51]. Moreover, sizes of image datasets continue to increase. It is desirable to develop a scalable approach for multi-label propagation.

1.1.3 Event Discovery

Event discovery aims to discover events of interest. Generally, event discov-ery can be achieved by observing the changes of data numbers from a given media stream along a period of time. Intuitively, every media stream has unique informa-tion and its own features. As a case study, an experiment was conducted in our preliminary work [27] that compared the similarity of top-100 popular places over different media streams, including Twitter, Instagram, TripAdvisor, and a NYC open data for subway traffics. The result can be found in Table 1.1. As can be seen, the popular places of the media streams are diverse. Unifying different media streams is capable of achieving better diversity and even performance than using

Table 1.1: Pairwise comparison of percentages of overlapped attractions, i.e., simi-larity, over Twitter, Instagram, TripAdvisor, and subway traffics, in our preliminary study [27]. As can be seen, their popular places are different to some extent. It is desirable to unify cross-domain media for high diversity.

Twitter Instagram TripAdvisor Subway

Twitter 1.00 0.35 0.08 0.22

Instagram 0.35 1.00 0.18 0.43

Tripadvisor 0.08 0.18 1.00 0.32

Subway 0.22 0.43 0.32 1.00

one media stream alone. However, combining different media streams is also what makes event discovery on cross-domain media challenging, because the meanings of their data, even only for numerical data, may be quite different.

相關文件