• 沒有找到結果。

2. Literature Review

2.7 Skyline query

results of network analysis for information diffusion on Twitter by using users’

ongoing interactions as denoted by "@username", and they build the diffusion network and develop models for three dimensions of diffusion networks in Twitter [44]. Spasojevic et al. examine the flow of messages in the entire network and recommend best times for a user to post on social networks, and they collect users’

friend graphs on Facebook [2]. These studies are also based on the structure of social networks.

Social network analysis (SNA) is not a formal theory in sociology but rather a strategy for investigating social structures [48]. It is an idea that can be applied in many fields. SNA is the process of investigating social structures through the use of network and graph theories [48]. Examples of social structures commonly visualized through social network analysis include social media networks, means spread, friendship and acquaintance networks, collaboration graphs [49]. On the other hand, some studies address the issue of predicting the temporal dynamics of the information diffusion process [38], [39], [50]. In [38], the authors also construct a graph-based model for information diffusion prediction; they build a platform that helps understanding social network users’ interests and activity by providing emerging topics and events detection as well as network analysis functionalities. These researchers need to construct some graph structures. However, our collected dataset does not contain such graph structures. We develop a new method to handle SNA without graph structures.

2.7 Skyline query

The skyline operation is proposed to extend database systems [51]. This operation is to discover a set of interesting tuples from a potentially large set of tuples. The basic way to compute the skyline is to apply block-nested-loop (BNL) algorithm and compare every tuple with every other tuple [52]. In [52], the authors also use

divide-‧

and-conquer (D&C) algorithm [53] to implement the skyline query. Two progressive techniques are proposed in [54], and they are the Bitmap and the Index techniques. A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional storage space and writes to maintain the index data structure. An index is used to quickly locate data without having to search every row in a database table every time a database table is accessed. Nearest

neighbor (NN) uses the results of nearest neighbor search to partition the data universe recursively [54]. NN executes a nearest neighbor query on R-trees. In [54], the authors mention that NN has some desirable features (such as high speed for returning the initial skyline tuples) but presents several inherent disadvantages (such as the need for duplicate elimination if the dimension is larger than 2, multiple accesses of the same node, and large space overhead). Therefore, the authors developed branch-and-bound skyline (BBS) [54], [55].

Some works sort the input data to speed up the performance of queries [51], [56], [57], [58], [59], [60], [61], [62]. The sorting-based algorithms aim to optimize pivot ordering to prune non-skyline tuples early. The first sorting-based algorithm is sort-filter-skyline (SFS) algorithm [56]. In [55], the authors define the monotone scoring function (ordered from highest to lowest score) which is a topological sort with respect to the skyline dominance partial relation. We also define a monotone function in our study. Godfrey et al. mention that the maximal vector problem has been

rediscovered in the database context with the introduction of skyline query [58].

Computing the skyline is known as the maximum vector problem [57], [63], [64]. In [58], the authors present a new algorithm for maximal vector computation, linear elimination sort for skyline (LESS), that combines aspects of SFS, BNL, and fast linear expected-time (FLET) [62] but does not contain any aspects of D&C. LESS must sort the tuples initially; LESS is an optimized version of SFS [58]. In [61], sort and limit skyline algorithm (SaLSa) exploits the idea of presorting the input data so as to effectively limit the number of tuples to be read and compared. The SaLSa strives

to avoid scanning the complete set of sorted tuples and its feature is the ability of computing the result without having to apply dominance tests to all the tuples in the input relation [61]. Many algorithms such as SFS, LESS and SaLSa need to sort tuples first, and so do our methods.

Besides, the partitioning-based algorithms aim to group tuples into sub-regions which are used for region-based dominance tests. D&C [52] simply divides the problem into multiple sub-problems and merges the local skyline tuples into global ones. Zhang et al. [59] propose an object-based space partitioning (OSP) scheme, which recursively divides the data space into separate partitions with respect to a reference skyline tuple and facilitates progressive retrieval in high dimensional spaces.

Table 2 summarizes the features of skyline query processing algorithms in the literature, and Table 3 summarizes related works according to their features.

Table 2. The features of skyline query processing algorithms

Features Description Abbreviation

Sorting technique Researchers sort the input data by using some functions

ST Dominance

checking approach

Researchers use some methods to reduce calculations

DA Indexing technique Researchers build index to speed up IT Application Researchers evaluate their algorithms

with real data

Ap

Table 3. The summary of related works of skyline query

Papers Algorithms ST DA IT Ap

In the typical application to which our methods can be applied, users assign scores to items or objects and then these scores are transformed into multi-instance data by the proportion method. An object can contain a series of probability values. Many prior works show that skyline query is very useful in multi-criteria decision making applications [56], [53], [54], [55], [57], [58], [59], [60], [61], [62]. Uncertainty in data is inherent in many applications such as sensor networks, scientific data management, data integration, where data take different values with probabilities [65]. Probabilistic data are unavoidable in some important applications. The first work on supporting the skyline query on such data, called p-skyline, is reported in [66], in which the authors consider analyzing professional basketball players using multiple technical statistics criteria and attempt to find the player who can achieve the best performance in all aspects. In [66], the authors propose a probabilistic skyline model in which an uncertain tuple may take on a probability of being on the skyline called p-skyline.

Given a threshold p (0 ≤ p ≤ 1), the p-skyline is the set of uncertain objects, each of which takes a probability of at least p to be on the skyline [67], [68]. In [67], the definition of an instance is different from that in our study. Atallah and Qi propose a general probabilistic skyline query that takes into account different user utilities without any restriction, but they do not use any probability threshold [65], [66]. Liu et al. propose a new uncertain skyline model called u-skyline [51], and it aims to return an uncertain skyline answer set from a complementary perspective to p-skyline.

Furthermore, p-skyline returns individual data tuples with non-dominance probabilities greater than or equal to a specified threshold [51], while u-Skyline focuses on returning an answer set that forms a valid skyline with the maximum probability [51].

Most works assume that uncertainty exists only in attribute values [69]. Zhang et al.

[69] address the skyline probability computation problem in scenarios where uncertainty resides in attribute preferences instead of values. The approach used in [69] assumes independent object dominance. The previous works discuss probabilistic skylines and skyline query for probabilistic data; we summarize these works

according to their features in Table 4. Our work is different from others in that it is for multi-instance data instead of the traditionally defined uncertain data. The dominance relation between two objects in our study is the sum of the probabilities that the higher score can dominate the lower score. In p-skyline, a probability for a tuple is defined by aggregating over all the possible worlds within which the tuple is dominated. We calculate the dominance relation between two objects and then determine the one that could potentially be on the skyline. If the determined object is not dominated by others, it is a skyline object and is returned as an answer to the query.

Table 4. The summary of related works of skyline query on probabilistic data

Papers Algorithms ST DA IT Ap

[67] The bottom-up and the top-down algorithms

Yes No Yes Yes

[65], [66] Weighted dominance counting (WDC)

Yes Yes No No

[70] Skyline feature algorithm (SFA)

Yes Yes Yes No

[68] Bottom-up and top-down hybrid algorithm

Yes Yes Yes Yes

[52] Dynamic programming search algorithm

Yes Yes No No

[69] Monte Carlo estimation algorithm

Yes Yes No Yes

Our study Our methods (Section 5.1)

Yes Yes No Yes

We summarize comparisons between our study and related works in Table 5. Table

5 also describes the originality of our study.

Table 5. The summary of comparisons between our proposed skyline query and related works

Features Our study Related works

Sorting technique

Ours and theirs sort the input data to reduce the computational cost and speed up the performance.

[51], [56], [57], [58], [59], [60], [61], [62]

Instance The instance in their definition is different from that in ours.

[67]

Dominance relation

They define a probability for each tuple by aggregating over all the possible worlds within which the tuple is dominated, while we define the

dominance relation to be the sum of the probabilities that the higher score can dominate the lower score.

[68]

Monotone function

They define the monotone scoring function, and that is different from what we use.

[56], [58]

Skyline object

If an object is not dominated by others, it is a skyline object and is returned as an answer to a query; this is consistent with the skyline query on certain data.

[52], [56], [53], [54], [55], [57], [58], [59], [60], [61], [62], [63],

[64], [65]

Considering the increasingly large amount of data, Cosgaya-Lozano et al. show that parallel computing is an effective method to speed up the skyline query processing on large datasets [71]. Many works consider parallelized methods that utilize multiple processors or obtain useful partitions of the dataset for parallel processing [72], [73], [74], [75]. In our study, we focus on effectively processing skyline query on a special type of data [76], [77]. Nevertheless, part of future work could be to parallelize our methods.

2.8 The left-wing and right-wing