針對複合式競賽挑選最佳球員組合的方法 - 政大學術集成

全文

(1)國立政治大學資訊科學系 Department of Computer Science National Chengchi University 碩士論文 Master’s Thesis. 立. 政治大. ‧ 國. 學 ‧. 針對複合式競賽挑選最佳球員組合的方法. n. al. er. io. sit. y. Nat. Selecting the Best Group of Players for a Composite Competition. Ch. engchi. i Un. v. 研究生：鄧雅文指導教授：陳良弼. 中華民國九十九年七月 July 2010.

(2) 針對複合式競賽挑選最佳球員組合的方法 Selecting the Best Group of Players for a Composite Competition. 研究生：鄧雅文. Student：Ya-Wen Teng. 指導教授：陳良弼. Advisor：Arbee L. P. Chen. 資訊科學系. 學. ‧ 國. 立. 治政大國立政治大學碩士論文. y. ‧. Nat. sit. A Thesis. er. io. submitted to Department of Computer Science. n. al v National Chengchi University ni Ch. U. i Requirements e n g cofhthe in partial fulfillment for the degree of Master in Computer Science. 中華民國九十九年七月 July 2010.

(3) 記錄編號：G0097753034. 國立政治大學博碩士論文全文上網授權書 National ChengChi University Letter of Authorization for Theses and Dissertations Full Text Upload (提供授權人裝訂於紙本論文書名頁之次頁用) (Bind with paper copy thesis/dissertation following the title page). 本授權書所授權之論文為授權人在國立政治大學資訊科學學系系所 ________________組 98學年度第二學期取得碩士學位之論文。 This form attests that the _____________ Division of the Department of Graduate Institute of Computer Science at National ChengChi University has received a Master degree thesis/dissertation by the undersigned in the _________ semester of 98 academic year.. 政治大論文題目（）：針對複合式競賽挑選最佳球員組合的方法 ( Selecting the Best Group 論文題目（Title）立 of Players for a Composite Competition ). ‧ 國. 學. 指導教授（）：陳良弼指導教授（Supervisor）. ‧. 立書人同意非專屬、無償授權國立政治大學，將上列論文全文資料以數位化等各種方式重製後收錄於資料庫，透過單機、網際網路、無線網路或其他公開傳輸方式提供用戶進行線上檢索、瀏覽、下載、傳輸及列印。國立政治大學並得以再授權第三人進行上述之行為。 The undersigned grants non-exclusive and gratis authorization to National ChengChi University, to re-produce the above thesis/dissertation full text material via digitalization or any other way, and to store it in the database for users to access online search, browse, download, transmit and print via single-machine, the Internet, wireless Internet or other public methods. National ChengChi University is entitled to reauthorize a third party to perform the above actions. 論文全文上載網路公開之時間（Time of Thesis/Dissertation Full Text Uploading for Internet Access）：網際網路（The Internet） ■ 中華民國 100 年 8 月 5 日公開 ● 立書人擔保本著作為立書人所創作之著作，有權依本授權書內容進行各項授權，且未侵害任何第三人之智慧財產權。 The undersigned guarantees that this work is the original work of the undersigned, and is therefore eligible to grant various authorizations according to this letter of authorization, and does not infringe any intellectual property right of any third party. ● 依據96年9月22日96學年度第1學期第1次教務會議決議，畢業論文既經考試委員評定完成，並已繳交至圖書館，應視為本校之檔案，不得再行抽換。關於授權事項亦採一經授權不得變更之原則辦理。 According to the resolution of the first Academic Affairs Meeting of the first semester on September 22nd, 2007,Once the thesis/dissertation is passed after the officiating examiner's evaluation and sent to the library, it will be considered as the library's record, thereby changing and replacing of the record is disallowed. For the matter of authorization, once the authorization is granted to the library, any further alteration is disallowed，立書人：鄧雅文簽名(Signature)：中華民國年月日 Date of signature：__________/__________/__________ (dd/mm/yyyy) . n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v.

(4) 針對複合式競賽挑選最佳球員組合的方法針對複合式競賽挑選最佳球員組合的方法摘要在資料庫的處理中，top-k 查詢幫助使用者從龐大的資料中萃取出具有價值的物件，它將資料庫中的物件依照給分公式給分後，選擇出分數最高的前 k 個回傳給使用者。然而在多數的情況下，一個物件也許不只有一個分數，. 政治大. 要如何在多個分數中仍然選擇出整體最高分的前 k 個物件，便成為一個新. 立. ‧ 國. 學. 的問題。在本研究中，我們將這樣的物件用不確定資料來表示，而每個物件的不確定性則是其帶有機率的分數以表示此分數出現的可能性，並提出. ‧. 一個新的問題：Best-kGROUP 查詢。在此我們將情況模擬為一個複合式競. sit. y. Nat. io. a. er. 賽，其中有多個子項目，每個項目的參賽人數各異，且最多需要 k 個人參. n. 賽；我們希望能針對此複合式競賽挑選出最佳的 i vk 個球員組合。當我們定 l. Ch. n engchi U. 義一個較佳的組合為其在較多項目居首位的機率比另一組合高，而最佳的組合則是沒有比它更佳的組合。為了加快挑選的速度，我們利用動態規劃的方式與篩選的演算法，將不可能的組合先剔除；所剩的組合則是具有天際線特質的組合，在這些天際線組合中，我們可以輕易的找出最佳的組合。此外，在實驗中，對於在所有球員中挑選最佳的組合，Best-kGROUP 查詢也有非常優異的表現。. i.

(5) Selecting the Best Group of Players for a Composite Competition Abstract In a large database, top-k query is an important mechanism to retrieve the most valuable information for the users. It ranks data objects with a ranking function and reports the k objects with the highest scores. However, when an object has multiple scores, how to rank objects without information loss. 治政 In this paper, we model大 the object with multiple scores 立. becomes challenging.. as an uncertain data object and the uncertainty of the object as a distribution of. ‧ 國. 學. the scores, and consider a novel problem named Best-kGROUP query. Imagine. ‧. the following scenario. Assume there is a composite competition consisting of. sit. y. Nat. several games each of which requires a distinct number of players. Suppose the largest number is k, and we want to select the best group of k players from. er. io. all the players for the competition. A group x is considered better than another a. n. iv l C n group y if x has higher aggregated U be the top ones in more games h eprobability n g c h i to than y. In order to speed up the selection process, the groups worse than another group definitely should first be discarded. We identify these groups using a dynamic programming based approach and a filtering algorithm. The remaining groups with the property that none of them have higher aggregated probability to be the top ones for all games against the other groups are called skyline groups. From these skyline groups, we can easily compare them to select the best group for the composite competition. The experiments show that our approach outperforms the other approaches in selecting the best group to defeat the other groups in the composite competitions. ii.

(6) 誌謝首先，我要先歸榮耀給神，在我的求學過程中，一路都有祂最美好的安排；特別在碩二的緊張和壓力中，祂不斷地用愛激勵我、帶領我，雖然遇到困難和挫折，但是感謝神，祂依然將我安置在高處。這篇論文的完成最感謝的人是我的指導教授陳良弼老師，從碩一就開始教導我做研. 政治大珍貴也非常感謝；還有每次的論文修改討論，都帶給我很大的幫助。除此之外，在生活立. 究的嚴謹態度，一直到碩二開始討論問題，老師特地每週一次的討論時間對我來說非常. ‧ 國. 學. 上不管是受傷或生病都有老師的關心，很謝謝老師在研究所這兩年所有的付出。也很感謝實驗室的學長姊、學弟妹，每次的報告都給我很好的建議，不僅如此，也. ‧. 陪伴我渡過兩年快樂的研究所生活，和你們相處的時間雖然不長，但是不管在課業或玩. sit. y. Nat. 樂上都給我留下十分美好的回憶。. io. er. 謝謝一直陪伴著我的家人：爸爸、媽媽、大姨和姊姊，你們的關心、代禱、鼓勵是. al. iv n C hengchi U 有我的男朋友家祺，雖然我們常常一起打電動以致於必須熬夜趕報告、一起吃吃喝喝直 n. 我的支持和力量，每一通電話、每一封簡訊、每一次一起禱告，都給我極大的安慰；還. 到錢包空空如也，但是很高興、也很感謝你的陪伴。教會的牧者、小組長、同組的姊妹，還有許許多多的朋友，謝謝你們在我研究不順利的時候為我加油打氣，也包容我的低潮，也願我的喜悅與你們一同分享。最後，我要再一次地謝謝所有師長、家人、朋友，過去的日子有你們的陪伴是神的恩典，願上帝也大大祝福你們！. iii.

(7) TABLE OF CONTENTS 1. INTRODUCTION .............................................................................................................. 1. 2. RELATED WORK ............................................................................................................. 7 2.1. Top-k Queries ..................................................................................................... 7. 2.2. Uncertain Data .................................................................................................... 8. 2.2.2. 2.3. Skyline .............................................................................................................. 15. ‧. METHODOLOGY ........................................................................................................... 17 Problem Definition ........................................................................................... 17. 3.2. The Basic Algorithms ....................................................................................... 20. sit. io. al. iv n C hengchi U EXPERIMENTS ............................................................................................................... 29 3.3. The Heuristic Approaches ................................................................................ 28. 4.1. Experiment Setup ............................................................................................. 29. 4.2. On Execution Time ........................................................................................... 32. 4.3. On Accuracy ..................................................................................................... 34. 4.4. On Performance of the Composite Competition .............................................. 35. n. 4. y. Nat. 3.1. er. 3. Uncertain Nearest Neighbor Queries ................................................................ 14. 學. 2.2.3. 政治大 Uncertain Top-k Queries on Data Streams ....................................................... 14 立 Uncertain Top-k Queries ..................................................................................... 8. ‧ 國. 2.2.1. 5. CONCLUSIONS AND FUTURE WORK ....................................................................... 37. 6. REFERENCES ................................................................................................................. 40. iv.

(8) LIST OF TABLES Table 1-1: Pin-fall records of A, B, and C. ................................................................................. 1 Table 1-2: Bowling scores of A, B, and C. ................................................................................. 2 Table 1-3: Records of player A, B, and C. .................................................................................. 3 Table 1-4: Modeled as uncertain data. ........................................................................................ 3. 政治大 Table 1-6: The aggregated probability to be the top-2 players. .................................................. 5 立 Table 1-5: The aggregated probability to be the top-1 player. .................................................... 5. ‧ 國. 學. Table 2-1: The example database D............................................................................................ 9 Table 2-2: All possible worlds of D. ......................................................................................... 10. ‧. Table 2-3: All possible U-top2 of D. ........................................................................................ 11. sit. y. Nat. Table 2-4: All possible rank 1 and rank 2 answers. .................................................................. 12. io. er. Table 2-5: The top-k probability of all tuples. .......................................................................... 13. al. iv n C Table 3-2: The table with all groups h and etheir n gaggregated c h i U probabilities. ................................. 20 n. Table 3-1: The tuple concept of D. ........................................................................................... 19. Table 3-3: Information of selecting the best 2GROUPs of the dataset in D. ........................... 20. v.

(9) LIST OF FIGURES Figure 3-1: Algorithm GroupGen. ............................................................................................ 22 Figure 3-2: Illustration of updating the dynamic programming table. ..................................... 23 Figure 3-3: Algorithm SubsetFilter. ......................................................................................... 26 Figure 3-4: Algorithm BestGROUP. ........................................................................................ 27. 政治大 Figure 4-2: Execution time among different algorithms. ......................................................... 33 立 Figure 4-1: Experiment process. ............................................................................................... 31. ‧ 國. 學. Figure 4-3: Accuracy of LimitGroupGen. ................................................................................ 34 Figure 4-4: Probability of defeating other kSETs among different algorithms. ....................... 35. ‧. n. er. io. sit. y. Nat. al. Ch. engchi. vi. i Un. v.

(10) 1. INTRODUCTION. There are many approaches to help the users to retrieve important data objects. Ranking objects and reporting the ones with the highest scores is called top-k query.. Traditional top-k. queries identify top-k objects by a scoring function which gives each object a unique score. For example, there are three players and their pin-fall records of a bowling game in Table 1-1. players will get bonuses in two cases. 政治 The大 One is that he/she knocked down all 10 pins with the first ball, called a strike and recorded as 立 The bowling is scored for each pin knocked over.. ‧ 國. 學. an X, while the other is that no pins are left after the second ball, called a spare and recorded as a slash. The bonus is that points scored for the next two and next ball after a strike and. ‧. spare are doubled, respectively.. By the rules above, we obtain the scores for each frame,. sit. i Un. v. 7. 8. n. al. er. In this example, we can simply claim the player A gets the first place.. io. the aggregated scores.. y. Nat. shown in Table 1-2, and the scoring function to rank these three players in a bowling game is. Table 1-1: Pin-fall records of A, B, and C.. Player. 1. 2. Ch. 3. 4. engchi 5. 6. 9. 10. 10+1 10+2. A. X. X. X. 9. / X. X. X. X. X. X. X. X. B. X. X. 5. / X. X. X. X. X. X. X. X. X. C. X. X. X. X. X. 8. 7. / X. X. X. X. X. 1. /.

(11) Table 1-2: Bowling scores of A, B, and C. 1. Player. 2. 3. 4. 5. 6. 7. 8. 9. 10 Score. A. 30 29 20 20 30 30 30 30 30 30. 279. B. 25 20 20 30 30 30 30 30 30 30. 275. C. 30 30 30 28 20 17 20 30 30 30. 265. 政治大 In the example 立 of the data in Table 1-2, the top-1 player is A and the top-2. In the traditional top-k query, the answer to a top-k query is the extension of that to a top-(k-1) query.. ‧ 國. 學. players are A and B. When we define a composite competition consisting of various games each of which requires distinct number of players and suppose the largest number of players. ‧. in the games is k, we can simply submit a top-k query to the database of players and take the. Nat. n. al. er. io. the answers of top-1, top-2, …, and top-(k-1) queries.. sit. y. answer as the best group of k players for the competition since these k players must contains. i Un. v. However, when a player has more competing experiences, he/she will have not only one. Ch. engchi. score, shown in Table 1-3. To rank such data objects with multiple scores, we can simply use another scoring function to obtain the average scores or the expected scores, in other words. When we replace the multiple scores with a single score, some information is omitted. For example, a player has the following scores: 100, 1000 and 1000, while another player has 50, 50 and 2000. The expected score is the same, 700, but the range and the distribution of the scores are not the same for these two players. Therefore, this way to model the multi-score data is not appropriate.. In this paper, we use an uncertainty model to. model the data objects with multiple scores. The uncertainty of each object is the probabilistic scores which keep the original score distribution, as shown in Table 1-4 2.

(12) Table 1-3: Records of player A, B, and C. Player. Past records. A. 279 252 252 252 266. B. 275 300 275 200 200. C. 265 265 265 265 265. 政治大. Table 1-4: Modeled as uncertain data. Probabilistic score. 學. ‧ 國. 立. Player. Score Probability. 266. 0.2. Ch. 252. 0.6. n. al. sit. y A. er. io. 0.2. ‧. Nat. 279. B. C. engchi. i Un. 300. 0.2. 275. 0.4. 200. 0.4. 265. 1.0. v. After we model the objects as uncertain data, the top-k objects are defined as the k objects with the highest probability to be the top-k ones. For example, when a top-1 query is submitted to the dataset in Table 1-4, the probability to be the top-1 object for each object. 3.

(13) needs to be obtained first and then the object with the highest probability can be decided. We compute the probability of A to be the top-1 player in three cases. One is when A gets 279 (with probability 0.2) and B and C get scores not greater than A (with probability 0.4+0.4=0.8 of B and probability 1.0 of C). A has probability 0.2*0.8*1.0=0.16 to be the top-1 player.. Another is 0.2*0.4*1.0=0.08 and the other is 0.6*0.4*0=0 since C cannot be less. than A when A gets 252. Therefore, the total probability of A to be the top-1 player is 0.16+0.08+0=0.24. We can obtain the probability for each player in the same way to have. 政治大. the top-1 player. When a top-k query is submitted, we view k objects as a group and. 立. compute the aggregated probability of these objects to be the top-k objects and select the. ‧ 國. 學. group with the highest aggregated probability as the answer. Here we also use the data in Table 1-4 as an example to explain more clearly.. When a top-2 query is submitted to the. ‧. dataset in Table 1-4, we group these three players into 3 groups as following: {A, B}, {A, C},. y. Nat. sit. and {B, C}. The aggregated probability of each group should be computed as the way in the. n. al. er. io. above example. The aggregated probability of A and B to be the top-2 players is computed. i Un. v. in nine cases. When A gets 279 (with probability 0.2), B gets 300 (with probability 0.2) and. Ch. engchi. C gets the score not greater than A and B (with probability 1.0), the probability of A and B as the top-2 players is 0.2*0.2*1.0=0.04. We can compute other eight cases as the same way. Note that, when A gets 252 or B gets 200, C has no chance to get a score smaller than A and B. That is, A and B are not the top-2 players in the cases so that the probability is 0 and it does not contribute to the aggregated probability.. Hence, the aggregated probability of A and B to. be the top-2 players is 0.24. Again, we compute the aggregated probability for other groups to obtain the two players with the highest aggregated probability to be the top-2 ones. However, we have an observation of the top-k queries on uncertain data.. The answer of. top-k query does not always include the answers of top-i queries, where i is less than k. We 4.

(14) take the data in Table 1-4 as an example. The answer of top-1 query is B but that of top-2 query is B and C, shown in Table 1-5 and Table 1-6.. That is, when we have a composite. competition consisting of two games which include one game with 1 player while the other with 2 players, the traditional method is not appropriate since the top-2 query reports B and C, but none of them is the answer of top-1 query.. In other words, the top-k query on uncertain. data cannot satisfy the need for a composite competition. Table 1-5: The aggregated probability to be the top-1 player.. {A}. 0.24. y. 0.24. io. sit. {C}. er. 0.52. ‧. Nat. {B}. 學. ‧ 國. 政治大立Player Aggregated probability. Table 1-6: The aggregated probability to be the top-2 players.. n. al. Ch. engchi. i Un. v. Players Aggregated probability {A, C}. 0.40. {B, C}. 0.36. {A, B}. 0.24. Therefore, we propose a novel problem named Best-kGROUP query to retrieve the best groups for the composite competition. Suppose the largest number of players in the games of the composite competition is k.. In a game with i players, if a group x of i players has a. 5.

(15) higher aggregated probability to be the top-i players than another group y, we say there is a preference of x over y.. A group P is said better than another group Q in a composite. competition if there are more preferences of the sub-groups in P than in Q. The best groups are those that are not worse than any other groups.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 6. i Un. v.

(16) 2. RELATED WORK. In the following, we introduce the background of uncertain data, the top-k processing and nearest-neighbor queries.. On the other hand, we also mention the work of skyline and the. reason why we choose skyline to solve our problem.. 2.1 Top-k Queries. 政治大. Top-k queries are useful when users are interested in the most important objects, especially in. 立. In [1], I. F. Ilyas et al. classify the top-k processing techniques. There are. 學. ‧ 國. large databases.. five categories as following: query model, data and query certainty, data access,. ‧. implementation level, and ranking function.. There are three types of query models. One is Top-k Selection Query which is to report. y. Nat. Join Query which is the variance of Top-k Selection Query.. n. al. attached to join results.. Ch. er. io. sit. the k tuples with the highest score according to some scoring function. Another is Top-k. i Un. Its scoring functions are. v. The other is Top-k Aggregate Query which focuses on groups of. tuples rather than single tuples.. engchi. In the data and query certainty issue, the authors furthermore classify the techniques into three types. The first two types are related to certain data and queries.. The difference. between them is that the first type reports the exact answers of queries while the second types reports the approximate answers instead. The last type is related to uncertain data which will be more discussed in the following sections. In data access field, sorted access and random access are discussed. Difference data access assumptions affect the methods to retrieve the underlying data sources.. 7.

(17) The way implementing top-k queries can be classified into two types. One is on application level and the other is on query engine level. The difference between these two types is the modification of the core of database engines. The ranking functions in most techniques are assumed to be monotone while a few ones are generic form. However, in recent researches, some are without scoring function.. It is. called a skyline query and we would discuss it in Section 0.. 政治大 Recently, the research on uncertain data has attracted a lot of interest. 立 2.2 Uncertain Data. It is because the. ‧ 國. Aggarwal et al. survey the sources of uncertainty.. 學. problems related to uncertainty cannot be addressed as traditional approaches.. In [2], C. C.. It is from errors, incompleteness, and. ‧. multiple records so that the data objects have probabilistic attributes. The main research area. sit. y. Nat. of uncertain data includes three types. One is data modeling, another is data management,. n. al. er. io. and the other is data mining. Here we focus on the data management, and we use the discrete probability distribution to model our uncertain data.. 2.2.1. Ch. engchi. i Un. v. Uncertain Top-k Queries. The traditional top-k processing is to retrieve the k tuples with the highest scores.. In. uncertain top-k definition, we need to consider the tradeoff between scores and probabilities. Here we will introduce two definitions proposed in [3]. Before we explain the definitions, we need the some preliminaries. Due to the uncertainty, an object may have different behaviors, and each behavior has its existent probability. Thus, when every object acts as its own behavior, we multiply the probability as p and claim this case is one of the possible worlds and the existent probability of it is p. For instance, Table 2-1 shows the example. 8.

(18) uncertain database. We can obtain all possible worlds from it, shown in Table 2-2. Table 2-1: The example database D. Probabilistic score Player Tuple ID Score Probability t1. 政t 治266 大 0.2 2. io. n. Ch. 0.6. t4. 300. 0.2. t5. 275. 0.4. t6. 200. 0.4. t7. 265. engchi U. 9. y. sit. Nat. aCl. 252. ‧. B. t3. 學. ‧ 國. 立. 0.2. er. A. 279. v ni. 1.0.

(19) Table 2-2: All possible worlds of D. Members. Possible World PW1. B(t4), A(t1), C(t7) 0.2*0.2*1.0 = 0.04. PW2. A(t1), B(t5), C(t7) 0.2*0.4*1.0 = 0.08. PW3. A(t1), C(t7), B(t6) 0.2*0.4*1.0 = 0.08. PW4. B(t ), A(t治 ), C(t ) 0.2*0.2*1.0 = 0.04 政大. 立. 4. 2. 7. PW5. B(t5), A(t2), C(t7) 0.2*0.4*1.0 = 0.08. PW6. A(t2), C(t7), B(t6) 0.2*0.4*1.0 = 0.08. PW7. B(t4), C(t7), A(t3) 0.6*0.2*1.0 = 0.12. 學. Nat. io. n. al. PW9. B(t5), C(t7), A(t3) 0.6*0.4*1.0 = 0.24. er. PW8. sit. y. ‧. ‧ 國. Probability. i Un. v. C(t7), A(t3), B(t6) 0.6*0.4*1.0 = 0.24. Ch. engchi. In U-Topk, we sum up all probabilities over all possible worlds for a top-k tuples. Note that, the U-Topk only considers the concept of tuples. From Table 2-2, we can furthermore obtain all possible U-top2 answer in Table 2-3.. Therefore, <t5, t7> and <t7, t3> with the. highest probability 0.24 to be top-2 over all possible worlds would be the answer of U-Top2. M. Soliman et al. transform the U-Topk query into a state search problem. They start from the tuple with the highest score. Each time they scan a new tuple, they extend the current state with the highest probability until the length of the retrieved current state is k. The tuples kept in the state is the answer of U-Topk.. 10.

(20) Table 2-3: All possible U-top2 of D. Top-2 vector Probability <t5, t7>. 0.24. <t7, t3>. 0.24. <t4, t7>. 0.12. <t , t > 治 0.08 政大 1. 立. 5. 0.08. <t5, t2>. 0.08. <t2, t7>. 0.08. <t4, t1>. 0.04. Nat. n. er. io. al. sit. y. ‧. ‧ 國. 學. <t1, t7>. ni Ch4 2 U engchi <t , t >. 0.04. v. M. Soliman et al. also define U-kRanks which is totally different from U-Topk. U-kRanks retrieves the top-k tuples separately. for the tuple with highest existent probability.. That is, we scan the every ranking position In this example, although the U-2Ranks. answer, <t5, t7>, is the same as the U-Top2 one, the semantic meaning is very different. Since the tuple of each ranking is picked separately, we might choose the same tuple in different ranking or tuples from the same object.. That is, the U-kRanks definition does not. care of the relation between each tuple in the answer and the answer might not be valid in any possible world.. 11.

(21) Table 2-4: All possible rank 1 and rank 2 answers. Rank 1 Probability Rank 2 Probability t5. 0.32. t7. 0.52. t7. 0.24. t3. 0.24. t4. 0.20. t2. 0.12. t1. 立. t2. 治t 0.16 政大 0.08. 5. 0.08. t1. 0.04. ‧ 國. 學. In addition, M. Hui et al. propose another definition about uncertain top-k queries in [4],. ‧. and it is called PT-k. For each tuple, they first define a top-k probability which is the. sit. y. Nat. possibility of the tuple as a member of the top-k ones over all possible worlds. When we. io. al. er. compute the top-k probabilities, the straightforward approach suffers from the exclusive rules.. v ni. n. M. Hui et al. use the compressed dominant set of each tuple to simplify the steps.. Ch. engchi U. In a. compressed dominant set of a tuple t, all tuples are independent of t. As we know, all the probabilities can be simply multiplied when all tuples are independent. Hence, the computing process can be efficiently simplified. The answer of PT-k is all the tuples with top-k probability higher than an input threshold p.. In the example of Table 2-1, when p=0.5,. the answer of PT-2 is only one tuple {t7}. From the definition, the answer of PT-k is simply a set of tuples with top-k probability higher than p. considered when PT-k reports the answer.. 12. The relation of each tuple is not.

(22) Table 2-5: The top-k probability of all tuples. Tuple ID Top-k probability t1. 0.20. t2. 0.20. t3. 0.24. 政治0.20大. t4. 立. 0.40. t6. 0.00. t7. 0.76. ‧. ‧ 國. 學. t5. sit. y. Nat. io. However, the answers might not have the highest score.. n. al. Ch. answers of uncertain top-k queries might be atypical.. engchi. er. The uncertain top-k queries above focus on the most probable tuples to be the top-k. In [5], T. Ge et al. observed the. i Un. v. It means the answer might have. relative low score. For example, the average score of the answer tuples might not be higher than the expected score.. Or combinations with aggregated scores of tuples higher than the. answer are many and their total probability is much higher than the answer is.. In these. situations, the uncertain top-k answers are perhaps not appropriate in score concept. Therefore, T. Ge et al. propose c-Typical-Topk to retrieve c combinations to represent the whole score distribution the dataset.. It first uses a dynamic programming based approach to. generate the whole distribution. Then, it maps the process of retrieving c combinations from all combinations of score distribution to the p-median problem and resolves by two recursion. 13.

(23) functions.. 2.2.2. Uncertain Top-k Queries on Data Streams. Data streaming has been focused for a long time because it models the way data collected from the real-world applications. of the amount of data.. In a streaming problem, the challenge is the rapid growth. Both the execution time and storing space makes the approaches. designed for the static databases useless. We can classify the streaming models into two The former 政治大 collects data from the beginning while the latter would remove the expired data. This is 立 types. One is unbounded streaming and the other is bounded streaming.. ‧ 國. 學. more challenging when we need to not only append new coming data but also remove old ones. We call this a sliding-window model since the valid data only appear in the window. ‧. and the window would slides on the stream.. sit. al. er. They claim the previous definitions, such as U-Topk, U-kRanks,. io. uncertain data streams.. y. Nat. In [6], C. Jin et al. propose a framework to process difference types of top-k queries on. n. iv n C htuples compact set which keeps the minimal e n gtocretrieve h i Uthe uncertain top-k answers even if PT-k, and so on, can be plugged into this framework.. new tuples are appended.. In their framework, they first define a. However, the problem is defined on sliding-window model, so we. use multiple compact sets at different time-stamps to implement the removing operation. The authors furthermore propose other compression approaches to reduce the space the large amount of compact sets would need.. 2.2.3. Uncertain Nearest Neighbor Queries. In addition to the uncertain top-k queries, there are uncertain NN queries to retrieve the most important objects. The NN (nearest neighbor) queries report the closest object to the. 14.

(24) query object. However, in the uncertain database, every object has its probability to be the NN and we define this probability as PNN. G. Beskales et al. propose Topk-PNN to report the k objects with the highest probability to be the NN in [7]. They consider both I/O and computing time to design an efficient algorithm for retrieving the k objects.. Instead of. computing the exact PNN of all objects, they use a lazy bound to simplified the computation and reduce the cost. Here we can transform the NN queries into top-k queries. We let the scoring function be the distance of the object and the query, and the score is the lower the. 政治大. better. Now we can find out the PNN of an object is the same as the probability of a player. 立. to be the top-1 team in 1-game. We will take Topk-PNN for comparison in our experiments.. ‧. ‧ 國. 學. 2.3 Skyline. The skyline query is also a useful approach to retrieve important data.. Before we introduce. sit. Suppose the value. er. A point A is said to dominate another point B if and only if for. io. is the smaller the better.. y. Nat. the definition of skyline, we have to explain what the term “dominate” is.. al. n. iv n C the value of A is less than that of B. hFor e nexample, g c h iA isU<5, 6, 7, 8> and B is <8, 6, 7, 8>.. every dimension, the value of A is less than or equal to that of B, and at least in one dimension. Then, we can easily tell that A dominates B. Moreover, if there is a point cannot be dominated by any other points in the database, we call the point as a skyline point. However, as the number of dimensions increases, a point becomes more difficult to be dominated. That is, the number of skyline points becomes huge. Therefore, M. L. Yiu et al. combine the advantages of skyline and top-k queries to define a new query named top-k dominating query in [8].. It computes the number of points dominated by each point as the. ranking score, and then returns the points with the highest scores. Users do not need to worry about how to define a scoring function and can simply give a value k as parameter to. 15.

(25) restrict the size of the answers. Another work on high dimension skyline operation is proposed by C.-Y. Chan et al. in [9]. They define k-dominate to describe a point A has the dominance property of another point B in a subspace of original space. According to the k-dominance, they also define k-dominant skyline consisting of points which cannot be k-dominated. Note that, the problem does not only reduce the dimension into some subspace since the dominance relationship between any two points is not specified in a fixed subspace.. 政治大. We have the. property that if a point A (k+1)-dominates another point B, A must k-dominates B.. 立. It implies. that every k-dominant skyline point must be one of the members of (k+1)-dominant skyline.. ‧ 國. 學. Hence, we know the size of the answers of k-dominant skyline must be less than that of. ‧. io. sit. y. Nat. n. al. er. original skyline.. Ch. engchi. 16. i Un. v.

(26) 3. METHODOLOGY. In this section, we will first define our problem and then show the basic algorithms to address it.. In the basic algorithms, the aggregated probability for all groups should be computed so. that we use a dynamic programming based approach to achieve the goal.. However, in. consideration of the time and space complexity, we turn to heuristics to reduce the cost and we also exploit the property 政治 Besides, 大 The number of the remaining of skyline to prune those with no chance to be the answer. 立 compare the performance in the Experiment Section.. ‧ 國. 學. ones is much smaller after the pruning process so that we can compare them directly to retrieve the best as the answer.. ‧. 3.1 Problem Definition. y. Nat. A game with i players in one team.. al. n. Definition 3-2 i-group.. er. io. Definition 3-1 i-game.. sit. Before giving the definition of our problem, we first define the terms used in the following.. Ch. A set of i players.. engchi. i Un. Definition 3-3 aggregated probability of an i-group.. v. The probability is the sum of the. probabilities of the players in this i-group to be the top-i players. the U-Topi probabilities of the tuples related to this i-group.. 17. In other words, we sum up.

(27) Definition 3-4 kGROUP. A kGROUP is a set of k groups, which includes one each for all sizes of groups.. For example, a 3GROUP should include three groups which are a. 1-group, a 2-group, and a 3-group. Besides, for an i-group in the kGROUP, it should be the sub-group of the k-group in the same kGROUP, and its aggregated probability is the highest among all sub-groups with size i of the k-group mentioned above. Definition 3-5 HiRank vector of a kGROUP.. It is a vector with k dimensions.. Assume that we sort all i-groups by aggregated probability and define that the one whose rank. 政治大. is higher has higher aggregated probability. The HiRank vector keeps the rank in dimension i. 立. for each i-group in this kGROUP. For example, in Table 3-3, the 2GROUP, {{B}, {B, C}},. ‧ 國. 學. has <1, 2> as the HiRank vector since {B} has the highest aggregated probability so that {B} is ranked first in 1-game and {B, C} has the second highest aggregated probability so that it is. ‧. ranked second in 2-game.. y. Nat. For an i-game, if an i-group x has a higher aggregated. sit. Definition 3-6 better kGROUP.. n. al. er. io. probability than another i-group y, we define there is a preference of x over y.. i Un. Hence, a. v. kGROUP P is a better kGROUP than another kGROUP Q if there are more preferences of the i-groups in P than in Q.. Ch. Definition 3-7 best kGROUP.. engchi. A kGROUP is the best kGROUP if no other kGROUP is. better than it. Next, we introduce the concept of skyline to help retrieve all kGROUPs with chance to be the best one. Definition 3-8 skyline kGROUP.. For a kGROUP, if there are no other kGROUPs. dominating this kGROUP with respect to the HiRank vector, we call it as a skyline kGROUP. A kGROUP P is said to dominate another kGROUP Q if and only if in the HiRank vector VP and VQ, VP.i is less than or equal to VQ.i for every dimension i, and VP.j is less than VQ.j in 18.

(28) some dimension j. We also use the term non-skyline kGROUP in opposition. Now, we can give the definition of our problem based on the following uncertainty model. For a database D, there are n uncertain players each of which has probabilistic scores, shown in Table 2-1. We can also represent D as the concept of tuples as shown in Table 3-1. Table 3-1: The tuple concept of D. Tuple ID Related player Score Probability t1. 立. t3. A. 252. t4. B. 300. B. 275. B. 200. y. 0.2. n. al. Ch. engchi U C. Definition 3-9 Best-kGROUP query.. 265. sit. io t7. 0.6. 0.4. er. t5 t6. 0.2. ‧. 266. Nat. A. 0.2. 學. ‧ 國. t2. 政A 治 279大. v ni. 0.4 1.0. The Best-kGROUP query returns the best. kGROUPs. Here is an example to illustrate the Best-kGROUP query. According to the database D in Table 2-1, we can obtain a table keeping all groups sorted by their aggregated probabilities, shown in Table 3-2. Suppose we submit a Best-2GROUP query to D.. Then, we need to. find the HiRank vectors of all 2GROUP. The information of finding results of the. 19.

(29) Best-2GROUP query is shown in Table 3-3. Thus, {{A}, {A, C}} and {{B}, {B, C}} are both the answers of the Best-2GROUP query. Table 3-2: The table with all groups and their aggregated probabilities. 1-game. 2-game. 3-game. {B} 0.52 {A, C} 0.40 {A, B, C} 1.000 {A} 0.24 {B, C} 0.36. 立. {C} 0.24. 政治大 {A, B} 0.24. ‧ 國. 學. Table 3-3: Information of selecting the best 2GROUPs of the dataset in D. HiRank vector. Nat. y. <2, 1>. {{B}, {B, C}}. n. al. {{A}, {A, B}}. <1, 2>. er. io. sit. {{A}, {A, C}}. ‧. 2GROUP. iv n C h<1, 3> i U{B, C}} is better than it e n g c h{{B},. 3.2 The Basic Algorithms In comparison with our basic algorithms, we first introduce a naïve approach to discover the results of the Best-kGROUP query. In the naïve approach, we generate all the k-groups, and then, for each k-group, create a kGROUP having this k-group and it sub-groups according to Definition 3-4. After we obtain all HiRank vectors of all kGROUPs, we can compare which is better with each other. Then, after those kGROUPs worse than the others are removed, the remaining ones are the best.. 20.

(30) However, this approach suffers from the high cost in computation. Therefore, we propose the basic algorithms to improve it. First, we generate all groups with their aggregated probability based on dynamic programming.. Here is our algorithm GroupGen in Figure 3-1. The input of GroupGen is a. database in tuple concept ordered by score.. In a dynamic programming based approach,. tuples should be independent; otherwise, we cannot reuse the previous computation to reduce the computing time. However, tuples in D are often related to other tuples, i.e. they belong. 政治大. to the same player. That is, we cannot compute the correct probabilities from only one scan.. 立. Instead, we perform an incremental method. Each time we retrieve a new tuple in D, we. ‧ 國. 學. perform a new scan reversely starting from the new tuple (line 5-8) and stopping at the first tuple in D. However, in each scan, tuples related to this new tuple should be removed and. ‧. io. one related to the new tuple needs computing (line 9-12).. n. al. sit. Hence, only unified probabilities except the Furthermore, in one scan,. er. Nat. information of all unified probabilities (line 4).. y. others can be unified into ones with respect to its player. We use MEHash to keep the. i Un. v. GroupGen would generate only part of aggregated probability for those groups related to the. Ch. engchi. tuples scanned so far, so we keep these probabilities in cHash temporarily (line 1 and 13). After we retrieve all tuples in D, the aggregated probabilities of the groups in cHash become complete. Then, all groups are distinguished from their size and sorted by aggregated probability as the result to return (line 15-16).. 21.

(31) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. Figure 3-1: Algorithm GroupGen. In Figure 3-2, we illustrate the process of updating the dynamic programming table.. In. each scan of a new tuple, we would put the object related the new tuple onto the last row and other objects appeared so far onto the higher rows in the dynamic programming table. Note that, the order of other objects does not matter since they are all independent.. 22. We start from.

(32) the bottom left cell and append a new group {O’} to the cell. Since any 2-group cannot be generated by only one object, we stop the update of the last row and scan upwards. For a. cell X, the groups inside are generated by two ways. One is from the lower left cell A.. In. this case, all groups in A should be extended with Oj and appended to X, and the probability of each group should be multiplied by the existence probability of Oj. The other is from the. lower cell B. This time we simply append all groups in B to X with their probability multiplied by the non-existence probability of Oj (1 Pr O௝ ).. When we finish updating. 政治大. all cells in the table, the groups in the first row (as the gray region) are the ones with correct. 立. probability in this scan.. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. Figure 3-2: Illustration of updating the dynamic programming table. Since the non-skyline kGROUPs must be worse than a specific kGROUP by Definition 3-6 and Definition 3-8, we can ignore them is this step.. 23. We will introduce SubsetFilter to.

(33) retrieve the skyline kGROUPs without computing all HiRank vectors. However, before we discuss the algorithm SubsetFilter, we need to show some properties of the kGROUPs. For a kGROUP G and its HiRank vector VG, any other kGROUP. Property 1.. having an i-group whose rank is higher than VG.i (the value in dimension i of VG) would never be dominated by G. Proof:. Assume that a kGROUP A is dominated by G and having some i-group whose. rank is higher than the i-group in G. Here we denote the HiRank vector of A as VA. We is ∀ , V . ≥ V . ∧ ∃ , V . > 政治 One 大. can obtain two formulas from this assumption. Vୋ . , and the other is V୅ .

(34) < Vୋ .

(35) .. 立. ୅. ୋ. ୅. However, it is contradiction between these two. ‧ 國. 學. formulas. Hence, the assumption is false and we know a kGROUP having some i-group whose rank is higher than VG.i must not be dominated by G.. y. Recall the definition of dominating.. A vector A is said not to dominate. sit. Nat. We can formulate the equivalence proposition of it.. ‧. It is easy to verify the correctness of this property.. n. al. er. io. another vector B if and only if the value of A is great than that of B in any dimension. Hence,. i Un. v. the kGROUPs with the k-group extended from any i-group whose rank is higher than VG.i. Ch. engchi. must have an i-group whose rank is higher than G in dimension i. This is why we claim they must not be dominated by G. Property 2.. If a kGROUP G has a k-group with the highest aggregated probability. and there is no other k-group with the same probability as it, this kGROUP G must be a skyline kGROUP. Proof:. We prove this by the equivalence proposition.. If G is a non-skyline kGROUP,. there must be a kGROUP A dominating G. That is, ∀

(36) , V୅ .

(37) ≤ Vୋ .

(38) ∧ ∃

(39) , V୅ .

(40) < Vୋ .

(41) .. In dimension k, there are two situations. One is when V୅ . < Vୋ . . It turns out that G does. not have a k-group with the highest aggregated probability. The other is when V୅ . = Vୋ . . 24.

(42) It makes that there are at least two kGROUPs having the k-group with the same aggregated probability. Since the equivalence proposition is true, Property 2 is proven. To use this property, we assume that each time distinguishing a skyline kGROUP, we would prune all kGROUPs dominated by it. That is, the kGROUPs not being removed so far must not be dominated by any reported skyline kGROUP. Moreover, the kGROUP having the k-group with the highest aggregated probability must not be dominated by any other kGROUP having the k-group whose rank is lower than it. Since this kGROUP cannot be. 政治大. dominated, it must be a skyline kGROUP. Note that, if there are more than one kGROUPs. 立. having the k-group with the highest aggregated probability, at least one of them must be a. ‧ 國. 學. skyline kGROUP.. In algorithm SubsetFilter, shown in Figure 3-3, we take all groups as input and it outputs. ‧. all skyline kGROUPs. At first, S is used to keep all the possible answers (line 1). We start. y. Nat. sit. from the kGROUP G having the k-group with the highest aggregated probability according to. n. al. er. io. Property 2 (line 3-8). Then, we use Property 1 to filter out those k-groups extends from. i Un. v. i-groups whose ranks are higher than v.i for some i-game and remove other k-groups since the. Ch. engchi. kGROUPs created from them must be non-skyline ones (line 15-19). After adding G into S (line 20), we repeat the whole process until the input set K is empty. all skyline kGROUPs.. 25. Finally, we can return.

(43) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. Figure 3-3: Algorithm SubsetFilter. After we retrieve all skyline kGROUPs, we can directly compare each other to obtain the best one since the size of skyline kGROUPs is much smaller than the size of original. kGROUPs. The algorithm to retrieve the best kGROUPs is shown in Figure 3-4.. 26.

(44) 立. 政治大. ‧. ‧ 國. 學 sit. y. Nat. Figure 3-4: Algorithm BestGROUP.. er. io. Note that, the best kGROUPs from all players are supposed to have the largest chance to. al. n. iv n C h e nbeing ones, the probability of the best kGROUPs i Uthan x should be high. g c hbetter be better than other ones.. That is, when we randomly select a kGROUP x from any other In another. consideration, the number of kGROUPs dominated by a skyline kGROUP can be viewed as the lower bound of its number of kGROUPs being worse than it. A skyline kGROUP x being better than another skyline one y must have larger number than y since when x is better than y, x is also better than any dominated by y.. In this paper, paper, the best kGROUPs are defined. as the ones not being worse than any other ones.. That is, no other kGROUPs have larger. number than the best ones. And it leads to the largest chance of the answer to be the best kGROUPs.. 27.

(45) 3.3 The Heuristic Approaches We analyze the time complexity of GroupGen and obtain the time complexity O( ).. However, in each iteration, the cell of the dynamic programming table contains ௡௜ groups,. where i is the size of groups. That is, when we consider the computation in each cell, the. total complexity becomes O(௞ ∙ ).. It is why we turn to heuristics to reduce the cost.. An observation is that the probabilities of many groups are low or even equal to zero. We have an attempt to ignore the groups with the probability 0 and it is called IgnoreGroupGen.. 政治大. IgnoreGroupGen makes no difference in the membership of skyline. 立. LimitGroupGen, is to restrict the size of each cell. of inaccuracy but benefit the cost greatly.. 學. LimitGroupGen would lead to an amount. It only keeps m(α) groups with the highest. ‧. ‧ 國. kGROUPs. However, the cost of this attempt might still be high. Another attempt,. probability in each cell where the equation m(α) is defined as following.. n. al. Ch. e nEquation g c h 1i. y. <. sit. io. min ௡ , α ∙ log ௡ , ଶ ଶ. er. Nat. m α =. min , α ∙ log , . i Un. v. 2. ≥ 2. . In Equation 1, we define the equation m(α) with respect to the log of the maximum value of ௡௜, where α is a scale parameter and i is from 1 to k.. ௡. If k is less than ଶ, the maximum. value will be ௡௞, otherwise ඃ௡೙ඇ. Note that, when k or n is too small, the log of ௡௞ (or మ. ඃ௡೙ඇ) might be less than ௡௞ (or ඃ௡೙ඇ), and we should choose ௡௞ (or ඃ௡೙ඇ) in these cases. మ. మ. మ. More analyses of the heuristic approaches will be shown in Experiment Section.. 28.

(46) 4. EXPERIMENTS. In this section, we set up experiments to compare our Best-kGROUP query and other uncertain top-k queries proposed in previous work. Here we consider two algorithms as following: U-Topk and Topk-PNN. Besides, we use the expected value approach as the baseline of our experiments since it is the most straightforward one.. 政治大 one score and at most ten scores in the distribution. Note that, there are no two or more 立. In our experiments, we generate 20 players and their scores. Each player has at least. ‧ 國. 學. players with the same scores.. Here we compare the probability of defeating other groups in a composite competition. ‧. and execution time.. In addition, we also examine the accuracy of LimitGroupGen.. y. Nat. er. io. sit. 4.1 Experiment Setup. Before we start the experiments, we should first define the inputs. The input is the kSET. n. al. Ch. i Un. v. which is a set of k groups and includes one each for all sizes of groups.. engchi. For an i-group in. this kSET, it must be the sub-group of the k-group in the same kSET. From the definition above, we know the kSET is the superset of the kGROUP and we can take the answer of the Best-kGROUP query as the kSET. For each algorithm except our Best-kGROUP query, the kSET consists of the i-group which is the answer of the query on parameter i for each i. However, when the answer is not the sub-group of the k-group in the kSET, we randomly choose one of the sub-groups of the k-group since the algorithm does not provide other answers. In the competition experiments, we make 20 composite competitions in iteration and in. 29.

(47) each iteration, we randomly generate a kSET to compete with the one of the algorithm to be compared. These 20 composite competitions consist of different number of games.. A. competition with k players is composed of k distinct games, from 1-game to k-game.. In a. composite competition with k players, we have to input two kSETs and then obtain the better of this competition. Note that, the better kSET here is defined the same as the better kGROUP, so is the best kSET. We sum up the counts of preference of these two kSETs. Here the i-group of each kSET is for the i-game.. The one with more preferences defeats. 政治大. another in this competition. We use the probability of defeating other kSETs over 100. 立. iterations to compare all algorithms.. ‧ 國. 學. Here we introduce the way to perform the experiment in each i-game. Since each player has many scores as a discrete distribution, we need to model the distribution rather than. ‧. use the expected value to simplify it. We use a straightforward approach to model all the. y. Nat. sit. distribution of all players. We randomly pick a score from his/her distribution as the. n. al. er. io. performance of this player. Assuming that the randomly-picking approach would fit the. i Un. v. probability model in large quantities, we perform this step MAX_BOUND times (line 21).. Ch. engchi. Then, in each iteration, we sort all players with their randomly-picked scores. Any i-group has exactly the top-i players would obtain one count. The i-group with more counts after MAX_BOUND iterations is the one gaining a preference. The detail process of the experiment is shown in Figure 4-1.. 30.

(48) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. Figure 4-1: Experiment process. 31.

(49) 4.2 On Execution Time In Figure 4-2, we first compare the execution time among these algorithms in two types of datasets. One is uniform distribution and the other is normal distribution. Here we do not perform the algorithm GroupGen because it costs too much space.. But we can know exactly. that it must take longer time than IgnoreGroupGen does. The average execution time of normal distribution dataset is larger since we have to model the normal distribution by almost 10 scores to keep the characteristic of the normal. 政治大. distribution while the datasets of uniform distribution is unrestricted.. 立. The baseline and Topk-PNN cost a constant time for every k value on datasets of both. ‧ 國. 學. distributions. However, in the dataset of uniform distribution, the time U-Topk costs raises up severely when k increases because U-Topk needs to scan the whole database to find the. ‧. answers. Also, the time IgnoreGroupGen and LimitGroupGen cost is almost the same. y. Nat. n. al. er. In this case, the drawback of LimitGroupGen sorting all groups in each cell. io. hardly be 0.. sit. because the score range of each player overlaps a lot and the probability might be little but. becomes more obvious.. i Un. v. On the other hand, in the dataset of normal distribution, the case of. Ch. engchi. overlapping is reduced so that U-Topk runs with a constant time, too. Since the probability of a group becomes 0 when the scores of some player not in this group have all appeared, less overlapping leads to more groups with probability 0. Here, IgnoreGroupGen takes almost twice the time of LimitGroupGen does. In other words, the number of groups ignored of IgnoreGroupGen might be also about twice of that of LimitGroupGen.. 32.

(50) Time cost on uniform distribution datasets Execution time (millisecond). 600 500 400 300 200 100 0 1. 2. 3. 4. 5. 立. 7 8 9治 10 11 12 13 14 15 16 政大 The composite competitions with different k 6. 17 18 19 20. Topk-PNN. Best-kGROUP (Naïve approach). Best-kGROUP (IgnoreGroupGen). Best-kGROUP (LimitGroupGen). ‧. ‧ 國. U-Topk. 學. Baseline. sit. 40000. n. al. er. io. Execution time (millisecond). 45000. y. Nat. Time cost on normal distribution datasets 35000 30000 25000. Ch. engchi. i Un. v. 20000 15000 10000 5000 0 1. 2. 3. 4. 5. 6. 7. 8. 9. 10 11 12 13 14 15 16 17 18 19 20. The composite competitions with different k Baseline. U-Topk. Topk-PNN. Best-kGROUP (Naïve approach). Best-kGROUP (IgnoreGroupGen). Best-kGROUP (LimitGroupGen). Figure 4-2: Execution time among different algorithms. 33.

(51) 4.3 On Accuracy Here we compare the accuracy between GroupGen and LimitGroupGen. We can see the accuracy from the Figure 4-3 below.. The F1-measure is computed by. the value of precision and recall. We find all best kGROUPs from GroupGen and LimitGroupGen, separately. The precision is the percentage of the kGROUPs matched in two approaches over all generated by LimitGroupGen while the recall is over all generated by GroupGen. The larger the value of α is, the 政治大. Here the value of α can be 5, 100, 500, and 1000.. 立. However, the difference is not obvious in Figure 4-3.. It might be. 學. ‧ 國. higher the accuracy is.. because of the small size of our datasets. The values of F1-measure are almost 1 but when k is 8, the value is relatively low.. It is because the groups ignored in the cells of a dynamic. ‧. programming table affect the rankings of the related groups so that some kGROUPs have. y. Nat. sit. wrong HiRank vectors and the algorithm reposts the wrong answer. Hence, the value of. er. io. F1-measure is only related to the reason and we cannot estimate the trend in F1-measure.. al. n. iv n C Accuracy difference α h e nof gchi U. Average F1-measure. 1 0.95 0.9 0.85 0.8 0.75 1. 2. 3. 4. 5. 6. 7. 8. 9. 10 11 12 13 14 15 16 17 18 19 20. The composite competitions with different k α=5. α = 100. α = 500. α = 1000. Figure 4-3: Accuracy of LimitGroupGen. 34.

(52) 4.4 On Performance of the Composite Competition In this section, we would compare the performance of the expected value approach (baseline), U-Topk, Topk-PNN, and k-group. We use the experiment process mentioned in Section 4.1.. 1 0.9 0.8. 政治大. 0.7. 立. 0.6. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10 11 12 13 14 15 16 17 18 19 20. ‧. The composite competitions with different k. Nat. U-Topk. Topk-PNN. Best-kGROUP. io. sit. Baseline. y. 0.4. ‧ 國. 0.5. 學. Probability of defeating other kSETs. Performance on datasets over two distributions. n. al. er. Figure 4-4: Probability of defeating other kSETs among different algorithms.. Ch. i Un. v. From Figure 4-4, we can see the probability of defeating other kSETs is lower down. engchi. when the value of k increases in baseline, U-Topk, and Topk-PNN.. The baseline considers. the multiple scores as an expected value and although the answer in i-game is the i players with the highest expected value, the real probability to be the top-i ones is not considered.. In. Topk-PNN, although the real probability is computed, it does not consider to put k players in one group.. Instead, it puts each player in a single group and independently computes the. probability of each group. However, U-Topk considers the real probability and put k players into one group, but its algorithm is tuple-wise and incoherent. The term “tuple-wise” means that each i-group represents a specific performance of i players and the consideration is not. 35.

(53) comprehensive. Moreover, when a group is reported as the answer of U-Topk, the players of any of its sub-groups may not have high probability to be the top-i players.. When k. increases and the competition becomes complicated, these three algorithms cannot give a good answer for the competition. Therefore, the Best-kGROUP query outperforms other algorithms in selecting the best kGROUP for a composite competition. Also, when the competition becomes more and more complicated, the Best-kGROUP query shows the distinction.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 36. i Un. v.

(54) 5. CONCLUSIONS AND FUTURE WORK. In this paper, we propose a novel problem to select the best kGROUP in a composite competition. The concept of the composite competition is not considered in previous work so that their answers are not suitable for addressing this problem. In addition, we use heuristics to reduce the cost of generating all groups with aggregated. 政治大 The SubsetFilter algorithm furthermore exploits the relationship among dimensions to 立 probability.. By a skyline filter, we can remove kGROUPs with no chance to be the answer.. ‧ 國. 學. retrieve the skyline without redundant steps. Also, the experiments show the value of our heuristic approaches and the outperformance of our Best-kGROUP query.. ‧. Our future work will be solving the problem in a more realistic environment. The. sit. y. Nat. databases should be updated when players play new games and have new records. Reusing. al. Besides, current. er. io. the computing results in the past would reduce the cost but is challenging.. n. iv n C distribution such as moving objects,h what i U model is should be first e nthegbest c huncertain uncertain model uses a discrete distribution.. If we discuss objects with continuous. considered. Moreover, the algorithm to generate all groups with probability is costly, in this paper, we use heuristics to reduce the cost and we are looking for other algorithms to improve the complexity. Overall, this paper only describes the first and simple step of our work and we believe further research would also be interesting.. 37.

(55) 6. REFERENCES. [1] I. F. Ilyas, G. Beskales, and M. A. Soliman. 2008. A Survey of Top-k Query Processing Techniques in Relational Database System. ACM Computing Surveys, Vol. 40, No. 4, pp. 11:1-11:58. [2] C. C. Aggarwal and P. S. Yu. 2009. A Survey of Uncertain Data Algorithms and. 政治大. Applications. IEEE Transactions on Knowledge and Data Engineering, Vol. 21, No. 5,. 立. pp. 609-623.. ‧ 國. 學. [3] M. Soliman, I. Ilyas, and K. Chang. 2007. Top-k Query Processing in Uncertain Databases. In Proceedings of the 23rd International Conference on Data Engineering. ‧. (ICDE), pp. 896-905.. sit. y. Nat. [4] M. Hua, J. Pei, W. Zhang, and X. Lin. 2008. Efficiently Answering Probabilistic. er. io. Threshold Top-k Queries on Uncertain Data. In Proceedings of the 24th International. al. n. iv n C U on Uncertain Data: On Score h e2009. T. Ge, S. Zdonik, and S. Madden. n gTop-k c h iQueries Conference on Data Engineering (ICDE), pp. 1403-1405.. [5]. Distribution and Typical Answers. In Proceedings of the 35th International Conference on Management of Data (SIGMOD), pp. 375-388. [6] C. Jin, K. Yi, L. Chen, J. X. Yu, and X. Lin. 2008. Sliding-Window Top-k Queries on Uncertain Streams. In Proceedings of the VLDB Endowment, Vol. 1, No. 1, pp. 301-312. [7] G. Beskales, M. A. Soliman, and I. F. Ilyas. 2008. Efficient Search for the Top-k Probable Nearest Neighbors in Uncertain Databases. In Proceedings of the VLDB Endowment, Vol. 1, No. 1, pp. 326-339. [8] M. L. Yiu and N. Mamoulis. 2007. Efficient Processing of Top-k Dominating Queries on. 38.

(56) Multi-Dimensional Data. In Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 483-494. [9] C.-Y. Chan, H. V. Jagadish, K.-L. Tan, A. K. H. Tung, and Z. Zhang. 2006. Finding k-Dominant Skylines in High Dimensional Space. In Proceedings of the 32nd International Conference on Management of Data (SIGMOD), pp. 503-514.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 39. i Un. v.

(57)