• 沒有找到結果。

Multi-dimensional Index

Related Work

5.2 Multi-dimensional Index

Due to the high scalability of cloud data managements, there are more and more works for constructing indexes on cloud data managements recently. B-tree is a commonly used index structure. The work in [23] presented a scalable B-tree based indexing scheme which build a local B-tree for the dataset stored in each compute node and build a Cloud Global index, called the CG-index, to index each compute node. However, the B-tree index can not support multi-dimensional queries effectively. Besides, much works on R-tree index structure for multi-dimensional data had been done, such as [21, 24, 13]. [21] presents RT-CAN, a multi-dimensional indexing scheme. RT-CAN is built on top of local R-tree indexes and it dynamically selects a portion of local R-tree nodes to publish onto the global index. Although it used R-tree indexing, it built the R-tree on their own distributed system epiC. [24] combined R-tree and k-d tree to be the index structure and the work in [13] presented an approach to construct a block-based hierarchical R-tree index structures. These works all build an index structure on Hadoop distributed file system or Google’s file system to support multi-dimensional queries.

MD-HBase is a data management system, based on HBase, using Quad tree and k-d tree coupling with Z-ordering to index multi-dimensional data for LBSs. The keys of MD-HBase are the Z-values of the dimensions being indexed. It uses the trie-based approach for splitting equal-sized space and built Quad tree and k-d tree index structures on the key-value data model. Moreover, MD-HBase proposed a novel naming scheme, called longest common prefix naming, to efficient index maintenance and query processing. Although the experiment of MD-HBase shows that the proposed indexing method is efficient for multi-dimensional data, the MD-HBase has some constraints. Before describing the constraints of MD-HBase, we have discovered a characteristic of cloud managements for data accesses through experiment. A trade-off exists between the number of points for getting one key and the number of keys for scanning; a reduction in the number of points for getting one key results in an increase in the number of keys for scanning and vice versa. The way of splitting space of Quad tree and k-d tree is fixed which may make some nodes store zero point. In addition, the Quad tree and the k-d tree can’t balance the number of stored points for each node, because they don’t restrict the minimum number of points in one space. Therefore, if we regard one node as one key, it

will make the keys store unbalanced data points, especially as the data is not uniform. Figure 1.3 is a Quad tree example of space splitting for MD-HBase. According to the data points in map, the Quad tree will split the whole space triple. The red line shows splitting results, and each black grid have its Z-ordering value. For instance, the Z-ordering value of (0,0) is 000000 and (1,0) is 000010. Then, the key of each region split by read line is the prefix of Z-ordering value of its sub-regions. Consequently, there are 10 keys, 000000, 000001, 000010, 000011, 0001*, 0010*, 0011*, 01*, 10* and 11*. But, there may be no data points in some region. As we mentioned above, the Quad tree and k-d tree can’t deal with multiform distribution data efficiently.

Chapter 6 Conclusion

We proposed a scalable multi-dimensional index, KR+-index, based on now existing CDMs, such as HBase and Cassandra. It supports efficient multi-dimensional range queries and nearest neighbor queries. We used R+ to construct index structure and designed the key for efficient accessing data. In addition, we redefined spatial query algorithm, including range query and k-NN query for our KR+. KR+took the characteristics of these CDMs into account so that KR+ shows much more efficient than other index methods in experimentation. Our evaluation using a cluster of 8 nodes handled the range queries and k-NN queries efficiently, and we also compared with related work, MD-HBase and the result showed that KR+ has better performance than MD-HBase, especially for skewing data.

Bibliography

[1] http://en.wikipedia.org/wiki/box

[2] N. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles, volume 19. ACM, 1990.

[3] J.L. Bentley. Multidimensional binary search trees used for associative searching. Com-munications of the ACM, 18(9):509–517, 1975.

[4] T. Bially. Space-filling curves: Their generation and their application to bandwidth reduction. Information Theory, IEEE Transactions on, 15(6):658–664, 1969.

[5] A.R. Butz. Convergence with hilbert’s space filling curve*. Journal of Computer and System Sciences, 3(2):128–146, 1969.

[6] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber. Bigtable: A distributed storage system for structured data.

ACM Transactions on Computer Systems (TOCS), 26(2):1–26, 2008.

[7] R.A. Finkel and J.L. Bentley. Quad trees a data structure for retrieval on composite keys. Acta informatica, 4(1):1–9, 1974.

[8] J. Gray et al. The transaction concept: Virtues and limitations. In Proceedings of the Very Large Database Conference, pages 144–154. Citeseer, 1981.

[9] A. Guttman. R-trees: a dynamic index structure for spatial searching, volume 14. ACM, 1984.

[10] I. Kamel and C. Faloutsos. Hilbert r-tree: An improved r-tree using fractals. 1993.

[11] A. Khetrapal and V. Ganesh. Hbase and hypertable for large scale distributed storage systems. Dept. of Computer Science, Purdue University.

[12] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010.

[13] H. Liao, J. Han, and J. Fang. Multi-dimensional index on hadoop distributed file sys-tem. In Networking, Architecture and Storage (NAS), 2010 IEEE Fifth International Conference on, pages 240–249. IEEE, 2010.

[14] G.M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. IBM, Ottawa, Canada, 1966.

[15] S. Nishimura, S. Das, D. Agrawal, and A. El Abbadi. Md-hbase: A scalable multi-dimensional data infrastructure for location aware services.

[16] Y. Pei and O. Za¨ıane. A synthetic data generator for clustering and outlier analysis.

Computing Science Department University of Alberta, Edmonton, Canada T6G 2E8,

2006.

[17] J.T. Robinson. The kdb-tree: a search structure for large multidimensional dynamic indexes. In Proceedings of the 1981 ACM SIGMOD international conference on Manage-ment of data, pages 10–18. ACM, 1981.

[18] T. Sellis, N. Roussopoulos, and C. Faloutsos. The r+-tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases, pages 507–518. Citeseer, 1987.

[19] M. Stonebraker. Sql databases v. nosql databases. Communications of the ACM, 53(4):10–11, 2010.

[20] J. Varia. Cloud architectures. White Paper of Amazon, jineshvaria. s3. amazonaws.

com/public/cloudarchitectures-varia. pdf, 2008.

[21] J. Wang, S. Wu, H. Gao, J. Li, and B.C. Ooi. Indexing multi-dimensional data in a cloud system. In Proceedings of the 2010 international conference on Management of data, pages 591–602. ACM, 2010.

[22] E.W. Weisstein. Box-muller transformation. MathWorld, Wolfram Research Inc, 1999.

[23] S. Wu, D. Jiang, B.C. Ooi, and K.L. Wu. Efficient b-tree based indexing for cloud data processing. Proceedings of the VLDB Endowment, 3(1-2):1207–1218, 2010.

[24] X. Zhang, J. Ai, Z. Wang, J. Lu, and X. Meng. An efficient multi-dimensional index for cloud data management. In Proceeding of the first international workshop on Cloud data management, pages 17–24. ACM, 2009.

相關文件