Big Data Processing Technologies
Chentao Wu
Associate Professor
Dept. of Computer Science and Engineering [email protected]
Schedule
• lec1: Introduction on big data and cloud computing
• Iec2: Introduction on data storage
• lec3: Data reliability (Replication/Archive/EC)
• lec4: Data consistency problem
• lec5: Block storage and file storage
• lec6: Object-based storage
• lec7: Distributed file system
• lec8: Metadata management
Collaborators
Contents
Metadata in DFS
1
Metadata
•
Metadata = structural information File/Objects: attributes in inode/onode
Main problem for metadata in DFS: indexing
Metadata Server in DFS (Lustre)
Metadata Server in DFS (Ceph)
Metadata Server in DFS (GFS)
Metadata Server in DFS (HDFS)
NameNode Metadata in HDFS
•
Metadata in Memory The entire metadata is in main memory
No demand paging of meta-data
•
Types of Metadata List of files
List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time, replication factor
•
A Transaction Log Records file creations, file deletions. etc
Metadata level in DFS (Azure)
Partition Layer – Index Range Partitioning
Account Name
Container Name
Blob Name
aaaa aaaa aaaaa
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
zzzz zzzz zzzzz
• Split index into
RangePartitions based on load
• Split at PartitionKey boundaries
• PartitionMap tracks Index RangePartition assignment to partition servers
• Front-End caches the
PartitionMap to route user requests
• Each part of the index is assigned to only one Partition Server at a time
Storage Stamp
Partition Server Partition
Server
Account Name
Container Name
Blob Name richard videos tennis
……… ……… ………
……… ……… ………
zzzz zzzz zzzzz
Account Name
Container Name
Blob Name harry pictures sunset
……… ……… ………
……… ……… ………
richard videos soccer
Partition Server
Partition Master
Front-End Server
PS 2 PS 3
PS 1
A-H: PS1 H’-R: PS2 R’-Z: PS3
A-H: PS1 H’-R: PS2 R’-Z: PS3
Partition Map Blob Index
Partition Map Account
Name
Container Name
Blob Name
aaaa aaaa aaaaa
……… ……… ………
……… ……… ………
harry pictures sunrise
A-H
H’-R R’-Z
Metadata level in DFS (Pangu) Partition layer
Access Layer Restful Protocol
LB LVS
Partition Layer Key-Value Engine
Persistent Layer Pangu FS
Load Balancing
Protocol Manager &
Access Control Partition & Index
Persistent, Redundancy
& Fault-Tolerance
Contents
ISAM & B+ Tree
2
Tree Structures Indexes
•
Recall: 3 alternatives for data entries k*:• Data record with key value k
• <k, rid of data record with search key value k>
• <k, list of rids of data records with search key k>
•
Choice is orthogonal to the indexing technique used to locate data entries k*.•
Tree-structured indexing techniques support both range searches and equality searches. ISAM (Indexed Sequential Access Method): static structure
B+ tree: dynamic, adjusts gracefully under inserts and deletes.
Range Searches
•
Choose``Find all students with gpa > 3.0’’ If data is in sorted file, do binary search to find first such student, then scan to find others.
Cost of binary search can be quite high.
•
Simple idea: Create an `index’ file. Level of indirection again!
Page 1 Page 2 Page 3 Page N Data File
k2 kN
k1 Index File
Can do binary search on (smaller) index file!
ISAM
• Index file may still be quite large. But we can apply the idea repeatedly!
Leaf pages contain data entries
P0 K
1 P
1 K 2 P 2 K
m P m
index entry
Non-leaf Pages
Pages
Overflow
page Primary pages
Leaf
Comments on ISAM
Data PagesIndex Pages
Overflow pages
• File creation: Leaf (data) pages allocated sequentially, sorted by search key.
Then index pages allocated.
Then space for overflow pages.
• Index entries: <search key value, page id>; they `direct’
search for data entries, which are in leaf pages.
• Search: Start at root; use key comparisons to go to leaf.
Cost log F N ; F = # entries/index pg, N = # leaf pgs
• Insert: Find leaf where data entry belongs, put it there.
(Could be on an overflow page).
• Delete: Find and remove from leaf; if empty overflow page, de-allocate.
Static tree structure: inserts/deletes affect only leaf pages.
Example ISAM Tree
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
20 33 51 63
40 Root
• Each node can hold 2 entries; no need for `next-
leaf-page’ pointers.
After Inserting 23*, 48*, 41*, 42* ...
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
20 33 51 63
40 Root
23* 48* 41*
42*
Overflow Pages Leaf Index Pages
Pages Primary
... then Deleting 42*, 51*, 97*
10* 15* 20* 27* 33* 37* 40* 46* 55* 63*
20 33 51 63
40 Root
23* 48* 41*
Note that 51 appears in index levels , but 51* not in leaf!
Pros, Cons & Usage
•
Pros Simple and easy to implement
•
Cons Unbalanced overflow pages
Index redistribution
•
Usage MS Access
Berkeley DB
MySQL (before 3.23) MyISAM (not real ISAM)
B+ Tree: The Most Widely Used Index
• Insert/delete at log
FN cost; keep tree height-balanced.
(F = fanout, N = # leaf pages)
• Minimum 50% occupancy (except for root). Each
node contains d <= m <= 2d entries. The parameter d is called the order of the tree.
• Supports equality and range-searches efficiently.
Index Entries
Data Entries ("Sequence set") (Direct search)
Example B+ Tree
• Search begins at root, and key comparisons direct it to a leaf (as in ISAM).
• Search for 5*, 15*, all data entries >= 24* ...
Based on the search for 15*, we know it is not in the tree!
Root
17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
B+ Tree in Practice
• Typical order: 100. Typical fill-factor: 67%.
• average fanout = 133
• Typical capacities:
• Height 4: 1334 = 312,900,700 records
• Height 3: 1333 = 2,352,637 records
• Can often hold top levels in buffer pool:
• Level 1 = 1 page = 8 Kbytes
• Level 2 = 133 pages = 1 Mbyte
• Level 3 = 17,689 pages = 133 MBytes
Inserting a Data Entry into a B+ Tree
• Find correct leaf L.
• Put data entry onto L.
• If L has enough space, done!
• Else, must split L (into L and a new node L2)
• Redistribute entries evenly, copy up middle key.
• Insert index entry pointing to L2 into parent of L.
• This can happen recursively
• To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)
• Splits “grow” tree; root split increases height.
• Tree growth: gets wider or one level taller at top.
Example B+ Tree - Inserting 8*
Root
17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
Example B+ Tree - Inserting 8*
Notice that root was split, leading to increase in height.
In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice.
2* 3*
Root
17
24 30
14* 16* 19* 20* 22* 24* 27*29* 33* 34* 38* 39*
13 5
7*
5* 8*
Inserting 8* into Example B+ Tree
• Observe how
minimum occupancy is guaranteed in both leaf and index pg
splits.
• Note difference
between copy-up and push-up; be sure you understand the
reasons for this.
2* 3* 5* 7* 8*
5
Entry to be inserted in parent node.
(Note that 5 is
continues to appear in the leaf.) s copied up and
appears once in the index. Contrast
5 24 30
17
13
Entry to be inserted in parent node.
(Note that 17 is pushed up and only this with a leaf split.)
…
…
Deleting a Data Entry from a B+ Tree
• Start at root, find leaf L where entry belongs.
• Remove the entry.
• If L is at least half-full, done!
• If L has only d-1 entries,
• Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).
• If re-distribution fails, merge L and sibling.
• If merge occurred, must delete entry (pointing to L or sibling) from parent of L.
• Merge could propagate to root, decreasing height.
Example Tree (including 8*) Delete 19* and 20* ...
2* 3*
Root
17
24 30
14* 16* 19* 20* 22* 24* 27*29* 33* 34* 38* 39*
13 5
7*
5* 8*
• Deleting 19* is easy.
Example Tree (including 8*) Delete 19* and 20* ...
• Deleting 19* is easy.
• Deleting 20* is done with re-distribution. Notice how middle key is copied up.
2* 3*
Root
17
30
14* 16* 33* 34* 38* 39*
13 5
7*
5* 8* 22* 24*
27
27* 29*
... And Then Deleting 24*
• Must merge.
• Observe `toss’ of index entry (on right), and `pull down’ of index entry
(below).
30
22* 27* 29* 33* 34* 38* 39*
2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39*
Root
13 30
5 17
Example of Non-leaf Re-distribution
• Tree is shown below during deletion of 24*. (What could be a possible initial tree?)
• In contrast to previous example, can re-distribute entry from left child of root to right child.
Root
13
5 17 20
22
30
14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*
7*
5* 8*
3*
2*
After Re-distribution
• Intuitively, entries are re-distributed by `pushing through’ the splitting entry in the parent node.
• It suffices to re-distribute index entry with key 20;
we’ve re-distributed 17 as well for illustration.
14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*
7*
5* 8*
2* 3*
Root
13 5
17
20 22 30
Prefix Key Compression
• Important to increase fan-out. (Why?)
• Key values in index entries only `direct traffic’; can often compress them.
• E.g., If we have adjacent index entries with search key values Dannon Yogurt, David Smith and Devarakonda
Murthy, we can abbreviate David Smith to Dav. (The other keys can be compressed too ...)
• Is this correct? Not quite! What if there is a data entry Davey Jones? (Can only compress David Smith to Davi)
• In general, while compressing, must leave each index entry greater than every key value (in any subtree) to its left.
• Insert/delete must be suitably modified.
Bulk Loading of a B+ Tree
• If we have a large collection of records, and we want to create a B+ tree on some field, doing so by
repeatedly inserting records is very slow.
• Also leads to minimal leaf utilization --- why?
• Bulk Loading can be done much more efficiently.
• Initialization: Sort all data entries, insert pointer to first (leaf) page in a new (root) page.
3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44*
Sorted pages of data entries; not yet in B+ tree Root
Bulk Loading (Contd.)
• Index entries for leaf pages always entered into right- most index page just above leaf level. When this fills up, it splits. (Split may go up right-most path to the root.)
• Much faster than repeated inserts, especially when one considers locking!
3* 4* 6* 9* 10* 11* 12* 13* 20*22* 23* 31* 35*36* 38*41* 44*
Root
Data entry pages not yet in B+ tree 35
23 12
6
10 20
3* 4* 6* 9* 10* 11* 12* 13* 20*22* 23* 31* 35*36* 38*41* 44*
6
Root
10
12 23
20
35
38
not yet in B+ tree Data entry pages
Summary of Bulk Loading
• Option 1: multiple inserts.
• Slow.
• Does not give sequential storage of leaves.
• Option 2: Bulk Loading
• Has advantages for concurrency control.
• Fewer I/Os during build.
• Leaves will be stored sequentially (and linked, of course).
• Can control “fill factor” on pages.
Contents
Log Structured Merge (LSM) Tree
3
Structure of LSM Tree
• Two trees
• C0 tree: memory resident (smaller part)
• C1 tree: disk resident (whole part)
Rolling Merge (1)
• Merge new leaf nodes in C
0tree and C
1tree
Rolling Merge (2)
• Step 1: read the new leaf nodes from C1 tree, and store them as emptying block in memory
• Step 2: read the new leaf nodes from C0 tree, and make merge sort with the emptying block
Rolling Merge (3)
• Step 3: write the merge results into filling block, and delete the new leaf nodes in C0.
• Step 4: repeat step 2 and 3. When the filling block is full, write the filling block into C1 tree, and delete the corresponding leaf nodes.
• Step 5: after all new leaf nodes in C0 and C1 are merged, finish the rolling merge process.
Data temperature
• Data Type
• Hot/Warm/Cold Data different trees
A LSM tree with multiple components
• Data Type
• Hottest data C0 tree
• Hotter data C1 tree
• ……
• Coldest data CK tree
Rolling Merge among Disks
• Two emptying blocks and filling blocks
• New leaf nodes should be locked (write lock)
Search and deletion (based on temporal locality)
• Lastest Τ (0- Τ) accesses are in C
0tree
• Τ - 2Τ accesses are in C
1tree
• ……
Checkpointing
• Log Sequence Number (LSN0) of last insertion at Time T0
• Root addresses
• Merge cursor for each component
• Allocation information
Contents
Distributed Hash & DHT
4
Definition of a DHT
• Hash table supports two operations
• insert(key, value)
• value = lookup(key)
• Distributed
• Map hash-buckets to nodes
• Requirements
• Uniform distribution of buckets
• Cost of insert and lookup should scale well
• Amount of local state (routing table size) should scale well
Fundamental Design Idea - I
• Consistent Hashing
• Map keys and nodes to an identifier space; implicit assignment of responsibility
Identifiers
A B C D
Key
Mapping performed using hash functions (e.g., SHA-1)
Spread nodes and keys uniformly throughout
1111111111 0000000000
Fundamental Design Idea - II
• Prefix / Hypercube routing
Source
Destination
But, there are so many of them!
• Scalability trade-offs
• Routing table size at each node vs.
• Cost of lookup and insert operations
• Simplicity
• Routing operations
• Join-leave mechanisms
• Robustness
• DHT Designs
• Plaxton Trees, Pastry/Tapestry
• Chord
• Overview: CAN, Symphony, Koorde, Viceroy, etc.
• SkipNet
Plaxton Trees Algorithm (1)
9 A E 4 2 4 7 B
1. Assign labels to objects and nodes
Each label is of log2b n digits
Object Node
- using randomizing hash functions
Plaxton Trees Algorithm (2)
2 4 7 B
2. Each node knows about other nodes with varying prefix matches
Node
2 4 7 B
2 4 7 B
2 4 7 B
2 4 7 B
3 1
5 3
6
8 A
C
2
2
2 4 2 4
2 4 7
2 4 7
Prefix match of length 0
Prefix match of length 1
Prefix match of length 2 Prefix match of length 3
Plaxton Trees Algorithm (3) Object Insertion and Lookup
Given an object, route successively towards nodes with greater prefix matches
2 4 7 B
Node
9 A E 2
9 A 7 6
9 F 1 0
9 A E 4
Object
Store the object at each of these locations
Plaxton Trees Algorithm (4) Object Insertion and Lookup
Given an object, route successively towards nodes with greater prefix matches
2 4 7 B
Node
9 A E 2
9 A 7 6
9 F 1 0
9 A E 4
Object
Store the object at each of these locations
log(n) steps to insert or locate object
Plaxton Trees Algorithm (5) Why is it a tree?
2 4 7 B
9 F 1 0
9 A 7 6
9 A E 2
Object
Object
Object
Object
Plaxton Trees Algorithm (6) Network Proximity
• Overlay tree hops could be totally unrelated to the underlying network hops
USA
Europe
East Asia
•
Plaxton trees guarantee constant factor approximation!
• Only when the topology is uniform in some sense
Pastry (1)
• Based directly upon Plaxton Trees
• Exports a DHT interface
• Stores an object only at a node whose ID is closest to the object ID
• In addition to main routing table
• Maintains leaf set of nodes
• Closest L nodes (in ID space)
• L = 2(b + 1) ,typically -- one digit to left and right
Pastry (2)
2 4 7 B
9 F 1 0
9 A 7 6
9 A E 2 Object
Only at the root!
Key Insertion and Lookup = Routing to Root
Takes O(log n) steps
Pastry (3)
Self Organization
• Node join
• Start with a node “close” to the joining node
• Route a message to nodeID of new node
• Take union of routing tables of the nodes on the path
• Joining cost: O(log n)
• Node leave
• Update routing table
• Query nearby members in the routing table
• Update leaf set
Chord [Karger, et al] (1)
• Map nodes and keys to identifiers
• Using randomizing hash functions
• Arrange them on a circle
Identifier Circle
x succ(x)
010110110 010111110
pred(x)
010110000
Chord (2)
Efficient routing
• Routing table
• ith entry = succ(n + 2i)
• log(n) finger pointers
Identifier Circle
Exponentially spaced pointers!
Chord (3)
Key Insertion and Lookup
To insert or lookup a key ‘x’, route to succ(x)
x succ(x)
source
O(log n) hops for routing
Chord (4)
Self Organization
• Node join
• Set up finger i: route to succ(n + 2i)
• log(n) fingers ) O(log2 n) cost
• Node leave
• Maintain successor list for ring connectivity
• Update successor list and finger pointers
CAN [Ratnasamy, et al]
• Map nodes and keys to coordinates in a multi-dimensional cartesian space
source
key
Routing through shortest Euclidean path
For d dimensions, routing takes O(dn1/d) hops
Zone
Symphony [Manku, et al]
• Similar to Chord – mapping of nodes, keys
• ‘k’ links are constructed probabilistically!
x
This link chosen with probability P(x) = 1/(x ln n)
Expected routing guarantee: O(1/k (log2 n)) hops
SkipNet [Harvey, et al] (1)
• Previous designs distribute data uniformly throughout the system
• Good for load balancing
• But, my data can be stored in Timbuktu!
• Many organizations want stricter control over data placement
• What about the routing path?
• Should a Microsoft Microsoft end-to-end path pass through Sun?
SkipNet (2)
Content and Path Locality
Basic Idea: Probabilistic skip lists
Height
Nodes
• Each node choose a height at random
• Choose height ‘h’ with probability 1/2h
SkipNet (3)
Content and Path Locality
Height
Nodes
Nodes are lexicographically sorted
Still O(log n) routing guarantee!
Summary
# Links per node Routing hops
Pastry/Tapestry
O(2
blog
2bn) O(log
2bn)
Chord
log n O(log n)
CAN
d dn
1/dSkipNet
O(log n) O(log n)
Symphony
k O((1/k) log
2n)
Koorde
d log
dn
Viceroy
7 O(log n)
Optimal (= lower bound)
Ceph Controlled Replication Under Scalable Hashing (CRUSH) (1)
• CRUSH algorithm: pgid OSD ID?
• Devices: leaf nodes (weighted)
• Buckets: non-leaf nodes (weighted, contain any number of devices/buckets)
CRUSH (2)
• A partial view of a four- level cluster map
hierarchy consisting of rows, cabinets, and shelves of disks.
CRUSH (3)
• Reselection behavior of select(6,disk) when device r = 2 (b) is rejected, where the boxes contain the CRUSH output R of n = 6 devices numbered by rank. The left shows the “first n” approach in which device ranks of existing devices
(c,d,e,f) may shift. On the right, each rank has a probabilistically independent sequence of potential targets; here fr = 1 , and r′ =r+ frn=8 (device h).
CRUSH (4)
• Data movement in a binary hierarchy due to a node addition and the subsequent weight changes.
CRUSH (5)
•
Four types of Buckets Uniform buckets
List buckets
Tree buckets
Straw buckets
•
Summary of mapping speed and data reorganization efficiency of different bucket types when items are added to or removed from a bucket.CRUSH (6)
• Node labeling strategy used for the binary tree comprising each tree bucket
Contents
Project 4
5
Metadata Management in DFS (1)
• Design a simple metadata management module for a distributed file system. Establish a distributed metadata cluster and a POSIX API based client.
Metadata Management in DFS (2)
•
The metadata management has the following functions, Basic command set: support metadata operations via POSIX- based API
i.e., mkdir, create file, readdir, rm file, stat, etc.
file handle can be ignored
Distribution of metadata
Metadata are distributed among various metadata servers
Metadata Management in DFS (3)
•
Tests on the metadata management functions, Input: Input the specified files & directories by client
Output:
Traverse the files via readdir command
List the status of a file via stat command
Etc.
Write the metadata of these file operations into the metadata server
Give the data distribution information of the whole cluster
Consistent with other metadata servers
Metadata Management in DFS (4)
•
Additional scores Support metadata server failover (process level)
Support metadata server failure
No metadata lost in the failure
Implementation on the read/write operations of a file