Big Data Processing Technologies

(1)

Big Data Processing Technologies

Chentao Wu

Associate Professor

Dept. of Computer Science and Engineering [email protected]

(2)

Schedule

• lec1: Introduction on big data and cloud computing

• Iec2: Introduction on data storage

• lec3: Data reliability (Replication/Archive/EC)

• lec4: Data consistency problem

• lec5: Block storage and file storage

• lec6: Object-based storage

• lec7: Distributed file system

• lec8: Metadata management

(3)

Collaborators

(4)

Contents

Metadata in DFS

1

(5)

Metadata

•

Metadata = structural information

 File/Objects: attributes in inode/onode

 Main problem for metadata in DFS: indexing

(6)

Metadata Server in DFS (Lustre)

(7)

Metadata Server in DFS (Ceph)

(8)

Metadata Server in DFS (GFS)

(9)

Metadata Server in DFS (HDFS)

(10)

NameNode Metadata in HDFS

•

Metadata in Memory

 The entire metadata is in main memory

 No demand paging of meta-data

•

Types of Metadata

 List of files

 List of Blocks for each file

 List of DataNodes for each block

 File attributes, e.g creation time, replication factor

•

A Transaction Log

 Records file creations, file deletions. etc

(11)

Metadata level in DFS (Azure)

Partition Layer – Index Range Partitioning

Account Name

Container Name

Blob Name

aaaa aaaa aaaaa

…….. …….. ……..

zzzz zzzz zzzzz

• Split index into

RangePartitions based on load

• Split at PartitionKey boundaries

• PartitionMap tracks Index RangePartition assignment to partition servers

• Front-End caches the

PartitionMap to route user requests

• Each part of the index is assigned to only one Partition Server at a time

Storage Stamp

Partition Server Partition

Server

Blob Name richard videos tennis

……… ……… ………

zzzz zzzz zzzzz

Blob Name harry pictures sunset

……… ……… ………

richard videos soccer

Partition Server

Partition Master

Front-End Server

PS 2 PS 3

PS 1

A-H: PS1 H’-R: PS2 R’-Z: PS3

Partition Map Blob Index

Partition Map Account

Name

Blob Name

aaaa aaaa aaaaa

……… ……… ………

harry pictures sunrise

A-H

H’-R R’-Z

(12)

Metadata level in DFS (Pangu) Partition layer

Access Layer Restful Protocol

LB LVS

Partition Layer Key-Value Engine

Persistent Layer Pangu FS

Load Balancing

Protocol Manager &

Access Control Partition & Index

Persistent, Redundancy

& Fault-Tolerance

(13)

Contents

ISAM & B+ Tree

2

(14)

Tree Structures Indexes

•

Recall: 3 alternatives for data entries k*:

• Data record with key value k

• <k, rid of data record with search key value k>

• <k, list of rids of data records with search key k>

•

Choice is orthogonal to the indexing technique used to locate data entries k*.

•

Tree-structured indexing techniques support both range searches and equality searches.

 ISAM (Indexed Sequential Access Method): static structure

 B+ tree: dynamic, adjusts gracefully under inserts and deletes.

(15)

Range Searches

•

Choose``Find all students with gpa > 3.0’’

 If data is in sorted file, do binary search to find first such student, then scan to find others.

 Cost of binary search can be quite high.

•

Simple idea: Create an `index’ file.

 Level of indirection again!

Page 1 Page 2 Page 3 Page N Data File

k2 kN

k1 Index File

Can do binary search on (smaller) index file!

(16)

ISAM

• Index file may still be quite large. But we can apply the idea repeatedly!

Leaf pages contain data entries

P0 K

1 P

1 K 2 P 2 K

m P m

index entry

Non-leaf Pages

Pages

Overflow

page Primary pages

Leaf

(17)

Comments on ISAM

^{Data Pages}

Index Pages

Overflow pages

• File creation: Leaf (data) pages allocated sequentially, sorted by search key.

Then index pages allocated.

Then space for overflow pages.

• Index entries: <search key value, page id>; they `direct’

search for data entries, which are in leaf pages.

• Search: Start at root; use key comparisons to go to leaf.

Cost log _FN ; F = # entries/index pg, N = # leaf pgs

• Insert: Find leaf where data entry belongs, put it there.

(Could be on an overflow page).

• Delete: Find and remove from leaf; if empty overflow page, de-allocate.

Static tree structure: inserts/deletes affect only leaf pages.

(18)

Example ISAM Tree

10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*

20 33 51 63

40 Root

• Each node can hold 2 entries; no need for `next-

leaf-page’ pointers.

(19)

After Inserting 23, 48, 41, 42 ...

10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*

20 33 51 63

40 Root

23* 48* 41*

42*

Overflow Pages Leaf Index Pages

Pages Primary

(20)

... then Deleting 42, 51, 97*

10* 15* 20* 27* 33* 37* 40* 46* 55* 63*

20 33 51 63

40 Root

23* 48* 41*

Note that 51 appears in index levels , but 51* not in leaf!

(21)

Pros, Cons & Usage

•

Pros

 Simple and easy to implement

•

Cons

 Unbalanced overflow pages

 Index redistribution

•

Usage

 MS Access

 Berkeley DB

 MySQL (before 3.23)  MyISAM (not real ISAM)

(22)

B+ Tree: The Most Widely Used Index

• Insert/delete at log

_F

N cost; keep tree height-balanced.

(F = fanout, N = # leaf pages)

• Minimum 50% occupancy (except for root). Each

node contains d <= m <= 2d entries. The parameter d is called the order of the tree.

• Supports equality and range-searches efficiently.

Index Entries

Data Entries ("Sequence set") (Direct search)

(23)

Example B+ Tree

• Search begins at root, and key comparisons direct it to a leaf (as in ISAM).

• Search for 5, 15, all data entries >= 24* ...

Based on the search for 15*, we know it is not in the tree!

Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

(24)

B+ Tree in Practice

• Typical order: 100. Typical fill-factor: 67%.

• average fanout = 133

• Typical capacities:

• Height 4: 133⁴ = 312,900,700 records

• Height 3: 133³ = 2,352,637 records

• Can often hold top levels in buffer pool:

• Level 1 = 1 page = 8 Kbytes

• Level 2 = 133 pages = 1 Mbyte

• Level 3 = 17,689 pages = 133 MBytes

(25)

Inserting a Data Entry into a B+ Tree

• Find correct leaf L.

• Put data entry onto L.

• If L has enough space, done!

• Else, must split L (into L and a new node L2)

• Redistribute entries evenly, copy up middle key.

• Insert index entry pointing to L2 into parent of L.

• This can happen recursively

• To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)

• Splits “grow” tree; root split increases height.

• Tree growth: gets wider or one level taller at top.

(26)

Example B+ Tree - Inserting 8*

Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

(27)

Example B+ Tree - Inserting 8*

 Notice that root was split, leading to increase in height.

 In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice.

2* 3*

Root

17

24 30

14* 16* 19* 20* 22* 24* 27*29* 33* 34* 38* 39*

13 5

7*

5* 8*

(28)

Inserting 8* into Example B+ Tree

• Observe how

minimum occupancy is guaranteed in both leaf and index pg

splits.

• Note difference

between copy-up and push-up; be sure you understand the

reasons for this.

2* 3* 5* 7* 8*

5

Entry to be inserted in parent node.

(Note that 5 is

continues to appear in the leaf.) s copied up and

appears once in the index. Contrast

5 24 30

17

13

Entry to be inserted in parent node.

(Note that 17 is pushed up and only this with a leaf split.)

…

(29)

Deleting a Data Entry from a B+ Tree

• Start at root, find leaf L where entry belongs.

• Remove the entry.

• If L is at least half-full, done!

• If L has only d-1 entries,

• Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).

• If re-distribution fails, merge L and sibling.

• If merge occurred, must delete entry (pointing to L or sibling) from parent of L.

• Merge could propagate to root, decreasing height.

(30)

Example Tree (including 8) Delete 19 and 20* ...

2* 3*

Root

17

24 30

14* 16* 19* 20* 22* 24* 27*29* 33* 34* 38* 39*

13 5

7*

5* 8*

• Deleting 19* is easy.

(31)

Example Tree (including 8) Delete 19 and 20* ...

• Deleting 19* is easy.

• Deleting 20* is done with re-distribution. Notice how middle key is copied up.

2* 3*

Root

17

30

14* 16* 33* 34* 38* 39*

13 5

7*

5* 8* 22* 24*

27

27* 29*

(32)

... And Then Deleting 24*

• Must merge.

• Observe `toss’ of index entry (on right), and `pull down’ of index entry

(below).

30

22* 27* 29* 33* 34* 38* 39*

2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39*

Root

13 30

5 17

(33)

Example of Non-leaf Re-distribution

• Tree is shown below during deletion of 24. (What* could be a possible initial tree?)

• In contrast to previous example, can re-distribute entry from left child of root to right child.

Root

13

5 17 20

22

30

14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*

7*

5* 8*

3*

2*

(34)

After Re-distribution

• Intuitively, entries are re-distributed by `pushing through’ the splitting entry in the parent node.

• It suffices to re-distribute index entry with key 20;

we’ve re-distributed 17 as well for illustration.

14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*

7*

5* 8*

2* 3*

Root

13 5

17

20 22 30

(35)

Prefix Key Compression

• Important to increase fan-out. (Why?)

• Key values in index entries only `direct traffic’; can often compress them.

• E.g., If we have adjacent index entries with search key values Dannon Yogurt, David Smith and Devarakonda

Murthy, we can abbreviate David Smith to Dav. (The other keys can be compressed too ...)

• Is this correct? Not quite! What if there is a data entry Davey Jones? (Can only compress David Smith to Davi)

• In general, while compressing, must leave each index entry greater than every key value (in any subtree) to its left.

• Insert/delete must be suitably modified.

(36)

Bulk Loading of a B+ Tree

• If we have a large collection of records, and we want to create a B+ tree on some field, doing so by

repeatedly inserting records is very slow.

• Also leads to minimal leaf utilization --- why?

• Bulk Loading can be done much more efficiently.

• Initialization: Sort all data entries, insert pointer to first (leaf) page in a new (root) page.

3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44*

Sorted pages of data entries; not yet in B+ tree Root

(37)

Bulk Loading (Contd.)

• Index entries for leaf pages always entered into right- most index page just above leaf level. When this fills up, it splits. (Split may go up right-most path to the root.)

• Much faster than repeated inserts, especially when one considers locking!

3* 4* 6* 9* 10* 11* 12* 13* 20*22* 23* 31* 35*36* 38*41* 44*

Root

Data entry pages not yet in B+ tree 35

23 12

6

10 20

3* 4* 6* 9* 10* 11* 12* 13* 20*22* 23* 31* 35*36* 38*41* 44*

6

Root

10

12 23

20

35

38

not yet in B+ tree Data entry pages

(38)

Summary of Bulk Loading

• Option 1: multiple inserts.

• Slow.

• Does not give sequential storage of leaves.

• Option 2: Bulk Loading

• Has advantages for concurrency control.

• Fewer I/Os during build.

• Leaves will be stored sequentially (and linked, of course).

• Can control “fill factor” on pages.

(39)

Contents

Log Structured Merge (LSM) Tree

3

(40)

Structure of LSM Tree

• Two trees

• C₀ tree: memory resident (smaller part)

• C₁ tree: disk resident (whole part)

(41)

Rolling Merge (1)

• Merge new leaf nodes in C

₀

tree and C

₁

tree

(42)

Rolling Merge (2)

• Step 1: read the new leaf nodes from C₁ tree, and store them as emptying block in memory

• Step 2: read the new leaf nodes from C₀ tree, and make merge sort with the emptying block

(43)

Rolling Merge (3)

• Step 3: write the merge results into filling block, and delete the new leaf nodes in C_0.

• Step 4: repeat step 2 and 3. When the filling block is full, write the filling block into C₁ tree, and delete the corresponding leaf nodes.

• Step 5: after all new leaf nodes in C₀ and C₁are merged, finish the rolling merge process.

(44)

Data temperature

• Data Type

• Hot/Warm/Cold Data  different trees

(45)

A LSM tree with multiple components

• Data Type

• Hottest data  C₀ tree

• Hotter data  C₁ tree

• ……

• Coldest data  C_K tree

(46)

Rolling Merge among Disks

• Two emptying blocks and filling blocks

• New leaf nodes should be locked (write lock)

(47)

Search and deletion (based on temporal locality)

• Lastest Τ (0- Τ) accesses are in C

₀

tree

• Τ - 2Τ accesses are in C

₁

tree

• ……

(48)

Checkpointing

• Log Sequence Number (LSN0) of last insertion at Time T₀

• Root addresses

• Merge cursor for each component

• Allocation information

(49)

Contents

Distributed Hash & DHT

4

(50)

Definition of a DHT

• Hash table  supports two operations

• insert(key, value)

• value = lookup(key)

• Distributed

• Map hash-buckets to nodes

• Requirements

• Uniform distribution of buckets

• Cost of insert and lookup should scale well

• Amount of local state (routing table size) should scale well

(51)

Fundamental Design Idea - I

• Consistent Hashing

• Map keys and nodes to an identifier space; implicit assignment of responsibility

Identifiers

A B C D

Key



Mapping performed using hash functions (e.g., SHA-1)

 Spread nodes and keys uniformly throughout

1111111111 0000000000

(52)

Fundamental Design Idea - II

• Prefix / Hypercube routing

Source

Destination

(53)

But, there are so many of them!

• Scalability trade-offs

• Routing table size at each node vs.

• Cost of lookup and insert operations

• Simplicity

• Routing operations

• Join-leave mechanisms

• Robustness

• DHT Designs

• Plaxton Trees, Pastry/Tapestry

• Chord

• Overview: CAN, Symphony, Koorde, Viceroy, etc.

• SkipNet

(54)

Plaxton Trees Algorithm (1)

9 A E 4 2 4 7 B

1. Assign labels to objects and nodes

Each label is of log₂^bn digits

Object Node

- using randomizing hash functions

(55)

Plaxton Trees Algorithm (2)

2 4 7 B

2. Each node knows about other nodes with varying prefix matches

Node

2 4 7 B

3 1

5 3

6

8 A

C

2

2 4 2 4

2 4 7

Prefix match of length 0

Prefix match of length 1

Prefix match of length 2 Prefix match of length 3

(56)

Plaxton Trees Algorithm (3) Object Insertion and Lookup

Given an object, route successively towards nodes with greater prefix matches

2 4 7 B

Node

9 A E 2

9 A 7 6

9 F 1 0

9 A E 4

Object

Store the object at each of these locations

(57)

Plaxton Trees Algorithm (4) Object Insertion and Lookup

Given an object, route successively towards nodes with greater prefix matches

2 4 7 B

Node

9 A E 2

9 A 7 6

9 F 1 0

9 A E 4

Object

Store the object at each of these locations

log(n) steps to insert or locate object

(58)

Plaxton Trees Algorithm (5) Why is it a tree?

2 4 7 B

9 F 1 0

9 A 7 6

9 A E 2

Object

(59)

Plaxton Trees Algorithm (6) Network Proximity

• Overlay tree hops could be totally unrelated to the underlying network hops

USA

Europe

East Asia

•

Plaxton trees guarantee constant factor approximation!

• Only when the topology is uniform in some sense

(60)

Pastry (1)

• Based directly upon Plaxton Trees

• Exports a DHT interface

• Stores an object only at a node whose ID is closest to the object ID

• In addition to main routing table

• Maintains leaf set of nodes

• Closest L nodes (in ID space)

• L = 2^{(b + 1)},typically -- one digit to left and right

(61)

Pastry (2)

2 4 7 B

9 F 1 0

9 A 7 6

9 A E 2 Object

Only at the root!

Key Insertion and Lookup = Routing to Root

 Takes O(log n) steps

(62)

Pastry (3)

Self Organization

• Node join

• Start with a node “close” to the joining node

• Route a message to nodeID of new node

• Take union of routing tables of the nodes on the path

• Joining cost: O(log n)

• Node leave

• Update routing table

• Query nearby members in the routing table

• Update leaf set

(63)

Chord [Karger, et al] (1)

• Map nodes and keys to identifiers

• Using randomizing hash functions

• Arrange them on a circle

Identifier Circle

x succ(x)

010110110 010111110

pred(x)

010110000

(64)

Chord (2)

Efficient routing

• Routing table

• i^th entry = succ(n + 2ⁱ)

• log(n) finger pointers

Identifier Circle

Exponentially spaced pointers!

(65)

Chord (3)

Key Insertion and Lookup

To insert or lookup a key ‘x’, route to succ(x)

x succ(x)

source

O(log n) hops for routing

(66)

Chord (4)

Self Organization

• Node join

• Set up finger i: route to succ(n + 2ⁱ)

• log(n) fingers ) O(log² n) cost

• Node leave

• Maintain successor list for ring connectivity

• Update successor list and finger pointers

(67)

CAN [Ratnasamy, et al]

• Map nodes and keys to coordinates in a multi-dimensional cartesian space

source

key

Routing through shortest Euclidean path

For d dimensions, routing takes O(dn^1/d) hops

Zone

(68)

Symphony [Manku, et al]

• Similar to Chord – mapping of nodes, keys

• ‘k’ links are constructed probabilistically!

x

This link chosen with probability P(x) = 1/(x ln n)

Expected routing guarantee: O(1/k (log² n)) hops

(69)

SkipNet [Harvey, et al] (1)

• Previous designs distribute data uniformly throughout the system

• Good for load balancing

• But, my data can be stored in Timbuktu!

• Many organizations want stricter control over data placement

• What about the routing path?

• Should a Microsoft  Microsoft end-to-end path pass through Sun?

(70)

SkipNet (2)

Content and Path Locality

Basic Idea: Probabilistic skip lists

Height

Nodes

• Each node choose a height at random

• Choose height ‘h’ with probability 1/2^h

(71)

SkipNet (3)

Content and Path Locality

Height

Nodes



Nodes are lexicographically sorted

Still O(log n) routing guarantee!

(72)

Summary

# Links per node Routing hops

Pastry/Tapestry

O(2

^b

log

₂^b

n) O(log

₂^b

n)

Chord

log n O(log n)

CAN

d dn

^1/d

SkipNet

O(log n) O(log n)

Symphony

k O((1/k) log

²

n)

Koorde

d log

_d

n

Viceroy

7 O(log n)

Optimal (= lower bound)

(73)

Ceph Controlled Replication Under Scalable Hashing (CRUSH) (1)

• CRUSH algorithm: pgid  OSD ID?

• Devices: leaf nodes (weighted)

• Buckets: non-leaf nodes (weighted, contain any number of devices/buckets)

(74)

CRUSH (2)

• A partial view of a four- level cluster map

hierarchy consisting of rows, cabinets, and shelves of disks.

(75)

CRUSH (3)

• Reselection behavior of select(6,disk) when device r = 2 (b) is rejected, where the boxes contain the CRUSH output R of n = 6 devices numbered by rank. The left shows the “first n” approach in which device ranks of existing devices

(c,d,e,f) may shift. On the right, each rank has a probabilistically independent sequence of potential targets; here f_r = 1 , and r′ =r+ f_rn=8 (device h).

(76)

CRUSH (4)

• Data movement in a binary hierarchy due to a node addition and the subsequent weight changes.

(77)

CRUSH (5)

•

Four types of Buckets

 Uniform buckets

 List buckets

 Tree buckets

 Straw buckets

•

Summary of mapping speed and data reorganization efficiency of different bucket types when items are added to or removed from a bucket.

(78)

CRUSH (6)

• Node labeling strategy used for the binary tree comprising each tree bucket

(79)

Contents

Project 4

5

(80)

Metadata Management in DFS (1)

• Design a simple metadata management module for a distributed file system. Establish a distributed metadata cluster and a POSIX API based client.

(81)

Metadata Management in DFS (2)

•

The metadata management has the following functions,

 Basic command set: support metadata operations via POSIX- based API

 i.e., mkdir, create file, readdir, rm file, stat, etc.

 file handle can be ignored

 Distribution of metadata

 Metadata are distributed among various metadata servers

(82)

Metadata Management in DFS (3)

•

Tests on the metadata management functions,

 Input: Input the specified files & directories by client

 Output:

 Traverse the files via readdir command

 List the status of a file via stat command

 Etc.

 Write the metadata of these file operations into the metadata server

 Give the data distribution information of the whole cluster

 Consistent with other metadata servers

(83)

Metadata Management in DFS (4)

•

Additional scores

 Support metadata server failover (process level)

 Support metadata server failure

 No metadata lost in the failure

 Implementation on the read/write operations of a file

(84)

Big Data Processing Technologies