Big Data Processing Technologies

(1)

Big Data Processing Technologies

Chentao Wu

Associate Professor

Dept. of Computer Science and Engineering [email protected]

(2)

Schedule

• lec1: Introduction on big data and cloud computing

• Iec2: Introduction on data storage

• lec3: Data reliability (Replication/Archive/EC)

• lec4: Data consistency problem

• lec5: Block level storage and file storage

• lec6: Object-based storage

• lec7: Distributed file system

• lec8: Metadata management

(3)

Collaborators

(4)

Data Reliability Problem (1)

Google – Disk Annual Failure Rate

(5)

Data Reliability Problem (2)

Facebook-- Failure nodes in a 3000 nodes cluster

(6)

Contents

Introduction on Replication

1

(7)

What is Replication?

• Replication can be classified as

• Local replication

• Replicating data within the same array or data center

• Remote replication

• Replicating data at remote site

It is a process of creating an exact copy (replica) of data.

Replication

Source Replica (Target)

REPLICATION

(8)

File System Consistency: Flushing Host Buffer

File System Application

Memory Buffers

Logical Volume Manager Physical Disk Driver

Data

Flush Buffer

Source Replica

(9)

Database Consistency: Dependent Write I/O Principle

D

Inconsistent

C

^Consistent

4 4

3 3

2 2

1 1

4 4

3 3

2 1

C

(10)

Host-based Replication: LVM-based Mirroring

CC

^Host

Logical Volume

Physical Volume 1

Physical Volume 2

• LVM: Logical Volume Manager

(11)

Host-based Replication: File System Snapshot

CC

• Pointer-based replication

• Uses Copy on First

Write (CoFW) principle

• Uses bitmap and block map

• Requires a fraction of the space used by the production FS

Metadata Production FS

Metadata 1 Data a 2 Data b

FS Snapshot

3 no data 4 no data

BLK Bit

1-0 1-0 2-0 2-0

N Data N 3 Data C

2 Data c 3-1

4 Data D 1 Data d

4-1

3-2 4-1

(12)

Storage Array-based Local Replication

CC

• Replication performed by the array operating environment

• Source and replica are on the same array

• Types of array-based replication

• Full-volume mirroring

• Pointer-based full-volume replication

• Pointer-based virtual replication

BC Host Storage Array

Replica Source

Production Host

(13)

Full-Volume Mirroring

Source

Attached

Storage Array

Read/Write Not Ready

Production Host BC Host

Target

Detached – Point In Time

Read/Write Read/Write

Source

Storage Array

Target

(14)

Copy on First Access: Write to the Source

Source C’

Target

• When a write is issued to the source for the first time after replication session activation:

 Original data at that address is copied to the target

 Then the new data is updated on the source

 This ensures that original data at the point-in-time of activation is preserved on the target

C Write to

Source

A B

C’ C

(15)

Copy on First Access: Write to the Target

• When a write is issued to the target for the first time after replication session activation:

 The original data is copied from the source to the target

 Then the new data is updated on the target

Source

B’

Target

B

Write to Target

A B

C’ C

B’

(16)

Copy on First Access: Read from Target

• When a read is issued to the target for the first time after replication session activation:

 The original data is copied from the source to the target and is made available to the BC host

Source

A

Target

A

Read request for

data “A”

A B

C’ C

B’

A

(17)

Tracking Changes to Source and Target

Source Target

0 unchanged changed

Logical OR At PIT

Target Source After PIT…

0 0 0 0 0 0 0 0

1 0 0 1 0 1 0 0

0 0 1 1 0 0 0 1

1 0 1 1 0 1 0 1

1

For resynchronization/restore

(18)

Contents

Introduction to Erasure Codes

2

(19)

Erasure Coding Basis (1)

• You've got some data • And a collection of storage nodes.

• And you want to store the data on the storage nodes so that you can get the data back, even when the nodes fail..

(20)

Erasure Coding Basis (2)

• More concrete: You have k

disks worth of data • And n total disks.

• The erasure code tells you how to create n disks worth of data+coding so that when disks fail, you can still get the data

(21)

Erasure Coding Basis (3)

• You have k disks worth of

data • And n total disks.

• n = k + m

• A systematic erasure code stores the data in the clear on k of the n disks. There are k data disks, and m coding or “parity”

disks.  Horizontal Code

(22)

Erasure Coding Basis (4)

• n = k + m

• A non-systematic erasure code stores only coding information, but we still use k, m, and n to describe the code.  Vertical Code

(23)

Erasure Coding Basis (5)

• n = k + m

• When disks fail, their contents become unusable, and the storage system detects this. This failure mode is called an erasure.

(24)

Erasure Coding Basis (6)

• n = k + m

• An MDS (“Maximum Distance Separable”) code can reconstruct the data from any m failures.  Optimal

• Can reconstruct any f failures (f < m)  non-MDS code

(25)

Two Views of a Stripe (1)

• The Theoretical View:

– The minimum collection of bits that encode and decode together.

– r rows of w-bit symbols from each of n disks:

(26)

Two Views of a Stripe (2)

• The Systems View:

– The minimum partition of the system that encodes and decodes together.

– Groups together theoretical stripes for performance.

(27)

Horizontal & Vertical Codes

• Horizontal Code

• Vertical Code

(28)

Expressing Code with Generator Matrix (1)

(29)

Expressing Code with Generator Matrix (2)

(30)

Expressing Code with Generator Matrix (3)

(31)

Encoding— Linux RAID-6 (1)

(32)

Encoding— Linux RAID-6 (2)

(33)

Encoding— Linux RAID-6 (3)

(34)

Accelerate Encoding— Linux RAID-6

(35)

Encoding— RDP (1)

(36)

Encoding— RDP (2)

(37)

Encoding— RDP (3)

(38)

Encoding— RDP (4)

(39)

Encoding— RDP (5)

(40)

Encoding— RDP (6)

Horizontal Parity Diagonal Parity Data

0 1 2 3 4 5 6 7

0

1

2

3

4

5

• Horizontal parity layout (p=7, n=8)

(41)

Encoding— RDP (7)

• Diagonal parity layout (p=7, n=8)

Horizontal Parity Diagonal Parity Data

0 1 2 3 4 5 6 7

0

1

2

3

4

5

(42)

Arithmetic for Erasure Codes

• When w = 1: XOR's only.

• Otherwise, Galois Field Arithmetic GF(2w)

– w is 2, 4, 8, 16, 32, 64, 128 so that words fit evenly into computer words.

– Addition is equal to XOR.

Nice because addition equals subtraction.

– Multiplication is more complicated:

Gets more expensive as w grows.

Buffer-constant different from a * b.

Buffer * 2 can be done really fast.

Open source library support.

(43)

Decoding with Generator Matrices (1)

(44)

Decoding with Generator Matrices (2)

(45)

Decoding with Generator Matrices (3)

(46)

Decoding with Generator Matrices (4)

(47)

Decoding with Generator Matrices (5)

(48)

Erasure Codes — Reed Solomon (1)

• Given in 1960.

• MDS Erasure codes for any n and k.

– That means any m = (n-k) failures can be tolerated without data loss.

• r = 1

(Theoretical): One word per disk per stripe.

• w constrained so that n ≤ 2w.

• Systematic and non-systematic forms.

(49)

Erasure Codes —Reed Solomon (2)

Systematic RS -- Cauchy generator matrix

(50)

Erasure Codes —Reed Solomon (3)

Non-Systematic RS -- Vandermonde generator matrix

(51)

Erasure Codes —Reed Solomon (4)

Non-Systematic RS -- Vandermonde generator matrix

(52)

Erasure Codes —EVENODD 1995 (7 disks, tolerating 2 disk failures)

• Horizontal Parity Coding

• Calculated by the data

elements in the same row

• E.g. 𝐶_0,5 = 𝐶_0,0 ⊕ 𝐶_0,1 ⊕ 𝐶_0,2 ⊕ 𝐶_0,3

⊕ 𝐶_0,4

• Diagonal Parity Coding

• Calculated by the data elements and S

• E.g. 𝐶_0,6 = 𝐶_0,0 ⊕ 𝐶_3,2 ⊕ 𝐶_2,3 ⊕ 𝐶_1,4 ⊕ 𝑆

(53)

Erasure Codes —X-Code 1999 (1)

Diagonal Parity Anti-diagonal Parity Data

0 1 2 3 4 5 6

0

1

2

3

4

5

6

(54)

Erasure Codes —X-Code 1999 (2)

• Anti-diagonal parity layout (p=7, n=7)

Diagonal Parity Anti-diagonal Parity Data

0 1 2 3 4 5 6

0

1

2

3

4

5

6

(55)

Erasure Codes —H-Code (1)

• Horizontal parity layout (p=7, n=8)

Horizontal Parity Anti-diagonal Parity Data

0 1 2 3 4 5 6 7

0

1

2

3

4

5

(56)

Erasure Codes —H-Code (2)

• Anti-diagonal parity layout (p=7, n=8)

Horizontal Parity Anti-diagonal Parity Data

0 1 2 3 4 5 6 7

0

1

2

3

4

5

(57)

Erasure Codes —H-Code (3)

• Recover double disk failure by single recovery chain

Horizontal Parity Anti-diagonal Parity

Data Lost Data and Parity

Recovery Chain

1

2 3

4 5

6 7

8 9

10 11

X

12

F

H

J L

B D

K E A

I G

C

0 1 2 3 4 5 6 7

0

1

2

3

4

5

(58)

Erasure Codes —H-Code (4)

• Recover double disk failure by two recovery chains

5

Horizontal Parity Anti-diagonal Parity

Recovery Chain

1

2 3

4 5

X6

D E

L J

K I

H

F G

A

C B

0 1 2 3 4 5 6 7

0

1

2

3

4

5

1

2 3

4

X6

(59)

Erasure Codes —HDP Code (1)

0 1 2 3 4 5

0

1

2

3

4

5

HDP ADP

Data

(60)

Erasure Codes —HDP Code (2)

0 1 2 3 4 5

0

1

2

3

4

5

HDP ADP

Data

(61)

Erasure Codes —HDP Code (3)

• HDP reduces more than 30% average recovery time.

0 1 2 3 4 5

0

1

2

3

4

5

HDP ADP

A B 1

D C

F E

K L

I J

G H

F

F 2

3 4

5 6

1 2

3 4

5 6

(62)

Contents

Replication and EC in Cloud

3

(63)

Three Dimensions in Cloud Storage

(64)

Replication vs Erasure Coding (RS)

(65)

Fundamental Tradeoff

(66)

Pyramid Codes (1)

(67)

Pyramid Codes (2)

(68)

Pyramid Codes (3) Multiple Hierachies

(69)

Pyramid Codes (4) Multiple Hierachies

(70)

Pyramid Codes (5) Multiple Hierachies

(71)

Pyramid Codes (6)

(72)

Google GFS II – Based on RS

(73)

Microsoft Azure (1)

How to Reduce Cost?

(74)

Microsoft Azure (2)

Recovery becomes expensive

(75)

Microsoft Azure (3)

Best of both worlds?

(76)

Microsoft Azure (4)

Local Reconstruction Code (LRC)

(77)

Microsoft Azure (5)

Analysis LRC vs RS

(78)

Microsoft Azure (6)

Analysis LRC vs RS

(79)

Recovery problem in Cloud

• Recovery I/Os from 6 disks (high network bandwidth)

(80)

Optimizing Recovery Network I/O (1)

(81)

Optimizing Recovery Network I/O (1)

• Establish recovery relationships among disks

(82)

Optimizing Recovery I/O (3)

• ~20+% savings in general

(83)

Regenerating Codes (1)

• Data = {a,b,c}

(84)

Regenerating Codes (2)

• Optimal Repair

(85)

Regenerating Codes (3)

• Optimal Repair

(86)

Regenerating Codes (4)

• Optimal Repair

(87)

Regenerating Codes (4)

Analysis -- Regenerating vs RS

(88)

Facebook Xorbas Hadoop

Locally Repairable Codes

(89)

Combination of Two ECs (1)

Recovery Cost vs. Storage Overhead

(90)

Combination of Two ECs (2)

Fast Code and Compact Code

(91)

Combination of Two ECs (3)

Analysis

(92)

Combination of Two ECs (4)

Analysis

(93)

Combination of Two ECs (5)

Analysis

(94)

Combination of Two ECs (6) Conversion

• Horizontal parities require no re-computation

• Vertical parities require no data block transfer

• All parity updates can be done in parallel and in a distributed manner

(95)

Combination of Two ECs (7)

Results

(96)

Contents

Project 1

4

(97)

Erasure Code in Hadoop (1)

• Implement an erasure code into Hadoop system

• Hadoop Version: 2.7 or higher

• Erasure Code: you can select one, but not RS

• Test the storage efficiency of your proposed code

• Report and Source Code are required

• Source Code should be checked by TA

• Deadline: June 30^th

(98)

Erasure Code in Hadoop (2)

• References

• Jerasure

http://web.eecs.utk.edu/~plank/plank/www/software.html

• HDFS-Xorbas

http://smahesh.com/HadoopUSC/

(99)