• 沒有找到結果。

Big Data Processing Technologies

N/A
N/A
Protected

Academic year: 2022

Share "Big Data Processing Technologies"

Copied!
99
0
0

加載中.... (立即查看全文)

全文

(1)

Big Data Processing Technologies

Chentao Wu

Associate Professor

Dept. of Computer Science and Engineering [email protected]

(2)

Schedule

• lec1: Introduction on big data and cloud computing

• Iec2: Introduction on data storage

• lec3: Data reliability (Replication/Archive/EC)

• lec4: Data consistency problem

• lec5: Block level storage and file storage

• lec6: Object-based storage

• lec7: Distributed file system

• lec8: Metadata management

(3)

Collaborators

(4)

Data Reliability Problem (1)

Google – Disk Annual Failure Rate

(5)

Data Reliability Problem (2)

Facebook-- Failure nodes in a 3000 nodes cluster

(6)

Contents

Introduction on Replication

1

(7)

What is Replication?

• Replication can be classified as

• Local replication

• Replicating data within the same array or data center

• Remote replication

• Replicating data at remote site

It is a process of creating an exact copy (replica) of data.

Replication

Source Replica (Target)

REPLICATION

(8)

File System Consistency: Flushing Host Buffer

File System Application

Memory Buffers

Logical Volume Manager Physical Disk Driver

Data

Flush Buffer

Source Replica

(9)

Database Consistency: Dependent Write I/O Principle

D

Inconsistent

C

Consistent

Source Replica

4 4

3 3

2 2

1 1

4 4

3 3

2 1

C

Source Replica

(10)

Host-based Replication: LVM-based Mirroring

CC

Host

Logical Volume

Physical Volume 1

Physical Volume 2

• LVM: Logical Volume Manager

(11)

Host-based Replication: File System Snapshot

CC

• Pointer-based replication

• Uses Copy on First

Write (CoFW) principle

• Uses bitmap and block map

• Requires a fraction of the space used by the production FS

Metadata Production FS

Metadata 1 Data a 2 Data b

FS Snapshot

3 no data 4 no data

BLK Bit

1-0 1-0 2-0 2-0

N Data N 3 Data C

2 Data c 3-1

4 Data D 1 Data d

4-1

3-2 4-1

(12)

Storage Array-based Local Replication

CC

• Replication performed by the array operating environment

• Source and replica are on the same array

• Types of array-based replication

• Full-volume mirroring

• Pointer-based full-volume replication

• Pointer-based virtual replication

BC Host Storage Array

Replica Source

Production Host

(13)

Full-Volume Mirroring

Source

Attached

Storage Array

Read/Write Not Ready

Production Host BC Host

Target

Detached – Point In Time

Read/Write Read/Write

Source

Storage Array

Production Host BC Host

Target

(14)

Copy on First Access: Write to the Source

Source C’

Target

When a write is issued to the source for the first time after replication session activation:

Original data at that address is copied to the target

Then the new data is updated on the source

This ensures that original data at the point-in-time of activation is preserved on the target

Production Host BC Host

C Write to

Source

A B

C’ C

(15)

Copy on First Access: Write to the Target

When a write is issued to the target for the first time after replication session activation:

The original data is copied from the source to the target

Then the new data is updated on the target

Source

B’

Target

Production Host BC Host

B

Write to Target

A B

C’ C

B’

(16)

Copy on First Access: Read from Target

When a read is issued to the target for the first time after replication session activation:

The original data is copied from the source to the target and is made available to the BC host

Source

A

Target

Production Host BC Host

A

Read request for

data “A”

A B

C’ C

B’

A

(17)

Tracking Changes to Source and Target

Source Target

0 unchanged changed

Logical OR At PIT

Target Source After PIT…

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 0 1 0 1 0 0

0 0 1 1 0 0 0 1

1 0 1 1 0 1 0 1

1

For resynchronization/restore

(18)

Contents

Introduction to Erasure Codes

2

(19)

Erasure Coding Basis (1)

• You've got some data • And a collection of storage nodes.

• And you want to store the data on the storage nodes so that you can get the data back, even when the nodes fail..

(20)

Erasure Coding Basis (2)

• More concrete: You have k

disks worth of data • And n total disks.

• The erasure code tells you how to create n disks worth of data+coding so that when disks fail, you can still get the data

(21)

Erasure Coding Basis (3)

• You have k disks worth of

data • And n total disks.

• n = k + m

• A systematic erasure code stores the data in the clear on k of the n disks. There are k data disks, and m coding or “parity”

disks.  Horizontal Code

(22)

Erasure Coding Basis (4)

• You have k disks worth of

data • And n total disks.

• n = k + m

• A non-systematic erasure code stores only coding information, but we still use k, m, and n to describe the code.  Vertical Code

(23)

Erasure Coding Basis (5)

• You have k disks worth of

data • And n total disks.

• n = k + m

• When disks fail, their contents become unusable, and the storage system detects this. This failure mode is called an erasure.

(24)

Erasure Coding Basis (6)

• You have k disks worth of

data • And n total disks.

• n = k + m

• An MDS (“Maximum Distance Separable”) code can reconstruct the data from any m failures.  Optimal

• Can reconstruct any f failures (f < m)  non-MDS code

(25)

Two Views of a Stripe (1)

• The Theoretical View:

– The minimum collection of bits that encode and decode together.

– r rows of w-bit symbols from each of n disks:

(26)

Two Views of a Stripe (2)

• The Systems View:

– The minimum partition of the system that encodes and decodes together.

– Groups together theoretical stripes for performance.

(27)

Horizontal & Vertical Codes

• Horizontal Code

• Vertical Code

(28)

Expressing Code with Generator Matrix (1)

(29)

Expressing Code with Generator Matrix (2)

(30)

Expressing Code with Generator Matrix (3)

(31)

Encoding— Linux RAID-6 (1)

(32)

Encoding— Linux RAID-6 (2)

(33)

Encoding— Linux RAID-6 (3)

(34)

Accelerate Encoding— Linux RAID-6

(35)

Encoding— RDP (1)

(36)

Encoding— RDP (2)

(37)

Encoding— RDP (3)

(38)

Encoding— RDP (4)

(39)

Encoding— RDP (5)

(40)

Encoding— RDP (6)

Horizontal Parity Diagonal Parity Data

0 1 2 3 4 5 6 7

0

1

2

3

4

5

• Horizontal parity layout (p=7, n=8)

(41)

Encoding— RDP (7)

• Diagonal parity layout (p=7, n=8)

Horizontal Parity Diagonal Parity Data

0 1 2 3 4 5 6 7

0

1

2

3

4

5

(42)

Arithmetic for Erasure Codes

• When w = 1: XOR's only.

• Otherwise, Galois Field Arithmetic GF(2w)

w is 2, 4, 8, 16, 32, 64, 128 so that words fit evenly into computer words.

– Addition is equal to XOR.

Nice because addition equals subtraction.

– Multiplication is more complicated:

Gets more expensive as w grows.

Buffer-constant different from a * b.

Buffer * 2 can be done really fast.

Open source library support.

(43)

Decoding with Generator Matrices (1)

(44)

Decoding with Generator Matrices (2)

(45)

Decoding with Generator Matrices (3)

(46)

Decoding with Generator Matrices (4)

(47)

Decoding with Generator Matrices (5)

(48)

Erasure Codes — Reed Solomon (1)

• Given in 1960.

• MDS Erasure codes for any n and k.

– That means any m = (n-k) failures can be tolerated without data loss.

• r = 1

(Theoretical): One word per disk per stripe.

• w constrained so that n ≤ 2w.

• Systematic and non-systematic forms.

(49)

Erasure Codes —Reed Solomon (2)

Systematic RS -- Cauchy generator matrix

(50)

Erasure Codes —Reed Solomon (3)

Non-Systematic RS -- Vandermonde generator matrix

(51)

Erasure Codes —Reed Solomon (4)

Non-Systematic RS -- Vandermonde generator matrix

(52)

Erasure Codes —EVENODD 1995 (7 disks, tolerating 2 disk failures)

• Horizontal Parity Coding

• Calculated by the data

elements in the same row

• E.g. 𝐶0,5 = 𝐶0,0 ⊕ 𝐶0,1 ⊕ 𝐶0,2 ⊕ 𝐶0,3

⊕ 𝐶0,4

• Diagonal Parity Coding

• Calculated by the data elements and S

• E.g. 𝐶0,6 = 𝐶0,0 ⊕ 𝐶3,2 ⊕ 𝐶2,3 𝐶1,4 ⊕ 𝑆

(53)

Erasure Codes —X-Code 1999 (1)

• Diagonal parity layout (p=7, n=7)

Diagonal Parity Anti-diagonal Parity Data

0 1 2 3 4 5 6

0

1

2

3

4

5

6

(54)

Erasure Codes —X-Code 1999 (2)

• Anti-diagonal parity layout (p=7, n=7)

Diagonal Parity Anti-diagonal Parity Data

0 1 2 3 4 5 6

0

1

2

3

4

5

6

(55)

Erasure Codes —H-Code (1)

• Horizontal parity layout (p=7, n=8)

Horizontal Parity Anti-diagonal Parity Data

0 1 2 3 4 5 6 7

0

1

2

3

4

5

(56)

Erasure Codes —H-Code (2)

• Anti-diagonal parity layout (p=7, n=8)

Horizontal Parity Anti-diagonal Parity Data

0 1 2 3 4 5 6 7

0

1

2

3

4

5

(57)

Erasure Codes —H-Code (3)

• Recover double disk failure by single recovery chain

Horizontal Parity Anti-diagonal Parity

Data Lost Data and Parity

Recovery Chain

1

2 3

4 5

6 7

8 9

10 11

X

12

F

H

J L

B D

K E A

I G

C

0 1 2 3 4 5 6 7

0

1

2

3

4

5

(58)

Erasure Codes —H-Code (4)

• Recover double disk failure by two recovery chains

5

Horizontal Parity Anti-diagonal Parity

Data Lost Data and Parity

Recovery Chain

1

2 3

4 5

X6

D E

L J

K I

H

F G

A

C B

0 1 2 3 4 5 6 7

0

1

2

3

4

5

1

2 3

4

X6

(59)

Erasure Codes —HDP Code (1)

• Diagonal parity layout (p=7, n=6)

0 1 2 3 4 5

0

1

2

3

4

5

HDP ADP

Data

(60)

Erasure Codes —HDP Code (2)

• Diagonal parity layout (p=7, n=6)

0 1 2 3 4 5

0

1

2

3

4

5

HDP ADP

Data

(61)

Erasure Codes —HDP Code (3)

• HDP reduces more than 30% average recovery time.

0 1 2 3 4 5

0

1

2

3

4

5

HDP ADP

Data Lost Data and Parity

A B 1

D C

F E

K L

I J

G H

F

F

F

F 2

3 4

5 6

1 2

3 4

5 6

(62)

Contents

Replication and EC in Cloud

3

(63)

Three Dimensions in Cloud Storage

(64)

Replication vs Erasure Coding (RS)

(65)

Fundamental Tradeoff

(66)

Pyramid Codes (1)

(67)

Pyramid Codes (2)

(68)

Pyramid Codes (3) Multiple Hierachies

(69)

Pyramid Codes (4) Multiple Hierachies

(70)

Pyramid Codes (5) Multiple Hierachies

(71)

Pyramid Codes (6)

(72)

Google GFS II – Based on RS

(73)

Microsoft Azure (1)

How to Reduce Cost?

(74)

Microsoft Azure (2)

Recovery becomes expensive

(75)

Microsoft Azure (3)

Best of both worlds?

(76)

Microsoft Azure (4)

Local Reconstruction Code (LRC)

(77)

Microsoft Azure (5)

Analysis LRC vs RS

(78)

Microsoft Azure (6)

Analysis LRC vs RS

(79)

Recovery problem in Cloud

• Recovery I/Os from 6 disks (high network bandwidth)

(80)

Optimizing Recovery Network I/O (1)

(81)

Optimizing Recovery Network I/O (1)

• Establish recovery relationships among disks

(82)

Optimizing Recovery I/O (3)

• ~20+% savings in general

(83)

Regenerating Codes (1)

• Data = {a,b,c}

(84)

Regenerating Codes (2)

• Optimal Repair

(85)

Regenerating Codes (3)

• Optimal Repair

(86)

Regenerating Codes (4)

• Optimal Repair

(87)

Regenerating Codes (4)

Analysis -- Regenerating vs RS

(88)

Facebook Xorbas Hadoop

Locally Repairable Codes

(89)

Combination of Two ECs (1)

Recovery Cost vs. Storage Overhead

(90)

Combination of Two ECs (2)

Fast Code and Compact Code

(91)

Combination of Two ECs (3)

Analysis

(92)

Combination of Two ECs (4)

Analysis

(93)

Combination of Two ECs (5)

Analysis

(94)

Combination of Two ECs (6) Conversion

• Horizontal parities require no re-computation

• Vertical parities require no data block transfer

• All parity updates can be done in parallel and in a distributed manner

(95)

Combination of Two ECs (7)

Results

(96)

Contents

Project 1

4

(97)

Erasure Code in Hadoop (1)

• Implement an erasure code into Hadoop system

• Hadoop Version: 2.7 or higher

• Erasure Code: you can select one, but not RS

• Test the storage efficiency of your proposed code

• Report and Source Code are required

• Source Code should be checked by TA

• Deadline: June 30th

(98)

Erasure Code in Hadoop (2)

• References

• Jerasure

http://web.eecs.utk.edu/~plank/plank/www/software.html

• HDFS-Xorbas

http://smahesh.com/HadoopUSC/

(99)

Thank you!

參考文獻

相關文件

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

In this section we define a general model that will encompass both register and variable automata and study its query evaluation problem over graphs. The model is essentially a

A quote from Dan Ariely, “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they

To complete the “plumbing” of associating our vertex data with variables in our shader programs, you need to tell WebGL where in our buffer object to find the vertex data, and

what is the most sophisticated machine learning model for (my precious big) data. • myth: my big data work best with most

“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced?. insight and

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the

We showed that the BCDM is a unifying model in that conceptual instances could be mapped into instances of five existing bitemporal representational data models: a first normal