Database Systems (資料庫系統) Lecture #7

(1)

Database Systems

( 資料庫系統 )

November 7, 2005

Lecture #7

(2)

Announcement

• Assignment #2 solution will be posted on the

course homepage.

• Assignment #3 is due on Thur (11/10) outside

TA’s office.

– Do not accept late assignments.

– Assignment solution will be posted on the course

homepage on Friday (11/11).

• Practicum assignment #1 is available on cour

se homepage.

• Midterm next Monday

(3)

Storing Data: Disks and

Files

(4)

Disks and Files

• DBMS stores information on (“hard”) disks.

• This has major performance implications

for DB system design!

–

READ: transfer data from disk to main memory

(RAM).

–

WRITE: transfer data from RAM to disk.

–

Both are high-cost operations, relative to

in-memory operations, so must be planned

carefully!

(5)

Why Not Store Everything in Main

Memory?

• Costs too much.

– $100 for 1G of SDRAM

– $100 for 250 GB of HD (cost x250)

– $40 for 50 GB of tapes. (cost same as HD) ->

“Is Tape for backup dead?”

• Main memory is volatile.

– We want data to be saved between runs.

• Typical storage hierarchy:

–

Main memory (RAM) for currently used data.

–

Disk for the main database (secondary storage).

–

Tapes for archiving older versions of the data

(6)

Disks

• Secondary storage device of choice.

– Main advantage over tapes: random access vs.

sequential.

• Tapes are for data backup, not for operational

data.

– Access the last byte in a tape requires winding through the entire tape.

• Data is stored and retrieved in units called disk

blocks or pages.

• Unlike RAM, time to retrieve a disk page varies

depending upon location on disk.

– Therefore, relative placement of pages on disk has

(7)

Components of a Disk

• The platters spin.

• The arm assembly is mo

ved in or out to posit

ion a head on a desire

d track. Tracks under

heads make a

cylinder

.

• Only one head reads/wr

ites at any one time.

• Block size

is a multip

le of

sect

or size

(which is fixe

Platters Spindle Disk head Arm movement Arm assembly Tracks Sector

(8)

(9)

Accessing a Disk Page

• Time to access (read/write) a disk block, called access

time, is a sum of:

– _{seek time (}_{moving arm to position disk head on right track}₎ – _{rotational delay (}_{waiting for block to rotate under head}₎ – transfer time (_{actually moving data to/from disk surface})

• Seek time and rotational delay (mechanical parts) dominate the access time

– _{Seek time varies from about 1 to 20msec (avg 10msec)} – _{Rotational delay varies from 0 to 8msec (avg. 4msec)}

– _{Transfer rate is about 100MBps (0.025msec per 4KB page)}

• Key to lower I/O cost: reduce seek/rotation delays!

– If two pages of records are accessed together frequently, put them

(10)

(11)

Arranging Pages on Disk

• Next

block concept (measure the closeness of

blocks)

– _{(1) blocks on same track (no movement of arm), followed}

by

– (2) blocks on same cylinder (switch head, but almost no

movement of arm), followed by

– _{(3) blocks on adjacent cylinder (little movement of arm)}

• Blocks in a file should be arranged sequentially on

disk (by `next’), to minimize seek and rotational

delay.

• For a sequential scan, pre-fetching

several pages at

a time is a big win!

(12)

Platters Spindle Disk head Arm movement Arm assembly Tracks Sector 1 2 3

(13)

RAID

• RAID = Redundant Arrays of Independent

(Inexpensive) Disks

– Disk Array: Arrangement of several disks that gives abstraction of a single, large disk.

• Goals: Increase performance and reliability.

– Say you have D disks & each I/O request wants D blocks – How to improve the performance (data transfer rate)? – How to improve the performance (request service rate)? – How to improve reliability (in case of disk failure)?

(14)

Two main techniques in

RAID

•

Data striping

_{improves performance.}

– _{Data (e.g., in the same time file) is partitioned across}

multiple HDs; size of a partition is called the striping unit.

– _{Performance gain is from reading/writing multiple HDs at}

the same time.

•

Redundancy

_{improves reliability.}

– _{Data striping lowers reliability: More disks → more fail}

ures.

– _Storeredundant_{information on different disks. When a d}

(15)

RAID Levels

• Level 0: No redundancy (only data striping)

• Level 1: Mirrored (two identical copies)

• Level 0+1: Striping and Mirroring

• (Level 2: Error-Correcting Code)

• Level 3: Bit-Interleaved Parity

• Level 4: Block-Interleaved Parity

• Level 5: Block-Interleaved Distributed Parity

• (Level 6: Error-Correcting Code)

• More Levels (01-10, 03/30, …)

(16)

RAID Level 0

• Strip data across all drives (minimum 2 drives)

• Sequential blocks of data (in the same file) are

written across multiple disks in stripes.

• Two performance criterions:

– Data transfer rate: net transfer rate for a single (large) file – request service rate: rate at which multiple requests

(17)

RAID Level 0

• Improve data transfer rate:

– Read 10 blocks (1~10) takes only 2-block access time (worse of 5 disks).

– Theoretical speedup over single disk = N (number of disks)

• Improve request service rate:

– File 1 occupies blocks 1 and file 2 occupies block 2. Service two requests (two files) at the same time. – Theoretical speedup over single disk = N.

(18)

RAID Level 0

• Poor reliability:

– Mean Time To Failure (MTTF) of one disk = 50K hours (5.7 years).

– MTTF of a disk array of 100 disks is 50K/100 = 500 hours (21 days)!

– MTTF decreases linearly with the number of disks.

• No space redundancy

(19)

Mirrored (RAID Level 1)

• Redundancy by duplicating data on different disks:

– Mirror means copy each file to both disks – Simple but expensive.

• Fault-tolerant to a single disk failure

– Recovery by copying data from the other disk to new disk.

– The other copy can continue to service requests (availability) during recovery.

(20)

Mirrored (RAID Level 1)

• Performance is not the objective, but reliability.

– Mirroring frequently used when availability is more important than storage efficiency.

• Data transfer rate:

– Write performance may be slower than single disk, why?

• Worse of 2 disks

– Read performance can be faster than single disk, why?

• Consider reading block 1 from disk 0 and block 2 from disk 1 at the same time.

– Compare read performance to RAID Level 0?

(21)

Mirrored (RAID Level 1)

• Data reliability:

– Assume Mean-Time-To-Repair (MTTR) is 1 hour.

• Shorter with Hotswap HDs.

– MTTF of Mirrored 2-disks = 1 / (probability that 2 disks will fail within the same hour) = MTTR2/2 = (50K) 2/2 hour

s = many many years.

• Space redundancy overhead:

(22)

Striping and Mirrors (RAID

0+1)

(23)

Bit-Interleaved Parity (RAID

Level 3)

• Fine-grained striping at the bit level

• One parity disk:

– Parity bit value = XOR across all data bit values

• If one disk fails, recover the lost data:

– XOR across all good data bit values and parity bit value

? 1

0 0

(24)

Bit-Interleaved Parity (RAID

Level 3)

• Performance:

– Transfer rate

speedup?

• x32 of single disk

– Request service

rate improvement?

• Same as single disk (do one request at a time)

• Reliability:

– Can tolerate 1 disk

failure.

• Space overhead:

– One parity disk

(1/33 overhead)

(25)

Block-Interleaved Parity

(RAID Level 4)

• Coarse-grained striping at the block level

– Otherwise, it is similar to RAID 3

• If one disk fails, recovery the lost block:

– Read same block of all disks (including

parity disk) to reconstruct the lost block.

(26)

Block-Interleaved Parity

(RAID Level 4)

• Performance:

– If error, read/write of same block on all disks (worse-of-N on one block)

– If no error, write also needs to update (read-n-write) the parity block. (no need to read other disks)

• Can compute new parity based on old data, new data, and old parity

• New parity = (old data XOR new data) XOR old parity

– Result in bottleneck on the parity disk! (can do only one write at a time)

(27)

Block-Interleaved Parity

(RAID Level 4)

• Reliability:

– Can tolerate 1 disk failure.

• Space redundancy overhead:

(28)

Block-Interleaved

Distributed-Parity (RAID Level 5)

• Remove the parity disk bottleneck in RAID

L4 by distributing the parity uniformly over

all of the disks.

– No single parity disk as bottleneck; otherwise, it

is the same as RAID 4.

• Performance improvement in write.

– You can write to multiple disks (in 2-disk pairs)

in parallel.

• Reliability & space redundancy are the

same as RAID L4.

(29)

Structure of DBMS

• Disk Space Manager

– manage space (pages) on

disk.

• Buffer Manager

– manage traffic between

disk and main memory.

(bring in pages from disk

to main memory).

• File and Access Methods

– Organize records into

pages and files.

Query Optimization and Execution Relational Operators Files and Access Methods

Buffer Manager Disk Space Manager

Applications

(30)

Disk Space Manager

• Lowest layer of DBMS software manages space on

disk.

• Higher levels call upon this layer to:

– _{allocate/de-allocate a page} – _{read/write a page}

• Request for a sequence of pages should be satisfied

by allocating the pages sequentially on disk!

– Support the “Next” block concept (reduce I/O cost when multiple sequential pages are requested at the same time). – Higher levels (buffer manager) don’t need to know how this

(31)

More on Disk Space

Manager

• Keep track of free (used) blocks:

– List of free blocks + the pointer to the first free

block

– Bitmap with one bit for each disk block. Bit=1

(used), bit=0 (free)

– Bitmap approach can be used to identify

contiguous areas on disk.

(32)

Buffer Manager

• Typically, DBMS has more data than main memory. • Bring Data into main memory for DBMS to operate on it! • Table of <frame#, pageid> pairs is maintained.

DB

MAIN MEMORY DISK

disk page free frame

Page Requests from Higher Levels

BUFFER POOL

choice of frame dictated by replacement policy

(33)

When a Page is

Requested ...

• If the requested page is not in pool (and no

free frame):

–

Choose an occupied frame for replacement

• _{Page replacement policy (minimize page miss rate)}

–

_{If the replaced frame is dirty, write it to disk}

–

_{Read requested page into chosen frame}

–

Pin the page and return its address.

• For each frame, you maintain

– Pin_count: number of outstanding requests

– Dirty: modified and need to written back to disk

• If requests can be predicted (e.g., sequential

scans), pages can be pre-fetched several

(34)

More on Buffer Manager

• Requestor of page must unpin it (no longer

need it), and indicate whether the page

has been modified:

–

dirty bit is used for this.

• Page in pool may be requested many

times,

–

a pin count is used. A page is a candidate for

(35)

Buffer Replacement Policy

• Frame is chosen for replacement by a

replacement po

licy:

– _{Least-recently-used (LRU): have LRU queue of frames with p}

in_count = 0

– _{Clock (approximate LRU with less overhead)}

• Use an additional reference_bit per page; set to 1 when the frame

is accessed

• _{Clock hand}_{moving from frame 0 to frame n.}

• _{Reset reference_bit of recently accessed frames.}

• _{Replace frame(s) with reference_bit = 0 & pin_count = 0.}

– _{FIFO, MRU, Random, etc.}

(36)

Clock Algorithm Example

disk page free frame BUFFER POOL Pin_count=0 1 1 0 1 0 Clock Hand

(37)

Sequential Flooding

• Use LRU + repeated

sequential scans + (#

buffer frames < # pages in

file)

– What many page I/O replacements?

• Example:

– #buffer frames = 2

– #pages in a file = 3 (P1, P2, P3)

• Repeated scan of file

– Every scan of the file result in reading every page of the file.

Block

read Frame #1 Frame #2

P1 P1 P2 P1 P2 P3 P3 P2 P1 P3 P1 P2 P2 P1 P3 P2 P3

(38)

DBMS vs. OS File System

• OS does disk space & buffer mgmt: why not

let OS manage these tasks?

– DBMS can better predict the page reference

patterns & pre-fetch pages.

• Adjust replacement policy, and pre-fetch pages based on

access patterns in typical DB operations.

– DBMS needs to be able to pin a page in memory

and force a page to disk.

• Differences in OS support: portability issues

– DBMS can maintain a virtual file that spans

multiple disks.

(39)

Files of Records

• Higher levels of DBMS operate on

records

, and

f

iles of records.

• FILE

: A collection of pages, each containing a

collection of records. Must support:

–

_{Insert/delete/modify record(s)}

–

_{Read a particular record (specified using}

record id

₎

–

_{Scan all records (possibly with some conditions on th}

e records to be retrieved)

• To support record level operations, we must kee

p track of:

–

_{Fields in a record:}

Record format

–

_{Records on a page:}

Page format

(40)

L1

L2

L3

L4

F1

F2

F3

F4

Record Formats

(how to organize

fields in a record):

Fixed Length

• Information about field types and offset same

for all records in a file; stored in

system c

atalogs

.

• Finding i-th field requires adding offsets to

(41)

Fields Delimited by Special Symbols

Record Formats: Variable

Length

• Two alternative formats (# fields is fixed):

Second alternative offers direct access to the i-th field, e

4 $

$

Field Count

F1 F2 F3 F4

(42)

Page Formats

(How to store records in a

page):

Fixed Length Records

• Record id = <page id, slot #>.

• They differ on how deletion (which creates a hole) is handled. • In first alternative, shift remaining records to fill hole =>

changes rid; may not be acceptable given external reference.

Slot 1 Slot 2 Slot N

. . .

N . . . 0 1M M ... 3 2 1

PACKED _{UNPACKED, BITMAP} Slot 1 Slot 2 Slot N Free Space Slot M 1 1 number of records number of slots

(43)

Page Formats: Variable Length Records

 _{Slot directory contains} one slot per record.

 _{Each slot contains (record offset, record length)}  Deletion is by setting the record offset to -1.

 _{Can move records on page without changing rid (change the record} Page i Rid = (i,N) Rid = (i,2) Rid = (i,1) Pointer to start of free space SLOT DIRECTORY N . . . 2 1 20 16 24 N # slots

(44)

Unordered (Heap) Files

• Simplest file structure contains records in no

particular order.

• As file grows and shrinks, disk pages are

allocated and de-allocated.

• There are two ways to implement a heap file

– Double-Linked lists – Page directory

(45)

Heap File (Doubly Linked Lists)

• The header page id and Heap file name must be stored someplace. • Each page contains 2 `pointers’ plus data.

• The problem is that inserting a variable size record requires walking through free space list to find a page with enough space.

Header Page Data Page Data Page Data Page Data Page Data Page Data

Page Pages with Free Space

(46)

Heap File (Page Directory)

• The directory is a collection of pages.

– Each directory page contains multiple directory entries – one per data page.

– The directory entry contains <page id, free bytes on the page> Data Page 1 Data Page 2 Data Page N Header Page DIRECTORY