Database Systems (資料庫系統) Lecture #9

(1)

Database Systems

( 資料庫系統 )

November 8, 2004

Lecture #9

(2)

Announcement

• Midterm exam: November 20 (Sat): 2:30 PM in CSIE

101/103

• Assignment #6 is available on the course homepage.

– It is due on 11/24

– It is very difficult

– Suggest you do it before midterm exam

• Assignment #7 will be available on the course homepage

later this afternoon.

– It is due 11/16 (next Tuesday). – It is easy.

(3)

Cool Ubicomp Project

Counter Intelligence (MIT)

• Smart kitchen & kitchen w

ares

• Talking Spoon

– Salty, sweet ,hot?

• Talking Cultery

– Bacteria?

• Smart fridge & counters

– RFID tags

– Tracking food from fridge to your month

(4)

Hash-Based Indexing

(5)

Introduction

• Recall that Hash-based

indexes are best for

equality

selections

.

– Cannot support range searches.

– Equality selections are useful for join operations.

• Static and dynamic hashing techniques exist

– Trade-offs similar to ISAM vs. B+ trees. – Static hashing technique

– Two dynamic hashing techniques

• Extendible Hashing • Linear Hashing

(6)

Static Hashing

• # primary pages fixed, allocated sequentially, never

de-allocated; overflow pages if needed.

• h(k) mod N

= bucket to which data entry with key k

belongs. (N = # of buckets)

h(key) mod N h

key

Primary bucket pages Overflow pages

2 0

(7)

Static Hashing (Contd.)

• Buckets contain data entries.

• Hash function works on search key field of record r.

– Ideally uniformly distribute values over range 0 ... N-1

– h(key) = (a * key + b) usually works well.

– a and b are constants; lots known about how to tune h.

•

Cost for insertion/delete/search

– _{two/two/one disk page I/Os (no overflow chains).}

• Long overflow chains

can develop and degrade performan

ce.

– Why poor performance? Scan through overflow chains linearly.

– _Extendible_and_{Linear Hashing}_{: Dynamic techniques to fix this prob}

(8)

Simple Solution

• Avoid creating overflow pages:

– When a bucket (primary page) becomes full, double #

of buckets & re-organize the file.

•

What’s wrong with this simple solution?

–

High cost concern: reading and writing all pages is

(9)

Extendible Hashing

•

The basic Idea (another level of abstraction):

– _Use_{directory of pointers to buckets}_{(hash to the directory entry)} – Double # of buckets by doubling the directory

– _{Splitting just the bucket that overflowed!}

•

_{Directory much smaller than file, so doubling it is much}

cheaper.

•

Only one page of data entries is split

– _{The page that overflows, rehash that page to two pages.}

•

Trick lies in how hash function is adjusted!

– _{Before doubling directory, h(r) -> 0..N-1 buckets.} – After doubling directory, h(r) -> 0 .. 2N-1

(10)

Example

• Directory is array of size 4. • To find bucket for r, take last

global depth # bits of h(r);

– Example: If h(r) = 5 = binary 101, it is in bucket pointed to by 01.

• Global depth: # of bits used for hashing directory entries.

• Local depth of a bucket: # bits for hashing a bucket. • When can global depth be

different from local depth?

13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1* 21* 4* 12* 32* 16* 15* 7* 19* 5*

(11)

Insert h(r)=20 (Causes

Doubling)

20* 00 01 10 11 2 2 2 2 LOCAL DEPTH 2 DIRECTORY

GLOBAL DEPTH Bucket A

Bucket B Bucket C Bucket D 1* 5* 21*13* 32*16* 10* 15* 7* 19* 4* 12* 19* 2 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 32* 1* 5* 21* 13* 16* 10* 15* 7* 4* 12*20* LOCAL DEPTH GLOBAL DEPTH 4: 0000 0100 12: 0000 1100 20: 0001 0100 4 12

(12)

Extensible Hashing Insert

• Check if the bucket is full.

– If no, done!

• Otherwise, check if local depth = global depth

– if no, rehash the entries and distribute them into two

buckets + increment the local depth

– if yes, double the directory -> rehash the entries and

distribute into two buckets

•

Directory is doubled by

copying it over

and

`fixing’ pointer to split image page.

–

You can do this only by using the least significant bits

(13)

Insert 9

19* 2 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 32* 1* 5* 21* 13* 16* 10* 15* 7* 4* 12*20* LOCAL DEPTH GLOBAL DEPTH 1: 0000 0001 5: 0000 0101 21: 0001 0101 13: 0000 1101 9: 0000 1001 19* 3 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket B2 32* 1* 9* 16* 10* 15* 7* 5* 13*21* LOCAL DEPTH GLOBAL DEPTH 3 Bucket A2 (`split image' of Bucket A) 4* 12*20*

(14)

Directory Doubling

00 01 10 11 2

Why use least significant bits in directory?

 Allows for doubling via copying!

3

vs.

0 1 1 6* 6* 000 001 010 011 100 101 110 111 6*

6 = 110

00 10 01 11 2 3 0 1 1 6* 6*

6 = 110

Least Significant

Most Significant

000 001 010 011 100 101 110 111 6*

(15)

Comments on Extendible

Hashing

• If directory fits in memory, equality search answered with

one disk access; else two.

– _{100MB file, 100 bytes/rec, you have 1M data entries.}

– _{A 4K page (a bucket) can contain 40 data entries. You need about}

25,000 directory elements; chances are high that directory will fit in memory.

– If the distribution of hash values is skewed (concentrates on a few

buckets), directory can grow large.

• Delete:

If removal of data entry makes bucket empty, can

be merged with `split image’. If each directory element

points to same bucket as its split image, can halve

(16)

Linear Hashing (LH)

• This is another dynamic hashing scheme, an alternative to

Extendible Hashing.

– LH fixes the problem of long overflow chains (in static hashing)

without using a directory (in extendible hashing).

• Basic Idea:

Use a family of hash functions h0, h1, h2, ...

– Each function’s range is twice that of its predecessor.

– Pages are split when overflows occur – but not necessarily the overflowing page. (Splitting occurs in turn, in a round robin fashion.)

– Buckets are added gradually (one bucket at a time).

– When all the pages at one level (the current hash function) have been split, a new level is applied.

(17)

Levels of Linear Hashing

• Initial Stage.

– The initial level distributes entries into N0 buckets.

– Call the hash function to perform this h0.

• Splitting buckets.

– If a bucket overflows its primary page is chained to an overflow page (same as in static hashing).

– Also when a bucket overflows, some bucket is split.

• The first bucket to be split is the first bucket in the file (not necessarily the bucket that overflows).

• The next bucket to be split is the second bucket in the file … and so on until the Nth. has been split.

• When buckets are split their entries (including those in overflow pages) are distributed using h1.

– To access split buckets the next level hash function (h1) is applied.

(18)

Levels of Linear Hashing

(Cnt)

• Level progression:

– Once all Ni buckets of the current level (i) are split, the hash func tion hi is replaced by hi+1.

– The splitting process starts again at the first bucket, and hi+2 is a

(19)

Linear Hashing Example

• Initially, the index level equal to

0 and N

0

equals 4 (three entries

fit on a page).

• h

0

maps index entries to one of

four buckets.

• h

0

is used and no buckets have

been split.

• Now consider what happens

when 9 (1001) is inserted

(which will not fit in the second

bucket).

• Note that next indicates which

bucket is to split next. (Round

Robin)

next 64 36 1 17 5 6 31 15 00 01 10 11 h₀

(20)

Linear Hashing Example 2

• An overflow page is chained to the primary page to contain the

inserted value.

• If h0 maps a value from zero to

next – 1 (just the first page in this case) h1 must be used to insert the

new entry.

• Note how the new page falls

naturally into the sequence as the fifth page. h₁ next 64 h₀ next 1 17 5 9 h₀ 6 h₀ 31 15 h₁ 36

• The page indicated by next is split (the first one).

• Next is incremented. 000 01 10 11 100

(21)

21

Linear Hashing Example 3

• Assume inserts of 8, 7, 18, 14, 111, 32, 162, 10, 13, 233 • After the 2nd. split the base level is 1 (N₁ = 8), use h₁.

• Subsequent splits will use h2 for inserts between the first bucket and next-1.

2 1 h₁ h₁ next₃ 64 8 32 16 h₁ h₁ 1 17 9 h₁ h₀ next₁ 10 18 6 18 14 h₀ h₀ next₂ 11 31 15 7 11 h₁ h₁ 36 h₁ h₁ 5 13 h₁ - 6 14

(22)

Linear Hashing vs. Extendable

Hashing

• What is the similarity?

– One round of RR of splitting in LH is the same as

1-step doubling of directory in EH

• What are the differences?

– Directory overhead vs. none

– Overflow pages vs. none

– Gradual splitting (of pages) vs. one-step doubling (of

directory)

– Pages are allocated in order vs. not in order

– Splitting non-overflowing pages vs. splitting

(23)

Summary

• Hash-based indexes: best for equality searches, cannot

support range searches.

• Static Hashing can lead to long overflow chains.

• Extendible Hashing avoids overflow pages by splitting a

full bucket when a new data entry is to be added to it.

(Duplicates may require overflow pages.)

– Directory to keep track of buckets, doubles periodically.

– Can get large with skewed data; additional I/O if this does not fit

in main memory.

– a skewed data distribution is one in which the hash values of data entries are not uniformly distributed!

(24)

Summary (Contd.)

• Linear Hashing avoids directory by splitting buckets

round-robin, and using overflow pages.

– _{Overflow pages not likely to be long.}

– _{Space utilization could be lower than Extendible Hashing, since}

splits not concentrated on `dense’ data areas.

• Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization.