Database Systems (資料庫系統) Lecture #9

(1)

Database Systems

( 資料庫系統 )

November 28, 2005

Lecture #9

(2)

Announcement

• Next week reading: Chapters 12

• Pickup your midterm exams at the end

of the class.

• Pickup your assignments #1~3

outside of the TA office 336/338.

• Assignment #4 & Practicum #2 are

due in one week.

(3)

Interesting Talk

• Rachel Kern, “From Cell Phones To Mo

nkeys: Research Projects in the Speec

h Interface Group at the M.I.T. Media

Lab”, CSIE 102, Friday 2:20 ~ 3:30

(4)

Midterm Exam Score

Distribution

(5)

Ubicomp project of the week

• From Pervasive to

Persuasive Computing

• Pervasive Computing

(smart objects)

– Design to

be aware

of

people’s behaviors

• Examples: smart dining table,

smart chair, smart wardrobe, smart mirror, smart shoes, smart spoon, …

• Persuasive Computing

– Design to

change

people’s

behaviors

(6)

(7)

Smart Device:

Credit Card Barbie Doll

(from

Accenture)

• Barbie gets wireless implant of chip and

sensors and become decision-making

objects.

• When one Barbie meets another Barbie …

– Detect the presence of clothing of the other

Barbie.

– If she does not have it … she can

automatically send an online order through the wireless connection!

– You can give her a credit card limit.

• Good that this is just a concept toy.

• It illustrates the concept of autonomous

purchasing object: car, home, refrigerator,

…

(8)

Hash-Based Indexing

(9)

Introduction

• Hash-based

indexes are best for

equality

selections

. Cannot support range searches.

– Equality selections are useful for join

operations.

• Static and dynamic hashing techniques;

trade-offs similar to ISAM vs. B+ trees.

– Static hashing technique

– Two dynamic hashing techniques

• Extendible Hashing

• Linear Hashing

(10)

Static Hashing

• # primary pages fixed, allocated

sequentially, never de-allocated; overflow

pages if needed.

• h(k) mod N

= bucket to which data entry

with key k belongs. (N = # of buckets)

h(key) mod N

h

key

Primary bucket pages Overflow pages

2

0

(11)

Static Hashing (Contd.)

• Buckets contain data entries.

• Hash function works on search key field of record r. Must

distribute values over range 0 ... N-1.

– _{h(key) = (a * key + b) usually works well.}

– _{a and b are constants; lots known about how to tune h.}

•

Cost for insertion/delete/search: 2/2/1 disk page I/Os (no

overflow chains).

• Long overflow chains

can develop and degrade performance.

– Why poor performance? Scan through overflow chains linearly.

– _Extendible_and_{Linear Hashing}_{: Dynamic techniques to fix this}

(12)

Extendible Hashing

• Simple Solution (no overflow chain):

– When bucket (primary page) becomes full, ..

– Re-organize file by doubling # of buckets. Cost concern?

–

High cost: rehash all entries - reading and writing all pages

is expensive!

•

How to reduce high cost?

–

Use

directory of pointers to buckets,

double # of buckets by

doubling the directory,

splitting just the bucket that

overflowed

!

–

_{Directory much smaller than file, so doubling much cheaper.}

Only one page of data entries is split.

–

How to adjust the hash function? Before doubling directory,

h(r) → 0..N-1 buckets. After doubling directory, h(r) → 0 ..

2N-1

(13)

Example

• Directory is array of size 4. • To find bucket for r, take

last global depth # bits of

h(r);

– Example: If h (r= 5), 5’s

binary is 101, it is in bucket pointed to by 01.

• Global depth: # of bits

used for hashing directory entries.

• Local depth of a bucket: # bits for hashing a bucket.

• When can global depth be

different from local depth?

13* 00 01 10 11 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1* 21 * 4* 12*32*16* 15*7*19* 2 2 2 2 2 5*

(14)

14

Insert 20 = 10100 (Causes

Doubling)

19* 2 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 32* 1* 5*21*13* 16* 10* 15*7* 4*12*20* LOCAL DEPTH GLOBAL DEPTH 00 01 10 11 2 2 2 LOCAL DEPTH 2 DIRECTORY

GLOBAL DEPTH Bucket A Bucket B Bucket C Bucket D 1*5* 21*13* 32*16* 10* 15*7*19* 4*12* 2

double directory:

-Increment global depth

-Rehash bucket A

-Increment local depth, why

track local depth?

(15)

Insert 9 = 1001 (No

Doubling)

19* 3 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 32* 1* 9* 21*13* 16* 10* 15*7* 4*12*20* LOCAL DEPTH GLOBAL DEPTH 19* 2 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 32* 1* 5*21*13* 16* 10* 15*7* 4*12*20* LOCAL DEPTH GLOBAL DEPTH 3 Bucket B2

(split image of Bucket B) 5*

Only split bucket:

-Rehash bucket B

(16)

Points to Note

•

Global depth of directory:

Max # of bits needed

to tell which bucket an entry belongs to.

•

Local depth of a bucket:

# of bits used to

determine if an entry belongs to this bucket.

• When does bucket split cause directory doubling?

–

Before insert, bucket is full & local depth = global depth.

•

Directory is doubled by

copying it over

and `fixing’

pointer to split image page.

–

_{You can do this only by using the least significant bits in}

(17)

Directory Doubling

00 01 10 11 2

Why use least significant bits in

directory?

 Allows for doubling via

copying!

₀₀₀3 001 010 011 100 101 110 111 00 10 01 11 2 3 000 001 010 011 100 101 110 111

Split buckets

(18)

Comments on Extendible

Hashing

• If directory fits in memory, equality search

answered with one disk access; else two.

•

Problem with extendible hashing:

–

_{If the distribution of hash values is skewed}

(concentrates on a few buckets), directory can grow

large.

–

Can you come up with one insertion leading to multiple

splits

•

Delete:

If removal of data entry makes bucket

empty, can be merged with `split image’. If each

directory element points to same bucket as its

(19)

Skewed data distribution

(multiple splits)

• Assume each

bucket holds one

data entry

• Insert 2 (binary 10)

– how many times

of split?

• Insert 16 (binary

10000) – how many

times of split?

0 1 LOCAL DEPTH GLOBAL DEPTH 0* 8* 1 1 1

(20)

Delete 10*

00 01 10 11 2 2 2 LOCAL DEPTH 2 DIRECTORY

GLOBAL DEPTH Bucket A Bucket B Bucket C Bucket D 1*5* 21*13* 32*16* 10* 15*7*19* 4*12* 2 ₀₀ 01 10 11 2 2 2 LOCAL DEPTH 1 DIRECTORY

GLOBAL DEPTH Bucket A Bucket B Bucket B2 1*5* 21*13* 32*16* 15*7*19* 4*12*

(21)

Delete 15, 7, 19*

00 01 10 11 2 2 2 LOCAL DEPTH 1 DIRECTORY

GLOBAL DEPTH Bucket A Bucket B Bucket B2 1*5* 21*13* 32*16* 15*7*19* 4*12* 00 01 10 11 2 1 LOCAL DEPTH 1

GLOBAL DEPTH Bucket A Bucket B 1*5* 21*13* 32*16* 4*12* 00 01 1 1 LOCAL DEPTH 1

GLOBAL DEPTH Bucket A Bucket B 1*5* 21*13*

32*16* 4*12*

(22)

Linear Hashing (LH)

• This is another dynamic hashing scheme, an

alternative to Extendible Hashing.

– LH fixes the problem of long overflow chains (in static

hashing) without using a directory (in extendible hashing).

• Basic Idea:

Use a family of hash functions h

0

, h

1

, h

2

, ...

– Each function’s range is twice that of its predecessor.

– Pages are split when overflows occur – but not necessarily

the page with the overflow.

– Splitting occurs in turn, in a round robin fashion.

– When all the pages at one level (the current hash function)

have been split, a new level is applied.

– Splitting occurs gradually

(23)

Levels of Linear Hashing

• Initial Stage.

– The initial level distributes entries into N0 buckets.

– Call the hash function to perform this h0.

• Splitting buckets.

– If a bucket overflows its primary page is chained to an overflow

page (same as in static hashing).

– Also when a bucket overflows, some bucket is split.

• The first bucket to be split is the first bucket in the file (not

necessarily the bucket that overflows).

• The next bucket to be split is the second bucket in the file … and

so on until the Nth. has been split.

• When buckets are split their entries (including those in overflow

pages) are distributed using h1.

– To access split buckets the next level hash function (h1) is

applied.

(24)

Levels of Linear Hashing

(Cnt)

• Level progression:

–

Once all

N

i buckets of the current level (i) ar

e split the hash function

h

_i

is replaced by

h

_i+1

.

–

The splitting process starts again at the first

bucket and

h

_i+2

is applied to find entries in spl

(25)

Linear Hashing Example

• Initially, the index level equal

to 0 and N

0

equals 4 (three

entries fit on a page).

• h

0

maps index entries to one

of four buckets.

• h

0

is used and no buckets

have been split.

• Now consider what happens

when 9 (1001) is inserted

(which will not fit in the

second bucket).

• Note that next indicates

which bucket is to split next.

(Round Robin)

nex

t

64 36

1

17

5

6 31 15

00

01

10

11 h

₀

(26)

Linear Hashing Example 2

• An overflow page is chained to the

primary page to contain the inserted value.

• Note that the split page is not

necessary the overflow page – round robin.

• If h0 maps a value from zero to next –

1 (just the first page in this case), h1

must be used to insert the new entry.

• Note how the new page falls naturally

into the sequence as the fifth page.

h

₁

nex

t

64 h

₀

nex

t

1

17

5

9 h

₀

6 h

₀

31

15 h

₁

36 • The page indicated by next

is split (the first one).

(27)

27

Linear Hashing

• Assume inserts of

8, 7, 18, 14, 11

1

,

32, 16

2

, 10, 13,

23

3

• After the 2

nd

. split

the base level is

1 (N

1

= 8), use h

1

.

• Subsequent splits

will use h

2

for

inserts between

the first bucket

and next-1.

2 1 h₁ h₁ nex t₃ 64 8 32 16 h₁ h₁ 1 1 7 9 h₁ h₀ nex t₁ 10 1 8 6 1 8 14 h₀ h₀ nex t₂ 11 31 1 5 7 11 h₁ h₁ 36 h₁ h₁ 5 1 3 h₁ - 6 1 4