Database Systems
( 資料庫系統 )
November 8, 2004
Lecture #9
Announcement
• Midterm exam: November 20 (Sat): 2:30 PM in CSIE
101/103
• Assignment #6 is available on the course homepage.
– It is due on 11/24
– It is very difficult
– Suggest you do it before midterm exam
• Assignment #7 will be available on the course homepage
later this afternoon.
– It is due 11/16 (next Tuesday). – It is easy.
Cool Ubicomp Project
Counter Intelligence (MIT)
• Smart kitchen & kitchen w
ares
• Talking Spoon
– Salty, sweet ,hot?
• Talking Cultery
– Bacteria?
• Smart fridge & counters
– RFID tags
– Tracking food from fridge to your month
Hash-Based Indexing
Introduction
• Recall that Hash-based
indexes are best for
equality
selections
.
– Cannot support range searches.
– Equality selections are useful for join operations.
• Static and dynamic hashing techniques exist
– Trade-offs similar to ISAM vs. B+ trees. – Static hashing technique
– Two dynamic hashing techniques
• Extendible Hashing • Linear Hashing
Static Hashing
• # primary pages fixed, allocated sequentially, never
de-allocated; overflow pages if needed.
• h(k) mod N
= bucket to which data entry with key k
belongs. (N = # of buckets)
h(key) mod N h
key
Primary bucket pages Overflow pages
2 0
Static Hashing (Contd.)
• Buckets contain data entries.
• Hash function works on search key field of record r.
– Ideally uniformly distribute values over range 0 ... N-1
– h(key) = (a * key + b) usually works well.
– a and b are constants; lots known about how to tune h.
•
Cost for insertion/delete/search
– two/two/one disk page I/Os (no overflow chains).
• Long overflow chains
can develop and degrade performan
ce.
– Why poor performance? Scan through overflow chains linearly.
– Extendible and Linear Hashing: Dynamic techniques to fix this prob
Simple Solution
• Avoid creating overflow pages:
– When a bucket (primary page) becomes full, double #
of buckets & re-organize the file.
•
What’s wrong with this simple solution?
–
High cost concern: reading and writing all pages is
Extendible Hashing
•
The basic Idea (another level of abstraction):
– Use directory of pointers to buckets (hash to the directory entry) – Double # of buckets by doubling the directory
– Splitting just the bucket that overflowed!
•
Directory much smaller than file, so doubling it is much
cheaper.
•
Only one page of data entries is split
– The page that overflows, rehash that page to two pages.
•
Trick lies in how hash function is adjusted!
– Before doubling directory, h(r) -> 0..N-1 buckets. – After doubling directory, h(r) -> 0 .. 2N-1
Example
• Directory is array of size 4. • To find bucket for r, take last
global depth # bits of h(r);
– Example: If h(r) = 5 = binary 101, it is in bucket pointed to by 01.
• Global depth: # of bits used for hashing directory entries.
• Local depth of a bucket: # bits for hashing a bucket. • When can global depth be
different from local depth?
13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1* 21* 4* 12* 32* 16* 15* 7* 19* 5*
Insert h(r)=20 (Causes
Doubling)
20* 00 01 10 11 2 2 2 2 LOCAL DEPTH 2 DIRECTORYGLOBAL DEPTH Bucket A
Bucket B Bucket C Bucket D 1* 5* 21*13* 32*16* 10* 15* 7* 19* 4* 12* 19* 2 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 32* 1* 5* 21* 13* 16* 10* 15* 7* 4* 12*20* LOCAL DEPTH GLOBAL DEPTH 4: 0000 0100 12: 0000 1100 20: 0001 0100 4 12
Extensible Hashing Insert
• Check if the bucket is full.
– If no, done!
• Otherwise, check if local depth = global depth
– if no, rehash the entries and distribute them into two
buckets + increment the local depth
– if yes, double the directory -> rehash the entries and
distribute into two buckets
•
Directory is doubled by
copying it over
and
`fixing’ pointer to split image page.
–
You can do this only by using the least significant bits
Insert 9
19* 2 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 32* 1* 5* 21* 13* 16* 10* 15* 7* 4* 12*20* LOCAL DEPTH GLOBAL DEPTH 1: 0000 0001 5: 0000 0101 21: 0001 0101 13: 0000 1101 9: 0000 1001 19* 3 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket B2 32* 1* 9* 16* 10* 15* 7* 5* 13*21* LOCAL DEPTH GLOBAL DEPTH 3 Bucket A2 (`split image' of Bucket A) 4* 12*20*Directory Doubling
00 01 10 11 2Why use least significant bits in directory?
Allows for doubling via copying!
3
vs.
0 1 1 6* 6* 000 001 010 011 100 101 110 111 6*6 = 110
00 10 01 11 2 3 0 1 1 6* 6*6 = 110
Least Significant
Most Significant
000 001 010 011 100 101 110 111 6*
Comments on Extendible
Hashing
• If directory fits in memory, equality search answered with
one disk access; else two.
– 100MB file, 100 bytes/rec, you have 1M data entries.
– A 4K page (a bucket) can contain 40 data entries. You need about
25,000 directory elements; chances are high that directory will fit in memory.
– If the distribution of hash values is skewed (concentrates on a few
buckets), directory can grow large.
• Delete:
If removal of data entry makes bucket empty, can
be merged with `split image’. If each directory element
points to same bucket as its split image, can halve
Linear Hashing (LH)
• This is another dynamic hashing scheme, an alternative to
Extendible Hashing.
– LH fixes the problem of long overflow chains (in static hashing)
without using a directory (in extendible hashing).
• Basic Idea:
Use a family of hash functions h0, h1, h2, ...– Each function’s range is twice that of its predecessor.
– Pages are split when overflows occur – but not necessarily the overflowing page. (Splitting occurs in turn, in a round robin fashion.)
– Buckets are added gradually (one bucket at a time).
– When all the pages at one level (the current hash function) have been split, a new level is applied.
Levels of Linear Hashing
• Initial Stage.
– The initial level distributes entries into N0 buckets.
– Call the hash function to perform this h0.
• Splitting buckets.
– If a bucket overflows its primary page is chained to an overflow page (same as in static hashing).
– Also when a bucket overflows, some bucket is split.
• The first bucket to be split is the first bucket in the file (not necessarily the bucket that overflows).
• The next bucket to be split is the second bucket in the file … and so on until the Nth. has been split.
• When buckets are split their entries (including those in overflow pages) are distributed using h1.
– To access split buckets the next level hash function (h1) is applied.
Levels of Linear Hashing
(Cnt)
• Level progression:
– Once all Ni buckets of the current level (i) are split, the hash func tion hi is replaced by hi+1.
– The splitting process starts again at the first bucket, and hi+2 is a
Linear Hashing Example
• Initially, the index level equal to
0 and N
0equals 4 (three entries
fit on a page).
• h
0maps index entries to one of
four buckets.
• h
0is used and no buckets have
been split.
• Now consider what happens
when 9 (1001) is inserted
(which will not fit in the second
bucket).
• Note that next indicates which
bucket is to split next. (Round
Robin)
next 64 36 1 17 5 6 31 15 00 01 10 11 h0Linear Hashing Example 2
• An overflow page is chained to the primary page to contain the
inserted value.
• If h0 maps a value from zero to
next – 1 (just the first page in this case) h1 must be used to insert the
new entry.
• Note how the new page falls
naturally into the sequence as the fifth page. h1 next 64 h0 next 1 17 5 9 h0 6 h0 31 15 h1 36
• The page indicated by next is split (the first one).
• Next is incremented. 000 01 10 11 100
21
Linear Hashing Example 3
• Assume inserts of 8, 7, 18, 14, 111, 32, 162, 10, 13, 233 • After the 2nd. split the base level is 1 (N1 = 8), use h1.
• Subsequent splits will use h2 for inserts between the first bucket and next-1.
2 1 h1 h1 next3 64 8 32 16 h1 h1 1 17 9 h1 h0 next1 10 18 6 18 14 h0 h0 next2 11 31 15 7 11 h1 h1 36 h1 h1 5 13 h1 - 6 14
Linear Hashing vs. Extendable
Hashing
• What is the similarity?
– One round of RR of splitting in LH is the same as
1-step doubling of directory in EH
• What are the differences?
– Directory overhead vs. none
– Overflow pages vs. none
– Gradual splitting (of pages) vs. one-step doubling (of
directory)
– Pages are allocated in order vs. not in order
– Splitting non-overflowing pages vs. splitting
Summary
• Hash-based indexes: best for equality searches, cannot
support range searches.
• Static Hashing can lead to long overflow chains.
• Extendible Hashing avoids overflow pages by splitting a
full bucket when a new data entry is to be added to it.
(Duplicates may require overflow pages.)
– Directory to keep track of buckets, doubles periodically.
– Can get large with skewed data; additional I/O if this does not fit
in main memory.
– a skewed data distribution is one in which the hash values of data entries are not uniformly distributed!
Summary (Contd.)
• Linear Hashing avoids directory by splitting buckets
round-robin, and using overflow pages.
– Overflow pages not likely to be long.
– Space utilization could be lower than Extendible Hashing, since
splits not concentrated on `dense’ data areas.
• Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization.