• 沒有找到結果。

Database Dystems (資料庫系統) Lecture #11

N/A
N/A
Protected

Academic year: 2021

Share "Database Dystems (資料庫系統) Lecture #11"

Copied!
27
0
0

加載中.... (立即查看全文)

全文

(1)

Database Systems

( 資料庫系統 )

December 12, 2005 Lecture #11

(2)

Announcement

• Next week reading: Chapter 14

• Assignment #5 is out on the course

homepage.

– It is a written assignment.

– It is due 12/19 (next Monday).

• Practicum #3 is also out on the

course homepage.

(3)

To Frame or not to Frame?

Architecture, Psychology, &

(4)

External Sorting

(5)

Why learn sorting again?

• O (n*n): bubble, insertion, selection, …

sorts

• O (n log n): heap, merge, quick, … sorts • Sorting huge dataset (say 10 GB)

• CPU time complexity may mean little on

practical systems

(6)

“External” Sorting Defined

• Refer to sorting methods when the data is too l arge to fit in main memory.

– E.g., sort 10 GB of data in 100 MB of main memory. • During sorting, some intermediate steps may req

uire data to be stored externally on disk.

• Disk I/O cost is much greater than CPU instruct ion cost

– Average disk page I/O cost: 10 ms vs. 3.8 GHz CPU clo ck: 0.25 ns.

– Minimize the disk I/Os (rather than number of compari sons).

(7)

Outline (easy chapter)

• Why does a DMBS sort data? • Simple 2-way merge sort

• Generalize B-way merge sort • Optimization

– Replacement sort

– Blocked I/O optimization – Double buffering

• Using an existing B+ tree index vs.

(8)

• Users may want answers to query in some

order

– E.g., students sorted by increasing age

• Sorting is the first step in bulk loading a B+

tree index

• Sorting is used for eliminating duplicate

copies

• Join requires a sorting step.

– Sort-join algorithm requires sorting.

When does a DBMS sort data?

(9)

Bulk Loading of a B+ Tree

• Step 1: Sort data entries. Insert pointer to first

(leaf) page in a new (root) page.

• Step 2: Build Index entries for leaf pages.

3* 4* 6* 9* 10*11* 12*13* 20*22* 23*31* 35*36* 38*41* 44*

Sorted pages of data entries; not yet in B+ tree Root

(10)

Example of Sort-Merge Join

(11)

A Simple Two-Way Merge Sort

• It uses only 3 buffer pages.

– Basic idea is divide and conquer.

– Sort smaller runs

and merge them into bigger runs.

• Pass 0: read each

page, sort records in each page, and write the page out to disk. (1 buffer page is used) Input file 1-page runs 2-page runs 4-page runs 8-page runs PASS 0 PASS 1 PASS 2 PASS 3 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2, 5,6 6 4,9 7,8 1,3 2 2,3 4,6 4,7 8,9 5,61,3 2 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6 7,8

(12)

A Simple Two-Way Merge Sort

• Pass 1: read two p ages, merge them, and write them out to disk. (3 buffe r pages are used)

• Pass 2-3: repeat a bove step till one sorted 8-page run.

• Each run is define d as a sorted subf ile. Input file 1-page runs 2-page runs 4-page runs 8-page runs PASS 0 PASS 1 PASS 2 PASS 3 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2, 5,6 6 4,9 7,8 1,3 2 2,3 4,6 4,7 8,9 5,61,3 2 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6

(13)

2-Way Merge Sort • Say the number of pages in a file is 2k:

– Pass 0 produces 2k sorted runs of one page each

– Pass 1 produces 2k-1 sorted runs of two pages each

– Pass 2 produces 2k-2 sorted runs of four pages each

– Pass k produces one sorted runs of 2k pages.

• Each pass requires read + write each page in file: 2*N

• For a N pages file,

– the number of passes = ceiling ( log2 N) + 1

• So total cost (disk I/Os) is

(14)

General External Merge Sort More than 3 buffer pages. How can we

utilize them?

• To sort a file with N pages using B buffer

pages:

Pass 0: use B buffer pages. Produce N / B

sorted runs of B pages each.

Pass 1..k: use B-1 buffer pages to merge B-1

runs, and use 1 buffer page for output.

B Main memory INPUT 1 INPUT B-1 OUTPUT Disk Disk INPUT 2 . . . . . . . . .

(15)

15

General External Merge Sort (B=4) 8,7 6,2 9,4 3,4 5,6 3,12,1015,3 16,513,819,132,8 8,9 4,4 6,7 2,3 1,2 3,3 5,610,15 1,5 8,813,1619,32 1,1

Pass 0: Read four unsorted pages, sort

them, and write them out. Produce 3 runs of 4 pages each.

Pass 1: Read three pages, one page from

each of 3 runs, merge them, and write them out. Produce 1 run of 12 pages.

2,2 3,3 …

merging

(16)

Cost of External Merge Sort

• # of passes: 1 + ceiling(log B-1 ceiling(N/B))

• Disk I/O Cost = 2*N*(# of passes)

• E.g., with 5 buffer pages, to sort 108 page file:

– Pass 0: ceiling(108/5) = 22 sorted runs of length 5 pag

es each (last run is only 3 pages)

– Pass 1: ceiling(22/4) = 6 sorted runs of length 20 page

s each (last run is only 8 pages)

– Pass 2: 2 sorted runs, of length 80 pages and 28 pages – Pass 3: Sorted file of 108 pages

– # of passes = 1 + ceiling (log 4 ceiling(108/5)) = 4 – Disk I/O costs = 2*108*4 = 864

(17)
(18)

Further Optimize Merge Sort

• Opportunity #1: sorting the first sorted-runs at pass 0

– First sorted run has length B.

– Is it possible to create bigger length in the first sorted

runs? How?

• Opportunity #2: consider block I/Os

– Block I/O: reading & writing consecutive blocks – Can the merge passes use block I/O?

• Opportunity #3: minimize CPU/disk idle time – How to keep both CPU & disks busy at the

(19)

Replacement Sort (Optimize Merge Sort)

• Pass 0 can output approx. 2B sorted pages on average. How?

• Divide B buffer pages into 3 parts:

– Current set buffer (B-2): unsorted or unmerged pages. – Input buffer (1 page): one unsorted page.

– Output buffer (1 page): output sorted page.

• Algorithm:

– Pick the tuple in the current set with the smallest k value > large st value in output buffer.

– Append k to output buffer.

– This creates a hole in current set, so move a tuple from input buff er to current set buffer.

– When the input buffer is empty of tuples, read in a new unsorted pa ge.

(20)

Replacement Sort Example (B=4) 8,7 6,2 9,4 3,4 5,6 3,1 2,1015,316,513,8 19,132,8 3,4 6,2 9,4 Current Set Buffer Input Buffer Output Buffer 3,4 6,2 9,4 3,4 6,9 4 2 4,4 6,9 2,3 8,7 8,4 6,9 4 7 2,3 On disk

When do you start a new run?

(21)

Minimizing I/O Cost vs. Number of I/Os

• So far, the cost metric is the number of disk

I/Os.

• This is inaccurate for two reasons:

– (1) Blocked I/O is a much cheaper (per I/O request) than equal number of individual I/O requests.

• Blocked I/O: read/write several consecutive pages at

the same time.

– (2) CPU cost may be significant.

• Keep CPU busy while we wait for disk I/Os. • Double Buffering Technique.

(22)

Blocked I/O

• Blocked access: read/write b pages as a unit.

• Assume the buffer pool has B pages, and file ha s N pages.

• Look at cost of external merge-sort (with repla cement optimization) using Blocked I/O:

– Blocked I/O has little affect on pass 0.

•Pass 0 produces initial N’ (= N/2B) runs of length 2B pag es.

– Pass 1..k, we can merge F = B/b – 1 runs.

– The total number of passes (to create one run of N p ages) is 1 + logF (N’).

(23)

Double Buffering

• Keep CPU busy, minimizes waiting for I/O requests. – While the CPU is working on the current run, start to p

refetch data for the next run (called shadow blocks). • Potentially, more passes; in practice, most files

still sorted in 2-3 passes.

OUTPUT OUTPUT' Disk Disk INPUT 1 INPUT k INPUT 2 INPUT 1' INPUT 2' INPUT k' block sizeb

(24)

Using B+ Trees for Sorting

• Assumption: Table to be sorted has B+ tree index on sorting column(s).

• Idea: Can retrieve records in order by traversing leaf pages.

• Is this a good idea?

• Cases to consider:

– B+ tree is clustered

(25)

Clustered B+ Tree Used for Sorting

• Cost: root to the

left-most leaf, then retrieve all leaf

pages (Alternative 1)

• If Alternative 2 is

used? Additional cost of retrieving data records: each page fetched just once.

• Cost better than

external sorting? (Directs search) Data Records Index Data Entries ("Sequence set")

(26)

Unclustered B+ Tree Used for

Sorting

• Alternative (2) for data entries;

each data entry contains rid of a

data record. In general, one I/O

per data record.

(Directs search)Index

Data Entries

(27)

27

External Sorting vs. Unclustered In dex

 p: # of records per page

參考文獻

相關文件

In addition, he says, universities should require secondary school teachers to provide more support for their students to better prepare them for university life and graduate

– Write special code to protect against system crashes – Optimize applications for efficient access and query – May often rewrite applications. • Easier to buy a DBMS to handle

– File and index layers organize records on files, and manage the indexing data

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

A good way to lead students into reading poetry is to teach them how to write their own poems.. The boys love the musical quality of

Lecture 1: Introduction and overview of supergravity Lecture 2: Conditions for unbroken supersymmetry Lecture 3: BPS black holes and branes. Lecture 4: The LLM bubbling

Introduction to the following free CPS Apps/Platforms and illustration will be given to how to adopt them in carrying out classroom activities and collaborative learning.