Database Systems
( 資料庫系統 )
December 13, 2004
Chapter 15
Announcement
• Assignment #9 is due next Thursday.
Cool Ubicomp Project
Hyperdragging (Sony, 1999)
• Not enough working surface on your computer
screen .…
• No shared working surface when collaborating
with other people ….
A “Typical” Query Optimizer
How does a query optimizer work in
general?
• Decompose a SQL query (can have nested queries) into
query blocks
(without nested queries).
• Translate each query block into
relational algebra
expressions
.
• Optimize the relational algebra expression, one query
block at a time:
– Enumerate a
subset
of possible evaluation plans (also call
explore the
plan space
)
– Estimate the cost (disk I/Os) of each explored plan using
system
catalogs and statistics
Decompose a Query into Query
Blocks (Example)
SELECT S.sid, MIN(R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid=R.sid AND R.bid=B.bid AND B.color=‘red’ AND S.rating = (SELECT MAX(S2.rating)
FROM Sailors S2)
GROUP BY S.sid
HAVING COUNT(*) > 1 SELECT S.sid, MIN(R.day)
FROM Sailors S, Reserves R, Boats B WHERE S.sid=R.sid AND R.bid=B.bid AND B.color=‘red’
AND S.rating = Reference to the nested block GROUP BY S.sid SELECT MAX(S2.rating) FROM Sailors S2
Nested block
Outer block
Optimizing A Query Block
• Query blocks are optimized one block at a time.
– Nested blocks are usually treated as calls to a subroutine, made
once per tuple in the outer block.
• This is an over-simplification, but good enough for now.
• To estimate I/O cost, the optimizer estimates the size of
(intermediate) results.
–
System catalogs about the lengths of (
projected
) fields
–
Statistics about referenced relationships (
file size & # tuples
)
–
Available
access methods
(
indexes
&
selection conditions
), for e
Translate Query Block into Relation
al Algebra Expr (1)
πS.sid,MIN(R.day) (
HAVING COUNT(*)>2
GROUP BY S.sid (
σ S.sid=R.sid AND R.bid=B.bid
AND B.color=‘red’
AND S.rating=value_from_nested_block ( SailorsχReserves χBoats))))
SELECT S.sid, MIN(R.day)
FROM Sailors S, Reserves R, Boats B WHERE S.sid=R.sid AND R.bid=B.bid AND B.color=‘red’
AND S.rating = Reference to the nested block
GROUP BY S.sid
HAVING COUNT(*) > 1
• Assume that GRO
UP BY and HAVIN
G are also operator
s.
Translate Query
Block (2)
• Try to simplify query block fur
ther into σπχ expression.
– Why do this? Easier to find equivalent σπχ expressions (alternative plans), and they may have cheaper costs.
• How about GROUP BY and
HAVING?
– They are carried out after the result of σπχ.
– Add attributes specified in G ROUP BY and HAVING into projection list.
π
S.sid,MIN(R.day)(
HAVING COUNT(*)>2
GROUP BY S.sid (
σπχ expression )
π
S.sid,R.day(
σ
S.sid=R.sidAND R.bid=B.bid AND B.color=‘red’
AND S.rating= value_from_nested_block (
Sailors χ Reserves χ Boats))))))
πS.sid,MIN(R.day) (
HAVING COUNT(*)>2
GROUP BY S.sid (
σ S.sid=R.sid AND R.bid=B.bid
AND B.color=‘red’
AND S.rating=value_from_nested_block ( SailorsχReserves χBoats))))
Relational Algebra Tree
• Represent a plan, which is a relational algebra (RA)
expression, as a
RA tree
.
Reserves Sailors sid=sidσ
bid=100 ^ rating > 5 sname SELECTS.sname
FROM
Reserves R, Sailors S
WHERER.sid=S.sid
ANDEstimate Cost of a Plan
• For each enumerated plan, estimate its cost:
– Each node in the tree involves a relational operator. We must
estimate the cost of evaluating a relational operator.
• Size of inputs (#pages), table statistics (selection conditions),
available indexes, and chosen algorithms for evaluating operators (Chapter 14).
– For each node in the tree, we need to estimate the
size
of the
results and whether the results are
sorted
or not.
• Since the results are inputs to the upper node, they are used to estimate the cost of the upper node (operator).
Notes on Query Optimizer
• Cost estimation is only an approximation.
– Consider the cost of Disk I/Os.
• Plan Space:
– Too large -> too many possible plans to enumerate and too expe
nsive to estimate the costs for all of them, must be restricted.
–
Consider only the space of
left-deep plans
. Why?
• Left-deep plans allow output of each operator to be pipelined into the next (parent) operator without physically storing it in a temporary rela tion.
Left Deep Join Trees
• Restrict the plan search space to only left-deep join trees.
–
As the number of joins increases, the number of alternative plans
grows rapidly; we need to restrict the search space.
–
Left-deep trees allow us to generate all
fully pipelined plans
(if we
choose so).
• Intermediate results not written to temporary files.
• Not all left-deep trees are fully pipelined, depending on the
choice of join algorithm (e.g., Sort-Merge join).
B A C D C D B A
⋈
⋈
⋈
⋈
⋈
⋈
Estimating Result Sizes
• How to estimate the size of result by an operator on given input
s?
– Use information from system catalogs and statistics.
– For each term, find tuple reduction factors (# expected input tuples / # expected qualified tuples).
SELECT
attribute list
FROMrelation list
WHERE
term_1 AND term_2 AND term_3 … AND term_n
Column = value 1 / NKeys(I), if column is index(I), or 10%.
Column1 = Column2 1 / MAX (NKeys(I1), NKeys(I2)), or 1 / NKeys(I), or 10% .
Column > value (High(I) – value) / (High(I) – Low(I)), or <50%. Column in (list of (reduction factor for column = value) * # values
Improved Statistics: Histograms
• The rough estimation assumes uniform distributions of v
alues.
– What if that assumption is not true? For more accurate estimatio
ns, use histograms.
• Histogram is a data structure to approximate data distrib
ution.
– Term # children with (age > 3): result size = 2 tuples.
1 2 3 4 5 ages
5 7
2
Relational Algebra Equivalences
• Allow us to choose different join orders and to
push
selecti
ons and projections ahead of joins.
• Selections (
cascading of selections
):
– Break a selection condition into many smaller selections.
– Combine several selections into one selection.
σ
c1 AND c2 AND … cn(R) ≡ σ
c1(σ
c2( …σ
cn(R)) …))
• Selection (
commutative
):
– Test conditions in either order.
σ
c1(σ
c2(R)) ≡ σ
c2(σ
c1(R))
Relational Algebra Equivalences
(Projections, Joins, and
Cross-Products)
• Projections (
cascading projections
)
– Successively eliminating columns is same as eliminating all but t
he columns of final projection.
π
a1.(R) ≡ π
a1.( π
a2.(…(π
anR)) ..)), where a
1is a set of attributes of
relation R, and a
iis a subset of a
i+1π
sid.(R) ≡ π
sid.( π
sid, bid(Reserves))
• Joins and Cross-Products (
commutative
)
– Freedom to choose inner or outer relations
RχS ≡ SχR
• Joins and Cross-Products (
associative
)
– Join pairs of relations in any order
Rχ(S χT) ≡ (RχS) χT
Relational Algebra Equivalences
Involving Two or More Operators (1)
• Commute a selection with a projection
– when the selection condition
c
involves only attributes retained by
the projection
a
.
π
a(σ
c(R)) ≡ σ
c(π
a(R))
π
sid(σ
sid=10(R)) ≡ σ
sid=10(π
sid(R))
π
sid(σ b
id=10(R)) ≠ σ b
id=10(π
sid(R))
• Combine a selection with a cross-product to form a join.
• Push a selection into a cross-product (join)
– When the selection condition
c
involves only attributes of one of th
e arguments to the cross-product (join).
σ
c(R χ S ) ≡ σ
c(R) χ S
σ
R.bid=10(R χ S ) ≡ σ
R.bid=10(R) χ S
Relational Algebra Equivalences
Involving Two or More Operators (2)
• Push
a selection into a cross-product (join)
– Replace a selection with cascading selections, and commute selections. – c1 involves attributes of both R & S (c2 attrs of R, c3 attrs of S).
σ c (R χ S ) ≡ σ c1 ^ c2 ^ c3 (R χ S )
≡ σ c1 (σ c2 (σ c3 (R χ S ))) ≡ σ c1 (σ c2 (R) χ σ c3 (S )) σ sid=10, rname=‘Jane’ ^ sname=‘Paul’ (R χ S )
≡ σ sid=10 (σ sname=‘Jane’ (R) χ σ rname=‘Paul’ (S ))
• Push
a projection into a cross-product (join)
– when subsets (a1,a2) of projection attribute a involves only attributes of one of the arguments to the cross-product (join).
– Same as to push a selection with cross-product (join) π a (R χ S ) ≡ π a1 (R) χ π a2 (S)
Relational Algebra Equivalences
Involving Two or More Operators (3)
• Push
a projection into a join
– When
a1
is subset of
a
in R,
a2
is a subset of
a
in S, and
c
is in
a
.
π
a(R ⋈
cS ) ≡ π
a1(R) ⋈
cπ
a2(S)
π
R.sid, S.sname, S.sid(R
⋈
R.sid=S.sidS )
≡ (π
R.sid(R) )
⋈
R.sid=S.sid(π
S.sname,S.sid(S) )
• Push
a projection into a join if
– a1
is subset of R that appear in
a
and
c
, and
a2
is a subset of S t
hat appear in
a
and
c
π
a(R
cS ) ≡ π
a(
π
a1(R)
cπ
a2(S))
π
R.sid, S.sname(R
R.sid=S.sidS ) ≡
π
R.sid,S.sname(
π
R.sid,R.sid(R)
R.sid=S.sidπ
S.sname,S.sid(S))
∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆
Enumeration of Alternative Plans
• There are two main cases:
–
Single-relation plans:
(SELECT … FROM Sailors S …)
–
Multiple-relation plans:
(SELECT … FROM Sailors S, Reserves R …)
• For queries over a single relation, queries consist of a combina
tion of selections, projections, and aggregate ops (no joins).
• The general strategy has two parts:
–
For selections, consider each available access path (file scan / index) a
nd pick the one with the least estimated cost.
–
Projections and aggregations are carried out together with selections.
• For example, if an index is used for a selection, projection is done for each retrieved tuple, and the resulting tuples are pipelined into the aggregate com putation.
Single-Relation Queries
• Example:
– “For each rating greater than 5, print the rating and the number of 20-year old sailors with that rating, provided that there are at last two such sailors with different names”
π
S.rating,COUNT(*)(
HAVING COUNT DISTINCT(S.sname)>2 GROUP BY S.rating (
π
S.raing,S.sname(
σ
S.rating>5 AND S.age=20 (Sailors)))))σ π χ expression
• Plan without indexes:
– Scan Sailors, apply sele ctions and projections. – Write out tuples.
– Sort tuples based on S.r ating (GROUP BY).
– Apply HAVING on the fly at the last sorting step.
Single-Relation Queries: Plans
Utilizing an Index
• Single-Index Access Path
– If several indexes match selection condition(s), pick the best access pat h. Apply projections and non-primary selections. Sort by grouping attrib utes. Do aggregations.
• Multiple-Index Access Path
– If several indexes match selection condition(s), use them to retrieve sets of RIDs. Take intersection of RID sets. Sort resulting RID by page ID. Retrieve all tuples on the same page, while applying projections, selecti ons. Sort by grouping attributes. Do aggregations.
• Sorted Index Access Path
– If GROUP BY attributes matches a tree index, use the index to retrieve t uples in order. Then apply selections, projections, and aggregation oper ations.
• Index-Only Access Path
– If all attributes mentioned in (SELECT, WHERE, GROUP BY, HAVING) are in the index, can do an index-only scan.
Example Using Single-Index Access
Path
• B+ tree index on rating, hash index on age, and B+ tree index on <r
ating, sname, age>
π
S.rating,COUNT(*)(
HAVING COUNT DISTINCT(S.sname)>2 GROUP BY S.rating (
π
S.raing,S.sname(
σ
S.rating>5 AND S.age=20 (Sailors)))))σ π χ expression
• Retrieve Sailors tuple (age= 20) using hash index on age . (most selective path)
• For retrieved tuple, apply (ra ting >5 ) and projection out a ttributes (rating and sname). • Write results to temp table. • Sort temp table on rating. A
t the last sorting step, apply HAVING and final projection .
Queries Over Multiple Relations
• The general strategy has three parts:
– Consider and enumerate
only left-deep join trees
. Why?
• Restrict the search space.
• Left-deep trees allow us to generate all fully pipelined plans.
–
Consider selections and projections as early as possible (Push
selections and projections into lower joins)
–
Estimate the cost for each left-deep plan. Pick the one with the
lowest cost.
B A C D B A D C B C A DEnumeration of Left-Deep Plans
• Left-deep plans differ from each other in
– The order of relations
– The access method for each relation
– The join algorithm for each join operation
• We have discussed how to enumerate different access
methods and estimate their costs for one relation.
• Different join algorithms (e.g., nested loop join, sort-merge
join, hash join, …) and their cost analysis are discussed in
Chapter 14.
Plan Enumeration Algorithm
• Enumerated using N passes (if N relations joined):
–
Pass 1:
Find best 1-relation plan for each relation.
• Push selection terms & projection attributes to that 1-relation (using equiv alences)
• Consider all different access paths, and pick the one with lowest cost. Th is is same algorithm as 1-relation query.
–
Pass 2:
Find best way to join result of each 1-relation plan (as outer)
to another relation.
(All 2-relation plans.)
• Again push selection terms & projection attributes into the inner relation u sing equivalences.
• Try to pipeline the selected/projected tuples from the inner relation into joi n operation with outer relation. (Sometimes cannot, e.g., sort-merge join) –
Pass N:
Find best way to join result of a (N-1)-relation plan (as outer)
Plan Enumeration Algorithm Illustration
C A B B A C Pass 1: Same as Single-Relation Query A B C Pass 2 B A A C outer inner push sel/proj pipelined into join consideralt join algs
Enumeration of Plans (Contd.)
• For each subset of relations, retain only:
–
Cheapest plan overall, plus
–
Cheapest plan for each
interesting order
of the tuples.
• ORDER BY, GROUP BY
, aggregates
etc. handled as a final ste
p. If tuples going into them are not sorted, apply additional
sorting.
• Consider left-deep trees, this approach is
still exponential
in
the # of relations.
Example
• Pass1:
– Sailors: B+ tree matches rating>5, and is probably ch
eapest. However, if this selection is expected to retrie ve a lot of tuples, and index is unclustered, file scan m ay be cheaper.
• Still, B+ tree plan kept (because tuples are in rating orde r).
– Reserves: B+ tree on bid matches bid=500; cheapest
.
Sailors:
B+ tree index on rating
Hash index on sid
Reserves:
B+ tree index on bid
•
Pass 2:
– We consider each plan retained from Pass 1 as the outer, and consider ho
w to join it with the (only) other relation.
• Reserves as outer (better): hash index can be used to get Sailors tuples that ma tch sid = outer tuple’s sid value (hash join)
• Sailors as outer (worse): B+ tree index can be used to get Reserves tuples that match bid. Then pipeline tuples into join, or write them out for sort-merge join.
Reserves Sailors
sid=sid
bid=100 rating > 5
Nested Subqueries
• Nested block is optimized independently
(sometimes just evaluated once).
• In general, nested queries are dealt with
using some form of nested loops evaluati
on.
– Main query as the outer loop, the subquer y as the inner loop
– This is necessary for correlated queries b elow (S is used both in the main query an d the subquery.). SELECT S.sname FROM Sailors S WHERE S.rating = (SELECT MAX(S2.ratings) FROM Sailors S2) SELECT S.sname FROM Sailors S WHERE S.sid IN (SELECT R.sid FROM Reserves R WHERE R.bid=103) SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R
Summary
• Query optimization is an important task in a relational
DBMS.
• Must understand optimization in order to understand the
performance impact of a given database design (relations,
indexes) on a workload (set of queries).
• Two parts to optimizing a query:
–
Consider a set of alternative plans.
• Must prune search space; typically, left-deep plans only.
–
Must estimate cost of each plan that is considered.
• Must estimate size of result and cost for each plan node.
Summary (Contd.)
• Single-relation queries:
–
All access paths considered, cheapest is chosen.
• Multiple-relation queries:
–
All single-relation plans are first enumerated.
• Selections/projections considered as early as possible
.–
Next, for each 1-relation plan, all different ways of joining another r
elation (as inner) are considered.
–
Next, for each 2-relation plan that is `retained’, all ways of joining a
nother relation (as inner) are considered, etc.
–