PART A: External Sorting (10 Points) Assume that you want to sort a file of size N

(1)

Database Systems (資料庫系統) Name: ___________________________________

January 10, 2005

Time: 2:30 PM ~ 5:30 PM Student ID: _______________________________

Final Examination Solution

I herewith state that I understand and will adhere to the following academic integrity: “I will not use or attempt to use any unauthorized assistance, material, or study aids in this final examination.

If I violate academic integrity, I may receive a grade of 0.”

___________________________________________________________

Signature

(You get 5 points for filling your name, student ID, and signature)

PART A: External Sorting (10 Points)

Assume that you want to sort a file of size N pages. You have B buffer pages available. You apply both double-buffering and blocked I/O optimization. Assume the block factor of b pages (reading and writing b pages at a time).

Derive the cost (measured in number of page I/Os) of sorting the file using the external merge sort algorithm? You just have to write down the formula.

# runs we can merge in each passes (after the 1^st pass): floor(B/2b) - 1

Number of passes: 1 + log floor(B/2b) - 1 ceiling(2N/B)

Cost of each pass: 2N

Total cost: 2N * (1 + log floor(B/2b) - 1 ceiling(2N/B)) -> You will receive full credit if you get the total cost correct.

(2)

PART B: Evaluating Relational Operators (20 Points) Suppose you have the following indices on the relation branch:

A clustered B+ tree index on assets

Hash index on manager and Hash index on branch-city

Describe an optimal way to evaluate each of the following selections. ¬ denotes negation. Please give short answers. (4 points each)

1) σ^¬ (assets < 5000) (branch)

Answer: Use the B+ tree index on assets to locate the first tuple whose assets equal or greater than 5000. From this tuple, scan forward and retrieve all tuples.

2) σ^¬ (assets = 5000) (branch)

Answer: The index is no use here. We perform a sequential scan on the file and select all the branches whose assets are not equal to 5000.

3) σmanager = “Joe” or branch-city = “Taipei” (branch)

Answer: Use the hash index on manager to retrieve the set of all tuple RIDs with manager =

“Joe”. Use the hash index on branch-city to retrieve the set of tuple RIDs with branch-city =

“Taipei”. Take the union of two sets. Retrieve all tuples in this union set.

Let relations r₁(A,B,C) and r₂(C,D,E) have the following properties: r₁ has 20,000 tuples, r₂ has 45,000 tuples, 25 tuples of r₁ fit on one block, and 30 tuples of r₂ fit on one block. Assume the buffer pool has 101 pages. Estimate the minimum number of block accesses required, using each of the following join strategies for r₁ ⋈ r2: (4 points each)

4) Block nested-loop join

Answer: r₁ is the outer relation. 800/(101-2) * 1500 + 800 = 14,300

5) Hash join (Assume no overflow occurs)

Answer: r1 is the outer relation. 3(1500+800) = 6900

(3)

PART C: Query Optimization (10 Points) Consider the following scenario:

Emp(eid: integer, sal: integer, did: integer)

DeptProj(did: integer, project-id: integer, budget: real)

Assume that each Emp record is 20 bytes long, and each DeptProj record is 40 bytes long on average. There are 20,000 tuples in Emp and 5000 tuples in DeptProj (note that did is not a key and each department on average has 10 projects – meaning that each did value can be

associated with 10 project-ids in the DeptProj relation). The file system supports 4000 byte pages, and 12 buffer pages are available. You can assume uniform distribution of values. State any additional assumptions. The cost metric to use is the number of page I/Os. Ignore the cost of writing out the final result.

Consider the following query:

SELECT *

FROM Emp E, DeptProj D WHERE E.did=D.did

1) Suppose that there is a clustered hash index on did on Emp. Identify a plan (including join algorithm & inner/outer relation) that is likely to give you the lowest estimated cost. You do not need to compute the actual cost. (5 points)

Answer: Use D as the outer and to use index-nested-loop (INL) join on clustered hash index on E.

If you say hash join, it will also be correct.

2) Assume that both relations are sorted on the join column, in addition to the clustered hash index on did on Emp. Show the plan that is likely to give you the lowest estimated cost. If this plan is different from the plan in (2), give a short reason why this plan is better than the plan in (2). You do not need to compute the actual cost. (5 points)

Answer: SM join is the best strategy because both relations are sorted on the join column. Note that INL is worse than SM, because each tuple in Emp is on average to join with 10 tuples of DeptProj, and must therefore be fetched 10 times.

(4)

PART D: Concurrency Control (15 Points) Consider the following two transactions:

T1: read(A) T2: read(B)

read(B) read(A)

write(B) write(A)

1) Is there a concurrent execution (some interleaving of read-write actions) of T1|T2 that produces a serializable, recoverable schedule? If yes, show this schedule by listing actions in the table below. If not, state why not. (5 points)

Answer: There is not a concurrent execution schedule that can produce a serializable, recoverable schedule. You may be able to come up with a serializable schedule with some interleaving actions from T1|T2, but that schedule would not be recoverable, because the last action in each transaction is writing to object that is read by the first action in the other transaction (RW/WR conflicts, see cascading rollback problem in E2).

2) Add locking & unlocking instructions when executing T1 and T2 concurrently according to the strict 2PL (phase locking) protocol in the table below. Assume DBMS starts executing T1 before T2. Then the DBMS interleaves one action from T1 with one action from T2

(execute T1’s 1^st action, followed by T2’s 1^st action, followed by T1’s 2^nd action, and so forth).

Please show where the deadlock can occur in the table below. Use the following notations for specifying the locking & unlocking instructions: lock-x(A)/unlock-x(A) for exclusive

lock/unlock on object A, and lock-s(A)/unlock-s(A) for shared lock/unlock on object A. (5 points)

T1 T2 lock-s(A)

read(A) lock-s(B)

read(B) lock-x(B) // blocked

lock-x(A) // blocked deadlock

← Time

(5)

3) Apply the wait-die deadlock prevention policies to handle the deadlock in (2). Assume the same execution sequence as in (2). Specify all read-write-commit-abort actions from T1|T2 as well as locking/unlocking instructions in the table below. (5 points)

T1 T2 lock-s(A)

read(A) lock-s(B)

read(B) lock-x(B) // waiting

lock-x(A) // aborting abort

write(B) commit

unlock-s(A) unlock-x(B)

lock-s(B) read(B) lock-x(A)

read(A) write(A)

← Time

commit unlock-s(A) unlock-x(B)

(6)

PART E: More Concurrency Control (15 Points) Consider the following transactions:

T3: read(A) T4: read(A)

read(B) write(A)

write(A) read(C)

1) Show a concurrent execution of T3|T4 schedule that is conflict serializable, but not

recoverable, and the schedule can be produced by non-strict 2PL (phase locking) protocol.

Specify (1) all read/write/commit actions and locking/unlocking instructions in the table below, (2) the abort action in one of the transaction that can cause the schedule to be unrecoverable. (3) The serial schedule that the below schedule is conflict-equivalent to. (10 points)

The conflict-equivalent serial schedule: __T3, T4______________________

Note that this is only one possible answer. You can also come up with a conflict serializable schedule that produces the same result as the serial schedule T4,T3.

T3 T4 lock-x(A)

read(A) lock-s(B)

read(B) write(A) lock-s(C) unlock-x(A)

lock-x(A) read(A) write(A) unlock-x(A)

commit read(C)

← Time

abort unlock-s(B) unlock-s(C)

(7)

2) Even if a schedule is recoverable, to recover correctly from the failure of a transaction T, we may have to roll back several transactions. Consider the partial schedule shown below.

Suppose that T3 fail, T4 must also be rolled back, since T4 has read a value written by T3.

This phenomenon, which a single transaction failure leads to another transaction rollback, is called cascading rollback. Cascading rollback is obviously undesirable, since it is leading to undoing more work. Specify one restriction to the schedule such that no cascading rollback can occur. Give a short explanation as to why this restriction works. (5 points)

T3 T4 read(A) read(B) write(A)

readA)

← Time

write(A)

Answer: If Tj (T4) reads a data item previously written by Ti (T3), the commit operation of Ti (T3) must appear before the read operation of Tj (T4). This avoids the cascading rollback problem because Ti has already committed its modification on the data item before Tj read that data item.

(8)

PART F: ARIES Recovery Algorithm (30 Points) Consider the following write-ahead log shown below after a restart.

LSN LOG 00 Update: T1 writes P1

10 Update: T2 writes P2 20 Update: T3 writes P3

30 begin_checkpoint 40 end_checkpoint

50 T2 commit

60 Update: T3 writes P2

70 T2 end

80 Update: T1 writes P5

90 T3 abort

Crash, Restart

1) Draw the dirty page table & active transaction table at the end of Analysis phase. Specify the redo point? (8 points)

Dirty Page Table Active Transaction Table

pageID recLSN transID lastLSN

P1 00 T1 80

P2 10 T3 90

P3 20

P5 80

Redo point: ______00___________

2) Assume that only T2’s update (P2) has been forced to disk. In the Redo phase, which page updates will be redone? Specify the order of page updates. (2 points)

Answer: P1, P3, P2, P5

3) Show the log after the recovery is completed. (4 points)

LSN LOG 100 CLR: Undo T1 LSN 80

110 CLR: Undo T3 LSN 60

120/125 CLR: Undo T3 LSN 20; T3 ends 130/135 CLR: Undo T1 LSN 00; T1 ends

(9)

4) Consider the following modifications to the ARIES recovery algorithm. To limit the amount of work during recovery, the redo point is set to start at the most recent checkpoint and the undo phase will not check actions before the most recent checkpoint. Does the modified ARIES still work? If not, give a short description about its potential problem(s). (6 points) Answer: REDO may miss some dirty pages (updated by actions before the most recent

checkpoint) from committed transactions that have not been written to disk at the time of crash.

UNDO may also miss some dirty pages (updated by actions before the most recent checkpoint) from uncommitted transactions that that have been written to disk.

5) The ARIES recovery algorithm is designed to work for DBMS using the non-force, steal write policy. Consider a DBMS that uses a force, steal write policy (meaning that all updates made by a transaction in the buffer pool are forced to stable storage when a transaction commits), is it possible to simplify ARIES recovery algorithm? If yes, please give a short description of your simplification(s). (5 points)

Answer: Yes, it is possible to simplify ARIES. There is no need to perform REDO on any of the actions from committed transactions because their updates have all been forced into stable storage.

6) Consider a DBMS that uses a non-force, non-steal write policy (meaning that all updates made by a transaction in the buffer pool cannot be forced to stable storage before a transaction commits), is it possible to simplify ARIES recovery algorithm? If yes, please give a short description of your simplification(s). (5 points)

Answer: Yes, it is possible to simplify ARIES. There is no need to perform UNDO on any of the actions from active (uncommitted) transactions because their updates cannot be forced into stable storage.

(10)

This page will be used for grading your exam. Do not write anything on this page.

PARTS PART Total Score

5 A 10 B 20 C 10 D 15 E 15 F 30 Total 105