outline solutions

Transcription

outline solutions
CS346 Exercise Sheet 2: Indexing and Query Processing
For the seminars taking place in Week 4. Solutions due noon on Monday of Week 4.
Question:
1
2
3
4
Total
Points:
35
10
15
40
100
1. Consider the following B+ tree with p = 3 and pleaf = 2:
12 29
4 9
1 4
5 9
38 45
18
20 29
13 18
11 12
30 38
41 45
60 70
Show the consequence of the following operations in sequence:
(a)
(b)
(c)
(d)
(e)
Insert a key with a value of 19
Then insert a record with key 15
Then delete the records with keys 1 then 4
Then delete the record with key 5
Lastly, delete the records with keys 11 then 9
[5]
[10]
[5]
[5]
[10]
Solution: (a) search to location then insert there, propagate new split up
12 29
4 9
1 4
5 9
18 20
13 18
11 12
19 20
38 45
29
30 38
41 45
60 70
(b) need to reorganize parents
18
29
12
4 9
1 4
5 9
20
15
11 12
13 15
18
1
19 20
38 45
29
30 38
41 45
60 70
(c) borrow from siblings:
18
29
12
5 9
5
9
11 12
13 15
38 45
20
15
19 20
18
29
30 38
41 45
60 70
(d)propagate to parent
18
29
12
9
9
20
15
11 12
13 15
19 20
18
38 45
29
30 38
41 45
60 70
(e) rearrangement of grandparents to ensure properties met
18 29
20
12 15
12
13 15
18
19 20
38 45
29
30 38
41 45
60 70
2. A file of 4096 blocks is to be sorted with an available memory buffer space of 64 blocks.
How many passes will be needed in the merge phase of the external sort-merge algorithm?
Solution: The number of runs produced by the initial processing of the data file is
4096/64 = 64.
The degree of merging is 63: we can use 63 blocks to merge data, while we fill the
remaining block in memory with records to output.
Page 2
[10]
So we need two passes in the merge phase: the data is just a little too large to allow
one pass.
3. Describe how you would use a hashing based solution to implement the UNION operation
in SQL. What is the cost of your solution, in terms of number of disk operations?
[15]
Hint: refer to the section from lectures on Set Operations, and the preceding discussion on
duplicate elimination.
See http://en.wikipedia.org/wiki/Set_operations_%28SQL%29#UNION_operator
for clarification of this operator
Solution: The result of computing the UNION of two relations is to include all tuples
that are in either relation.
Assume for simplicity that we have already checked that the relations are compatible.
Further, assume that relations R and S to UNION do not contain duplicates.
For each tuple in R, we use a hash function to map it into a hash table. The hash
function is applied the full tuple. We also copy the tuple to a new output relation, T.
Then, for each tuple in S, we use the same hash function to map it into the hash table.
If there is a hash collision, and the exact same tuple is already stored there, then we do
not output it. Otherwise, we copy the tuple to the new output relation T.
This assumes that there is enough memory space to keep the hash table in memory. If
not, we could use hash partitioning of the inputs, similar to the discussion of partition
hash join. That is, we first partition the files into M buckets based on a hash function
mapping each tuple into one of M values. This creates M files containing the tuples
mapped into each of the M buckets (we can also note on each output tuple whether it
was from relation R or S). Then we can take each file, and apply a duplicate removal
algorithm to it, to remove tuples that have more than one copy.
The cost of the basic solution is to read both R and S, and write out T. Let bR denote the
number of blocks in R, and bS for S. The cost is then bR + bS to read through the files.
The size of T is between max(bR , bS ) and bR + bS , so we can approximate the overall
cost by the (upper) bound 2bR + 2bS .
4. Propose two different query plans for the following query and compare their cost:
σSalary>40000 (EMPLOYEE ./Dno = Dnumber DEPARTMENT)
Use the statistics from the table on the next page to make the cost estimations. Explain any
additional assumptions you need to make along the way.
Solution:
There are two natural query trees for this query.
Query Tree 1
The first tree performs the selection on Employee first, then performs the join on the
resulting relation.
Page 3
[40]
Low value and high value give the observed maximum and minimum data values for that
field.
Interpret salary as being in units of 100, so low value of salary = 1 corresponds to 100, and
high value of salary = 500 corresponds to 50,000.
Page 4
We can use the salary index on EMPLOYEE for the select operation: The table of statistics indicates that there are 500 unique salary values, with a low value of 1 and a high
value of 500. (It might be in reality that salary is in units of 1000 pounds, so 1 represents
£1000 and 500 represents £500,000.) The selectivity for (Salary > 400) can be estimated
as (500 - 400)/500 = 1/5 This assumes that salaries are spread evenly across employees.
So the cost (in block accesses) of accessing the index would be Blevel + (1/5) * (LEAF
BLOCKS) = 1 + (1/5)* 50 = 11.
Since the index is nonunique, the employees can be stored on any of the data blocks. So
the the number of data blocks to be accessed would be (1/5) * (NUM ROWS) = (1/5)
* 10,000 = 2000 Since 10,000 rows are stored in 2000 blocks, we have that 2000 rows
can be stored in 400 blocks. So the TEMPORARY table (i.e., the result of the selection
operator) would contain 400 blocks. The cost of writing the TEMPORARY table to disk
would be 400 blocks. Now, we can do a nested loop join of the temporary table and
the DEPARTMENT table. The cost of this would be
bDEPARTMENT + (bDEPARTMENT ∗ bTEMPORARY )
We can ignore the cost of writing the result for this comparision, since the cost would
be the same for both plans, we will consider. We have 5 + (5 * 400) = 2005 block accesses
Therefore, the total cost would be 11 + 2000 + 400 + 2005 = 4416 block accesses
NOTE: If we have 5 main memory buffer pages available during the join, then we
could store all 5 blocks of the DEPARTMENT table there. This would reduce the cost
of the join to 5 + 400 = 405 and the total cost would be reduced to 11 + 2000 + 400 + 405
= 2816.
Query Plan 2
A second plan might be for the following query tree to first do the join between DEPARTMENT and EMPLOYEE, and only then perform the selection on Salary.
Again, we could use a nested loop for the join but instead of creating a temporary
table for the result we can use a pipelining approach and pass the joining rows to
the select operator as they are computed. Using a nested loop join algorithm would
yield the following 50 + (50 * 2000) = 100,050 blocks We would pipeline the result to
the selection operator and it would choose only those rows whose salary value was
greater than 400.
NOTE: If we have 50 main memory buffer pages available during the join, then we
could store the entire DEPARTMENT table there. This would reduce the cost of the
join and the pipelined select to 50 + 2000 = 2050.
In either case, the cost of this approach is much higher, so it is of considerable advantage to “push down” the selection, and perform this first.
Page 5