outline solutions
Transcription
outline solutions
CS346 Exercise Sheet 2: Indexing and Query Processing For the seminars taking place in Week 4. Solutions due noon on Monday of Week 4. Question: 1 2 3 4 Total Points: 35 10 15 40 100 1. Consider the following B+ tree with p = 3 and pleaf = 2: 12 29 4 9 1 4 5 9 38 45 18 20 29 13 18 11 12 30 38 41 45 60 70 Show the consequence of the following operations in sequence: (a) (b) (c) (d) (e) Insert a key with a value of 19 Then insert a record with key 15 Then delete the records with keys 1 then 4 Then delete the record with key 5 Lastly, delete the records with keys 11 then 9 [5] [10] [5] [5] [10] Solution: (a) search to location then insert there, propagate new split up 12 29 4 9 1 4 5 9 18 20 13 18 11 12 19 20 38 45 29 30 38 41 45 60 70 (b) need to reorganize parents 18 29 12 4 9 1 4 5 9 20 15 11 12 13 15 18 1 19 20 38 45 29 30 38 41 45 60 70 (c) borrow from siblings: 18 29 12 5 9 5 9 11 12 13 15 38 45 20 15 19 20 18 29 30 38 41 45 60 70 (d)propagate to parent 18 29 12 9 9 20 15 11 12 13 15 19 20 18 38 45 29 30 38 41 45 60 70 (e) rearrangement of grandparents to ensure properties met 18 29 20 12 15 12 13 15 18 19 20 38 45 29 30 38 41 45 60 70 2. A file of 4096 blocks is to be sorted with an available memory buffer space of 64 blocks. How many passes will be needed in the merge phase of the external sort-merge algorithm? Solution: The number of runs produced by the initial processing of the data file is 4096/64 = 64. The degree of merging is 63: we can use 63 blocks to merge data, while we fill the remaining block in memory with records to output. Page 2 [10] So we need two passes in the merge phase: the data is just a little too large to allow one pass. 3. Describe how you would use a hashing based solution to implement the UNION operation in SQL. What is the cost of your solution, in terms of number of disk operations? [15] Hint: refer to the section from lectures on Set Operations, and the preceding discussion on duplicate elimination. See http://en.wikipedia.org/wiki/Set_operations_%28SQL%29#UNION_operator for clarification of this operator Solution: The result of computing the UNION of two relations is to include all tuples that are in either relation. Assume for simplicity that we have already checked that the relations are compatible. Further, assume that relations R and S to UNION do not contain duplicates. For each tuple in R, we use a hash function to map it into a hash table. The hash function is applied the full tuple. We also copy the tuple to a new output relation, T. Then, for each tuple in S, we use the same hash function to map it into the hash table. If there is a hash collision, and the exact same tuple is already stored there, then we do not output it. Otherwise, we copy the tuple to the new output relation T. This assumes that there is enough memory space to keep the hash table in memory. If not, we could use hash partitioning of the inputs, similar to the discussion of partition hash join. That is, we first partition the files into M buckets based on a hash function mapping each tuple into one of M values. This creates M files containing the tuples mapped into each of the M buckets (we can also note on each output tuple whether it was from relation R or S). Then we can take each file, and apply a duplicate removal algorithm to it, to remove tuples that have more than one copy. The cost of the basic solution is to read both R and S, and write out T. Let bR denote the number of blocks in R, and bS for S. The cost is then bR + bS to read through the files. The size of T is between max(bR , bS ) and bR + bS , so we can approximate the overall cost by the (upper) bound 2bR + 2bS . 4. Propose two different query plans for the following query and compare their cost: σSalary>40000 (EMPLOYEE ./Dno = Dnumber DEPARTMENT) Use the statistics from the table on the next page to make the cost estimations. Explain any additional assumptions you need to make along the way. Solution: There are two natural query trees for this query. Query Tree 1 The first tree performs the selection on Employee first, then performs the join on the resulting relation. Page 3 [40] Low value and high value give the observed maximum and minimum data values for that field. Interpret salary as being in units of 100, so low value of salary = 1 corresponds to 100, and high value of salary = 500 corresponds to 50,000. Page 4 We can use the salary index on EMPLOYEE for the select operation: The table of statistics indicates that there are 500 unique salary values, with a low value of 1 and a high value of 500. (It might be in reality that salary is in units of 1000 pounds, so 1 represents £1000 and 500 represents £500,000.) The selectivity for (Salary > 400) can be estimated as (500 - 400)/500 = 1/5 This assumes that salaries are spread evenly across employees. So the cost (in block accesses) of accessing the index would be Blevel + (1/5) * (LEAF BLOCKS) = 1 + (1/5)* 50 = 11. Since the index is nonunique, the employees can be stored on any of the data blocks. So the the number of data blocks to be accessed would be (1/5) * (NUM ROWS) = (1/5) * 10,000 = 2000 Since 10,000 rows are stored in 2000 blocks, we have that 2000 rows can be stored in 400 blocks. So the TEMPORARY table (i.e., the result of the selection operator) would contain 400 blocks. The cost of writing the TEMPORARY table to disk would be 400 blocks. Now, we can do a nested loop join of the temporary table and the DEPARTMENT table. The cost of this would be bDEPARTMENT + (bDEPARTMENT ∗ bTEMPORARY ) We can ignore the cost of writing the result for this comparision, since the cost would be the same for both plans, we will consider. We have 5 + (5 * 400) = 2005 block accesses Therefore, the total cost would be 11 + 2000 + 400 + 2005 = 4416 block accesses NOTE: If we have 5 main memory buffer pages available during the join, then we could store all 5 blocks of the DEPARTMENT table there. This would reduce the cost of the join to 5 + 400 = 405 and the total cost would be reduced to 11 + 2000 + 400 + 405 = 2816. Query Plan 2 A second plan might be for the following query tree to first do the join between DEPARTMENT and EMPLOYEE, and only then perform the selection on Salary. Again, we could use a nested loop for the join but instead of creating a temporary table for the result we can use a pipelining approach and pass the joining rows to the select operator as they are computed. Using a nested loop join algorithm would yield the following 50 + (50 * 2000) = 100,050 blocks We would pipeline the result to the selection operator and it would choose only those rows whose salary value was greater than 400. NOTE: If we have 50 main memory buffer pages available during the join, then we could store the entire DEPARTMENT table there. This would reduce the cost of the join and the pipelined select to 50 + 2000 = 2050. In either case, the cost of this approach is much higher, so it is of considerable advantage to “push down” the selection, and perform this first. Page 5