How to Improve the Pruning Ability of Dynamic Metric Access Methods
Transcription
How to Improve the Pruning Ability of Dynamic Metric Access Methods
How to Improve the Pruning Ability of Dynamic Metric Access Methods Caetano Traina Jr. Agma Traina Roberto Santos Filho Dept. of Computer Science and Statistics University of Sao Paulo at Sao Carlos - Brazil [caetano| agma | figueira]@icmc.sc.usp.br Abstract Retrieval of complex data is accelerated through the use of indexing structures. They organize the data in order to minimize comparisons between objects aiming to prune blocks of data during the process. In metric spaces, comparison operations can be specially expensive, so the pruning ability of indexing methods turns out to be specially significant. This paper shows how to measure the pruning power of metric access methods (MAMs), and defines a new measurement, called "prunability,” which indicates how well a pruning technique carries out the task of reducing distance calculations at each level of the tree. We also gives a new measurement for it, and presents a new dynamic access method, aiming to retrieve objects in a metric space minimizing the number of distance calculations required to answer similarity queries. We show that this structure is up to 3 times faster and requires less than 25% of the distance calculations to answer range and nearest-neighbor queries as compared with the Slim-tree, one of the most efficient metric access method (MAM). This gain in performance is achieved by taking advantage of a set of global representatives. Although using multiple representatives, the whole structure remains dynamic and balanced. Experimenting with existing metric access methods, we have discovered that the use of one node representative works well at the upper levels of the existing MAMs, but is almost useless at the lower levels. Our proposed method assures higher rates of pruning at all the levels of the tree. 1 - Introduction The amount of information stored in databases has grown at a very fast pace over the last years. Data today is far more diversified and complex than it was in the past, when all the information was represented by text and numbers. Christos Faloutsos Dept. of Computer Science Carnegie Mellon University - USA [email protected] The massive use of multimedia information, such as images, audio, video, time series and DNA sequences have generated new challenges for researchers in the database area. Thus, a cornerstone for the database area is the development of fast and efficient access methods, which can cope with large amount of complex data in these domains. In multimedia systems, it is important to search the database for similar data. In image databases, for example, the user looks for images that are similar to a given one according to specific criteria. These data domains do not pose an order relationship property; in other words, relations such as “less than” and “greater than” are meaningless, rendering the usual access methods useless. However, if a similarity function is defined, similarity queries make sense. Note that a metric distance function points out the dissimilarity between objects. When a distance function has non-negativity and symmetry properties and satisfies triangular inequality, it is said to be a metric function, the domain is said to be metric, and metric access methods can be used. In metric domains, the so-called similarity queries most frequently used are the following: • k-nearest neighbor or k-NN query: kNN=<sq, k>, which asks for the k objects that are the closest to a given query object center sq. • range query: Rq=<sq, rq>, which searches for all the objects whose distance to the query object center sq is less or equal to the query radius rq. Note that calculating distances between complex objects are usually expensive operations. Hence, minimizing these calculations is one of the most important aspects to achieve acceptable response time in a metric access method (MAM). This paper presents a way to reduce the number of distance calculations to answer similarity queries and introduces a new and efficient MAM, the DF-tree. The main approach is the use of global representatives for the whole tree. These global representatives will work together with the representatives of the nodes in order to decrease the number of distance calculations, which is -2usually expensive for complex (multimedia) data. Global representatives use a new approach which maintains the tree dynamic. We will show that this approach leads to amazing improvements for answering similarity queries. Also, being a dynamic MAM is important because applications usually continue inserting and deleting objects after the creation of the index structure. We present the results of experiments executed over three real datasets, comparing the DF-tree with the state of the art MAMs. The results show that the DF-tree performs 25% less distance calculations and answers similarity queries up to 3 times faster. The results of the experiments also demonstrate that this new MAM is scalable to very large datasets, presenting a linear behavior over varying dataset sizes. We also defined a new measurement, called "prunability,” to estimate how many distance calculations can be pruned at each level of a tree. We have discovered that the use of one node representative works well at the upper levels of the existing MAMs, but is almost useless at the lower levels. Using this measurement, we confirmed that our new proposed MAM assures higher rates of pruning at every level of the tree. In addition to presenting the DF-tree and the improved algorithms to accelerate answering similarity queries, this paper also presents three support algorithms. The first one decides when to select and use the global representatives, the second one automatically detects if the current set is no longer appropriate, and the last algorithm quickly selects a new set of global representatives. The remainder of this paper is organized as follows. The next section summarizes the background for this work, while section 3 explains why using more than one representative per node to build a metric tree may lead to static structures. Section 4 presents the proposed metric access method DF-tree, as well as its supporting algorithms. Section 5 presents the results of the experiments performed on the DF-tree and compares it to other recent MAMs. Section 6 discusses the conclusions of this paper. 2 - Background One of the underpinnings of a database system is the index structure that supports it, so the design of efficient access methods has long been pursued in the database area. Because some complex data can be viewed as points in a multidimensional domain, several methods to index data in a multidimensional space have been proposed in the literature [1] [6] [10] [11]. An excellent survey on multidimensional access methods is presented in [9]. Unfortunately, these methods were developed only for vector data. Note that many complex data do not have any vectorial property, but their elements can be treated as occupying a metric domain. Previous work on metric methods focused on the static case, for example: the structures proposed by Burkhard and Keller [5], Uhlmann [15], and also the vp-tree [16], GNAT [4] and mvp-tree [3], where no further insertions, deletions and updates are handled. Overcoming this inconvenience, dynamic MAM structures including M-tree [7], Slim-tree [14] and OMNI-family [12] have been developed. Multiple representatives help reducing the number of distance calculations performed when answering similarity queries. They were nicely used in the mvp-tree [3] and in the Omni-family of index structures [12]. In contrast to mvp-tree, the MAMs members of Omni-family are dynamic, allowing for insertions and deletions. However, the set of representatives used by the Omni-family is chosen during the early construction phase of the tree and cannot be changed afterwards. The main idea was that, given a dataset S of objects in a metric domain, an object s i S can be represented by a set of distances to a set of global representative objects, the foci. A good algorithm to choose these foci, the HF algorithm, is also presented in [12]. This algorithm finds objects near the border of the dataset to be used as foci, with linear cost on the size of the dataset. The number of representatives required to maximize the gain obtained from the reduced number of distance calculations is associated with the intrinsic dimensionality ' of the dataset. As in [8], the intrinsic dimensionality is defined as the exponent in the power-law: ' nb(r ) ∝r where nb ( r ) is the average number of neighbors within distance r or less from an arbitrary object, and stands for “is proportional to”. We will use this result and the HF algorithm in this paper. 3 - Motivation Metric access methods based on trees store data through sets of fixed-size nodes, using reference objects in each node to represent the other objects stored in that node and in its sub-trees. Precalculated distances between stored objects and these representatives are kept to speed up insertions and deletions and to answer queries. This design takes advantage of the triangular inequality property of metric domains to prune many costly distance calculations between the query centers and the stored objects, as is explained following. In a metric tree, each node stores a sub-set of the objects. Let us first consider the objects sjS stored in a leaf node i. One of these objects, tagged as sRi, is chosen as the “representative” for the objects stored in the node i. The distances from every object in this node to the representative are calculated, and the largest one is defined as the covering radius rRi of that node. In this way any object farther than rRi from sRi cannot be found in node i. In the -3upper levels the same construct is employed, but the objects stored in a non-leaf node are the representatives of their descendant nodes. Again, one of these objects is chosen as the representative of this node, and we will tag it as sRep. The distances from every object stored in that node to its representative are calculated and also stored in the node. However, the covering radius of a non-leaf node is the distance from the representative to the farthest object stored in that node plus the covering radius of the node where this object is the representative. Figure 1 shows a DF-tree indexing 17 objects, considering nodes with a maximum capacity of tree objects. Note that the root node does not have a representative and the complete set S is stored in the leafs. The global representatives are not depicted in this figure because no query was asked yet. For example, any two objects far from each other and on the “border” of the dataset can be used as global representatives, such as the objects K and P, in this figure. As the other illustrative figures presented in this paper, figure 1(a) assumes a dataset in a two-dimensional space with the Euclidean distance function, although the same concepts apply to any metric domain. Figure 1 - A DF-tree with node capacity=3, indexing 17 objects. Now, consider a range query asking for objects distant up to rq from the query center sq, as shown in figure 2. The query defines tree regions: in Region 1 are the objects that are farther from sRep than its distance to the query center sq plus the query radius rq. In Region 3 are the objects that are closer to sRep than its distance to the query center sq minus the query radius rq. In Region 2 are the objects that do not satisfy any of these conditions. Objects in Regions 1 and 3 cannot be in the response set. Thus, nodes that are completely in Regions 1 or 3 can be pruned from the search process. In this way, by the triangular inequality, every node i with representative sRi can be pruned if it satisfy one of the following conditions: a) d(sRep, sq)+rq < d(sRep, sRi)-rRi (Region 1) (Eq. 1) b) d(sRep, sq)-rq > d(sRep, sRi)+rRi (Region 3) (Eq. 2) The same pruning concept can be applied at the leaf nodes, where the value of rRi is zero. Thus, the triangular inequality enables pruning both the traversing of subtrees in non-leaf nodes and distance calculations from the query object to objects in the leaf nodes. Figure 2 - Only objects in region 2 will be in the response set of the range query centered at sq with radius rq. This mechanism leads to dynamic structures, because if a new object needs to be inserted, it can be inserted in any of the nodes that are able to “cover” it. If no node covers it, the covering radius of an existing node is enlarged. If the node capacity is surpassed, the node splits and a new pair of representatives is chosen. This changes one object in the upper level node and promotes one new representative to this node, which will eventually require a split too, in an iterative process. When the root node splits, the tree grows one level. Thus, the construction of the tree occurs in a bottom up approach, like the B-trees. Detailed algorithms to perform those operations can be found in [14] and [7]. Pruning can be enhanced by using more than one representative (a reference object). In fact, the “amount of information” embodied in a simple relation between three objects (the query center, the target object sj and the node representative) provided by the triangular inequality property is rather low, particularly in pure metric or in high-dimensional spaces [2]. The combined use of two or more references eliminates larger portions of the dataset than using each reference object individually. Figure 3 shows how much the region where the candidate set of objects to answer the query shrinks by adding a global representative. If only the node representative is used for pruning, the shadowed region in figure 3(a) is where the answer objects for a range query with center object sq and radius rq could be found. In figure 3(b) one Global Representative is also used and the darker shadowed region indicates where the answer objects could be for the same range query of (a). Note that the size of the resulting region is much smaller. -4the metric access method DF-tree presented in the next section. 4 - The new DF-tree structure Figure 3 - Regions that cannot be pruned: (a) using only the node representative; (b) using also a global representative. It is helpful to understand why multiple reference objects lead to static structures, allowing us to overcome this undesired effect, yet still being able to retain the strong capacity of pruning achieved by a combination of many representatives. In fact, if the set of reference objects at a given node dictates where other objects should be stored at lower levels (i.e., in which of its descending subtree), the structure becomes static, because whenever a reference is changed, the objects that were stored in a given subtree should have to be moved to another subtree. Hence, in order to be dynamic, a MAM must either allow for more than one place to store each object or choose the representatives in a bottom-up approach, changing the reference objects based on the subtrees rather than on the upper levels of the tree. Both approaches, however, have their drawbacks. Allowing for more than one place to store each object involves many attempts until the objects required in a query are successfully retrieved. Choosing the representatives in a bottom-up approach, on the other hand, inhibits the combined use of the set of representatives of the nodes whose path leads to that node. The Slim-tree and the M-tree choose a compromise between these two approaches. First, the representative at each node restricts, rather than determines, where a given object can be stored. Second, the representative of a node is selected from among the representatives of its children nodes, and the covering radius of the node is calculated based on the covering radius of those children. However, this compromise means that only one representative can be used at each node. The Slim-tree improved the M-tree tackling the first compromise, as minimizing the intersection of nodes reduces the number of deep searches needed to find an object. The DF-tree further improves the Slim-tree allowing more representatives to be used to prune distances, tackling also the second compromise. Although we do not use the representatives of upper levels, as a matter of fact, nothing hinders the use of more than one object to prune subtrees, provided that only one is used to guide the creation of the tree. That is what we do through We propose here the use of Global Representatives (GR) together with the fundamental concepts of a metric tree, which aims to reduce the number of distance calculations required to answer similarity queries. This paper demonstrates that the proper use of GRs allows for fewer distance calculations and thus to a shorter time to answer similarity queries. The following definition is needed to explain this new structure. 0 7 Definition 1 - Global Representative Set (* ): Let =< , d> be a metric space, where is the domain of features, which are the indexing keys, and d( ) is a metric distance function. Given a dataset S G with N objects, a Global Representative Subset of S (or a GR set) is a set *={g1, g2, ..., gp | gk S, gkggj, pN}, where each gj is chosen to be a Global Representative, and p is the number of global representatives contained in *. We call the single representative of each node of the tree as the node representative in order to distinguish it from the global representatives for the whole tree. Each global representative is independent from the others, including the node representatives, and is applied to every object si inserted in the structure, so we say that each global representative defines a distance field (DF) over the domain . The distance field is represented by the distance from each object to the corresponding global representative, stored with that object. The main symbols used in this paper are shown in Table 1. The general idea of the DF-tree is to build the structure combining a metric tree that uses one representative per 7 7 7 Symbols 7 S ' sq rq k h * gj p d( ) mdj Ph(Q) Definitions domain of objects set of objects in domain 7 intrinsic dimensionality of dataset S a query object (or query center) radius of a range query number of neighbors in an NN query level of the tree set of global representatives a global representative of * cardinality of * distance function maximum distance from the global representative gj to all the other global representatives Prunability at level h to answer a set Q of queries Table 1 - Summary of symbols and definitions. -5node (e.g. the Slim-tree) with the distance fields generated by each element of *. The distance fields do not interfere with the creation of the tree structure, but can be used to prune distance calculations when answering queries. The selection of objects to act as global representatives and the calculation of their corresponding distances to each stored object can be postponed to any time before the first query is answered. For example, the distances can be calculated either when each object is being inserted or just after a bulk loading operation is completed. 4.1 - The “prunability” property The purpose of using global representatives is to increase the pruning of distance calculations. The number of distance calculations that can be pruned depends on the relative sizes of the areas defined by each representative (node or global), and the regions defined by the query center and the representative radii (recall figure 2). These sizes vary at different levels of a given tree, leading to methods that are better to prune at some levels than at others. Therefore, to gauge how efficiently a MAM can prune distance calculations based on a set of reference objects active in each node, we have defined the following measurement. Definition 2- “Prunability”: Given a large set Q of similarity queries over a tree, the prunability Ph ( Q ) is the average of the relation Nub(qi)/Ntb(qi) applied to each node b accessed at a given level h to answer each query qi Q, where: Ntb(qi ) - is the total number of objects in node b accessed to answer the query qi not pruned by the references active in this node. Each access to a node imposes a distance calculation. Nub(qi) - is the number of objects in the nodes accessed to answer the query qi that actually qualify, i.e., for which the distance calculation is unavoidable. We consider that a distance calculation is unavoidable in a given node at level h if at least one object in the children nodes at level h+1 require further processing. Figure 4(a) exemplifies the prunability when using just the node repres e n t a t i ve t o a n swer t h e q u e r y d e p i c t e d (prunability=2/20=0.1), and figure 4(b) shows that when using also one global representative to answer a range query centered at sq (prunability=2/4=0.5). It can be seen that the prunability increased five times. 4.2 - Structure of the DF-tree The structure of the proposed DF-tree has two components: the fields which are used to build the tree and the distance fields. The tree component complies with the same structure of the Slim-tree, and is used following its same algorithms. The distance fields have valid values only after the GR set * has been selected, which occurs when the first query is asked. Each tree-node of the DF-tree structure contains the number c of objects stored in this node, and an array of c sub-structures. Leaf and non-leaf nodes have slightly different format. The leaf-node substructure has the following format: <si, Oidi, DFi, d(si, sRi) >, where si is the indexed description of the object, Oidi is the identifier of object si, and d(si, sRi) is the distance between object si and the representative object sRi of this leaf node. The indexnode substructure has the following format: <si, Ri, d(si, sRi), DFi, Ptr(Tsi), Nentries(Ptr(Tsi))> , where si holds the object that is the representative of the sub-tree indicated by Ptr(Tsi), and Ri is the covering radius of that region. The distance between si and the representative of this node sRi is given by d(si , sRi). The pointer Ptr (Tsi) points to the root node of the subtree T rooted by si. The current number of entries in the node indicated by Ptr(Tsi) is stored in NEntries(Ptr(Tsi)). Both leaf and index nodes store the distance field component in DFi , an array of the distances of object si to each global representative gj*. This component is used only to answer queries, and not in the update operations of the tree. Section 4.3 explains how they are used to answer queries, and section 4.4 explains how to calculate them, when to calculate, how many representatives constitute the distance fields and how to determine if they need to be recalculated after many updates in the tree had taken place. The distance field component can be stored in a separated file. We chose to store it together with the tree component to reduce disk accesses. Storing all the data together reduces the number of objects that fit in nodes of a given size, which, in turn, reduces the fan-out of the tree. We have assumed that this side effect is negligible when the proposed structure is used with large objects, which is the most common situation. 4.3 - Algorithms for range and nearest-neighbor queries using global representatives Figure 4 - Exemplifying how the prunability works. Here we present the algorithms to execute both range and k-nearest neighbor queries using global representatives on the DF-tree. Let us first consider range queries, which are represented as Rq=<sq, rq>. These queries begin looking at -6the root node of the tree. When this node is read, its representative is set as sRep, and the distance from sq to sRep is calculated. In this way the three regions shown in figure 2 are generated for this representative, and the conditions expressed by Equations 1 and 2 are evaluated. If none of the conditions expressed by those equations holds, one of the global representatives gj is set as sRep and both equations are re-evaluated. If none of the equations holds, the next global representative is used, and so on, until every representative is used. If one equation holds for any representative, the corresponding node can be pruned from the searching process. Nodes that cannot be pruned must be read, and processed recursively. This algorithm is shown in figure 5. Process a subtree to answer a range query Input: a DF-tree node i and a range query Rq=<sq, rq> Output: the query response set. Begin 1 - Calculate the distance from the query object sq to the representative of this node. 2 - For each object sj in node i: 3 - Set sRep as the representative of this node 4 - If Eq. (1) or Eq. (2) holds, get the next object. // object was pruned 5else check if object sj can be pruned by the distance field (see Fig. 4b) If it can, get the next object. // object was pruned 6If this is a leaf node, put object sj in the response set 7else process the subtree rooted at this object. End D E Check if an object can be pruned by the distance field Input: the object sj to be checked, its covering radius rRi if it is in a non-leaf node or zero otherwise, and its distances to the global representatives Output: true if it can be pruned, false otherwise. Begin 1 - For each global representative gj * 2Set sRep as gj 3If Eq. (1) or Eq. (2) holds, return true, else get next global representative 4 - return false End Figure 5 - The range query algorithm for the DF-tree. The k-nearest neighbor queries kNN=<sq, k> use a priority queue Pr of size k, where the candidate objects sj are maintained sorted by its distance to the query center sq. The distance of the farthest object currently in Pr is set as the current query radius rc. The algorithm starts reading the root node, with Pr empty, and until there are k objects in the queue, rc is set to infinity. A new object is inserted only if it is closer to sq than rc. Inserting a new object in a full Pr replaces the farthest one, and the next farthest object define the new value of rc. Whenever a node is read, every object in it is stored in another priority queue Pw of unlimited size, that holds the unprocessed objects, which are also maintained sorted by its distance to the query center sq. Objects in Pw are processed by an algorithm similar to the one to process range queries, using rc as the current query radius. The algorithm terminates when Pw becomes empty. Figure 6 depicts this algorithm. Process a subtree to answer a k-nearest neighbor query Input: a DF-tree node i, a query kNN=<sq, k>, the response queue Pr and the waiting queue Pw Output: the query response set. Begin 1 - If node i is a non-leaf node: 2For each object sj in node i: 3Set sRep as the representative of this node. 4If Eq. (1) or Eq. (2) holds, get the next object. // object was pruned 5else if object sj can be pruned by the distance field (see Fig 4b), get the next object // object was pruned 6else insert sj in Pw with its distance from sq. 7 - While Pw is not empty, get sj and rj from Pw: 8Set sRep as sj 9If Eq.(1) or Eq.(2) holds, get the next object. // object was pruned 10 else if object s j can be pruned by the distance field (see Fig 4b), get the next object // object was pruned 11 else process the subtree rooted at object sj. 12 else // node is a leaf node 13 - For each object sj in node i: 14 Set sRep as the representative of this node. 15 If Eq. (1) or Eq.(2) holds, get the next object. // node was pruned 16 else if object s j can be pruned by the distance field (see Fig 4b), get the next object // node was pruned 17 else insert sj in Pr with its distance from sq, and get the new rc. End Figure 6 - The K-nearest neighbors query algorithm for the DFtree. 4.4 - Choosing the global representative set * As discussed in the previous section, each representative (either the node or a global representative) defines a ring in the dataset domain where objects cannot be pruned using this reference. When more than one representative is used, only the objects in the intersection of the corresponding rings cannot be pruned. The node representative is defined by the tree construction algorithm, so it cannot be changed. Hence, the flexibility in the choice of references falls on the number and placement of the global representatives (GRs). Determining the number of global representatives It was shown in [12] that the maximum number p of references worth considering is one plus the intrinsic -7dimensionality M'N of the dataset. However, the way in which DF-tree uses the GR involves different requirements, so we have re-analyzed this number. As each node already has a representative, the number of global representatives can be only M'N. If the distance of the intersecting regions of the rings provided by two GRs is less than the distance between these two GRs, this region tends to be smaller than the intersection generated by the rings far from the GRs. Therefore, among other requirements, the GR set must get objects far from each other and close to the “border” of the dataset. In this way, almost every distance from the objects in the dataset to these GRs will be less than the distance between the GRs. Figure 7 illustrates this idea. Figure 7(a) shows a query occurring “between” the GRs, whereas the queries shown in figures 7(b) and 7(c) occurring “outside” the GRs. The intersections of the rings in figure 7(a) are clearly smaller than those in figures 7(b) and 7(c). Figure 7 - Where the GR set G is effective. (a) The most effective: the range query is circumscribed by G={ A, B, C}. (b) The intersection provided by G to solve a query centered on the object sq2 which is not circumscribed by G, but close to at least one GR. (c) The least effective: intersection provided by G to solve a query centered on the object sq3, which is also not circumscribed by G and is far from all the GRs. It should be noted that increasing the number of GRs reduces the fan-out of the tree, which, in turn, increases the number of nodes and the height of the tree. Although the triangular inequality allows for pruning of distance calculations to objects in a previously read node, each disk access involves at least one additional distance calculation from the query center to the node representative. Therefore, increasing the number of nodes in a tree also increases the need for distance calculations, leading to a tradeoff between the increased prunability provided by a greater number of representative and the resulting decreased fan-out. This limits the useful number of global representatives to M'N. When to choose the first GR set * When the tree is empty, it is impossible to choose a GR set. As the objects are inserted, we have to decide at which point to choose the GR set. The trade-off is clear: if we start too early, we may be 'stuck' with a bad GR set. If we start too late, we may not enjoy the speedups that the GR set provides. The DF-tree requires the GR only when a query is issued. If the query is posted on a tree that has only a few objects, the query can be answered without using this resource. Therefore, we have determined that the first GR set is calculated when the first query is answered, after the DF-tree already has at least 2 levels (the root level and another one). This ensures the presence of a reasonable number of objects when the global representatives are chosen. Deciding when to update the GR set: the WU algorithm An important consideration is the maintenance of a proper set of global representatives. The idea is that the GRs are chosen when the tree is first created. However, after the tree has been created and many objects were inserted and deleted, a possible change in data distribution may render the original GR set * inadequate. Therefore, in addition to having a fast algorithm to select the GR set in the current dataset, an algorithm is required to determine when the current * is no longer worthwhile. An ideal algorithm must preclude new distance calculations. Considering that the distance between each newly inserted object and each GR must be calculated, the algorithm should rely only on those distances. Figure 8 enable to visualize how this algorithm works through a two-dimensional space using a Euclidean distance function. This figure presents the regions delimited by the paired distances between the three GRs A, B and C. The shadowed area represents the region where the Figure 8 - The shadowed GRs prune the objects more r e g i o n s h o w s t h e effectively (see figure 7). The circumscribed area posed by distance between each object the global representative A, B, C. inside the shadowed area to each of the three GRs is less than the maximum distance between the GRs. This can be expressed by the following definition. Definition 3 - Circumscribed objects: Let mdj=max { d(gj, gk) | gkggj, ~gj, gk * } be the maximum distance from each global representative gj to all the other global representatives gk. We say that an object si that follows d(si, gj)mdj , ~gj * is circumscribed by the GR set *. The set * has a strong ability to prune its circumscribed objects because the intersection of the corresponding rings generated by queries in this region tends to be minimized. This prunability gradually decreases as the distance of an object outside this region increases. This point is illustrated in figures 7(b) and 7(c), which show the same -8space and the global representative set * as those of figure 8. The objects in the shadowed area of figure 8 are those circumscribed by * = { A, B, C }. Figure 7(b) shows the intersection provided by the GRs A, B and C to solve a query centered on the object sq2 that is not circumscribed by them, but close to at least one GR. Figure 7(c) shows the intersection provided by * to solve a query centered on the object sq3, which is also uncircumscribed and is far away from every GR. As can be seen, the intersection of the rings in figure 7(c) is much larger than the one in figure 7(b). It can be argued that this is true only for uniformly distributed datasets. However, in practical situations, the density of queries in a given region follows approximately the density of the dataset in that region. Hence, the gross result is independent from the density of the dataset over the whole space. The desired algorithm can be built based on this characteristic, and is called the WU (when to update) algorithm. Note that, because the cardinality p of * is at least 2, the md set always can be calculated. The mdj value is associated to the global representative gj, and can easily be calculated as * is created. Since the distance of each object to the global representatives is calculated, no extra distance calculation is required to determine if it is circumscribed by the current *. This allows for the development of an inexpensive algorithm to determine when the current * is no longer an efficient pruner, which is done by checking every newly inserted object and verifying how many of them are uncircumscribed by the current *. The algorithm to change * is triggered whenever this number exceeds a given limit. Objects that are uncircumscribed by the current * and are far away from the circumscribed region have stronger negative impact on prunability than do those that are in closer proximity (as shown in figures 7(b) and 7(c)). Therefore, we propose to count each uncircumscribed object si by a weighted value, which depends on the distance of the object to each GR gj, p calculated as w = p ∏ j =1 d ( si , g j ) md . As objects j close to GRs lead to rings with smaller intersections (there are fewer false alarms) than do objects far away from all GRs, this weighted value is a better choice. This weighted counting of new objects inserted in the tree is restarted whenever a new GR set is calculated, so it is associated with the tree. The trigger limit (threshold) used is left as a tuning parameter of the DF-tree and can be experimentally determined for each data domain. We found empirically that using the number of objects already indexed when the current * was chosen is a good starting point for the threshold. Note that this value is independent from the absolute scale of the dataset. Figure 9(a) presents the algorithm to calculate the md set, while figure 9(b) shows the WU algorithm to detect when the current * no longer effectively prunes sub-trees. D Calculate the md set (executed when a new GR set * is chosen) input: the set of distances d(gj, gk), gj, gk new * output: the set md Begin 1 - For each Global Representative gj : 2Set mdj=0 3For each Global Representative gk distinct from gj: 4If d(gj, gk) > mdj then set mdj=d(gj, gk) end E WU Algorithm: Detect when to update * (executed when an object is inserted in the tree) input: the set md with cardinality p, the set of distances d(gj, si) from the object si to each GR gj, the threshold th, and the accumulated number of new uncircumscribed objects (cnc). The value cnc is set to null whenever a new * is calculated. output: the updated number cnc. Returns true if * requires changing, false otherwise Begin 1 - Set w=1 2 - For each Global Representative gj : 3Set w = w*d(gj, si) / mdj 4 - Set cnc = cnc + (w)1/ p 5 - If cnc > th returns true; otherwise return false end Figure 9 - (a) Algorithm to calculate the md set. (b) The WU algorithm. How to choose a new GR set * : the HU algorithm Here we describe an algorithm to update the set *, called the HU (how to update) algorithm. Let us assume that a new * with cardinality p has to be chosen from a set of objects. As the algorithm computes p+1 distances for each object in the candidate set, it is useful to reduce this set to the minimum possible. The set * is composed of objects close to the “border” of the dataset. Therefore, objects already circumscribed by the current * are not candidates. Thus, the candidate set includes the objects si in the dataset whose distance to a global representative j is greater than the mdj, i.e., d(si, gj)mdj to at least one global representative gj. If the uncircumscribed objects are concentrated in a specific region and are unevenly distributed around the current *, then the new * may put down uncircumscribed the objects that are currently circumscribed. To overcome -9this problem, the current * is also included in the candidate set. The candidate set is then processed by the usual HF algorithm to identify the representatives close to the border of this set. The speed of this process is enhanced when only the uncircumscribed objects plus the current * are used as the candidate set instead of the full set of stored objects, which would be the normal procedure. In fact, our experiments have shown that the same objects are often selected, whether the whole dataset or only this restricted set is used. The HU algorithm is described in figure 10. HU Algorithm: Update * (executed when the WU algorithm triggers) input: the current *, its cardinality p, and the DF-tree output: the new * Begin 1 - Set the “candidate set” to empty 2 - Traverse the DF-tree: 3For each object si stored at a leaf node: 4If for any i, 1<j<p, d(si, gj)mdj, then include si in the candidate set 5 - Include * in the candidate set 6 - Execute the HF Algorithm on the candidate set End Figure 10 - The HU algorithm to update G. 4.5 - Properties of the DF-tree In addition to optimizing the query answering performance, the global representatives have the interesting characteristics which lead to the following DF-tree properties: Property 1: The low costs involved in choosing and changing the global representatives *. Property 2: The GR set * can either be used or not. It should be pointed out that, regardless of the existence of * a query can be answered without using it. This is an important feature in a concurrent environment, because it allows the system to continue answering queries even if a change of is in progress. Property 3: Any number of the existing global representatives can be used. The greater the number of reference points used up to the limit given by M'N, the higher structure’s prunability. However, in combination with the previous property, can be updated by incrementally changing the distances from the objects to each GR one at a time, thereby taking advantage of the idle time of the database manager. Property 4: A can be replaced without affecting the tree structure, because the global representatives are not used to construct the tree. That is why the DF- tree remains a dynamic structure even using global representatives. * * * 5 - Experimental results This section discusses the results of the experiments performed to evaluate the effectiveness of the DF- tree. The DF-tree was implemented in C++ language, and the experiments were ran in a Pentium III 800 Mhz with 256 Mbytes of RAM. The I/O device was an E-IDE Ultra ATA/66 7200RPM hard disk. Although work was done on a variety of both real and synthetic datasets, for reasons of space limitations, in this paper we show measures obtained with the following three real datasets. EnglishWords - A set of 25,143 words from the English dictionary, using the Ledit distance function. As its intrinsic dimensionality is '=4.75 [13], the cardinality of used for this dataset is 5. Faces - A set of 11,900 faces from the Informedia project at Carnegie Mellon University. The faces are described by 16-dimensional vectors, using the Euclidean distance function. With '=5.27 , the cardinality of is 6. PortugueseWords - A set of 214,717 words from the Portuguese dictionary. The distance function used is Ledit, '=6.69, and the cardinality of used for this dataset is 7. The number of distance calculations and the number of disk accesses shown in this section represents the average of the values obtained from sets of 500 queries. For example, distance plots of range queries represent the average number of distance calculations for 500 queries in each query radius or database size shown in the plot. Time measurements show the total time (in seconds) spent to answer each set of 500 queries. The selection of each query set is biased, i.e., objects sampled from the full dataset were used as query centers. The node size of every structure is 4 kbytes. * * * 5.1 - A comparison between the DF-tree, the Slim-tree, the M-tree and the Omni-sequential This section describes the results obtained comparing the DF-tree with the Slim-tree, the M-tree, the Omni-sequential and the sequential scanning. Figure 11 shows the results from the comparison of the performance of the four MAMs for the EnglishWords dataset to answer 500 range queries with a radius equal to one, as the dataset is stored in ten percentile increments of the full dataset. This figure shows the plots of (a) the average number of distance calculations, (b) the average number of disk accesses, and (c) the total time. Figure 11(c) does not include a plot for the M-tree because it is implemented on another framework with timing much slower rendering a comparison of time unfair to the M-tree. As can be seen, all the MAMs display a linear behavior; however, the slope representing the number of distance calculations in the DF-tree plot is significantly less steeper. In fact, the difference in the - 10 number of distance calculations performed by the DF-tree is quite significant compared to the number performed by the others. The number of disk accesses of the DF- tree and the Omni-sequential are equivalent and both show a better performance here than the M-tree, whereas, from this standpoint, the Slim-tree excels, confirming its original purpose. However, considering the overall time factor, which ultimately summarizes disk, distance and internal logic, the DF-tree is clearly the winner, performing almost twice as fast as the Omni-sequential, which takes second place. We performed the same experiments for radii 2,4 and 5 and got equivalent results. It must be noted that the most common queries are usually on small radii. Figure 11 - Comparing the performance to answer 500 range queries on the EnglishWords dataset indexed on the DF-tree, the Slim-tree, the Omni-sequential, the M-tree and Sequential Scan, for increasing dataset sizes. (a) Average number of distance calculations per query, (b) Average number of disk accesses per query. (c) Total time (500 queries) in seconds. 5.2 - Comparing the DF-tree and the Slim-tree This section discusses the experiments involving comparisons between the DF-tree, the Slim-tree, and the sequential scanning. It will be shown that the DF-tree Figure 12 - Distance calculations and total time comparing Sequential Scanning and Slim-tree with DF-tree for the following datasets: EnglishWords: (a) Average number of distance calculations for each range query. (b) Average number of distance calculations for each k-NN query. (d) Total time for 500 range queries for varying range radii; (e) Total time for 500 k-NN queries varying the number of neighbors. Faces: (c) Average number of distance calculations for each range query. (f) Total time for 500 range queries for varying range radii. The averages are for 500 queries. requires fewer distance calculations to answer both range and k-NN queries. Figures 12 and 13 show measurements comparing the average number of distance calculations and the total time required by both Slim-tree and DF-tree to answer the same set of range and k-NN queries over the full dataset. Figure 12(a) shows the average number of distance calculations for range queries on the EnglishWords dataset, while figure 12(d) shows the total time for these experiments at each given range radius. Figure 12(b) shows the average number of distance calculations for kNN queries in the EnglishWords dataset, and figure 12(e) shows the total time for these experiments at each given number of neighbors. Figure 12(c) shows the average number of distance calculations for range queries in the Faces dataset, while figure 12(f) shows the total time for these experiments at each given range radius. These graphs show the meaningful improvement achieved by the DF-tree in such situations. The results obtained are corroborated by the measurements got from the prunability of these MAMs, as is shown in Section 5.3. Figure 13(a) presents the average number of distance calculations for k-NN queries in the full Faces dataset, regarding the Slim-tree and the DF-tree, emphasizing that both trees present a sub-linear behavior; in figure 13(d) the sequential scanning results are also given. Figure 13(c) depicts the total time for these experiments for different numbers of neighbors for both trees and the same sub-linear behavior is achieved. Figure 13(f) includes the results for the sequential scanning. Figure 13(b) and (e) shows the average number of disk accesses for k-NN queries on the Faces dataset. As predicted, the average number of disk accesses required by the DF-tree is greater than that required by the Slim-tree. However, the gain in number of distance calculations offsets this increase, and the total time shown by figure 13(c) and (f) confirms the superior overall performance of the DF-tree. Figures 12 and 13 show the impressive drop in the Figure 13 - Comparing the Slim-tree, the DF-tree and the sequential scan, for 500 k-NN queries on the Faces dataset, for varying numbers of neighbors. (a), (d) average number of distance calculations, (b), (e) average number of disk accesses, (c), (f) total time (in seconds). - 11 average number of distance calculations and the total time required to answer similarity queries, confirming the fact that the DF-tree outperforms the Slim-tree by requiring less than 25% of the number of distance calculations as required by the Slim-tree. The DF-tree also processes small radii in range queries up to 3 times faster than the Slim-tree does. The same behavior is obtained for few neighbors in k-NN queries over small portions of the stored dataset. Queries over small portions of the stored dataset are usual in practical situations. Moreover, the graphs show that the sub-linear behavior of the DF-tree with increasing radii sizes or number of neighbors is maintained. 5.3 - Prunability Table 2 shows the prunability of the DF-tree compared to that of the Slim-tree. Each measurement represents the average pruning obtained by the application of 500 range queries with a radius equal to 1% of the dataset diameter for the Faces dataset and one letter of distance for the EnglishWords dataset. Both radii are examples of small values, which are the most common in real queries. We tested other small values and the results were similar, so they are not presented herein. The measurements also included 500 k-NN asking for 5 neighbors. It must be noted that the DF-tree requires one level more than the Slim-tree to store the 25,143 objects of the EnglishWords dataset. From this table one can see that, using only the node representative, the Slim-tree does almost no pruning at the leaf level (level 2). Nonetheless, the DF-tree maintains considerable prunability. Range queries English Words Slim- DFtree tree 0 (root) 1.00 1.00 1 0.71 0.94 2 0.004 0.56 3 0.18 level Faces Slimtree 0.73 0.33 0.005 DFtree 0.99 0.76 0.51 k-Nearest neighbors queries English Faces Words Slim- DF- Slim- DFtree tree tree tree 1.00 1.00 0.75 0.82 0.89 0.98 0.35 0.76 0.001 0.80 0.02 0.31 0.07 Table 2 - Prunability of the Slim-tree compared with prunability of the DF-tree. Consider, for example, the prunability at the leaf level of the tree built over the Faces dataset when range queries are executed. In this case, for each set of 200 distance calculations that the single node representative of the Slimtree cannot prune (prunability=0.005), an average of only one will find an object that is really part of the answer. In contrast, for every two distance calculations that the node representative and the global representatives of the DF-tree cannot prune (prunability=0.56), an average of more than one is part of the answer. The difference in prunability is less dramatic at the upper levels of the tree; however, since pruning of whole branches provided at the upper levels is more valuable, the impact of increased prunability is always worthwhile. 5.4 - Scalability The DF-tree is linearly scalable regarding the dataset size, the number of distance calculations, the number of disk accesses and time to answer range queries. Figure 14 illustrates this statement for our largest dataset, the PortugueseWord, with increments of 10% up to the full dataset. The values shown represent the average number of distance calculations, the average number of disk accesses, and the total time to answer 500 range queries with a radius rq=1. Figure 14 - Linear behavior of the DF-tree when answering 500 range queries with radius 1 on the PortugueseWords dataset, regarding: (a) average number of distance calculations, (b) average number of disk accesses, (c) total time. 5.5 - Automatic fine-tuning provided by the WU algorithm The algorithm WU detects the downgrading of the after considerable changes on the tree. Our real datasets are randomly distributed over its entire size. Selecting at the beginning of the tree construction, usually brings a poor set, thus the WU algorithm should find a better as the tree grows. Figure 15(a) shows the plots for the average number of distance calculations with increasing size of the database, when answering range queries of radius 1 on the DF-tree for the EnglishWords dataset. The first plot shows the average number of distance calculations using the earlychosen , and the second plot shows this number when the algorithm is allowed to update the set with a threshold of 2000. In this second plot the algorithm triggered two * * * * * Figure 15 - Range queries of radius 1 on the EnglishWords dataset without changing representatives, and changing representatives using a threshold equal to 2000. (a) average number of distance calculations, and (b) total time. - 12 times: after 1,083 objects were inserted (before the first point in the plots of figure 15(a) was measured), and after 24,510 objects were inserted (before the last point in the plots of figure 15 was measured). The time spent to update at the first time was 0.2 sec and the second time was 1.88 sec. As we can see, even for a dataset that has no emphatic change in its tendency, the improvement is steady over a large range of dataset size, both in number of distance calculations and time. Note that the time measurements, shown in figure 15(b), follow the graphs in figure 15(a). data. The concept of the use of global representatives to prune distance calculations accelerates query processing to make it practical, without, however, increasing the complexity of the characteristics yet to be developed. * 6 - Conclusions This paper presented new techniques aiming to improve the efficiency of MAMs to answer similarity queries. Based on them, we developed the DF-tree, which is dynamic and takes advantage of using multiple global representatives. The DF-tree offers tremendous speed-ups, being up to 3 times faster than the state of the art, in wall-clock time. The gain in the number of distance calculations is also impressive, requiring less than a quarter of the computations to answer similarity queries. This improvement is achieved by taking advantage of a set of global representatives, which increase the prunability without interfering with the construction algorithm of the tree, because they only have to be calculated when the first query is asked. Moreover, the set of global representatives can be changed at any time without ever disrupting the response to ongoing queries over the tree. Additional contributions of this paper are: - an algorithm to automatically detect when the global representatives require updating; - an inexpensive algorithm to update the set of global representatives. We have also presented the new resource, called “prunability”, whose purpose is to evaluate how efficient a set of representatives is to prune distance calculations at each level of the tree. Using this resource, we have found that the use of a single representative works well at the upper levels of the existing MAMs, but it is less effective at lower levels. However, the proposed DF-tree is capable of continuously pruning at high rates at every level of the tree. Metric access methods have been intensely developed in recent years and now, with the proposed DF-tree, they have reached a level of performance that qualifies them for inclusion among the indexing methods used in current commercial database management systems, broadening their support to include metric datasets. The DF-tree makes possible to support queries by content over large sets of images, time sequences and genetic data. This support still requires further development to enable MAMs to operate in the open transactional environment of commercial DBMS, such as concurrent operations, generation of and recovery from logs, and to support interactions with selection clauses involving non-metric 7 - References [1] S. Berchtold, D. A. Keim, H.-P. Kriegel, “The X-tree: An Index Structure for High-dimensional data,” in VLDB 1996, pp. 28-39. [2] K. Beyer, J. Godstein, R. Ramakrishnan, U. Shaft, “When is "Nearest Neighbor" Meaningful?,” in ICDT 1999, pp. 217235. [3] T. Bozkaya and Z. M. Özsoyoglu, “Distance-Based Indexing for High-Dimensional Metric Spaces,” in ACM SIGMOD 1997, pp. 357-368. [4] S. Brin, “Near neighbor search in large metric spaces,” in VLDB 1995, pp. 574-584. [5] W. A. Burkhard and R. M. Keller, “Some Approaches to BestMatch File Searching,” CACM, vol. 16, pp. 230-236, 1973. [6] K. Chakrabarti and S. Mehrotra, “The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces,” in IEEE ICDE 1999, pp. 440-447. [7] P. Ciaccia, M. Patella, P. Zezula, “M-tree: An efficient access method for similarity search in metric spaces,” in VLDB 1997, pp. 426-435. [8] C. Faloutsos, B. Seeger, A. J. M. Traina, C. Traina, Jr., “Spatial Join Selectivity Using Power Laws,” in ACM SIGMOD 2000, pp. 177-188. [9] V. Gaede and O. Günther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, pp. 170-231, 1998. [10] N. Katayama and S. i. Satoh, “The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries,” in ACM SIGMOD 1997, pp. 369-380. [11] K.-I. D. Lin, H. V. Jagadish, C. Faloutsos, “The TV-Tree: An Index Structure for High-Dimensional Data,” VLDB Journal, vol. 3, pp. 517-542, 1994. [12] R. F. Santos, Filho, A. J. M. Traina, C. Traina, Jr., C. Faloutsos, “Similarity Search without Tears: The OMNI Family of All-purpose Access Methods,” in ICDE 2001, pp. 623-630. [13] C. Traina, Jr., A. J. M. Traina, C. Faloutsos, “Distance exponent : a new concept for selectivity estimation in metric trees,” Research Paper CMU-CS-99-110, March 1999 1999. [14] C. Traina, Jr., A. J. M. Traina, B. Seeger, C. Faloutsos, “Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes,” in EDBT 2000, pp. 51-65. [15] J. K. Uhlmann, “Satisfying General Proximity/Similarity Queries with Metric Trees,” Information Processing Letter, vol. 40, pp. 175-179, 1991. [16] P. N. Yianilos, “Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces,” in ACM/SIGACT-SIAM - SODA 1993, pp. 311-321.