How to Improve the Pruning Ability of Dynamic Metric Access Methods

Transcription

How to Improve the Pruning Ability of Dynamic Metric Access Methods
How to Improve the Pruning Ability of
Dynamic Metric Access Methods
Caetano Traina Jr.
Agma Traina
Roberto Santos Filho
Dept. of Computer Science and Statistics
University of Sao Paulo at Sao Carlos - Brazil
[caetano| agma | figueira]@icmc.sc.usp.br
Abstract
Retrieval of complex data is accelerated through
the use of indexing structures. They organize the
data in order to minimize comparisons between
objects aiming to prune blocks of data during the
process. In metric spaces, comparison operations
can be specially expensive, so the pruning ability
of indexing methods turns out to be specially
significant. This paper shows how to measure the
pruning power of metric access methods (MAMs),
and defines a new measurement, called
"prunability,” which indicates how well a pruning
technique carries out the task of reducing distance
calculations at each level of the tree. We also
gives a new measurement for it, and presents a
new dynamic access method, aiming to retrieve
objects in a metric space minimizing the number
of distance calculations required to answer
similarity queries. We show that this structure is
up to 3 times faster and requires less than 25% of
the distance calculations to answer range and
nearest-neighbor queries as compared with the
Slim-tree, one of the most efficient metric access
method (MAM). This gain in performance is
achieved by taking advantage of a set of global
representatives.
Although using multiple
representatives, the whole structure remains
dynamic and balanced. Experimenting with
existing metric access methods, we have
discovered that the use of one node representative
works well at the upper levels of the existing
MAMs, but is almost useless at the lower levels.
Our proposed method assures higher rates of
pruning at all the levels of the tree.
1 - Introduction
The amount of information stored in databases has grown
at a very fast pace over the last years. Data today is far
more diversified and complex than it was in the past, when
all the information was represented by text and numbers.
Christos Faloutsos
Dept. of Computer Science
Carnegie Mellon University - USA
[email protected]
The massive use of multimedia information, such as
images, audio, video, time series and DNA sequences have
generated new challenges for researchers in the database
area. Thus, a cornerstone for the database area is the
development of fast and efficient access methods, which
can cope with large amount of complex data in these
domains.
In multimedia systems, it is important to search the
database for similar data. In image databases, for example,
the user looks for images that are similar to a given one
according to specific criteria. These data domains do not
pose an order relationship property; in other words,
relations such as “less than” and “greater than” are
meaningless, rendering the usual access methods useless.
However, if a similarity function is defined, similarity
queries make sense. Note that a metric distance function
points out the dissimilarity between objects. When a
distance function has non-negativity and symmetry
properties and satisfies triangular inequality, it is said to be
a metric function, the domain is said to be metric, and
metric access methods can be used. In metric domains, the
so-called similarity queries most frequently used are the
following:
• k-nearest neighbor or k-NN query: kNN=<sq, k>,
which asks for the k objects that are the closest to
a given query object center sq.
• range query: Rq=<sq, rq>, which searches for all the
objects whose distance to the query object center
sq is less or equal to the query radius rq.
Note that calculating distances between complex
objects are usually expensive operations.
Hence,
minimizing these calculations is one of the most important
aspects to achieve acceptable response time in a metric
access method (MAM). This paper presents a way to
reduce the number of distance calculations to answer
similarity queries and introduces a new and efficient MAM,
the DF-tree.
The main approach is the use of global representatives
for the whole tree. These global representatives will work
together with the representatives of the nodes in order to
decrease the number of distance calculations, which is
-2usually expensive for complex (multimedia) data. Global
representatives use a new approach which maintains the
tree dynamic. We will show that this approach leads to
amazing improvements for answering similarity queries.
Also, being a dynamic MAM is important because
applications usually continue inserting and deleting objects
after the creation of the index structure. We present the
results of experiments executed over three real datasets,
comparing the DF-tree with the state of the art MAMs.
The results show that the DF-tree performs 25% less
distance calculations and answers similarity queries up to
3 times faster. The results of the experiments also
demonstrate that this new MAM is scalable to very large
datasets, presenting a linear behavior over varying dataset
sizes.
We also defined a new measurement, called
"prunability,” to estimate how many distance calculations
can be pruned at each level of a tree. We have discovered
that the use of one node representative works well at the
upper levels of the existing MAMs, but is almost useless at
the lower levels. Using this measurement, we confirmed
that our new proposed MAM assures higher rates of
pruning at every level of the tree.
In addition to presenting the DF-tree and the improved
algorithms to accelerate answering similarity queries, this
paper also presents three support algorithms. The first one
decides when to select and use the global representatives,
the second one automatically detects if the current set is no
longer appropriate, and the last algorithm quickly selects a
new set of global representatives.
The remainder of this paper is organized as follows.
The next section summarizes the background for this work,
while section 3 explains why using more than one
representative per node to build a metric tree may lead to
static structures. Section 4 presents the proposed metric
access method DF-tree, as well as its supporting
algorithms. Section 5 presents the results of the
experiments performed on the DF-tree and compares it to
other recent MAMs. Section 6 discusses the conclusions of
this paper.
2 - Background
One of the underpinnings of a database system is the index
structure that supports it, so the design of efficient access
methods has long been pursued in the database area.
Because some complex data can be viewed as points in a
multidimensional domain, several methods to index data in
a multidimensional space have been proposed in the
literature [1] [6] [10] [11]. An excellent survey on
multidimensional access methods is presented in [9].
Unfortunately, these methods were developed only for
vector data. Note that many complex data do not have any
vectorial property, but their elements can be treated as
occupying a metric domain.
Previous work on metric methods focused on the static
case, for example: the structures proposed by Burkhard and
Keller [5], Uhlmann [15], and also the vp-tree [16], GNAT
[4] and mvp-tree [3], where no further insertions, deletions
and updates are handled. Overcoming this inconvenience,
dynamic MAM structures including M-tree [7], Slim-tree
[14] and OMNI-family [12] have been developed.
Multiple representatives help reducing the number of
distance calculations performed when answering similarity
queries. They were nicely used in the mvp-tree [3] and in
the Omni-family of index structures [12]. In contrast to
mvp-tree, the MAMs members of Omni-family are
dynamic, allowing for insertions and deletions. However,
the set of representatives used by the Omni-family is chosen
during the early construction phase of the tree and cannot
be changed afterwards. The main idea was that, given a
dataset S of objects in a metric domain, an object s i S can
be represented by a set of distances to a set of global
representative objects, the foci. A good algorithm to choose
these foci, the HF algorithm, is also presented in [12]. This
algorithm finds objects near the border of the dataset to be
used as foci, with linear cost on the size of the dataset.
The number of representatives required to maximize
the gain obtained from the reduced number of distance
calculations is associated with the intrinsic dimensionality
' of the dataset. As in [8], the intrinsic dimensionality is
defined as the exponent in the power-law:
'
nb(r ) ∝r
where nb ( r ) is the average number of neighbors within
distance r or less from an arbitrary object, and stands for
“is proportional to”. We will use this result and the HF
algorithm in this paper.
3 - Motivation
Metric access methods based on trees store data through
sets of fixed-size nodes, using reference objects in each
node to represent the other objects stored in that node and
in its sub-trees. Precalculated distances between stored
objects and these representatives are kept to speed up
insertions and deletions and to answer queries. This design
takes advantage of the triangular inequality property of
metric domains to prune many costly distance calculations
between the query centers and the stored objects, as is
explained following.
In a metric tree, each node stores a sub-set of the
objects. Let us first consider the objects sjS stored in a leaf
node i. One of these objects, tagged as sRi, is chosen as the
“representative” for the objects stored in the node i. The
distances from every object in this node to the representative are calculated, and the largest one is defined as the
covering radius rRi of that node. In this way any object
farther than rRi from sRi cannot be found in node i. In the
-3upper levels the same construct is employed, but the objects
stored in a non-leaf node are the representatives of their
descendant nodes. Again, one of these objects is chosen as
the representative of this node, and we will tag it as sRep.
The distances from every object stored in that node to its
representative are calculated and also stored in the node.
However, the covering radius of a non-leaf node is the
distance from the representative to the farthest object stored
in that node plus the covering radius of the node where this
object is the representative. Figure 1 shows a DF-tree
indexing 17 objects, considering nodes with a maximum
capacity of tree objects. Note that the root node does not
have a representative and the complete set S is stored in the
leafs. The global representatives are not depicted in this
figure because no query was asked yet. For example, any
two objects far from each other and on the “border” of the
dataset can be used as global representatives, such as the
objects K and P, in this figure. As the other illustrative
figures presented in this paper, figure 1(a) assumes a
dataset in a two-dimensional space with the Euclidean
distance function, although the same concepts apply to any
metric domain.
Figure 1 - A DF-tree with node capacity=3, indexing 17 objects.
Now, consider a range query asking for objects distant
up to rq from the query center sq, as shown in figure 2. The
query defines tree regions: in Region 1 are the objects that
are farther from sRep than its distance to the query center sq
plus the query radius rq. In Region 3 are the objects that
are closer to sRep than its distance to the query center sq
minus the query radius rq. In Region 2 are the objects that
do not satisfy any of these conditions. Objects in Regions
1 and 3 cannot be in the response set. Thus, nodes that are
completely in Regions 1 or 3 can be pruned from the search
process. In this way, by the triangular inequality, every
node i with representative sRi can be pruned if it satisfy one
of the following conditions:
a) d(sRep, sq)+rq < d(sRep, sRi)-rRi (Region 1) (Eq. 1)
b) d(sRep, sq)-rq > d(sRep, sRi)+rRi (Region 3) (Eq. 2)
The same pruning concept can be applied at the leaf nodes,
where the value of rRi is zero. Thus, the triangular
inequality enables pruning both the traversing of subtrees
in non-leaf nodes and distance calculations from the query
object to objects in the leaf nodes.
Figure 2 - Only objects in region 2 will
be in the response set of the range query
centered at sq with radius rq.
This mechanism leads to dynamic structures, because
if a new object needs to be inserted, it can be inserted in
any of the nodes that are able to “cover” it. If no node
covers it, the covering radius of an existing node is
enlarged. If the node capacity is surpassed, the node splits
and a new pair of representatives is chosen. This changes
one object in the upper level node and promotes one new
representative to this node, which will eventually require a
split too, in an iterative process. When the root node splits,
the tree grows one level. Thus, the construction of the tree
occurs in a bottom up approach, like the B-trees. Detailed
algorithms to perform those operations can be found in [14]
and [7].
Pruning can be enhanced by using more than one
representative (a reference object). In fact, the “amount of
information” embodied in a simple relation between three
objects (the query center, the target object sj and the node
representative) provided by the triangular inequality
property is rather low, particularly in pure metric or in
high-dimensional spaces [2]. The combined use of two or
more references eliminates larger portions of the dataset
than using each reference object individually. Figure 3
shows how much the region where the candidate set of
objects to answer the query shrinks by adding a global
representative. If only the node representative is used for
pruning, the shadowed region in figure 3(a) is where the
answer objects for a range query with center object sq and
radius rq could be found. In figure 3(b) one Global
Representative is also used and the darker shadowed region
indicates where the answer objects could be for the same
range query of (a). Note that the size of the resulting
region is much smaller.
-4the metric access method DF-tree presented in the next
section.
4 - The new DF-tree structure
Figure 3 - Regions that cannot be pruned: (a) using only the node
representative; (b) using also a global representative.
It is helpful to understand why multiple reference
objects lead to static structures, allowing us to overcome
this undesired effect, yet still being able to retain the strong
capacity of pruning achieved by a combination of many
representatives. In fact, if the set of reference objects at a
given node dictates where other objects should be stored at
lower levels (i.e., in which of its descending subtree), the
structure becomes static, because whenever a reference is
changed, the objects that were stored in a given subtree
should have to be moved to another subtree. Hence, in
order to be dynamic, a MAM must either allow for more
than one place to store each object or choose the
representatives in a bottom-up approach, changing the
reference objects based on the subtrees rather than on the
upper levels of the tree. Both approaches, however, have
their drawbacks. Allowing for more than one place to store
each object involves many attempts until the objects
required in a query are successfully retrieved. Choosing the
representatives in a bottom-up approach, on the other hand,
inhibits the combined use of the set of representatives of the
nodes whose path leads to that node.
The Slim-tree and the M-tree choose a compromise
between these two approaches. First, the representative at
each node restricts, rather than determines, where a given
object can be stored. Second, the representative of a node
is selected from among the representatives of its children
nodes, and the covering radius of the node is calculated
based on the covering radius of those children. However,
this compromise means that only one representative can be
used at each node. The Slim-tree improved the M-tree
tackling the first compromise, as minimizing the
intersection of nodes reduces the number of deep searches
needed to find an object. The DF-tree further improves the
Slim-tree allowing more representatives to be used to prune
distances, tackling also the second compromise. Although
we do not use the representatives of upper levels, as a
matter of fact, nothing hinders the use of more than one
object to prune subtrees, provided that only one is used to
guide the creation of the tree. That is what we do through
We propose here the use of Global Representatives (GR)
together with the fundamental concepts of a metric tree,
which aims to reduce the number of distance calculations
required to answer similarity queries. This paper
demonstrates that the proper use of GRs allows for fewer
distance calculations and thus to a shorter time to answer
similarity queries. The following definition is needed to
explain this new structure.
0 7
Definition 1 - Global Representative Set (* ): Let =< ,
d> be a metric space, where
is the domain of
features, which are the indexing keys, and d( ) is a
metric distance function. Given a dataset S G with
N objects, a Global Representative Subset of S (or a GR
set) is a set *={g1, g2, ..., gp | gk S, gkggj, pN}, where
each gj is chosen to be a Global Representative, and p
is the number of global representatives contained in *.
We call the single representative of each node of the
tree as the node representative in order to distinguish it
from the global representatives for the whole tree. Each
global representative is independent from the others,
including the node representatives, and is applied to every
object si inserted in the structure, so we say that each global
representative defines a distance field (DF) over the
domain . The distance field is represented by the distance
from each object to the corresponding global representative,
stored with that object. The main symbols used in this
paper are shown in Table 1.
The general idea of the DF-tree is to build the structure
combining a metric tree that uses one representative per
7
7
7
Symbols
7
S
'
sq
rq
k
h
*
gj
p
d( )
mdj
Ph(Q)
Definitions
domain of objects
set of objects in domain 7
intrinsic dimensionality of dataset S
a query object (or query center)
radius of a range query
number of neighbors in an NN query
level of the tree
set of global representatives
a global representative of *
cardinality of *
distance function
maximum distance from the global
representative gj to all the other global
representatives
Prunability at level h to answer a set Q of
queries
Table 1 - Summary of symbols and definitions.
-5node (e.g. the Slim-tree) with the distance fields generated
by each element of *. The distance fields do not interfere
with the creation of the tree structure, but can be used to
prune distance calculations when answering queries. The
selection of objects to act as global representatives and the
calculation of their corresponding distances to each stored
object can be postponed to any time before the first query is
answered. For example, the distances can be calculated
either when each object is being inserted or just after a bulk
loading operation is completed.
4.1 - The “prunability” property
The purpose of using global representatives is to increase
the pruning of distance calculations. The number of
distance calculations that can be pruned depends on the
relative sizes of the areas defined by each representative
(node or global), and the regions defined by the query
center and the representative radii (recall figure 2). These
sizes vary at different levels of a given tree, leading to
methods that are better to prune at some levels than at
others. Therefore, to gauge how efficiently a MAM can
prune distance calculations based on a set of reference
objects active in each node, we have defined the following
measurement.
Definition 2- “Prunability”: Given a large set Q of
similarity queries over a tree, the prunability Ph ( Q ) is the
average of the relation Nub(qi)/Ntb(qi) applied to each node
b accessed at a given level h to answer each query qi Q,
where:
Ntb(qi ) - is the total number of objects in node b
accessed to answer the query qi not pruned by the
references active in this node. Each access to a
node imposes a distance calculation.
Nub(qi) - is the number of objects in the nodes accessed
to answer the query qi that actually qualify, i.e.,
for which the distance calculation is unavoidable.
We consider that a distance calculation is unavoidable in a
given node at level h if at least one object in the children
nodes at level h+1 require further processing. Figure 4(a)
exemplifies the prunability when using just the node repres e n t a t i ve t o a n swer t h e q u e r y d e p i c t e d
(prunability=2/20=0.1), and figure 4(b) shows that when
using also one global representative to answer a range
query centered at sq (prunability=2/4=0.5). It can be seen
that the prunability increased five times.
4.2 - Structure of the DF-tree
The structure of the proposed DF-tree has two components:
the fields which are used to build the tree and the distance
fields. The tree component complies with the same
structure of the Slim-tree, and is used following its same
algorithms. The distance fields have valid values only after
the GR set * has been selected, which occurs when the first
query is asked. Each tree-node of the DF-tree structure
contains the number c of objects stored in this node, and an
array of c sub-structures. Leaf and non-leaf nodes have
slightly different format. The leaf-node substructure has the
following format: <si, Oidi, DFi, d(si, sRi) >, where si is the
indexed description of the object, Oidi is the identifier of
object si, and d(si, sRi) is the distance between object si and
the representative object sRi of this leaf node. The indexnode substructure has the following format:
<si, Ri, d(si, sRi), DFi, Ptr(Tsi), Nentries(Ptr(Tsi))> , where
si holds the object that is the representative of the sub-tree
indicated by Ptr(Tsi), and Ri is the covering radius of that
region. The distance between si and the representative of
this node sRi is given by d(si , sRi). The pointer Ptr (Tsi)
points to the root node of the subtree T rooted by si. The
current number of entries in the node indicated by Ptr(Tsi)
is stored in NEntries(Ptr(Tsi)).
Both leaf and index nodes store the distance field
component in DFi , an array of the distances of object si to
each global representative gj*. This component is used
only to answer queries, and not in the update operations of
the tree. Section 4.3 explains how they are used to answer
queries, and section 4.4 explains how to calculate them,
when to calculate, how many representatives constitute the
distance fields and how to determine if they need to be
recalculated after many updates in the tree had taken place.
The distance field component can be stored in a
separated file. We chose to store it together with the tree
component to reduce disk accesses. Storing all the data
together reduces the number of objects that fit in nodes of
a given size, which, in turn, reduces the fan-out of the tree.
We have assumed that this side effect is negligible when
the proposed structure is used with large objects, which is
the most common situation.
4.3 - Algorithms for range and nearest-neighbor queries
using global representatives
Figure 4 - Exemplifying how the prunability works.
Here we present the algorithms to execute both range and
k-nearest neighbor queries using global representatives on
the DF-tree. Let us first consider range queries, which are
represented as Rq=<sq, rq>. These queries begin looking at
-6the root node of the tree. When this node is read, its
representative is set as sRep, and the distance from sq to sRep
is calculated. In this way the three regions shown in figure
2 are generated for this representative, and the conditions
expressed by Equations 1 and 2 are evaluated. If none of
the conditions expressed by those equations holds, one of
the global representatives gj is set as sRep and both equations
are re-evaluated. If none of the equations holds, the next
global representative is used, and so on, until every
representative is used. If one equation holds for any
representative, the corresponding node can be pruned from
the searching process. Nodes that cannot be pruned must
be read, and processed recursively. This algorithm is shown
in figure 5.
Process a subtree to answer a range query
Input: a DF-tree node i and a range query Rq=<sq, rq>
Output: the query response set.
Begin
1 - Calculate the distance from the query object sq to the
representative of this node.
2 - For each object sj in node i:
3 - Set sRep as the representative of this node
4 - If Eq. (1) or Eq. (2) holds, get the next object. // object was
pruned
5else check if object sj can be pruned by the distance field
(see Fig. 4b)
If it can, get the next object. // object was pruned
6If this is a leaf node, put object sj in the response set
7else process the subtree rooted at this object.
End
D
E Check if an object can be pruned by the distance field
Input: the object sj to be checked, its covering radius rRi if it is
in a non-leaf node or zero otherwise, and its distances to the
global representatives
Output: true if it can be pruned, false otherwise.
Begin
1 - For each global representative gj *
2Set sRep as gj
3If Eq. (1) or Eq. (2) holds, return true, else get next
global representative
4 - return false
End
Figure 5 - The range query algorithm for the DF-tree.
The k-nearest neighbor queries kNN=<sq, k> use a
priority queue Pr of size k, where the candidate objects sj
are maintained sorted by its distance to the query center sq.
The distance of the farthest object currently in Pr is set as
the current query radius rc. The algorithm starts reading
the root node, with Pr empty, and until there are k objects
in the queue, rc is set to infinity. A new object is inserted
only if it is closer to sq than rc. Inserting a new object in a
full Pr replaces the farthest one, and the next farthest object
define the new value of rc. Whenever a node is read, every
object in it is stored in another priority queue Pw of
unlimited size, that holds the unprocessed objects, which
are also maintained sorted by its distance to the query
center sq. Objects in Pw are processed by an algorithm
similar to the one to process range queries, using rc as the
current query radius. The algorithm terminates when Pw
becomes empty. Figure 6 depicts this algorithm.
Process a subtree to answer a k-nearest neighbor query
Input: a DF-tree node i, a query kNN=<sq, k>, the response
queue Pr and the waiting queue Pw
Output: the query response set.
Begin
1 - If node i is a non-leaf node:
2For each object sj in node i:
3Set sRep as the representative of this node.
4If Eq. (1) or Eq. (2) holds, get the next object. // object
was pruned
5else if object sj can be pruned by the distance
field (see Fig 4b), get the next object // object
was pruned
6else insert sj in Pw with its distance from sq.
7 - While Pw is not empty, get sj and rj from Pw:
8Set sRep as sj
9If Eq.(1) or Eq.(2) holds, get the next object. // object
was pruned
10 else if object s j can be pruned by the distance field
(see Fig 4b), get the next object // object was pruned
11 else process the subtree rooted at object sj.
12 else // node is a leaf node
13 - For each object sj in node i:
14 Set sRep as the representative of this node.
15 If Eq. (1) or Eq.(2) holds, get the next object. //
node was pruned
16 else if object s j can be pruned by the distance
field (see Fig 4b), get the next object // node
was pruned
17 else insert sj in Pr with its distance from sq,
and get the new rc.
End
Figure 6 - The K-nearest neighbors query algorithm for the DFtree.
4.4 - Choosing the global representative set *
As discussed in the previous section, each representative
(either the node or a global representative) defines a ring in
the dataset domain where objects cannot be pruned using
this reference. When more than one representative is used,
only the objects in the intersection of the corresponding
rings cannot be pruned. The node representative is defined
by the tree construction algorithm, so it cannot be changed.
Hence, the flexibility in the choice of references falls on the
number and placement of the global representatives (GRs).
Determining the number of global representatives
It was shown in [12] that the maximum number p of
references worth considering is one plus the intrinsic
-7dimensionality M'N of the dataset. However, the way in
which DF-tree uses the GR involves different requirements,
so we have re-analyzed this number. As each node already
has a representative, the number of global representatives
can be only M'N.
If the distance of the intersecting regions of the rings
provided by two GRs is less than the distance between these
two GRs, this region tends to be smaller than the
intersection generated by the rings far from the GRs.
Therefore, among other requirements, the GR set must get
objects far from each other and close to the “border” of the
dataset. In this way, almost every distance from the objects
in the dataset to these GRs will be less than the distance
between the GRs. Figure 7 illustrates this idea. Figure 7(a)
shows a query occurring “between” the GRs, whereas the
queries shown in figures 7(b) and 7(c) occurring “outside”
the GRs. The intersections of the rings in figure 7(a) are
clearly smaller than those in figures 7(b) and 7(c).
Figure 7 - Where the GR set G is effective. (a) The most
effective: the range query is circumscribed by G={ A, B, C}. (b)
The intersection provided by G to solve a query centered on the
object sq2 which is not circumscribed by G, but close to at least
one GR. (c) The least effective: intersection provided by G to
solve a query centered on the object sq3, which is also not
circumscribed by G and is far from all the GRs.
It should be noted that increasing the number of GRs
reduces the fan-out of the tree, which, in turn, increases the
number of nodes and the height of the tree. Although the
triangular inequality allows for pruning of distance
calculations to objects in a previously read node, each disk
access involves at least one additional distance calculation from the query center to the node representative.
Therefore, increasing the number of nodes in a tree also
increases the need for distance calculations, leading to a
tradeoff between the increased prunability provided by a
greater number of representative and the resulting
decreased fan-out. This limits the useful number of global
representatives to M'N.
When to choose the first GR set *
When the tree is empty, it is impossible to choose a GR set.
As the objects are inserted, we have to decide at which
point to choose the GR set. The trade-off is clear: if we start
too early, we may be 'stuck' with a bad GR set. If we start
too late, we may not enjoy the speedups that the GR set
provides.
The DF-tree requires the GR only when a query is
issued. If the query is posted on a tree that has only a few
objects, the query can be answered without using this
resource. Therefore, we have determined that the first GR
set is calculated when the first query is answered, after the
DF-tree already has at least 2 levels (the root level and
another one). This ensures the presence of a reasonable
number of objects when the global representatives are
chosen.
Deciding when to update the GR set: the WU algorithm
An important consideration is the maintenance of a proper
set of global representatives. The idea is that the GRs are
chosen when the tree is first created. However, after the
tree has been created and many objects were inserted and
deleted, a possible change in data distribution may render
the original GR set * inadequate. Therefore, in addition to
having a fast algorithm to select the GR set in the current
dataset, an algorithm is required to determine when the
current * is no longer worthwhile. An ideal algorithm
must preclude new distance calculations. Considering that
the distance between each newly inserted object and each
GR must be calculated, the algorithm should rely only on
those distances.
Figure 8 enable to visualize how this algorithm works
through a two-dimensional
space using a Euclidean distance function. This figure
presents the regions delimited
by the paired distances between the three GRs A, B and
C. The shadowed area represents the region where the Figure 8 - The shadowed
GRs prune the objects more r e g i o n s h o w s t h e
effectively (see figure 7). The circumscribed area posed by
distance between each object the global representative A,
B, C.
inside the shadowed area to
each of the three GRs is less
than the maximum distance between the GRs. This can be
expressed by the following definition.
Definition 3 - Circumscribed objects: Let mdj=max
{ d(gj, gk) | gkggj, ~gj, gk * } be the maximum
distance from each global representative gj to all
the other global representatives gk. We say that an
object si that follows d(si, gj)mdj , ~gj * is
circumscribed by the GR set *.
The set * has a strong ability to prune its circumscribed
objects because the intersection of the corresponding rings
generated by queries in this region tends to be minimized.
This prunability gradually decreases as the distance of an
object outside this region increases. This point is
illustrated in figures 7(b) and 7(c), which show the same
-8space and the global representative set * as those of figure
8. The objects in the shadowed area of figure 8 are those
circumscribed by * = { A, B, C }. Figure 7(b) shows the
intersection provided by the GRs A, B and C to solve a
query centered on the object sq2 that is not circumscribed by
them, but close to at least one GR. Figure 7(c) shows the
intersection provided by * to solve a query centered on the
object sq3, which is also uncircumscribed and is far away
from every GR. As can be seen, the intersection of the
rings in figure 7(c) is much larger than the one in figure
7(b). It can be argued that this is true only for uniformly
distributed datasets. However, in practical situations, the
density of queries in a given region follows approximately
the density of the dataset in that region. Hence, the gross
result is independent from the density of the dataset over
the whole space. The desired algorithm can be built based
on this characteristic, and is called the WU (when to
update) algorithm.
Note that, because the cardinality p of * is at least 2,
the md set always can be calculated. The mdj value is
associated to the global representative gj, and can easily be
calculated as * is created. Since the distance of each object
to the global representatives is calculated, no extra distance
calculation is required to determine if it is circumscribed by
the current *. This allows for the development of an
inexpensive algorithm to determine when the current * is
no longer an efficient pruner, which is done by checking
every newly inserted object and verifying how many of
them are uncircumscribed by the current *. The algorithm
to change * is triggered whenever this number exceeds a
given limit. Objects that are uncircumscribed by the
current * and are far away from the circumscribed region
have stronger negative impact on prunability than do those
that are in closer proximity (as shown in figures 7(b) and
7(c)).
Therefore, we propose to count each
uncircumscribed object si by a weighted value, which
depends on the distance of the object to each GR gj,
p
calculated as
w =
p
∏
j =1
d ( si , g j )
md
.
As objects
j
close to GRs lead to rings with smaller intersections (there
are fewer false alarms) than do objects far away from all
GRs, this weighted value is a better choice. This weighted
counting of new objects inserted in the tree is restarted
whenever a new GR set is calculated, so it is associated
with the tree.
The trigger limit (threshold) used is left as a tuning
parameter of the DF-tree and can be experimentally
determined for each data domain. We found empirically
that using the number of objects already indexed when the
current * was chosen is a good starting point for the
threshold. Note that this value is independent from the
absolute scale of the dataset. Figure 9(a) presents the
algorithm to calculate the md set, while figure 9(b) shows
the WU algorithm to detect when the current * no longer
effectively prunes sub-trees.
D Calculate the md set (executed when a new GR set
*
is chosen)
input: the set of distances d(gj, gk), gj, gk new *
output: the set md
Begin
1 - For each Global Representative gj :
2Set mdj=0
3For each Global Representative gk distinct from
gj:
4If d(gj, gk) > mdj then set mdj=d(gj, gk)
end
E
WU Algorithm: Detect when to update * (executed
when an object is inserted in the tree)
input: the set md with cardinality p, the set of distances
d(gj, si) from the object si to each GR gj, the threshold
th, and the accumulated number of new
uncircumscribed objects (cnc). The value cnc is set to
null whenever a new * is calculated.
output: the updated number cnc. Returns true if * requires
changing, false otherwise
Begin
1 - Set w=1
2 - For each Global Representative gj :
3Set w = w*d(gj, si) / mdj
4 - Set cnc = cnc + (w)1/ p
5 - If cnc > th returns true; otherwise return false
end
Figure 9 - (a) Algorithm to calculate the md set. (b) The WU
algorithm.
How to choose a new GR set * : the HU algorithm
Here we describe an algorithm to update the set *, called
the HU (how to update) algorithm. Let us assume that a
new * with cardinality p has to be chosen from a set of
objects. As the algorithm computes p+1 distances for each
object in the candidate set, it is useful to reduce this set to
the minimum possible. The set * is composed of objects
close to the “border” of the dataset. Therefore, objects
already circumscribed by the current * are not candidates.
Thus, the candidate set includes the objects si in the dataset
whose distance to a global representative j is greater than
the mdj, i.e., d(si, gj)mdj to at least one global
representative gj.
If the uncircumscribed objects are concentrated in a
specific region and are unevenly distributed around the
current *, then the new * may put down uncircumscribed
the objects that are currently circumscribed. To overcome
-9this problem, the current * is also included in the
candidate set. The candidate set is then processed by the
usual HF algorithm to identify the representatives close to
the border of this set. The speed of this process is enhanced
when only the uncircumscribed objects plus the current *
are used as the candidate set instead of the full set of stored
objects, which would be the normal procedure. In fact, our
experiments have shown that the same objects are often
selected, whether the whole dataset or only this restricted
set is used. The HU algorithm is described in figure 10.
HU Algorithm: Update *
(executed when the WU algorithm triggers)
input: the current *, its cardinality p, and the DF-tree
output: the new *
Begin
1 - Set the “candidate set” to empty
2 - Traverse the DF-tree:
3For each object si stored at a leaf node:
4If for any i, 1<j<p, d(si, gj)mdj, then include si
in the candidate set
5 - Include * in the candidate set
6 - Execute the HF Algorithm on the candidate set
End
Figure 10 - The HU algorithm to update G.
4.5 - Properties of the DF-tree
In addition to optimizing the query answering performance,
the global representatives have the interesting
characteristics which lead to the following DF-tree
properties:
Property 1: The low costs involved in choosing and
changing the global representatives *.
Property 2: The GR set * can either be used or not. It
should be pointed out that, regardless of the existence
of * a query can be answered without using it. This is
an important feature in a concurrent environment,
because it allows the system to continue answering
queries even if a change of is in progress.
Property 3: Any number of the existing global
representatives can be used. The greater the number of
reference points used up to the limit given by M'N, the
higher structure’s prunability.
However, in
combination with the previous property,
can be
updated by incrementally changing the distances from
the objects to each GR one at a time, thereby taking
advantage of the idle time of the database manager.
Property 4: A can be replaced without affecting the tree
structure, because the global representatives are not
used to construct the tree. That is why the DF- tree
remains a dynamic structure even using global
representatives.
*
*
*
5 - Experimental results
This section discusses the results of the experiments
performed to evaluate the effectiveness of the DF- tree. The
DF-tree was implemented in C++ language, and the
experiments were ran in a Pentium III 800 Mhz with 256
Mbytes of RAM. The I/O device was an E-IDE Ultra
ATA/66 7200RPM hard disk. Although work was done on
a variety of both real and synthetic datasets, for reasons of
space limitations, in this paper we show measures obtained
with the following three real datasets.
EnglishWords - A set of 25,143 words from the
English dictionary, using the Ledit distance
function. As its intrinsic dimensionality is '=4.75
[13], the cardinality of used for this dataset is 5.
Faces - A set of 11,900 faces from the Informedia
project at Carnegie Mellon University. The faces
are described by 16-dimensional vectors, using the
Euclidean distance function. With '=5.27 , the
cardinality of is 6.
PortugueseWords - A set of 214,717 words from the
Portuguese dictionary. The distance function used
is Ledit, '=6.69, and the cardinality of used for
this dataset is 7.
The number of distance calculations and the number of
disk accesses shown in this section represents the average
of the values obtained from sets of 500 queries. For
example, distance plots of range queries represent the
average number of distance calculations for 500 queries in
each query radius or database size shown in the plot. Time
measurements show the total time (in seconds) spent to
answer each set of 500 queries. The selection of each query
set is biased, i.e., objects sampled from the full dataset were
used as query centers. The node size of every structure is
4 kbytes.
*
*
*
5.1 - A comparison between the DF-tree, the Slim-tree,
the M-tree and the Omni-sequential
This section describes the results obtained comparing the
DF-tree with the Slim-tree, the M-tree, the Omni-sequential
and the sequential scanning. Figure 11 shows the results
from the comparison of the performance of the four MAMs
for the EnglishWords dataset to answer 500 range queries
with a radius equal to one, as the dataset is stored in ten
percentile increments of the full dataset. This figure shows
the plots of (a) the average number of distance calculations,
(b) the average number of disk accesses, and (c) the total
time. Figure 11(c) does not include a plot for the M-tree
because it is implemented on another framework with
timing much slower rendering a comparison of time unfair
to the M-tree. As can be seen, all the MAMs display a
linear behavior; however, the slope representing the
number of distance calculations in the DF-tree plot is
significantly less steeper. In fact, the difference in the
- 10 number of distance calculations performed by the DF-tree
is quite significant compared to the number performed by
the others. The number of disk accesses of the DF- tree and
the Omni-sequential are equivalent and both show a better
performance here than the M-tree, whereas, from this
standpoint, the Slim-tree excels, confirming its original
purpose. However, considering the overall time factor,
which ultimately summarizes disk, distance and internal
logic, the DF-tree is clearly the winner, performing almost
twice as fast as the Omni-sequential, which takes second
place. We performed the same experiments for radii 2,4
and 5 and got equivalent results. It must be noted that the
most common queries are usually on small radii.
Figure 11 - Comparing the performance to answer 500 range
queries on the EnglishWords dataset indexed on the DF-tree, the
Slim-tree, the Omni-sequential, the M-tree and Sequential Scan,
for increasing dataset sizes. (a) Average number of distance
calculations per query, (b) Average number of disk accesses per
query. (c) Total time (500 queries) in seconds.
5.2 - Comparing the DF-tree and the Slim-tree
This section discusses the experiments involving
comparisons between the DF-tree, the Slim-tree, and the
sequential scanning. It will be shown that the DF-tree
Figure 12 - Distance calculations and total time comparing
Sequential Scanning and Slim-tree with DF-tree for the following
datasets: EnglishWords: (a) Average number of distance
calculations for each range query. (b) Average number of
distance calculations for each k-NN query. (d) Total time for 500
range queries for varying range radii; (e) Total time for 500 k-NN
queries varying the number of neighbors. Faces: (c) Average
number of distance calculations for each range query. (f) Total
time for 500 range queries for varying range radii. The averages
are for 500 queries.
requires fewer distance calculations to answer both range
and k-NN queries. Figures 12 and 13 show measurements
comparing the average number of distance calculations and
the total time required by both Slim-tree and DF-tree to
answer the same set of range and k-NN queries over the full
dataset. Figure 12(a) shows the average number of distance
calculations for range queries on the EnglishWords dataset,
while figure 12(d) shows the total time for these
experiments at each given range radius. Figure 12(b)
shows the average number of distance calculations for kNN queries in the EnglishWords dataset, and figure 12(e)
shows the total time for these experiments at each given
number of neighbors. Figure 12(c) shows the average
number of distance calculations for range queries in the
Faces dataset, while figure 12(f) shows the total time for
these experiments at each given range radius. These
graphs show the meaningful improvement achieved by the
DF-tree in such situations. The results obtained are
corroborated by the measurements got from the prunability
of these MAMs, as is shown in Section 5.3.
Figure 13(a) presents the average number of distance
calculations for k-NN queries in the full Faces dataset,
regarding the Slim-tree and the DF-tree, emphasizing that
both trees present a sub-linear behavior; in figure 13(d) the
sequential scanning results are also given. Figure 13(c)
depicts the total time for these experiments for different
numbers of neighbors for both trees and the same sub-linear
behavior is achieved. Figure 13(f) includes the results for
the sequential scanning. Figure 13(b) and (e) shows the
average number of disk accesses for k-NN queries on the
Faces dataset. As predicted, the average number of disk
accesses required by the DF-tree is greater than that
required by the Slim-tree. However, the gain in number of
distance calculations offsets this increase, and the total time
shown by figure 13(c) and (f) confirms the superior overall
performance of the DF-tree.
Figures 12 and 13 show the impressive drop in the
Figure 13 - Comparing the Slim-tree, the DF-tree and the
sequential scan, for 500 k-NN queries on the Faces dataset, for
varying numbers of neighbors. (a), (d) average number of
distance calculations, (b), (e) average number of disk accesses,
(c), (f) total time (in seconds).
- 11 average number of distance calculations and the total time
required to answer similarity queries, confirming the fact
that the DF-tree outperforms the Slim-tree by requiring less
than 25% of the number of distance calculations as required
by the Slim-tree. The DF-tree also processes small radii in
range queries up to 3 times faster than the Slim-tree does.
The same behavior is obtained for few neighbors in k-NN
queries over small portions of the stored dataset. Queries
over small portions of the stored dataset are usual in
practical situations. Moreover, the graphs show that the
sub-linear behavior of the DF-tree with increasing radii
sizes or number of neighbors is maintained.
5.3 - Prunability
Table 2 shows the prunability of the DF-tree compared to
that of the Slim-tree. Each measurement represents the
average pruning obtained by the application of 500 range
queries with a radius equal to 1% of the dataset diameter
for the Faces dataset and one letter of distance for the
EnglishWords dataset. Both radii are examples of small
values, which are the most common in real queries. We
tested other small values and the results were similar, so
they are not presented herein. The measurements also
included 500 k-NN asking for 5 neighbors. It must be
noted that the DF-tree requires one level more than the
Slim-tree to store the 25,143 objects of the EnglishWords
dataset. From this table one can see that, using only the
node representative, the Slim-tree does almost no pruning
at the leaf level (level 2). Nonetheless, the DF-tree
maintains considerable prunability.
Range queries
English
Words
Slim- DFtree tree
0 (root) 1.00 1.00
1
0.71 0.94
2
0.004 0.56
3
0.18
level
Faces
Slimtree
0.73
0.33
0.005
DFtree
0.99
0.76
0.51
k-Nearest neighbors
queries
English
Faces
Words
Slim- DF- Slim- DFtree tree tree tree
1.00 1.00 0.75 0.82
0.89 0.98 0.35 0.76
0.001 0.80 0.02 0.31
0.07
Table 2 - Prunability of the Slim-tree compared with
prunability of the DF-tree.
Consider, for example, the prunability at the leaf level
of the tree built over the Faces dataset when range queries
are executed. In this case, for each set of 200 distance
calculations that the single node representative of the Slimtree cannot prune (prunability=0.005), an average of only
one will find an object that is really part of the answer. In
contrast, for every two distance calculations that the node
representative and the global representatives of the DF-tree
cannot prune (prunability=0.56), an average of more than
one is part of the answer. The difference in prunability is
less dramatic at the upper levels of the tree; however, since
pruning of whole branches provided at the upper levels is
more valuable, the impact of increased prunability is always
worthwhile.
5.4 - Scalability
The DF-tree is linearly scalable regarding the dataset size,
the number of distance calculations, the number of disk
accesses and time to answer range queries. Figure 14
illustrates this statement for our largest dataset, the
PortugueseWord, with increments of 10% up to the full
dataset. The values shown represent the average number of
distance calculations, the average number of disk accesses,
and the total time to answer 500 range queries with a radius
rq=1.
Figure 14 - Linear behavior of the DF-tree when answering 500
range queries with radius 1 on the PortugueseWords dataset,
regarding: (a) average number of distance calculations, (b)
average number of disk accesses, (c) total time.
5.5 - Automatic fine-tuning provided by the WU
algorithm
The algorithm WU detects the downgrading of the after
considerable changes on the tree. Our real datasets are
randomly distributed over its entire size. Selecting at the
beginning of the tree construction, usually brings a poor
set, thus the WU algorithm should find a better as the tree
grows. Figure 15(a) shows the plots for the average
number of distance calculations with increasing size of the
database, when answering range queries of radius 1 on the
DF-tree for the EnglishWords dataset. The first plot shows
the average number of distance calculations using the earlychosen , and the second plot shows this number when the
algorithm is allowed to update the set with a threshold of
2000. In this second plot the algorithm triggered two
*
*
*
*
*
Figure 15 - Range queries of radius 1 on the EnglishWords
dataset without changing representatives, and changing
representatives using a threshold equal to 2000. (a) average
number of distance calculations, and (b) total time.
- 12 times: after 1,083 objects were inserted (before the first
point in the plots of figure 15(a) was measured), and after
24,510 objects were inserted (before the last point in the
plots of figure 15 was measured). The time spent to update
at the first time was 0.2 sec and the second time was 1.88
sec. As we can see, even for a dataset that has no emphatic
change in its tendency, the improvement is steady over a
large range of dataset size, both in number of distance
calculations and time. Note that the time measurements,
shown in figure 15(b), follow the graphs in figure 15(a).
data. The concept of the use of global representatives to
prune distance calculations accelerates query processing to
make it practical, without, however, increasing the
complexity of the characteristics yet to be developed.
*
6 - Conclusions
This paper presented new techniques aiming to improve the
efficiency of MAMs to answer similarity queries. Based on
them, we developed the DF-tree, which is dynamic and
takes advantage of using multiple global representatives.
The DF-tree offers tremendous speed-ups, being up to 3
times faster than the state of the art, in wall-clock time. The
gain in the number of distance calculations is also
impressive, requiring less than a quarter of the
computations to answer similarity queries.
This
improvement is achieved by taking advantage of a set of
global representatives, which increase the prunability
without interfering with the construction algorithm of the
tree, because they only have to be calculated when the first
query is asked. Moreover, the set of global representatives
can be changed at any time without ever disrupting the
response to ongoing queries over the tree. Additional
contributions of this paper are:
- an algorithm to automatically detect when the global
representatives require updating;
- an inexpensive algorithm to update the set of global
representatives.
We have also presented the new resource, called
“prunability”, whose purpose is to evaluate how efficient a
set of representatives is to prune distance calculations at
each level of the tree. Using this resource, we have found
that the use of a single representative works well at the
upper levels of the existing MAMs, but it is less effective at
lower levels. However, the proposed DF-tree is capable of
continuously pruning at high rates at every level of the tree.
Metric access methods have been intensely developed
in recent years and now, with the proposed DF-tree, they
have reached a level of performance that qualifies them for
inclusion among the indexing methods used in current
commercial database management systems, broadening
their support to include metric datasets. The DF-tree
makes possible to support queries by content over large sets
of images, time sequences and genetic data. This support
still requires further development to enable MAMs to
operate in the open transactional environment of
commercial DBMS, such as concurrent operations,
generation of and recovery from logs, and to support
interactions with selection clauses involving non-metric
7 - References
[1] S. Berchtold, D. A. Keim, H.-P. Kriegel, “The X-tree: An
Index Structure for High-dimensional data,” in VLDB 1996,
pp. 28-39.
[2] K. Beyer, J. Godstein, R. Ramakrishnan, U. Shaft, “When is
"Nearest Neighbor" Meaningful?,” in ICDT 1999, pp. 217235.
[3] T. Bozkaya and Z. M. Özsoyoglu, “Distance-Based Indexing
for High-Dimensional Metric Spaces,” in ACM SIGMOD
1997, pp. 357-368.
[4] S. Brin, “Near neighbor search in large metric spaces,” in
VLDB 1995, pp. 574-584.
[5] W. A. Burkhard and R. M. Keller, “Some Approaches to BestMatch File Searching,” CACM, vol. 16, pp. 230-236, 1973.
[6] K. Chakrabarti and S. Mehrotra, “The Hybrid Tree: An Index
Structure for High Dimensional Feature Spaces,” in IEEE
ICDE 1999, pp. 440-447.
[7] P. Ciaccia, M. Patella, P. Zezula, “M-tree: An efficient access
method for similarity search in metric spaces,” in VLDB
1997, pp. 426-435.
[8] C. Faloutsos, B. Seeger, A. J. M. Traina, C. Traina, Jr.,
“Spatial Join Selectivity Using Power Laws,” in ACM
SIGMOD 2000, pp. 177-188.
[9] V. Gaede and O. Günther, “Multidimensional Access
Methods,” ACM Computing Surveys, vol. 30, pp. 170-231,
1998.
[10] N. Katayama and S. i. Satoh, “The SR-tree: An Index
Structure for High-Dimensional Nearest Neighbor Queries,”
in ACM SIGMOD 1997, pp. 369-380.
[11] K.-I. D. Lin, H. V. Jagadish, C. Faloutsos, “The TV-Tree: An
Index Structure for High-Dimensional Data,” VLDB Journal,
vol. 3, pp. 517-542, 1994.
[12] R. F. Santos, Filho, A. J. M. Traina, C. Traina, Jr., C.
Faloutsos, “Similarity Search without Tears: The OMNI
Family of All-purpose Access Methods,” in ICDE 2001, pp.
623-630.
[13] C. Traina, Jr., A. J. M. Traina, C. Faloutsos, “Distance
exponent : a new concept for selectivity estimation in metric
trees,” Research Paper CMU-CS-99-110, March 1999 1999.
[14] C. Traina, Jr., A. J. M. Traina, B. Seeger, C. Faloutsos,
“Slim-Trees: High Performance Metric Trees Minimizing
Overlap Between Nodes,” in EDBT 2000, pp. 51-65.
[15] J. K. Uhlmann, “Satisfying General Proximity/Similarity
Queries with Metric Trees,” Information Processing Letter,
vol. 40, pp. 175-179, 1991.
[16] P. N. Yianilos, “Data Structures and Algorithms for Nearest
Neighbor Search in General Metric Spaces,” in
ACM/SIGACT-SIAM - SODA 1993, pp. 311-321.