An approach to the model-based fragmentation and

Transcription

An approach to the model-based fragmentation and
An approach to the model-based fragmentation
and relational storage of XML-documents
Christian Süß
Fakultät für Mathematik und Informatik, Universität Passau, D-94030 Passau, Germany
Abstract
A flexible method to store XML documents in relational or object-relational databases is presented
that is based on an adaptable fragmentation. Whereas most known approaches decompose XML documents into minimal units we propose to store fragments of variable granularity ranging from single
elements to whole documents. Different fragmentation strategies depending on the specific access and
query requirements can be applied to the same XML documents. Experiments have shown that the response times are much better than those for the complete decomposition. Furthermore, our storage model
which is based on directed acyclic graphs facilitates the reuse of XML subdocuments and supports different views on XML documents.
1 Introduction
Today, there exist numerous different approaches to store and query XML data. Besides storing XML data in
the file system, which is straightforward but does not support querying XML data, object-oriented database
systems as well as systems based on semi-structured data models or native XML systems volunteer. However,
comparative studies [4, 6, 7, 8] indicate, that relational or object-relational database systems are still very
competitive. In general, there are many ways to store XML data in relational or object-relational database
systems. For example, in [1, 2] the user or database administrator can decide how to store XML elements
in relational tables. Appropriate relational schemata can also be derived automatically from a given XML
schema, e.g. a DTD [3, 8].
There are also generic approaches that store documents without any user interaction and do not require any
kind of schema and which provide for the storage and retrieval of different types of XML documents, e.g.
XSL documents etc., using the same relational schema for storing. For example, [4] presents different strategies to completely decompose arbitrary XML documents into relational tables. They show a good overall
performance, but reconstructing complete documents is rather expensive. In relational databases, clustering
and indexing is applied to compensate for the performance loss caused by the (more or less) rigorous decomposition of the relational schemata which is required for normalization purposes. However, up to now it is
not clear what clustering and indexing of XML documents exactly means. In this paper we propose a flexible fragmentation of XML documents that avoids at least unnecessary joins when reconstructing frequently
accessed parts of the document. Fragmentation has also been suggested in [9]. However, [9] relies on an
fragmentation specification supplied by the user and stored in the source document or the DTD, whereas our
generic approach is completely independent of such hard-wired directives. Furthermore, in our approach
multiple fragmentation strategies can be defined and applied to the document whenever appropriate. The
fragmentation strategies presented in this paper are guided by an underlying domain model which is used
to specify the fragments at a high level of abstraction. As compared to [4] we do not lose the difference
between subelements and attributes as well as between subelements and references. Furthermore, applying our approach to object-relational databases we can benefit from specific user defined datatypes [8] and
indexing techniques [1, 2]. Finally, our storage model is based on directed acylic graphs. In contrast to
tree-based storage structures, it facilitates the reuse of XML subdocuments and supports different views on
XML documents.
The rest of this paper is organized as follows: In section 2 we formally define the fragmentation of XML
documents. In section 3 relational database schemata as well as algorithms to store and retrieve XML fragments are introduced. Section 4 focuses on model-based fragmentation strategies and presents experimental
results. The paper concludes with a short summary in section 5.
2 Fragmentation of XML documents
Definition 1 (Notation for XML documents) Let doc be an XML document. Then elements(doc) denotes
the set of elements in doc where elements having the same tag name are identified by unique subindices.
tree(doc) is the tree structure of the elements defined by doc. root(doc) is the root element of doc and
therefore of tree(doc), too. Let e ∈ elements(doc) be an element of doc. Then doc(e) denotes the subdocument of doc having e as its root element. tag(e) is the tag name of e. value(e, attr) is the value of the
attribute attr of e. xml(e) denotes the XML representation or serialization of e in doc including the opening
and closing tags of e and all of its contents. In particular, xml(root(doc)) is the XML representation of the
entire document doc. children(e) ⊆ elements(doc) denotes the set of elements directly contained in e excluding e itself. Conversely parent(e) ∈ elements(doc) denotes the element which contains e. Obviously,
parent is not defined for the root element.
2
<course title="XML Tutorial">
<section title="Introduction">
<motivation>
<text>XML is needed for many reasons...</text>
<image src="xml.gif"/>
</motivation>
</section>
<section title="Basics">
<definition>
<text>An XML document is...</text>
</definition>
<example>
<text>Consider the following XML document...</text>
</example>
</section>
</course>
Figure 1: Sample XML document
Definition 2 (Fragment) Let doc be an XML document. Then f ⊆ elements(doc) is a fragment of doc
iff the subgraph of tree(doc) which is induced by f is connected. This subgraph forms a tree denoted by
tree(f ) and having the root element root(f ).
2
Definition 3 (Fragmentation) Let doc be an XML document. A fragmentation F = {f 1 , . . . , fn } of doc
is a partitioning of elements(doc) into fragments f 1 to fn , i.e., the fi are pairwise disjoint and their union
equals elements(doc). roots(F ) denotes the set of root elements of the fragments in F . The elements of
roots(F ) uniquely determine F . We assume that each fragment f i in F , 1 ≤ i ≤ n, has an unique identifier
id(fi ). The XML representation xml(f ) of a fragment f ∈ F is the result of replacing in xml(root(f )) the
XML representation xml(e) of the root element e of any other fragment g ∈ F occurring as a subtree in f
by the element < tag(e) fragment-id = ”id(g)”/ > where tag(e) is the tag name of the root element of g
and id(g) is the unique identifier of g Thus, essentially every fragment subtree is replaced by a reference to
the fragment.
2
Let doc be an XML document. Let F be a fragmentation of doc. Then the tree structure tree(doc) induces
a graph graph(F ) on the fragments of F which is a tree, too, because in XML each element can only be
contained in exactly one other element. However, we allow directed acyclic fragmentation graphs, in which
each fragment can have more than one parent fragment.
Definition 4 (Graph of a fragmentation) Let F be a fragmentation and f, g ∈ F . Then f is called a
parent fragment of g and g a child fragment of f in F iff in the tree of the original document doc it holds
that parent(root(g)) ∈ f . children(f ) denotes the child fragments of f in F and parents(f ) denotes the
parent fragments of f in F .
2
Example 5 (Fragments) Figure 2 shows the tree structure of the XML document of figure 1 using four
fragments with root elements course, motivation, definition and example identified by the unique identifiers
1 through 4. The XML representation of fragment 1 containing three references to the child fragments can
be found in figure 3.
2
<course title="XML Tutorial">
<section title="Introduction">
<motivation fragment-id="2" />
</section>
<section title="Basics">
<definition fragment-id="3" />
<example fragment-id="4" />
</section>
</course>
course
1
section
section
4
motivation
definition
3
2
text
image
example
4
text
text
Figure 2: Tree structure with four fragments
Figure 3: XML representation of a fragment
3 Relational storage of XML fragments
Definition 6 (Relational schema) To store the graph of a fragmentation in a relational database we use a
relational schema consisting of the three tables fragment(id, tag, xml), attribute(id, name, value) and
child(parId, childId, pos) where underlined attributes denote primary keys. Attribute id of table attribute
and attribute parId as well as attribute childId of table child are foreign keys of the table fragment. Attribute
xml is of a type appropriate to store large character sequences (e.g. CLOB). Note that we show only the
essential attributes.
2
Algorithm 7 (Storage of an XML document) Let doc be an XML document and let F be a fragmentation
of doc. Then doc is stored according to F in a relational database using the schema of definition 6 by the
following algorithm:
1. For each fragment f ∈ F , insert into table fragment the tuple (id(f ), tag(root(f )), xml(f )).
2. For each attribute-value pair name=value of a root element root(f ) of a fragment f , insert into table
attribute the tuple (id(f ), name, value).
3. For each pair of fragments f and g, where g is the i-th child fragment of f according to the element
ordering in the original document, insert into table child the tuple (id(f ), id(g), i).
2
Example 8 (Storage of an XML document) Figure 4 shows the extension of the tables after storing the
XML document of figure 1 according to the fragmentation shown in figure 2. For the complete contents of
column xml in the first row of table fragment see figure 3.
2
Algorithm 9 (Retrieval of a fragment) Let the XML document doc be stored as described in algorithm 7
according to a fragmentation F . Let e = root(f ) be the root element of a fragmentf ∈ F . Then we obtain
the subdocument xml(e) which contains e and all its XML contents using the following algorithm:
1. Execute the SQL query SELECT xml FROM fragment WHERE id= id(e). The result of this
query is the XML representation xml(f ) of fragment f .
2. Replace each element < tag(e2 ) fragment-id = ”id”/ > by the XML representation xml(e 2 ) of the
root element e2 of the fragment with identifier id obtained by a recursive application of this algorithm.
2
According to algorithm 9 tables attribute and child are not necessary for retrieving a document, because
their information is also contained in the XML representation of the fragments. Nevertheless they are important for the efficient retrieval of fragments and navigation in the fragmentation graph.
id
1
2
3
4
fragment
tag
xml
course
<course...
motivation
<motivation>...
definition
<definition>...
example
<example>...
id
1
attribute
name
value
title
XML Tutorial
parId
1
1
1
child
childId
2
3
4
Figure 4: Extension of tables for fragmented sample XML document
pos
1
2
3
1,0
0,8
0,8
Relative Retrieval Time
Relative Retrieval Time
1,0
0,6
0,4
0,2
Single Elements
ContentModules
Cou/Sec/Ex
Cou/Sec
Course
1 large Fragment
0,6
0,4
0,2
0,0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0,0
1
2
3
Levels per Fragment
4
5
Chapters per Document
Figure 5: (a) Uninformed Fragmentation
(b) Model-Based Fragmentation
4 Model-based fragmentation strategies
Definition 10 (Fragmentation strategy) Let doc be an XML document. A fragmentation strategy S ⊆
elements(doc) is a subset of the set of elements of doc specifying the root elements of the fragments.
2
Strategies specify those elements which are stored in separate fragments. They use predicates which every
element has to satisfy to qualify as a root element of a fragment, e.g. match patterns for tag names and
attribute values of elements (see example 12) or more sophisticated structural conditions (see example 11).
Strategies can also be based on the nesting depth, i.e., how many levels of nested elements are stored in a
fragment.
Figure 5 (a) shows the experimental results of retrieving an entire XML document having 14 levels of element
nesting. As expected, the response time of the rightmost strategy where all 14 levels, i.e. the whole document,
are stored in one fragment is about 15 times as fast as the response time of the leftmost strategy where each
fragment contains only one element, i.e., the document is completely decomposed. These strategies are
uninformed and therefore produce unpredictable fragmentations resulting in the different response times
shown in figure 5 (a).
To meet the specific access and query requirements of a given application we define fragmentation strategies
which are guided by the domain model of the application. For example, figure 6 shows a simplified part of
the teachware model presented in [10] which describes learning material using specialized DTDs [5] (see the
sample document in figure 1)
Course
Module
StructureModule
section
ContentModule
motivation
definition
example
paragraph
exercise
illustration
remark
Figure 6: Domain model for teachware
Example 11 (Sequential Leaf Access) From the teachware domain model we know that a learner mostly
accesses the ”leaf sections” of a course document doc in a sequential way. Those sections directly contain at
least one ContentModule while their enclosing sections do not directly contain any ContentModule. To meet
this specific access requirement, we now define a fragmentation strategy S 1 to store such ”leaf sections” in
separate fragments:
S1 = root(doc) ∪ {e ∈ elements(doc)|
(tag(e) = 0 section0 ∧ (∃c ∈ children(e) : tag(c) ∈ CM ))
∧∀p ∈ ancestors(e) : ¬(∃c ∈ children(e) : tag(c) ∈ CM ))}
ancestors(e) is the set of all ancestors of e in tree(doc) and CM = {motivation, definition, ...} is the set of
all tag names of ContentModule elements according to the given model.
2
Example 12 (Supporting Reuse of Modules) We know that authors usually reuse ContentModules in more
than one course document. Thus, we define a corresponding strategy S 2 = {e ∈ elements(doc)|(tag(e) ∈
CM )} which stores each ContentModule in a separate fragment. Note, that our storage model which is
based on directed acyclic graphs (see definition 4) directly supports the reuse of XML subdocuments.
2
Figure 5 (b) shows the experimental results of retrieving complete course documents containing one to
five chapters, i.e., top-level sections. The response time for S 2 depicted in the second line from the top is
significantly less than the response time of the complete decomposition depicted in the line on top.
From examples 11 and 12 we can see, that there can be more than one fragmentation strategy for a single
XML document. Both strategies S1 and S2 can be applied whenever appropriate. Moreover, we can define a
combined strategy S1 ∪ S2 which produces a finer-grained fragmentation and which facilitates the sequential
leaf access of example 11 as well as the reuse of ContentModules of example 12.
Our approach allows to adjust the granularity of fragments when appropriate. For example, statistic information like the most often reused subdocuments can be used to dynamically determine the appropriate fragmentation strategy. So, documents can be re-fragmented and re-organized in storage when needed without
having to change the document source itself. This supports the cooperative authoring of an XML document
base.
Example 13 (Dynamically Changing Strategies) To support reuse at the storage layer and to improve S 2
of example 12 accordingly we define the strategy S 3 = {e ∈ ReusedElements} where ReusedElements
is the dynamically changing set of reused elements.
2
5 Conclusion
In this paper we have presented a generic, model-based approach to the relational storage of XML documents.
Arbitrary XML documents are automatically stored in database where they are clustered in fragments of different sizes tailorable to specific access and query requirements. We have specified different strategies which
make use of information provided by an underlying model. Experimental results show that corresponding
queries can be answered more efficiently than when using a complete decomposition.
The storage model is based on directed acyclic graphs. In contrast to a tree model it directly supports multiple
hierarchies which facilitate the reuse of XML subdocuments and allow the definition of different views on
the same XML document.
Future work will focus on the concept of views which could only be touched in this paper. Furthermore, we
will study the application of our approach to the modularization of XML documents.
References
[1] S. Banerjee et al. Oracle8i - The XML Enabled Data Management System. In Proc. ICDE 2000: San Diego, USA, 2000.
[2] J. M. Cheng and J. Xu. XML and DB2. In Proc. ICDE 2000: San Diego, USA, 2000.
[3] A. Deutsch et al. Storing Semistructured Data with STORED. In Proc. ACM SIGMOD Philadelphia, PN, 1999, 1999.
[4] D. Florescu and D. Kossmann. A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a
Relational Database. techreport 3680, INRIA, France, 1999.
[5] C. Süß. Learning Material Markup Language 1.1. http://daisy.fmi.uni-passau.de/pakmas/lmml/.
[6] A. Schmidt et al. Efficient Relational Storage and Retrieval of XML Documents. In Proc. WebDB 2000, Dallas, USA, 2000.
[7] J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities. In Proc.
25th VLDB Conference, Edinburgh, Scotland, 1999.
[8] T. Shimura et al. Storage and Retrieval of XML Documents Using Object-Relational Databases. In Proc. DEXA ’99, Florence,
Italy, 1999.
[9] B. Surjanto et al. XML Content Management based on Object-Relational Database Technology. In Proc. WISE’2000,
Hongkong, 2000.
[10] C. Süß et al. Metamodeling for Web-Based Teachware Managment. In Advances in Conceptual Modeling. ER’99 Workshop
on the World-Wide Web and Conceptual Modeling, Paris, France, 1999.