What is XML?

Transcription

What is XML?
Informatica II
Anisa Rula
[email protected]
••• ITIS Lab ••• http://www.itis.disco.unimib.it
1
Agenda




XML
XML Schema
XML Path
XML Query Language
2
What is XML?

XML (Extensible Markup Language) is a metadata language - a
language for providing data about data
• W3C standard around 1998

It looks a bit like HTML, but with XML the tags are user-defined
and therefore extensible
• HTML marks up logical presentation
• CSS specifies presentation style
• XML marks up meaning (semantics)


XML has no mechanism to specify the format for presenting data
to the user
Precise definition of valid tags and their grammar.
• Document Type Definitions (DTD).
• XML Schema Definition (XSD).

System-independent and vendor-independent.
• Product of the World Wide Web Consortium (W3C), trademarked by
MIT.

An XML document resides in its own file with an ‗.xml‘ extension
Why XML?



Separates content from presentation
General - can be applied to anything
Adds value to semi-structured data
• E.g. Product Catalogue


Enables an enterprise to mark up all its data
Using XML greatly simplifies encoding of data
• (c.f. ad hoc text representations)

Ubiquitous - everybody is using it!
Where does XML fit?


Why not put everything in a relational or OO database?
XML is a global standard:
• offers better information transfer between different applications
and enterprises than proprietary databases

XML is flexible and easily applied
•
(which also presents dangers - data does NOT become more
valuable just because it is marked up in XML - the XML structures
have to be well designed).
Applications of XML

Data-oriented languages
• Used in web services
• Communication between applications
• Data export from databases

Document-oriented languages
• To add structure to natural language text documents
• E.g. content for web pages (XHTML), lecture notes, product
catalogues


Emerging XML databases such as Xindice
http://xml.apache.org/xindice/ store XML directly (don‘t have
to map to relational DB)
Protocols and programming languages
• E.g. XML schema, XSLT, WSDL
XML Basic Syntax






An XML document consists of a number of declarations
followed by a tree of elements.
Every document must contain a root element
Each element is delimited between begin and end tags.
Each element may contain attributes
Elements may contain text or other elements (or a mixture
of the two)
Attributes may only contain text
XML Sample
<?xml version="1.0"?>
<PUBLICATION>
<TITLE>Why I am Overworked</TITLE>
<AUTHOR role="author">
<FIRSTNAME>Fred</FIRSTNAME>
<LASTNAME>Smith</LASTNAME>
<COMPANY>Jones and Associates</COMPANY>
</AUTHOR>
<ABSTRACT>This is the abstract</ABSTRACT>
</PUBLICATION>
8
XML Element




Has a name
Has a begin tag <elementName>
Then text and/or child elements
Has an end tag </elementName>
• E.g. <name> Simon </name>

Elements can also be empty
• E.g. <person name=―Simon‖ />
Tips
•
Do not use white space when creating names for elements
•
Element names cannot begin with a digit, although names can
contain digits
•
Only certain punctuation allowed – periods, colons, and
hyphens
Elements or Attributes




Information can either be stored in elements or attributes
Structured information is stored in elements
Primitive information (i.e. a single atomic value or list of
values) can either be stored in an element or an attribute
Perhaps better to store primitives in attributes
XML Attributes




Element start tags may also contain attributes
An attribute consists of an attribute name followed by an
attribute value
Attributes are only allowed in the start tags
E.g.:
• <person email=―[email protected]‖>
•
<name>Simon</name>
• </person>
Well-Formed and Valid

Elements tags must be properly nested
• E.g. <a> <b> text </b> </a> is ok
• But <a> <b> text </a> </b> is NOT

Attribute values enclosed in string quotes
• <item id=―33905‖>


A document where all the tags are properly nested is wellformed
If a document is well-formed, and obeys the syntax rules of a
specified DTD, then it is also Valid
Document Type Definition (DTD)




A Document Type Definition (DTD) allows the developer to
create a set of rules to specify legal content and place
restrictions on an XML file
If the XML document does not follow the rules contained
within the DTD, a parser generates an error
An XML document that conforms to the rules within a DTD is
said to be valid
A DTD
• Provides a concise way to specify the syntax of a given document
type
• Declares how the elements can include other elements
• And the attributes allowed for each element
• Special operators specify the order and cardinality of each item
(see below)
CDATA and PCDATA




CDATA – Character Data
Attributes declared with CDATA may contain any text
characters
PCData – Parsed Character Data
Elements declared PCDATA do not contain other elements
• i.e. no other mark-up within them

In tree-terms, these are LEAF-nodes
Document Type Definition (DTD)
<?xml version="1.0" ?>
<!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to
(#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<message>Don't forget me this weekend!</message>
</note>
Why Use a DTD?
• A single DTD ensures a common format for each XML document
that references it
• An application can use a standard DTD to verify that data that it
receives from the outside world is valid
• A description of legal, valid data further contributes to the
interoperability and efficiency of using XML
DTD Elements





All element declarations begin with <!ELEMENT and
end with >
The ELEMENT declaration is case sensitive
The programmer must declare all elements within an XML file
Elements declared with the #PCDATA content model can not
have children
When describing sequences, the XML document must contain
exactly those elements in exactly that order
Operator
Meaning
+
One or more times
*
Zero or more times
?
Zero or once
|
Or: (a | b)? Either a or b or nothing
(no
operator)
Exactly once: (a , b) Exactly one a
followed by exactly one b
Some Example DTD Declarations
Example 1: The Empty Element
<!ELEMENT Bool (EMPTY)>
<Bool Value="True"></Bool>
<!--DTD declaration of empty element-->
<!--Usage with attribute in XML file-->
Example 2: Elements with Data
<!ELEMENT Month (#PCDATA)>
<Month>April</Month>
<Month>This is a month</Month>
<Month>
<January>Jan</January>
<March>March</March>
</Month>
<!--DTD declaration of an element->
<!—Valid usage within XML file-->
<!—Valid usage within XML file-->
<!—Invalid usage within XML file,
can‘t have children!-->
Some Example DTD Declarations
Example 3: Elements with Children
To specify that an element must have a single child element, include the element name
within the parenthesis.
<!ELEMENT House (Address)> <!—A house has a single address-->
<House> <!—Valid usage within XML file-->
<Address>1345 Preston Ave Charlottesville Va 22903</Address>
</House>

An element can have multiple children. A DTD describes multiple children using a
sequence, or a list of elements separated by commas. The XML file must contain one of
each element in the specified order.
<!--DTD declaration of an element-->
<!ELEMENT address (person,street,city, zip)>
<!ELEMENT person (#PCDATA)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT zip
(#PCDATA)>
<!—Valid usage within XML file-->
<address>
<person>John Doe</person>
<street>1234 Preston Ave.</street>
<city>Charlottesville, Va</city>
<zip>22903</zip>
</address>
DTD for Address Book Example
<!-- DTD for simple address book -->
<!ELEMENT AddressBook (Title, Person*)>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Person EMPTY>
<!ATTLIST Person name CDATA #REQUIRED>
<!ATTLIST Person email CDATA #IMPLIED>


Tip: Enter the Address Book DTD and XML as files in Intellij,
then use the tools -> validate command to perform validation
on the document.
Try to modify the DTD and/or XML document to make it
invalid.
Address Book – XML
<!DOCTYPE AddressBook SYSTEM "AddressBook.dtd">
<AddressBook>
<Title>Simon's address book</Title>
<Person name="Simon―
email="[email protected]" />
<Person name="Anna" />
</AddressBook>
Alternative Address Book
What about this version:
<AddressBook>
<Simon email=―[email protected]‖ />
<Anna email=―[email protected]‖ />
</AddressBook>
 Is it well formed?
 Is it valid (with respect to previous DTD?)
 Is it well designed?

XML Schema

An XML Schema:
•
•
•
•
•
•
•
•
defines
defines
defines
defines
defines
defines
defines
defines
elements that can appear in a document
attributes that can appear in a document
which elements are child elements
the order of child elements
the number of child elements
whether an element is empty or can include text
data types for elements and attributes
default and fixed values for elements and attributes
23
Sample XML Schema
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Schema vs. DTD

XML Schemas are the Successors of DTDs
•
•
•
•
•
XML
XML
XML
XML
XML
Schemas
Schemas
Schemas
Schemas
Schemas
are extensible to future additions
are richer and more useful than DTDs
are written in XML
support data types
support namespaces
25
XML types






Includes primitive data types (integers, strings, dates, etc.)
Supports value-based constraints (integers > 100)
User-definable structured types
Inheritance (extension or restriction)
Foreign keys
Element-type reference constraints
XML namespaces

Problem: the meaning of a tag depends on its context
• Combining elements from different documents may erase
conflicting interpretation

The definition of name spaces give precise context to tags
• http://www.w3.org/1999/xhtml defines HTML tags

The notation {URI}tag fully qualify a tag
• {http://www.w3.org/1999/xhtml }head

Namespace declarations increase readability
• <… xmlns:myns="http://www.w3.org/1999/xhtml">
• <myns:head> … </myns:head>
XML namespaces




This XML carries HTML table information:
<table>
<tr>
<td>Apples</td>
<td>Bananas</td>
</tr>
</table>
This XML carries information about a table (a piece of
furniture):
<table>
<name>African Coffee Table</name>
<width>80</width>
<length>120</length>
</table>
Solving the Name Conflict Using a Prefix


This XML carries information about an HTML table, and a piece
of furniture:
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<f:table>
<f:name>African Coffee Table</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>
XML Namespaces - The xmlns Attribute
<root>
<h:table xmlns:h="http://www.w3.org/TR/html4/">
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<f:table xmlns:f="http://www.w3schools.com/furniture">
<f:name>African Coffee Table</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>
</root>
XML Namespaces - The xmlns Attribute
<root
xmlns:h="http://www.w3.org/TR/html4/"
xmlns:f="http://www.w3schools.com/furniture">
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<f:table>
<f:name>African Coffee Table</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>
</root>
What Makes XML Portable?

XSDs or DTDs associated with a document allow the receiver
to perform validation on the document.

Human-readable/writable.

Independent of presentation (formatting).
32
Syntactic vs Semantic
Interoperability

While XML is portable, communicating parties still need to
agree on:
•
•
•
•

Document type definitions
Meaning of tags
―Operations‖ on data (interfaces).
Meaning of those operations.
Semantic interoperability is still a problem!
33
Querying XML
XQuery concepts

A query in XQuery is an expression that:


•
•
reads a sequence of XML fragments or atomic values
returns a sequence of XML fragments or atomic values
•
•
•
•
•
•
•
path expressions
element constructors
FLWOR ("flower") expressions (For-Let-Where-Order-Return)
list expressions
conditional expressions
quantified expressions
datatype expressions
•
•
•
•
•
•
•
namespaces
variables
functions
date and time
context item (current node or atomic value)
context position (in the sequence being processed)
context size (of the sequence being processed)
The principal forms of XQuery expressions are:
Expressions are evaluated relative to a context:
XML vs. Relational Data
name
phone
John
3634
row
phone
name
Sue
6343
Dick
6363
Relation
… in XML
row
“John”
name
row
phone
name
3634 “Sue” 6343 “Dick”
phone
6363
{ row: { name: “John”, phone: 3634 },
row: { name: “Sue”, phone: 6343 },
row: { name: “Dick”, phone: 6363 }
}
Relational to XML Data
• A relation instance is basically a tree with:
– Unbounded fanout at level 1 (i.e., any # of rows)
– Fixed fanout at level 2 (i.e., fixed # fields)
• XML data is essentially an arbitrary tree
– Unbounded fanout at all nodes/levels
– Any number of levels
– Variable # of children at different nodes, variable
path lengths
Displaying XML with XSLT
• With XSLT you can transform an XML
document into HTML.
• XSLT is the recommended style sheet
language of XML.
• XSLT (eXtensible Stylesheet Language
Transformations) is far more sophisticated
than CSS.
• XSLT can be used to transform XML into
HTML, before it is displayed by a browser
Esempio 1
<?xml version="1.0" encoding="ISO-8859-1"?>
- <note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Esempio 2
<?xml version="1.0" encoding="ISO-8859-1"?>
- <note>
<to>Tove</to>
<from>Jani</Ffrom>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>