DIADEM: domain-centric, intelligent, automated data extraction

Comments

Transcription

DIADEM: domain-centric, intelligent, automated data extraction
WWW 2012 – European Projects Track
April 16–20, 2012, Lyon, France
DIADEM: Domain-centric, Intelligent, Automated Data
Extraction Methodology∗
Tim Furche, Georg Gottlob, Giovanni Grasso, Omer Gunes, Xiaonan Guo,
Andrey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, Cheng Wang
Department of Computer Science, Oxford University, Wolfson Building, Parks Road, Oxford OX1 3QD
[email protected]uk
ABSTRACT
Search engines are the sinews of the web. These sinews have become strained, however: Where the web’s function once was a mix
of library and yellow pages, it has become the central marketplace
for information of almost any kind. We search more and more for
objects with specific characteristics, a car with a certain milage, an
affordable apartment close to a good school, or the latest accessory
for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers—never to be sure a better offer isn’t just around the
corner. What search engines are missing is understanding of the
objects and their attributes published on websites.
Automatically identifying and extracting these objects is akin to
alchemy: transforming unstructured web information into highly
structured data with near perfect accuracy. With DIADEM we
present a formula for this transformation, but at a price: DIADEM
identifies and extracts data from a website with high accuracy. The
price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the
domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language
of a website of this domain. In this demonstration, we demonstrate
with a first prototype of DIADEM that, in contrast to alchemists,
DIADEM has developed a viable formula.
Figure 1: Exploration: Form
extension businesses, articles, or information about products wherever they are published. Visibility on search engines has become
mission critical for most companies and persons. However, for
searches where the answer is not just a single web page, search engines are failing: What is the best apartment for my needs? Who offers the best price for this product in my area? Such object queries
have become an exercise in frustration: we users need to manually
sift through, compare, and rank the offers from dozens of websites returned by a search engine. At the same time, businesses
have to turn to stopgap measures, such as aggregators, that provide customers with search facilities in a specific domain (vertical
search). This endangers the just emerging universal information
market, “the great equalizer” of the Internet economy: visibility depends on deals with aggregators, the barrier to entry is raised, and
the market fragments. Rather than further stopgap measures, the
“publish on your website” model that has made the web such a success and so resilient against control by a few (government agents or
companies) must be extended to object search: Without additional
technological burdens to publishers, users should be able to search
and query objects such as properties or laptops based on attributes
such as price, location, or brand. Increasing the burden on publishers is not an option, as it further widens the digital divide between
those that can afford the necessary expertise and those that can not.
DIADEM, an ERC advanced investigator grant, is developing a
system that aims to provide this bridge: Without human supervision it finds, navigates, and analyses websites of a specific domain
and extracts all contained objects using highly efficient, scalable,
automatically generated wrappers. The analysis is parameterized
with domain knowledge that DIADEM uses to replace human annotators in traditional wrapper induction systems and to refine and
verify the generated wrappers. This domain knowledge describes
the ontology as well as the phenomenology of the domain: what are
the entities and their relations as well as how do they occur on websites. The latter describes that, e.g., real estate properties include a
location and a price and that these are displayed prominently.
Figures 1 and 2 show examples (screenshots from the current
prototype) of the kind of analysis required by DIADEM for fully
automated data extraction: Web sites needs to be explored to locate
relevant data, here real estate properties. This includes in particular
forms as in Figure 1 which DIADEM automatically classifies such
Categories and Subject Descriptors
H.3.5 [Information Storage and Retrieval]: On-line Information
Services—Web-based services
General Terms
Languages, Experimentation
Keywords
data extraction, deep web, knowledge
1.
INTRODUCTION
Search is failing. In our society, search engines play a crucial
role as information brokers. They allow us to find web pages and by
∗The research leading to these results has received funding from the
European Research Council under the European Community’s Seventh Framework Programme (FP7/2007–2013) / ERC grant agreement DIADEM, no. 246858, http://diadem-project.info/.
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use,
and personal use by others.
WWW 2012 Companion, April 16–20, 2012, Lyon, France.
ACM 978-1-4503-1230-1/12/04.
267
WWW 2012 – European Projects Track
April 16–20, 2012, Lyon, France
DIADEM
tin
In the Cloud
La
Phenomenology
O
x
AMBER
OPAL
IT
H
AC
AL
GLUE M
Ontology
Ontology
Identification
r
to
ac
Glue
r
xt
Reasoning
Datalog±
Exploration
O
E
O
XP
a
th
Extraction
Figure 3: DIADEM Overview
level fact finders, whose output facts may be revised in case more
accurate information arises from other sources.
DIADEM focuses on a dual approach: at the low level of web
pattern recognition, we use machine-learning complemented by linguistic analysis and basic ontological annotation. Several of these
approaches, in addition to newly developed methods, are implemented as lower level building blocks (fact finders) to extract as
much knowledge as possible and integrate it into a common knowledge base. At a higher level, we use goal-directed domain-specific
rules for reasoning on top of all generated lower-level knowledge,
e.g. in order to identify the main input mask and the main extraction data structure, and to finalize the navigation process.
DIADEM approaches data extraction as a two stage process: In
a first stage, we analyse a small fraction of a web site to generate a
wrapper that is then executed in the second stage to extract all the
relevant data on the site at high speed and low cost. Figure 3 gives
an overview of the high-level architecture of DIADEM. On the left,
we show the analysis, on the right the execution stage.
(1) Sampling analysis: In the first stage, a sample of the web
pages of a site are used to fully automatically generate wrappers
(i.e., extraction program). The analysis is based on domain knowledge that describes the objects of interest (the “ontology”) and how
they appear on the web (the “phenomenology”). The result of the
analysis is a wrapper program, i.e., a specification how to extract
all the data from the website without further analysis.
Conceputally, it is divided into two major phases, though these are
closely interwoven in the actual analysis:
(i) Exploration: DIADEM automatically explores a site to locate
relevant objects. The major challenge here are web forms: DIADEM needs to understand such forms enough to fill them for
sampling, but also to generate exhaustive queries for the extraction stage such that all the relevant data is extracted (see
[1]). DIADEM’s form understanding engine OPAL [9, 10] is
uses an phenomenology of forms in the domain to classify the
form fields.
The exploration phase is supported by the page and block classification in MALACHITE where we identify, e.g., next links
in paginate results, navigation menus, and irrelevant data such
as advertisements. We further cluster pages by structural and
visual similarity to guide the exploration strategy and to avoid
analysing many similar pages. Since such pages follow a common template, the analysis of one or two pages from a cluster
usually suffices to generate a high confidence wrapper.
(ii) Identification: The exploration unearths those web pages that
contain actual objects. But DIADEM still needs to identify the
precise boundaries of these objects as well as their attributes.
To that end, DIADEM’s result page analysis AMBER [11]
analyses the repeated structure within and among pages. It
exploits the domain knowledge to distinguish noise from relevant data and is thus far more robust than existing data extraction approaches. AMBER is complemented by Oxtractor, that
analysis the free text descriptions. It benefits in this task from
the contextual knowledge in form of attributes already iden-
Figure 2: Identification: Result Page
that each form field is associated with a type from the domain ontology such as “minimum price” field in a “price range” segment.
These form models are then used for filling the form for further
analysis, but also to generate exhaustive queries for latter extraction of all relevant data. Figure 2 illustrates a (partial) result of the
analysis performed on a result page, i.e., a page containing relevant
objects of the domain: DIADEM identifies the objects, their boundaries on the page, as well as their attributes. Objects and attributes
are typed according to the domain ontology and verified against the
domain constraints. In Figure 2, e.g., we identify real estate properties (for sale) together with price, location, legal status, number
of bed and bathrooms, pictures, etc. The identified objects are generalised into a wrapper that can extract such objects from any page
following the same template without further analysis.
In the first year and a half, DIADEM has progressed beyond
our expectations: The current prototype is already able to generate wrappers for most UK real estate and used car websites with
higher accuracy than existing wrapper induction systems. In this
demonstration, we outline the DIADEM approach, first results, and
describe the live demonstration of the current prototype.
2.
DIADEM—THE METHODOLOGY
DIADEM is fundamentally different from previous approaches.
The integration of state-of-the-art technology with reasoning using
high-level expert knowledge at the scale envisaged by this project
has not yet been attempted and has a chance to become the cornerstone of next generation web data extraction technology.
Standard wrapping approaches [17, 6] are limited by their dependence on templates for web pages or sites and their lack of understanding of the extracted information. They cannot be generalised
and their dependence on human interaction prevents scaling to the
size of the Web. Several approaches that move towards fully automatic or generic wrapping of the existing World Wide Web have
been proposed and are currently under active development. Those
approaches often try to iteratively learn new instances from known
patterns and new patterns from known instances [5, 19] or to find
generalised patterns on web pages such as record boundaries. Although current web harvesting and automated information extraction systems show significant progress, they still lack the combined
recall and precision necessary to allow for very robust queries. We
intend to build upon those approaches, but add deep knowledge
into the extraction process and combine them in a new manner into
our knowledge-based approach; we propose to use them as low-
268
WWW 2012 – European Projects Track
April 16–20, 2012, Lyon, France
1
100
0
0.93
0
20
40
60
80
100 120 140
UK Real Estate (100) UK Used Car (100)
#pagesICQ (98)
Tel-8 (436)
50
40
400
30
300
600
20
200
400
10
100
0
0
1000
Web Harvest
Chickenfoot
1600
200
800
150
50
0
0 150 20
40300 350
60 40080
0 50 100
200 250
450 100 120 140
time [sec]
#pages
140
120
Chickenfoot
1000
100
200
160
OXPath
1400
memory
extracted matches Lixto
1200visited
Web
Harvest
pages
100
800
80
600
60
400
40
200
20
0
0
2 0
0
50
40
memory [MB]
200
1200
memory [MB]
300
OXPath
Web Content Extractor
memory
Lixto
visited
Visual Web Ripper
500 pages
pages
time [sec]
400
600
time w/o page loading [sec]
0.94
60
memory [MB]
0.95
500
time w/o page loading [sec]
700
600
0.96
F-score
OXPath
Web Content Extractor
Lixto
Visual Web Ripper
Web Harvest
Chickenfoot
700
0.98
0.97
Recall
800
#pages [1000] / #results [100,000]
Precision
0.99
extrac
v
30
20
10
0
100 200
600 700 800
4
6 300
8 40010500 12
time [h]Number of pages
0
10
2
1023
269
pe
pro
1022
th!
ba
!
rty
typ
e!
rec
ep
tio
n!
!
om
de
dro
be
stc
o
po
!
al!
leg
on
!
RL
loc
ati
!
ce
sU
pri
tai
l
de
da
ta
a
rea
!
rec
ord
s!
time [sec]
memory [MB]
Table 2 shows the incremental improvements made in each
round. For each round, we report the number of unannotated instances, the number of instances recognized through attribute alignment, and the number of correctly identified instances. For each
round we also show the corresponding precision and recall met-
time [sec]
(a) OXPath
Many
pages.
(b)
Millions
of results.
(c
(b) Norm.
evaluation
time, <150 p.
(c)
Norm.
evaluation
time, <850 p.
(b)
Comparative
(c)
OXPath
Scaling
Figure 5: Scaling OXPath: Memory, visited pages, and output size vs. time.
Figure 7: Comparison.
Figure 4: Exploration and Execution
0.4
350
a Java API based on50JAXP.
Some
of the issues raised by OXPath
OXPath
browser
initialization
Lixto
that we plan to address in future
work are: (1) OXPath is amenable0.3
300
Web Harvest
page rendering
40
unannotated instances
(328) Chickenfoottotal instances (1484)
to
significant
optimization
and
a
good
precision!
recall! target for automated gener250
PAAT
evaluation
tified from AMBER and of background knowledge from the
100.0%!
30 programs. (2) Further, OXPath is perfectly
ation of web extraction
200
rnd. aligned
prec.
rec.
prec.
rec.
0.2
ontology.corr.
suited for highly parallel
execution: Different bindings for the same
150
20
Large-scale
The wrapper
by the anal1 (2) 226
196 extraction:
86.7%
59.2%
84.7% generated
81.6%
variable can be filled into forms in parallel. The effective parallel0.1
98.0%!
100
be executed
independently
and repeatedly.
2 ysis stage
261 can248
95.0%
74.9% 93.2%
91.0%We have
execution of actions10on context sets with many nodes is an open
50
3 developed
271 a new
265 wrapper
97.8%
80.6% called
95.1%OXPath 93.8%
language,
[13], the first
issue. (3) We plan to0further investigate language features, such as 0
0
4 wrapper
271language
265 for
97.8%
80.6%
95.1%
93.8%
large
scale,
repeated
continuous) PAAT (2%) 96.0%!
D1
D2
D3
D4
D5
1
2
0 100
200 300
400 500(or
600 even
700 800
more
browser initialization
(13%) expressive visual features and multi-property axes.
query set
data extraction. OXPath is powerful enough
#pages to express nearly any page rendering (85%)
94.0%! (avg.).
(b) Profiling OXPath (per query).
(c) Contex
(a) Profiling
OXPath
extraction Table
task, yet
as a 8:
careful
extension
of XPath(<850
maintains
the
1: Total
learned
instances
Figure
Comparison:
Memory
p.).
7.
REFERENCES
low data and combined complexity. In fact, it is so efficient, that
Figure 6: Profiling OXPath’s components.
[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring
tion graph
to rendering
a depth of 3time
for papers
“Seattle”the
(see
Appendix F).
page retrieval
and
by far on
dominate
execution.
documents, databases, and webs. In ICDE, 24–33, 1998.
rnd. unannot.
recog.
corr. time prec.
rec.for each
record
evaluation
memory
system.
In Fig5. terms
EVALUATION
Profiling: Page Rendering is D
For largeWe
scale
execution,
the aim and
is thus
to minimize
page ren[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web
ure
7(a),
we
show
the
(averaged)
evaluation
time
for
each
system
up
1
331
225
196
86.7%
59.2%
262
stage of
OXPath’s
evaluation perform
dering and retrieval by storing pages that are possibly
needed
forwe confirm the scaling
In this
section,
behaviour
of OXPath:
information
extraction
with Lixto. In VLDB,
119–128,
2001.
2
118
94.1%memory
27.1%
29
toprocessing.
150 pages.34
Though
Chickenfoot
and
Web
Harvest
do not manFigure
Identification:
in UK
realfollowing
estate
further
At
the 32
same
time,
should
be
(1)
Theindepentheoretical
complexity
from5:Section
4.3 areXPath
con[3] bounds
M. Benedikt
and
C. Koch.
Leashed.
CSUR,web sites: apple.com (D1),
Figure
4: AMBER Evaluation
on Real-Estate
Domain
3
16state
100.0%
20.3%
4 all, OXPath
age 79
pagenumber
and16
browser
or do as
nototherwise
render
pages
bing.com
(D3), www.vldb.org/2011
41(1):3:1–3:54,
2007.
dent
from
the
of pages
visited,
large-scale
orlarge scale extraction
firmed
inatseveral
experiments
in diverse set4
100.0%that manage
0% state0similar to OXstill63
outperforms
them.0become
Systems
on
Wikipedia
(D5). On each, we clic
[4]
M.
Bolin,
M.
Webber,
P.
Rha,
T.
Wilson,
and
R. C. Miller.
continuous
extraction0 tasks
impossible.
With
OXPath
we
tings, in particular the constant memory use even if extracting miltribute separately.
AMBER
achieves onofaverage
98%
for page. Figure
are between
twocharacteristics,
and four timesasslower
than
OXPath
Automation
rendered
web
pages.
html
tagaccuracy
of eachInresult
managePath
to obtain
all these
shown
inof
Section
lions
records3.even
fromon
hundreds of thousands
ofand
webcustomization
pages.
UIST,
163–172,
2005.
all
these
tasks,
with
a
tendency
to
perform
worse
on
attributes
that
this
small
number
of
pages.
Figure
7(b)
show
the
evaluation
time
domly
selected
among
2810
web
sites
named
in
the
yellow
pages.
average
and
the
individual
averages f
TableFor
2: Incrementally
recognized
instances
and
learned
terms
a more detailed description of DIADEM’s stages,
(2) see
We[12].
illustrate that OXPath’s evaluation is dominated by the
C.-H.
M.its
Kayed,
R. Girgis,
K. F.
Arendering time an
less
frequently
(such
as
theM.
number
of reception
rooms).
discounting
the page
cleaning,
rendering
time.
bing.com
,Shaalan.
the pageof
Foroccur
each[5]
site,
weChang,
submit
main
form
with
aand
fixed
sequence
DIADEM’s
analysis
uses loading,
a knowledge
drivenand
approach
based
on This
browser
rendering
time,
even
for
complex
queries
on
small
web
survey
of web
informationthe
extraction
systems.
TKDE,
allows
for a more
comparison
the
different
en4b and
4corsummarise
results
for
OXPath:
Itateasily
low,
andwith
thus
also
fillingsFigures
to OXPath
obtain
one,
if1.2)
possible,
two
result
pages
leastthe overall evalua
pages.
Nonebrowser
of the extensions
of
(Section
significantly
a domain
ontology
and balanced
phenomenology.
To as
that
end,
most
of
1411–1428,
2006.
gines
or
web
cleaning
approaches
used
in
the
systems
affect
the
outperforms
existing
data
extraction
systems,
often
by
a
wide
maron
the
other
hand,
are comparatively
two
result
records
and
compare
AMBER
’s
results
with
a
manually
affects
the
scaling
behaviour
or
the
dominance
of
page
rendering.
As ananalysis
example,
consider theinwebpage
of Figure
is implemented
logical rules
on top 3oftaken
a thinfrom
layer of
[6] V. Crescenzi and G. Mecca. Automatic information extraction
overall For
runtimereasoning
considerably.DIADEM
Figure 7(c) shows
the
same
evaluagin. Its gold
high
performance
execution
leaves
page
retrieval
and
renthus
the overall
evaluation
time is hig
(3)currently
Inofan
extensive
comparison
with
commercial
orJACM,
academic
annotated
standard.
Using
a full
gazetteer,
AMBER
extracts
fact finders.
are
derightmove.co.uk
. Thethe
page shows in
properties inwe
a radius
15
from
large
websites.
51(5):731â
ĂŞ779,
2004.
tion time up to 850 pages, but omits WCE web
and VWR
as
they
were
dering
to
dominate
execution
(>
85%)
and
thus
makes
avoiding
The
second
experiment
analyses
t
extraction
tools, data
we show
outperforms
veloping
a reasoning
language
dynamic,
modu[7]that
G.OXPath
Gottlob,
C. Koch,
andprevious
R. Pichler.
Efficient algorithms
for
area,
records,
price,
detailed
page
link, location,
legal status,
miles around
Oxford.
AMBER
uses thetargeted
contentatofhighly
the seed
gazetteer
not able to run these tests. Again both figures
show a considerable
on
query
evaluation,
comparing tim
page
rendering
imperative.
We
minimize
page
rendering
by
bufferapproaches
by
at
least
an
order
of
magnitude,
although
they
are
lar,
expressive
reasoning
on
top
of
a
live
browser.
This
language,
processing
XPath
queries.
TODS,
30(2):444â
Ă
Ş491,
2005.
postcode and bedroom number with more than 98% precision and
to identify theadvantage
position offortheOXPath
known(at
terms
such as
locations.
In par- Finally,
least
order
of magnitude).
and T.
absolute
actions.
ing any
page
that may
be needed
in further
processing,
yet Our test perfo
± [4,one
Where recall.
OXPath
requires
memory
independent
the
[8]
G.
M.still
Haber,
T.ofMatthews,
and
Lau.reception
CoScripter:
G LUE
, buildsthree
on Datalog
16,locations,
2, 3, more
18] videlicet.
, limited.
DIADEM’s
For
lessLeshed,
regularE.attributes
such
as property
type,
ticular, called
AMBER
identifies
potential
Figure
8 illustrates the
memorynew
use
of
these
systems.
Again
WCE
not
result
in
new
page
manage
to
keep
memory
consumption
constant
in
nearly
all
cases,
±
automating
&
sharing
how-to
knowledge
in
the
enterprise.
number
of
accessed
pages,
most
others
use
linear
memory.
ontological
query
language. Datalog
allows
us to reason
on top number and bathroom number, precision remains at 98%, but re-In retrievals. Figu
“Oxford”,
“Witney”
and
with
confidence
of 0.70,
and VWR
are“Wallingford”
excluded, but they
show
a clear linear
trend in memqueries memory
containing
one to six actions
CHI, by
1719–1728,
as evidenced
Figure 4c2008.
where we show OXPaths
use
of
fairly
complex
ontologies
including
value
invention
and
equality
Millions
of over
Results
at
Constant
call
drops
toJ.94%.
result
of our evaluation
proves
that
AMBER
0.62, and 0.61ory
respectively.
acceptance
forthe
new
usage in theSince
tests the
we were
able to threshold
run. Scaling:
Among
systems in
uct
search
form.
Contextual actions s
[9]
Lin,
J.The
Wong,
J.Memory.
Nichols,
A.We
Cypher,of
and
T. A.
Lau.
Enda
12h
extraction
task
extracting
millions
records
from
hunlossto
over
basic
datalog
(see
validate
the complexity
foruser
OXPath’s
PAAT algorithm
and
is bounds
able
human-quality
examples
forsee
any
web
site97–106,
in a time compared
programming
ofFor
mashups
with
Vegemite.
In
onlylittle
Webperformance
Harvest
comes
close
the memory
usage of
items isdependencies
50%, Figure
all the 8,with
three
locations
are added
the to
gazetteer.
penalty
to IUI,
evaluation
dredstoofgenerate
thousands
of pages.
more
details
[13].
[7] for OXPath,
a currentwhich
state of
datalog
engines).
We
areillustrate
also
working
on Yet,
its
scaling
behaviour
by
evaluating
three
classes
of
queries
2009.
is
not
surprising
as
it
does
not
render
pages.
given
domain.
We repeat the process for several websites and ±
show how AMBER
Comparison:
probabilisticWeb
extensions [14,
15]a of
Datalog
forthat
use
in DIADEM.
require
complex
page buffering.
shows
results
[10] L. Figure
Liu, C. 5(a)
Pu, and
W. the
Han.
XWRAP: An
XML-enabled Order-of-Magn
shows
clear
linear trend.
Chickenfoot
exhibits
identifies neweven
locationsHarvest
with increasing
confidence
as the
number
Path is compared
wrapper
for web information
sources.toInfour commerci
of the it
first
class,
which
searches
for
papersconstruction
on DEMONSTRATION
“Seattle”system
in Google
4.
DIADEM
a
constant
memory
use
just
as
OXPath,
though
uses
about
ten
of analyzed websites grows. We then leave AMBER to run over
ICDE,
611–621,
2000.
Content Extractor (WCE), Lixto, Vi
Scholar
and
repeatedly
clicks
on
the
“Cited
by”
links
of
all
results
3. pages
FIRST
RESULTS
times
more
memory
absolute
terms.
The
constant
As[11]
its approach,
the
of DIADEM
is
split
into
two
250 result
from
150
sites ofinthe
UK real
estate
domain,
inmemory
a star isoperator.
M.query
Liu and
T. demonstration
W. Ling.
A rule-based
query
language
for
academic
web
automation
and extrac
using
thenavigation
Kleene
The
descends
to a depth
of
due to
Chickenfoot’s
lack
ofachievements
support for multi-way
To
give
an
impression
of
the
of
DIADEM
in
its that parts. The
first part
illustrates
its analysis
current
HTML.
In DASFAA,
6–13,
2001. using
configuration for fully automated learning, i.e., g = l3=into
u =the50%,
the the
open
sourceDIAextraction toolkit W
citation
graph
and
is
evaluated
in
under
9
minutes
(93
we we
compensate
by using the browser’s
history.
This forces
reload- DEM[12]
M. Marx.
Conditional
XPath.
TODS,
30(4):929–959,
2005.
year,
summarise
on three
components
of DIprototype.
The
second part
illustrates
wrapper
execution
with
three can
express
at least the same ext
and we first
visualize
thebriefly
results
on sampleresults
pages.
pages/min).
Pages page
retrieved are linear w.r.t. time, but memory size
ingitswhen
aunderstanding
page is no longer
cached
anditsdoes
notpage
preserve
[13]using
A. O.aMendelzon,
G.ofA.web
Mihaila,
and
T.
Milo.
Querying
the
ADEM:
form
system,
OPAL;
result
analOXPath
combination
demos
and
visual
wrapper
dee.g.,
Lixto
goes
considerably
beyond
Starting with
the
sparse
gazetteer
(i.e.,
25%
of
the
full
gazetteer),
remains
constant
the number
of Wide
pagesWeb.
accessed
increase.
state, butand
requires
only a single
active
DOM
instance
at any even
time.asvelopment
World
In
DIS,
80–91,a1996.
ysis,
AMBER;
the
OXPath
extraction
language.
tools.
To
give
the
audience
better
understanding
ofiteration and man
vest require scripted
AMBER performs
four
learning
iterations,
before
it
saturates,
as
it
The
jitter
in
memory
use
is
due
to
the
repeated
ascends
and
deWe also
tried5toreport
simulate
navigation
in Chickenfoot, but wrappers
[14] A.
and F. Azavant.
Buildingidentifying
light-weight
wrappers
andSahuguet
the
challenges
in same
automatically
wrappers,
Figures
4a
and
on1multi-way
the quality
of form
understanding
many
extraction
tasks, in particular w
scends
of the
5(b)
shows
the
test W4F.
does not learn
anyresulting
new
terms.
Table
thefor
outcome
of
each
ofcitation hierarchy. Figure
for legacy
web
data-sources
using
In VLDB,
738–741,
the
program
wasshows
too slow
the tests
shown
here.
and result
page analysis
in DIADEM’s
first prototype.
Figure
4
[9]
we
start
the
demonstration
with
the
execution.
needed.
We
do
not
consider tools su
1999.
simulating
Google
Scholar
pages
on
our
web
server
(to
avoid
overly
the four rounds. Using the incomplete gazetteer, we initially fail to
shows that OPAL is able to identify about 99% taxing
of all Google’s
form fields
The
demonstration
starts
withhow
Visual
aH.visual
for
as
focusIDE
on automation
only an
[15]
N. Sawa,
A. Morishima,
S.
Sugimoto,
andthey
Kitagawa.
servers). The
number
in brackets
indicate
we OXPath,
annotate 328 out
of
1484
attribute
instances.
In
the
first
round,
the
6. real
CONCLUSION
AND correctly.
FUTURE
WORK
in the UK
estate and used car domain
We
also
show
OXPath,
shown
inWrapping
Figure
6bof
and
illustrated
thefor
screenas
required
extraction tasks. We a
Wraplet:
your
web
contentsfurther
with
a in
lightweight
scale
the
axes.
OXPath
extracts
over
7
million
pieces
data
from
gazetteer
step
identifies
226
unannotated
instances.
197
language.with
In SITIS,
387–394,
2007. as
thelearning
results
ICQ
form benchmarks,
where
OPAL
at pages/min)
diadem-project.info/oxpath/visual
. RoadRunner
With this UI,[6]
weor XWRAP [10]
Toon
thethe
best
of and
our Tel-8
knowledge,
OXPath 140,
is the000
first
web
pages
inextrac12 hourscast
(194
constant
memory.
of thoseachieves
instances
are
correctly
identified,
which
yields
a
precision
[16] W. Shen,
Doan,
J. F.links)
Naughton,
R. Ramakrishnan.
> system
96% accuracy
(in memory
contrast recent
approaches
achievestrongly
at
demonstrate
how
aA.
human
construct
wrapper
and
lacksupported
the abilitybyto traverse to new
tion
with strict
guarantees,
which
reflect
We conduct
similar
tests
(repeatedly
clicking
on would
all
for aand
Declarative
extraction
using generator
datalog
with
and recall
87.2%
and
60.1%
ofevaluation.
the unannotated
instances,
In contrast
OXPath, many of
in our
experimental
We believe
that
it84.3%
candifferent
becomecharacteristics:
an a visual interface
bestof92%
[8]).
The
latter
result
is without
use
of
domain
knowlandinformation
how
supervised
cantogentasks
with
one on very
largeapages
(Wiki-wrapper
embedded
extraction
predicates.
In VLDB, 1033–1044, 2007.
and 81.3%
all instances.
The
increase
in precision
is stable
in the
scripted
websites
easily. Thus, we
part of
the toolset
of developers
interacting
edge.ofimportant
With
domain
knowledge
we
could
easily
achieve
close
to
a wrapper
from
just a few
examples.
We
provide
a number
pedia)
andwith
one
onweb.
pageseralize
with[17]
many
results
(product
listings
on
J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design of
all the learning
rounds
sointhat,
the
end
ofa strong
the
fourth
task
Google
Scholar, which does
We
arealso
committed
building
set
ofiteration,
tools
around
OX- web
99% accuracy
theseatto
cases.
Figure
5 [11]
shows
the
results
of standard
examples,
butand
are 12
also
suggestions
for
the JMST,
audiGoogle)
from
different
shops.
Figures 5(c)
(inopen
Ap-tosystem
browser-oriented
data
extraction
andon
plug-ins.
scripted
AMBERfor
achieves
a precision
of
and
a recall
80.6%
ofagain
the
Path.
We
provide
a97.8%
visual
generator
forofOXPath
expressions
and
Once
the wrapper
istocreated
we submit itOn
andheavily
let it run
in thepages, the perfo
data
area,
record,
and
attribute
identification
on result
pages
18(2):189–200,
pendix
E)
show
thatence.
memory
is constant
w.r.t.2010.
pages visited
even
more pronounced. With each
unannotated
instances,
an overall
precision
and recall
offor95.1%
for AMBER
in and
the UK
real estate
domain.
We
report
each
at- largerbackground
while weand
continue
withquite
the rest of isthe
demonstration.
even
the
much
pages on Wikipedia
the often
and 93.8%, respectively.
visually-involved pages reached by the Google product search.
(a) Evaluation
time, <150 p.
(a) OPAL
WWW 2012 – European Projects Track
April 16–20, 2012, Lyon, France
1
2
4
5
6
3
(a) OPAL
(b) OXPath
Figure 6: Demonstration UIs
In the second part of the demonstration, we focus on the DIADEM prototype. A screencast of the demo is available from
[7] O. de Moor, G. Gottlob, T. Furche, and A. Sellers, eds.
Datalog Reloaded, Revised Selected Papers, LNCS. 2011.
[8] E. C. Dragut, T. Kabisch, C. Yu, and U. Leser. A hierarchical
approach to model web query interfaces for web source
integration. In VLDB, 2009.
[9] T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, and
C. Schallhart. Real understanding of real estate forms. In
WIMS, 2011.
[10] T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, and
C. Schallhart. OPAL: Automated Form Understanding for
the Deep Web. In WWW, 2012.
[11] T. Furche, G. Gottlob, G. Grasso, G. Orsi, C. Schallhart, and
C. Wang. Little knowledge rules the web: Domain-centric
result page extraction. In RR, 2011.
[12] T. Furche, G. Gottlob, X. Guo, C. Schallhart, A. Sellers, and
C. Wang. How the minotaur turned into Ariadne: Ontologies
in web data extraction. ICWE, (keynote), 2011.
[13] T. Furche, G. Gottlob, G. Grasso, C. Schallhart, and
A. Sellers. OXPath: A language for scalable,
memory-efficient data extraction from web applications. In
VLDB, 2011.
[14] G. Gottlob, T. Lukasiewicz, and G. I. Simari. Answering
threshold queries in probabilistic datalog+/- ontologies. In
SUM, 2011.
[15] G. Gottlob, T. Lukasiewicz, and G. I. Simari. Conjunctive
query answering in probabilistic datalog+/- ontologies. In
RR, 2011.
[16] G. Gottlob, G. Orsi, and A. Pieris. Ontological queries:
Rewriting and optimization. In ICDE, 2011.
[17] N. Kushmerick. Wrapper induction: efficiency and
expressiveness. AI, 118, 2000.
[18] G. Orsi and A. Pieris. Optimizing query answering under
ontological constraints. In VLDB, 2011.
[19] A. Yates, M. Cafarella, M. Banko, O. Etzioni, M. Broadhead,
and S. Soderland. Textrunner: open information extraction
on the web. In NAACL 2007.
diadem-project.info/screencast/diadem-01-april-2011.mp4.
In the demonstration, we let the prototype run on several pages
from the UK real estate and used car domain. The prototype runs
in interactive mode, i.e., it visualizes its deliberations and conclusions and prompts the user before continuing to a new page, so that
the user can inspect the results on the current page. With this, we
can demonstrate first how DIADEM analyses forms and explores
web sites based on these analyses. In particular, we show how DIADEM deals with form constraints such as mandatory fields, multi
stage forms and other advanced navigation structures. The demo
fully automatically explores the web site until it finds result pages
for which it performs a selective analysis sufficient to generate the
intended wrapper. As before, the demonstration discusses a number
of such result pages and shows in the interactive prototype how DIADEM extracts objects from these result pages as well as identifies
further exploration steps.
We conclude the demonstration with a brief outlook into applications of DIADEM technology beyond data extraction: The
screencast at diadem-project.info/opal shows how we use DIADEM’s form understanding to provide advanced assistive form
filling: where current assistive technologies are limited to simple
keyword matching or filling of already encountered forms, DIADEM’s technology allows us to fill any form of a domain with values from a master form with high precision.
5.
REFERENCES
[1] M. Benedikt, G. Gottlob, and P. Senellart. Determining
relevance of accesses at runtime. In PODS, 2011.
[2] A. Calì, G. Gottlob, and A. Pieris. New expressive languages
for ontological query answering. In AAAI, 2011.
[3] A. Calì, G. Gottlob, and A. Pieris. Querying conceptual
schemata with expressive equality constraints. In ER, 2011.
[4] A. Calì, G. Gottlob, and A. Pieris. Query answering under
non-guarded rules in datalog± . In RR, 2010.
[5] V. Crescenzi and G. Mecca. Automatic information
extraction from large websites. J. ACM, 51(5), 2004.
[6] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic
wrappers for large scale web extraction. In VLDB, 2011.
270