OXPath: Little Language, Little Memory, Great Value

Transcription

OXPath: Little Language, Little Memory, Great Value
OXPath: Little Language, Little Memory, Great Value∗
Andrew Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart
Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford OX1 3QD
[email protected]
ABSTRACT
Data about everything is readily available on the web—but
often only accessible through elaborate user interactions.
For automated decision support, extracting that data is essential, but infeasible with existing heavy-weight data extraction systems. In this demonstration, we present OXPath, a novel approach to web extraction, with a system
that supports informed job selection and integrates information from several different web sites. By carefully extending XPath, OXPath exploits its familiarity and provides a
light-weight interface, which is easy to use and embed. We
highlight how OXPath guarantees optimal page buffering,
storing only a constant number of pages for non-recursive
queries.
Categories and Subject Descriptors
H.3.5 [Information Storage and Retrieval]: Online Information Services—Web-based services
General Terms
Languages, Algorithms
Keywords
Web extraction, web automation, XPath, AJAX
1.
INTRODUCTION
Where do you want to work tomorrow? As three of your
job applications have been accepted, you have to pick one
where (1) your wife can find a good job as well, (2) you don’t
have to pay through the nose for a rental, (3) you can follow
your passions, and (4) your neighbors are used to children.
All this information is readily available on some webpage,
but manually extracting, aggregating, and ranking that information is tedious and often unmanageable due to the size
of the relevant data. We are forced to settle with approximate solutions agonizing not to miss some crucial data.
∗The research leading to these results has received funding
from the European Research Council under the European
Community’s Seventh Framework Programme (FP7/2007–
2013) / ERC grant agreement no. 246858. The views expressed in this article are solely those of the authors.
This paper is authored by an employee(s) of the United States Government
and is in the public domain.
WWW 2011, March 28–April 1, 2011, Hyderabad, India.
ACM 978-1-4503-0637-9/11/03.
2
4
doc("rightmove...")/desc::field()[1]/{"Oxford"}/foll::*#rent
/{click/}//*#maxPrice/{"1,250 PCM"/}/foll::field()[last()]
/{click/}/(//a[.#=’next’]/{click/})*//ol#summaries/li[
.//a[.#=’details’]/{click/}//text()[.=’Furnished’]]:<R>
Figure 1: OXPath for finding affordable rentals
Automating this process, however, is not affordable with
existing web querying, extraction, or automation tools. Web
querying tools do not support navigation between web pages,
form filling, or the simulation of user actions, all essential
for our task. Web extraction and automation approaches
[2, 4, 7, 3] address these issues, but are not affordable either: (1) They use specifically designed, complex languages
[2, 4, 7], which are unfamiliar to web developers. (2) Yet
most cannot deal properly with many modern, scripted web
sites as pointed out in [1]. (3) They rely on a sophisticated,
heavy weight infrastructure, often including server and frontend components. (4) They require considerable hardware
resources: Neither provides strict memory guarantees and
only [2] provides bounds for extraction time.
In this demonstration, we introduce OXPath, a careful
extension of XPath for simulating user interactions with web
sites to extract data from such sites. OXPath is
(1) affordable to learn as it extends XPath with only four
new features, sufficient for solving most extraction tasks.
(2) expressive enough for most web extraction tasks. In
particular, it deals with scripted web sites as it is based
on an existing browser engine.
(3) affordable to execute: Evaluating OXPath is polynomial
in the number of visited HTML elements. Its memory
consumption is independent of the number of visited
pages. To the best of our knowledge, OXPath is the
first web extraction tool that can guarantee constant
page buffering for non-recursive extraction queries.
(4) affordable to embed as it builds on existing technologies
and requires little additional processing over XPath.
We demonstrate these characteristics of OXPath in a decision support system for selecting the ideal job: The user can
specify UK locations and weighted criteria such as the number of jobs, the availability of affordable rentals, etc. Each
criteria has some parameters, e.g., the wage range for jobs.
For each location the relevant data is extracted with OXPath
(from different web sites) and the ranked list is presented to
the user. Figure 1 shows how to extract apartments from
rightmove.co.uk. We select the first (visible) form field on
the start page and enter “Oxford”. From there we navigate
to and click the rent submit button. On the returned page,
we fill the maximum price field and click the submit button
(the last visible form element). We iterate over all result
pages by clicking the next links and retrieve each furnished
rental (recognized on a details page).
In the second demo part, we profile OXPath’s evaluation
algorithm: We visualize the tree of pages as navigated by an
expression and show the time to render and select the relevant data, illustrating that often rendering time dominates
the evaluation. We also visualize OXPath’s buffer management and highlight when pages are dropped from the buffer.
This reinforces our theoretical result: Even for very large extraction tasks, OXPath rarely buffers more than a handful
pages at any time. OXPath is available on diadem-project.
info/oxpath under a BSD license.
2.
visibility hidden, or have display property none set for themselves or in an ancestor. To ease field navigation, we introduce the node-test field() as an abbreviation. In the
above Google search for “Oxford”, we rely on the order
of the visible fields selected with descendant::field()[1] and
following::field()[1]. Such an expression is not only easier
to write, it is also far more robust against changes on the
web site. For it to fail, either the order or set of visible form
fields has to change.
Extraction Marker. Navigation and form filling are often
means to data extraction: While data extraction requires
records with many related attributes, XPath only computes
a single node set. Hence, we introduce a new kind of qualifier, the extraction marker, to identify nodes as representatives for records and to form attributes from extracted
data. For example, :<story> identifies the context nodes as
story records. To select the text of a node as title, we use
:<title=string(.)>. Therefore,
OXPATH: LANGUAGE
OXPath is an extensions of XPath: XPath expressions are
also OXPath expressions and retrain their same semantics,
computing sets of nodes, integers, strings or Booleans.
We extend XPath with (1) a new kind of location step (for
actions and form filling), (2) a new axis (for selecting nodes
based on visual attributes), (3) a new node-test (for selecting
visible fields), and (4) a new kind of predicate (for marking
data to be extracted). For page navigation, we adapt the
notion of Kleene star over path expressions from [5]. Nodes
and values marked by extraction markers are streamed out
as records of the result tables. For efficient processing, we
cannot fix an apriori order on nodes from different pages.
Therefore, we do not allow access to the order of nodes in
sets that contain nodes from multiple pages.
2
extracts from Google News a story element for each current
story, containing its title and its sources, as in:
2
2
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
Style Axis and Visible Field Access. We introduce two
extensions for lightweight visual navigation: a new axis for
accessing CSS DOM node properties and a new node test
for selecting only visible form fields. The style axis is similar
to the attribute axis, but navigates dynamic CSS properties
instead of static HTML properties. For example, the following expression selects the sources for the top story on Google
News based on visual information only:
doc("news.google.com")//*[style::color="#767676"]
The style axis provides access to the actual CSS properties
(as returned by the DOM style object), rather than only to
inline styles.
An essential application of the style axis is the navigation of visible fields. This excludes fields which have type or
<story><title >Tax cuts ...</title>
<source>Washington Post</source>
<source>Wall Street Journal</source> ...
</story>
The nesting in the result above mirrors the structure of the
OXPath expression: An extraction marker in a predicate
represents an attribute to the (last) extraction marker outside the predicate.
Actions. For explicitly simulating user actions, such as clicks
or mouse-overs, OXPath introduces contextual action steps,
as in {click}, and absolute action steps with a trailing slash,
as in {click /} . Since actions may modify or replace the entire DOM, OXPath’s semantics assumes that they produce a
new DOM. Absolute actions return DOM roots, while contextual actions return those nodes in the resulting DOMs
which are matched by the action-free prefix of the performed
action: The action-free prefix afp(action) of action is constructed by removing all intermediate contextual actions and
extraction markers from the segment starting at the previous
absolute action. Thus, the action-free prefix selects nodes on
the new page, if there are any. For instance, the following
expression enters “Oxford” into Google’s search form using a
contextual action—thereby maintaining the position on the
page—and clicks its search button using an absolute action.
doc("news.google.com")//div[@class~="story"]:<story>
[.//h2:<title=string(.)>]
[.//span[style::color="#767676"]:<source=string(.)>]
Kleene Star. Finally, we add the Kleene star, as in [5],
to OXPath. For example, we use the following expression
to query Google for “Oxford”, traverse all accessible result
pages, and to extract all contained links.
2
4
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
( (/desc::a.l:<Link=(@href)>)
/ancestor::*/desc::a[next]{click /} )*
To limit the range of the Kleene star, one can specify upper
and lower bounds on the multiplicity, e.g., (...)*{3,8}.
3.
OXPATH: SYSTEM
Our implementation consists of the three layers shown in
Figure 2: the embedding layer provides a development API
and a host environment, the engine layer performs the evaluation of OXPath expressions, and the web access layer accesses the web in a browser-neutral fashion.
Embedding Layer. To evaluate an OXPath expression, we
need to provide the environment with bindings for all occurring XPath variables; the environment in turn provides the
final expression to the OXPath engine. Variables in OXPath expressions are commonly used to fill web forms with
multiple values. To this end, the host environment allows
value bindings based on databases, files, other OXPath expressions, or Java functions. In our default implementation,
we stream the extracted matches to a file without buffering, while other implementations may choose to store the
matches e.g. in a database instead. Finally, we offer a GUI
to support the visual design of the OXPath queries.
Embedding
Visual OXPath
Host Environment (e.g. Java, XQuery, CLI, …)
1
XPath Variable Bindings for Form Filling
DB
3
4
PAAT Algorithm
Page Buffer
Page Modification Manager
Web Access
2
OXPath Parser & Rewriting
Web Access API
HtmlUnit
Node Manager
OXPath Engine
OXPath API JAXP XPath API
Basic OXPath Evaluator
Browser XPath
5
SWT Mozilla
Figure 3: OXPath Visual Profiler
Figure 2: OXPath System Architecture
We extract and aggregate data from several deep web
sites: Reed (reed.co.uk) for jobs, Rightmove (rightmove.co.
uk) for rentals, Yahoo’s Upcoming (upcoming.yahoo.com) for
events, Thomson Local (thomsonlocal.com) for local businesses, and UpMyStreet (upmystreet.co.uk) for neighborhood assessment. For each site, the user may also provide a
list of criteria for ranking the postcode, though the system
proposes a default set. These criteria can be built from the
data of the above web sites using a number of additional parameters for each (criteria in italics are not available through
the search interface of the respective site and thus are realized through OXPath predicates):
— From Reed, we extract the number of jobs that match the
following criteria (in the given postcode): job category,
salary range, and number of applicants.
— From Upcoming, we extract the number of events in the
specified postcode area, allowing the user to configure
the period of time, and event category (e.g., “Jazz”).
— From Thomson Local, we extract local businesses if their
category and distance from our postcode match.
— From Rightmove, we extract all rentals that have the
chosen number of bedrooms, maximum rental, included
furnishings, and availability (e.g., long term).
— From UpToMyStreet, we extract any of four neighborhood quality measures (averages): income, education
level, and number of children.
On the website, we also show the OXPath implementations for all five web sites. For space reasons, we highlight
here only the most interesting OXPath expressions. Figure 4
shows the navigation and extraction performed on Reed.
We first load Reed’s start page and fill the search field
for the job category Ê, the location Ë, and the maximum
distance Ì from the given postcode. To submit the search
form, we click the “Search Jobs” button Í:
Engine Layer. After parsing, the query optimizer expands
abbreviated expressions, such as field(), and feeds the resulting queries to our Page-At-A-Time (PAAT) algorithm [6].
This algorithm controls the overall evaluation strategy and
uses the browser’s XPath engine to evaluate individual XPath
steps and a buffer manager to handle page modifications. In
this demonstration (see Section 4), we illustrate the inner
workings of the buffer manager using the OXPath visual
profiler shown in Figure 3. The visual profiler is an AJAX
webpage that uses the InfoVis Toolkit (thejit.org) to replay
or step through the evaluation of an OXPath expression. Its
main feature is a visualization of the page navigation tree,
showing how pages are traversed by OXPath.
Web Access Layer. For evaluating OXPath expressions on
web pages, we require programmatic access to a dynamic
DOM rendering engine, as employed by all modern web
browsers. We identified HtmlUnit (htmlunit.sourceforge.
net), the Mozilla-based JREX (jrex.mozdev.org), and the
also Mozilla-based SWT widget (eclipse.org/swt) as backends. To decouple our own implementation from their implementation details, we provide the web access layer as a
facade which allows for exchanging the underlying browser
engine independently of our other code. In our current implementation, we use HtmlUnit as backend, since it renders
pages efficiently (compared with the SWT Mozilla embedding) and is implemented entirely in Java.
4.
DEMO DESCRIPTION
For this demonstration, we showcase OXPath with a proofof-concept decision support system (diadem-project.info/
oxpath/placerank): It ranks a set of UK locations using
data extracted from a number of different web sites with
user provided criteria and weights. Separate OXPath expressions navigate the interfaces of each site, filling the necessary values for each location. The criteria are realized
either through form filling, if the web site already provides
a corresponding option, or as OXPath predicates otherwise.
OXPath extracts all the relevant data, the actual weighting and ranking is done in the host language; in our case, a
straightforward Java program.
2
doc("reed.co.uk")//field()[@title#=’keyword’]/{"Java"/}Ê
//field()[@title#=’town’]/{"OX1"}Ë/foll::field()[1]
/{"5 miles"}Ì/foll::field()[@title=’SearchJobs’]/{click/}Í
For space reasons, we abbreviate the XPath axes. The #
operator is a node test we adopt for convenience: A#B filters
the nodes in A to those with id attribute B. The node-test
field() is used to identify only visible form fields.
101,225 jobs from 8,240 recruiters online
www.Locumotion.com/Locums
The UK's #1 job site. More jobs, more choice.
Temp Zone
Career Tools
Learning Centre
Home
Temp Zone
Learning Centre
The UK's
#1 job site.Career
MoreTools
jobs, more
choice.
Home
Search Jobs
Career Tools
Temp Zone
Learning Centre
Search
Jobs
Enter
keyword(s):
Java
in title & description
Sign In
Register
Register
Shortlist (0)
Advertise a vacancy »
RECRUITING?
Advertise a vacancy »
Shortlist (0)
RECRUITING?
Get qualified
and improve your prospects
with a learning
Show All jobs
sorted by Choose...
course from our range of training partners.
42 Specialist Sites
Local Jobs
title only
60
6+
added Any time
Learning
Centre
10
jobs per page
Applications today
Advertise in minutes
Go
today
Advertise now »
Next »
4
memory [MB]
Recruiting now
Charity & Voluntary Jobs (356)
Media, Digital & Creative Jobs
(1,894)
description:
Enter salary Jobs
range (£):(1,894)
Media, Digital & Creative
Organisation Description
per annum
To
C# Developer,
C#,Jobs
.NET,to
40,000, Oxford
Construction
& Property
(1,542)
Motoring & From
Automotive
Jobs (924)
Construction & Property Jobs (1,542)
Motoring & Automotive
Jobs (924)
Select job type:
Location
Oxford, Oxfordshire
Date
05 Nov I am currently recruiting for an IT Business Analyst for a role based in Oxford.
Permanent
Part-Time
Customer
Service
(4,025)
Multilingual
Jobs (5,886)
Temporary
Contract
Customer
Service
Jobs Jobs
(4,025)
Multilingual
Jobs (5,886)
/}
Remember this location
Searchjobs
jobs
Search
Advanced Search
Local Job Search
Company Search
Advanced Search
Advanced
Search
Job Reference Search
Local Job Search
Local
Job Search
Company Search
Company Search
Recent
Job Reference Search
Activity
Job Reference Search
Most recent search
Urgent
vacancies
View
results
li
ck
Charity & Voluntary Jobs (356)
Part-Time jobs
Search
Part-Time
Contract
Contract
{c
Select job type:
Permanent
Permanent
Temporary
Temporary
Recruiting now
35
Select job type:
Salary
£30,000 - £40,000 per annum
Search(5,658)
jobs
Education
(4,493)
PublicJobs
Sector
Jobs
Recruiter
Haybrook IT Resourcing Ltd
Education
JobsJobs
(4,493)
Public Sector
(5,658)
Applications
View urgent job roles
3200
30
2400
20
1600
8
Purchasing
JobsSearch
(646)
Purchasing
Jobs (646)
Advanced
Local Job
This is an excellent opportunity for an experienced C# Developer
toSearch
join a market-leading
company
The successful candidate
will have the following skills and experience:
Recruitment
Consultancy
Jobs (5,601)
Recruitment
Consultancy
Jobs (5,601)
Search
based in Oxford. The C# Developer will play a valuable role inCompany
the development
team charged with
- Excellent working knowledge of ITIL
10
800
Job Description
Energy
Energy
JobsJobs
(371)(371)
Engineering
JobsJobs
(3,356)
Engineering
(3,356)
Job Reference Search
Estate
Agency
Jobs
(1,240)
Retail
Jobs
(3,760)
Management experience
creating
web
applications
products.
From
identifying
requirements
through
Estate
Agency
Jobs
(1,240) vital to other key
Retail
Jobs
(3,760) customer- Project
- Considerable business process experience
to...
Financial
Services
Jobs Jobs
(7,410)
Financial
Services
(7,410)
Recent Activity
- Experience in the management of It applications systems development, maintenance and support
Sales Jobs
(10,296)
Sales
Jobs Most
(10,296)
- In depth knowledge of technical topics (Oracle, SAP, Java and Linux.
recent search
- German (ideal not essential)
FMCG Jobs
(498)
Scientific
(1,325)
View results
Architect – Reading
– Jobs
50-60k
FMCGJava
JobsApplication
(498)
Scientific
Jobs (1,325)
General Insurance
Jobs (3,440) Berkshire
Security & Safety Jobs
(219) 04 Nov Person Specification
Location
Date
General
InsuranceReading,
Jobs (3,440)
Security & Safety
Jobs (219)
Salary
£50,000
£60,000
per
annum
Graduate Jobs (9,654)
Social Care Jobs (2,601)
Graduate
Jobs (9,654)
Social Care Jobs (2,601)
Recruiter
Avanti Recruitment
Health & Medicine Jobs (3,176)
Strategy & Consultancy Jobs (381)
Applications
Health
& Medicine5Jobs (3,176)
Strategy & Consultancy Jobs (381)
Hospitality & Catering Jobs (2,919)
Temporary Work Jobs (12,754)
Java Application Architect – Reading – 50-60k Avanti Recruitment are working with a global software
Hospitality
& Catering
Jobs (2,919)
Human
Resources
Jobs (1,541)
TrainingTemporary
Jobs (331) Work Jobs (12,754)
0
IT & Telecoms Jobs (6,132)
0
0
reed.co.uk
50
100
150
200
250
300
time [sec]
solutions provider based in Reading, Berkshire who have a requirement for an Applications Architect
Human
Resources
Jobs (1,541)
IT &
Telecoms
Jobs (6,132)
Training
JobsJobs
(331)
Transport
& Logistics
(2,340)
Transport & Logistics Jobs (2,340)
Temp Work in Newcastle
Figure 5: Memory, results, and pages
Temp Work in Newcastle
Since the search may return a paginated result, we use
an OXPath Kleene star expression (Ï+) to iterate over all
the result pages: On each result page, we identify the “Next”
link at the top of the results listing, until there is no further:
/(//a[.#="Next"][1]/{click/})*Ï+
On each result page, we identify and extract those job
listings that have less than the given number of maximum
applicants (in this case 10):
2
40
The main purpose of the role is to take responsibility for the planning, design and build IT Solutions in line with
business requirements
Urgent vacancies
Figure
4: Finding an OXPath through
from urgent
top UK recruiters
View
job roles
and apply
now.
from
top UK
recruiters
and apply now.
4000
Recommended Jobs
Applications
Hide job descriptions
1 2 3 4 5
Enter town/county/postcode:42 Specialist Sites
Local Jobs
Recommended Jobs
All Sectors
» all services »
The UK's #1 job site. More jobs, more choice. Advertise now
Select
job sector:
See
Accountancy
IT Contractor Jobs (623)
30 miles
job Jobs (6,522)
oxford
RECRUITING?
Sign In
Register
Shortlist (0)
Home
Temp Zone
Career Tools
Learning Centre
Contact
Advertise a vacancy »
All Sectors
IT BUSINESS ANALYST
See all services »
Accountancy
Jobs(Qualified)
(6,522)
Accountancy
Jobs (3,552) IT Contractor
Legal Jobs
Jobs (623)
(1,783)
Remember 5
thismiles
location
Return to results
Previous Job
Next Job
Location
Oxford, Oxfordshire
Search Jobs
Enter town/county/postcode:
Within
cl
Accountancy
(Qualified)
Jobs (3,552)
Legal{{c
Jobs
(1,783)
li
Oxford
ic
Actuarial
Leisure
& Tourism Jobs (822)
ck
SalaryJobs (590)
£32,000 - £35,000 per annum
k /}
Enter salary range (£):
IT
BUSINESS
ANALYST
Enter keyword(s):
/
}
Within
ActuarialRecruiter
Jobs (590)
Leisure & Tourism Jobs
(822)
Reed Technology
java
Recruiter(9,411)
Reed Technology
Remember
this
location
Admin,
Secretarial
&
PA
Jobs
(4,694)
Management
&
Executive
Jobs
Add to Shortlist
per
annum
From
To
in title & description
title only
Location
Oxford, Oxfordshire
Applications
0
Add Recruiter to Favourites
Remember this location
Admin, Secretarial
& PA
Jobs (4,694)
Management & Executive
Jobs
(9,411)
Salary
£32,000
£35,000
per
annum
Select job sector:
More jobs like this
Apprenticeships (89)
Manufacturing
Jobs (1,303) Sector
Enter salary
range
IT & Telecoms - Business Analyst
All Sectors
Select
job (£):
type:
Organisation Description I am currently recruiting for an IT Business
Analyst forJobaType
role based
in Oxford.
More jobs from this recruiter
Permanent
Apprenticeships (89)
Manufacturing Jobs (1,303)
Enter salary range (£):
Enter
town/county/postcode:
Date build Just
The main
for the &
planning,
and
ITadded
Solutions in
Email this job to a friend
Banking
Jobs purpose
(1,946) of the role is to take responsibility
Marketing
PR
Jobsdesign
(3,792)
From
To Permanentper annum
Part-Time
5 miles
Applications 0
Email me jobs like this
per annum
Banking line
Jobswith
(1,946)
Marketing
& PR
Jobsoxford
(3,792)
business requirements Job Description
The
successful
candidate will have...
From
To Temporary
Job ref no
19556197
Contract
3
memory
extracted matches
visited pages
50
Advertise in minutes
with
Enter town/county/postcode:
2
4800
Shortlist
(0)
Email me
jobs like theseAdvertise
|
RSS feed
a vacancy
» (What's RSS?)
Upgrade
Jobs in your
Oxfordskills with the UK's #1 job site
title only
Enter keyword(s):
Select job sector:
in title & description
title only
All Sectors
Select
job
sector:
in title
& description
Sign In
Contact
Contact
Sponsored Links
Register
Get qualified and improve your prospects with a learning
Your search for
'java' in
'oxford'
returned
jobs job siteLearning Centre
Upgrade
skills
with
the
UK's61#1
course from your
our range
of training
partners.
SearchjavaJobs
Enter keyword(s):
1
Sign In
Contact
matches / pages
Home
101,225 jobs from 8,240 recruiters RECRUITING?
online
{click /}
The UK's #1 job site. More jobs, more choice.
//div.resultEntry[.//label[.#="Applications")]
/foll::text()[1]<10]:<job>
The extraction marker :<job> extracts each div.resultEntry
with the label job if it matches the predicates.
For those jobs, we also access their details page to extract
the full job description (rather than the abbreviated one
shown on the result page):
[h4/a/{click/}Î//div.jobDescription:<desc=.>]
The extraction marker :<desc> extracts those descriptions
as children of the corresponding job element, as it occurs
nested in a predicate following :<job>.
Further criteria such as salary range or posting date can
be included in the same way (full example on the website).
To visualize OXPath’s buffer management, we use the
OXPath profiler (Figure 3): It shows the tree of pages that
OXPath navigates while evaluating the expressions. The
evaluation can be replayed either automatically or stepwise.
At the top of the page Ê, a summary of the evaluation time
of the various components of OXPath is given. In the page
tree, we show for each page Ë some meta-data and the
time spent rendering and navigating that page Î. A page
is shaded gray Ì if it is no longer buffered by OXPath. As
Figure 3 shows, OXPath keeps at most the pages on the path
from the start page to the current one in memory. This behavior is confirmed on a larger scale in Figure 5, where we
evaluate the above OXPath expression with location “London” and distance “30 miles”.
We conclude the demonstration by going into details for
one more web site, to illustrate the versatility of OXPath:
extracting rentals from rightmove.co.uk. On Rightmove the
user has to perform a longer sequence of actions (than on
Reed) to obtain the results: Fill “Oxford” into the input field
and click on the “Rent” submit button. On the returned
page, fill in the maximum price field, and click submit. To
find all result pages, repeatedly click the next link.
2
4
doc("rightmove...")/desc::field()[1]/{"Oxford"}Ê/foll::*#rent
/{click/}Ë//*#maxPrice/{"1,250 PCM"/}Ì/foll::field()[last()]
/{click/}Í/(//a[.#=’next’]/{click/})*Î//ol#summaries/li[
.//a[.#=’details’]/{click/}Ï//text()[.=’Furnished’]]:<R>
Figure 6: OXPath for finding affordable rentals
The OXPath expression in Figure 6 realizes this sequence
of actions in Lines 1–4: To fill the first input field, we use
OXPath’s field() nodetest and a contextual action Ê (denoted with just {} around the action). The navigation continues from the same field after a contextual action. The
other actions are absolute actions (with an added /) where
the navigation resumes at the root of the page retrieved by
the action. We locate and click Ë the “Rent” button to submit the form. Rightmove does not directly return results,
but asks for a refinement of the query. On the refinement
page, we identify the drop-down list for the maximum price
and also submit the refinement form by clicking Í its submit button (the last visible web form field). Again we iterate
over all result pages. In order to extract only “Furnished”
apartments, we need to navigate to the details page Ï. We
extract the filtered results from the result page using the
:<R> extraction marker.
5.
REFERENCES
[1] A. Alba, V. Bhagwan, and T. Grandison. Accessing the
deep web: when good ideas go bad. In OOPSLA, 2008.
[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web
information extraction with Lixto. In VLDB, 2001.
[3] J. P. Bigham, A. C. Cavender, R. S. Kaminsky, C. M.
Prince, and T. S. Robison. Transcendence: enabling a
personal view of the deep web. In IUI, 2008.
[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C.
Miller. Automation and customization of rendered web
pages. In UIST, 2005.
[5] M. Marx. Conditional XPath. ACM Trans. Database
Syst., 30(4), 2005.
[6] OXPath. http://www.diadem-project.info/oxpath.
[7] W. Shen, A. Doan, J. F. Naughton, and
R. Ramakrishnan. Declarative information extraction
using datalog with embedded extraction predicates. In
VLDB, 2007.