OXPath: Little Language, Little Memory, Great Value
Transcription
OXPath: Little Language, Little Memory, Great Value
OXPath: Little Language, Little Memory, Great Value∗ Andrew Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford OX1 3QD [email protected] ABSTRACT Data about everything is readily available on the web—but often only accessible through elaborate user interactions. For automated decision support, extracting that data is essential, but infeasible with existing heavy-weight data extraction systems. In this demonstration, we present OXPath, a novel approach to web extraction, with a system that supports informed job selection and integrates information from several different web sites. By carefully extending XPath, OXPath exploits its familiarity and provides a light-weight interface, which is easy to use and embed. We highlight how OXPath guarantees optimal page buffering, storing only a constant number of pages for non-recursive queries. Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: Online Information Services—Web-based services General Terms Languages, Algorithms Keywords Web extraction, web automation, XPath, AJAX 1. INTRODUCTION Where do you want to work tomorrow? As three of your job applications have been accepted, you have to pick one where (1) your wife can find a good job as well, (2) you don’t have to pay through the nose for a rental, (3) you can follow your passions, and (4) your neighbors are used to children. All this information is readily available on some webpage, but manually extracting, aggregating, and ranking that information is tedious and often unmanageable due to the size of the relevant data. We are forced to settle with approximate solutions agonizing not to miss some crucial data. ∗The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007– 2013) / ERC grant agreement no. 246858. The views expressed in this article are solely those of the authors. This paper is authored by an employee(s) of the United States Government and is in the public domain. WWW 2011, March 28–April 1, 2011, Hyderabad, India. ACM 978-1-4503-0637-9/11/03. 2 4 doc("rightmove...")/desc::field()[1]/{"Oxford"}/foll::*#rent /{click/}//*#maxPrice/{"1,250 PCM"/}/foll::field()[last()] /{click/}/(//a[.#=’next’]/{click/})*//ol#summaries/li[ .//a[.#=’details’]/{click/}//text()[.=’Furnished’]]:<R> Figure 1: OXPath for finding affordable rentals Automating this process, however, is not affordable with existing web querying, extraction, or automation tools. Web querying tools do not support navigation between web pages, form filling, or the simulation of user actions, all essential for our task. Web extraction and automation approaches [2, 4, 7, 3] address these issues, but are not affordable either: (1) They use specifically designed, complex languages [2, 4, 7], which are unfamiliar to web developers. (2) Yet most cannot deal properly with many modern, scripted web sites as pointed out in [1]. (3) They rely on a sophisticated, heavy weight infrastructure, often including server and frontend components. (4) They require considerable hardware resources: Neither provides strict memory guarantees and only [2] provides bounds for extraction time. In this demonstration, we introduce OXPath, a careful extension of XPath for simulating user interactions with web sites to extract data from such sites. OXPath is (1) affordable to learn as it extends XPath with only four new features, sufficient for solving most extraction tasks. (2) expressive enough for most web extraction tasks. In particular, it deals with scripted web sites as it is based on an existing browser engine. (3) affordable to execute: Evaluating OXPath is polynomial in the number of visited HTML elements. Its memory consumption is independent of the number of visited pages. To the best of our knowledge, OXPath is the first web extraction tool that can guarantee constant page buffering for non-recursive extraction queries. (4) affordable to embed as it builds on existing technologies and requires little additional processing over XPath. We demonstrate these characteristics of OXPath in a decision support system for selecting the ideal job: The user can specify UK locations and weighted criteria such as the number of jobs, the availability of affordable rentals, etc. Each criteria has some parameters, e.g., the wage range for jobs. For each location the relevant data is extracted with OXPath (from different web sites) and the ranked list is presented to the user. Figure 1 shows how to extract apartments from rightmove.co.uk. We select the first (visible) form field on the start page and enter “Oxford”. From there we navigate to and click the rent submit button. On the returned page, we fill the maximum price field and click the submit button (the last visible form element). We iterate over all result pages by clicking the next links and retrieve each furnished rental (recognized on a details page). In the second demo part, we profile OXPath’s evaluation algorithm: We visualize the tree of pages as navigated by an expression and show the time to render and select the relevant data, illustrating that often rendering time dominates the evaluation. We also visualize OXPath’s buffer management and highlight when pages are dropped from the buffer. This reinforces our theoretical result: Even for very large extraction tasks, OXPath rarely buffers more than a handful pages at any time. OXPath is available on diadem-project. info/oxpath under a BSD license. 2. visibility hidden, or have display property none set for themselves or in an ancestor. To ease field navigation, we introduce the node-test field() as an abbreviation. In the above Google search for “Oxford”, we rely on the order of the visible fields selected with descendant::field()[1] and following::field()[1]. Such an expression is not only easier to write, it is also far more robust against changes on the web site. For it to fail, either the order or set of visible form fields has to change. Extraction Marker. Navigation and form filling are often means to data extraction: While data extraction requires records with many related attributes, XPath only computes a single node set. Hence, we introduce a new kind of qualifier, the extraction marker, to identify nodes as representatives for records and to form attributes from extracted data. For example, :<story> identifies the context nodes as story records. To select the text of a node as title, we use :<title=string(.)>. Therefore, OXPATH: LANGUAGE OXPath is an extensions of XPath: XPath expressions are also OXPath expressions and retrain their same semantics, computing sets of nodes, integers, strings or Booleans. We extend XPath with (1) a new kind of location step (for actions and form filling), (2) a new axis (for selecting nodes based on visual attributes), (3) a new node-test (for selecting visible fields), and (4) a new kind of predicate (for marking data to be extracted). For page navigation, we adapt the notion of Kleene star over path expressions from [5]. Nodes and values marked by extraction markers are streamed out as records of the result tables. For efficient processing, we cannot fix an apriori order on nodes from different pages. Therefore, we do not allow access to the order of nodes in sets that contain nodes from multiple pages. 2 extracts from Google News a story element for each current story, containing its title and its sources, as in: 2 2 doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} Style Axis and Visible Field Access. We introduce two extensions for lightweight visual navigation: a new axis for accessing CSS DOM node properties and a new node test for selecting only visible form fields. The style axis is similar to the attribute axis, but navigates dynamic CSS properties instead of static HTML properties. For example, the following expression selects the sources for the top story on Google News based on visual information only: doc("news.google.com")//*[style::color="#767676"] The style axis provides access to the actual CSS properties (as returned by the DOM style object), rather than only to inline styles. An essential application of the style axis is the navigation of visible fields. This excludes fields which have type or <story><title >Tax cuts ...</title> <source>Washington Post</source> <source>Wall Street Journal</source> ... </story> The nesting in the result above mirrors the structure of the OXPath expression: An extraction marker in a predicate represents an attribute to the (last) extraction marker outside the predicate. Actions. For explicitly simulating user actions, such as clicks or mouse-overs, OXPath introduces contextual action steps, as in {click}, and absolute action steps with a trailing slash, as in {click /} . Since actions may modify or replace the entire DOM, OXPath’s semantics assumes that they produce a new DOM. Absolute actions return DOM roots, while contextual actions return those nodes in the resulting DOMs which are matched by the action-free prefix of the performed action: The action-free prefix afp(action) of action is constructed by removing all intermediate contextual actions and extraction markers from the segment starting at the previous absolute action. Thus, the action-free prefix selects nodes on the new page, if there are any. For instance, the following expression enters “Oxford” into Google’s search form using a contextual action—thereby maintaining the position on the page—and clicks its search button using an absolute action. doc("news.google.com")//div[@class~="story"]:<story> [.//h2:<title=string(.)>] [.//span[style::color="#767676"]:<source=string(.)>] Kleene Star. Finally, we add the Kleene star, as in [5], to OXPath. For example, we use the following expression to query Google for “Oxford”, traverse all accessible result pages, and to extract all contained links. 2 4 doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} ( (/desc::a.l:<Link=(@href)>) /ancestor::*/desc::a[next]{click /} )* To limit the range of the Kleene star, one can specify upper and lower bounds on the multiplicity, e.g., (...)*{3,8}. 3. OXPATH: SYSTEM Our implementation consists of the three layers shown in Figure 2: the embedding layer provides a development API and a host environment, the engine layer performs the evaluation of OXPath expressions, and the web access layer accesses the web in a browser-neutral fashion. Embedding Layer. To evaluate an OXPath expression, we need to provide the environment with bindings for all occurring XPath variables; the environment in turn provides the final expression to the OXPath engine. Variables in OXPath expressions are commonly used to fill web forms with multiple values. To this end, the host environment allows value bindings based on databases, files, other OXPath expressions, or Java functions. In our default implementation, we stream the extracted matches to a file without buffering, while other implementations may choose to store the matches e.g. in a database instead. Finally, we offer a GUI to support the visual design of the OXPath queries. Embedding Visual OXPath Host Environment (e.g. Java, XQuery, CLI, …) 1 XPath Variable Bindings for Form Filling DB 3 4 PAAT Algorithm Page Buffer Page Modification Manager Web Access 2 OXPath Parser & Rewriting Web Access API HtmlUnit Node Manager OXPath Engine OXPath API JAXP XPath API Basic OXPath Evaluator Browser XPath 5 SWT Mozilla Figure 3: OXPath Visual Profiler Figure 2: OXPath System Architecture We extract and aggregate data from several deep web sites: Reed (reed.co.uk) for jobs, Rightmove (rightmove.co. uk) for rentals, Yahoo’s Upcoming (upcoming.yahoo.com) for events, Thomson Local (thomsonlocal.com) for local businesses, and UpMyStreet (upmystreet.co.uk) for neighborhood assessment. For each site, the user may also provide a list of criteria for ranking the postcode, though the system proposes a default set. These criteria can be built from the data of the above web sites using a number of additional parameters for each (criteria in italics are not available through the search interface of the respective site and thus are realized through OXPath predicates): — From Reed, we extract the number of jobs that match the following criteria (in the given postcode): job category, salary range, and number of applicants. — From Upcoming, we extract the number of events in the specified postcode area, allowing the user to configure the period of time, and event category (e.g., “Jazz”). — From Thomson Local, we extract local businesses if their category and distance from our postcode match. — From Rightmove, we extract all rentals that have the chosen number of bedrooms, maximum rental, included furnishings, and availability (e.g., long term). — From UpToMyStreet, we extract any of four neighborhood quality measures (averages): income, education level, and number of children. On the website, we also show the OXPath implementations for all five web sites. For space reasons, we highlight here only the most interesting OXPath expressions. Figure 4 shows the navigation and extraction performed on Reed. We first load Reed’s start page and fill the search field for the job category Ê, the location Ë, and the maximum distance Ì from the given postcode. To submit the search form, we click the “Search Jobs” button Í: Engine Layer. After parsing, the query optimizer expands abbreviated expressions, such as field(), and feeds the resulting queries to our Page-At-A-Time (PAAT) algorithm [6]. This algorithm controls the overall evaluation strategy and uses the browser’s XPath engine to evaluate individual XPath steps and a buffer manager to handle page modifications. In this demonstration (see Section 4), we illustrate the inner workings of the buffer manager using the OXPath visual profiler shown in Figure 3. The visual profiler is an AJAX webpage that uses the InfoVis Toolkit (thejit.org) to replay or step through the evaluation of an OXPath expression. Its main feature is a visualization of the page navigation tree, showing how pages are traversed by OXPath. Web Access Layer. For evaluating OXPath expressions on web pages, we require programmatic access to a dynamic DOM rendering engine, as employed by all modern web browsers. We identified HtmlUnit (htmlunit.sourceforge. net), the Mozilla-based JREX (jrex.mozdev.org), and the also Mozilla-based SWT widget (eclipse.org/swt) as backends. To decouple our own implementation from their implementation details, we provide the web access layer as a facade which allows for exchanging the underlying browser engine independently of our other code. In our current implementation, we use HtmlUnit as backend, since it renders pages efficiently (compared with the SWT Mozilla embedding) and is implemented entirely in Java. 4. DEMO DESCRIPTION For this demonstration, we showcase OXPath with a proofof-concept decision support system (diadem-project.info/ oxpath/placerank): It ranks a set of UK locations using data extracted from a number of different web sites with user provided criteria and weights. Separate OXPath expressions navigate the interfaces of each site, filling the necessary values for each location. The criteria are realized either through form filling, if the web site already provides a corresponding option, or as OXPath predicates otherwise. OXPath extracts all the relevant data, the actual weighting and ranking is done in the host language; in our case, a straightforward Java program. 2 doc("reed.co.uk")//field()[@title#=’keyword’]/{"Java"/}Ê //field()[@title#=’town’]/{"OX1"}Ë/foll::field()[1] /{"5 miles"}Ì/foll::field()[@title=’SearchJobs’]/{click/}Í For space reasons, we abbreviate the XPath axes. The # operator is a node test we adopt for convenience: A#B filters the nodes in A to those with id attribute B. The node-test field() is used to identify only visible form fields. 101,225 jobs from 8,240 recruiters online www.Locumotion.com/Locums The UK's #1 job site. More jobs, more choice. Temp Zone Career Tools Learning Centre Home Temp Zone Learning Centre The UK's #1 job site.Career MoreTools jobs, more choice. Home Search Jobs Career Tools Temp Zone Learning Centre Search Jobs Enter keyword(s): Java in title & description Sign In Register Register Shortlist (0) Advertise a vacancy » RECRUITING? Advertise a vacancy » Shortlist (0) RECRUITING? Get qualified and improve your prospects with a learning Show All jobs sorted by Choose... course from our range of training partners. 42 Specialist Sites Local Jobs title only 60 6+ added Any time Learning Centre 10 jobs per page Applications today Advertise in minutes Go today Advertise now » Next » 4 memory [MB] Recruiting now Charity & Voluntary Jobs (356) Media, Digital & Creative Jobs (1,894) description: Enter salary Jobs range (£):(1,894) Media, Digital & Creative Organisation Description per annum To C# Developer, C#,Jobs .NET,to 40,000, Oxford Construction & Property (1,542) Motoring & From Automotive Jobs (924) Construction & Property Jobs (1,542) Motoring & Automotive Jobs (924) Select job type: Location Oxford, Oxfordshire Date 05 Nov I am currently recruiting for an IT Business Analyst for a role based in Oxford. Permanent Part-Time Customer Service (4,025) Multilingual Jobs (5,886) Temporary Contract Customer Service Jobs Jobs (4,025) Multilingual Jobs (5,886) /} Remember this location Searchjobs jobs Search Advanced Search Local Job Search Company Search Advanced Search Advanced Search Job Reference Search Local Job Search Local Job Search Company Search Company Search Recent Job Reference Search Activity Job Reference Search Most recent search Urgent vacancies View results li ck Charity & Voluntary Jobs (356) Part-Time jobs Search Part-Time Contract Contract {c Select job type: Permanent Permanent Temporary Temporary Recruiting now 35 Select job type: Salary £30,000 - £40,000 per annum Search(5,658) jobs Education (4,493) PublicJobs Sector Jobs Recruiter Haybrook IT Resourcing Ltd Education JobsJobs (4,493) Public Sector (5,658) Applications View urgent job roles 3200 30 2400 20 1600 8 Purchasing JobsSearch (646) Purchasing Jobs (646) Advanced Local Job This is an excellent opportunity for an experienced C# Developer toSearch join a market-leading company The successful candidate will have the following skills and experience: Recruitment Consultancy Jobs (5,601) Recruitment Consultancy Jobs (5,601) Search based in Oxford. The C# Developer will play a valuable role inCompany the development team charged with - Excellent working knowledge of ITIL 10 800 Job Description Energy Energy JobsJobs (371)(371) Engineering JobsJobs (3,356) Engineering (3,356) Job Reference Search Estate Agency Jobs (1,240) Retail Jobs (3,760) Management experience creating web applications products. From identifying requirements through Estate Agency Jobs (1,240) vital to other key Retail Jobs (3,760) customer- Project - Considerable business process experience to... Financial Services Jobs Jobs (7,410) Financial Services (7,410) Recent Activity - Experience in the management of It applications systems development, maintenance and support Sales Jobs (10,296) Sales Jobs Most (10,296) - In depth knowledge of technical topics (Oracle, SAP, Java and Linux. recent search - German (ideal not essential) FMCG Jobs (498) Scientific (1,325) View results Architect – Reading – Jobs 50-60k FMCGJava JobsApplication (498) Scientific Jobs (1,325) General Insurance Jobs (3,440) Berkshire Security & Safety Jobs (219) 04 Nov Person Specification Location Date General InsuranceReading, Jobs (3,440) Security & Safety Jobs (219) Salary £50,000 £60,000 per annum Graduate Jobs (9,654) Social Care Jobs (2,601) Graduate Jobs (9,654) Social Care Jobs (2,601) Recruiter Avanti Recruitment Health & Medicine Jobs (3,176) Strategy & Consultancy Jobs (381) Applications Health & Medicine5Jobs (3,176) Strategy & Consultancy Jobs (381) Hospitality & Catering Jobs (2,919) Temporary Work Jobs (12,754) Java Application Architect – Reading – 50-60k Avanti Recruitment are working with a global software Hospitality & Catering Jobs (2,919) Human Resources Jobs (1,541) TrainingTemporary Jobs (331) Work Jobs (12,754) 0 IT & Telecoms Jobs (6,132) 0 0 reed.co.uk 50 100 150 200 250 300 time [sec] solutions provider based in Reading, Berkshire who have a requirement for an Applications Architect Human Resources Jobs (1,541) IT & Telecoms Jobs (6,132) Training JobsJobs (331) Transport & Logistics (2,340) Transport & Logistics Jobs (2,340) Temp Work in Newcastle Figure 5: Memory, results, and pages Temp Work in Newcastle Since the search may return a paginated result, we use an OXPath Kleene star expression (Ï+) to iterate over all the result pages: On each result page, we identify the “Next” link at the top of the results listing, until there is no further: /(//a[.#="Next"][1]/{click/})*Ï+ On each result page, we identify and extract those job listings that have less than the given number of maximum applicants (in this case 10): 2 40 The main purpose of the role is to take responsibility for the planning, design and build IT Solutions in line with business requirements Urgent vacancies Figure 4: Finding an OXPath through from urgent top UK recruiters View job roles and apply now. from top UK recruiters and apply now. 4000 Recommended Jobs Applications Hide job descriptions 1 2 3 4 5 Enter town/county/postcode:42 Specialist Sites Local Jobs Recommended Jobs All Sectors » all services » The UK's #1 job site. More jobs, more choice. Advertise now Select job sector: See Accountancy IT Contractor Jobs (623) 30 miles job Jobs (6,522) oxford RECRUITING? Sign In Register Shortlist (0) Home Temp Zone Career Tools Learning Centre Contact Advertise a vacancy » All Sectors IT BUSINESS ANALYST See all services » Accountancy Jobs(Qualified) (6,522) Accountancy Jobs (3,552) IT Contractor Legal Jobs Jobs (623) (1,783) Remember 5 thismiles location Return to results Previous Job Next Job Location Oxford, Oxfordshire Search Jobs Enter town/county/postcode: Within cl Accountancy (Qualified) Jobs (3,552) Legal{{c Jobs (1,783) li Oxford ic Actuarial Leisure & Tourism Jobs (822) ck SalaryJobs (590) £32,000 - £35,000 per annum k /} Enter salary range (£): IT BUSINESS ANALYST Enter keyword(s): / } Within ActuarialRecruiter Jobs (590) Leisure & Tourism Jobs (822) Reed Technology java Recruiter(9,411) Reed Technology Remember this location Admin, Secretarial & PA Jobs (4,694) Management & Executive Jobs Add to Shortlist per annum From To in title & description title only Location Oxford, Oxfordshire Applications 0 Add Recruiter to Favourites Remember this location Admin, Secretarial & PA Jobs (4,694) Management & Executive Jobs (9,411) Salary £32,000 £35,000 per annum Select job sector: More jobs like this Apprenticeships (89) Manufacturing Jobs (1,303) Sector Enter salary range IT & Telecoms - Business Analyst All Sectors Select job (£): type: Organisation Description I am currently recruiting for an IT Business Analyst forJobaType role based in Oxford. More jobs from this recruiter Permanent Apprenticeships (89) Manufacturing Jobs (1,303) Enter salary range (£): Enter town/county/postcode: Date build Just The main for the & planning, and ITadded Solutions in Email this job to a friend Banking Jobs purpose (1,946) of the role is to take responsibility Marketing PR Jobsdesign (3,792) From To Permanentper annum Part-Time 5 miles Applications 0 Email me jobs like this per annum Banking line Jobswith (1,946) Marketing & PR Jobsoxford (3,792) business requirements Job Description The successful candidate will have... From To Temporary Job ref no 19556197 Contract 3 memory extracted matches visited pages 50 Advertise in minutes with Enter town/county/postcode: 2 4800 Shortlist (0) Email me jobs like theseAdvertise | RSS feed a vacancy » (What's RSS?) Upgrade Jobs in your Oxfordskills with the UK's #1 job site title only Enter keyword(s): Select job sector: in title & description title only All Sectors Select job sector: in title & description Sign In Contact Contact Sponsored Links Register Get qualified and improve your prospects with a learning Your search for 'java' in 'oxford' returned jobs job siteLearning Centre Upgrade skills with the UK's61#1 course from your our range of training partners. SearchjavaJobs Enter keyword(s): 1 Sign In Contact matches / pages Home 101,225 jobs from 8,240 recruiters RECRUITING? online {click /} The UK's #1 job site. More jobs, more choice. //div.resultEntry[.//label[.#="Applications")] /foll::text()[1]<10]:<job> The extraction marker :<job> extracts each div.resultEntry with the label job if it matches the predicates. For those jobs, we also access their details page to extract the full job description (rather than the abbreviated one shown on the result page): [h4/a/{click/}Î//div.jobDescription:<desc=.>] The extraction marker :<desc> extracts those descriptions as children of the corresponding job element, as it occurs nested in a predicate following :<job>. Further criteria such as salary range or posting date can be included in the same way (full example on the website). To visualize OXPath’s buffer management, we use the OXPath profiler (Figure 3): It shows the tree of pages that OXPath navigates while evaluating the expressions. The evaluation can be replayed either automatically or stepwise. At the top of the page Ê, a summary of the evaluation time of the various components of OXPath is given. In the page tree, we show for each page Ë some meta-data and the time spent rendering and navigating that page Î. A page is shaded gray Ì if it is no longer buffered by OXPath. As Figure 3 shows, OXPath keeps at most the pages on the path from the start page to the current one in memory. This behavior is confirmed on a larger scale in Figure 5, where we evaluate the above OXPath expression with location “London” and distance “30 miles”. We conclude the demonstration by going into details for one more web site, to illustrate the versatility of OXPath: extracting rentals from rightmove.co.uk. On Rightmove the user has to perform a longer sequence of actions (than on Reed) to obtain the results: Fill “Oxford” into the input field and click on the “Rent” submit button. On the returned page, fill in the maximum price field, and click submit. To find all result pages, repeatedly click the next link. 2 4 doc("rightmove...")/desc::field()[1]/{"Oxford"}Ê/foll::*#rent /{click/}Ë//*#maxPrice/{"1,250 PCM"/}Ì/foll::field()[last()] /{click/}Í/(//a[.#=’next’]/{click/})*Î//ol#summaries/li[ .//a[.#=’details’]/{click/}Ï//text()[.=’Furnished’]]:<R> Figure 6: OXPath for finding affordable rentals The OXPath expression in Figure 6 realizes this sequence of actions in Lines 1–4: To fill the first input field, we use OXPath’s field() nodetest and a contextual action Ê (denoted with just {} around the action). The navigation continues from the same field after a contextual action. The other actions are absolute actions (with an added /) where the navigation resumes at the root of the page retrieved by the action. We locate and click Ë the “Rent” button to submit the form. Rightmove does not directly return results, but asks for a refinement of the query. On the refinement page, we identify the drop-down list for the maximum price and also submit the refinement form by clicking Í its submit button (the last visible web form field). Again we iterate over all result pages. In order to extract only “Furnished” apartments, we need to navigate to the details page Ï. We extract the filtered results from the result page using the :<R> extraction marker. 5. REFERENCES [1] A. Alba, V. Bhagwan, and T. Grandison. Accessing the deep web: when good ideas go bad. In OOPSLA, 2008. [2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web information extraction with Lixto. In VLDB, 2001. [3] J. P. Bigham, A. C. Cavender, R. S. Kaminsky, C. M. Prince, and T. S. Robison. Transcendence: enabling a personal view of the deep web. In IUI, 2008. [4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller. Automation and customization of rendered web pages. In UIST, 2005. [5] M. Marx. Conditional XPath. ACM Trans. Database Syst., 30(4), 2005. [6] OXPath. http://www.diadem-project.info/oxpath. [7] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, 2007.