FAST Enterprise Crawler Guide
Transcription
FAST Enterprise Crawler Guide
FAST Enterprise Crawler version:6.7 Crawler Guide Document Number: ESP939, Document Revision: B, December 03, 2009 Copyright Copyright © 1997-2009 by Fast Search & Transfer ASA (“FAST”). Some portions may be copyrighted by FAST’s licensors. All rights reserved. The documentation is protected by the copyright laws of Norway, the United States, and other countries and international treaties. No copyright notices may be removed from the documentation. No part of this document may be reproduced, modified, copied, stored in a retrieval system, or transmitted in any form or any means, electronic or mechanical, including photocopying and recording, for any purpose other than the purchaser’s use, without the written permission of FAST. Information in this documentation is subject to change without notice. The software described in this document is furnished under a license agreement and may be used only in accordance with the terms of the agreement. Trademarks FAST ESP, the FAST logos, FAST Personal Search, FAST mSearch, FAST InStream, FAST AdVisor, FAST Marketrac, FAST ProPublish, FAST Sentimeter, FAST Scope Search, FAST Live Analytics, FAST Contextual Insight, FAST Dynamic Merchandising, FAST SDA, FAST MetaWeb, FAST InPerspective, NXT, FAST Unity, FAST Radar, RetrievalWare, AdMomentum, and all other FAST product names contained herein are either registered trademarks or trademarks of Fast Search & Transfer ASA in Norway, the United States and/or other countries. All rights reserved. This documentation is published in the United States and/or other countries. Sun, Sun Microsystems, the Sun Logo, all SPARC trademarks, Java, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. Netscape is a registered trademark of Netscape Communications Corporation in the United States and other countries. Microsoft, Windows, Visual Basic, and Internet Explorer are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Red Hat is a registered trademark of Red Hat, Inc. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. AIX and IBM Classes for Unicode are registered trademarks or trademarks of International Business Machines Corporation in the United States, other countries, or both. HP and the names of HP products referenced herein are either registered trademarks or service marks, or trademarks or service marks, of Hewlett-Packard Company in the United States and/or other countries. Remedy is a registered trademark, and Magic is a trademark, of BMC Software, Inc. in the United States and/or other countries. XML Parser is a trademark of The Apache Software Foundation. All other company, product, and service names are the property of their respective holders and may be registered trademarks or trademarks in the United States and/or other countries. Restricted Rights Legend The documentation and accompanying software are provided to the U.S. government in a transaction subject to the Federal Acquisition Regulations with Restricted Rights. Use, duplication, or disclosure of the documentation and software by the government is subject to restrictions as set forth in FAR 52.227-19 Commercial Computer Software-Restricted Rights (June 1987). Contact Us Web Site Please visit us at: http://www.fastsearch.com/ Contacting FAST FAST Cutler Lake Corporate Center 117 Kendrick Street, Suite 100 Needham, MA 02492 USA Tel: +1 (781) 304-2400 (8:30am - 5:30pm EST) Fax: +1 (781) 304-2410 Technical Support and Licensing Procedures Technical support for customers with active FAST Maintenance and Support agreements, e-mail: [email protected] For obtaining FAST licenses or software, contact your FAST Account Manager or e-mail: [email protected] For evaluations, contact your FAST Sales Representative or FAST Sales Engineer. Product Training E-mail: [email protected] To access the FAST University Learning Portal, go to: http://www.fastuniversity.com/ Sales E-mail: [email protected] Contents Preface..................................................................................................ii Copyright..................................................................................................................................ii Contact Us...............................................................................................................................iii Chapter 1: Introducing the FAST Enterprise Crawler.......................9 New features..........................................................................................................................10 Web concepts.........................................................................................................................10 Crawler concepts....................................................................................................................12 Enterprise Crawler Architecture.............................................................................................14 Configuring a crawl.................................................................................................................16 Where to begin?..........................................................................................................16 Where to go?...............................................................................................................17 How fast to crawl?.......................................................................................................18 How long to crawl?......................................................................................................18 Excluding pages in other ways....................................................................................19 External limits on fetching pages.................................................................................19 Removal of old content................................................................................................21 Browser Engine......................................................................................................................22 Chapter 2: Migrating the Crawler.....................................................23 Overview................................................................................................................................24 Storage overview....................................................................................................................24 Document storage.......................................................................................................24 Meta database.............................................................................................................25 Postprocess Database.................................................................................................25 Duplicate server database...........................................................................................25 Configuration and Routing Databases.........................................................................26 CrawlerGlobalDefaults.xml file considerations.......................................................................26 The Migration Process...........................................................................................................27 Chapter 3: Configuring the Enterprise Crawler..............................31 Configuration via the Administrator Interface (GUI)................................................................32 Modifying an existing crawl via the administrator interface..........................................32 Basic Collection Specific Options................................................................................33 Advanced Collection Specific Options.........................................................................35 Adaptive Crawlmode....................................................................................................49 Authentication..............................................................................................................52 Cache Sizes................................................................................................................53 5 FAST Enterprise Crawler Crawl Mode.................................................................................................................53 Crawling Thresholds....................................................................................................54 Duplicate Server..........................................................................................................55 Feeding Destinations...................................................................................................56 Focused Crawl.............................................................................................................57 Form Based Login.......................................................................................................57 HTTP Proxies..............................................................................................................58 Link Extraction.............................................................................................................59 Logging........................................................................................................................59 POST Payload.............................................................................................................61 Postprocess.................................................................................................................61 RSS.............................................................................................................................61 Storage........................................................................................................................63 Sub Collections............................................................................................................63 Work Queue Priority....................................................................................................66 Configuration via XML Configuration Files.............................................................................67 Basic Collection Specific Options (XML).....................................................................67 Crawling thresholds.....................................................................................................87 Refresh Mode Parameters...........................................................................................89 Work Queue Priority Rules..........................................................................................89 Adaptive Parameters...................................................................................................91 HTTP Errors Parameters.............................................................................................93 Logins parameters.......................................................................................................94 Storage parameters.....................................................................................................96 Password Parameters..................................................................................................97 PostProcess Parameters.............................................................................................98 Log Parameters.........................................................................................................100 Cache Size Parameters.............................................................................................101 Link Extraction Parameters........................................................................................102 The ppdup Section....................................................................................................104 Datastore Section......................................................................................................105 Feeding destinations.................................................................................................105 RSS...........................................................................................................................107 Metadata Storage......................................................................................................108 Writing a Configuration File.......................................................................................109 Uploading a Configuration File..................................................................................110 Configuring Global Crawler Options via XML File................................................................110 CrawlerGlobalDefaults.xml options............................................................................110 Sample CrawlerGlobalDefaults.xml file.....................................................................113 Using Options.......................................................................................................................115 Setting Up Crawler Cookie Authentication.................................................................115 Implementing a Crawler Document Plugin Module....................................................118 Configuring Near Duplicate Detection.......................................................................125 Configuring SSL Certificates.....................................................................................127 Configuring a Multiple Node Crawler....................................................................................128 6 Removing the Existing Crawler..................................................................................128 Setting up a New Crawler with Existing Crawler........................................................128 Large Scale XML Crawler Configuration..............................................................................130 Node Layout..............................................................................................................130 Node Hardware.........................................................................................................131 Hardware Sizing........................................................................................................131 Ubermaster Node Requirements...............................................................................132 Duplicate Servers......................................................................................................132 Crawlers (Masters)....................................................................................................133 Configuration and Tuning...........................................................................................133 Duplicate Server Tuning.............................................................................................135 Postprocess Tuning....................................................................................................136 Crawler/Master Tuning...............................................................................................137 Maximum Number of Open Files...............................................................................141 Large Scale XML Configuration Template.................................................................141 Chapter 4: Operating the Enterprise Crawler................................147 Stopping, Suspending and Starting the Crawler..................................................................148 Starting in a Single Node Environment - administrator interface...............................148 Starting in a Single Node Environment - command line............................................148 Starting in a Multiple Node Environment - administrator interface............................148 Starting in a Multiple Node Environment - command line..........................................148 Suspending/Stopping in a Single Node Environment - administrator interface.........148 Suspending/Stopping in a Single Node Environment - command line......................149 Suspending/stopping in a Multiple Node Environment - administrator interface.......149 Suspending/stopping in a Multiple Node Environment - command line....................149 Monitoring.............................................................................................................................149 Enterprise Crawler Statistics.....................................................................................149 Backup and Restore.............................................................................................................153 Restore Crawler Without Restoring Documents........................................................154 Full Backup of Crawler Configuration and Data.........................................................154 Full restore of Crawler Configuration and Data.........................................................154 Re-processing Crawler Data Using postprocess.......................................................154 Single node crawler re-processing............................................................................155 Multiple node crawler re-processing..........................................................................156 Forced Re-crawling....................................................................................................156 Purging Excluded URIs from the Index.....................................................................156 Aborting and Resuming of a Re-process..................................................................156 Crawler Store Consistency...................................................................................................157 Verifying Docstore and Metastore Consistency.........................................................157 Rebuilding the Duplicate Server Database................................................................159 Redistributing the Duplicate Server Database.....................................................................160 Exporting and Importing Collection Specific Crawler Configuration.....................................161 Fault-Tolerance and Recovery..............................................................................................161 7 FAST Enterprise Crawler Ubermaster................................................................................................................162 Duplicate server.........................................................................................................162 Crawler Node.............................................................................................................162 Chapter 5: Troubleshooting the Enterprise Crawler.....................163 Troubleshooting the Crawler.................................................................................................164 Reporting Issues.......................................................................................................164 Known Issues and Resolutions.................................................................................165 Chapter 6: Enterprise Crawler - reference information................169 Regular Expressions............................................................................................................170 Using Regular Expressions.......................................................................................170 Grouping Regular Expressions..................................................................................170 Substituting Regular Expressions..............................................................................171 Binaries................................................................................................................................171 crawler.......................................................................................................................171 postprocess...............................................................................................................174 ppdup.........................................................................................................................176 Tools.....................................................................................................................................178 crawleradmin.............................................................................................................178 crawlerdbtool.............................................................................................................185 crawlerconsistency....................................................................................................189 crawlerwqdump.........................................................................................................192 crawlerdbexport.........................................................................................................193 crawlerstoreimport.....................................................................................................194 Crawler Port Usage..............................................................................................................195 Log Files...............................................................................................................................196 Directory structure.....................................................................................................196 Log files and usage...................................................................................................197 Enabling all Log Files................................................................................................198 Verbose and Debug Modes.......................................................................................198 Crawler Log Messages..............................................................................................199 PostProcess Log........................................................................................................202 Crawler Fetch Logs....................................................................................................203 Crawler Fetch Log Messages....................................................................................204 Crawler Screened Log Messages.............................................................................207 Crawler Site Log Messages.......................................................................................209 8 Chapter 1 Introducing the FAST Enterprise Crawler Topics: • • • • • • New features Web concepts Crawler concepts Enterprise Crawler Architecture Configuring a crawl Browser Engine This chapter introduces the FAST Enterprise Crawler (EC), version 6.7, for use with FAST ESP. FAST Enterprise Crawler New features New features since EC 6.3: • • • • Significant large-scale performance and robustness improvements. Through efficiency improvements and the addition of new configuration variables to reduce or eliminate inter-node communications, a large-scale web crawl of up to 2 billion documents can be supported with the crawler, with 25-30 million documents on over 60 dedicated crawler hosts. Multimedia enabled crawler. Document evaluator plugin. NTLM v1 server and digest authentication. New features since EC 6.4: • Introduction of the Browser Engine, which enables more links to be extracted from: • • • • • • JavaScript. By default the Browser Engine extracts most links, but the customizable preprocessors and extractors allow even more links to be extracted. Static Flash. By default links will be extracted from static flash (.swf) and flash video files (.flv) IDNA support. Authentication improvements. Full NTLM v1 support and improved form based authentication. Operational improvements including new crawl modes and tools to verify crawler store consistency and change the number of crawler nodes and duplicate servers. Near duplicate detection to evaluate patterns in the content to identify duplicates. No new features since EC 6.5 as it was never officially released. New features since EC 6.6: • Comprehensive sitemap support, which includes: • • • • • • • • • • • • • • Automatic detection of sitemaps, including support for robots.txt directive. Support for storing/indexing metadata from sitemaps. Obey sitemap access rules. Sitemap enabling/disabling per subdomain. Use the lastmod attribute to determine what pages require re-crawling (non-adaptive crawl mode only) Use the priority and changefreq attributes to score documents in adaptive crawl mode. Improved crawleradmin refetch and refeed options Extended the crawleradmin verifyuri option to perform a more thorough verification Passwords no longer stored or presented in plant text in exported crawler configurations Postprocess supports auto-resume of the previous interrupted refeed Improved flexibility in matching robots.txt user-agents through a regular expression Configurable session cookie timeout Support for overriding the Obey robots.txt setting in sub collections Document plugins can now perfom limited logging to the fetch log Web concepts This section provides a list of definitions for terms that apply to the part of the Internet called the World Wide Web (www). 10 Introducing the FAST Enterprise Crawler Web server A web server is a network application using the HyperText Transfer Protocol (HTTP) to serve information to users. Human users utilize a client application called a browser to request, transfer and display documents from the web server. The documents may be web pages (encoded in HTML or XML markup languages), files stored on the web server's file system in any number of formats (Microsoft Word or Adobe Acrobat PDF documents, JPEG or other image files, MP3 or other audio files), or content generated dynamically based on the user's request (e-commerce products, search results, or database lookup results). The crawler responds to HTTP error codes. For extensive explanations of all HTTP/1.1 RFC codes, refer to the Hypertext Transfer Protocol -- HTTP/1.1 available at http://www.ietf.org/. Web site vs. web server A web site is a given hostname (for example, www.example.com), with an associated IP address (or, sometimes, a set of IP addresses, generally if a site gets a lot of traffic), which supports the HTTP protocol and serves content to user requests. A web server is the hardware system corresponding to this hostname. Several web sites may share a given web server, or even a single IP address. Web page A web page is the standard unit of content returned by a web server, which may be identified by one or more URIs. It may represent the formatted output of a markup language (for example, HTML or XML), or a document format stored on-disk (for example, Microsoft Office, Adobe PDF, plain text), or the dynamic representation of a database or other archive. In any case, the web server will return some header information along with the content, describing the format of the contents, using the Internet Standard MIME type conventions. Links Web pages may contain references to other web pages, either on the same web server or elsewhere in the network, called hyperlinks or simply links. These links are identified by various internal formatting tags. Uniform Resource Identifier (URI) vs. Uniform Resource Locator (URL) URI is overall namespace for identifying resources. URL is a specific type that include the location of the resource (for example, a web page, http://www.example.com/index.html). Encoded within this example are the network protocol (scheme, HTTP), the hostname and implied port number, (www.example.com, port 80), and a specific path and page on that server (/index.html). URI is the more general term, and is preferred. Extensive RFC 3986 details can be found at http://www.ietf.org/. IDNA Since normal DNS resolving doesn't support characters outside the ASCII scope of characters, a hostname containing these special characters has to be translated into an ASCII based format. This translation is defined by the Internationalizing Domain Names in Applications (IDNA) standard. An example of such a hostname would be www.blåbærsyltetøy.no . The DNS server doesn't understand this name, so the host is registered as the IDN encoded version of the host name: www.xn--blbrsyltety-y8ao3x.no. The Crawler will automatically translate these host names to IDN encoded names before DNS lookup is performed. When working with URIs that use special characters, please make sure the collection or the start uri files have been stored using UTF-8 or similar encoding. Extensive RFC 3490 details can be found at http://www.ietf.org/. RSS RSS is a family of web feed formats used to publish frequently updated digital content, such as blogs, news feeds or podcasts. Users of RSS content use programs called feed readers or aggregators. The user subscribes to a feed by supplying to their reader a link to the feed. The reader can then check the user's subscribed feeds to see if any of those feeds have new content since the last time it checked and if so, retrieve that content and present it to the user. The following RSS formats/versions are supported by the crawler: • RSS 0.9-2.0 11 FAST Enterprise Crawler • • XML Sitemaps ATOM 0.3 and 1.0 Channel Definition Format (CDF) An XML Sitemap (also known as Google Sitemap) is an XML format for specifying the links on the site, with associated meta data. This meta data includes the following per URI: • • • The priority (importance) The change frequency The time it as last modified The crawler can be configured to download such sitemaps, and make use of this information when deciding what URIs to crawl, and in what order, for a site. In non-adaptive refresh mode the crawler uses the lastmod attribute to determine whether a page has been modified since the last time the sitemap was retrieved. Pages that have not been modified will not be recrawled in this crawl cycle. In adaptive refresh mode the crawler will use the priority and changefreq attributes from the sitemap to score (weight) a page. Thus, assuming that the sitemap has sane values, the crawler will prioritize high priority content. The sitemap is only re-downloaded each major cycle however. See the Sitemap support configuration option for more information. Crawler concepts The crawler is a software application that gathers (or fetches) web pages from a network, typically a bounded institutional or corporate network, but potentially the entire Internet, in a controlled and reasonably deterministic manner. The crawler works, in many ways, like a web browser to download content from web servers. But unlike a browser that responds only the user's input via mouse clicking or keyboard typing, the crawler works from a set of rules it must follow when requesting web pages, including how long to wait between requests for pages (Request rate), and how long to wait before checking for new/updated pages (Refresh interval). For each web page downloaded by the crawler, it makes a list of all the links to other pages, and then checks these links against the rules for what hosts, domains, or paths it is allowed to fetch. A brief description of the crawler's algorithm is that it will start by comparing the start URIs list against the include and (if defined) exclude rules. Valid URIs are then requested from their web servers at a rate determined by the specified request rate. If fetched successfully, the page is parsed for links, and information about the page stored in the meta database, with the contents stored in the crawler store. The URIs from the parsed links are each evaluated against the rules, fetched, and the process continues until all included content has been gathered, or the refresh interval is complete. Because of the many different situations in which the crawler is used, there are many different ways to adjust its configuration. This section identifies some of the fundamental elements used to set up and control web page collection: 12 Collection Named set of documents to be indexed together in ESP, this also identifies the crawler's configuration rules. Storage The crawler stores crawled content locally by default, to be passed to other ESP components later. If there is too much data to store, or a one-time index build is planned, pages can be deleted after having been indexed. It also builds a database of meta data, or details about a web page, such as what pages link to it or if there are any duplicate copies. Introducing the FAST Enterprise Crawler Include rules Settings that indicate the hosts and/or URIs that may be fetched. These can be quite specific such as a list of web servers or URIs, or general, for all the servers in one's network. It's important to keep in mind that this only specifies what may be fetched, it does not define where to start crawling (see start URIs list below). Exclude rules Optional settings that prevent hosts and/or URIs from being fetched, because they would otherwise match the include rules, but are not desired in the index. Start URIs list List of web pages (URIs) to be fetched first, from which additional links may be extracted, tested against the rules, and added to work queues for subsequent fetch attempts. As each is fetched, additional URIs on that site and others may be found. If there are URIs listed to more sites in the start URIs list than the number of sites the crawler can connect to simultaneously (Maximum number of concurrent sites configuration variable), then some will remain queued until a site completes crawling, at which point a new site can be processed. The start URIs list is sometimes referred to as a seed URIs list or simply seed list. Refresh interval Length of time the crawler will work before re-crawling a site to see if new or modified pages exist. The behavior of the crawler during this period depends upon the refresh mode. If the crawler is busy it will have work queues of pages yet to be fetched; the contents of the existing work queues may either be kept and crawled during the next refresh cycle, or it may be erased (scratched). In either case the start URIs are also added to the work queue. In the adaptive mode the overall refresh interval is called the major cycle. The major cycle is subdivided into multiple minor cycles, with goals and limits regarding the number of pages to be revisited. This interval may be quite short, measured in hours or minutes, for "fresh" data like news stories, but is more typically set as a number of days. The refresh interval is sometimes referred to as the crawl cycle, refresh cycle or simply refresh. Request rate The amount of time the crawler will "wait" after fetching a document before attempting another fetch from the same web site. For flexibility, different rates (variable delay) can be specified for different times of day, or days of the week. Setting this value very low can cause problems, as it increases the activity of both the web sites and the crawler system, along with the network links between them. The request rate is sometimes referred to as the page request delay, request delay or delay. Concurrent sites The crawler is capable of crawling a large number of unique web sites, however only a limited number of these can be crawled concurrently at any one time. Normally, the crawler will crawl a site to completion before continuing on the next site. You can however limit the amount of documents crawled from a single site by several means, see Excluding or limiting documents below. This can be used to ensure the crawler eventually gets time to crawl all the web sites it is configured to. Crawl speed Sometimes also referred to as crawl rate this is rate at which documents are fetched from the web sites for a given collection. The highest possible crawl rate can be calculated from the number of concurrent sites divided by the request rate. For example, if crawling 50 web sites with request rate of 10 (10 seconds "delay" between each fetch) the total maximum achievable crawl rate will be 5 documents per second. However, if the network or web sites are slow the actual crawl rate may be less. Excluding or Because the ultimate goal of crawling is indexing the textual content of web pages rather limiting than viewing a fully detailed web page (as with a browser), the standard configuration of the documents crawler includes some exceptions to the rules of what to fetch. A common example is graphical content; JPEG, GIF and bitmap files are all excluded. 13 FAST Enterprise Crawler There are several other controls that can be set to limit downloaded content. A per-page size limit can be set, with another option to control what happens when the size limit is exceeded (drop the page, or truncate it at the size limit). Another option, Maximum documents per site, limits the number of pages downloaded from a given web site; helpful if a large number of sites is being surveyed, and too many pages fetched from a "deep" site would limit the resources available and starve other sites. Level or hops This value indicates how many links have been followed from a start URI (Level 0) to reach the current page. It is used in evaluating a crawl in which a DEPTH value has been specified. For example, if the start URI http://www.example.com/index.html links to /sitemap.html on the same site, from which a link to http://www.example.com/test/setting/000/output_listing/three.txt is extracted, this latter URI will be Level 2. The number of path elements is not considered in determining the Level value. If you are running a DEPTH:0 crawl, the start URIs will be crawled, but redirects and frame links will also be allowed. To strictly enforce a start URI only crawl, specify DEPTH:-1 (minus-one). Feed/refeed The crawler will send fetched pages to FAST ESP in batches to be indexed, updated or deleted, a process known as feeding. In normal operation it will automatically maintain the synchronization between what pages exist on web sites and what pages are available in the index. Under some circumstances it may be necessary to rebuild the collection in the index, or make major (bulk) changes in the contents. For example, a large number of deletions of sites or pages no longer desired, or significant changes in the processing pipeline would both require resending data to FAST ESP. In this case the crawler can be shut down and postprocess run manually, a process known as re-feeding. After restarting the crawler, it will continue to keep the index updated based on new pages fetched, or deleted pages discovered. Duplicate documents A web document may in some cases be represented by more than a single URI. In order to avoid indexing the same document multiple times a mechanism known as duplicate detection is used to ensure that only one copy of each unique document is indexed. The crawler supports two ways of identifying such duplicates. The first method is to strip all HTML markup and white space from the document, and then compute an MD5 checksum of the resulting content. For non-HTML content such as PDFs the MD5 checksum is generated directly from the binary content. Any documents sharing the same checksum are duplicate. A variation on this method is the near duplicate detection, refer to the Configuring Near Duplicate Detection on page 125 chapter for more information. The set of documents with different URIs classified as duplicates will be indexed as one document in the index, but the field 'urls' will contain multiple URIs pointing to this document. Note: The crawler's duplicate handling will only apply within collections, not across. Enterprise Crawler Architecture The Enterprise Crawler is typically a component within a FAST ESP installation, started and stopped by the Node Controller (nctrl). Internally the crawler is organized as a collection of processes and logical entities, which in most cases run on a single machine. Distributing the processes across multiple hosts is supported, allowing the crawler to gather and process a larger number of documents from numerous web sites. Table 1: Crawler Processes 14 Introducing the FAST Enterprise Crawler Binary Function crawler Master/Ubermaster uberslave Uberslave/Slave postprocess Postprocess ppdup Duplicate Server crawlerfs File Server In a single node installation the primary process is known as the master, and is implemented in the crawler binary. It has several tasks, including resolving DNS names to addresses, maintaining the collection configurations, and other "global" jobs. It also allocates sites to one of the uberslave processes. The master is started (or stopped) by the node controller, and is responsible for starting and stopping the other crawler processes. These include the uberslave processes (two by default), each of which creates multiple slave entities. The uberslave is responsible for creating the per-site work queues and databases; a slave is allocated to a single site at any given time, and is responsible for fetching pages, directly or through a proxy, computing the checksum of the page's content, storing the page to disk, and associated activities such as logging in to protected sites. The postprocess maintains a database of document content checksums, to determine duplicates (more than one URI corresponding to the same data content), and is responsible for feeding batches of documents to FAST ESP. Small documents are sent directly to the document processing pipelines, but larger documents are sent with only a reference to the document; the file server process is responsible for supplying the document contents to any pipeline stage that requests it. Figure 1: Single Node Crawler Architecture When the number of web sites, or total number of pages to be crawled, is large, the crawler can be scaled up by distributing the processes across multiple hosts. In this configuration some additional processes are required. An ubermaster is added, which takes on the role of DNS name/address resolution, centralized logging, and routing URIs to the appropriate master node. Each master node continues to have a postprocess locally, but each of these must now submit URI checksums to the duplicate server, which maintains a global database of URIs and content checksums. 15 FAST Enterprise Crawler Figure 2: Multiple Node Crawler Architecture Refer to the FAST ESP Product Overview Guide, Basic Concepts chapter for FAST ESP search engine concepts. Configuring a crawl The purpose of the crawler is to fetch the web pages that are desired for the index, so that users can search for and find the information they need. This section introduces how to limit and guide the crawler in selecting web pages to fetch and index, and describes the alternatives for what to do once the refresh interval has completed. In building an index it is important to include documents that have useful information that people need to find, but it is also critical to exclude content that is repetitive or otherwise less useful. For example, the automated pages of an on-line calendar system, with one page per day (typically empty), stretching off into the distant future may not be useful. Keep this in mind when setting up what the crawler will, and will NOT, fetch and process. At a minimum, a crawl is defined by two key issues: where to begin, and where to go; also important are determining how fast to crawl, and for how long. Where to begin? The start URIs list provides the initial set of URIs to web sites/pages for the crawler to consider. As each is fetched it generates additional URIs to that site and other sites. If there are URIs listed to more sites in the start URIs list than the number of sites the crawler can connect to simultaneously (Maximum number of concurrent sites), then some remain pending until a site completes crawling, at which point a new site can be processed. To prevent site starvation, the setting of Maximum documents before interleaving can force a different site to be scheduled after the specified value of pages are fetched. 16 Introducing the FAST Enterprise Crawler Note: This can be expensive with regard to queue structure and the possibility of overflowing file system limits. It is recommended that you thoroughly consider the implications on web scale crawls before implementing this feature. Where to go? The first factor to consider is what web sites should be crawled. If given no limitations, no rules to restrict it, the crawler will consider ANY URI to be valid. For most indexing projects, this is too much data. Generally, an index is being built for a limited number of known web sites, identified by their DNS domains or, more specifically, hostnames. For these sites, one or more start URIs is identified, giving the crawler a starting point within the web site. An include rule corresponding to the start URI can be quite specific, for example, an EXACT match for www.example.org, or it can be more general to match all websites in a given DNS domain, for example, any hostname matching the SUFFIX .example.com. Figure 3: Configuring a Crawl A crawl configured with these include rules, and a start URI to match each one, would attempt to download, store, and index all the pages available within the large circles shown in the illustration, corresponding to the www.example.com network and the www.example.org web site. It is often the case, though, that general rules such as these have specific exemptions, special cases of servers or documents that must not be indexed. Consider the host hidden.example.com in Site A containing documents that are not useful to the general public, or are otherwise deemed to be unworthy of indexing. To prevent any pages from this site being fetched, the crawler can be configured with a rule to exclude from consideration any URI with the EXACT hostname hidden.example.com. Another possibility is that only files from a particular part of the web site should be avoided; in such a case an Exclude URI rule could be entered, for example, any URI matching the PREFIX http://hidden.example.com/forbidden_fruit/. As the crawler fetches pages and looks through them for new URIs to crawl, it will evaluate each candidate URI against its configured rules. If a URI matches either an Include Hostname or Include URI filter rule, while NOT matching an Exclude Hostname or Exclude URI filter rule, then it is considered eligible for further processing, and possibly fetching. 17 FAST Enterprise Crawler Note: The semantics of URI and hostname inclusion rules have changed since ESP 5.0 (EC 6.3). In previous ESP releases these two rule types were evaluated as an AND operation, meaning that a URI has to match both rules (when rules are defined). As of ESP 5.1 and later (EC 6.6 and up), the rules processing has changed to an OR operator, meaning a URI now only needs to match one of the two rule types. Existing crawler configurations migrating from EC 6.3 must be updated, by removing or adjusting the hostname include rules that overlap with the URI include rules. Within a given site, you can configure the crawler to gather all pages (a FULL crawl), or limits can be set on either the depth of the crawl (how many levels of links are followed), or an overall limit on the number of pages allowed per site can be set (Maximum documents per site). How fast to crawl? Perhaps the key variable that affects how much work the crawler, the network, and the remote web sites must do is the page Request rate. This value is determined by the delay setting, which indicates how long the crawler should wait, after fetching a page, before requesting the next one. For each active site, the overall page request rate will be a function of the Delay, the Outstanding page requests setting, and the response time of the web site returning requested pages. The crawler's overall download rate depends on how many active sites are busy. The number of uberslaves per node can be modified, with a default of two and a maximum of eight. When crawling remote sites that are not part of the same organization running the crawl, using the default delay value of 60 seconds (or higher) is appropriate, so as not to burden the web site from which pages are being requested. For crawlers within the same organization/network, lower values may be used, though note that using very low values (for example, less than 5 seconds) can be stressful on the systems involved. How long to crawl? The Refresh interval determines the overall crawl cycle length; the period of time over which a crawl should run without revisiting a site to see if new or modified pages exist. Picking an appropriate interval depends on the amount of data to be fetched (which depends both on the number of web sites, and how many web pages each contains) and the update rate or freshness of the web sites. In some cases, there are many web sites with very static/stable data, and a few that are updated frequently; these can be configured as either separate collections, or given distinct settings through the use of a sub collection. The behavior of the crawler at the end of the crawl cycle depends upon the Refresh mode setting, the Refresh when idle setting, and the current level of activity. If the refresh cycle is long enough that all sites have been completely crawled, and the refresh when idle parameter is "no", the crawler will remain idle until the refresh interval ends. If the refresh when idle parameter is "yes", a new cycle will be started immediately. In the next cycle, the start URIs list is followed as in the first cycle. On the other hand, if the crawler is still busy, it will have work queues of pages yet to be fetched; in the default setting (scratch), the work queues are erased, and the cycles begin "from scratch", just as in the first cycle. Other options keep any existing work queues, and specify that the start URIs are to be placed at the end of the work queue list (append), or at the front (prepend). In the adaptive mode, the major cycle is subdivided into multiple micro cycles, with goals and limits regarding the number of pages to be revisited in each of these. It works by maintaining a scaled score for each page it retrieves, and this score is used to determine if a document should be re-crawled multiple times within a major cycle. This mode is mainly useful when crawling large bodies of data. For instance, if a site that is being crawled contains several million pages it can take, say, a month to completely crawl the site. If the "top" of 18 Introducing the FAST Enterprise Crawler the site changes frequently and contains high quality information it may be useful for these pages to be crawled more frequently than once a month. When in adaptive mode, the crawler will do exactly that. Excluding pages in other ways Because the ultimate goal of crawling in FAST ESP is indexing the textual content of web pages, rather than viewing a fully detailed web page (as with a browser), the standard configuration of the crawler includes some exceptions to the rules of what to fetch. A common example is graphical content; JPEG, GIF and bitmap files are excluded. Links to audio or video content are typically excluded to avoid downloading large amounts of content with no text content, although special multimedia crawls may choose to include this content for further processing. These restrictions can be implemented using either filename extensions (for example, any file that ends with ".jpg"), or via the Internet standard MIME type (for example, "image/jpeg"). Note that a MIME type screening requires the crawler to actually fetch part of the document whereas an extension exclude can be performed without any network access. There are several other controls that can be set to limit downloaded content. A per-page size limit can be set, with another option to control what happens when the size limit is exceeded (drop the page, or truncate it at the size limit). Another option limits the number of pages downloaded from a given web site, helpful if a large number of sites is being surveyed, and too many pages fetched from a "deep" site would limit the resources available and starve other sites. It is also an option to exclude pages based on the header information returned by web servers as part of the HTTP protocol. External limits on fetching pages Not every page that meets the crawler's configured rule set will be successfully fetched. In many cases, a "trivial" crawl configured with a single start URI and a rule including just that one site will start, then suddenly stop without any pages having been fetched. This is generally an issue when the site itself has signaled that it does not wish to be crawled. This section summarizes the ways that pages are NOT successfully crawled, and discuss how to recognize this situation. The first fact to consider is that not all pages exist! Documents can be removed from a web server, and due to the distributed nature of the web the links pointing to it may never disappear entirely. If such a "dead" link is provided to the crawler, by either harvesting it off a fetched page or listed as a start URI, it will result in an HTTP "404" error being returned by the remote web server, indicating "File Not Found". The HTTP status codes are logged in the Crawler Fetch Logs. Login Control One common mechanism for limiting access to pages, by either crawlers or browsers, is to require a login. Refer to Setting Up Crawler Cookie Authentication for more information. Robots Control Because the crawler (and programs similar to it, known collectively as "spiders" or "robots") collects web pages automatically, repetitively fetching pages from a web site, some techniques have been developed to give webmasters a measure of control over what can be fetched, and what pages can or cannot ultimately be indexed. This section will review these techniques, the site-wide robots.txt file, and the per-page robots META tags. The primary tool available to webmasters is the Robots Exclusion Standard (http://www.robotstxt.org), or more commonly known as the robots.txt standard. This was the first technique developed to organize the growing number of web crawlers, and is a commonly implemented method of restricting, or even prohibiting, crawlers from downloading pages. The way it works is that before a crawler fetches any page from a web site, it should first request the page /robots.txt. If the file doesn't exist, there are no restrictions on crawling documents from that server. If the file does exist, the crawler must download it and interpret the rules found there. A webmaster can choose to list rules specific to the FAST crawler, in which case the robots.txt file 19 FAST Enterprise Crawler would have an User-agent entry that matches what the crawler sends to identify itself, normally "FAST Enterprise Crawler 6", though in fact any string matching any prefix of this, such as "User-agent: fast", would be considered a match. In the most common case the webmaster can indicate that every crawler, "User-agent: *", is restricted from gathering any pages on the site, via the rule "Disallow: /". Any site blocked in this way is off-limits from crawling, unless the crawler is explicitly configured to override the block. This should only be done with the knowledge and permission of the webmaster. The Crawler Screened Log, which should be enabled for test crawls, would list any site blocked in this way with the entry DENY ROBOTS. The crawler supports some non-standard extensions to the robots.txt protocol. The directives are described in the following table: Table 2: robots.txt Directives Extension Comments Allow: This directive can override a Disallow: directive for a particular file or directory tree. Disallow: This directive is defined to be used with path prefixes (for example, "/samples/music/" would block any files from that directory), some sites specify particular file types to avoid(only excluding the extensions), such as Disallow: /*.pdf$, and the crawler obeys these entries. Crawl-delay: 120 This directive specifies the number of seconds to delay between page requests. If the crawler is configured with the Obey robots.txt crawl delay setting enabled (set to Yes/True), this value will override the collection-wide Delay setting for this site. Example: User-agent: * Crawl-delay: 5 Disallow: /files Disallow: /search Disallow: /book/print Allow: /files/ovation.txt Another tool that can be used to modify the behavior of visiting crawlers is known as robots META tags. Unlike the robots.txt file, which provides guidance for the entire web site, robots META tags can be embedded within any HTML web page, within the "head" section. For a META tag of name "robots", the content value will indicate the actions to take, or to avoid. While a page without such tags will be parsed to find new URIs before being indexed, the possible settings can prevent either or both of these actions by a crawler. In the following example, the page is being effectively blocked from further processing by any crawler that downloads it. Table 3: Robots META Tags Settings 20 Value Crawler Action index Accept the page contents for indexing. (Default) noindex Do not index the contents of this page. follow Parse the page for links (URIs) to other pages (Default) nofollow Do not follow any links (URIs) embedded in this page all All actions permitted (equivalent to "index, follow") none No further processing permitted (equivalent to "noindex,nofollow") Introducing the FAST Enterprise Crawler Example: <html> <head> <meta name="robots" content="noindex,nofollow"> <meta name="description" content="This page ...."> <title>...</title> </head> . . . <html> Removal of old content Over time the content on web sites change. Documents, and sometimes entire web sites, disappear and the crawler must be able to detect this. At the same time, web sites may also be unavailable for periods, and the crawler must be able to differentiate between these two scenarios. The crawler has two main methods of detecting removed content, which should be removed from the index. These two methods are broken links and document expiry. As the crawler follows links from documents it will inevitably come across a number of links that are not valid, i.e. the web server does not return a document but instead an HTTP error code. In some cases the web server may not even exist any more. If the document is currently in the crawler document store, i.e. it has been previously crawled, the crawler will take an appropriate action to either delete the document or retry the fetch later. The following table shows various HTTP codes and how they are handled by default. The action is configurable per collection. Table 4: HTTP error handling Error Action taken 400-499 Delete document immediately. 500-599 Delete document on 10th failed fetch attempt. Fetch timeout Delete document on 3rd failed fetch attempt. Network error Delete document on 3rd failed fetch attempt. First retry is performed immediately. Internal error Keep the document. The method of detecting dead links described above works fine as long as the crawler locates links leading to the removed content. It is also sufficient in the adaptive refresh mode, since the crawler will create a work queue internally of all URIs it has previously seen and use that for re-crawling. However when not using adaptive refresh, a second method is necessary in order to correctly handle situations where portions of a site, or perhaps the entire web site, has disappeared from the web. In this case, the crawler will most likely not discover links leading to each separate document. The method used in this case is that of document expiry, usually referred to as DB switch in the crawler. The crawler keeps track internally of all documents seen every refresh cycle. It is therefore able to create a list of documents not seen for the last X cycles, where X is defined as the DB switch interval. Under the assumption that the crawler is able to completely re-crawl every web site every crawl cycle, these documents therefore no longer exist on the web servers. The action taken by the crawler on these documents is dependent on the DB switch delete option. The default value of this option is No which instructs the crawler to not delete them immediately, but rather place them on the work queue to verify that they are indeed removed from the web sites in question. Every document found to be irretrievable is subsequently deleted. This is the recommended setting, however it is also possible to instruct the crawler to immediately discard these documents. 21 FAST Enterprise Crawler Care should be taken when adjusting the DB switch interval and especially DB switch delete options. Setting the former too low and using a brief refresh cycle can lead to a situation where the crawler incorrectly interprets large numbers of documents as candidates for deletion. If then the DB switch delete option is set to yes it is entirely possible for the crawler to accidentally delete a large portion of the crawler store and index. Browser Engine The Browser Engine is a stand alone component which is used by the Enterprise Crawler to extract information from JavaScript and Flash files. The flow from the crawler to the Browser Engine and back is explained below. Normal processing If the crawler detects a document containing one or more JavaScript or Flash files and the corresponding crawler option is enabled, the crawler submits the document to a Browser Engine for processing. When the Browser Engine receives the request, it picks a thread from its pool of threads and assigns the task to it. If the file is a Flash file, it is parsed for links. However, if the document contains JavaScript, the Browser Engine parses it, creates a DOM (document object model) tree and executes all inline JavaScript code. The DOM tree is then passed to a configurable pipeline within the Browser Engine. This pipeline constructs a HTML document, extracts cookies, generates a document checksum, simulate user interaction and extracts links. Finally the data is returned to the crawler. Some documents processed by the Browser Engine require external documents (dependencies), such as scripts and frames. The Browser Engine will request these dependencies from the crawler, which in turn will retrieve these as soon as possible. However, in order to reduce web server load the crawler still obeys the configured request rate for each of these dependencies. Once the dependency is resolved a replay is sent back to the Browser Engine. In other words the crawler will function as a download proxy for the Browser Engine. The crawler stores the processed HTML document, and sends it to the indexer. The crawler will also follow links from the generated HTML document, provided the URIs are permitted according to the crawler configuration. Overloading If the Browser Engine has no available capacity when receiving a processing request, it attempts to queue the request. When the queue is full, the request is denied. The crawler automatically detects this situation and will attempt to send the request to another Browser Engine, if one is available. If there are no others available then the crawler uses an exponential back-off algorithm before resending the request, thus reducing the load on the Browser Engine. This means that for each failed request it will wait a bit longer before trying again. There is no upper limit on the number of retries. A request to the Browser Engine is counted towards the maximum number of concurrent requests for the web site. The maximum number of pending requests to the Browser Engines are thus limited by this configuration option. 22 Chapter 2 Migrating the Crawler Topics: • • • • Overview Storage overview CrawlerGlobalDefaults.xml file considerations The Migration Process FAST Data Search (FDS) 4.1 and FAST ESP 5.0, and related products, included version 6.3 of the Enterprise Crawler (EC 6.3). If you are migrating an installation of a previous release, and need to preserve the crawler data store, this chapter outlines the necessary procedure. Refer to the FAST ESP Migration Guide for additional overall migration information. FAST Enterprise Crawler Overview FAST Data Search (FDS) 4.1, FAST ESP 5.0 and related products, included version 6.3 of the Enterprise Crawler (EC 6.3). If you are migrating an installation from these releases, and need to preserve the crawler data store, this chapter outlines the necessary procedure. Upgrading from FAST ESP 5.1 or 5.2 can be done simply by preserving the crawler's data directory, as there are no changes to the storage backend between EC 6.6 and EC 6.7. Refer to the FAST ESP Migration Guide for additional overall migration information. The EC 6.7 document storage is backwards compatible with that of EC 6.3 and EC 6.4. but the meta data store of EC 6.3 must be converted to be readable by the new version. More specifically, the meta data and configuration databases have new options or formats, to which existing data must be adapted. The document storage can be retained in the same format, or the format can be changed from flatfile to bstore with the migration tool. The overall migration process consists of stopping the EC 6.3 crawler, so that its data becomes stable, then running an export tool in that installation to prepare the metadata for migration. In the new EC 6.7 installation, an import tool is run that can read the EC 6.3 databases and exported metadata, and copy, create, or recreate all necessary files. Note: The semantics of URI and hostname inclusion rules have changed since ESP 5.0 (EC 6.3). In previous ESP releases these two rule types were evaluated as an AND, meaning that a URI has to match both rules (when rules are defined). As of ESP 5.1 and later (EC 6.6 and up), the rules processing has changed to an OR operator, meaning a URI now only needs to match one of the two rule types. For example, an older configuration for fetching pages with the prefix http://www.example.com/public would specify two rules: • • include_domains: www.example.com (exact) include_uris: http://www.example.com/public/ (prefix) The first rule is no longer needed, and if not removed would allow any URI from that host to be fetched, not only those from the /public path. Some configurations may be much more complex than this simple example, and require careful adjustment in order to restrict URIs to the same limits as before. Contact FAST Support for assistance in reviewing your configuration, if in doubt. Existing crawler configurations migrating from EC 6.3 must be manually updated, by removing or adjusting the hostname include rules that overlap with URI include rules. Note: While the EC 6.4 configuration and database are compatible with EC 6.7, and do not require special processing to convert them to an intermediate format, the import tool should still be used to copy the old installation to the new location. The one special case that requires a conversion is when the EC 6.4 installation uses the non-standard ppdup_format value "hashlog". The migration tool will recognize this case when copying an EC 6.4 installation, and automatically perform the conversion. Storage overview The crawler storage that is converted as part of the migration process is described in some detail in the following sections. Document storage In a typical crawler installation, all downloaded documents are stored on disk, and the content is retained even after having been sent to the search engine. 24 Migrating the Crawler The two datastore formats supported in an EC 6.3 installation are flatfile and bstore. In the flatfile method each downloaded URI is stored directly in the file system, with a base64 encoding of the URI used as the filename. While this can be expensive in terms of disk usage, it does allow obsolete documents to be deleted immediately.The alternative, bstore (block file storage), is more efficient in file system usage, writing documents into internally indexed 16MB files, though these also require additional processing overhead in terms of compaction. Either bstore or flatfile may be specified when starting the import tool, allowing an installation to transition from one format to another. If a new storage format is not specified, the setting in the configuration database will be retained. Note that if your old data storage is in flatfile format, the migration process will be slower than migrating a bstore data storage. It is suggested to specify the new crawler store to be in bstore format. In either case, the number of clusters, that is the number of subdirectories across which the data is spread, with a default value of eight, cannot be modified during migration. The original document storage will not be touched in the export operation. During the import operation, documents will be read from the old storage, and stored at its new location in the new format, one by one. Again, the original version of the storage will not be modified. Path: $FASTSEARCH/data/crawler/store/<collection>/data/<cluster> Meta database The meta databases contain meta information about all URIs visited by the crawler. Typical meta information for a URI would include document storage location, content type, outgoing links, document checksum, referrers, HTTP headers, and/or last modified date. The metadata store consists of all the information available to it about the URIs that it is, or has been, crawling. This information is organized into multiple databases that store the URIs and details about them, primarily the metadata database and the postprocess database. If a given URI has been successfully crawled, the metadata will also contain a reference to the document storage, a separate area where the actual contents of the downloaded page is kept on disk, available to be fed to the search engine. The job of the migration tool is to transfer, and if necessary update or convert, the crawler store from an earlier crawler version so that it can be used by EC 6.7. During the export operation, all the meta databases will be dumped to an intermediary format and placed in the same directory as the original databases. The dumped versions of the databases will be given the same filenames as the original databases, with the suffix .dumped_nn, where nn goes from 0 to the total number of dump files. In the import operation, all the dumped postprocess databases are loaded and stored, one by one, in the new postprocess database format. Optionally, each dump file can be deleted from the disk after processing. Path: $FASTSEARCH/data/crawler/store/<collection>/db/<cluster>/ Postprocess Database The postprocess (PP) databases contain a limited amount of metadata for each unique checksum produced by the postprocess process. For each item stored, it contains the checksum, the owner URI, duplicate URIs, and redirects URIs. During the import operation, all the EC 6.3 postprocess databases are copied to the new postprocess database directory. Path: $FASTSEARCH/data/crawler/store/<collection>/PP/csum/ Duplicate server database Duplicate servers are only used in multiple node crawler setups. They are used to perform duplicate detection across crawler nodes. The duplicate server database format is unchanged between versions EC 6.3 and EC 6.7. In the import operation, each database file is copied to the new duplicate server storage location. 25 FAST Enterprise Crawler Path: $FASTSEARCH/data/crawler/ppdup/ Configuration and Routing Databases The configuration database contains all the crawler-options set in a collection specification. The difference between EC 6.3 and EC 6.4-6.7 is the removal of several obsolete options listed in the table below: Table 5: Obsolete Database Options Option Type Comment starturis (database cache size) integer Start URI database removed for 6.4 and later versions starturis_usedb boolean Start URI database removed for 6.4 and later versions Compressdbs boolean All databases compressed in 6.4 and later versions When the configuration database is imported, the database will be read and all valid options will be used. Two potential modifications may also be made. If a proxy definition exists, it will be converted from a string to a list element (as EC 6.6 and 6.7 supports multiple proxies), and if the data storage format is changed (via import tool command line option) that configuration setting is updated. Path: $FASTSEARCH/data/crawler/config/ (for crawler nodes) Path: $FASTSEARCH/data/crawler/config_um/ (for ubermasters) The routing database will be migrated on an ubermaster, in addition to the configuration database.The routing database is not important in a single node installation, however in a multiple node installation, the routing database defines to which crawler node each site is assigned. Note: The semantics of URI and hostname inclusion rules have changed since ESP 5.0 (EC 6.3). In previous ESP releases these two rule types were evaluated as an AND, meaning that a URI has to match both rules (when rules are defined). As of ESP 5.1 and later (EC 6.6 and up), the rules processing has changed to an OR operator, meaning a URI now only needs to match one of the rule types. For example, an older configuration for fetching pages with the prefix http://www.example.com/public would specify two rules: • • include_domains: www.example.com (exact) include_uris: http://www.example.com/public/ (prefix) The first rule is no longer needed, and if not removed would allow any URI from that host to be fetched, not only those from the /public path. Note that more complex rules Existing crawler configurations migrating from EC 6.3 must be manually updated, by removing the hostname include rules that overlap with URI include rules. Note: Be sure to manually copy any text or XML configuration files that might be used in the installation, located outside of the data/config file system, such as a start_uri_files listing. CrawlerGlobalDefaults.xml file considerations If you are using a CrawlerGlobalDefaults.xml file in your configuration, note that some options have been restructured from separate options in EC 6.3 into values within options and possibly renamed in EC 6.7. There are no such changes between EC 6.6 and 6.7. 26 Migrating the Crawler The following tables list options that have been restructured for the EC 6.7 CrawlerGlobalDefaults.xml file as well as how they were identified in the EC 6.3 version of this file. You will need to manually edit the CrawlerGlobalDefaults.xml file to use the new options. Table 6: Domain Name System (DNS) Options Changes EC 6.7 Option EC 6.7 Description dns Refer to CrawlerGlobalDefaults.xml options on page 110 for detailed dns descriptions. Valid values: EC 6.3 Option min_rate MinDNSRate max_rate MaxDNSRate max_retries MaxDNSRetries timeout DNSTimeout min_ttl DNSMinTTL db_cachesize DNSCachesize Table 7: Feeding Options Changes EC 6.7 Option EC 6.7 Description feeding Refer to CrawlerGlobalDefaults.xml options on page 110 for detailed feeding descriptions. Valid values for the FDS feeding options related to postprocess and its behavior when submitting data to DataSearch are: EC 6.3 Option priority DSPriority feeder_threads Not available max_outstanding DSMaxOutstanding max_cb_timeout DSMaxCBTimeout max_batch_size Not available fs_threshold Not available The Migration Process The migration process consists of running two separate programs; the export tool, and the import tool. The export tool is run in the EC 6.3 environment and the import tool is run in the EC 6.7 environment. If migrating an EC 6.4 installation, only the import tool need be run from the EC 6.7 environment. This proceedure is not neccessary when going from EC 6.6 to 6.7. • • The export tool dumps all of the EC 6.3 databases to an intermediate data format on disk. The files will be placed alongside the original databases, named with the suffix .dumped_nn. The import tool loads these dump files one by one, creates new databases and migrates the document storage, and a new 6.7 crawler store will be created. This also includes the document store. Note: Ensure that you have sufficient disk space to migrate your crawler store. This process requires significant amounts of free disk space, both to hold the intermediate format, and to write the new (6.7) formatted data. The migration tool does not remove the old crawler store, and the new crawler store consumes approximately the same amour of disk space as the old one. Note that changing formats, for example, bstore to flatfile, may result in an increase in disk usage. To migrate the crawler: 1. Stop all crawler processes. Crawler processes include ubermaster, crawler, ppdup, and postprocess. 27 FAST Enterprise Crawler $FASTSEARCH/bin/nctrl stop crawler Make sure the FASTSEARCH environment variable points to the old ESP installation (the one being migrated from). 2. Backup the crawler store, or as a minimum, backup the configuration database and files. 3. In an EC 6.3 installation, start the export tool. This example uses a crawler node with collection named CollectionName: $FASTSEARCH/bin/crawlerdbexport -m export -d /home/fast/esp50/data/crawler/ -g CollectionName Make sure the FASTSEARCH environment variable points to the old installation (the one being migrated from). Observe the log messages which are output by the export tool to monitor processing progress. If you are running FAST ESP, the log messages will also appear under Logs in the administrator interface. Specifying the "-l debug" option will give more detailed information, but is not necessary in most cases. If no error messages are displayed, the export operation was successful. Skip this step if you are migrating an EC 6.4 installation. Note: It is recommended to always redirect the stdout and stderr output of this tool to a log file on disk. Additionally, on UNIX use either screen or nohup in order to prevent the tool from terminating in case the session is disconnected. 4. Set the FASTSEARCH environment variable to correspond to the new ESP installation. Refer to the FAST ESP Operations Guide, Chapter 1 for information on setting up the correct environment. It is advisable to start the crawler briefly in the new ESP installation, to verify that it is operating correctly, and then shut it down to prepare for migration. 5. Create all crawler collections in the new ESP installation, but leave data sources set to None: a) Select Create Collection from the Collection Overview screen and the Description screen is displayed. b) Enter the Name of the collection to be migrated (matching the crawler collection name exactly), and optional text for a description. This restores the original collection specification in the administrator interface; when you start the crawler, the configuration will be loaded automatically and the crawl will continue. c) Proceed through the remaining steps to create a collection. Refer to the Creating a Basic Web Collection in the Configuration Guide for a detailed procedure. d) Leave the Data Source set to None, we will add the crawler once the collection has been migrated below. Click submit. 6. Run the import tool. $FASTSEARCH/bin/crawlerstoreimport -m import -d /home/fast/esp50/data/crawler/ -g CollectionName -t master -n /home/fast/esp51/data/crawler/ Make sure the FASTSEARCH environment variable points to the new ESP installation (the one being migrating to). Observe the log messages which are output by the import tool to monitor processing progress. If you are running FAST ESP, the log messages will also appear under Logs in the administrator interface. If no error messages are displayed, the import operation was successful. Note: In addition to migrating the crawler store, the import tool outputs statistics, sites, URIs and the collection to separate files in the directory: $FASTSEARCH/data/crawler/migrationstats. Once more, it is recommended to always redirect the stdout and stderr output of this tool to a log file on disk. Additionally, on UNIX use either screen or nohup in order to prevent the tool from terminating in case the session is disconnected. 28 Migrating the Crawler 7. Start the crawler from the administrator interface or the console. 8. Associate the crawler with the collection in the FAST ESP administrator interface: a) Select Edit Collection from the Collection Overview screen and the Collection Details screen is displayed. b) Select Edit Data Sources from the Collection Details screen and the Edit Collection screen is displayed. c) When you identify the crawler as a Data Source, carefully read through the collection specification to make sure everything is correct. Click submit. 9. To get the migrated documents into the new FAST ESP installation, you must run postprocess refeed, which requires the crawler to be shut down. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler 10. To start the refeed, enter the following command: $FASTSEARCH/bin/postprocess -R CollectionName -d $FASTSEARCH/data/crawler/ 11. When the feeding has finished, check the logs to make sure the refeed was successful. Restart the crawler: $FASTSEARCH/bin/nctrl start crawler 12. Shut down the old FAST ESP installation. Note: If this migration process is terminated during processing, you must repeat the entire procedure. That is, delete all dump files and the new data directory generated by the tool. To delete all dumps, you may run any of the tools with the deldumps option. When this cleanup process has finished, you must manually delete the new data directory from the disk. Upon completion, repeat the entire migration procedure. Contact FAST Technical Support for assistance. Note: If only a subset of collections are migrated to the new version, the undesired collections will still be listed in the new configuration database, even though no data or metadata for the collection has been transferred. These lingering references must be deleted with the command crawleradmin -d oldCollection prior to starting the crawler in the new ESP installation. 29 Chapter 3 Configuring the Enterprise Crawler Topics: • • • • • • Configuration via the Administrator Interface (GUI) Configuration via XML Configuration Files Configuring Global Crawler Options via XML File Using Options Configuring a Multiple Node Crawler Large Scale XML Crawler Configuration This chapter describes how to limit and guide the crawler in selecting web pages to fetch and index, and describes the alternatives for what to do once the refresh interval has completed. It also describes how to modify an existing web data source, and how to configure and tune a large scale distributed crawler. FAST Enterprise Crawler Configuration via the Administrator Interface (GUI) Crawl collections may be configured using the administrator graphical user interface (GUI) or by using an XML based file. The administrator interface includes access to most of the crawler options. However, some options are only available using XML. Modifying an existing crawl via the administrator interface Complete this procedure to modify an existing crawl collection. 1. From the Collection Overview screen, click the Edit button for the collection you want to modify. The following example selects collection1. Note: The + Add Document function is not directly connected to the crawler, but rather attempts to add the specified document directly to the index. This may cause problems if the document already exists in the index and the crawler has found one or more duplicates to this document. In this case the submitted document may appear as a duplicate in the Index because the crawler is not involved in adding the document, so duplicate detection is not performed. 2. Click the Edit Data Sources button in the Control Panel and the following screen is displayed: 3. Click the Edit icon and the Edit Data Source Setup screen is displayed: 32 Configuring the Enterprise Crawler 4. Work through the basic and advanced options making modifications as necessary. a) To add information, highlight or type information into the text box on the left, then click the button and the selection is added to the text box on the right. b) To remove information, highlight the information in the text box on the right, then click the button and the selected text is removed. add remove 5. Click Submit and the Edit Collection collection1 Action screen is displayed. The modified data source crawler is now installed. 6. Click ok and you are returned to the Edit Collection collection1 Configuration screen. The configuration is now complete. This screen lists the name, description, pipeline, index, and data source information you have configured for collection1. 7. Click ok and you are returned to the Collection Overview screen. Basic Collection Specific Options The following table discusses the Basic collection specific options. Table 8: Basic collection specific options Option Start URIs Description Enter start URIs in the Start URIs box. There is also a Start URI files option, which if specified must be an absolute path to a file existing on the crawler node. The format of the file is a text file containing URIs, separated by newlines. These options defines a set of URIs from which to start crawling. At least one URI must be defined before any crawling can start. If the URI points to the root of a web site then make sure the URI ends with a slash (/). 33 FAST Enterprise Crawler Option Description As URIs are added, exact hostname include filters are automatically generated and added to the list of allowed hosts in the Hostname include filters field. For example, if adding the URI http://www.example.com/ then all documents from the web site at www.example.com will be crawled. Note: If the crawl includes any IDNA hostnames, they must be input using UTF-8 characters, and not in the DNS encoded format. Hostname include filters Specify filters in the Hostname include filters field to specify the hostnames (web sites) to include in the crawl. Possible filter types are: • • • • • • Exact - matches the identical hostname File - identifies a local (to the crawler host) file containing include and/or exclude rules for the configuration. Note that in a multiple node configuration, the file must be present on all crawler hosts, in the same location. IPmask - matches IP addresses of hostnames against specified dotted-quad or CIDR expression. Prefix - matches the given hostname prefix (for example, "www" matches "www.example.com") Regexp - matches the given hostname against the specified regular expression in order from left to right Suffix - matches the given hostname suffix (for example, "com" matches "www.example.com") This option specifies which hostnames (web sites) to be crawled. When a new web site is found during a crawl, its hostname is checked against this set of rules. If it matches, the web site is crawled. If no hostname or URI include filters (see below) are specified then all web sites are allowed unless explicitly excluded (see below). If rules are specified, a hostname must match at least one of these filters in order to be crawled. For better crawler performance, use prefix, suffix or exact rules when possible instead of regular expressions. Note: If the crawl includes any IDNA hostnames, they must be input using UTF-8 characters, and not in the DNS encoded format. Hostname exclude filters Specify filters in the Hostname exclude filters field to exclude a specific set of hostnames (web sites) from the crawl. The possible filter types are the same as for the Hostname include filters. This option specifies hosts you do not want to be crawled. If a hostname matches a filter in this list, the web site will not be crawled. If no setting is given, no sites are excluded. For better crawler performance, use prefix, suffix or exact rules when possible instead of regular expressions. Note: If the crawl includes any IDNA hostnames, they must be input using UTF-8 characters, and not in the DNS encoded format. Request rate Select one of the options in the Request rate drop-down menu; then select the rate in seconds. This option specifies how often (the delay between each request) the crawler should access a single web site when crawling. Default: 60 seconds Note: FAST license terms do not allow a more frequent request rate setting than 60 seconds for external sites unless an agreement exists between the customer and the external site. 34 Configuring the Enterprise Crawler Option Refresh interval Description Specify the interval at which a single web site is scheduled for re-crawling in the Refresh interval field. The crawler retrieves documents from web servers. Since documents on web servers frequently change, are added or removed, the crawler must periodically crawl a site over again to reflect this. In the default crawler configuration, this refresh interval is one day (1440 minutes), meaning that the crawler will start over crawling a site every 24 hours. Since characteristics of web sites may differ, and customers may want to handle changes differently, the action performed at the time of refresh is also configurable, via the Refresh modesetting. Default: 1440 minutes Advanced Collection Specific Options This table describes the options in the advanced section of the administrator interface. Table 9: Overall Advanced Collection Specific Options Option URI include filters Description This option specifies rules on which URIs may be crawled. Leave this setting empty in order to allow all URIs, unless those excluded by other filters. The possible filter types are the same as for the Hostname include filters. The URI include filters field and the URI exclude filters field examine the complete URI (http://www.example.com/path.html) so the filter must include everything in the URI, and not just the path. An empty list of include filters will allow any URI, as long as it is allowed by the hostname include/exclude rules. For better crawler performance, use prefix, suffix or exact rules when possible instead of regular expressions. Note: The semantics of URI and hostname inclusion rules have changed since ESP 5.0 (EC 6.3). In previous ESP releases these two rule types were evaluated as an AND, meaning that a URI has to match both rules (when rules are defined). As of ESP 5.1 and later (EC 6.6 and up), the rules processing has changed to an OR operator, meaning a URI now only needs to match one of the two rule types. Existing crawler configurations migrating from EC 6.3 must be manually updated, by removing or adjusting the hostname include rules that overlap with URI include rules. If the crawl includes any IDNA hostnames, they must be input using UTF-8 characters, and not in the DNS encoded format. URI exclude filters This option specifies which URIs you do not want to be crawled. If a URI matches one listed in the set, it will not be crawled. The possible filter types are the same as for the Hostname include filters. Note: If the crawl includes any IDNA hostnames, they must be input using UTF-8 characters, and not in the DNS encoded format. Allowed schemes This option specifies which URI protocols (schemes) the crawler should follow. Select the protocol(s) you want to use from the drop-down menu. Valid schemes: http, https, ftp and multimedia formats MMS, RTSP. 35 FAST Enterprise Crawler Option Description Note: MMS and RTSP for multimedia crawl is supported via MM proxy. Default: http, https MIME types This option specifies which MIME types will be downloaded from a site. If the document header MIME type is different than specified here, then the document is not downloaded. Select a MIME type you want to download from the drop-down menu. You can manually enter additional MIME types directly as well. That the crawler supports wildcard expansion of an entire field (only), for example */example, text/* or */*, but not appl/ms* is allowed. No other regular expressions are supported. Note: When adding additional MIME types beyond the two default types make sure the corresponding file name extensions are not listed in the Extension excludes list. Default: text/html, text/plain MIME types to search for links This option specifies MIME types of documents that the crawler should attempt to extract links from. If not already listed in the default list, type in a MIME type you want to search for links. This option differs from the MIME types option in that the MIME types to search for links denotes which documents should be inspected for links to crawl further, whereas the latter indicates all formats the crawler should retrieve. In effect, MIME types is always a superset of MIME types to search for links. Note: Wildcard on type and subtype is allowed. For instance, text/* or */html are valid. No other regular expressions are supported. Furthermore, the link extraction within the crawler only works on textual markup documents, hence you should not specify any binary document formats. Default: text/vnd.wap.wml, text/wml, text/x-wap.wml, x-application/wml, text/html, text/x-hdml Extension excludes This option specifies a list of suffixes (file name extensions) to be excluded from the crawl. The extensions are suffix string matched with the path of the URIs, after first stripping away any query portion. URIs that match the indicated file extensions will not be crawled. If not already listed in the default list, type in the link extensions you want to be excluded from the crawl. This option is commonly used to avoid unnecessary bandwidth usage through the early exclusion of unwanted content, such as images. Default: .jpg, .jpeg, .ico, .tif, .png, .bmp, .gif, .wmf, .avi, .mpg, .wmv, .wma, .ram, .asx, .asf, .mp3, .wav, .ogg, .zip, .gz, .vmarc, .z, .tar, .iso, .img, .rpm, .cab, .rar, .ace, .swf, .exe, .java, .jar, .prz, .wrl, .midr, .css, .ps, .ttf, .mso URI rewrite rules This option specifies rewrite rules that allows the crawler to rewrite special URI patterns. A rewrite rule is a grouped match regular expression and an expression that denotes how the matched pattern should be rewritten. The rewrite expression can have references to numbered groups in the match regexp, using regexp repetition. URI rewrites are applied as the URIs are parsed out from the documents during crawling. No rewriting occurs during re-feeding. If you add URI rewrites after you have crawled the URIs you wanted to rewrite, you will have to wait X (dbswitch) refresh cycles before they are fully removed from the index (they are not crawled anymore). The rewritten ones are added in place as they are crawled. In other words, there will be a time period in which both the rewritten and the not-rewritten URIs may be in the index. Running postprocess refeed will not help, however you may manually delete the URIs using the crawleradmin tool. 36 Configuring the Enterprise Crawler Option Description Since URIs are rewritten as they are parsed out of the documents, adding new URI rewrites would in some cases seem to not take immediate effect. The reason for this is that if the crawler already has a work queue full of URIs that are not rewritten, it must empty the work queue before it can begin to crawl the URIs affected by the rewrite rules. The format is: <separator><matched pattern><separator><replacement string><separator> The separator can be any single non-whitespace character, but it is important that the separator is selected so that it does not occur in either the matched pattern or the replacement string. The separator is given explicit as the first character of each rule. This example is useful if a website generates new session IDs each time it is crawled (resulting in an infinite number of links over time), but that pages will be displayed correctly without this session ID. @(.*[&?])session_id=.*?(&|$)(.*)@\1\3@ The @-character is the separator. Considering the URI http://example.com/dynamic.php?par1=val1&session_id=123456789&par2=val2 the rewrite rule above would rewrite the URI to http://example.com/dynamic.php?par1=val1&par2=val2 Default: empty Start URI files This option specifies a list of one or more Start URI files for the collection to be configured. Each file must be an absolute path/filename on the crawler node. A Start URI file is specified as the absolute path to a text file (for example, C:\starturifile.txt). The format of the files is one URI per line. All entries in the start URI files must match the Hostname include filters or the URI include filters or they will not be crawled. Default: empty Mirror site files Map file of primary/secondary servers for a site. This parameter is a list of mirror site files for the specified web site (hostname). The file format is a plain text, whitespace-separated list of hostnames, with the preferred (primary) hostname listed first. Format example: www.fast.no fast.no www.example.com example.com mirror.example.com Note: In a multiple node configuration, the file must be available on all masters. Default: empty Extra HTTP Headers This option specifies any additional headers to send to each HTTP GET request to identify the crawler. When crawling public sites not owned by the FAST ESP customer, the HTTP header must include a User-Agent string which must contain information that can identify the customer, as well as basic contact information (email or web address). Format: <header field>:<header value> 37 FAST Enterprise Crawler Option Description Specifying an invalid value may prevent documents from being crawled and prevent you from viewing/editing your configuration. The recommended User-Agent string when crawling public web content is <Customer name> Crawler (email address / WWW address). User agent information (company and E-mail) suitable for intranet crawling is by default added during installation of FAST ESP. Default: User-Agent: FAST Enterprise Crawler 6 used by <example.com> ([email protected]) Refresh mode These refresh modes determine the actions taken by the crawler when a refresh occurs. Although no refreshes occur when the crawler is stopped, the time spent is still taken into consideration when calculating the time of the next refresh. Thus, if the refresh period is set to two days and the crawler is stopped after one day and restarted the next day, it will then refresh immediately since two days have elapsed. Refresh is on a per site (single hostname) basis. Even though Start URIs are fed at a specific (refresh) interval by the master, each site keeps a record of the last time it was refreshed. Since sites are scheduled randomly based on available resources/URIs, the site refreshes quickly get desynchronized with the master Start URI feeding interval. Refresh modes other than Scratch and Adaptive do not erase any existing queues at the time of refresh. If the site(s) being crawled generate an infinite amount of URIs, or the crawl is very loosely restricted, this may lead to the crawler work queues growing infinitely. Valid modes: • • • • • Append - the Start-URIs are added to the end of the crawler work queue at the start of every refresh. If there are URIs in the queue, Start URIs are appended and will not be crawled until those before them in the queue have been crawled. Prepend - the Start URIs are added to the beginning of the crawler work queue at every refresh. However, URIs extracted from the documents downloaded from the Start URIs will still be appended at the end of the queue. Scratch - the work queue is truncated at every refresh before the Start URIs are appended. This mode discards all outstanding work on each refresh event. It is useful when crawling sites with dynamic content that produce an infinite number of links. This is useful when sites generate an infinite number of links, as sometimes seen for sites with dynamic content. Soft - if the work queue is not empty at the end of a refresh period, the crawler will continue crawling into the next refresh period. A server will not be refreshed until the work queue is empty. This mode allows the crawler to ignore the refresh event for a site if it is not idle. This allows large sites to be crawled in conjunction with smaller sites, and the smaller sites can be refreshed more often than the larger sites. Adaptive - build work queue according to scoring of URIs and limits set by adaptive section parameters. Default: Scratch Automatically refresh when This option allows you to specify whether the crawler automatically should trigger a new idle refresh cycle when the crawler goes idle (all websites are finished crawling) in the current refresh cycle. Select Yesto automatically refresh when idle. Select No to wait the entire refresh cycle length. Default: No Note: This option cannot be used with a multi node crawler. 38 Configuring the Enterprise Crawler Option Max concurrent sites Description This option allows you to limit the maximum number of sites being crawled concurrently. The value of this option, together with the request rate, controls the aggregated crawl-rate for your collection. A request rate of 1 document every 60 seconds, crawling 128 sites concurrently yields a theoretical crawl-rate of about 2 (128/60) documents per second. This option also impacts CPU usage and memory consumption; the more sites crawled concurrently, the more CPU and memory will be used. It is recommended that values higher than 2048 is used cautiously. In a distributed setup, the value applies per crawler node. Default: 128 Max document count per site This option sets the maximum amount of documents (web pages) to download from a web site per refresh cycle. When this limit is reached any remaining URIs queued for the site will be erased, and crawling of the site will go idle. Note: The option only restricts the per-cycle count of documents, not the number of unique documents across cycles. Therefore it's possible for a web site to exceed this number in stored documents if the documents found each cycle changes. Over time, the excess documents will however be removed by the document expiry functionality (DB switch interval setting). Default: 1000000 Max document size This option sets the maximum size of a single document retrieved from any site in a collection. If this limit is exceeded, the remaining documents are discarded or truncated to the indicated maximum size (see the Discard or truncate option). If you have large documents (for example, PDF files) on your site, and want to index complete documents, make sure that this option is set high enough to handle the largest documents in the collection. Default: 5000000 bytes Discard or truncate This option discards or truncates documents exceeding the maximum document count size determined in the previous entry. It is not recommended to use the truncate option except for text document collections. Valid values: Discard, Truncate Default: Discard Checksum cut-off When crawling multimedia content through a multimedia proxy (schemes MMS or RTSP), use this setting to adjust how the crawler determines whether a document has been modified or not. Rather than downloading the entire document, only the number of bytes specified in this setting will be transferred and the checksum calculated on that initial portion of the file. Only if the checksum of the initial bytes have changed is the entire document downloaded. This saves bandwidth after the initial crawl cycle, and reduces the load on other system and network resources as well. A setting of 0 will disable this feature (checksum always computed on entire document). Default: 0 Fetch timeout This setting specifies the time, in seconds, that the download of a single document is allowed to spend, before being aborted. Set this value higher if you expect to download large documents from slow servers, and you observe high average download times in the crawler statistics reported by the crawleradmin tool. 39 FAST Enterprise Crawler Option Description Valid values: Positive integer Default: 300 seconds Obey robots.txt A robots.txt file is a standardized way for web sites to direct a crawler (for example, to not crawl certain paths or pages on the site). If the file exists it must be located on the root of the web site, e.g. http://www.example.com/robots.txt, and contain a set of Allow/Disallow directives. This setting specifies whether or not the crawler should follow the directives in robots.txt files when found. If you do not control the site(s) being crawled, it is recommended that you use the default setting and obey these files. Select Yes to obey robots.txt directives. Select No to ignore robots.txt directives. Default: Yes Check meta robots A meta robots tag is a standardized way for web authors and administrators to direct the crawler not to follow links or to save content from a particular page; it indicates whether or not to follow the directives in the meta-robots tag (noindex or nofollow). This option allows you to specify whether or not the crawler should follow such rules. If you do not control the site(s) being crawled, it is recommended that you use the default setting. Select Yes to obey meta robots tags. Select No to ignore meta robots tags. Example (HTML): <META name="robots" content="noindex,nofollow"> Default: Yes Ignore robots on timeout Before crawling a site, the crawler will attempt to retrieve a robots.txt file from the server that describes areas of the site that should not be crawled. If you do not control the site being crawled, it is recommended that you use the default setting. This option specifies what action the crawler should take in the event that it is unable to retrieve the robots.txt file due to a timeout, unexpected HTTP error code (other than 404) or similar. If set to ignore then the crawler will proceed to crawl the site as if no robots.txt exists, otherwise the web site in question will not be crawled. Select Yesto obey robots on timeout. Select No to ignore robots on timeout. Default: No Ignore robots auth sites This option allows you to control whether the crawler should crawl sites returning 401/403 Authorization Required for their robots.txt from the crawl. The robots.txt standard lists this behavior as a hint for a crawler to ignore the web site altogether. However, incorrect configuration of web servers is widespread and can lead to a site being erroneously excluded from the crawl. Enabling this option makes the crawler ignore such indications and crawl the site anyway. Default: Yes 40 Configuring the Enterprise Crawler Option Obey robots.txt crawl delay Description This parameter indicates whether or not to follow the Crawl-delay directive in robots.txt files. In a site's robots.txt file, this non-standard directive may be specified (e.g. Crawl-Delay: 120, where the numerical value is the number of seconds to delay between page requests. If this setting is enabled, this value will override the collection-wide request rate (delay) setting for this web site. Default: No Robots refresh interval This option allows you to specify how often (in seconds) the crawler will re-download the robots.txt file from sites, in order to check if it has changed. Note that the robots.txt file may be retrieved less often if the site is not crawling continuously. the refresh interval of robots.txt files.The time period is on a per site basis and after it expires the robots.txt file will be downloaded again and the rules will be updated. Reduce this setting to pick up robots changes more quickly, at the expense of network bandwidth and additional web server requests. Default: 86400 seconds (24 hours) Robots timeout Before crawling a site, the crawler will attempt to retrieve a robots.txt file from the server that describes areas of the site that should not be crawled. This option allows you to specify the timeout to apply when attempting to retrieve robots.txt files. Set this value high if you expect to have comparably slow interactions requesting robots.txt. Default: 300 Near duplicate detection This option indicates whether or not to use the near duplicate detection scheme. This option can be specified per sub collection. Refer to Configuring Near Duplicate Detection on page 125 for more information. Default: No (disabled) Perform duplicate detection? This parameter indicates whether document duplicate detection should be enabled or not. Default: Yes Use HTTP/1.1 This option allows you to specify whether you want to use HTTP 1.1 or HTTP 1.0 requests when retrieving web data. HTTP/1.1 is required for the crawler to accept compressed documents from the server (the Accept Compression option) and enable ETag support (Send If-Modified-Since option). Select Yes to crawl using HTTP/1.1. Select No to crawl using HTTP/1.0. When using cookie authentication there may be instances where HTTP/1.1 is not supported and you should select No. Default: Yes Send If-Modified-Since If-Modified-Since headers allow bandwidth to be saved, as the web server only will send a document if the document has changed since the last time the crawler retrieved it. Also, if web servers report an ETag associated with a document, the crawler will set the If-None-Match header when this setting and HTTP/1.1 is enabled. Select Yes to send If-Modified-Since headers. Select No to not send If-Modified-Since headers. Web servers may give incorrect information whether or not a document has been modified since the last time the crawler retrieved it. In those instances select No to allow the crawler 41 FAST Enterprise Crawler Option Description to decide whether or not the document has been modified instead of the web server, at the expense of increased bandwidth usage. Default: Yes Accept compression Specify whether the crawler should use the Accept-Encoding header, thus accepting that the documents are compressed at the web server before returned to the crawler. This may save bandwidth. This option only applies if HTTP/1.1 is in use. Select Yes to accept compressed content. Select No to not accept compressed content. Default: Yes Send/receive cookies This feature enables limited support for cookies in the crawler, which might enable crawling cookie-based sessions for a site. Some limitations apply, mainly that cookies will only be visible across web sites handled within the same uberslave process. Note: Note that this feature is unrelated to cookie support as described in the Form Based Login on page 57 section. Select Yes to enable cookie support. Select No to disable cookie support. Default: No Extract links from duplicates Even though two documents have duplicate content, they may have different links. The reason for this is that all markup, including links, is stripped from the document prior to generating a checksums for use by the duplicate detection algorithm. This option lets you specify whether or not you want the crawler to extract links from documents detected as duplicates. If enabled, you may get an increased amount of duplicate links in the URI-queues. If duplicate documents contain duplicate links then you can disable this parameter. Note: Even though duplicate URIs exist on the work queues, a single URI is only downloaded once each refresh cycle. Select Yes to extract links from documents that are duplicates. Select No to not extract links from documents that are duplicates. Default: No Macromedia Flash support Select Yes to enable retrieval of Adobe Flash files, and limited link extraction within these. The flash files are indexed as separate files within the searchable index. Select No to disable Adobe Flash support. You may also want to enable JavaScript support, as many web servers only provide Flash content to clients that support JavaScript. Note: Flash processing is resource intensive and should not be enabled for large crawls. Note: Processing Macromedia Flash files requires an available Browser Engine. Please refer to the FAST ESP Browser Engine Guide for more information. Default: No 42 Configuring the Enterprise Crawler Option Sitemap support Description Enabling this option allows the crawler to detect and parse sitemaps. The crawler support sitemap and sitemap index files as defined by the specification at http://www.sitemaps.org/protocol.php. The crawler uses the 'lastmod' attribute in a sitemap to see if a page has been modified since the last time the sitemap was retrieved. Pages that have not been modified will not be recrawled. An exception to this is if the collection uses adaptive refresh mode. In adaptive refresh mode the crawler will use the 'priority' and 'changefreq' attributes of a sitemap in order to determine how often a page should be crawled. For more information see Adaptive Crawlmode on page 49. Custom tags found in sitemaps are stored in the crawlers meta database and can be submitted to document processing. Note: Most sitemaps are specified in robots.txt. Thus, 'obey robots.txt' should be enabled in order to get the best result. Default: No JavaScript support Select Yes to enable JavaScript support. The crawler will execute JavaScripts embedded within HTML documents, as well as retrieve and execute external JavaScripts. Select No to disable JavaScript support. Note: JavaScript processing is resource intensive and should not be enabled for large crawls. Note: Processing JavaScript requires an available Browser Engine. Please refer to the FAST ESP Browser Engine Guide for more information. Default: No JavaScript keep original HTML Specify whether to submit the original HTML document, or the HTML resulting from the JavaScript parsing, to document processing for indexing. When parsing a HTML document the Browser Engine executes all inlined and external JavaScripts, and thereby all document.write() statements, and includes these in its HTML output. By default it is this resulting document that is indexed. However it is possible to use this feature to limit the Browser Engine to link extraction only. This option has no effect if JavaScript crawling is not enabled Default: No JavaScript request rate Specify the request rate (delay) in seconds to be used when retrieving external JavaScripts referenced from a HTML document. By default this rate is the same as the normal request rate, but it may be set lower to speed up crawling of documents containing JavaScripts. To specify the default value leave the option blank. This option has no effect if JavaScript crawling is not enabled Default: Empty FTP passive mode This option determines if the FTP server (active) or the crawler (passive) should set up the data connection between the crawler and the server. Passive mode is recommended, and is required when crawling FTP content from behind a firewall. Select Yes to crawl ftp sites in passive mode. Select No to crawl ftp sites in active mode. Default: Yes 43 FAST Enterprise Crawler Option FTP search for links Description This option determines whether or not the crawler should run documents retrieved from an FTP server through the link parser to extract any links contained. Select Yes to search FTP documents for links. Select No to not search FTP documents for links. Default: Yes Include meta in csum The crawler differentiates between content and META tags when detecting duplicates and detecting whether a document has been changed. Select Yes and the crawler will detect changes in META tags in addition to the document content. This means that only documents with identical content and META tags are treated as duplicates. Select No and the crawler will detect changes in content only. This means that documents with the same content is treated as duplicates even if the META tags are different. Default: No Sort URI query params Example: If http://example.com/?a=1&b=2 is really the same URI as http://example.com/?b=2&a=1, then the URIs will be rewritten to be the same when this option is enabled. If not, the two URIs most likely will be screened as duplicates. The problem arises if the two URIs are crawled at different times, and the page has changed during the time of which the first one was crawled. In this case you can end up with both URIs in the index. Select Yes to enable sorting of URI query parameters. Select No to disable sorting of URI query parameters. Default: No Enforce request rate per IP This option allows you to control whether the crawler should enforce the request rate on a per IP basis. If enabled, a maximum of 10 sites sharing the same IP will be crawled in parallel. Additionally, at most Max pending requests will be issued to this IP in parallel. This prevents overloading the server(s) that host these sites. If disabled, sites sharing the same IP will be treated as unique sites, each hit with the configured request rate. Default: Yes Enforce MIME type detection This option allows you to decide whether or not the crawler should run its own MIME type detection on documents. In most cases web servers return the MIME type of documents when they are downloaded, as part of the HTTP header. If this option is enabled, documents will get tagged with the MIME type that looks most accurate; either the one received from the web server or the result of the crawlers determination. Default: No (disabled) Send logs to Ubermaster If enabled (as by default), all logging is sent to the ubermaster host for storage, as well as stored locally. In large multiple node configurations it can be disabled to reduce inter-node communications, reducing resource utilization, at the expense of having to check log files on individual masters. Default: Yes (enabled) Note: This option only applies to multi node crawlers 44 Configuring the Enterprise Crawler Option META refresh is redirect Description This option allows you to specify whether the crawler should treat META refresh URIs as HTTP redirects. Use together with META refresh threshold option which lets you specify the upper threshold of this option. Default: Yes META refresh threshold This option allows you to specify the upper limit on the refresh time for which a META refresh URI is considered a redirect (The META refresh is redirect option must be enabled.) Example: Setting this option to 3 will make the crawler treat every META refresh URI with a refresh of 3 seconds or lower as a redirect URI. Default: 3 seconds DB switch interval Specify the number of cycles a document is allowed to exist without having been seen by the crawler, before expiring. When a document expires, the action taken is determined by the DB switch delete setting. The age of a document is not affected by force re-fetches; only cycles where the crawler refreshes normally (by itself) increases the document's age if not found. This mechanism is used by the crawler to be able to purge documents that are no longer linked to from the index. It is not used to detect dead links such as documents returning an error code, e.g. 404. This check is performed at the beginning of each refresh cycle individually for each site. A similar check is performed for sites that have not been seen at the start of each collection level refresh Valid values: Positive integer Default: 5 Note: Setting this value very low, e.g 1-2, combined with a DB switch delete setting of Yes can result in documents being incorrectly identified as expired and deleted very suddenly. DB switch delete The crawler will at regular intervals perform an update of its internal database of retrieved documents, to detect documents that the crawler has not seen for DB switch interval number of refresh cycles. This option determines what to do with these documents; they can either be deleted right away or put in the work queue for a retrieval attempt to make certain they are actually removed from the web server. Select Yes to delete documents immediately. Select No to verify that the documents no longer exist on the web server before deleting them. Default: No Note: This setting should normally be left at the default setting of No in order to avoid situations where the crawler may incorrectly believe that a set of documents have been deleted and immediately deletes them from the crawler store and index Workqueue filter When this feature is enabled, the crawler will associate a Bloom filter with its work queues, thereby reducing the degree of duplicates that go onto the queue. This way the queues will grow more slowly and therefore use less disk I/O and space, plus save memory since Bloom filters are very memory efficient. The drawback with Bloom filters is that there is a very low probability of false positives, which means that there is a theoretical chance may lose some URIs that would be crawled if work queue filters were disabled. Disable this feature if this risk is a problem and added disk overhead is not problematic. Select Yes to enable use of Bloom filters with work queues. 45 FAST Enterprise Crawler Option Description Select No to disable use of Bloom filters with work queues. Default: Yes Master/Ubermaster filter This parameter enables a Bloom filter to screen links transferred between masters and the ubermaster. The value is the size of the filter, specifying the number of bits allocated, which should be between 10 and 20 times the number of URIs to be screened. Note that enabling this setting with a positive integer value disables the crosslinks cache. It is recommended that you turn on this filter for large crawls; recommended value is 500000000 (500 megabit). Default: 0 (disabled) Master/Slave filter When this feature is enabled, the crawler slave processes use a Bloom filter in the communication channel with the master process, which reduces Inter Process Communication (IPC) and memory overhead. The drawback with Bloom filters is that there is a non-zero chance of false positives, which may cause URIs to be lost by the crawler. Use this feature if this risk is not a concern and there is CPU and memory contention on the crawler nodes. It is recommended that you turn on this filter for large crawls; recommended value is 50000000 (50 megabit). Valid values: Zero or positive integer Default: 0 (filter is disabled) Max docs before interleaving The crawler will by default crawl a site to exhaustion. However, the crawler can be configured to crawl "batches" of documents from sites at a time, thereby interleaving between sites. This option allows you to specify how many documents you want to be crawled from a server consecutively before the crawler interleaves and starts crawling other servers. The crawler will then return to crawling the former server as resources free up. Valid values: No value (empty) or positive integer Default: empty (disabled) Note: Since this feature will stop crawling web sites without fully emptying their work queues on disk first it may lead to excessive amounts of work queue directories/files on large scale crawls. This can impact crawler performance, if the underlying file system is not able to handle it properly. Max referrer links This option specifies the maximum number of referrer levels the crawler will track for each URI. As this feature is quite performance intensitive the setting should no longer be used, instead the Web Analyzer should be queried to extract this information. It is recommended that you contact FAST Solution Services if you still decide to modify the default setting. Valid values: Positive integer Default: 0 Max pending requests Specify the maximum number of concurrent (outstanding) HTTP requests to a single site at any given time. The crawler may make overlapping requests to a site, and this setting determines the maximum degree of this overlapping. If you do not control the site(s) being crawled, it is recommended that you use the default setting. Keep in mind that regardless of this setting the crawler will not issue requests to a single web site more often than specified by the Request rate setting. 46 Configuring the Enterprise Crawler Option Description Valid values: Positive integer Default: 2 Max pending proxy-requests Proxy open connection limit. This parameter specifies a limit on the number of outstanding open connections per HTTP proxy, per uberslave process in the configuration. Valid values: Positive integer Default: 2147483647 Max redirects This option allows you to specify the maximum number redirects that should be followed from an URI. Example: http://example.com/path redirecting to http://example.com/path2 will be counted as 1. Default: 10 Max URI recursion This option allows you to specify the maximum number of times a pattern is allowed appended to an URIs successors. Example: http://example.com/path/ linking to http://example.com/path/path/ will be counted as 1. A value of 0 disables the test. Default: 5 Max backoff count/delay Together these options control the adaptive algorithm by which a site experiencing connection failures (for example, network errors, timeouts, HTTP 503 "Server Unavailable" errors) are contacted less frequently. For each consecutive instance of these errors, the inter-request delay for that site is incremented by the initial delay setting (Request rate setting): Increasing delay = current delay + delay The maximum delay for a site will be the Max backoff delay setting. If the number of failures reaches Max backoff count, crawling of the site will become idle. Should the network issues affecting the site be resolved, the internal backoff counter will start decreasing, with the inter-request delay lowered on each successful document fetch by half: Decreasing delay = current delay / 2 This continues until the original delay (Request rate setting) is reached. Default: Max backoff count = 50; Max backoff delay = 600 SSL key/certificate file This option sets the filename for the file containing your client SSL key and certificate. Type in a path and filename; the path and filename must be an absolute path on the crawler node. Example: /etc/ssl/key.pem Default: empty Note: This option is not necessary to specify in order to crawl HTTPS web sites. It is only required if the web site requires the crawler to identify itself using a client certificate Document evaluator plugin Specify a user-written Python module to be used for processing fetched documents and (optionally) URI redirections. 47 FAST Enterprise Crawler Option Description The value specifies the python class module, ending with your class name. The crawler splits on the last '.' separator and converts this to the Python equivalent "from <module> import <class>". The source file should be located relative to $PYTHONPATH, which for FDS installations corresponds to ${FASTSEARCH}/lib/python2.3/. Refer to Implementing a Crawler Document Plugin Module on page 118 for more information. Variable request rate This option allows you to specify specific time slots when you want to use a higher or lower request rate than the main setting. Time slots are specified as ending and starting points, and cannot overlap. Time slot start and endpoints are specified with day of week and time of day (optionally with minute resolution). Note that no two time slots can have the same delay values. Each variable must be unique, for example, 2.0, 2.1, and so forth. You can also enter the value Suspend in the Delay field that will suspend the crawler so that there is no crawling for the time span specified. Example time slots crawling at 60 second delay during weekends and no crawling during weekdays: • • Time span: Fri:23.59-Sun:23.59 Delay: 60 Time span: Mon:00-Fri:23 Delay: Suspend Note: Entering very long delays (above 600 seconds) is not recommended as it may cause problems with sites requiring authentication. To suspend crawling for a period always use the Suspend value. HTTP errors This option allows you to specify how the crawler handles various HTTP response codes and errors. It is recommended that you contact FAST Solution Services if you decide to modify the default setting. The following actions that can be configured for each condition: • • • KEEP - no action is taken, the document is not deleted DELETE[:X] - the document is deleted if the error condition persists over X retries. X refers to the number of refresh cycles the same error condition occurs, until the document should be considered deleted. If X is unspecified or 0, the document is deleted immediately. RETRY[:X] - X refers to the number of retries within the same refresh cycle that should be attempted before giving up. A DELETE:3, RETRY:1, would thus attempt to fetch a document with this error condition 2x every refresh, and after 3 refreshes, if the document at some time was stored and added to the index, it will be deleted. The protocol response codes are divided into separate protocol response codes as general client-side errors (4xx) and general server-side errors (5xx). Behavior for individual 400/500 errors can also be specified. There are three classes of non-protocol errors that can be configured: • • • ttl - specifies handling for connections that time out net - specifies handling for network/socket-level errors int - specifies handling for other internal errors Example: To delete a document after consecutive 3 retries for an HTTP 503 error, enter 503 in the Error box, and DELETE:3, RETRY: 1 in the Value box, then click on the right arrow. FTP errors 48 This option is the equivalent of the HTTP errors for FTP errors. Configuring the Enterprise Crawler Option Description Example: To delete a document after consecutive 3 retries for an FTP 550 error, enter 550 in the Error box, and DELETE:3, RETRY: 1 in the Value box, then click on the right arrow. FTP accounts This option allows you to specify a number of FTP accounts required for crawling FTP URIs. If unspecified for a site, the default anonymous user will be used. Specify the hostname of the FTP site in the Hostname box, and the username and password in the Credentials box. The format of the Credentials is: <USERNAME>:<PASSWORD> Example (Credentials): myuser:secretpassword Crawl sites if login fails This parameter allows you to specify whether you want the crawler to continue crawling a site after a configured login specification has failed, or not. Select Yesto attempt crawling of the site regardless. Select No to disallow crawling of the site. Default: No Domain clustering In a web scale crawl it is possible to optimize the crawler to take advantage of locality in the web link structure. Sub domains on the same domain tend to link more internally than externally, just as a site would have mostly interlinks. The domain clustering option enables clustering of sites on the same domain (for example, *.example.net) on the same master node and the same storage cluster (and thus uberslave process). This option also affects clustering within a single node, where all sites clustered in the same domain will be handled by the same uberslave process. This ensures cookies (if Send/receive cookies is enabled) can be used across a domain within the crawler. Default: No Note: This option is automatically turned on for multi node crawls by the ubermaster. Adaptive Crawlmode This section describes the adaptive scheduling options. Note that this parameter is only applicable if the Refresh mode is set to adaptive. Note: Extensive testing is strongly recommended before production use, to insure that desired pages and sites are properly represented in the index. Table 10: Adaptive Crawlmode Options Option Minor Refresh count Description Number of minor cycles within the major cycle. A minor cycle is sometimes referred to as a micro cycle. Refresh quota Ratio of existing URIs re-crawled to new (unseen) URIs, expressed as percentage. As long as the crawler has sufficient URIs of both types, this ratio is used. However, if it runs out of URIs of either type it will crawl only the other type from then on until refresh kicks in, or the site reaches some other limit (e.g. maximum document count for the cycle). High value => re-crawling old content (recommended) 49 FAST Enterprise Crawler Option Description Low value => prefer fresh content Minor Refresh Min Coverage Minimum number of URIs from a site to be crawled in minor cycle. Used to guarantee some coverage for small sites. Minor Refresh Max Coverage Limit percentage of site re-crawled within minor cycle. Ensures small sites do not crawl fully each minor cycle, starving large sites. When configuring this option have the number of minor cycles in mind. With e.g. 4 minor cycles this option should be 25% or higher, to ensure the entire site is re-crawled over the course of a major cycle. If the crawler detects that this value is set too low it may increase it internally. URI length weight Each URI is scored against a set of rules to determine its crawl rank value. The crawl rank value is used to determine the importance of the particular URI, and hence the frequency at which it is re-crawled (from at most once every minor cycle to only once every major cycle). Each rule is assigned a weight to determine its contribution towards the total rank value. Higher weights produce higher rank contribution. A weight of 0 disables a rule altogether. The URI length scoring rule is based on number of slashes (/) in URI path. The document receives the maximum score if there is only a single slash, down to no score for 10 slashes or more. Increase this setting to boost the priority of URIs with few levels (slashes in path). Default weight: 1.0 Valid Range: 0.0-2^32 URI depth weight The URI depth score is based on number of link "hops" to this URI. Max score for none (for example, start URI), no score for 10 or more. Use this setting to boost the priority of URIs linked closely from the top pages. Default weight: 1.0 Valid Range: 0.0-2^32 Landing page weight The landing page score awards a bonus score if the URI is determined to be a "landing page". A landing page is defined as any URI who's path ends in one of the following: /, index.html, index.htm, index.php, index.jsp, index.asp, default.asp, default.html, default.htm. Any URI with query parameters receives no score. Use this option to boost landing pages. . Default weight: 1.0 Valid Range: 0.0-2^32 Markup document weight 50 The markup document score awards a bonus score if the document at the URI is determined to be a "markup" page. A markup page is a document whose MIME type matches one of the MIME types listed in the MIME types to search for links. Configuring the Enterprise Crawler Option Description This option is used to give preference to more dynamic content as opposed to static document types such as PDF, Word, etc. Default weight: 1.0 Valid Range: 0.0-2^32 Change history weight The change history scores a document on the basis of how often it changes over time. This is done by the crawler by keeping track of whether a document has changed, or remains unchanged, as it is re-downloaded. An estimate is then made on how likely this document is to have changed the next time. Use this option to boost pages that change more frequently, compared to static non-changing pages. Default weight: 10.0 Valid Range: 0.0-2^32 Sitemap weight The sitemap score is based is based on metadata found in sitemaps. The score is calculated by multiplying the value of the changefreq parameter with the priority parameter of a sitemap. Use this option to boost pages that are defined in sitemaps. Default weight: 10.0 Valid Range: 0.0-2^32 Changefreq always value This value is used to map the changefreq string value "always" in sitemaps to a numerical value. Default weight: 1.0 Valid Range: 0.0-2^32 Changefreq hourly value This value is used to map the changefreq string value "hourly" in sitemaps to a numerical value. Default weight: 0.64 Valid Range: 0.0-2^32 Changefreq daily value This value is used to map the changefreq string value "daily" in sitemaps to a numerical value. Default weight: 0.32 Valid Range: 0.0-2^32 Changefreq weekly value This value is used to map the changefreq string value "weekly" in sitemaps to a numerical value. Default weight: 0.16 Valid Range: 0.0-2^32 Changefreq monthly value This value is used to map the changefreq string value "monthly" in sitemaps to a numerical value. Default weight: 0.08 51 FAST Enterprise Crawler Option Description Valid Range: 0.0-2^32 Changefreq yearly value This value is used to map the changefreq string value "yearly" in sitemaps to a numerical value. Default weight: 0.04 Valid Range: 0.0-2^32 Changefreq never value This value is used to map the changefreq string value "never" in sitemaps to a numerical value. Default weight: 0.0 Valid Range: 0.0-2^32 Changefreq default value This value is assigned to all documents that have no changefreq attribute set in a sitemap. Default weight: 0.16 Valid Range: 0.0-2^32 Authentication This section of the Advanced Data Sources screen allows you to configure authentication credentials for Basic, NTLM v1 or Digest schemes. Note: After an Authentication item has been added, it cannot be modified. To modify an existing item, save it under a new name and delete the old one. Table 11: Authentication Options Option URI Prefix or Realm Description An identifier based on either a URI prefix or authentication realm. The corresponding credentials (Username, Password, and optionally Domain) will be used in an authentication attempt if either: • • Username Password Domain Authentication scheme A URI matches the URI prefix string from left-to-right The specified Realm matches the value returned by the web server in a 401/Unauthorized response Specify the username to use for the login attempt. This value will be sent to every URI that matches the specified URI prefix or realm. Specify the password to use for authentication attempts. This value will be sent to every URI that matches the specified prefix or realm. Specify the domain value to use for authentication attempts. This value is optional. Specify the scheme to use in authentication attempts. If auto is specified, the crawler selects one on its own. Note: If authentication fails, crawling of the site will stop. 52 Configuring the Enterprise Crawler Cache Sizes It is recommended that you contact FAST Solution Services before changing any of the Cachesizes options. The default selections are shown in the following screen. The options with empty defaults are automatically adjusted by the crawler Figure 4: Cache Size Options Crawl Mode This table describes the advanced options that apply to the Crawl mode. Table 12: Crawl Mode Options Option Crawl mode Description Select how web sites in the collection should be crawled from the Crawl mode drop-down menu. Highlight the type to be used. Possible modes are: • • Full - use if you want the crawler to crawls through all levels on a site. Level - use to indicate the depth of the crawl as defined in the Max Levels option. The start level is the level of the start URI specified in the Start URI files. The crawler assumes that all cross-site links are valid links and will follow these links until it reaches the number of levels specified in Max Levels. If the crawler crawls two sites that are closely interlinked, it may crawl both sites entirely, despite the given maximum level. You can prevent this by either: • • Limiting the included domains in Hostname Includes Selecting No in the Follow cross-site URIs Default: Full Max levels This option allows you to specify the maximum number of levels to crawl from a server. The crawler considers all cross-links to be valid and follows all cross-links the same amount of levels. If the sites you are crawling are heavily cross-linked, you may crawl entire sites. This option only applies when the Crawl mode option is set to Level. If unspecified, a Level Crawl mode will default to Max level 0. Example: 1 (the crawler crawls only the URI named in the Start URI files and any links from the Start URI) Default: empty 53 FAST Enterprise Crawler Option Description Note: Frame links, JavaScripts and redirects do not increase the level counter, therefore even a depth 0 crawl may follow these links. In this case it is possible to specify the depth as -1 instead, this will not follow any links. Follow cross-site URIs This option allows you to select whether the crawler is to follow cross-site URIs from one web site to another. Select Yes and the crawler will follow any links leading from the start URI sites as long as they fulfill the Hostname include filters criteria. Select No and the crawler will only follow "local" links with the same web site. It will not follow links from one web site to another even if the site is included by the Hostname include rules. Default: Yes Note: If cross-site link following is turned of it is necessary that each site to be crawled has an entry in the start URIs list. Note: The crawler treats a single hostname as a single web site, hence it will identify example.com and www.example.com as two different web sites, even though they may appear the same to the user. Follow cross-site redirects Specifies whether or not to follow external redirects from one web site to another. Default: Yes Reset crawl level This option allows you to select whether the crawler is to reset the crawl level when crawling cross-site. Select Yes to enable the crawler to reset the crawl level when leaving the start URI sites and crawling sites leading from there. The crawl mode will be reset to default (Crawl mode = Full). Select No to ensure that the crawler will not reset the crawl level, and the crawl mode and level set for the start URIs will also apply for external sites. Default: No Crawling Thresholds This option allows you to specify certain threshold limits for the crawler. When these limits are exceeded, the crawler will enter a special mode called refresh (not to be confused with the now removed refresh mode called refresh). The refresh crawl mode will make the crawler only crawl URIs that previously has been crawled. Figure 5: Crawling Thresholds The following table describes the crawling thresholds to be set Table 13: Crawling Threshold Options 54 Configuring the Enterprise Crawler Option Disk free percentage Description This option allows you to specify, as a percentage, the amount of free disk space that must be available for the crawler to operate in normal crawl mode. If the disk free percentage drops below this limit, the crawler enters the refresh crawl mode. While in the refresh crawl mode only documents previously seen will be re-crawled, no new documents will be downloaded. Default: 0% (0 == disabled) Disk free percentage slack This option allows you to specify, as a percentage, a slack to the disk free threshold defined by the Disk free percentage. By setting this option, you create a buffer zone above the disk free threshold. While the current free disk space remains in this zone, the crawler will not change the crawl mode back to normal. This prevents the crawler from switching back and forth between the crawl modes when the percentage of free disk space is close to the value specified by the Disk free percentage option. When the available disk space percentage rises above disk_free+disk_free_slack, the crawler will change back to normal crawl mode. Default: 3% Maximum documents This option allows you to specify, in number of documents, the number of stored documents in the collection that will trigger the crawler to enter the refresh crawl mode. While in the refresh crawl mode only documents previously seen will be re-crawled, no new documents will be downloaded. Default: 0 documents (0 == disabled) Note: The threshold specified is not an exact limit, as the statistics reporting is somewhat delayed compared to the crawling. Note: This option should not be confused with Max document count per site option. Maximum documents slack This option allows you to specify the number of documents which should act as a buffer zone between normal mode and refresh mode. The option is related to the Maximum documents setting. Whenever the refresh mode is activated because the number of documents has exceeded the maximum, a buffer zone is created between the maximum documents and maximum documents-maximum documents slack. The crawler will not change back to normal mode while within the buffer zone.This prevents the crawler from switching back and forth between the crawl modes when the number of docs is close to the Maximum documents value. Default: 1000 documents Duplicate Server This section of the Advanced Data Sources screen allows you to configure the Duplicate Server settings. Table 14: Duplicate Server Options 55 FAST Enterprise Crawler Option Database format Description Specify the storage format to use for the duplicate server databases. Available formats are: • • • Database Cachesize Database stripe size Nightly compaction? Gigabase DB Memory hash Disk hash Specify the size of the cache of the duplicate server databases. If the database format is a hash format, the cache size specifies the initial size of the hash. Specify the # of stripes to use for the duplicate server databases. Specify whether nightly compaction should be enabled for the duplicate server databases. Note: If no duplicate server setting is specified, defaults, or values given on the duplicate server command line are used. Feeding Destinations This table describes the options available for custom document feeding destinations. It is possible to submit document to a collection by another name, multiple collections and even another ESP installation. If no destinations are specified the default is to feed into a collection by the same name in the current ESP installation. Table 15: Feeding Destination Options Option Name Description This parameter specifies a unique name that must be given for the feeding destination you are configuring. The name can later be used in order to specify a destination for refeeds. This field is required. Target collection This parameter specifies the ESP collection name to feed documents into. Normally this is the same as the collection name, unless you wish to feed into another collection. Ensure that the collection already exists on the ESP installation designated by Destination first. Each feeding desintation you specify maps to a single collection, thus to feed the same crawl into multiple collections you need to specify multiple feeding destinations. It is also possible for multiple crawler collections to feed into the same target collection. This field is required. Destination This parameter specifies an ESP installation to feed to. The available ESP destinations are listed in the feeding section of the crawler's global configuration file, normally $FASTSEARCH/etc/CrawlerGlobalDefaults.xml. The XML file contains a list of named destinations, each with a list of content distributors. If no destinations are explicitly listed in the XML file you may specify "default" here, and the crawler will feed into the current ESP installation. This current ESP installation is that which is specified by $FASTSEARCH/etc/contentdistributor.cfg. This field is required, may be "default" unless the global XML file has been altered. Pause ESP feeding 56 This option specifies whether or not the crawler should pause document feeding to FAST ESP. When paused, the feed will be written to stable storage on a queue. Configuring the Enterprise Crawler Option Description Note that the value of this setting can be changed via the crawleradmin tool options, --suspendfeed/--resumefeed. Default: no Primary This parameter controls whether this feeding destination is considered a primary or secondary destination. Only the primary destination is allowed to act on callback information from the document feeding chain, secondary feeders are only permitted to log callbacks. Exactly one feeding destination must be specified as primary. This field is required. Focused Crawl This table describes the options to configure Language focused crawling Table 16: Focused Crawl Options Option Languages Description This option allows you to specify a list of languages that documents must match to be stored and indexed by FAST ESP. The crawler will only follow non-focused documents to a maximum depth set by the Focus depth option. Languages should be specified either as a two letter ISO-639-1 code, or the single word equivalent. Examples: english, en, german, de. Focus depth This option allows you to specify how many levels the crawler should follow links from URIs not matching the specified language of the crawl. For example, if you are doing an English only crawl, with a focus depth of 2, the URI chain would look like this (focus depth in parenthesis, "-" means no depth assigned): English(-) -> French(2) ->French(1) -> English(1) -> English(1) ->German (0) In the example above, the crawler will not follow links from the last URI in the chain as the specified depth has been reached. Hostname exclude filters Use this parameter to specify certain domains where language focus should not apply. For example, if performing e.g. a Spanish crawl it is possible to exclude the top level domain .es from the language focus checks, thereby crawling all of .es regardless of the language on individual pages. The format is the same as the Hostname exclude filters in the basic collection options. Form Based Login The crawler can crawl sites that rely on HTTP cookie authentication for access control of web pages. Configuring the crawler to perform cookie authentication does however require a fair bit of insight in the details of how the authentication scheme works and may take some trial and error to get correct. Studying the HTML or JavaScript source of the login page and HTTP protocol traces of a browser login session can be very helpful. Tools that perform such tasks are freely available, including the packet sniffer Ethereal (http://www.ethereal.com/). Note: When secure transport (HTTPS) is used, packet sniffing in general cannot be used, and some type of application level debugging tool must be used. We recommend the LiveHTTPHeaders utility (http://livehttpheaders.mozdev.org/) for the Mozilla browser. 57 FAST Enterprise Crawler Note: Login Specification does not allow empty values. If you need to crawl cookie authenticated sites with empty values, contact FAST Technical Support for detailed instructions. Table 17: Form Based Login Options Option Name Preload HTML form Form scheme Description Required: Specify a unique name for the login specification you are configuring. Optional: URI to fetch (in order to receive a cookie) before proceeding to the authentication form. May or may not be necessary, depending on how the authentication for that site works. Optional: URI to the HTML page containing the login form. Used by the Autofill option. If not specified, the crawler will assume the HTML page is specified by the Form action option. Optional: Type of scheme used for login. Valid values: http, https Default: http Form site Form action Form method The hostname of the login form URI. The path/file of the login form URI. The HTTP action of the form. Valid values: GET, POST Default: GET Autofill Re-login if failed? Form parameters Login sites TTL Whether the crawler should download the HTML page, parse it, identify which form you're trying to log into by matching parameter names, and merge it with any specified form parameters you may have specified in the Form parameters option. Whether the crawler after a failed login should attempt to re-login to the web site after TTL seconds. During this time, the web site will be kept active within the crawler, thus occupying one available site resource. The credentials as a sequence of key, value parameters the form requires for a successful log on. These are typically different from form to form, and must be deduced by looking at the HTML source of the form. In general, only user-visible (via the browser) variables need be specified, e.g. username and password, or equivalent. The crawler will fetch the login form and read any "hidden" variables that must be sent when the form is submitted. If a variable and value are specified in the parameters section, this will override any value read from the form by the crawler. List of sites (i.e. hostnames) that should log into this form before being crawled. Number of seconds before the crawler should re-authenticate itself. HTTP Proxies This topic specifies one or more proxy addresses to use for all HTTP/HTTPS communication. 58 Configuring the Enterprise Crawler Table 18: HTTP Proxy Options Option Name Host Port Description Name of proxy. Hostname of proxy. Port number of proxy. Default port: 3128 User Password Registered HTTP Proxies Username. Password. List of registered HTTP proxy names. Link Extraction This topic describes the advanced options available that apply to Link Extraction. These options allow you to specify which HTML tags to extract links from, including whether or not to extract links from within comments or JavaScript code (applies only when the proper JavaScript support is turned off). The following display shows the default values for the various Link Extraction parameters. Figure 6: Link Extraction Options Logging This section describes the advanced options available that apply to Logging. The different logs can be enabled or disabled by selecting text or none respectively. Table 19: Logging Options 59 FAST Enterprise Crawler Option Document fetch log Description This log contains detailed information about every retrieved document. It contains status on whether the retrieval was a success, or if not, what went wrong. It will also tell you if the document was excluded after being downloaded, for instance if it was not of the correct document type. Inspecting this log is very useful if you suspect that your data should have been crawled but was not, or vice versa. It should be the first place to look after examining the crawler debugs for errors and warnings. Default location: $FASTSEARCH/var/log/crawler/fetch/<collection name>/<date>.log Default: text Site log The site log contains information about all sites being crawled in a collection, for instance when the crawler starts/stops crawling a site, as well as the time of refresh events. Examining this log can be useful when debugging site-wide issues, as this log is comparable to the fetch log only on a site basis. Default location: $FASTSEARCH/var/log/crawler/site/<collection name>/<date>.log Default: text Postprocess log This log contains a report of all documents, modifications or deletions sent to the FAST ESP indexing pipeline, and the outcome of these operations. Default location: $FASTSEARCH/var/log/crawler/PP/<collection name>/<date>.log Default: text Header log This log contains all HTTP headers send and received from the HTTP servers when documents are retrieved, and can be used for debugging purposes of your setup. This log is essential when debugging authentication related issues, but should be turned off for normal crawling. Default location for every web site crawled: $FASTSEARCH/var/log/crawler/header/<collection name>/<5 first chars of hostname>/<hostname>/<date>.log Default: none Screened log This log contains all URIs that are not attempted retrieved for any reason, including not falling within the scope of the configured include/exclude filters, robots.txt exclusion and so forth. This log is useful if you feel that content that should be crawled is not being crawled. As this is a very high volume log it should be turned off for normal crawling. Default location: $FASTSEARCH/var/log/crawler/screened/<collection name>/<date>.log Default: none Data Search feed log This log contains all URIs that have been submitted to document processing and their status. The log contain error messages reported by document processing stages and is the first place to look if a document is not in the index. Default location: $FASTSEARCH/var/log/crawler/dsfeed/<collection name>/<date>.log Default: text 60 Configuring the Enterprise Crawler Option Adaptive Scheduler log Description Logs adaptive rank score of documents, for debugging purposes only. Default location: $FASTSEARCH/var/log/crawler/scheduler/<collection name>/<date>.log Default: none POST Payload This section of the Advanced Data Sources screen allows you to configure POST payloads Table 20: POST Payload Options Option URI prefix Payload Description Specify a URI or URI prefix. Every URI that matches the URI or prefix will have the below associated Payload submitted to it using the HTTP POST method. A URI prefix must be indicated by the string prefix:, followed by the URI string to match. A URI alone will be used for an exact match. Specify the payload to be submitted by the HTTP POST method to every URI that matches the given URI prefix specified above. Postprocess It is recommended that you contact FAST Solution Services before changing any of the postprocess options. The default selections are shown in the following screen. The options with empty defaults are automatically adjusted by the crawler. Figure 7: Postprocess Options RSS This topic describes the parameters for RSS crawling. Note: Extensive testing is strongly recommended before production use, to insure that desired processing patterns are attained. Table 21: RSS Options 61 FAST Enterprise Crawler Option RSS start URIs Description This option allows you to specify a list of RSS start URIs for the collection to be configured. RSS documents (feeds) are treated a bit different than other documents by the crawler. First, RSS feeds typically contain links to articles and meta data which describes the articles. When the crawler parses these feeds, it will associate the metadata in the feeds with the articles they point to.This meta data will be sent to the processing pipeline together with the articles, and a RSS pipeline stage can be used to make this information searchable. Second, links found in RSS feeds will be tagged with a force flag. Thus, the crawler will crawl these links as soon as allowed (it will obey the collection's delay rate), and they will be crawled regardless if it they have been crawled already in this crawl cycle. Example: http://www.example.com/rss.xml Default: Not mandatory RSS start URI files This parameter requires you to specify a list of RSS start URI files for the collection to be configured. This option is not mandatory. The format of the files is one URI per line. Example: C:\MyDirectory\rss_starturis.txt (Windows) or /home/user/rss_starturis.txt (UNIX). Default: Not mandatory Discover new RSS feeds? This parameter allows you to specify if the crawler should attempt to find new RSS feeds. If this option is not set, only feeds specified in the RSS start URIs and/or the RSS start URIs files sections will be treated as feeds. Default: no Follow links from HTML? This option allows you to specify if the crawler should follow links from HTML documents, which is the normal crawler behavior. If this option is disabled, the crawler will only crawl one hop away from a feed. Disable this option if you only want to crawl feeds and documents referenced by feeds. Default: yes Ignore include/exclude rules? Use this option to specify if the crawler should crawl all documents referenced by feeds, regardless of being valid according to the collection's include/exclude rules. Default: no Index RSS feeds? This parameter allows you to specify if the crawler should send the RSS feed documents to the processing pipeline. Regardless of this option, meta data from RSS feeds will be sent to the processing pipeline together with the articles they link to. Default: no Max age for links in feeds This parameter allows you to specify the maximum age (in minutes) for a link in an RSS document. Expired links will be deleted if the 'Delete expired' option is enabled. 0 disables this option. Default: 0 (disabled) Max articles per feed This parameter allows you to specify the maximum number of links the crawler will remember for a feed. The list of links found in a feed will be treated in a FIFO manner. When links get pushed out of the list, they will be deleted if the 'Delete expired' option is set. 0 disables this option. Default: 128 62 Configuring the Enterprise Crawler Option Delete expired articles? Description This option allows you to specify if the crawler should delete articles when they expire. An article (link) will expire when it is affected by either 'Max articles per feed' or 'Max age for links in feeds'. Default: no Storage It is recommended that you contact FAST Solution Services before changing any of the Storage options. The default selections are shown in the following screen. The options with empty defaults are automatically adjusted by the crawler. Figure 8: Storage Options Sub Collections This topic describes how to define and configure Sub Collections in the crawler. Sub Collections is a mechanism that allows subsets of a collection to be specified differently in the crawler. An example is if a collection spans across several sites, and one wish to crawl a particular site or set of sites to be crawled more aggressively. In such a case, one can define a Sub Collection that includes this site and set a different request rate on that Sub Collection. Sub Collections should be considered as a separate work queue that is treated differently than the main collection queue. Note that Sub Collections can span several sites, or a particular subset of a site. The Sub Collection Hostname include/exclude filters and URI include/exclude filters determine what will be included in a Sub Collection; the filters have the same semantics found in the Data Source Basic Options and Data Source Advanced Options respectively. Note that whatever does not fall within a Sub Collection automatically falls within the main collection. Also note that what falls within a Sub Collection cannot be excluded in the main collection; it must be a subset. Sub Collections must be given their own start URI or start URI file. The options that are set for Sub Collections will contain the same semantics as those in the main collection; Sub Collection settings override main collection settings. One or more of the following settings are mandatory: • • • Hostname/URI include/exclude filters Start URI files/Start URIs Name The remaining settings are optional. 63 FAST Enterprise Crawler Figure 9: Sub Collection Basic Options Figure 10: Sub Collection Crawl Mode Options Figure 11: Sub Collection RSS Options 64 Configuring the Enterprise Crawler Figure 12: Sub Collection Advanced Options Creating a new Sub Collection Fill in the proper values in the fields for the Sub Collection. If values are already filled in, click New to get a blank template. Fill in the mandatory values, and click Add. Note: If a different Sub Collection has been viewed earlier, some options may not change. Make sure all options are correct before selecting Add. Modifying an existing Sub Collection Select the Sub Collection you wish to add in the Installed items select box, then click View. Modify the applicable settings. Before saving, select the same Sub Collection in the Installed Sub Collections box. Click Delete. Click Add. 65 FAST Enterprise Crawler Removing an existing Sub Collection Select the Sub Collection you wish to remove in the Installed Sub Collections box, then click Delete. Work Queue Priority This topic describes the work queue priority parameter, which allows you to specify how many priority levels you want the work queue to consist of, and various rules and methods for how to insert and extract entries from the work queue. Note: Extensive testing is strongly recommended before production use, to insure that desired processing patterns are attained. Table 22: Work Queue Priority Options Option Workqueue levels Description This option allows you to specify the number of priority levels you want the crawler work queue to have. Note: If this value is ever decreased in value (e.g. from 3 to 1), the on-disk storage for the work queues must be deleted manually to recover the disk space. Default: 1 Default queue This option allows you to specify the default priority level for extracting and inserting URIs from/to the work queue. Default: 1 Start URI priority This option allows you to specify the priority level for URIs coming from the start URIs and start URI files options. Default: 1 Pop Scheme This option allows you to specify which method you want the crawler to use when extracting URIs from the work queue. Valid values: • rr - extract URIs from the priority levels in a round-robin fashion. • wrr - extract URIs from the priority levels in a iweighted round-robin fashion. The weights are based on their respective share setting per priority level. Basically URIs are extracted from the queue with the highest share value; when all shares are 0 the shares are reset to their original settings. pri - extract URIs from the priority levels in a priority fashion by always extracting from the highest priority level if there still are entries available (1 being the highest). default- same as wrr. • • When using multiple work queue levels it's recommended to use either the wrr or pri pop scheme. Default: default Put Scheme This option allows you to specify which method you want the crawler to use when inserting the URIs into the work queue. Valid values: • 66 default - always insert URIs with default priority level. Configuring the Enterprise Crawler Option Description • include - insert URIs with the priority level defined by the includes specified for every priority level. If no includes match, the default priority level will be used. Default: default Queue - Hostname include These options allow you to specify a set of include rules for each priority level to be used filters Queue - URI include when utilizing the include Put scheme of inserting entries to queue. filters Queue - Share This option allows you to specify a share or weight for each queue to be used when utilizing the wrr Pop scheme of extracting entries in the work queue. Configuration via XML Configuration Files The crawler may be configured using an XML based file format. This format allows you to manage files in a text based environment to create/manage multiple collections as well as automate configuration changes. Furthermore, a few advanced features are only available in the XML format. Basic Collection Specific Options (XML) This section discusses the parameters available on a per collection basis should you decide to configure the crawler using an XML configuration file. To add or update a collection in the crawler from an XML file, use the following command: $FASTSEARCH/bin/crawleradmin -f <XML file path> Substitute <XML file path> with the full path to the XML file. Note: Removing a section from the XML configuration and submitting that configuration while keeping the section intact is necessary for proper updating. For example, if you want to delete the existing include_uris section, you should not completely delete that section from the XML file. You should add an empty include_uris section in the XML file before importing the changes. This behavior allows partial configs to be submitted in order to change a specific option while keeping the remaining configuration intact. Table 23: XML Configuration File Parameters Parameter info Description Collection information. Parameter specifies a string that can contain general-purpose information. <attrib name="info" type="string"> Test crawl for .no domains on W2k </ attrib> fetch_timeout URI fetch timeout in seconds. The maximum period of time allowed for downloading a document. Set this value high if you expect to download large documents from slow servers. Default: 300 <attrib name="fetch_timeout" type="integer"> 300 </attrib> 67 FAST Enterprise Crawler Parameter allowed_types Description Allowed document MIME types. Only download documents of the indicated MIME type(s). The MIME types specified here is included in the accept-header of each GET request that is sent. Note that some servers can return incorrect MIME types. Note that the format supports wildcard expansion of an entire field only, for example, */example, text/* or */*, but not appl/ms*. No other regular expression is supported. <attrib name="allowed_types" type="list-string"> <member> text/html</member> <member>application/msword </member> </attrib> force_mimetype_detection Force MIME type detection on documents. This option allows you to decide whether or not the crawler should run its own MIME type detection on documents. In most cases web servers return the MIME type of documents when they are downloaded, as part of the HTTP header. If this option is enabled, documents will get tagged with the MIME type that looks most accurate; either the one received from the web server or the result of the crawlers determination. Default: no (disabled) <attrib name="force_mimetype_detection" type="boolean"> no </attrib> allowed_schemes Allowed schemes. Specify which URI schemes to allow. Valid schemes are: HTTP, HTTPS and FTP and multimedia formats MMS and RTSP. Note that MMS and RTSP for multimedia crawl is supported via MM proxy. <attrib name="allowed_schemes" type=list-string"> <member> http </member> <member> https </member> <member> ftp </member> </attrib> ftp_acct FTP accounts. Specify FTP accounts for crawling FTP URIs. If no site match is found here, the default is used. Note that changing this value may result in previously accessible content to be (eventually) deleted from the index. 68 Configuring the Enterprise Crawler Parameter Description Default: anonymous <section name="ftp_acct"> <attrib name="ftp.mysite.com" type="string"> user:pass </attrib> </section> ftp_passive FTP passive mode. Use FTP passive mode for retrieval from FTP sites. Default: yes <attrib name="ftp_passive" type="boolean"> yes </attrib> domain_clustering Route hosts from the same domain to the same slave. If enabled in a multiple node configuration, sites from the same domain (for example, www.example.com and forums.example.com) will also be routed to the same master node. Default: no (disabled) for single node and yes (enabled) for multiple node <attrib name="domain_clustering" type="boolean"> yes </attrib> max_inter_docs Maximum number of docs before interleaving site. The crawler will by default crawl a site to exhaustion, or until the maximum number of documents per site is reached. However, the crawler can be configured to crawl "batches" of documents from sites at a time, thereby interleaving between sites. This parameter allows you to specify how many documents you want to be crawled from a server consecutively before the crawler interleaves and starts crawling other servers. The crawler will then return to crawling the former server as resources free up. Valid values: No value (empty) or positive integer Default: empty (disabled) Example: <attrib name="max_inter_docs" type="integer"> 3000 </attrib> max_redirects Maximum number of redirects to follow. This parameter allows you to specify the maximum number redirects that should be followed from an URI. For example, http://example.com/path redirecting to http://example.com/path2 will be counted as 1. Default: 10 <attrib name="max_redirects" type="integer"> 10 </attrib> near_duplicate_detection Enable near duplication detection algorithm. The near_duplicate_detection parameter is boolean, with values true or false, indicating whether or not to use the near duplicate detection scheme. The 69 FAST Enterprise Crawler Parameter Description near_duplicate_detection parameter can be used per domain (sub-domain). It is disabled (false) by default. Default: no <attrib name="near_duplicate_detection" type="boolean"> no </attrib> Refer to Configuring Near Duplicate Detection for more information. max_uri_recursion Screen for recursive patterns in new URIs. Use this parameter to check for repeating patterns in URIs, compared to their referrers, with repetitions beyond the specified being dropped. For example, http://www.example.com/wile linking to http://www.example.com/wile/wile is a repetition of 1 element. A value of 0 disables the test. Default: 5 <attrib name="max_uri_recursion" type="integer"> 5 </attrib> focused Language focused crawl (optional). Use this parameter to specify options to focus your crawl. languages: Use this parameter to specify a list of languages that documents must match to be stored and sent to FAST ESP. Documents that do not match the languages will follow a configured amount (depth) of levels before traversing stops.Those domains excluded from the language focused crawl are still eligible for the main crawl. Languages should be specified according to ISO-639-1. The depth and exclude_domains settings are used to limit the crawl: depth: Use this parameter to specify the number of levels to follow documents that do not match the language specification. exclude_domains: Use this parameter to exclude certain domains from which language focus should not apply. Format is the same as the exclude_domains option in the collection configuration. Note that domains will be crawled regardless of their language; they will be excluded from the language check, but not excluded from the crawl. <section name="focused"> <attrib name="depth" type="integer"> 3 </attrib> <section name="exclude_domains"> <attrib name="suffix" type="list-string"> <member> .tv </member> </attrib> </section> <attrib name="languages" type="list-string"> <member> norwegian </member> <member> no </member> <member> nb </member> <member> nn </member> <member> se </member> </attrib> </section> ftp_searchlinks 70 FTP search for links. Configuring the Enterprise Crawler Parameter Description Specify if you want the crawler to search the documents downloaded from FTP for links. Default: yes <attrib name="ftp_searchlinks" type="boolean"> yes </member> use_javascript Enable JavaScript support. Specify if you want to enable JavaScript support in the crawler. If enabled, the crawler will download, parse/execute and extract links from any external JavaScript. Note: JavaScript processing is resource intensive and should not be enabled for large crawls. Note: Processing JavaScript requires an available Browser Engine. For more information, please refer to the FAST ESP Browser Engine Guide. Default: no <attrib name="use_javascript" type="boolean"> no </attrib> javascript_keep_html Specify whether to submit the original HTML document, or the HTML resulting from the JavaScript parsing, to document processing for indexing. When parsing a HTML document the Browser Engine executes all inlined and external JavaScripts, and thereby all document.write() statements, and includes these in its HTML output. By default it is this resulting document that is indexed. However it is possible to use this feature to limit the Browser Engine to link extraction only. This option has no effect if JavaScript crawling is not enabled Default: no <attrib name="javascript_keep_html" type="boolean"> no </attrib> javascript_delay Specify the delay (in seconds) to be used when retrieving external JavaScripts referenced from a HTML document. The default (specified as an empty value) is the same as the normal crawl delay, but it may be useful to set it lower to speed up crawling of documents containing JavaScripts. This option has no effect if JavaScript crawling is not enabled Default: empty <attrib name="javascript_delay" type="real"> 60 </attrib> exclude_headers Exclude headers. 71 FAST Enterprise Crawler Parameter Description Specify which documents that you want to be excluded by identifying the document HTTP header fields. First specify the header name, then one or more regular expressions for the header value. <section name="exclude_headers"> <attrib name="Server" type="list-string"> <member> webserverexample1.* </member> <member> webserverexample2.* </member> </attrib> </section> exclude_exts Exclude extensions. Specify which documents you want to be excluded by identifying the document extensions. The extensions will be suffix string matched with the path of the URIs. <attrib name="exclude_exts" type="list-string"> <member> .gif </member> <member> .jpg </member> </attrib> use_http_1_1 Use HTTP/1.1. Specify whether the crawler should use HTTP/1.1 or not (HTTP/1.0). HTTP/1.1 is required for the crawler to accept compressed documents from the server (accept_compression) and enable ETag support (if_modified_since must be checked). Default: yes (to crawl using HTTP/1.1) <attrib name="use_http_1_1" type="boolean"> no </attrib> accept_compression Accept compression. Specify whether the crawler should use the Accept-Encoding header, thus accepting that the documents are compressed at the web server before returned to the crawler. Default: yes Only applicable if use_http_1_1 is enabled. <attrib name="accept_compression" type="boolean"> no </attrib> dbswitch DB switch interval. Specify the number of cycles a document is allowed to complete before being deleted. When the DB interval is complete, the action taken on these deleted documents is determined by the dbswitch_delete parameter. Setting this value very low, such as to 1, can result in documents being deleted very suddenly. 72 Configuring the Enterprise Crawler Parameter Description This parameter is not affected by force re-fetches; only cycles where the crawler refreshes normally (by itself) increases the document's cycle number count. <attrib name="dbswitch" type="integer"> 5 </attrib> dbswitch_delete DB switch delete. The crawler will at regular intervals perform an update of its internal database of retrieved documents, to detect documents that may be removed from the web servers. This option determines what to do with these remaining documents; they can either be deleted right away or put in the work queue for retrieval to make certain they are actually removed. A dbswitch check occurs at the start of a refresh cycle, independently for each site. If set to yes, then documents found to be too old are deleted immediately. If set to no, then documents are scheduled for a re-retrieval and only deleted if they no longer exist on the server. Default: no <attrib name="dbswitch_delete" type="boolean"> yes </attrib> html_redir_is_redir Treat META refresh HTTP tag contents as an HTTP redirect. Use this parameter in conjunction with html_redir_thresh to allow the crawler to treat META refresh tags inside HTML documents as if they were true HTTP redirects. When enabled the document containing the META refresh will not itself be indexed. Default: yes <attrib name="html_redir_is_redir" type="boolean"> yes </attrib> html_redir_thresh Upper bound for META refresh tag delay. Use this parameter in conjunction with html_redir_is_redir to specify the number of seconds delay (threshold) which are allowed for the tag to be considered a redirect. Anything less than this number is treated as a redirect, other values are treated as a link (and the document itself is indexed also). Default: 3 <attrib name="html_redir_thresh" type="integer"> 3 </attrib> robots_ttl Robots time to live. Specifies how often (in seconds) the crawler will re-download the robots.txt file from sites, in order to check if it has changed. Note that the robots.txt file may be retrieved less often if the site is not crawling continuously. 73 FAST Enterprise Crawler Parameter Description Default: 86400 (24 hours) <attrib name="robots_ttl" type="integer"> 86400 </attrib> enable_flash Extract links from flash files. If enabled, extract links from Adobe Flash (.swf) files. You may also want to enable JavaScript support, as many web servers only provide Flash content to clients that support JavaScript. Note: Flash processing is resource intensive and should not be enabled for large crawls. Note: Processing Macromedia Flash files requires an available Browser Engine. For more information, please refer to the FAST ESP Browser Engine Guide. Default: no <attrib name="enable_flash" type="boolean"> no </attrib> use_sitemaps Extract links and metadata from sitmap files. Enabling this option allows the crawler to detect and parse sitemaps. The crawler support sitemap and sitemap index files as defined by the specification at http://www.sitemaps.org/protocol.php. The crawler uses the 'lastmod' attribute in a sitemap to see if a page has been modified since the last time the sitemap was retrieved. Pages that have not been modified will not be recrawled. An exception to this is if the collection uses adaptive refresh mode. In adaptive refresh mode the crawler will use the 'priority' and 'changefreq' attributes of a sitemap in order to determine how often a page should be crawled. For more information see Adaptive Parameters on page 91. Custom tags found in sitemaps are stored in the crawlers meta database and can be submitted to document processing. Note: Most sitemaps are specified in robots.txt. Thus, 'robots' should be enabled in order to get the best result. Default: No <attrib name="use_sitemaps" type="boolean"> no </attrib> max_reflinks Maximum referrer links. Specify the maximum number of referrer links to store per URI (redirects excluded). Note: This value can have a major impact on crawler performance. 74 Configuring the Enterprise Crawler Parameter Description Default: 0 <attrib name="max_reflinks" type="integer"> 0 </attrib> max_pending Maximum number of concurrent requests per site. Specify the maximum number of concurrent (outstanding) HTTP requests to a site at any given time. Default: 2 <attrib name="max_pending" type="integer"> 8 </attrib> robots_auth_ignore Ignore robots.txt authentication errors. Specify whether or not the crawler should ignore robots.txt if an HTTP 40x authentication error is returned by the server. If disabled the crawler will not crawl the site in question at this time. This option allows you to control whether the crawler should crawl sites returning 401/403 Authorization Required for their robots.txt from the crawl. The robots standard lists this behavior as a hint for the spider to ignore the site altogether. However, incorrect configuration of web servers is widespread and can lead to a site being erroneously excluded from the crawl. Enabling this option makes the crawler ignore such indications and crawl the site anyway. Default: yes <attrib name="robots_auth_ignore" type="boolean"> yes </attrib> robots_tout_ignore Ignore robots.txt timeout. Specify whether or not the crawler should ignore the robots.txt rules if the request for this file times out. Before crawling a site, the crawler will request the robots.txt file from the server, according to the rules for limiting what areas of a site may be crawled. According to these rules, if the request for this file times out the entire site should be considered off-limits to the crawler. Setting this parameter to yes indicates that the robots.txt rules should be ignored, and the site crawled. Keep this option set to no if you do not control the site being crawled. Default: no <attrib name="robots_tout_ignore" type="boolean"> no </attrib> rewrite_rules Rewrite rules. Specify a number of rewrite rules that rewrite certain URIs. Typical usage is to rewrite URIs with session-ids by removing the session-id part. Sed-type format. Separator character is the first one encountered, in this example "@". <attrib name="rewrite_rules" type="list-string"> <member> <![CDATA[@(.*/servlet/.*[&?])r=.*?(&|$)(.*)@\1\3@ ]]> </member> <member> <![CDATA[@(.*);jsessionid=.*?(\?.*|$)@\1\2@ 75 FAST Enterprise Crawler Parameter Description ]]> </member> </attrib> extract_links_from_dupes Extract links from duplicates. Even though two documents have duplicate contents, they may have different links. Specify whether or not you want the crawler to extract links from duplicates. If enabled, you may get duplicate links in the URI-queues. If duplicate documents contain duplicate links then you can disable this parameter. Default: no <attrib name="extract_links_from_dupes" type="boolean"> no </attrib> use_meta_csum Include HTML META tag contents in checksum. Specify if you want the crawler to include the contents (values) of HTML META tags when generating the document checksum used for duplicate detection. Use this to find changes in the document META tags. Default: no <attrib name="use_meta_csum" type="boolean">no</attrib> csum_cut_off Checksum cut-off limit. When crawling multimedia content through a multimedia proxy (schemes MMS or RTSP), use this setting to determine if a document has been modified. Rather than downloading an entire document, only the number of bytes specified in this setting will be transferred and the checksum calculated on that initial portion of the file. This saves bandwidth after the initial crawl cycle, and reduces the load on other system and network resources as well. Default: 0 (disabled) <attrib name="csum_cut_off" type="integer">0</attrib> if_modified_since Send If-Modified-Since header. Specify if you want the crawler to send If-Modified-Since headers. Default: yes <attrib name="if_modified_since" type="boolean"> yes </attrib> use_cookies Use cookies. Specify if you want the crawler to store/send cookies received in HTTP headers. This feature is automatically enabled for sites that use a login, but can also be turned on globally through this option. Default: no <attrib name="use_cookies" type="boolean"> no </attrib> uri_search_mime Document MIME types to extract links from. This option specifies MIME types that should be searched for links. If not already listed in the default list, type in a MIME type you want to search for links. 76 Configuring the Enterprise Crawler Parameter Description Note that wildcard on type and subtype is allowed. For instance, text/* or */html are valid. No other regular expression is supported. <attrib name="uri_search_mime" type="list-string"> <member> text/html </member> <member> text/plain </member> </attrib> variable_delay Variable request rate. Specify time slots when you want to use a higher or lower request rate (delay) than the main setting. Time slots are specified as ending and starting points, and cannot overlap. Time slot start and endpoints are specified with day of week and time of day (optionally with minute resolution). You can also enter the value suspend in the delay field that will suspend the crawler so that there is no crawling for the time span specified. <section name="variable_delay"> <!-- Crawl with delay 20 Wednesdays --> <attrib name="Wed:00-Wed:23" type="string">20 </attrib> <!-- Crawl with delay 2 during weekends --> <attrib name="Sat:08.00-Sun:20.30" type="string">2</attrib> <!-- Dont crawl Mondays --> <attrib name="Mon:00-Mon:23" type="string">suspend</attrib> </section> site_clusters Explicit site clustering. Specify if you want to override normal routing of sites and force certain sites to be on the same uberslave. This is useful when cookies/login is enabled, since cookies are global only within an uberslave. Also if you know certain sites are closely interlinked you can reduce internal communication by clustering them. <section name="site_clusters"> <attrib name="mycluster" type="list-string"> <member> site1.example.com </member> <member> site2.example.com </member> <member> site3.example.com </member> </attrib> </section> refresh_mode workqueue_priority Refer to Refresh Mode Parameters on page 89 for option information. Refer to Work Queue Priority Rules on page 89 for option information. adaptive Refer to Adaptive Parameters on page 91 for option information. max_backoff_counter and max_backoff_delay Maximum connection error backoff counter. and Maximum connection error backoff delay. 77 FAST Enterprise Crawler Parameter Description Together these options control the adaptive algorithm by which a site experiencing connection failures (for example, network errors, timeouts, HTTP 503 "Server Unavailable" errors) are contacted less frequently. For each consecutive instance of these errors, the inter-request delay for that site is incremented by the initial delay setting (delay setting): Increasing delay = current delay + delay The maximum delay for a site will be the max_backoff_delay setting. If the number of failures reaches max_backoff_counter, crawling of the site will become idle. Should the network issues affecting the site be resolved, the internal backoff counter will start decreasing, with the inter-request delay lowered on each successful document fetch by half: Decreasing delay = current delay / 2 This continues until the original delay (delay setting) is reached. Default: <attrib name="max_backoff_counter" type="integer"> 50 </attrib> <attrib name="max_backoff_delay" type="integer"> 600 </attrib> http_errors ftp_errors Refer to HTTP Errors Parameters on page 93 for option information. FTP error handling. Specify how various response codes and error conditions are handled for FTP URIs. Same XML structure as the http_errors section. Logins storage delay Refer to Logins parameters on page 94 for option information. Refer to Storage parameters on page 96 for option information. Delay between document requests (request rate). This option specifies how often (the delay between each request) the crawler should access a single web site when crawling. <attrib name="delay" type="real"> 60.0 </attrib> Note: FAST license terms do not allow a more frequent request rate setting than 60 seconds for external sites unless an agreement exists between the customer and the external site. refresh Refresh interval. refresh_mode <attrib name="refresh" type="real"> 1440 </attrib> The crawler retrieves documents from web servers. Since documents on web servers frequently change, are added or removed, the crawler must periodically crawl a site over again to reflect this. In the default crawler configuration, this 78 Configuring the Enterprise Crawler Parameter Description refresh interval is one day (1440 minutes), meaning that the crawler will start over crawling a site every 24 hours. Since characteristics of web sites may differ, and customers may want to handle changes differently, the action performed at the time of refresh is also configurable, via the refresh_mode <attrib name="refresh" type="real"> 1440 </attrib> robots Respect robot directives. This parameter indicates whether or not to follow the directives in robots.txt files. <attrib name="robots" type="boolean"> yes </attrib> include_domains Sites included in crawl. This parameter is a set of rules of which a hostname must match at least one in order to be crawled. An empty section matches all domains. Note: This setting is a primary control over the pages included in the crawl (and index), and should not be changed without care. Valid rules types are: prefix: Matches the given sitename prefix (for example, www matches www.example.net, but not download.example.net) exact: Matches the exact sitename file: Identifies a local (to the crawler host) file containing include and/or exclude rules for the configuration. Note that in a multiple node configuration, the file must be present on all crawler hosts, in the same location. suffix: Matches the given sitename suffix (for example, com matches www.example.com) regexp: Matches the given sitename against the specified regular expression (left to right). IP mask: Matches IP addresses of sites against specified dotted-quad or CIDR expression. <section name="include_domains"> <attrib name="suffix" type="list-string"> <member> example.net </member> <member> example.com </member> </attrib> <attrib name="regexp" type="list-string"> <member> .*\.alltheweb\.com </member> </attrib> </section> exclude_domains Sites excluded from crawl. This parameter is a set of rules of which a hostname must not match any rules in order to be crawled. An empty section matches no domains (allowing all to be crawled). Syntax is identical to include_domains parameter with only the section name being different. Note: This setting is a primary control over the pages included in the crawl (and index), and should not be changed without care. 79 FAST Enterprise Crawler Parameter include_uris Description Included URIs. This parameter is a set of rules of which a URI must match at least one rule in order to be crawled. An empty section matches all URIs. Syntax is identical to include_domains parameter with only the section name being different. Note: This setting is a primary control over the pages included in the crawl (and index), and should not be changed without care. Note: The semantics of URI and hostname inclusion rules have changed since ESP 5.0 (EC 6.3). In previous ESP releases these two rule types were evaluated as an AND, meaning that a URI has to match both rules (when rules are defined). As of ESP 5.1 and later (EC 6.6 up), the rules processing has changed to an OR operator, meaning a URI now only needs to match one of the two rule types. For example, an older configuration for fetching pages with the prefix http://www.example.com/public would specify two rules: • • include_domains: www.example.com (exact) include_uris: http://www.example.com/public/ (prefix) The first rule is no longer needed, and if not removed would allow any URI from that host to be fetched, not only those from the /public path. Some configurations may be much more complex than this simple example, and require careful adjustment in order to restrict URIs to the same limits as before. Contact Contact Us on page iii for assistance in reviewing your configuration, if in doubt. Existing crawler configurations migrating from EC 6.3 must be manually updated, by removing or adjusting the hostname include rules that overlap with URI include rules. exclude_uris Excluded URIs. This parameter is a set of rules of which a URI must not match any rules in order to be crawled. An empty section matches no URIs (allowing all to be crawled). Syntax is identical to include_domains with only the section name being different. Note: This setting is a primary control over the pages included in the crawl (and index), and should not be changed without care. start_uris Start URIs for the collection. This parameter is a list of start URIs for the specified collection. The crawler needs either start_uris or start_uri_files specified to start crawling. Note: If your crawl includes any IDNA domain names, you should enter them using UTF-8 characters, and not in the DNS encoded format. <attrib name="start_uris" type="list-string"> <member> http://www.example.com/ </member> <member> http://example.øl.no/ </member> </attrib> 80 Configuring the Enterprise Crawler Parameter start_uri_files Description Start URI files for the collection. This parameter is a list of start URI files for the specified collection. The file format is plain text with one URI per line. The crawler needs either start_uris or start_uri_files specified to start crawling. Note: In a multiple node configuration, the file must be available on all masters. <attrib name="start_urifiles" type="list-string"> <member> urifile.txt </member> <member> urifile2.txt </member> </attrib> mirror_site_files Map file of primary/secondary servers for a site. This parameter is a list of mirror site files for the specified domain. The file format is a plain text, whitespace-separated list of sites, with the preferred (primary) name listed first. Note: In a multiple node configuration, the file must be available on all masters. <attrib name="mirror_site_files" type="list-string"> <member> mirror_mappings.txt </member> </attrib> max_sites Maximum number of concurrent sites. This parameter limits the maximum number of sites that can be handled concurrently by this crawler node. This value applies per crawler node in a distributed setup. Note: This value can have a major impact on system resource usage. <attrib name="max_sites" type="integer"> 128 </attrib> proxy Proxy address. This parameter specifies a proxy to redirect all HTTP communication. The proxy can be specified in the format: (http://)?(user:pass@)?hostname(:port)? Default port: 3128 <attrib name="proxy" type="list-string"> <member> proxy1.example.com:3128 </member> <member> proxyB.example.com:8080 </member> </attrib> proxy_max_pending Proxy open connection limit. 81 FAST Enterprise Crawler Parameter Description This parameter specifies a limit on the number of outstanding open connections per proxy, per uberslave in the configuration. <attrib name="proxy_max_pending" type="integer"> 8 </attrib> passwd document_plugin Refer to Password Parameters on page 97 for option information. Specify user-defined document/redirect processing program. Specify a user-written Python module to be used for processing fetched documents and (optionally) URI redirections. The value specifies the python class module, ending with your class name. The crawler splits on the last '.' separator and converts this to the Python equivalent "from <module> import <class>". The source file should be located relative to $PYTHONPATH, which for FDS installations corresponds to ${FASTSEARCH}/lib/python2.3/. <attrib name="document_plugin" type="string"> tests.plugins.plugins.helloworld</attrib> Refer to Implementing a Crawler Document Plugin Module on page 118 for more information. headers List of additional HTTP headers to send. List of additional headers to add to the request sent to the web servers. Typically this is used to specify a user-agent header. <attrib name="headers" type="list-string"> <member> User-agent: FAST Enterprise Crawler 6 </member> </attrib> cut_off Maximum document size in bytes. This parameter limits the maximum size of documents. Document s larger than the specified number of bytes will be truncated or discarded (refer to truncate setting). Default: no cut-off <attrib name="cut_off" type="integer"> 100000000 </attrib> truncate Truncate/discard docs exceeding cut-off. This parameter specifies the action taken when a document exceeds the specified cut-off threshold. A value of "yes" truncates the document at that size and a value of "no" discards the document entirely. Default: yes <attrib name="truncate" type="boolean"> yes </attrib> diffcheck Duplicate screening. This parameter indicates whether or not duplicate screening should be performed. <attrib name="diffcheck" type="boolean"> yes </attrib> 82 Configuring the Enterprise Crawler Parameter check_meta_robots Description Inspect META robots directive. This parameter indicates whether or not to follow the directives given in the META robots tag (noindex or nofollow). <attrib name="check_meta_robots" type="boolean"> yes </attrib> obey_robots_delay Respect robots.txt crawl-delay directive. This parameter indicates whether or not to follow the crawl-delay directive in robots.txt files. In a site's robots.txt file, the non-standard directive Crawl-delay: 120 may be specified, where the numerical value is the number of seconds to delay between page requests. If this setting is enabled, this value will override the collection-wide delay setting for this site. <attrib name="obey_robots_delay" type="boolean"> no </attrib> key_file SSL key file. An SSL key file to use for HTTPS connections. Note: In a multiple node configuration, the file must be on all masters. <attrib name="key_file" type="string"> key.pem </attrib> cert_file SSL cert file. An SSL certificate file to use for HTTPS connections. Note: In a multiple node configuration, the file must be on all masters. <attrib name="cert_file" type="string"> cert.pem </attrib> max_doc Maximum number of documents This parameter indicates the maximum amount of documents to download from a web site. <attrib name="max_doc" type="integer"> 5000 </attrib> enforce_delay_per_ip Limit requests per target IP address. Use this parameter to force the crawler to limit requests (per the delay setting) to web servers whose names map to a shared IP address. Default: yes <attrib name="enforce_delay_per_ip" type="boolean"> yes </attrib> wqfilter Enable work queue Bloom filter. This parameter enables filtering that screens duplicate URI entries from the per-site work queues. Sizing of the filter is automatic. Default: yes <attrib name="wqfilter" type="boolean"> yes </attrib> 83 FAST Enterprise Crawler Parameter smfilter Description Slave/Master Bloom filter. This parameter enables a Bloom filter to screen URI links transferred between slaves and master. The value is the size of the filter, specifying the number of bits allocated, which should be between 10 and 20 times the number of URIs to be screened. It is recommended that you turn on this filter for large crawls; recommended value is 50000000 (50 megabit). Default: 0 <attrib name="smfilter" type="integer"> 0 </attrib> mufilter Master/Ubermaster Bloom filter. This parameter enables a Bloom filter to screen URI links transferred between masters and the ubermaster. The value is the size of the filter, specifying the number of bits allocated, which should be between 10 and 20 times the number of URIs to be screened. Note: Enabling this setting with a positive integer value disables the crosslinks cache. It is recommended that you turn on this filter for large crawls; recommended value is 500000000 (500 megabit). Default: 0 (disabled) <attrib name="mufilter" type="integer"> 0 </attrib> umlogs Ubermaster log file consolidation. If enabled (as by default), all logging is sent to the ubermaster host for storage. In large multiple node configurations it can be disabled to reduce inter-node communications, reducing resource utilization, at the expense of having to check log files on individual masters. <attrib name="umlogs" type="boolean"> yes </attrib> crawlmode Specify crawl mode. This parameter indicates the crawl mode that should be applied to a collection. Note: This setting is the primary control over the pages included in the crawl (and index) and should not be changed without care. The following settings exist: mode: Specifies either FULL or DEPTH:# (where # is the maximum number of levels to crawl from the start URIs). Default: FULL fwdlinks: Specifies whether or not to follow external links from servers. Default: yes fwdredirects: Specifies whether or not to follow external redirects received from servers. Default: yes 84 Configuring the Enterprise Crawler Parameter Description reset_level: Specifies whether or not to reset level counter when following external links. Doing so will result in a deeper crawl and you will generally want this set to "no" when doing a DEPTH crawl. Default: yes <section name="crawlmode"> <attrib name="mode" type="string"> DEPTH:1 </attrib> <attrib name="fwdlinks" type="boolean"> yes </attrib> <attrib name="reset_level" type="boolean"> no </attrib> </section> Master Master Crawler node inclusion. In a multiple node crawler setup, each instance of this parameter specifies a crawler node to include in the crawl. The following example specifies use of the crawler node named "crawler_node_1": <Master name="crawler_node1"> </Master> It is possible to override "global" FAST ESP parameters for a crawler node by including "local" values of the parameters within the <Master> tag: <Master name="crawler_node1"> <attrib name="delay" type="integer">60 </attrib> </Master> It is possible to specify "local" values for all "global" collection parameters. A specific sub collection may be bound to a crawler node by including a <subdomain> tag within the <Master> tag: <Master name="crawler_node1"> <attrib name="subdomain" type="list-string"> <member> subdomain1 </member> </attrib> </Master> Note: Having no masters specified means that whatever masters are connected when the configuration is initially added will be used. sort_query_params Sort query parameters. This parameter tells the crawler whether or not it should sort the query parameters in URIs. For example, http://example.com/?a=1&b=2 is really the same URI as http://example.com/?b=2&a=1. If this parameter is enabled, then the URIs will be rewritten to be the same. If not, the two URIs most likely will be screened as duplicates.The problem however arises if the two URIs are crawled at different times, and the page has changed during the time of which the first one was crawled. In this case you can end up with both URIs in the index. Note: Changing this setting after an initial crawl has been done might also lead to duplicates. <attrib name="sort_query_params" type="boolean"> no </attrib> post_payload POST payload. 85 FAST Enterprise Crawler Parameter Description This parameter can be used to submit data as the payload to a POST request made to a URI matching the specified URI prefix or exact match. To specify a URI prefix, use the label prefix:, then the leading portion of the URIs to match. A URI alone will be tested for an exact match. The payload value can be any data accepted by the target web server, but often URL encoding of variables is required. <section name="post_payload"> <attrib name="prefix:http://vault.example.com/secure" type="string"> variable1=value1&variableB=valueB </attrib> </section> Note: Use of this option should be tested carefully, with header logs enabled, to ensure expected response from remote server[s]. pp SubDomain Refer to PostProcess Parameters on page 98 for option information. Specifies a sub collection (subdomain) within the collection. Within a collection, you can specify sub collections with individual configuration options. The following options are valid within a sub collection: ftp_passive, allowed_schemes, include_domains, exclude_domains, include_uris, exclude_uris, refresh, refresh_mode, use_http_1_1, accept_compression, delay, crawlmode, cut_off, start_uris, start_uri_files, headers, use_javascript, use_sitemaps , max_doc , proxy , enable_flash , rss and variable_delay. One of either include_domains, exclude_domains, include_uris or exclude_uris must be specified; the others are optional. This is used for directing URI/sites to the sub collection. The refresh parameter of a sub collection must be set lower than the refresh rate of the main domain. Note: The following options can only have a domain granularity: use_javascript, enable_flash and max_doc <SubDomain name="rabbagast"> <section name="include_uris"> <attrib name="prefix" type="list-string"> <member> http://www.example.net/index </member> </attrib> </section> <attrib name="refresh" type="real"> 60.0 </attrib> <attrib name="delay" type="real"> 10.0 </attrib> <attrib name="start_uris" type="list-string"> <member> http://www.example.net/ </member> </attrib> </SubDomain> log cachesize 86 Refer to Log Parameters on page 100 for option information. Refer to Cache Size Parameters on page 101 for option information. Configuring the Enterprise Crawler Parameter link_extraction robots_timeout Description Refer to Link Extraction Parameters on page 102 for option information. Use this parameter to specify the maximum amount of time in seconds you want to allow for downloading a robots.txt file. Before crawling a site, the crawler will attempt to retrieve a robots.txt file from the server that describes areas of the site that should not be crawled. Set this value high if you expect to have comparably slow interactions requesting robots.txt. Default: 300 <attrib name="robots_timeout" type="integer"> 300 </attrib> login_timeout Use this parameter to specify the maximum amount of time in seconds you want to allow for login requests. Set this value high if you expect to have comparably slow interactions with login requests. Default: 300 <attrib name="login_timeout" type="integer"> 300 </attrib> post_payload Use this parameter to specify a data payload that will be submitted by HTTP POST to all URIs matching the specified URI prefix. <section name="post_payload"> <attrib name="http://www.example.com/testsubmit.php" type="string"> randomdatahere </attrib> </section> send_links_to The parameter allows one collection to send all extracted links to another crawler collection. This can, for instance, be useful when setting up RSS crawling. You can do RSS crawling with a high refresh rate in one collection, and make it pass new URIs to another collection which does normal crawling. <attrib name="send_links_to" type="string"> collection_name </attrib> Crawling thresholds The option allows you to specify fail-safe limits for the crawler. When the limits are exceeded, the crawler enters a mode called 'refresh', which makes sure that only URIs that have been crawled previously will be crawled. The following table describes the crawler thresholds to be set Table 24: Crawler thresholds 87 FAST Enterprise Crawler Parameter disk_free Description This option allows you to specify, in percentage, the amount of free disk space that must be available for the crawler to operate in normal crawl mode. If the disk free percentage drops below this limit, the crawler enters the 'refresh' crawl mode. Default: <attrib name="disk_free" type="integer"> 0 </attrib> (0 == disabled) disk_free_slack This option allows you to specify, in percentage, a slack to the disk_free threshold. By setting this option, you create a buffer zone above the 'disk_free' threshold. When the current free disk space is in this zone, the crawler will not change the crawl mode back to normal. This prevents the crawler from switching back and forth between the crawl modes when the percentage of free disk space is close to the value specified by the 'disk_free' parameter. When the available disk space percentage rises above disk_free + disk_free_slack, the crawler will change back to normal crawl mode.. Default: <attrib name="disk_free_slack" type="integer"> 3 </attrib> max_doc This option allows you to specify, in number of documents, the number of stored documents that will trigger the crawler to enter the 'refresh' crawl mode. Note: the threshold specified is not an *exact* limit, as the statistics reporting is somewhat delayed compared to the crawling. Default: <attrib name="max_doc" type="integer"> 0 </attrib> (0 == disabled) Note: This option should not be confused with Max document count per site option. max_doc_slack This option allows you to specify the number of documents which should act as a buffer zone between normal mode and 'refresh' mode. The option is related to the 'max_doc' parameter. Whenever the 'refresh' mode is activated, because the number of documents has exceeded the 'max_doc' parameter, a buffer zone is created between the 'max_doc' and 'max_doc'-'max_doc_slack'. The crawler will not change back to normal mode within the buffer zone. This prevents the crawler from switching back and forth between the crawl modes when the number of docs is close to the 'max_doc' threshold value. Default: <attrib name="max_doc_slack" type="integer"> 1000 </attrib> Example: <section name="limits"> <attrib name="disk_free" type="integer"> 0 </attrib> <attrib name="disk_free_slack" type="integer"> 3 </attrib> <attrib name="max_doc" type="integer"> 0 </attrib> <attrib name="max_doc_slack" type="integer"> 1000 </attrib> </section> Note: This special refresh crawl mode can also be user initiated enabled with the crawleradmin tool. 88 Configuring the Enterprise Crawler Refresh Mode Parameters The refresh_mode allows you to specify the refresh mode of the collection. The following table describes the valid refresh modes Table 25: Refresh Mode Parameter Options Option append prepend scratch Description The Start URIs are added to the end of the crawler work queue at the start of every refresh. If there are URIs in the queue, Start URIs are appended and will not be crawled until those before them in the queue have been crawled. The Start URIs are added to the beginning of the crawler work queue at every refresh. However, URIs extracted from the documents downloaded from the Start URIs will still be appended at the end of the queue. The work queue is truncated at every refresh before the Start URIs are appended. This mode discards all outstanding work on each refresh event. It is useful when crawling sites with dynamic content that produce an infinite number of links. Default: <attrib name="refresh_mode" type="string"> scratch </attrib> soft adaptive If the work queue is not empty at the end of a refresh period, the crawler will continue crawling into the next refresh period. A server will not be refreshed until the work queue is empty. This mode allows the crawler to ignore the refresh event for a site if it is not idle. This allows large sites to be crawled in conjunction with smaller sites, and the smaller sites can be refreshed more often than the larger sites. Build work queue according to scoring of URIs and limits set by adaptive section parameters. The overall refresh period can be subdivided into multiple intervals, and high-scoring URIs re-fetched during each interval, to maintain content freshness while still completing deep sites. Refresh_when_idle This option allows you to specify whether the crawler automatically should trigger a new refresh cycle when the crawler goes idle (all websites are finished crawling) in the current refresh cycle. Default: <attrib name="refresh_when_idle" type="boolean"> no </attrib > Note: This option cannot be used with a multi node crawler. Work Queue Priority Rules The workqueue_priority section allows you to specify how many priority levels you want the work queue to consist of, and various rules and methods for how to insert and extract entries from the work queue. Note: Extensive testing is strongly recommended before production use, to insure that desired processing patterns are attained. The following table describes the possible options: Table 26: Work Queue Priority Parameter Options 89 FAST Enterprise Crawler Option levels Description This option allows you to specify the number of priority levels you want the crawler work queue to have. Note: If this value is ever decreased in value (e.g. from 3 to 1), the on-disk storage for the work queues must be deleted manually to recover the disk space. Default: 1 default This option allows you to specify the default priority level for extracting and inserting URIs from/to the work queue. Default: 1 start_uri_pri This option allows you to specify the priority level for URIs coming from the start_uris/start_uri_files option. Default: 1 pop_scheme This option allows you to specify which method you want the crawler to use when extracting URIs from the work queue. Available values are: rr - extract URIs from the priority levels in a round-robin fashion. wrr - extract URIs from the priority levels in a weighted round-robin fashion. The weights are based on their respective share setting per priority level. Basically URIs are extracted from the queue with the highest share value; when all shares are 0 the shares are reset to their original settings. pri - extract URIs from the priority levels in a priority fashion by always extracting from the highest priority level if there still are entries available (1 being the highest). default - same as wrr. Default: default put_scheme This option allows you to specify which method you want the crawler to use when inserting the URIs into the work queue. Available values are: default - always insert URIs with default priority level. include - insert URIs with the priority level defined by the includes specified for every priority level. If no includes match, the default priority level will be used. Default: default For each priority level specified, you can define: share - this value allows you to specify a share or weight for each queue to be used when utilizing the wrr (weighted round robin) of extracting entries the work queue. include_domains, include_uris - these values allow you to specify a set of inclusion rules for each priority level to be used when utilizing the include method of inserting entries to queue. 90 Configuring the Enterprise Crawler Work Queue Priority Parameter Example <section name="workqueue_priority"> <!-- Define a work queue with 2 priority levels --> <attrib name="levels" type="integer"> 2 </attrib> <!-- Default priority level is 2. For this specific setting it means that a URI that doesn't match the specified includes for the queues will be inserted with priority level 2 --> <attrib name="default" type="integer"> 2 </attrib> <!-- Default priority level of start URIs is 1 --> <attrib name="start_uri_pri" type="integer"> 1 </attrib> <!-- Use weighted round robin for extracting from the queue according to the share specified per queue below --> <attrib name="pop_scheme" type="string"> wrr </attrib> <!-- Use include based insertion scheme according to the include rules specified for each queue below --> <attrib name="put_scheme" type="string"> include </attrib> <!-- Settings for the first priority level queue (1) --> <section name="1"> <!-- This queues share/weight is 10 --> <attrib name="share" type="integer"> 10 </attrib> <!-- These include rules defines the URIs that should enter the 1st priority level --> <section name="include_domains"> <attrib name="suffix" type="list-string"> <member> web005.example.net </member> <member> web006.example.net </member> </attrib> </section> </section> <!-- Settings for the second priority level queue (2) --> <section name="2"> <attrib name="share" type="integer"> 10 </attrib> <section name="include_domains"> <attrib name="suffix" type="list-string"> <member> web002.example.net </member> <member> web003.example.net </member> </attrib> </section> </section> </section> Adaptive Parameters The adaptive section allows you to configure adaptive scheduling options. Note: This section is only applicable if refresh_mode is set to adaptive. Note: Extensive testing is strongly recommended before production use, to insure that desired pages and sites are properly represented in the index. The following table describes the possible options: Table 27: Adaptive Parameter Options Option refresh_count refresh_quota Description Number of minor cycles within the major cycle. Ratio of existing URIs re-crawled to new (unseen) URIs, expressed as percentage. High value = re-crawling old content 91 FAST Enterprise Crawler Option Description Low value = prefer fresh content coverage_max_pct coverage_min Limit percentage of site re-crawled within a minor cycle. Ensures small sites do not crawl fully each minor cycle, starving large sites. Minimum number of URIs from a site to be crawled in a minor cycle. Used to guarantee some coverage for small sites. weights Each URI is scored against a set of rules to determine its crawl rank value. The crawl rank value is used to determine the importance of the particular URI, and hence the frequency at which it is re-crawled (from every micro cycle to only once every major cycle). Each rule is assigned a weight to determine its contribution towards the total rank value. Higher weights produce higher rank contribution. A weight of 0 disables a rule altogether. Adaptive Crawling Scoring Rules: inverse_length: Based on number of slashes (/) in URI path. Max score for 1, no score for 10 or more. Default weight: 1.0 inverse_depth: Based on number of link "hops" to this URI. Max score for none (for example, start_uri), no score for 10 or more. Default weight: 1.0 is_landing_page: Bonus score if "landing page", ends in / or index.html. Any page with query parameters gets no score. Default weight: 1.0 is_mime_markup: Bonus score if "markup" page listed in uri_search_mime attribute. Preference to more dynamic content (vs. PDF, Word, other static docs). Default weight: 1.0 change_history: Scored on basis of last-modified value over time (or estimate). Default weight: 10.0 sitemap: Score based on the metadata found in sitemaps. The score is calculated by multiplying the value of the changefreq parameter with the priority parameter in a sitemap. Default weight: 10.0 sitemap_weights Sitemap entries may contain a changefreq attribute. This attribute gives a hint on how often a page is changed. The value of this attribute is a string. This string value is mapped to a float value in order for the adaptive scheduler to calculate an adaptive rank. This mapping can be changed by configuring the sitemap_weights section. Note that in addition to the defined values a default attribute is defined. Documents with no changefreq attribute are given the value of the default weight for priority. Sitemap Changefreq Weights: always Map the changefreq value 'always' to a numerical value. Default weight: 1.0 hourly Map the changefreq value 'hourly' to a numerical value. Default weight: 0.64 daily Map the changefreq value 'daily' to a numerical value. Default weight: 0.32 weekly Map the changefreq value 'weekly' to a numerical value. Default weight: 0.16 92 Configuring the Enterprise Crawler Option Description monthly Map the changefreq value 'monthly' to a numerical value. Default weight: 0.08 yearly Map the changefreq value 'yearly' to a numerical value. Default weight: 0.04 never Map the changefreq value 'never' to a numerical value. Default weight: 0.0 default This value is assigned to all documents that have no changefreq attribute. Default weight: 0.16 HTTP Errors Parameters The http_errors section specifies how various response codes and error conditions are handled for HTTP(S) URIs. The following table describes the possible options: Table 28: HTTP Errors Parameter Options Option Description 4xx or 5xx Specify handling for all 40X or 50x HTTP response codes. Valid options for handling individual response codes are: "KEEP" - keep the document (leave unchanged) "DELETE[:X]" - delete the document if the error condition occurs for X retries. If no X value is specified deletion happens immediately. For both of these options "RETRY[:X]" can be specified, for which the crawler will try to download the document again X times in the same refresh period before giving up. Note: If different behavior is desired for a specific value within one of these ranges, e.g. for HTTP status 503, it may be given its own handling specification. ttl or net or int Specify handling for: • • • HTTP connections that time out net (socket) errors internal errors HTTP Errors Parameter Example <section name="http_errors"> <!-- 408 HTTP status code: Request Timeout --> <attrib name="408" type="string"> KEEP </attrib> <!-- 40x HTTP return codes: delete immediately --> <attrib name="4xx" type="string"> DELETE </attrib> <!-- 50x HTTP return codes: delete after 10 failed fetches, --> <!-- retry 3 times immediately --> 93 FAST Enterprise Crawler <attrib name="5xx" type="string"> DELETE:10, RETRY:3 </attrib> <!-- fetch timeout: delete after 3 failed fetches --> <attrib name="ttl" type="string"> DELETE:3 </attrib> <!-- network error: delete after 3 failed fetches --> <attrib name="net" type="string"> DELETE:3 </attrib> <!-- Internal handling error (header error, etc): --> <!-- Never delete --> <attrib name="int" type="string"> KEEP </attrib> </section> Logins parameters The Logins section allows you to configure HTML form based authentication.You can specify multiple sections to handle different site logins, but each must have a unique name. The following table describes the options that may be set: Table 29: Logins Parameter Options Option preload scheme site form Description Specify the full URI of a page to be fetched before attempting login form processing. Some sites require the user to first get a cookie from some page before proceeding with authentication. Often the Start URI for the site is an appropriate choice for preload. Which scheme the form you are trying to log into is using, e.g. http or https. The hostname of the login form page. The path to the form you are trying to log into. Note: The three previous values, scheme + site + form make up the URI of the login form page. action parameters sites ttl 94 The action of the form (GET or POST). The credentials as a sequence of key, value parameters the form requires for a successful log on. These are typically different from form to form, and must be deduced by looking at the HTML source of the form. In general, if 'autofill' is enabled, only user-visible (via the browser) variables need be specified, e.g. username and password, or equivalent. The crawler will fetch the HTML page (specified in 'html_form') containing the login form and read any "hidden" variables that must be sent when the form is submitted. If a variable and value are specified in the parameters section, this will override any value read from the form by the crawler. A list of sites that should log into this form before being crawled. Note that this is a list of hostnames, not URIs. Time before you have to log in to the form once again, before continuing the crawl. Configuring the Enterprise Crawler Option Description html_form The URI to the HTML page, containing the login form. Used by the 'autofill' option. If not specified, the crawler will assume the HTML page is specified by the 'form' option. autofill Whether the crawler should download the HTML page, parse it, identify which form you're trying to log into by matching parameter names, and merge it with any specified form parameters you may have specified in the 'parameters' option. Default on. relogin_if_failed Whether the crawler after a failed login should attempt to re-login to the web site after 'ttl' seconds. During this time, the web site will be kept active within the crawler, thus occupying one available site resource. Logins Parameter Example <Login name="mytestlogin"> <!-- Go fetch this URI first, and gather needed cookies --> <attrib name="preload" type="string">http://preload.companyexample.com/</attrib> <!-- The following elements make up the login form URI and what to do --> <attrib name="scheme" type="string"> https </attrib> <attrib name="site" type="string"> login.companyexample.com </attrib> <attrib name="form" type="string"> /path/to/some/form.cgi </attrib> <attrib name="action" type="string">POST</attrib> <!-- Specify any necessary variables/values from the HTML login form --> <section name="parameters"> <attrib name="user" type="string"> username </attrib> <attrib name="password" type="string"> password </attrib> <attrib name="target" type="string"> sometarget </attrib> </section> <!-- Attempt this login before fetching from the following sites --> <attrib name="sites" type="list-string"> <member> site1.companyexample.com </member> <member> site2.companyexample.com </member> </attrib> <!-- Consider login valid for the following lifetime (in seconds) --> <attrib name="ttl" type="integer"> 7200 </attrib> <!-- The html page containing the login form--> <attrib name="html_form" type="string"> http://login.companyexample.com/login.html </attrib> <!-- Let the crawler download, parse, and fill in any missing parameters from the html form --> <attrib name="autofill" type="boolean"> yes </attrib> <!-- Attempt to re-login after 'ttl' seonds if login failed --> <attrib name="relogin_if_failed" type="boolean"> yes </attrib> </Login> 95 FAST Enterprise Crawler Storage parameters The Storage parameter allows you to specify storage related options. The following table describes the possible options. Note: These values cannot be changed after a collection has been defined. Table 30: Storage Parameter Options Option datastore Description Refer to Datastore Section on page 105 for more information. Default: bstore store_http_header This option specifies if the received HTTP header should be stored as document metadata. If enabled, the HTTP header will be included when the document is submitted to the ESP document processing pipeline. Default: yes store_dupes This option allows you to preserve disk space on the crawler nodes by instructing the crawler not to store documents that are detected as duplicates at runtime to disk. Duplicates detected by PostProcess are stored to disk initially, but will be deleted later. Default: no compress This option specifies if downloaded documents should be compressed before stored on disk. If enabled, gzip compression will be performed. Default: yes compress_exclude_mime MIME types of documents not compressed in storage. Note that compressing multimedia documents can waste resources, as these documents are often already compressed. Use this setting to selectively skip compression of documents based on their MIME type, thus saving resources both in the crawler (no unnecessary compression) and in the pipeline (no unnecessary decompression). remove_docs This option allows you to preserve disk space on the crawler nodes by instructing the crawler to remove docs on the disk after they have been processed by the document processor. Default: no Note: Enabling this option will make it impossible to refeed the crawler store at a later time (e.g. to take advantage of changes made to the document processing pipeline) since the crawled documents are no longer available on disk. clusters This option specifies how many storage clusters to use for the collection. Default: 8 Note: In general, this value should not be modified unless so directed by Contact Us on page iii 96 Configuring the Enterprise Crawler Option Description defrag_threshold A non-zero value specifies the threshold value, in terms used capacity, before defragmentation is initiated for any given data storage file. When the available capacity drops below this level the file is compacted to reclaim fragmented space caused by previously stored documents. Database files are compacted regardless of fragmentation. Default: 85 The default of 85% means there must be 15% reclaimable space in the data storage file to trigger defragmentation of a particular file. Setting this value to 0 will disable the nightly database/data compression routines. Note: The data storage format flatfile does not become fragmented, and this option does not apply to that format. uri_dir All URIs extracted from a document by an uberslave process may be stored in a separate file on disk. This option indicates in which directory to place the URI files. The name of a URI file is constructed by concatenating the slave process PID with `.txt'. Default: The default is not to generate these files (empty directory path). Storage Parameter Example <section name="storage"> <attrib name="uri_dir" type="string"> test/URIS </attrib> <!-- Document store type (flatfile|bstore) --> <attrib name="datastore" type="string"> bstore</attrib> <!-- Compress Documents --> <attrib name='compress' type="boolean"> yes </attrib> <!-- Do not compress docs with the following MIME types --> <attrib name="compress_exclude_mime" type="list-string"> <member> video/* </member> <member> audio/* </member> </attrib> <!-- Store HTTP header information --> <attrib name="store_http_header" type="boolean"> yes </attrib> <!-- Store duplicates in storage --> <attrib name="store_dupes" type="boolean"> no </attrib> <!-- Support removal of documents from disk after processing --> <attrib name="remove_docs" type="boolean"> no </attrib> <!-- Defragment data store files with more than 100-n % fragmentation --> <attrib name="defrag_threshold" type="integer"> 85 </attrib> </section> Password Parameters Password specification for sites/paths that require Basic Authentication. Note: Changing the passwd value may result in previously accessible content to eventually be deleted from the index. Support includes basic, digest and NTLM v1 authentication. Note: AD/Kerberos and NTLM v2 is not supported. Credentials can be keyed on either Realm or URI. 97 FAST Enterprise Crawler A valid URI can be used as the parameter value, in which case it serves a prefix value, as all links extracted from the URI at its level or deeper will also utilize the authentication settings. It is also possible to specify passwords for Realms. When a 401 Unauthorized is encountered, the crawler attempts to locate a matching realm, and if one exists, the URI will be fetched again with corresponding user/passwd set. As this requires two HTTP transactions for each document, it is inherently less efficient than specifying a URI prefix. The credentials format is: user:password:realm:scheme, though you can still use the basic format of: user:password Scheme can be any of the supported authentication schemes (basic, digest, ntlm) or auto in which the crawler tries to pick one on its own. <section name=“passwd”> <!-- username: "bob" password: "bobsecret" domain: "mysite" authentication scheme: "auto" --> <attrib name=“http://www.example.net/confidential1/” type=“string”> bob:bobsecret:mysite:auto </attrib> <!-- Escaping characters in the password may be necessary --> <!-- username: "bob" password: "bob:secret\" domain: "myothersite" authentication scheme: "basic" --> <attrib name=“http://www.example.net/confidential2/” type=“string”> bob:bob\:secret\\:myotherdomain:basic </attrib> </section> Note: Cookie authentication requires a separate setup. Refer to the Logins Parameters for more information. PostProcess Parameters The pp section configures PostProcess. The following table describes the possible options: Table 31: PostProcess Parameter Options Option dupservers Description This option specifies which duplicate servers should be used in a multiple node crawler crawl. The crawler will automatically perform load-balancing between multiple duplicate servers. Values should be specified as hostname:port, e.g. dup01:18000 Default: none Note: This setting can not be modified for a collection, once set. max_dupes This option specifies the maximum number of duplicates to record along with the original document. Default: 10 Note: This setting has a severe performance impact and values above 3-4 are not recommended for large scale crawls. 98 Configuring the Enterprise Crawler Option Description stripe This option specifies the PostProcess database stripe size; the number of files to spread the available data across. A value of 1 puts everything in a single file. Default: 1 Note: This setting can not be modified for a collection, once set. ds_meta_info This option identifies the meta info that the PostProcess should report to document processing. Available types: duplicates, redirects, referrers, intra links, interlinks. Default: none ds_max_ecl This option specifies the max URI equivalence class length such as the maximum number of duplicates, redirects or referrers to report to document processing. Default: 10 ds_send_links Send extracted links from document to FAST ESP for document processing. Default: no ds_paused This option specifies whether or not the PostProcess should pause feeding to FAST ESP. When paused, the feed will be written to stable storage. Note that the value of this setting can be changed via the crawleradmin tool options, --suspendfeed/--resumefeed. Default: no ecl_override This option specifies a regular expression used to identify URIs that should go into the URI equivalence class, even though ds_max_ecl is reached. Example: .*index\.html$ Default: none PostProcess Parameter Example <section name="pp"> <attrib name="dupservers" type="list-string"> <member> node5:13000 </member> <member> node6:13000 </member> </attrib> <!-- Limit recorded duplicates for large crawl --> <attrib name="max_dupes" type="integer"> 2 </attrib> <!-- Spread data across several files to scale --> <attrib name="stripe" type="integer"> 8 </attrib> <attrib name="ds_meta_info" type="list-string"> <member> duplicates </member> </attrib> <attrib name="ds_max_ecl" type="integer"> 10 </attrib> <attrib name="ds_send_links" type="boolean"> yes </attrib> <!-- Feeding enabled...can override using crawleradmin --> <attrib name="ds_paused" type="boolean"> no </attrib> <!-- URI Equivalence class override regexp --> <attrib name="ecl_override" type="string"> .*index\.html$ </attrib> </section> 99 FAST Enterprise Crawler Log Parameters The log section provides crawler logging options. Use this section to enable or disable various logs. Note: The use of screened and header logs can be very useful during crawl setup and testing, but should generally be disabled for production crawls as they can use a lot of disk space. It is sometimes necessary to enable these when debugging specific issues. The following table describes the possible options Table 32: Log Parameter Options Option fetch Description Document log (collection wide) format. Logs all documents downloaded with time stamp and response code/error values. Values: text or none. Default: text postprocess Postprocess log (collection wide) format. Logs all documents output by postprocess with info. Values: text, xml or none. Default: text header HTTP Header exchanges log (stored per-site). Logs all header exchanges with web servers. Useful for debugging, but should not be enabled for production crawls. Values: text or none. Default: none screened URI allow/deny log (collection wide) format. Log URIs that are screened for various reasons. Useful for debugging. Values: text or none. Default: none scheduler Provides details of adaptive scheduling algorithm processing. Values: text or none. Default: none dsfeed ESP document processing and indexing feed log. Logs all URIs that PostProcess receives callbacks on with information on failure/success. Values: text, none. Default: text site Provide statistics for per-site crawl sessions. Default: text 100 Configuring the Enterprise Crawler Log Parameter Example <section name="log"> <!-- The first two normally enabled --> <attrib name="fetch" type="string"> text </attrib> <attrib name="postprocess" type="string"> text </attrib> <!-- These others enabled for debugging --> <attrib name="header" type="string"> text </attrib> <attrib name="screened" type="string"> text </attrib> <attrib name="site" type="string"> text </attrib> </section> Cache Size Parameters The cachesize section allows configuration of the crawler cache sizes. All cache sizes represent number of entries unless otherwise noted. The following table describes possible cache options: Table 33: Cache Size Parameter Options Option duplicates Description Duplicate checksum cache. Default: automatic screened URIs screened during crawling. Default: automatic smcomm Slave/Master comm. channel. Default: automatic mucomm Master/Ubermaster comm. channel. Default: automatic wqcache Site work queue cache. Default: automatic crosslinks Crosslinks cache (number of links). Default: automatic Note: Defaults for the previous parameters are auto generated based on the max_sites and delay parameters. routetab Routing table cache (in bytes). Default: 1 MB pp PostProcess database cache (in bytes). Default: 1 MB 101 FAST Enterprise Crawler Option Description pp_pending PostProcess pending (in bytes). Default: 128 KB aliases Aliases mapping database cache (in bytes). Default: 1 MB Cache Size Parameter Example <section name="cachesize"> <!-- Override automatic settings --> <attrib name="duplicates" type="integer"> 128 </attrib> <attrib name="screened" type="integer"> 128 </attrib> <attrib name="smcomm" type="integer"> 128 </attrib> <attrib name="mucomm" type="integer"> 128 </attrib> <attrib name="wqcache" type="integer"> 4096 </attrib> <!-- Increase for large-scale crawl --> <attrib name="crosslinks" type="integer"> 5242880 </attrib> <attrib name="routetab" type="integer"> 5242880 </attrib> <attrib name="pp" type="integer"> 5242880 </attrib> <attrib name="pp_pending" type="integer"> 1048576 </attrib> <attrib name="aliases" type="integer"> 5242880 </attrib> </section> Link Extraction Parameters Use the link_extraction section to tell the crawler which links it should follow. The following table lists possible options: Table 34: Cache Size Parameter Options Option Description a URIs found in anchor tags. Default: yes action URIs found in action tags. Example: <form action="http://someaction.com/?submit" method="get"> Default: yes area URIs found in area tags (related to image maps). Example: <map name="mymap"> <area src="http://link.com"> </map> Default: yes comment URIs found within comment tags. The crawler extracts links from comments by looking for http://. Example: <!- this URI is commented away; http://old.link.com/ --> Default: yes frame URIs found in frame tags. Example: <frame src="http://topframe.com/"> </frame> Default: yes go URIs found in go tags. Note that go tags are a feature of the WML specification. Example: <go href="http://link.com/"> Default: yes 102 Configuring the Enterprise Crawler Option img Description URIs found in image tags. Example: <img src="picture.jpg"> Default: no layer URIs found in layer tags. Example: <layer src="http://www.link.com/"></layer> Default: yes link URIs found in link tags. Example: <link href="http://link.com/"> Default: yes meta URIs found in META tags. Example: <meta name="mymetatag" content="http://link.com/"/> Default: yes meta_refresh URIs found in META refresh tags. Example: <meta name="refresh" content="20;URL="http://link.com/"/> Default: yes object URIs found in object tags. Example: <object data="picture.png"> Default: yes script URIs found within script tags. Example: <script> variable = "http://somelink.com/" </script> Default: yes script_java URIs found within script tags that are JavaScript styled. Example: <script type="javascript"> window.location="http://somelink.com"</script> Default: yes style URIs found within style tags. Default: yes embed Typically used to insert links into audio files. Default: yes card A link type used to define a card in a WML deck. Default: yes 103 FAST Enterprise Crawler Link Extraction Parameter Example <section name="link_extraction"> <attrib name="a" type="boolean"> yes </attrib> <attrib name="action" type="boolean"> yes </attrib> <attrib name="area" type="boolean"> yes </attrib> <attrib name="comment" type="boolean"> no </attrib> <attrib name="frame" type="boolean"> yes </attrib> <attrib name="go" type="boolean"> yes </attrib> <attrib name="img" type="boolean"> no </attrib> <attrib name="layer" type="boolean"> yes </attrib> <attrib name="link" type="boolean"> yes </attrib> <attrib name="meta" type="boolean"> yes </attrib> <attrib name="meta_refresh" type="boolean"> yes </attrib> <attrib name="object" type="boolean"> yes </attrib> <attrib name="script" type="boolean"> yes </attrib> <attrib name="script_java" type="boolean"> yes </attrib> <attrib name="style" type="boolean"> yes </attrib> <attrib name="embed" type="boolean"> yes </attrib> <attrib name="card" type="boolean"> no </attrib> </section> The ppdup Section Use the ppdup section to specify the duplicate server settings. The following table lists possible options: Table 35: Duplicate Server Options Option Description format The duplicate server database format. Available formats are: • • • gigabase hashlog diskhashlog The duplicate server database cache size. If the duplicate server database format is a hash type, the cache size specifies the initial size of the hash. cachesize Note: Specified in MB stripes The duplicate server database stripe size. compact Specify whether to perform nightly compaction of the duplicate server databases. Duplicate Server Settings Example <!-- This option allows you to configure per collection duplicate server database settings. --> <section name="ppdup"> <!-- The format of the duplicate server dbs --> <attrib name="format" type="string"> hashlog </attrib> <!-- The # of stripes of the duplicate server dbs --> <attrib name="stripes" type="integer"> 1 </attrib> <!-- The cache size of the duplicate server dbs - in MB --> <attrib name="cachesize" type="integer"> 128 </attrib> <!-- Whether to run nightly compaction of the duplicate server dbs --> <attrib name="compact" type="boolean"> no </attrib> 104 Configuring the Enterprise Crawler </section> Datastore Section The datastore section specifies which format to use for the document data store. The crawler normally stores a collection's documents in the directory: $FASTSEARCH/data/crawler/store/collectionName/data/. The following table describes possible options: Table 36: Datastore Parameter Options Option Description Option Description bstore For each storage cluster that is crawled, a directory is created using the cluster number. The directory contains: 0-N[.N] - BStore segment files. Documents are stored within numbered files starting with '0' going up ad infinitum. After each compaction, the file is appended with a generation identifier, e.g. 0.1 replaces 0, and 17.253 replaces 17.252. The older generation file is retained for up to 24 hours as a read only resource for postprocess. 0-N[.N].idx - BStore segment index files. Contains the index of each of the corresponding BStore segment files. master_index - Contains information about all existing BStore segments. The crawler will schedule a defragment of the data store to ensure that stale segments are cleaned up on a daily basis. flatfiles The files are stored in a base64-encoded representation of the filenames. Storing documents in this manner is more metadata intensive on the underlying file system as each retrieved document is stored in a separate physical file, but allows the crawler to delete old versions of documents when a new version is retrieved from the web server. Feeding destinations This table describes the options available for custom document feeding destinations. It is possible to submit document to a collection by another name, multiple collections and even another ESP installation. If no destinations are specified the default is to feed into a collection by the same name in the current ESP installation. Table 37: Feeding destinations Parameter name Description This parameter specifies a unique name that must be given for the feeding destination you are configuring. The name can later be used in order to specify a destination for refeeds. This field is required. collection This parameter specifies the ESP collection name to feed documents into. Normally this is the same as the collection name, unless you wish to feed into 105 FAST Enterprise Crawler Parameter Description another collection. Ensure that the collection already exists on the ESP installation designated by destination first. Each feeding desintation you specify maps to a single collection, thus to feed the same crawl into multiple collections you need to specify multiple feeding destinations. It is also possible for multiple crawler collections to feed into the same target collection. This field is required. destination This parameter specifies an ESP installation to feed to. The available ESP destinations are listed in the feeding section of the crawler's global configuration file, normally $FASTSEARCH/etc/CrawlerGlobalDefaults.xml.The XML file contains a list of named destinations, each with a list of content distributors. If no destinations are explicitly listed in the XML file you may specify "default" here, and the crawler will feed into the current ESP installation. This current ESP installation is that which is specified by $FASTSEARCH/etc/contentdistributor.cfg. This field is required, may be "default" unless the gloabl XML file has been altered. paused This option specifies whether or not the crawler should pause document feeding to FAST ESP. When paused, the feed will be written to stable storage on a queue. Note that the value of this setting can be changed via the crawleradmin tool options, --suspendfeed/--resumefeed. Default: no primary This parameter controls whether this feeding destination is considered a primary or secondary destination. Only the primary destination is allowed to act on callback information from the document feeding chain, secondary feeders are only permitted to log callbacks. Exactly one feeding destination must be specified as primary. This field is required. Example: <section name="feeding"> <section name="collA"> <attrib name="collection" type="string"> collA </attrib> <attrib name="destination" type="string"> default </attrib> <attrib name="primary" type="boolean"> yes </attrib> <attrib name="paused" type="boolean"> no </attrib> </section> <section name="collB"> <attrib name="collection" type="string"> collB </attrib> <attrib name="destination" type="string"> default </attrib> <attrib name="primary" type="boolean"> no </attrib> <attrib name="paused" type="boolean"> no </attrib> </section> 106 Configuring the Enterprise Crawler </section> RSS This table describes the parameters for RSS crawling. Note: Extensive testing is strongly recommended before production use, to insure that desired processing patterns are attained. Table 38: RSS Options Parameter start_uris Description This paramter allows you to specify a list of RSS start URIs for the collection to be configured. RSS documents (feeds) are treated a bit different than other documents by the crawler. First, RSS feeds typically contain links to articles and meta data which describes the articles. When the crawler parses these feeds, it will associate the metadata in the feeds with the articles they point to. This meta data will be sent to the processing pipeline together with the articles, and a RSS pipeline stage can be used to make this information searchable. Second, links found in RSS feeds will be tagged with a force flag. Thus, the crawler will crawl these links as soon as allowed (it will obey the collection's delay rate), and they will be crawled regardless if it they have been crawled already in this crawl cycle. Example: http://www.example.com/rss.xml Default: Not mandatory start_uri_files This parameter requires you to specify a list of RSS start URI files for the collection to be configured. This option is not mandatory. The format of the files is one URI per line. Example: C:\MyDirectory\rss_starturis.txt (Windows) or /home/user/rss_starturis.txt (UNIX). Default: Not mandatory auto_discover This parameter allows you to specify if the crawler should attempt to find new RSS feeds. If this option is not set, only feeds specified in the RSS start URIs and/or the RSS start URIs files sections will be treated as feeds. Default: no follow_links This parameter allows you to specify if the crawler should follow links from HTML documents, which is the normal crawler behavior. If this option is disabled, the crawler will only crawl one hop away from a feed. Disable this option if you only want to crawl feeds and documents referenced by feeds. Default: yes ingnore_rules Use this parameter to specify if the crawler should crawl all documents referenced by feeds, regardless of being valid according to the collection's include/exclude rules. Default: no index_feed This parameter allows you to specify if the crawler should send the RSS feed documents to the processing pipeline. Regardless of this option, meta data from RSS feeds will be sent to the processing pipeline together with the articles they link to. 107 FAST Enterprise Crawler Parameter Description Default: no max_link_age This parameter allows you to specify the maximum age (in minutes) for a link in an RSS document. Expired links will be deleted if the 'Delete expired' option is enabled. 0 disables this option. Default: 0 (disabled) max_link_count This parameter allows you to specify the maximum number of links the crawler will remember for a feed. The list of links found in a feed will be treated in a FIFO manner. When links get pushed out of the list, they will be deleted if the 'Delete expired' option is set. 0 disables this option. Default: 128 del_expired_links This option allows you to specify if the crawler should delete articles when they expire. An article (link) will expire when it is affected by either 'Max articles per feed' or 'Max age for links in feeds'. Default: no Example: <section name="rss"> <attrib name="start_uris" type="list-string"> <member> http://www.example.com/rss.xml </member> </attrib> <attrib name="auto_discover" type="boolean"> yes </attrib> <attrib name="ignore_rules" type="boolean"> no </attrib> <attrib name="index_feed" type="boolean"> yes </attrib> <attrib name="follow_links" type="boolean"> yes </attrib> <attrib name="max_link_age" type="integer"> 14400 </attrib> <attrib name="max_link_count" type="integer"> 128 </attrib> <attrib name="del_expired_links" type="boolean"> yes </attrib> </section> Metadata Storage In addition to document storage being handled by the datastore section, the crawler maintains a set of data structures on disk to do bookkeeping regarding retrieved content and content not yet retrieved. These are maintained as databases and queues, and collectively referred to as metadata (as opposed to the actual data retrieved). The crawler stores a collection's site and document metadata in the directory: $FASTSEARCH/data/crawler/store/collectionname/db/. The following options are relevant to how the crawler handles metadata. Site Databases For each site from which pages are fetched, an entry is made in a site database, storing details including the IP address, any equivalent (mirror) sites, the number of documents stored for the site. Work Queue Files The work queues used by the uberslave process to store URIs (and related data) waiting to be fetched are stored on-disk in the following location and directory format. 108 Configuring the Enterprise Crawler Location: $FASTSEARCH/data/crawler/queues/slave/collectionname/XX/YY/sitename[:port] In addition to being organized by collection, additional layers of directory structure are introduced to avoid file system limits. Within each collection directory, subdirectories (shown as XX and YY above) are created, using the first 4 hexadecimal digits of the MD5 checksum of a site's name. For example, the site www.example.com has the MD5 checksum 7c:17:67:b3:05:12:b6:00:3f:d3:c2:e6:18:a8:65:22. The created directory path is therefore 7c/17/www.example.com. If a site uses a port number other than the default (80 for HTTP, 443 for HTTPS), it will be included in the sitename directory, and used in calculating the checksum. In case of a restart of the crawler, the work queues are reloaded from disk, and the crawl continues from where it left off. Pending Queues The master (or, in a multi node configuration, ubermaster) process utilizes several on-disk queues used for storing URI and site information while DNS address resolution is being performed, and prior to a site being assigned to an uberslave (or master, in the multi node case) for further processing. These are stored in the following locations, organized by collection: Location: $FASTSEARCH/data/crawler/queues/master/collectionname/ unresolved.uris Queue for URIs waiting for site (hostname) DNS resolution unresolved.sites Queue for sites waiting for DNS resolution (per the configured DNS rate limits) resolved.uris Queue of URIs pending assignment to slave work queue, or to a master (multi node case) Writing a Configuration File Note: This method described should only be used when the collections (and sub collections) have been created in the FAST ESP administrator interface. The “collection-name” must match the name given to the collection when it was created. The “sub collection-name” must match the name given to the sub collection when it was created. When adding or modifying a configuration parameter, the configuration file needs only to contain the modified configuration parameters. The crawler configuration files are XML files with the following structure: <?xml version="1.0"?> <CrawlerConfig> <DomainSpecification name="collection-name"> ... collection configuration directives ... </DomainSpecification> </CrawlerConfig> When configuring a sub collection, use the following structure: <?xml version="1.0"?> <CrawlerConfig> <DomainSpecification name="collection-name"> ... collection configuration directives ... <SubDomain name="sub collection-name"> ... sub collection configuration directives 109 FAST Enterprise Crawler ... </SubDomain> </DomainSpecification> </CrawlerConfig> Uploading a Configuration File After a configuration file has been created, it must be uploaded to the crawler. This is done via the crawleradmin tool which is located in the $FASTSEARCH/bin directory of your FAST ESP installation. The following command uploads the configuration to the crawler: crawleradmin -f configuration.xml Changes take place immediately; any errors in the configuration file will be reported. Configuring Global Crawler Options via XML File Many of the crawler configuration options specified at startup can also be specified in the crawler default XML-based configuration file. At startup, the crawler looks for this file (CrawlerGlobalDefaults.xml) at: $FASTSEARCH/etc/CrawlerGlobalDefaults.xml The configuration file can also be specified at startup using the -F option. CrawlerGlobalDefaults.xml options Table 39: CrawlerGlobalDefaults.xml options Option slavenumsites Description Number of sites per uberslave. Use this option to specify the initial number of sites (slave instances) for an uberslave. Default: 1024 <attrib name="slavenumsites" type="integer"> 1024 </attrib> dbtrace Enable database statistics. Use this option to specify whether or not you want to enable detailed statistics from the databases. Default: no <attrib name="dbtrace" type="boolean"> no </attrib> directio Enable direct disk I/O. Use this option to specify whether or not to enable (yes) direct I/O in postprocess and duplicate server. Use only if the operating system supports this functionality. Default: no <attrib name="directio" type="boolean"> no </attrib> 110 Configuring the Enterprise Crawler Option numprocs Description Number of uberslave processes to start. This value will be overridden by the -c command line option. Default: 2 <attrib name="numprocs" type="integer"> 2 </attrib> logfile_ttl Log file lifetime. This option specifies the number of days to keep old log files before deletion. Default: 365 <attrib name="logfile_ttl" type="integer"> 365 </attrib> store_cleanup Time when daily storage cleanup job begins. Format: HH:MM (24-hour clock) Default: 04:00 <attrib name="store_cleanup" type="string"> 04:00 </attrib> ppdup_dbformat Duplicate server database format Valid values: hashlog, diskhashlog or gigabase <attrib name="ppdup_dbformat" type="string"> hashlog </attrib> disk_suspend_threshold Specifies a threshold, in bytes, that when reached will make the crawler suspend all existing collections. Default: 500 MB <attrib name="disk_suspend_threshold" type="real">524288000 </attrib> disk_resume_threshold Specifies a threshold, in bytes, that when reached will make the crawler resume all existing collections, in the event they already have been suspended by the 'disk_suspend_threshold' option. Default: 600 MB <attrib name="disk_resume_threshold" type="real">629145600 </attrib> browser_engines List of browser engines that the crawler will use to process JavaScript and flash extracted from html documents. Default: none <attrib name="browser_engines" type="list-string"> <member> <host>:<port> </member> </attrib> 111 FAST Enterprise Crawler Option feeding Description Various feeding options for postprocess. Valid values: • • • • • • • • priority: ESP conent feeder priority. Note that there must be a pipeline configured with the same priority setting. Default: 0 feeder_threads: Number of content feeder threads to start. Must only be changed when the data/dsqueues directory is empty. Default: 1 max_cb_timeout: Maximum time to wait for callbacks in postprocess (in seconds) when shutting down. Default: 1800 max_batch_size: Number of documents in each batch submission. Smaller batches may be sent if not enough docs are available or if the memory size of the batch grows too large. Default: 128 max_batch_datasize: Maximum size of a batch specified in bytes. Lower this limit if you have trouble with procservers using too much memory. Default: 52428800 (50 MB) fs_threshold: Specifies the crawler file system (crawlerfs) getpath threshold in kB. Documents larger than this value will be served using the crawlerfs HTTP server instead of being inserted in the batch itself. Default: 128 waitfor_callback: FAST ESP 5.0 only, from ESP 5.1 this is configured in $FASTSEARCH/etc/dsfeeder.cfg. Feeding callback to wait for. Possible values are PROCESSED, PERSISTED and LIVE. Recovery of batches that fail will not be available when the PROCESSED callback is chosen. Default: PERSISTED destinations: Specifies a set of feeding destinations. Each destination is identified by a symbolic name and a list of associated content distrbutor locations (host:port format).The contentdistributors for an ESP installation can be found by looking in $FASTSEARCH/etc/contentdistributor.cfg of that installation.When no feeding destinations are explicitly defined the crawler will default to the current ESP installation, and use the symbolic name "default". Note: To make use of user specified feeding destinations they must be referenced in the collection configuration. dns Domain name system (DNS) tuning options for the resolver. This option allows various settings related to the crawler's use of the DNS as a client. In single node installations the master calls DNS to resolve hostnames. In a multiple node installation this job is done by the ubermaster. Valid values: min_rate: Minimum number of DNS requests to issue per second. Default: 5 max_rate: Maximum number of DNS requests to issue per second. Default: 100 max_retries: Maximum number of retries to issue for a failed DNS lookup. Default: 5 timeout: DNS request timeout before retrying (in seconds). Default: 30 min_ttl: Minimum lifetime of resolved names (in seconds). Default: 21600 112 Configuring the Enterprise Crawler Option Description db_cachesize: DNS database cache size setting for master; ubermaster will use four times this value. Default: 10485760 near_duplicate_detection Near duplicate detection tuning options. Near duplicate detection is enabled on a per-collection basis. Near duplicate detection primarily works for western languages. Valid values: min_token_size: Specifies the minimum number of characters a token must have to be included in the lexicon. Tokens that contain fewer characters than this value are excluded from the lexicon. Range: 0 - 2147483647. Default: 5 max_token_size: Specifies the maximum character length for a token. Tokens that contain more characters than this value are excluded from the lexicon. Range: 1 - 2147483647. Default: 35 unique_tokens: Specifies the minimum number of unique tokens a lexicon must contain in order to perform advanced duplicate detection. Below this level the checksum is computed on the entire document. Range: 0 2147483647. Default: 10 high_freq_cut: Specifies the percentage of tokens with a high frequency to cut from the lexicon. Range: between 0 and 1. Default: 0.1 low_freq_cut: Specifies the percentage of tokens with a low frequency to cut from the lexicon. Range: between 0 and 1. Default: 0.2 Sample CrawlerGlobalDefaults.xml file <?xml version="1.0"?> <CrawlerConfig> <GlobalConfig> <!-- Crawler global configuration file --> <!-- Maximum number of sites per UberSlave --> <attrib name="slavenumsites" type="integer"> 1024 </attrib> <!-- Enable/disable DB tracing --> <attrib name="dbtrace" type="boolean"> no </attrib> <!-- Enable/disable direct I/O in postprocess and duplicate server --> <attrib name="directio" type="boolean"> no </attrib> <!-- Number of slave processes to start --> <attrib name="numprocs" type="integer"> 2 </attrib> <!-- Number of days to keep log files --> <attrib name="logfile_ttl" type="integer"> 365 </attrib> <!-- Time of the daily storage cleanup, HH:MM (24-hour clock) --> <attrib name="store_cleanup" type="string"> 04:00 </attrib> <!-- Duplicate Server DB format (hashlog, diskhashlog or gigabase) --> <attrib name="ppdup_dbformat" type="string"> hashlog </attrib> <!-- Specifies a threshold, in bytes, that when reached will make --> <!-- the crawler suspend all existing collections. --> <attrib name="disk_suspend_threshold" type="real"> 524288000 </attrib> 113 FAST Enterprise Crawler <!-- Specifies a threshold, in bytes, that when reached will make the --> <!-- crawler resume all existing collections, in the event they --> <!-- already have been suspended by the 'disk_suspend_threshold' option. --> <attrib name="disk_resume_threshold" type="real"> 629145600 </attrib> <!-- List of browser engines--> <attrib name="browser_engines" type="list-string"> <member> mymachine.fastsearch.com:14195 </member> </attrib> <!-- Various feeding options for postprocess --> <section name="feeding"> <!-- Feeder priority --> <attrib name="priority" type="integer"> 0 </attrib> <!-- Number of content feeder threads to start. Must only --> <!-- be changed when the data/dsqueues directory is empty --> <attrib name="feeder_threads" type="integer"> 1 </attrib> <!-- Maximum time to wait for callbacks in PP (in seconds) --> <!-- when shutting down --> <attrib name="max_cb_timeout" type="integer"> 1800 </attrib> <!-- The number of documents in each batch submission. Smaller --> <!-- batches may be sent if not enough docs are available or --> <!-- if the memory size of the batch grows too large --> <attrib name="max_batch_size" type="integer"> 128 </attrib> <!-- The maximum number of bytes in each batch submission. --> <!-- Default 50MB --> <attrib name="max_batch_datasize" type="integer"> 52428800 </attrib> <!-- Specifies the crawlerfs getpath threshold in kB. Documents --> <!-- larger than this value will be served using the crawlerfs --> <!-- HTTP server instead of being inserted in the batch itself --> <attrib name="fs_threshold" type="integer"> 128 </attrib> <!-- Feeding callback to wait for (ESP 5.0 only). Can be one of --> <!-- PROCESSED, PERSISTED and LIVE. Please note that recovery --> <!-- of batches that fail will not be available when the --> <!-- PROCESSED callback is chosen --> <attrib name="waitfor_callback" type="string"> PERSISTED </attrib> <!-- Content feeding destinations. Collections will by default <!-- feed into a destination by the name "default", and this <!-- destination should always be available. Additional <!-- destinations may be added and referenced by collections. <section name="destinations"> --> --> --> --> <!-- Default destination is current ESP install --> <section name="default"> <!-- Empty list, use $FASTSEARCH/etc/contentdistributor.cfg --> <attrib name="contentdistributors" type="list-string"> </attrib> </section> <!-- Sample alternate destination --> <section name="example"> <attrib name="contentdistributors" type="list-string"> <member> hostname1:port1 </member> <member> hostname2:port2 </member> </attrib> </section> </section> </section> <!-- Various DNS tuning options for the resolver --> <section name="dns"> <!-- Minimum/Lower number of DNS requests to issue per second --> <attrib name="min_rate" type="integer"> 5 </attrib> 114 Configuring the Enterprise Crawler <!-- Maximum/Upper number of DNS requests to issue per second --> <attrib name="max_rate" type="integer"> 100 </attrib> <!-- Maximum number of DNS retries to issue for a failed lookup --> <attrib name="max_retries" type="integer"> 5 </attrib> <!-- DNS request timeout before retrying (in seconds) --> <attrib name="timeout" type="integer"> 30 </attrib> <!-- Minimum lifetime of resolved names (in seconds) --> <attrib name="min_ttl" type="integer"> 21600 </attrib> <!-- DNS DB cache size (in bytes) for Master; an UberMaster --> <!-- will use four times this value --> <attrib name="db_cachesize" type="integer"> 10485760 </attrib> </section> <!-- Various options for tuning the Near Duplicate Detection --> <!-- feature, which must be enabled on a per-Collection basis --> <section name='near_duplicate_detection'> <!-- Minimum token size for lexicon --> <attrib name="min_token_size" type="integer"> 5 </attrib> <!-- Maximum token size for lexicon --> <attrib name="max_token_size" type="integer"> 35 </attrib> <!-- The minimum number of unique tokens required to perform <!-- advanced duplicate detection --> <attrib name="unique_tokens" type="integer"> 10 </attrib> --> <!-- High frequency cut-off for lexicon --> <attrib name="high_freq_cut" type="real"> 0.1 </attrib> <!-- Low frequency cut-off for lexicon --> <attrib name="low_freq_cut" type="real"> 0.2 </attrib> </section> </GlobalConfig> </CrawlerConfig> Using Options This section provides information on how to set up various crawler configuration options. Setting Up Crawler Cookie Authentication This section describes how to set up the Enterprise Crawler to do forms based authentication, which is sometimes referred to as cookie authentication. Login page To configure the crawler for forms based authentication, it is first necessary to understand the process of a user logging in to the site using a browser. A common mechanism is that a request for a specific (or "target") URI causes the browser to instead be directed to a page containing a login form, into which username and password values must be entered. After entering valid values, the data is submitted to the web server using an HTTP POST request, and once authenticated the browser is redirected back to the original target page. 1. Open the web browser. 2. Point the browser to the page where you want the crawler to log in. The following shows a sample login page: 115 FAST Enterprise Crawler HTML Login Form The following shows a sample HTML source view of the login form (with excess HTML source removed): 1: <form method="POST" name="login" action="/path/to/form.cgi"> 2: <input type="text" name="username" size="20"> 3: <input type="password" name="password" size="20"> 4: <input type="hidden" name="redirURI" value="/"> 5: <input type="submit" value="Login" name="show"> 6: <input type="reset" value="Reset" name="B2"> 7: </form> The information shown here can be used to configure the crawler to login to this site successfully. HTML Login Form Descriptions This example assumes the login page is found by going to http://mysecuresite.example.com. To browse the site, log in with the Full name demo and Password demon. Line 1: The method of the form is “POST”, and the action is “/path/to/form.cgi”. The form variables are posted to that URI. Line 2: The form needs a parameter named “username”. (This is the login page entry named Full name). Note that the “username” parameter is not a general parameter, the username as well as the “password” parameter can be any name. This is individual to each and every form, even though most people name their variables/parameters something that can be associated with this parameter value. Line 3: The form needs a parameter named “password”. (This is the login page entry named Password). Line 4: The form needs a parameter named “redirURI” set. Note that this parameter is hidden, and thus not shown when viewing the page in the browser. In general, this type of hidden parameter need not be specified in the crawler's configuration, as the crawler will read the form itself and determine the names and values of hidden variables. Line 5: This line describes the Login button on the login page. There are no variables to extract from here since the button is of type submit, which means that the browser should submit the form when the button is pressed. Line 6: This line describes the Reset button on the login page. There are no variables to extract from here since the button is of type reset, which means that the browser should clear all input fields when the button is pushed. Crawler Login Form The following shows a sample crawler login: 1: <Login name="mytestlogin"> 2: <attrib name="preload" type="string">http://site.com/</attrib> 3: <attrib name="scheme" type="string"> https </attrib> 4: <attrib name="site" type="string"> mysecuresite.example.com </attrib> 5: <attrib name="form" type="string"> /path/to/form.cgi </attrib> 6: <attrib name="action" type="string">POST</attrib> 7: <section name="parameters"> 8: <attrib name="user" type="string"> username </attrib> 9: <attrib name="password" type="string"> password </attrib> 10: </section> 116 Configuring the Enterprise Crawler 11: <attrib name="sites" type="list-string"> 12: <member> site1.example.com </member> 13: <member> site2.example.com </member> 14: </attrib> 15: <attrib name="ttl" type="integer"> 7200 </attrib> 16: </Login> Crawler Login Form Descriptions Configure the crawler login specification by filling in the necessary values for the crawler configuration. Line 1: The login name must be unique. All login specifications must have different names). This sample uses the name “mytestlogin”. Line 2: The preload step is not needed for this form, and is an optional parameter. If a target URI is used in the browser login (i.e. in order to set initial cookie values, this URI should be used as the value of the preload attribute, to force the crawler to fetch this page before attempting login. Note: Lines 3, 4, and 5 (scheme+site+form) together make up the URI of the login form page: , e.g. http://mysecuresite.example.com/path/to/form.cgi Line 3: The “scheme” of the page this URI was found on was “http”. Note that some forms may be found on HTTP sites, but the URI in the form action, may be absolute and point to an HTTPS site instead. For this example the form action URI was relative, so it will have the same scheme as the form URI. The “scheme” field is optional; if not set, “http” is assumed. Line 4: The site (or hostname) of the web server on which the form URI resides. In this sample, the site is “mysecuresite.example.com”. Line 5: The actual form we are logging into is the form specified in the form action described in Line 1 of the HTML login form. In this sample action=“/path/to/form.cgi”. Line 6: The method of the form was found to be “POST”. Lines 7+: Use the “parameters” section to describe the HTML login form. For this sample we need username and password. These credentials are a sequence of key, value parameters the form requires for a successful log on, differ from form to form, and must be deduced by looking at the HTML source of the form. In general, only user-visible (via the browser) variables need be specified, e.g. username and password, or equivalent. The crawler will fetch the login form and read any "hidden" variables that must be sent when the form is submitted. If a variable and value are specified in the parameters section, this will override any value read from the form by the crawler. <section name="parameters"> <attrib name="username" type="string"> demo </attrib> <attrib name="password" type="string"> demon </attrib> </section> Lines 12+: Use the “sites” section to identify every site that needs to login to the login form before starting to crawl. This sample lists two sites, site1.example.com and site2.example.com. When the crawler begins to crawl either of these sites, it will log in with the specified options before fetching pages. <attrib name="sites" type="list-string"> <member> site1.example.com </member> <member> site2.example.com </member> </attrib> Line 16: The time to live (ttl) variable is optional, and the sample login page does not produce any time limited cookies so it is not included in this description. Some forms may set expire times on the cookies they return, and require credentials to be verified after a period of time. For such forms you may specify a ttl value, specifying the number of seconds until the crawler logs in again. 117 FAST Enterprise Crawler Confirming Successful Login The crawler will attempt login for each of the sites listed, and can generally be considered to have done so successfully if it proceeds to crawl the site's Start URI and other pages linked to from it. The fetch log would show this successful pattern, as in the following example. 2007-07-19-22:42:36 form)[TestLogin] 2007-07-19-22:42:36 form)[TestLogin] 2007-07-19-22:42:39 2007-07-19-22:42:42 200 NEW http://www.example.com/index.php (Reading login 200 NEW http://www.example.com/login.php (Submitting login 200 NEW 200 NEW http://www.example.com/ http://www.example.com/faq.php The site log should also show the status of the authentication attempt. 2007-07-19-22:42:35 2007-07-19-22:42:35 Authentication 2007-07-19-22:42:36 Authentication 2007-07-19-22:42:36 www.example.com STARTCRAWL LOGIN N/A GET www.example.com www.example.com Performing LOGIN POST www.example.com Performing LOGGEDIN N/A www.example.com Through A failure to log in will be indicated by a lack of crawling the site extensively, as shown in the fetch log. More detailed information would be written to the crawler log file, especially in DEBUG mode. You can contact FAST Support for further troubleshooting details. Implementing a Crawler Document Plugin Module This section describes how to create a python plugin module in order to provide an additional means of control over the internal processing of fetched documents after they have been downloaded and initial processing completed. The scope of work performed by the plugin can vary widely, ranging from a read only analyzer to very complex processing of each document, and can include the rejection of documents from the crawl. Overview To implement the plugin, as a minimum you need a Python class that defines a process() call that takes one argument, and a document object provided by the crawler. An optional process_redirect() call may also be specified to evaluate redirections received in the course of following links. A basic implementation of a plugin is as follows class mycrawlerplugin: def process(self, doc): # XXX: Activity1. def process_redirect(self, doc): # XXX: Activity2. The document object is an internal crawler data structure, and has a fixed set of attributes that you can utilize and modify in your processing call. Note that only a subset of the attributes can be modified; any changes will have an effect on subsequent crawler behavior. The process() call is invoked for each document that is processed for links within the crawler, that is those whose MIME type matches the MIME-types to search for links option. The process_redirect() call is invoked whenever the crawler encounters a redirect response from a server. That is, whenever the server returns ordinary redirect response codes (HTTP response codes 301 or 302) or when an HTML META "refresh" is encountered and is evaluated as a redirect according to the configuration settings. 118 Configuring the Enterprise Crawler Configuring the Crawler Configure the crawler to use your plugin with the Document evaluator plugin option. Note that only one plugin can be active at a time per collection. The format for the option is: tests.plugins.plugins.helloworld The format specifies the python class module, ending with your class name. The crawler splits on the last '.' separator and converts this to the Python equivalent "from <module> import <class>". Note: Whenever you create a python module in a separate directory, you need to have an empty __init__.py file to be able to import it. This is a Python requirement, and failure to do so will result in error messages in the crawler.log file. The module must be available relative to the PYTHONPATH environment variable. For example, the file structure on disk of this sample plugin configuration is tests/plugins/plugins.py, which contains the class helloworld(). If used within an an existing FAST ESP installation this would be relative to: ${FASTSEARCH}/lib/python2.3/ Note that only documents that are searched for links, that is matching the MIME types of the uri_search_mime option are subject to processing by the defined document plugin. Modifying Document Object Options Each URI downloaded by the crawler is processed internally to determine the values of various attributes, which are made available to the plugin as a "document" object. The following tables describe the "document" options for the process() and process_redirect() calls. Table 40: process () Options Option store_document [integer] Description This option specifies whether or not to store the current document. Valid values: 0 (no), 1 (yes) Documents that are not stored (option set to 0) will be logged in the fetch log as: 2006-04-19-14:53:42 200 IGNORED <URI> Excluded: plugin_document focus_crawl [integer] This option specifies whether or not the current document is out of focus. Valid values: 0 (no), 1 (yes) Example: To have an effect on the focus of the crawl, a focus section with a depth attribute must be defined in the collection configuration. <section name="focused"> <attrib name="depth" type="integer"> 2 </attrib> </section> links [list] This option contains all the URIs the crawler link parser was able to pull out of this document. The list consists of tuples for each link. A tuple contains either three or five objects. A five-tuple entry contains: uritype - type of URI as defined by pydocinfo (eg. pydocinfo.URI_A) 119 FAST Enterprise Crawler Option Description uriflag - attribute flag for URI. uri - the original URI uricomp - the parsed version of the URI as output by pyuriparse.uriparse(<uri>) metadata - dictionary containing the meta data that should be associated/tagged with this URI (optional) A three-tuple entry contains: uritype - type of URI as defined by pydocinfo (eg. pydocinfo.URI_A) uri - the original URI uricomp - the parsed version of the URI as output by pyuriparse.uriparse(<uri>) Note: Be sure to keep the same format. If you do not want any links to be followed from the current document, set this attribute to an empty list. cookies [list] Use this option to add any additional cookies your module wants to set to the crawler cookie store. The cookies document attribute is always an empty list from the crawler. Format Example (valid HTTP Set-Cookie header): Set-Cookie: <cookie> data [string] csum [string] This option contains the data of the current document. This option contains the checksum of the current document. Valid values: string of length 16 bytes referrer_data [dictionary] The referrer_data attribute contains the meta data which the parent document has appended to this URI. For instance, with RSS feeds the meta data from the RSS feed is forwarded with each URI found in the feed. Note that the meta data is forwarded automatically if the URI is a redirect. extra_data [dictionary] destination [list(2)] This attribute can be used to store additional meta data with this document, which will also be sent to the processing pipeline. The docproc example below shows how to extract this data in the pipeline. This attribute can be used to set the feeding destination of the current document. The list should contain 2 items [<destination>:<collection>]. <destination> refers to the pre-defined destination targets in your GlobalCrawlerConfig.xml file. <collection> refers to an existing collection on the target system. errmsg [string] 120 This attribute can be used to set a short description on why the current document was excluded due to being out of focus (focus_crawl = 1) or not being stored (store_document = 0) that will be output to the crawler fetch log. Configuring the Enterprise Crawler Option Description The errmsg will be prefixed by "document_plugin:" and appended to the documents fetch log entry. Table 41: process_redirect () Options Option Description links [list] This option is the same as the links [list] option in the process() call, but with the restriction that it contains only a single tuple entry. This tuple contains the URI that the redirect refers to. If this tuple is modified, then the crawler will use the updated location as target for the redirect. store_document [integer] This option controls whether or not the redirect URI set in the links option should be followed. Valid values: 0 (no), 1 (yes) Documents that are not stored (option set to 0) will be logged in the fetch log as: 2006-04-19-14:53:42 200 IGNORED <URI> Excluded: plugin_document Note: When store_document is set to 0, no further processing of the redirect will take place, and any modifications to the links or cookies attributes are ignored. cookies [list] Use this option to add any additional cookies your module wants to set to the crawler cookie store. The cookies document attribute is always an empty list from the crawler. Format Example (valid HTTP Set-Cookie header): Set-Cookie: <cookie> errmsg [string] This attribute can be used to set a short description on why the current document was excluded due to or not being stored (store_document = 0) that will be output to the crawler fetch log. The errmsg will be prefixed by "document_plugin:" and appended to the documents fetch log entry. Static Document Object Options The following document object attributes are included the plugin, but should not be changed.They are available in both process() and process_redirect(). Table 42: Static process () and process_redirect() Attributes Attribute site [string] ip [string] uri [string] Description This attribute contains the site/hostname of the current document. This attribute contains the IP of the site/hostname of the current document. This attribute contains the URI of the current document. 121 FAST Enterprise Crawler Attribute Description header [string] referrer [string] This attribute contains the HTTP headers of the current document. This attribute contains the referrer of the current document. Note: Empty referrer means the current document was a start-URI. collection [string] mimetype [string] encoding [string] redirtrail [list] This attribute contains the name of the collection the current document belongs to. process() only.This attribute contains the MIME type of the current document. This attribute contains the auto-detected character encoding of the current document. process_redirect() only. This attribute is a list that contains all the URIs of preceding redirects that were performed prior to the current one. Each URI is a two-tuple that contains: uri referring redirect URI flags - flag internal to the crawler is_rss [boolean] is_sitemap [boolean] This attribute informs if the crawler has identified the document as an RSS feed. This attribute tells if the crawler has identified the document as a sitemap or sitemap index. Hello world Processing class that prints 'hello world' for every document that is put through. class helloworld: def __init__(self): pass def process(self, doc): print "hello world" Focus crawl Processing class that focuses the crawl based on whether or not the content contains 'fast'. Note that this requires the global focus depth to be set in the configuration. class focusonfast: def __init__(self): # Regexp matching the string 'fast' self.re = re.compile(".*?fast", re.I) def process(self, doc): if not self.re.match(doc.data): # Change the focus crawl option of this document doc.focus_crawl = 1 122 Configuring the Enterprise Crawler Lowercase all URIs Processing class that lowercases the path of every URI (ref windows webservers that are case-insensitive. Note that this requires that all URIs input to the crawler (for example start-uris/crawleradmin -u) are also in lowercased form. class lowercase: def process(self, doc): newlinks = [] # Handle no links if not doc.links: return for uritype, uri, uricomp in doc.links: # Parse uri into its 7-part components # lowercase path, params, query and fragment part of URI newuri = pyuriparse.uriunparse([ uricomp[pyuriparse.URI_SCHEME], uricomp[pyuriparse.URI_NETLOC], uricomp[pyuriparse.URI_PATH].lower(), uricomp[pyuriparse.URI_PARAMS].lower(), uricomp[pyuriparse.URI_QUERY].lower(), uricomp[pyuriparse.URI_FRAGMENT].lower(), uricomp[pyuriparse.URI_USERPASS]]) newuricomp = pyuriparse.uriparse(newuri) print "Before:", uri print "After:", newuri newlinks.append((uritype, newuri, newuricomp)) # Change the links associated with this document doc.links = newlinks Add text Processing class that inserts ‘good’ at the beginning and end of the document. class prefixsandpostfix: def process(self, doc): # Modify content of this document doc.data = 'good' + doc.data + 'good' Modify Checksum Processing class that duplicates every document by modifying the checksum. class duplicate: def process(self, doc): # Modify checksum of all docs to a x 16 doc.csum = 'a' * 16 Add cookie Processing class that adds a cookie to the crawler cookie store for every document. class cookieextenter: def process(self, doc): # Set a cookie for the hostname of the current URI uricomp = pyuriparse.uriparse(doc.uri) domain = uricomp[pyuriparse.URI_NETLOC].split(".", 1) doc.cookies = ['Set-Cookie: BogusCookie=f00bar; path=/; domain=.%s' %\domain[1]] 123 FAST Enterprise Crawler Exclude links Processing class that parses the document for links and anchortext, and based on anchor text of the links excludes bad links associated with apache directory listings, and so forth. class directorylistingdetector: class myparser(htmllib.HTMLParser): def __init__(self, formatter): self.links = {} self.currenthref = None htmllib.HTMLParser.__init__(self, formatter) def anchor_bgn(self, href, name, type): if not href in self.links: self.links[href] = '' self.currenthref = href self.save_bgn() def anchor_end(self,): if self.currenthref is not None: self.links[self.currenthref] = self.save_end() self.currenthref = None def reset(self): self.links = {} htmllib.HTMLParser.reset(self) def __init__(self): self.parser = self.myparser(formatter.NullFormatter()) self.ignorelist = ( "parent directory", "name", "last modified", "size", "description", "../") def process(self, doc): self.parser.reset() self.parser.feed(doc.data) existinglinks = map(lambda x: x[1], doc.links) for link in self.parser.links: normlink = self.normalize_link(doc.uri, link) idx = existinglinks.index(normlink) if idx < 0: # XXX: Crawler didn't find this link, ignore it print "Ignoring %s, mismatch with crawler" % link continue if self.parser.links[link].lower() in self.ignorelist: # The link should not be followed, ignore print "Ignoring %s, anchor text=%s" % \ (link,self.parser.links[link]) doc.links.pop(idx) existinglinks.pop(idx) continue def normalize_link(self, baseuri, uri): return pyuriparse.urijoin(baseuri, uri) Extra data Processing class that adds text to the extra_data parameter for each document. class addMeta: def process(self, doc): doc.extra_data = "Extra data" 124 Configuring the Enterprise Crawler Docproc stage for extra_data Docproc example that extracts the extra_data parameter set by a crawler plugin and adds it to the document attribute "generic3" class CrawlerPluginProcessor(Processor.Processor): def Process(self, docid, document): extra_data = document.GetValue('extra_data', None) if extra_data: plugin_data = extra_data.get("docplugin", None) if plugin_data is not None: document.Set("generic3", plugin_data) else: return ProcessorStatus.OK_NoChange return ProcessorStatus.OK Configuring Near Duplicate Detection The default duplicate detection scheme in the crawler strips format tags from each new document and then creates a checksum based on the remaining content. If another document exists with the same checksum, that document is identified as a duplicate. This approach may sometimes be too rigid. There are many documents that are not exactly identical, but are perceived to contain the same content by the user. For example, two documents might have the same body of text but be marked with different timestamps. Or, a document could have been copied but missed some characters, a copy-and-paste mistake. Since these documents are not exactly identical they are not registered as duplicates by the crawler. A near duplicate detection algorithm curbs this issue. This section describes how the near duplicate detection scheme works and how it can be used in the crawler. Overview Once the crawler retrieves a new document, it is parsed into a token stream and its markup code and punctuation are removed. Individual words (tokens) are separated by splitting the remaining content on whitespace and punctuation (a rudimentary tokenizer). If a new token is shorter or longer than some predefined values, then the token is discarded. Otherwise the token is lower cased and added to a lexicon, or collection of words. Note: Since CJK languages do not separate tokens by space, the text appears as a set of large continuous tokens. Note that since languages such as Chinese, Japanese, Korean (CJK) and Thai have no means of separating tokens without a more complex algorithm involved such characters will be extracted as a set of large continuous tokens. Consequently, most of the detected tokens in such documents are discarded. When parsing is done, the constructed lexicon is trimmed. The most and least frequent tokens are removed from the lexicon. The goal of this process is to retain only the significant tokens in the lexicon. By removing the most frequent tokens the algorithm tries to get rid of the most common words in a language, for example, 'the', 'for', etc. And by removing infrequent tokens it tries to get rid of timestamps, tokens with spelling mistakes, and so forth. If the lexicon contains enough tokens then a digest string is constructed by traversing the original document for tokens that are in the lexicon. If a token is in the document and in the lexicon the token is added to the digest string. Once the entire document has been traversed, the digest is used to generate a signature checksum. However, if the trimmed lexicon does not have enough tokens, a checksum will be constructed from the entire document without format tags - just as in the existing duplicate detection scheme. As in the default scheme, documents with the same checksum are defined as duplicates. 125 FAST Enterprise Crawler Configuring the Crawler The crawler is able to modify the behavior of its duplicate detection with the Near duplicate detection option(AW). The variable can also be set in the collection specification: <attrib name="near_duplicate_detection" type="boolean"> yes </attrib> Next, there are global options located in the $FASTSEARCH/etc/CrawlerGlobalDefaults.xml file that can be set. Changing these parameters will result in a different digest and consequently generate a different signature for the document. Refer to CrawlerGlobalDefaults.xml options on page 110 for options information. It is not recommended that you modify these parameters after the crawl has started. <attrib <attrib <attrib <attrib <attrib name="max_token_size" type="integer"> 35 </attrib> name="min_token_size" type="integer"> 5 </attrib> name="high_freq_cut" type="real"> 0.1 </attrib> name="low_freq_cut" type="real"> 0.2 </attrib> name="unique_tokens" type="integer"> 10 </attrib> Near Duplicate Detection Example Original Document text: The current version of the Enterprise Crawler (EC) has a duplicate detection algorithm that strips format tags from each new document, and then creates a checksum based on the remaining content. If another document exists with the same checksum, that document is identified as a duplicate. CrawlerGlobalDefaults configuration: <attrib <attrib <attrib <attrib <attrib name="max_token_size" type="integer"> 35 </attrib> name="min_token_size" type="integer"> 4 </attrib> name="high_freq_cut" type="real"> 0.1 </attrib> name="low_freq_cut" type="real"> 0.2 </attrib> name="unique_tokens" type="integer"> 10 </attrib> The previous configuration and document text yield the following adjustments to the full lexicon of 25 tokens. Note that the cutoff percentages as computed against the full token list are rounded down to the nearest whole number, and that terms of equal frequency are ordered alphabetically before trimming. As can be seen in the following, the low frequency tokens trimmed are those from the end of the alphabet. • Removed due to high_freq_cut: Remove top 2 words (10% of 25 words is 2) in reverse alphabetical order (document, that) • Token Frequency document 3 that 2 Removed due to low_freq_cut: Remove bottom 4 words (20% of 25 words is 4) in reverse alphabetical order (with, version, then, tags) • 126 Token Frequency tags 1 then 1 version 1 with 1 The final trimmed lexicon meets the threshold limit of 10 tokens, and contains the following terms: Configuring the Enterprise Crawler • Token Frequency checksum 2 duplicate 2 algorithm 1 another 1 based 1 content 1 crawler 1 creates 1 current 1 detection 1 each 1 enterprise 1 exists 1 format 1 from 1 identified 1 remaining 1 same 1 strips 1 Based on this the digest string, ordered according to the word order of the original document, would read: currententerprisecrawlerduplicatedetectionalgorithmstripsformatfromeac hcreateschecksumbasedremainingcontentanotherexistssamechecksumi dentifiedduplicate The checksum is then computed on this digest string, and is associated with the original fetched document. Configuring SSL Certificates In most cases no special configuration is necessary for the crawler to fetch from SSL protected sites (https). In some cases it is necessary to enable Cookie support in the crawler. If a full SSL certificate chain must be presented to the web server, use the following procedure to prepare the files. Please follow the steps below to set up a certificate chain to support a client certificate in the crawler. This is only required when using client certificates and when the client certificate itself cannot be directly verified by the server without the complete certificate chain up to the trusted CA being attached. 1. Copy all certificates (including intermediate certificates) into the PEM certificate file specified for the crawler. This is done with all other certificates at the beginning of the file with the root CA certificate last (no key copied into file) 2. Encode the certificate file in PKCS#7 format using the command: openssl crl2pkcs7 -nocrl -certfile file_with_certs -out combined.pem Note: The key point is the use of the PKCS#7 format for the certificate file specified to the crawler. 127 FAST Enterprise Crawler 3. Specify the combined.pem file in the crawler certificate file configuration option. Configuring a Multiple Node Crawler The distributed crawler consists of one ubermaster process, one or more duplicate servers and one or more subordinate crawler (master) processes. The ubermaster controls all work going on in the crawler, and presents itself as a single data source in the FAST ESP administrator interface. Before you begin you should decide: • • What nodes should run the ubermaster, duplicate server(s) and master(s) processes. If you are removing the existing crawler or setting up the new crawler so that it does not interfere with the existing crawler. Go to Removing the Existing Crawler on page 128 if you are removing the existing crawler; go to Setting up a New Crawler with Existing Crawler on page 128 if you are setting up a new crawler while keeping the existing crawler. Removing the Existing Crawler If you are removing the existing crawler and replacing it with a new crawler configuration, complete this procedure. To remove the existing crawler: 1. On all nodes that run crawler processes (assuming the processes are named crawler), run the command: $FASTSEARCH/bin/nctrl stop crawler 2. On the node running the ubermaster process, run the command: $FASTSEARCH/bin/nctrl stop ubermaster 3. Stop the nctrl process with the command: $FASTSEARCH/bin/nctrl stop nctrl. On Windows it is neccessary to stop the FAST ESP Service instead. 4. Remove the crawler process from the startorder list in $FASTSEARCH/etc/NodeConf.xml. 5. Remove the crawler process from $FASTSEARCH/etc/NodeState.xml. 6. Start the nctrl process by running the command: $FASTSEARCH/bin/nctrl start. On Windows this can be done by starting the FAST ESP Service instead. 7. On the node running the ubermaster process, run the command: $FASTSEARCH/bin/nctrl start ubermaster. 8. On all nodes that run crawler processes (assuming the processes are named crawler), run the command: $FASTSEARCH/bin/nctrl start crawler. Setting up a New Crawler with Existing Crawler If you are setting up the new crawler as a separate data source so that it does not interfere with the existing crawler, complete this procedure. 1. Modify the $FASTSEARCH/etc/NodeConf.xml files on the different nodes. Existing crawler entries in the file on your FAST ESP installation file can be used as templates. 2. When several crawler components are run on the same node, be they multiple instances of single node crawlers or several components of a multiple node crawler, always make sure that the following parameters do not overlap: -P (the port number used to communicate with the process) -d (the data directory designated to the process) -L (the log directory designated to that process) 128 Configuring the Enterprise Crawler Port numbers should be sufficiently far apart to avoid interference; incrementing by 100 per process should be sufficient. Always inspect the existing entries in the NodeConf.xml file to ensure the port numbers do not overlap with those allocated to other processes. 3. Add an ubermaster. An example of an ubermaster process entry is as follows: <process name="ubermaster" description="Master Crawler"> <start> <executable>crawler</executable> <parameters>-P $PORT -U -o -d $FASTSEARCH/data/crawler_um -L $FASTSEARCH/var/log/crawler_um</parameters> <port base="1000" increment="1000" count="1"/> </start> <outfile>ubermaster.scrap</outfile> <limits> <minimum_disk>1000</minimum_disk> </limits> </process> Note the -U parameter, then add <proc>ubermaster</proc> to the global startorder list near the top of the $FASTSEARCH/etc/NodeConf.xml file. 4. Add a subordinate master. An example of a subordinate master entry is as follows: <process name="master" description="Subordinate Crawler"> <start> <executable>crawler</executable> <parameters>-P $PORT -S <ubermaster_host>:14000 -o -d $FASTSEARCH/data/crawler -L $FASTSEARCH/var/log/crawler</parameters> <port base="1100" increment="1000" count="1"/> </start> <outfile>master.scrap</outfile> <limits> <minimum_disk>1000</minimum_disk> </limits> </process> A <proc>master </proc> should be added to the startorder list in the $FASTSEARCH/etc/NodeConf.xml file. 5. Add a duplicate server. The duplicate server can be set up in a number of different ways, including striped and replicated modes, but a simple standalone set up is as follows: <process name="dupserver" description="Duplicate server"> <start> <executable>ppdup</executable> <parameters>-P $PORT -I <Symbolic ID></parameters> <port base="1900" increment="1" count="1"/> </start> <outfile>dupserver.scrap</outfile> <limits> <minimum_disk>1000</minimum_disk> </limits> </process> The ppdup binary must be added to the configuration in the FAST ESP administrator interface with a host:port location (available in the advanced mode). Note that the ppdup does not have an -L parameter. Refer to Crawler/Master Tuning for information about cache size and storage tuning 129 FAST Enterprise Crawler A <proc>dupserver</proc> should be added to the startorder list in the $FASTSEARCH/etc/NodeConf.xml file. 6. Start the new crawler: $FASTSEARCH/bin/nctrl reloadcfg 7. Start the different processes on the relevant nodes in the following order: $FASTSEARCH/bin/nctrl start ubermaster $FASTSEARCH/bin/nctrl start dupserver $FASTSEARCH/bin/nctrl start master (on all master nodes) 8. To verify the new configuration: a) Check the ubermaster logs to verify all masters are connected and that there are no conflicts, for example, conflicts in -I names. b) Check to make sure the ubermaster appears as a Data Source in the FAST ESP administrator interface You can add collections by either using the FAST ESP administrator interface or by uploading the crawler XML specifications with the following command: $FASTSEARCH/bin/crawleradmin -C <ubermaster hostname>:<ubermaster portnumber> -f <path to xml specification> If you are uploading a web scale crawl, it is recommended that you add the collection with the Large Scale XML Configuration Template on page 141. 9. Refeed collections with postprocess. Re-feeding collections in a multiple node crawler is similar to performing it in a single node crawler, with some exceptions. Before starting the refeed make sure the duplicate server(s) are running. The master(s) must be stopped on the node(s) you wish to refeed, the ubermaster as well as masters on other nodes may continue to run. $FASTSEARCH/bin/postprocess -R "*" Large Scale XML Crawler Configuration This section provides information on how to configure and tune a large scale distributed crawler. Node Layout A large scale installation of the crawler consists of three different components: • • • One Ubermaster (UM), One or more Duplicate servers (DS) and Multiple Crawlers. Each crawler consists of a master process, multiple uberslave processes and a single postprocess. A typical 10 node layout may look like this (each square represents a server): 130 Configuring the Enterprise Crawler Figure 13: This configuration offers both duplicate server fault tolerance and load balancing. Node Hardware The following items (in prioritized order) should be considered when planning hardware: 1. Disk I/O performance 2. Amount of memory 3. CPU performance and dual vs. single processor Typical disk setup involves either RAID 0 or RAID 5 with a minimum of four drives. RAID 0 offers better performance, but no fault tolerance. RAID 5 has substantial penalty on write performance. Other options include RAID 0+1 (or RAID 1+0). When running with a replicated duplicate server setup, it may be that a non-fault tolerant setup (for example, RAID 0) is the best alternative for the duplicate server nodes, with all other nodes on a fault tolerant storage (RAID 5). Memory usage is very dependent on the cache configuration used, but both the duplicate server and postprocess (on each crawler node) can take advantage of large amounts for database caching purposes. CPU performance is much less important, and depends mainly on the configuration settings used. This is discussed in more detail in the Configuration and Tuning section. Hardware Sizing Due to the I/O-bound nature of crawling, the hardware sizing should be based primarily on hard disk capacity and performance. To calculate the disk usage of the crawler nodes, the following needs to be accounted for. • • • • Crawled data (each doc compressed individually by default) Meta data structures Postprocess duplicate database Crawl and feed queues Assuming an average compressed document size of 20kB (30kB if also crawling PDF and Office documents), 2kB meta data per document (includes HTTP header) and 500 bytes per URI in the duplicate DB we can calculate the disk space requirements for a single node. Note however that the document sizes (20kB and 30kB) are estimates, and depends largely on what is being crawled and also the document size cut-off specified in the configuration. Adding 30% slack on top to account for wasted space in the data structures, log files, queues etc we get the following guide line table. 131 FAST Enterprise Crawler Document Data size (HTML only) Data size (HTML, PDF, Word++) 10M 290GB 420GB 20M 585GB 845GB 30M 880GB 1.3TB The rule of thumb is that a node should not be holding more than 20-30M documents, as performance may degrade too much from that point on. The disk usage of the duplicate servers will be similar to that of the postprocess duplicate database. However, keep in mind that using replication (which is recommended) doubles the disk usage as each node holds a mirror of its peer node's dataset. Ubermaster Node Requirements The UM can either run on a separate node or share a node with one of the duplicate servers. Place the UM on a dedicated node for large installations (20 masters and up). Memory usage 100-500MB CPU usage Moderate to high depending on number of masters Disk I/O Moderate to high depending on number of masters Disk usage Minimal There are no global tuning parameters for this component. Duplicate Servers The duplicate server processes serve as the backbone of a multiple node crawler setup and care should be taken when configuring since they may be difficult to reconfigure at a later stage. Memory usage 70MB and up (tunable) CPU usage Minimal Disk I/O Heavy during first cycle, moderate on subsequent cycles Disk usage Moderate Non-replicated Mode A simple duplicate server layout involves one or more duplicate server processes in a non-replicated mode. The advantage of this approach is the increased performance offered, with the drawback of no replication in case of failure (loss of data). It should therefore only be used if the underlying disk system is fault tolerant (and preferably more so than RAID 5). For each duplicate server set up this way you must also add it to the duplicate servers configuration section for each collection. Refer to ppdup for options information. Two node (striped) example, running on servers node1 and node2 with symbolic IDs dup1 and dup2: node1: ./ppdup -P 14900 -I dup1 node2: ./ppdup -P 14900 -I dup2 Replicated Mode The duplicate server can be replicated in two ways; dedicated replica and cross-replication. 132 Configuring the Enterprise Crawler The dedicated replica mode works by setting up a second duplicate server that acts only as a replica with no direct communication with postprocess, only its duplicate server primary. Both processes in the following example use the same ID. Dedicated replica example, running on two nodes: node1: ./ppdup -P 14900 -I dup1 -R node2:14901 node2: ./ppdup -P 14900 -I dup1 -r 14901 An alternative means of getting both replication and load balancing is to use the cross-replication mode. In this setup each duplicate server acts as both primary and replica for another duplicate server. Cross replication example running on two nodes: node1: ./ppdup -P 14900 -I dup1 -R node2:14901 -r 14901 node2: ./ppdup -P 14900 -I dup2 -R node1:14901 -r 14901 While it may seem that the last setup is preferable to a combined striped and replicated setup, it is not. Separating primary and replica into two processes has two distinct advantages; it allows the processes to use more memory for caching (max process size on RHEL3 is about 2.8GB) as well as placing the I/O and CPU tasks somewhat in parallel (the duplicate server uses blocking I/O). Crawlers (Masters) Memory usage 512MB and up (tunable) CPU usage Moderate to high, configuration dependent Disk I/O Heavy Disk usage High There are no global tuning parameters for this component. Configuration and Tuning When planning a large scale deployment configuration, first consider the number of collections and their sizes. An ideal large scale setup consists of a single collection, possibly with a few sub collections. Having multiple smaller collections on a multiple node crawler is generally not desired, especially if they fit on single node crawler. In this case it is better to set up one or several single node crawlers to handle the small collections. If you have to have multiple collections on a multiple node crawler, keep in mind that many tunable parameters such as cache sizes are configured per collection so they can add up for each collection. Furthermore, certain parameters are applied individually to each collection, but may only be configured through one global setting. This does not fit well with having both small and large collections on the same multiple node crawler and is generally not recommended. However, there are advantages to having multiple collections as opposed to one collection with sub collections. Individual collections make management easier, including configuration updates, re-feeds, and scratching. Having many sub collections add overhead since each URI must be checked against the include rules of each sub collection until a match, if any, is found. The best advice is to first divide your data into whatever logical collections make the most sense. If the setup calls for a mix of large and small collections (for example, web and news crawls) then it is advisable to place the small collections on a separate single node crawler. The remaining collections should generally be larger than what a single node could handle and it therefore makes sense to run them either separately or as a single collection on the multiple node crawler. Include /Exclude Rules The crawler uses a set of include and exclude rules to limit and control what is to be crawled. 133 FAST Enterprise Crawler In a small scale setting the performance considerations are few as there are generally a limited number of rules. However, in a large scale setting there may be tens of thousands of rules and care must be taken when selecting these. It is important to keep in mind that every URI extracted by the crawler is checked against some or all of these rules. The least expensive rules are the exact match hostname rules. Checking a URI against these involve a single hash table lookup, so the performance is the same regardless of the number of URIs. Memory usage depends on the number of rules. The suffix and prefix hostname includes are also generally inexpensive, as they are also implemented using hash structures. By dividing the URIs into subsets based on their lengths we get at most n lookups (where n is the number of different lengths), rather than one lookup per rule. While more expensive than the exact match rule, it is dependent on the number of unique lengths and not the number of rules. The regexp type rule should be avoided if at all possible. In general it will only be necessary as a URI rule, not a hostname rule. It is best to check with the crawler team if you have any questions regarding this. Note: One common pitfall is to use either a suffix URI rule or a regexp URI rule to exclude certain filename extensions. The former will fail if the URI contains for example, query parameters and the latter consumes much more CPU than it needs to. To exclude certain file extensions you should use the exclude_exts config option. Tip: Using exclude rules can potentially speed up the checks. Since exclude rules are applied first, you could for example, exclude all top level domains you do not wish to crawl. Includes and sub collections - When setting up sub collections it is vital to keep in mind that they are subsets of the main collection. Therefore, any URIs for the sub collection must match the include/exclude rules of the main collection first, and then the relevant sub collection. URI Seed Files It is necessary to specify one or more Start URIs for any size crawl, but it is not necessarily true that a large crawl requires a long list of Start URIs. Because the web is heavily interconnected, with links from site to site, you can usually start at a single URI (preferably a page with many external URIs) and allow the crawler to gather links and add sites to the crawl from there. This works well if your goal is to crawl a top-level domain, such as the .no domain. Adding numerous seeds will do little to improve or focus your crawl. However, if you do not wish to crawl an entire top-level domain, but rather selected sites only, then a seed list is useful.You also need to either be restrictive in the sites crawled (using include/exclude rules), or disable the following of cross site URIs altogether. If you do neither, then use a small seed list. Restricting the Size of the Crawl There are several techniques for restricting the size of your crawl. When setting up a large scale crawl, you often have requirements based on the number of URIs you would like in your index or how much data you should handle. The crawler has several configuration options that can restrict the size of a crawl, taken alone or together: • • • • Limit number of documents per site Specify maximum crawl depth per site Set a maximum number of documents per collection Require minimum free disk space limit The max_docs option specifies the maximum number of documents to retrieve per site per refresh cycle. This is useful to limit the crawling of deep (or perhaps endlessly recursive) sites. Keep in mind that this counter is reset per refresh, so that over time the total number of documents might exceed this limit, though documents not seen for a number of refresh cycles will be recognized and deleted, as described in the dbswitch configuration option. 134 Configuring the Enterprise Crawler An another configuration setting that can be used in conjunction with the attribute described above (or on its own) is level crawling. By specifying a crawlmode depth limitation, you can ensure that the crawler only follows links a certain number of hops from the Start URI for each site. This avoids the deep internal portions of a site. The amount of time spent crawling and the aggressiveness of the crawl are major factors in determining the volume of fetched documents. The configuration options refresh, max_sites and delay also play a major role. The crawler will fetch at most refresh * 60 * ( max_sites / delay ) URIs during a single refresh cycle. For multi node crawls this figure is per master node. Together with a refresh_mode set to scratch or adaptive this limits the number of documents that will be indexed. Note: No other refresh_mode value should be used for large-scale crawls, due to potentially large disk usage requirements. Keep in mind that subsequent refresh cycles may not fetch the exact same links as before, due to various reasons including the fact that web pages (and their links) change, the structure of the web changes, and network latencies change. If scratch refresh mode is used then the index may fluctuate slightly in size. However, as URIs not fetched for some time are deleted it should be fairly stable once the configuration is set. With refresh mode set to adaptive it will use the existing set of URIs in the crawler store as the basis for re-crawling, but some new content will also be crawled. The limits options allow you to specify threshold limits for the total number of documents and for the free disk space. If exceeded, the crawler enters a "refreshing" crawl mode, so that it will only crawl URIs that have been previously fetched. For each limit, one also has to specify a slack value, indicating the lower resource limit that must be reached before the crawler returns to its normal crawl mode. Duplicate Server Tuning Two different storage formats are supported in EC 6.7, GigaBASE and hashlog. Hashlog is recommended for most installations. However, if you have a lot of small collections (less than 10M each) then using GigaBASE may also work very well. GigaBASE The original format is based on GigaBASE and consists of a set of striped databases. The main motivation behind the striping is two-fold; decrease the size of database files on disk and reduce depths of the B-trees within the databases. Reducing the database size through striping may have a limited effect on the B-tree depths once the databases grow too large. The following table can be used as a guideline for selecting an appropriate GigaBASE stripe size. Keep in mind that the document count number relates to the number of documents (or rather document checksums) stored on this particular duplicate server. A load balanced setup will thus have total_count/ds_instances documents in a single duplicate server. Dedicated duplicate server replicas are not included in the ds_instances count. Document Stripes 25M 2 50M 4 100M 8 In addition to stripe size you can also tune the GigaBASE cache size. The value given will be divided among the stripes such that a cache size of 1GB will consume approximately 1GB of memory. If cross-replication is used the memory usage will be twice the specified cache size. The rule of thumb when selecting a cache size is to use as much memory as you can afford, as long as the process does not exceed the maximum process size allowed by the operating system. 135 FAST Enterprise Crawler Hashlog The newer format (the default format) is called hashlog and combines a hash (either disk or memory based) with a log structure. The advantage of a hash compared to a B-tree structure is that lookups in a hash are O(1) whereas a B-tree has O(log n). This means that as the data structure grows larger the hash is much more suitable. A disk based hash is selected by specifying "diskhashlog" as the format. The initial size of the hash (in MB) can be specified by the cache size option. In this mode each read/write results in 1 seek. This is suitable for very large structures where it is not feasible to hold the entire hash in memory. To select a memory based hash, specify the maximum amount of memory to be used with the cache size option and use the format "hashlog". If the amount of data exceeds the capacity the hash will automatically rehash to disk. The following formula calculates approximately how many elements the memory hash will hold: capacity = memory / (12 * 1.3) This yields the following approximate table: Memory Reserved Documents 100M ~6.5M 500M ~33M 1.0GB ~68M 1.5GB ~100M In addition to the hash a structured log is also used. Reads from the log require a single seek (bringing us up to 2 seeks for disk hash and 1 for memory hash). However, due to its nature the log grows in size even when replacing old elements. To counter this, the duplicate server has a built-in nightly compaction where the most fragmented log file is compacted. During this time the duplicate server will be unavailable. It does not affect crawling performance, but will delay any processing in postprocess during that time. Disk Hash vs. Memory Hash Note: If sufficient memory is available, a memory hash will give significant performance advantages. Keep in mind that every collection that uses the duplicate server will allocate a memory hash of the same size (regardless of the size of the collection), so this affects the memory hash size that can be used. This makes it impractical to use a memory hash if the collections differ greatly in size. In this case the best solution is to setup multiple duplicate servers, with memory hash, on the same node, for instance one duplicate server for a large collection and one duplicate server for the small/medium collections. Furthermore, the summation of all the cache sizes should not exceed the total amount of available physical memory on a node. Note: When using a disk based hash larger than 10M it is generally recommended to turn off the automatic compaction feature in the duplicate server. Compaction of such disk based hashes can take many hours, and is best performed manually using the crawlerdbtool. This compaction can be turned off using the -N command line option passed to the duplicate server. Postprocess Tuning "PostProcess" (PP) performs duplicate checking for a single node and feeds data through document processing to the indexer. Please note that in a multi-node crawler environment, the "duplicate server" handles cross-node duplicate checking. 136 Configuring the Enterprise Crawler Tuning duplicate checking Duplicate checking within a single node requires a database which consists of all the unique checksums "owned" by the node. The checksums map to a set of URIs (the URI equivalence class) from which one URI is designated the owner URI and the rest duplicate URIs. Some additional meta information is also stored. The parameters listed below are available for this purpose, and are tunable per collection in the configuration. The options are specified in the "postprocess" section of the configuration, unless otherwise noted. Postprocess Parameter Description dupservers Must be set to a list of primary duplicate servers. max_dupes Determines the maximum number of duplicate URIs corresponding to each checksum. This setting has a severe performance impact and values above 3-4 are not recommended for large scale crawls. stripe Please refer to the hashlog/GigaBASE discussion in the Duplicate Server Tuning section. A stripe size of 4 is typical in most cases. Note that for GigaBASE storage the amount of memory used by caching is defined as the postprocess cache size multiplied by the number of stripes. ds_paused Allows you to pause feeding to document processing and indexing. Useful if you would like to crawl first and index later. Can also be controlled with the --suspendfeed and the resumefeed crawleradmin option, but the value in the configuration overrides it if you feed. ds_max_ecl Maximum number of URIs in the equivalence class that is sent to document processing for indexing. The value should be set to the same value as max_dupes. pp (in cache size section) Specifies the amount of memory allocated to the checksum database cache for the collection. For GigaBASE this is the database cache *per stripe*, and for Hashlog it is the memory hash size. The value should be high (for example, 512MB or more for a 25M node), but keep in mind that this setting is separate per collection and that they add up. Use the most memory on the large collections. Tuning postprocess feeding By default crawlerfs in postprocess is run using a single thread. In order to increase the throughput, it is possible to configure multiple crawlerfs processes and multiple DocumentRetriever processes in the ESP document processing pipeline. This may significantly speed up the processing. If you need to accomplish this task, please contact FAST support. Crawler/Master Tuning The following sections outline the various parameters that should be modified for large scale crawls. Storage and Storage Tuning The following storage section attributes should be tuned. The remaining storage parameters should be left at default values (for example, not included in XML at all). The crawler performs large amounts of network and file I/O, and the number of available file descriptors can be a limiting factor, and lead to errors. Insure that sufficient file descriptors are available by running the limit (or ulimit) command from the account under which the crawler runs. If the value is too low (below 2048), increase the hard limit for descriptors to 8096 (8K). Check the operating system administrator documentation for details on doing so; it may be sufficient to run the limit/ulimit command as superuser, or a system resource configuration file might need to be modified, perhaps requiring a system reboot. 137 FAST Enterprise Crawler Storage Parameter Description store_http_header Can be disabled if you know that it will not be needed in the processing pipeline (it is sent in the 'http_header' attribute). Disabling saves some disk space in the databases and may give a slight performance boost. remove_docs Enabling this option will delete documents from disk once they have been feed to document processing and indexing. Disabled by default. Note: Re-feeding the crawler store will not be possible with this option enabled. Therefore, this mode should only be enabled for stable and well-tested installations. Cache Size Tuning The crawler automatically tunes certain cache sizes based on what it perceives as the size of your crawl. The main factors are the number of active sites and the delay value. The following caches are automatically tuned, and they should not be included in your XML configuration (and if they do they must have blank values): • • • • • screened smcomm mucomm wqcache crosslinks Refer to Cache Size Parameters on page 101 for additional information including parameter defaults.The only cache parameter to be configured is the postprocess (pp) cache which was previously discussed in the Postprocess Tuning on page 136 section.You may also use a larger cache size for the routetab and aliases cache if you crawl a lot of sites, especially multiple node crawls. The pp, routetab and aliases caches are all GigaBASE caches specified in bytes. Log Tuning Less logging means improved performance. However, it also means that is becomes more difficult to debug issues. Note that only some of the logs have large impact on the performance. In order of resource consumption you should adhere to the following recommendation: • • • • • • • DNS log: Should always be enabled. Screened log: Must be disabled! Site log: Should always be enabled. Fetch log: Should always be enabled. Postprocess log: Should be disabled unless you use it. DSFeed log: Should always be enabled. Scheduler log: Should be disabled unless you use it. General Tuning fetch_timeout, robots_timeout, Default is 300 seconds, increase if you experience more timeouts than expected (could be caused by bandwidth shapers) login_timeout 138 use_http_1_1 Enable to take advantage of content compression and If-Modified-Since, both saving bandwidth. accept_compression Enables the remote HTTP server to compress the document if desired before sending. Few servers actually do this, but some do and it will save bandwidth. robots Always adhere to robots.txt when doing large scale crawls. Configuring the Enterprise Crawler refresh_mode A large scale crawl should always use 'scratch' (default) or 'adaptive'. If you need to crawl everything, then you should initially set the 'refresh' to a high value. Once you know the time required for an entire cycle, you can tune it. If it is not possible to crawl everything within your time limit, you need to reduce the 'refresh' and/or use the 'max_doc' option to reduce the number of documents to download from each site. Note: The option to automatically refresh when idle is not available for multi node crawler setups. max_sites Together with the delay option this controls the maximum theoretical crawl speed you will be able to obtain. For example, a max_sites of 2000 and a delay of 60 will give you 2000/60 = 33 docs/sec. Please note that this value is *per node* so with 10 crawler nodes this would translate into 20000 max_sites total and 333 docs/sec. In practice you seldom get that close to the theoretical speed, and it also depends greatly on there being enough sites to crawl at any one time. To monitor your crawler with regard to the actual rate use crawleradmin -c and look at the "Active sites" count. headers While this setting can be used to specify arbitrary HTTP headers, it is usually used for only the crawler 'User-Agent'. The user agent header must specify a "name" for the crawler as well as contact information, either in the form of a web page and/or e-mail address. For example "User-Agent: YourCompany Crawler (crawler at yourcompany dot com)" cut_off Should be adjusted depending on the type of documents to be crawled. For example, PDFs and Word documents tend to be larger than HTML documents for example. max_doc This setting can be important when tuning the size of the crawl, as it limits the number of documents retrieved per site.Typical values might be in the 2000-5000 range. Can also be specified for sub collections if and only if the sub collections are only defined using hostname rules and not any URI rules. check_meta_robots META robots tags should be adhered to when doing a web scale crawl. html_redir_is_redir/html_redir_thresh The 'html_redir_is_redir' option lets the crawler treat META refresh tags inside HTML documents as if they were true HTTP redirects. When enabled the document containing the META refresh will not itself be indexed. The 'html_redir_thresh' option specifies the number of seconds delay which are allowed for the tag to be considered a redirect. Anything less than this number is treated as a redirect, other values are just treated as a link (and the document itself is indexed also). dbswitch/dbswitch_delete The 'dbswitch' option specifies the number of refreshes a given URI is allowed to stay in the index without being seen again by the crawler. URIs that have not been seen for this amount of refreshes will either be deleted or added to the queue for re-crawl, depending on the 'dbswitch_delete' option. This option should never be less than 2 and preferably at least 3. For example a in a 30 day cycle crawl with a dbswitch of 3 any given URI may remain unseen for 3 cycles before being removed/scheduled. Keep in mind that if the crawler was stopped for 30 days the cycles would still progress. One common method of limiting the amount of dead links in the index is to use what is known as a dead links crawler. The idea is to use click-through tracking to actively re-crawl the search results clicked on by users. Not only will the crawler quickly discover pages that have disappeared, but freshness for frequently clicked pages are also improved. 139 FAST Enterprise Crawler wqfilter/smfilter/mufilter These options decide whether to use a Bloom filter to weed out duplicate URIs before queuing in the slave ('wqfilter') and sending URIs from the slave to the master ('smfilter'). The former is a yes/no option and the size of the filter is calculated based on the max_docs setting and a very low probability of false positives. For large scale crawls this should definitely be on to reduce the number of duplicates in the queue. The 'smfilter' is specified by a capacity value, typically 50000000 (50M). The filter is automatically purged whenever it reaches a certain saturation, so there should be a very low probability of false positives. The default is off (0), but the 50M size should definitely be used for large crawls. The 'mufilter' is a similar filter present in the master and ubermaster. It should be even larger, typically 500000000 (500M) for wide crawls to prevent overloading the ubermaster with links. max_reflinks Must be set to 0 (the default value) for large-scale crawls, to disable the crawler from storing a list of URIs that link to each document. Disabling this reduces memory and disk usage within the crawler. The equivalent functionality is implemented by the WebAnalyzer component of FAST ESP. max_pending This option limits the number of concurrent requests allowed to a single site. For a large scale crawl with 60 seconds delay it should probably be set to 1 or 2 (the only time you would have more than one request would be if the first took more than 60 seconds to complete). extract_links_from_dupes Since duplicates generally link to more duplicates this option should be turned off, whether the crawl is large or small. if_modified_since Controls whether to send 'If-Modified-Since' requests, significantly reducing the bandwidth use subsequent crawl cycles. Should always be on for wide crawls. use_cookies The cookie support is intended for small intranet crawls and should always be disabled for large scale crawls. If you require cookie support for certain sites it may be best to place them in a separate collection, rather than enabling this feature for the entire crawl. rewrite_rules Rewrite rules can be used to rewrite links parsed out of documents by applying regular expression and repeating captured text. This implies that all rewrite rules are attempted applied for every link, and it can therefore be very expensive in terms of CPU usage depending on the number of rules and their complexity. It is therefore advised to limit the amount of rewrite rules for large scale crawls. use_javascript and enable_flash For performance reasons it is highly recommended to disable JavaScript and flash crawling for large crawls. If you require JavaScript and/or flash support, you should only enable it for a limited set of sites. You need to put these sites into a separate collection. Note: Enabling any of these options also requires that one or more Browser Engines be configured. For more information, please refer to the FAST ESP Browser Engine Guide. domain_clustering In a web scale crawl it is possible to optimize the crawler to take advantage of locality in the web link structure. Sub domains on the same domain tend to link more internally than externally, just as a site would have mostly interlinks. The domain clustering option enables clustering of sites on the same domain (for example, *.example.net) on the same master node and the same storage cluster (and thus uberslave process). For web crawls this feature should always be enabled. Note: This option is automatically turned on for multi node crawls by the ubermaster 140 Configuring the Enterprise Crawler Maximum Number of Open Files If you plan to do a large scale crawl you should increase the maximum number of open files (default 1024). This change is done in etc/NodeConf.xml. For example, to change from: <resourcelimits> <limit name="core" soft="unlimited"/> <limit name="nofile" soft="1024"/> </resourcelimits> to <resourcelimits> <limit name="core" soft="unlimited"/> <limit name="nofile" soft="4096"/> </resourcelimits> Note that this will only set the soft limit. In order for this to work the system hard limit must also be set to the same value or higher. Large Scale XML Configuration Template The following shows a largescale.xml collection configuration template. largescale.xml crawler collection configuration template <?xml version="1.0"?> <CrawlerConfig> <!-- Template --> <DomainSpecification name="LARGESCALE"> <!-<!-<!-<!-- Crawler Identification Modify the following options to identify the collections and the crawler. Make sure you specify valid contact information. --> --> --> --> <attrib name="info" type="string"> Sample LARGESCALE crawler config </attrib> <!-- Extra HTTP Headers --> <attrib name="headers" type="list-string"> <member> User-agent: COMPANYNAME Crawler (email address / WWW address) </member> </attrib> <!-- General options <!-- The following options are general options tuned for large <!-- scale crawling. You generally do not need to modify these --> --> --> <!-- Adhere to robots.txt rules --> <attrib name="robots" type="boolean"> yes </attrib> <!-- Adhere to meta robots tags in html headers --> <attrib name="" type="boolean"> yes </attrib> <!-- Adhere to crawl delay specified in robots.txt --> <attrib name="obey_robots_delay" type="boolean"> no </attrib> <!-- Don't track referrer links, as this is done in the --> <!-- pipeline by the WebAnalyzer component --> <attrib name="max_reflinks" type="integer"> 0 </attrib> <!-- Only have one outstanding request per site at any one time --> <attrib name="max_pending" type="integer"> 1 </attrib> <!-- Keep hostnames of the same DNS domain within one slave --> <attrib name="domain_clustering" type="boolean"> yes </attrib> 141 FAST Enterprise Crawler <!-- Maximum time for the retrieval of a single document --> <attrib name="fetch_timeout" type="integer"> 300 </attrib> <!--- Support HTML redirects --> <attrib name="html_redir_is_redir" type="boolean"> yes </attrib> <!-- Anything with delay 3 and lower is treated as a redirect --> <attrib name="html_redir_thresh" type="integer"> 3 </attrib> <!-- Enable near duplicate detection --> <attrib name="near_duplicate_detection" type="boolean"> no </attrib> <!-- Only log retrievals, not postprocess activity --> <section name="log"> <attrib name="fetch" type="string"> text </attrib> <attrib name="postprocess" type="string"> none </attrib> </section> <!-- Do not extract and follow links from duplicates --> <attrib name="extract_links_from_dupes" type="boolean"> no </attrib> <!-- Do not store duplicates, use a block-type storage, and compress documents on disk --> <section name="storage"> <attrib name="store_dupes" type="boolean"> no </attrib> <attrib name="datastore" type="string"> bstore </attrib> <attrib name="compress" type="boolean"> yes </attrib> </section> <!-- Do not retry retrieval of documents for common errors --> <section name="http_errors"> <attrib name="4xx" type="string"> DELETE </attrib> <attrib name="5xx" type="string"> DELETE </attrib> <attrib name="ttl" type="string"> DELETE:3 </attrib> <attrib name="net" type="string"> DELETE:3, RETRY:1 </attrib> <attrib name="int" type="string"> KEEP </attrib> </section> <!-- Rate and size options <!-- The following options tune the crawler rate and refresh <!-- cycle settings. Modify as desired. --> --> --> <!-- Only retrieve one document per minute --> <attrib name="delay" type="real"> 60 </attrib> <!-- Length of crawl cycle is 10 days (expressed in minutes) --> <attrib name="refresh" type="real"> 14400 </attrib> <!-- Available refresh modes: scratch (default), adaptive, soft, --> <!-- append and prepend. --> <attrib name="refresh_mode" type="string"> scratch </attrib> <!-- Let three cycles pass before cleaning out URIs not found --> <attrib name="dbswitch" type="integer"> 3 </attrib> <!-- Crawl Mode --> <section name="crawlmode"> <!-- Crawl depth (use DEPTH:n to do level crawling) --> <attrib name="mode" type="string"> FULL </attrib> <!-- Follow interlinks --> <attrib name="fwdlinks" type="boolean"> yes </attrib> <!-- Reset crawl level when following interlinks --> <attrib name="reset_level" type="boolean"> no </attrib> </section> <!-- Let each master crawl this many sites simultaneously --> <attrib name="max_sites" type="integer"> 6144 </attrib> <!-- Maximum size of a document --> <attrib name="cut_off" type="integer"> 500000 </attrib> <!-- Maximum number of bytes to use in checksum (0 == disable) --> 142 Configuring the Enterprise Crawler <attrib name="csum_cut_off" type="integer"> 0 </attrib> <!-- Maximum number of documents to retrieve from one site --> <attrib name="max_doc" type="integer"> 5000 </attrib> <!-- Enable HTTP version 1.1 to enable accept_compression --> <attrib name="use_http_1_1" type="boolean"> yes </attrib> <!-- Accept compressed documents from web servers, you have more cpu than bandwidth --> <attrib name="accept_compression" type="boolean"> yes </attrib> <!-- Performance tuning options --> <!-- Sizes of various caches --> <section name="cachesize"> <!-- UberMaster and Master routing tables cache (in bytes) --> <attrib name="routetab" type="integer"> 4194304 </attrib> <!-- PostProcess checksum database (per stripe) cache (in bytes) --> <attrib name="pp" type="integer"> 268435456 </attrib> </section> <!-- Slave work queue bloom filter enabled --> <attrib name="wqfilter" type="boolean"> yes </attrib> <!-- Slave -> Master bloom filter with capacity 50M --> <attrib name="smfilter" type="integer"> 50000000 </attrib> <!-- Master/UberMaster bloom filter with capacity 500M --> <attrib name="mufilter" type="integer"> 500000000 </attrib> <!-- Adaptive Scheduling. To enable un comment this section and --> <!-- change 'refresh_mode' to 'adaptive' --> <section name="adaptive"> <!-- Number of "micro" refresh cycle within a full refresh --> <attrib name="refresh_count" type="integer"> 4 </attrib> <!-- Ratio (in percent) of rescheduled URIs vs. new (unseen) --> <!-- URIs scheduled. - -> <attrib name="refresh_quota" type="integer"> 98 </attrib> <!-- The maximum percentage of a site to reschedule during a --> <!-- "micro" refresh cycle. --> <attrib name="coverage_max_pct" type="integer"> 25 </attrib> <!-- The minimum number of URIs on a site to reschedule <!-- during a refresh cycle "micro" refresh cycle. <attrib name="coverage_min" type="integer"> 10 </attrib> --> --> <!-- Ranking weights. Each scoring criteria adds a score between --> <!-- 0.0 and 1.0 which is then multiplied with the associated --> <!-- weight below. Use a weight of 0 to disable a scorer --> <section name="weights"> <!-- Score based on the number of /'es (segments) in the --> <!-- URI. Max score with one, no score with 10 or more --> <attrib name="inverse_length" type="real"> 1.0 </attrib> <!-- Score based on the number of link "levels" down to --> <!-- this URI. Max score with none, no score with >= 10 --> <attrib name="inverse_depth" type="real"> 1.0 </attrib> <!-- Score added if URI is determined as a "landing page" --> <!-- defined as e.g. ending in "/" or "index.html". URIs --> <!-- with query parameters are not given score --> <attrib name="is_landing_page" type="real"> 1.0 </attrib> 143 FAST Enterprise Crawler <!-- Score added if URI points to a markup document as --> <!-- defined by the "uri_search_mime" option. Assumption --> <!-- being that such content changes more often than e.g. --> <!-- "static" Word or PDF documents. --> <attrib name="is_mime_markup" type="real"> 1.0 </attrib> <!-- Score based on change history tracked over time by --> <!-- using an estimator based on last modified date given --> <!-- by the web server. If no modified date returned then --> <!-- one is estimated (based on whether the document has --> <!-- changed or not). --> <attrib name="change_history" type="real"> 10.0 </attrib> </section> </section> <!-- PostProcess options <!-- Duplicate servers must be specified also. Feeding is <!-- initially suspended below, can be turned on if desired. --> --> --> <section name="pp"> <!-- Use 4 database stripes --> <attrib name="stripe" type="integer"> 4 </attrib> <!-- Only track up to four duplicates for any document --> <attrib name="max_dupes" type="integer"> 4 </attrib> <!-- The address of the duplicate server --> <attrib name="dupservers" type="list-string"> <member> HOSTNAME1:PORT </member> <member> HOSTNAME2:PORT </member> </attrib> <!-- report only bare minimum of meta info to ESP/FDS --> <attrib name="ds_meta_info" type="list-string"> <member> duplicates </member> <member> redirects </member> </attrib> <!-- Feeding to ESP/FDS suspended --> <attrib name="ds_paused" type="boolean"> yes </attrib> </section> <!-<!-<!-<!-<!-- Inclusion and exclusion The following section sets up what content to crawl and not to crawl. Do not use regular expression rules unless absolutely necessary as they have a significant impact on performance. --> --> --> --> --> <!-- Only crawl http (ie, don't crawl https/ftp --> <attrib name="allowed_schemes" type="list-string"> <member> http </member> </attrib> <!-- Allow these MIME types to be retrieved --> <attrib name="allowed_types" type="list-string"> <member> text/html </member> <member> text/plain </member> <member> text/asp </member> <member> text/x-server-parsed-html </member> </attrib> <!-- List of included domains (may be regexp,prefix,suffix, exact) --> <section name="include_domains"> 144 Configuring the Enterprise Crawler <attrib name="exact" type="list-string"> </attrib> <attrib name="prefix" type="list-string"> </attrib> <attrib name="suffix" type="list-string"> </attrib> </section> <!-- List of excluded domains (may be regexp,prefix,suffix, exact) --> <section name="exclude_domains"> <attrib name="exact" type="list-string"> </attrib> <attrib name="prefix" type="list-string"> </attrib> <attrib name="suffix" type="list-string"> </attrib> </section> <!-- List of excluded URIs (may be regexp,prefix,suffix, exact) --> <section name="exclude_uris"> <attrib name="exact" type="list-string"> </attrib> <attrib name="prefix" type="list-string"> </attrib> <attrib name="suffix" type="list-string"> </attrib> </section> <!-- List of included URIs (may be regexp,prefix,suffix, exact) --> <section name="include_uris"> <attrib name="exact" type="list-string"> </attrib> <attrib name="prefix" type="list-string"> </attrib> <attrib name="suffix" type="list-string"> </attrib> </section> <!-- Exclude these filename extensions --> <attrib name="exclude_exts" type="list-string"> <member> .jpg </member> <member> .jpeg </member> <member> .ico </member> <member> .tif </member> <member> .png </member> <member> .bmp </member> <member> .gif </member> <member> .avi </member> <member> .mpg </member> <member> .wmv </member> <member> .wma </member> <member> .ram </member> <member> .asx </member> <member> .asf </member> <member> .mp3 </member> <member> .wav </member> <member> .ogg </member> <member> .zip </member> <member> .gz </member> <member> .vmarc </member> <member> .z </member> <member> .tar </member> <member> .swf </member> <member> .exe </member> <member> .java </member> <member> .jar </member> <member> .prz </member> <member> .wrl </member> 145 FAST Enterprise Crawler <member> <member> <member> <member> <member> <member> <member> <member> <member> <member> <member> <member> <member> <member> </attrib> .midr </member> .css </member> .ps </member> .ttf </member> .xml </member> .mso </member> .rdf </member> .rss </member> .cab </member> .xsl </member> .rar </member> .wmf </member> .ace </member> .rar </member> <!-- List of start URIs --> <attrib name="start_uris" type="list-string"> <member> INSERT START URI HERE </member> </attrib> <!-- List of start URI files --> <attrib name="start_uri_files" type="list-string"> <member> INSERT START URI FILE HERE</member> </attrib> </DomainSpecification> </CrawlerConfig> 146 Chapter 4 Operating the Enterprise Crawler Topics: • • • • • • • Stopping, Suspending and Starting the Crawler Monitoring Backup and Restore Crawler Store Consistency Redistributing the Duplicate Server Database Exporting and Importing Collection Specific Crawler Configuration Fault-Tolerance and Recovery The crawler runs as an integrated component within FAST ESP, monitored by the node controller (nctrl) and started/stopped via the administrator interface or the nctrl command. FAST Enterprise Crawler Stopping, Suspending and Starting the Crawler Stopping, suspending and starting the crawler can be executed from the administrator interface or from the command line and differs depending on your environment. Starting in a Single Node Environment - administrator interface In a single node environment, to start the crawler from the administrator interface: 1. Select System Management on the navigation bar. 2. Locate the Enterprise Crawler on the Installed module list - Module name. Select the Start symbol to start the crawler. Starting in a Single Node Environment - command line Use the nctrl tool to start the crawler from the command line. Refer to the nctrl Tool appendix in the FAST ESP Operations Guide for nctrl usage information. Run the following command to start the crawler: 1. $FASTSEARCH/bin/nctrl start crawler Starting in a Multiple Node Environment - administrator interface In a multiple node environment, the ubermaster processes must be started up first, followed by individual crawler processes. In a multiple node environment, to start the crawler from the administrator interface: 1. Select System Management on the navigation bar. 2. Locate the ubermaster process, and select the Start symbol. 3. For all crawler processes, select the Start symbol. Starting in a Multiple Node Environment - command line In a multiple node environment, the ubermaster processes must be started up first, followed by individual crawler processes. Use the nctrl tool to start the crawler from the command line. Refer to the nctrl Tool appendix in the FAST ESP Operations Guide for nctrl usage information. To start the crawler from the command line: 1. On the node running the ubermaster process, run the command: $FASTSEARCH/bin/nctrl start crawler 2. On all nodes that run crawler processes (assuming the processes are named crawler), run the command: $FASTSEARCH/bin/nctrl start crawler Suspending/Stopping in a Single Node Environment - administrator interface The crawler is stopped (if running) and started when a configuration is updated. There is also a start/stop button for an existing crawler. In a single node environment, to stop the crawler from the administrator interface: 1. Select System Management on the navigation bar. 148 Operating the Enterprise Crawler 2. Locate the Enterprise Crawler on the Installed module list - Module name. Select the Stop symbol to stop the crawler. Suspending/Stopping in a Single Node Environment - command line Use the nctrl tool to stop the crawler from the command line. Refer to the nctrl Tool appendix in the FAST ESP Operations Guide for nctrl usage information. Run the following command to stop the crawler: 1. $FASTSEARCH/bin/nctrl stop crawler Suspending/stopping in a Multiple Node Environment - administrator interface In a multiple node environment, the individual crawler processes must be shut down first, followed by the ubermaster processes. In a multiple node environment, to stop the crawler from the administrator interface: 1. Select System Management on the navigation bar. 2. For all crawler processes, select the Stop symbol. 3. Locate the ubermaster process, and select the Stop symbol. The crawler will not stop completely before all outstanding content batches have been successfully submitted to FAST ESP and received by the indexer nodes. Monitor the crawler submit queue by waiting until the $FASTSEARCH/data/crawler/dsqueues folder (on the node running the crawler) is empty. Suspending/stopping in a Multiple Node Environment - command line In a multiple node environment, the individual crawler processes must be shut down first, followed by the ubermaster processes. Use the nctrl tool to stop the crawler from the command line. Refer to the nctrl Tool appendix in the FAST ESP Operations Guide for nctrl usage information. To stop the crawler from the command line: 1. On all nodes that run crawler processes (assuming the processes are named crawler), run the command: $FASTSEARCH/bin/nctrl stop crawler 2. On the node running the ubermaster process, run the command: $FASTSEARCH/bin/nctrl stop crawler The crawler will not stop completely before all outstanding content batches have been successfully submitted to FAST ESP and received by the indexer nodes. Monitor the crawler submit queue by waiting until the $FASTSEARCH/data/crawler/dsqueues folder (on the node running the crawler) is empty. Monitoring While the crawler is running, you can use the FAST ESP administrator interface or the crawleradmin tool to monitor and manage the crawler. Refer to the FAST ESP Configuration Guide for information about the administrator interface. Enterprise Crawler Statistics A detailed overview of statistics for each of the collections configured in the Enterprise Crawler is available in the FAST ESP administrator interface. 149 FAST Enterprise Crawler Navigating to the Data Sources tab will list all the available Enterprise Crawlers installed. For each Enterprise Crawler choosing List Collections will display all the collections associated with the particular Enterprise Crawler instance. For each collection there are a number of available options: Configuration - List the configured settings for the collection. Fetch log - See the last 5 minutes from the collection fetch log for the collection. Site log - See the last 5 minutes of the collection site log for the collection. Site statistics - View detailed statistics for a single web site in the collection. Input the web site you want to view detailed statistics for and choose Lookup. Note: Only web sites that have already been crawled can be viewed for statistics. Table 43: Site statistics for <web site> in collection <collection> Name Description Status The current crawl status for this web site. Possible values are: Crawling - The web site is currently being crawled. Idle - The web site is not being crawl at the moment. Document Store The number of documents in the crawler store for this web site. Statistics age The time since the last statistics update. Last URI The last URI crawled for this web site- Queue Length The current size of this web sites workqueue. For a description of the detailed statistics of a web site. See viewing detailed statistics about collection below. Statistiscs - View detailed statistics about collection. Table 44: Overall Collection Statistics Name Description Crawl status Displays the current crawl status of the crawler. Possible values are: Crawling, X sites active - The collection is crawling, X web sites are currently active. Idle - The collection is idle, no web sites are currently active. Suspended - The collection is suspended. Feed Status Possible values are: Feeding - The collection is currently feeding the content to ESP. Queueing - The collection is currently queueing content to disk and feeding to ESP is suspended. 150 Cycle Progress (%) The current collection refresh cycle progress. Calculated based on time until next refresh. Time until refresh The time until next refresh for this collection. Stored Documents The number of documents the crawler has stored. Operating the Enterprise Crawler Name Description Unique Documents The number of unique documents the crawler has stored. Document Rate The current rate at which documents are downloaded. In Bandwidth The current inbound bandwidth the crawler is utilizing. Statistics Updated The time since the last statistics update. The Status for all collections link will display a summary of all the collections and some of their most interesting statistics. The Detailed Statistics link will display detailed statistics for the previous and the current crawl cycle, as well as the total for all crawl cycles. Table 45: Detailed Collection Statistics Processing Status Description Processed The number of requested documents by the crawler. Downloaded The number of downloaded documents by the crawler. Stored The number of documents stored by the crawler. Modified The number of documents stored that were modified. Unchanged The number of documents that were unchanged. Deleted The number of documents that were deleted by the crawler. Postprocess statistics Description ADD operations The number of ADD operations sent to ESP. DEL operations The number of DEL(ete) operations sent to ESP. MOD operations The number of MOD(ified) operations sent to ESP. Note: MODs are in reality sent as ADDs. URLSChange operations The number of URLSChange operations sent to ESP. URLSChanges contains updates on the URIs equivalence class. Total Operations The total number of operations overall. Successful operations The number of successful operations overall. Failed operations The number of failed operations overall. Operation rate The rate, in operations per second, at which operations are sent to ESP. Network Description Document Rate The rate, in documents per seoncd, at which documents are downloaded. In Bandwidth The current inbound bandwidth the crawler is utilizing. Out Bandwidth The current outbound bandwidth the crawler is utilizing. Downloaded bytes The total number of bytes the crawler has downloaded. Sent bytes The total number of bytes the crawler has sent. Average Document Size The average document size of the documents the crawler has downloaded. Max Document Size The maximum document size of the documents the crawler has downloaded. 151 FAST Enterprise Crawler 152 Network Description Download Time The accumulated time used to download documents. Average Download Time The average time to download a document. Maximum Download Time The maximum time to download a document. Mime Types Description <type>"/"<subtype> A breakdown of the various Mime Types of the documents downloaded by the crawler. URIs Skipped Description NoFollow URIs skipped due to link tag having a rel="NoFollow" attribute. Scheme URIs skipped due to not matching the collection Allowed Schemes setting. Robots URIs skipped due to being excluded by robots.txt. Domain URIs skipped due to not matching the collection domain include/exclude filters. URI URIs skipped due to not matching the collection URI include/exclude filters. Out of Focus URIs skipped due to being out of focus from the collection Focus crawl settings. Depth URIs skipped due to being out of the collection Crawl Mode depth settings. M/U Cache URIs skipped due to being screened by internal caches. Documents Skipped Description MIME Type Document skipped due to not matching the collection MIME-Types setting. Header Exclude Document skipped due to matching the collection Header Excludes setting. Too Large Document skipped due to exceeding the collection Maximum Document Size setting. NoIndex RSS Document skipped due to the collection RSS setting Index RSS documents?. HTTP Header Document skipped due to errors with HTTP header. Encoding Document skipped due to problems with the document encoding. Typically problems with compressed content. Chunk Error Document skipped due to problems with chunked encoding. Failure to de-chunk content. Incomplete Document skipped due to being incomplete. The webserver did not return the complete document as indicated by the HTTP header. No 30x Target Document skipped due to not having a redirect target. Connect Error Document skipped due to failure to connect() to the remote web server. Connect Timeout Document skipped due to the to connect() to the remote web server timed out. Timeout Document skipped due to it using longer time to download than the Fetch Timeout setting. Network Error Document skipped due to various network errors. NoIndex Document skipped due to containing a META robots No Index tag. Checksum Cache Document skipped due to being screened by the run-time checksum cache used for duplicate detection. Other Error Document skipped due to other reasons. Document Plugin Document skipped by the user specified document plugin. Empty Document Document skipped due to being 0 bytes. Operating the Enterprise Crawler Protocol Response Codes Description <Response Code> <Response Info > A breakdown of the various protocol response codes received by the crawler. DNS Statistics (global) Description DNSRequests Number of issued DNS requests. DNSResponses Number of received DNS responses. DNSRetries Number of issued DNS request retries. DNSTimeout Number of issued DNS requests that timed out. DNS Statistics (global) Description <Response code> A breakdown of the DNS response codes received by the crawler. Possible responses are: NOERROR - The DNS server returned no error (hostname resolved). NXDOMAIN - The domain name referenced in the query does not exist (hostname did not resolve). FORMERR - The DNS server was unable to interpret the query. SERVFAIL - The DNS server was unable to process this query due to a problem on the server side. NOTIMP - The DNS server does not support the requested query. REFUSED - The DNS server refused to perform the specified operation. NOANSWER - The DNS rescord received from the DNS server did not contain an ANSWER section. PARTANSWER - The DNS record received from the DNS server contained only a partial ANSWER section. TIMEOUT - The DNS request timed out. UNKNOWN - An unknown DNS reply packet was received. Backup and Restore Crawler configuration is primarily concerned with collection specific settings. Backup of the crawler configuration will ensure that the crawler can be reconstructed to a state with an identical setup, but without knowledge of any documents. The crawler configuration is located in: $FASTSEARCH/data/crawler/config/config.hashdb To backup the configuration, stop the crawler, then save this file. It is also possible to export/import collection specific crawler configuration in XML format using the crawleradmin tool. This is not necessary for pure backup needs (as the config.hashdb file includes all the collection specific information, including statistics on previously gathered pages). However, if a collection is to be completely recreated from scratch, having been deleted both from the crawler and the search engine, the XML-formatted settings should be used to recreate the collection, rather than using a restored crawler configuration database. Refer to the FAST ESP Operations Guide for overall system backup and restore information. 153 FAST Enterprise Crawler Restore Crawler Without Restoring Documents To restore a node to the backed up configuration without restoring the documents: 1. Install the node according to the overall procedures in the Installing Nodes from an Install Profile chapter in the FAST ESP Operations Guide. 2. Restart the crawler. 3. Reload the backed up XML configuration file: $FASTSEARCH/bin/crawleradmin -f <configuration filename> Full Backup of Crawler Configuration and Data Backing up the crawler configuration only ensures that all information about individual collections and the setup of the crawler itself can be restored, but will trigger the sometimes unacceptable overhead of having to crawl and reprocess all documents over again. To be able to recover without this overhead, a full backup of the crawler is needed. To perform a full backup: 1. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler 2. Backup the complete directory on all nodes involved in crawling: $FASTSEARCH/data/crawler Note: Be sure to back up the duplicate server; this process often runs on a separate node from the crawler. 3. If keeping log files is desired, backup the log file directory. This is for reference only, and is not needed to get the system backup. $FASTSEARCH/var/log/crawler Full restore of Crawler Configuration and Data To perform a full restore: 1. Install the node according to the overall procedures in the Installing Nodes from an Install Profile chapter in the FAST ESP Operations Guide. 2. Make sure the crawler is not running, then restore the backed up directory on each node to be restored: $FASTSEARCH/data/crawler 3. Start the crawler. The crawler will start re-crawling from the point where it was backed up, and according to the restored configuration. Re-processing Crawler Data Using postprocess This topic describes how to re-process the crawler data of one or several collections into the document processing pipeline without starting a re-crawl. The crawler stores the crawl data into meta storage and document storage. The metadata consists of a set of databases mapping URIs to their associated metadata (such as crawl time, MIME type, checksum, document storage location, and so forth.). The crawler uses a pool of database clusters (usually 8) which in turn consist of a set of meta databases (one site databases and multiple URI segment databases). Reprocessing the contents of collections involves the following process. Note that this is a somewhat simplified description of the actual inner workings of postprocess. Steps 1 and 2 run in parallel, and as step 1 is usually significantly faster, it also completes before step 2. 154 Operating the Enterprise Crawler 1. Meta databases are traversed site by site, and extract the URIs with associated document data on disk. Each site, along with the number of URIs stored in the meta database, is logged as traversal of that site commences. The number of URIs may therefore include duplicates. For each URI that is extracted, duplicate detection is performed by a lookup for the associated checksum in the postprocess database. If the checksum does not exist (new/changed document) or the checksum is associated with the current URI, then the document is accepted as a unique document; otherwise it is treated as a duplicate. 2. Unique (non-duplicate) documents are queued for submission to document processing. 3. All documents due for submission are placed in a queue in $FASTSEARCH/data/crawler/dsqueues. Postprocess now serves the document processors from this queue, and terminates once all documents have been consumed. The duration of this phase is dictated by the capacity of the document processing subsystem. If you have a configuration scenario with the crawler on a separate node you may experience a recovery situation where index data has been lost (when running single-row indexer). If the crawler node is fully operative, it is recommended to perform a full re-processing of crawled documents. This can be time-consuming, but may be the only way to ensure full recovery of documents submitted after the last backup. Single node crawler re-processing 1. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler 2. Delete the content of $FASTSEARCH/data/crawler/dsqueues. Only perform this step if you are re-feeding all your collections, otherwise you may lose content scheduled for submission for other collections. 3. Run the postprocess program in manual mode, using the -R (refeed) option: To do this: Use this command: Re-process a single collection $FASTSEARCH/bin/postprocess -R <collectionname> Re-process all collections, use an asterisk (*) with quotes instead of a <collectionname> $FASTSEARCH/bin/postprocess -R "*" Re-process a single site by using the -r /<site> option $FASTSEARCH/bin/postprocess -R <collectionname> -r <site> On UNIX make sure you either run postprocess in a screen, or use the nohup command, to ensure postprocess runs to completion. It is also considered a good practice to redirect stdout and stderr to a log file. 4. Allow postprocess to finish running until all content has been queued and submitted. When postprocess has completed the processing, the following message is displayed: Waiting for ESP to process remaining data...Hit CTRL+C to abort If you want to start crawling immediately, then it is safe to shutdown postprocess, since it has identified and enqueued all documents due for processing, as long as the crawler is later restarted so that the processing of the remaining documents can be serviced. The remaining documents will eventually be processed before processing any newly crawled data. Otherwise, postprocess will eventually shut down itself when all documents has been processed. Press Ctrl+C, or send the process SIGINT if it is running in the background. Alternatively let postprocess run to completion and it will exit by itself. 5. Start the crawler: $FASTSEARCH/bin/nctrl start crawler. 155 FAST Enterprise Crawler Multiple node crawler re-processing Re-processing the crawl data on a multiple node crawler is similar to the single node scenario, except that a multiple node crawl will include one or more duplicate servers. These must be running when executing postprocess. For more information on multiple node crawler setup, contact FAST Solution Services. Forced Re-crawling The procedure in section Re-processing Crawler Data Using postprocess on page 154 assumes that the crawler database is correct. This implies that it will only re-process all already crawled documents to FAST ESP. In case of a single-node failure or crawler node failure the last documents fetched by the crawler (after last backup) will be lost. In this case you must instead perform a full re-crawl of all the collections. This will then re-fetch the remaining documents, assuming they are retrievable. In some cases documents may have disappeared from web servers, but still be present in the index. If this is the case these documents will have to be manually removed from the index. Use the following command to force a full re-crawl of a given collection: 1. crawleradmin -F <collection> 2. Repeat the command for each collection in the system This will then ensure that all documents crawled after the last backup will be re-fetched. Note that the re-crawl may take a reasonable amount of time before finished, but the index will be fully operative in the meantime. Purging Excluded URIs from the Index Normally postprocess will not check the validity of the URIs it processes, as this has already been done by the crawler during crawling. However, there are times when the include/exclude rules are altered and it is necessary to remove content that is no longer allowed (but was previously allowed) by the configuration. This can be accomplished by using the (uppercase) -X option. It will cause postprocess to traverse the meta databases as usual, but rather than processing the contents it will delete the contents that no longer match the configuration include and exclude rules.The contents that match are simply ignored, unless the (lowercase) -x option is also specified, in which case this content will be re-processed at the same time. Use the following command to remove excluded content from both the index and crawler store: 1. $FASTSEARCH/bin/postprocess -R <collectionname> -X The -X and -x options as described above assume that the crawler has already been updated with the new include/exclude rules. If you have the configuration in XML format, but have not yet uploaded the configuration you can use the -u <XML config> option to tell postprocess to update the rules directly from the XML file (and store them in the crawlers persistent configuration database). Finally, the option -b instructs postprocess to re-check each URI against the robots.txt file for the corresponding server. The check uses the currently stored robots.txt file, rather than download a new one. The behavior for a URI that is no longer allowed for crawling by the robots.txt file is the same as if it had been excluded by the configuration. Aborting and Resuming of a Re-process To pause/stop postprocess while it is re-processing and then to resume postprocess, you can use one of the following procedures. 156 Operating the Enterprise Crawler Aborting and Resuming of a Re-process - scenario 1 1. Stop postprocess after it has traversed all meta databases. When postprocess has completed the processing, the following message is displayed: Waiting for ESP to process remaining data...Hit CTRL+C to abort This log message indicates that the traversing of the meta databases has finished, and the only remaining task is to submit all the queued data to FAST ESP, and wait for it to finish processing callbacks. 2. Press Ctrl+C or send SIGINT. 3. To resume postprocess refeed, use: $FASTSEARCH/bin/postprocess -R <collectionname> -f Aborting and Resuming of a Re-process - scenario 2 1. Stop postprocess after it has traversed all meta databases. When postprocess has completed the processing, the following message is displayed: Waiting for ESP to process remaining data...Hit CTRL+C to abort If this message is not displayed in the log then postprocess has not finished traversal, and is still logging the following message: "Processing site: <site> (16 URIs)" To resume postprocessing after stopping postprocess in this condition, you must use the -r <site> (resume after <site>) option in combination with the -R <collections> option. To determine which site to resume, inspect the postprocess logs and find the site which was logged before the last Processing site log entry. For example, if the last two Processing site messages in the postprocess log are: Processing site: SiteA (X URIs) Processing site: SiteB (Y URIs) Start postprocess with -r SiteA to make sure that it will traverse the remaining sites. Since the log message is output before the site is traversed, this will ensure that SiteB is completely traversed. Crawler Store Consistency A consistency tool is included with the crawler to verify, and if necessary repair consistency issues in the crawler store. The following sections describe how to use the consistency tool. Verifying Docstore and Metastore Consistency The following steps will first verify (and if necessary repair) the consistency between the document store and metadata store, and then perform the same verification between the verified metadata store and postprocess checksum database. In case the tool removes documents we also ask it to keep the statistics in sync. The logs will be placed in a directory named after todays date under $FASTSEARCH/var/log/crawler/consistency. Make sure this directory exists before running the tool. In a multi node crawler setup these steps should be performed on all master nodes. Additionally you may also specify that the tool should verify the correct routing of sites to masters (details below). 1. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler. Verify that all crawler processes has exited before proceeding to the next step. 157 FAST Enterprise Crawler 2. Run the command: $FASTSEARCH/bin/postprocess -R mytestcol -f . This will ensure there is no remaining documents to be fed to the indexer. [2007-09-12 [2007-09-12 Hit CTRL+C [2007-09-12 11:58:39] INFO 11:58:39] INFO to abort 11:58:39] INFO systemmsg Feeding existing dsqueues only.. systemmsg Waiting for ESP to process remaining data... systemmsg PostProcess Refeed exiting 3. (Optional) Copy routing table from the ubermaster node. Note: This step only applies to multi node crawlers and is only necessary if you wish to verify the correct master routing of all sites. The routing table database can be found as $FASTSEARCH/data/crawler/config_um/mytestcoll/routetab.hashdb on the ubermaster and should overwrite $FASTSEARCH/data/crawler/config/mytestcoll/routetab.hashdb on the master nodes. 4. Create the directory $FASTSEARCH/var/log/crawler/consistency The tool will create a sub directory inside this directory, in this example 20070912, where it will place the output logs in a separate directory per collection checked. The path to the output directory is logged by the tool on startup. 5. Run the command: $FASTSEARCH/bin/crawlerconsistency -C mytestcoll -M doccheck,metacheck,updatestat -O $FASTSEARCH/var/log/crawler/consistency Note: Ensure that you do not insert any space between the modes listed in the -M option. If the tool is being run on a master in a multi node crawler you may also add the routecheck mode. 6. Examine the output and log files generated. [2007-09-11 16:23:10] INFO systemmsg Started EC Consistency Checker 6.7 (PID: 21542) [2007-09-11 16:23:10] INFO systemmsg Copyright (c) 2008 FAST, A Microsoft(R) Subsidiary [2007-09-11 16:23:10] INFO systemmsg Data directory: $FASTSEARCH/data/crawler [2007-09-11 16:23:10] INFO systemmsg 1 collections specified [2007-09-11 16:23:10] INFO systemmsg Mode(s): doccheck, metacheck, updatestat [2007-09-11 16:23:10] INFO systemmsg Output directory: $FASTSEARCH/var/log/consistency/20070912 [2007-09-11 16:23:10] INFO mytestcoll Going to work on collection mytestcoll.. [2007-09-11 16:23:12] INFO mytestcoll Completed docstore check of collection mytestcoll in 1.6 seconds [2007-09-11 16:23:12] INFO mytestcoll ## Processed sites : 5 (2.50 per second) [2007-09-11 16:23:12] INFO mytestcoll ## Processed URIs : 5119 (2559.50 per second) [2007-09-11 16:23:12] INFO mytestcoll ## OK URIs : 5119 [2007-09-11 16:23:12] INFO mytestcoll ## Deleted URIs : 0 [2007-09-11 16:23:12] INFO mytestcoll Document count in statistics left unchanged [2007-09-11 16:23:12] INFO mytestcoll Processing 5119 checksums (all clusters).. [2007-09-11 16:23:14] INFO mytestcoll Completed metastore check of collection mytestcoll in 1.8 seconds [2007-09-11 16:23:14] INFO mytestcoll ## Processed csums : 5119 (2559.50 per second) [2007-09-11 16:23:14] INFO mytestcoll ## OK csums : 5119 [2007-09-11 16:23:14] INFO mytestcoll ## Deleted csums : 0 [2007-09-11 16:23:14] INFO mytestcoll Finished work on collection mytestcoll [2007-09-11 16:23:14] INFO systemmsg Done In the example output above all URIs and checksums were found to be ok. If this was not the case then a mytestcol_deleted.txt file will contain the URIs deleted. Additionally if a mytestcol_refeed.txt file was generated then the URIs listed here should be re-fed using postprocess (next step). 7. Run the command: $FASTSEARCH/bin/postprocess -R mytestcol -i <path to mytestcol_refeed.txt> 158 Operating the Enterprise Crawler Note: This step is only required in order to update the URI equivalence class of the listed URIs. 8. Start the crawler: $FASTSEARCH/bin/nctrl start crawler. Rebuilding the Duplicate Server Database This section explains the steps necessary to rebuild the duplicate server database, based on the contents of the postprocess database present on each master. It only applies to multi node crawlers, as single node crawlers doe not require a duplicate server. Prior to performing this task it is recommended to run the consistency tool as outlined in the previous section first to ensure each node is in a consistent state. As part of this operation a set of log files will be generated and placed in a directory named after todays date under $FASTSEARCH/var/log/crawler/consistency. Make sure this directory exists before running the tool. To successfully rebuild the duplicate server databases it is vital that these steps be run on all master nodes. The crawler must not be restarted until all nodes have successfully completed the execution of the tool. 1. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler. Verify that all crawler processes has exited before proceeding to the next step. Perform this step on each master before proceeding to the next step. 2. Run the command: $FASTSEARCH/bin/postprocess -R mytestcol -f . This will ensure there is no remaining documents to be fed to the indexer. Perform this step on each master before proceeding to the next step. 3. Stop the duplicate server processes. Wait until the processes have completed shutting down before moving to the next step. Depending on your configuration this may take several minutes 4. Delete the per-collection duplicate server databases. These databases are usually located under $FASTSEARCH/data/crawler/ppdup/<collection> and should be deleted prior to running this tool to ensure there will not be "orphan" checksums recorded in the database. 5. Start the duplicate server processes. 6. Create the directory $FASTSEARCH/var/log/crawler/consistency. The tool will create a sub directory inside this directory, in this example 20070912, where it will place the output logs in a separate directory per collection rebuilt. The path to the output directory is logged by the tool on startup. 7. Run the command: $FASTSEARCH/bin/crawlerconsistency -C mytestcoll -M ppduprebuild -O $FASTSEARCH/var/log/crawler/consistency Note: This command will usually take several hours to complete. Progress information is logged regularly, but only applies per collection. Hence if you are processing multiple collections the subsequent collections are not accounted for in the reported ETA. 8. Examine the output and log files generated. [2007-09-11 09:17:12] INFO systemmsg Started EC Consistency Checker 6.7 (PID: 18622) [2007-09-11 09:17:12] INFO systemmsg Copyright (c) 2008 FAST, A Microsoft(R) Subsidiary [2007-09-11 09:17:12] INFO systemmsg Data directory: $FASTSEARCH/data/crawler/ [2007-09-11 09:17:12] INFO systemmsg No collections specified, defaulting to all collections (2 found) [2007-09-11 09:17:12] INFO systemmsg Mode(s): ppduprebuild [2007-09-11 09:17:12] INFO systemmsg Connected to Duplicate Server at dupserver01:11100 [2007-09-11 09:17:13] INFO systemmsg Output directory: $FASTSEARCH/var/log/consistency/20070912 [2007-09-11 09:17:13] INFO mytestcoll Going to work on collection mytestcoll.. [2007-09-11 09:17:13] INFO mytestcoll Processing 5299429 checksums (all clusters).. [2007-09-11 09:17:14] INFO systemmsg Received config ACK -> connection state OK ..... 159 FAST Enterprise Crawler [2007-09-11 mytestcoll: [2007-09-11 second) [2007-09-11 [2007-09-11 [2007-09-11 12:01:21] INFO mytestcoll Duplicate Server rebuild status for 12:01:21] INFO mytestcoll ## Processed csums : 5299429 (477.98 per 12:01:21] INFO 12:01:21] INFO 12:01:21] INFO mytestcoll ## OK csums : 5299429 mytestcoll ## Deleted csums : 0 mytestcoll ## Misrouted csums : 0 9. Start the crawler: $FASTSEARCH/bin/nctrl start crawler Redistributing the Duplicate Server Database This section explains the steps necessary to change the number of duplicate servers used by a collection in a multi node crawler setup. It only applies to multi node crawlers, as single node crawlers doe not require a duplicate server. Prior to performing this task it is recommended to run a consistency check on the crawler store. Refer to Verifying Docstore and Metastore Consistency on page 157 for more information. 1. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler. Perform this step on each master as well as the ubermaster before proceeding to the next step. Note: Verify that all crawler processes has exited before proceeding to the next step. 2. Run the following command on each master node: $FASTSEARCH/bin/postprocess -R mytestcol -f . This will ensure there is no remaining documents to be fed to the indexer. 3. Stop the duplicate server processes. Wait until the processes have completed shutting down before moving to the next step. Note: Depending on your configuration this may take several minutes 4. Delete the per-collection duplicate server databases. These databases are usually located under $FASTSEARCH/data/crawler/ppdup/<collection> and should be deleted prior to running this tool to ensure there will not be "orphan" checksums recorded in the database. 5. Create a partial XML configuration in order to specify a new set of duplicate servers. The following is an example XML configuration file for three duplicate servers. You should also specify appropriate duplicate server settings for the collection at this time. <?xml version="1.0"?> <CrawlerConfig> <DomainSpecification name="mytestcoll"> <section name="pp"> <attrib name="dupservers" type="list-string"> <member> dupserver1:14200 </member> <member> dupserver2:14200 </member> <member> dupserver3:14200 </member> </attrib> </section> <section name="ppdup"> <attrib name="format" type="string"> hashlog </attrib> <attrib name="stripes" type="integer"> 1 </attrib> <attrib name="cachesize" type="integer">512 </attrib> <attrib name="compact" type="boolean"> yes </attrib> </section> </DomainSpecification> </CrawlerConfig> 160 Operating the Enterprise Crawler Note: Make sure the collection name in the XML file matches the name of the collection you wish to update. 6. Update the configuration on the ubermaster node with the command $FASTSEARCH/bin/crawleradmin -f <path to XML file> -o $FASTSEARCH/data/crawler/config_um --forceoptions=dupservers. The --forceoptions argument allows the command to override the dupserver option which is normally not changeable. 7. Update the configuration on each master node with the command $FASTSEARCH/bin/crawleradmin -f <path to XML file> -o $FASTSEARCH/data/crawler/config --forceoptions=dupservers 8. Start the duplicate server processes. 9. Rebuild the duplicate server. Refer to Rebuilding the Duplicate Server Database on page 159 for more information. Exporting and Importing Collection Specific Crawler Configuration The basic collection data is backed up (exported) and restored (imported) using the procedure described in the System Configuration Backup and Recovery section in the Operations Guide. This, however, does not include the data source configuration for the crawler. The crawler configuration can be set and read from the administrator interface, but it is also possible to export/import the crawler setup of a particular collection to XML format using the crawleradmin tool. If you intend to create a new collection using an exported crawler configuration, note the following: • The collection must be created prior to importing the crawler configuration. Create the collection in the normal way using the administrator interface but do not select a crawler as this will import a default configuration from the administrator interface into the crawler. The effect of this is that some options in the XML configuration will not be able to take effect. Specifically: 1. Create collection in the administrator interface but do not select a Data Source 2. Import the XML configuration into crawler using crawleradmin. 3. Edit the collection in the administrator interface to select the crawler. Select Edit Data Sources and add the crawler. Click ok on Edit Collection screen and again on the Collections Details screen. Refer to the FAST ESP Configuration Guide, Basic Setup chapter for additional details on how to create a collection and integrate the crawler through the FAST ESP administrator interface. • Use the same name for the collection in the new system as the old system. The collection name is given by the content of the exported crawler configuration file. The collection name is also implicitly used within the configuration file, for example, related to folder names within the crawler folder structure. Note: If you want to use a different name for the new collection, you must change all references to the collection name within the exported XML file prior to importing it using crawleradmin -f. • In a multiple node crawler, the crawleradmin tool must always be run on the main crawler node (the node running the ubermaster process) to ensure that all nodes are updated. Fault-Tolerance and Recovery To increase fault-tolerance, the crawler may be configured to replicate the state of various components. The following sections describe how state is replicated in the different components, and how state may be recovered should an error occur. 161 FAST Enterprise Crawler Ubermaster The ubermaster will incrementally replicate the information in its routing tables (the mapping of sites to masters) for a specific collection to all crawler nodes associated with that collection. If an ubermaster database is lost or becomes corrupted, the databases will be reconstructed automatically upon restarting the ubermaster. If the ubermaster enters a recovery mode it will query crawler nodes in that collection for their routing tables, which they will send back in batches. While in recovery mode, the ubermaster will accept URIs from crawler nodes, but will not distribute new sites to crawler nodes until recovery is complete for that collection. Duplicate server A duplicate server may be configured to replicate the state of another duplicate server. By starting the with the –R option, a duplicate server is configured to incrementally replicate its state: $FASTSEARCH/bin/ppdup –p <port> -R <host:port> -I <my_ID> where <host:port> specifies the address of the target duplicate server and <my_id> specifies a symbolic identifier for the duplicate server. The target duplicate server will store replicated state in its working directory under a directory with the name <my_ID>>. Conversely, a duplicate server is configured to replicate the state of another duplicate server by starting with the –r option: $FASTSEARCH/bin/ppdup –p <port> -r <port> When replication is activated, communication between a postprocess process and a duplicate server has transactional semantics. The duplicate server(s) performing replication on behalf of other duplicate servers may be used actively by postprocesses. The state of a duplicate server may be reconstructed by manually copying replicated state from the appropriate directory on the target duplicate server. Crawler Node There is no support for replicating the state stored on a crawler node. However, the crawler node state, if lost, will eventually be reconstructed by re-crawling the set of sites assigned to the node. In the course of crawling, the ubermaster will route URIs to the crawler, and from this the crawler node will gradually reconstruct its state with respect to crawled documents (assuming all documents are still available on the web servers). The postprocess databases on a crawler node will similarly recover over time, as each processed document will be checked against the duplicate server(s) in the installation. This will permit the URI checksum tables to be rebuilt, but it may not result in the same set of URI equivalences (duplicates) as had been previously indexed, leading to some unnecessary updates being sent to the search engine. 162 Chapter 5 Troubleshooting the Enterprise Crawler Topics: • Troubleshooting the Crawler This chapter describes how to troubleshoot problems you may encounter when using the crawler. FAST Enterprise Crawler Troubleshooting the Crawler This topic describes how to troubleshoot problems you may encounter when using the crawler. General Guidelines • Inspect logs The crawler logs a wide range of useful information, and these logs should always be inspected whenever a perceived error or misconfiguration occurs. These include the crawler log which logs overall crawler status messages and exceptional conditions, the fetch log which logs all attempted retrievals of documents, the screened log which logs all documents that are not attempted retrieved, the postprocess log, which logs the status of data feeding to FAST ESP and the site and header logs. By default, all these logs but the screened logs are enabled. • Raise log level The level of detail in the crawler log of the crawler is governed by the –l <level> option in the crawleradmin tool. Restarting the crawler with a given parameter propagates this setting to all components. • Inspect traffic trace of crawler network activity This can either be done by using a network packet trace utility such as ethereal or tcpdump on the crawler node, or by crawling through a proxy and inspecting the traffic passing through it. Both of these have shortcomings when encrypted transport such as HTTPS, is used. • Inspect browser traffic If a particular behavior is expected from the crawler, a trace as suggested above can be examined alongside one generated by a web browser. For web browsers, client side debugging can be used to bypass the encryption for HTTPS. An example of such a utility is LiveHTTPHeaders (http://livehttpheaders.mozdev.org/ ). This is particularly useful when debugging cookie authentication schemes. Additional Information Reporting Issues Here is a list of important information to gather in order to get a fast and complete support response. When reporting operational issues, the following list of information is critical in providing a fast and complete response: • Crawler version Which version of the crawler you are running and, if applicable, which hotfixes to the crawler have been applied. • Platform Which operating system/platform you are running on. • FAST ESP version Which version of FAST ESP you are running the crawler against. • Crawler configuration All applicable <collection> configurations in XML format, as output by crawleradmin -G <collection> • Crawler log files All applicable <collection> crawler log files (fetch, pp, header, screened, dsfeed, site - as a minimum, fetch and dsfeed). These files are located in $FASTSEARCH/var/log/crawler/. • 164 Crawler log files Troubleshooting the Enterprise Crawler All available crawler.log* and any associated .scrap files. For multiple node installations, include dupserver.scrap.* (or equivalent). These files are located in $FASTSEARCH/var/log/crawler/. Known Issues and Resolutions This section provides resolutions to known issues for the crawler. #1: The crawler has problems reaching the license server or allocating a valid license. A valid license served by the license manager (lmgrd) generates a log entry similar to the following in $FASTSEARCH/var/log/lmgrd/lmgrd.scrap file: hh:mm:ss (FASTSRCH) OUT:"FastDataSearchCrawler" [email protected] If the crawler is having problems either reaching the license server (which may be on a remote node in a multiple node FAST ESP installation) or allocating a valid license, it will issue an error (Message A), and try a total of 3 times before exiting (Message B): Message A: "Unable to check out FLEXlm license. Server may be down or too many instances of the crawler are running. Retrying. Message B: "Unable to check out FLEXlm license. Shutting down. Contact Fast Search &Transfer (http://www.fastsearch.com/) for a new license." Please contact FAST Support for any licensing issues. #2: How do I display, To display the collection configuration, type the command: save, or import the configuration for a bin/crawleradmin -G <collection> collection in XML You can save this collection configuration by redirecting or saving it to a file. To format? import a configuration from a file, type the command: bin/crawleradmin -f <filename> #3: Postprocess reports that it is unable to obtain a database lock. The crawler is running on the system. If you stopped it, check the logs to make sure that the process has stopped. You may have to kill it manually if it still exists. #4: The crawler does The following areas can be checked: not fetch pages. • Verify the Start URIs list against the configured rules. Check the screened log. Check the URIs individually using the crawleradmin tool and the --verifyuri option: # crawleradmin --verifyuri <collection>:<start URI> • • • • If a site specified in the Start URIs list (or otherwise permitted under the rules) is not being crawled, it may be due to a robots.txt file on the remote web server. This is a mechanism that gives webmasters the ability to block access to content from some or all web crawlers. To check with a browser, request page http://<site>/robots.txt. If it does not exist, the web server should return the HTTP status 404, Not Found, in the crawler fetch log. Check the DNS log in case the server does not resolve. Check proxy if one is used. Check log file. 165 FAST Enterprise Crawler If there are no clear errors, yet some pages are not being crawled, it may be due to the refresh cycle being too short to complete the crawl. Refer to Resolution #5: for resolution. #5: Some documents Check your refresh interval (default = 1440 minutes) and refresh mode (default = are never crawled. scratch). If the interval is too short, some of the documents may never be crawled (depending on the refresh mode). You need to either increase the refresh interval or change the refresh mode of the crawler. The Refresh interval and Refresh mode can be changed from Edit Collection in the administrator interface. Note that the Refresh when idle option allows an idle crawl to start a new cycle immediately without waiting for the next scheduled refresh. All refresh modes include inserting the start URIs into to work queue. The work queue is the queue from where the crawler retrieves URIs. Refer to Refresh Mode Parameters on page 89 for valid refresh modes. If there are no clear errors, refer to Resolution #4: for additional checks. #6: How do I back up Make a backup copy of the $FASTSEARCH/data/crawler/ folder. Complete the the content retrieved procedure described in Backup and Restore on page 153. by the crawler? #7: Some documents Since the DB switch delete is set to off by default, no documents will be deleted get deleted from the unless they are irretrievable. Check to see if the DB switch delete has been turned index. on. There may also be a problem if the DB switch delete is on and the refresh interval is set too low. If so, then it is possible that the internal queue of your crawler is so large that certain documents do not get refreshed (or re-visited) by the crawler. If that is the case, you need to either change the mode of your crawler or increase the refresh rate or both. #8: Documents are not removed when I change the exclude rules and refresh the collection. If you make changes to your configuration, only the configuration will be refreshed, not the collection. Collection refreshes are only triggered by time since the last refresh or by using the Force Re-fetch option. 1. In order to remove documents instantly, stop the crawler from System Management. With a multiple node crawler, it is recommended that you stop all instances of crawler and ubermaster. Do not stop the duplicate server as this is required in order to run any postprocess refeed commands. 2. Run postprocess manually with the (uppercase) -X option (together with -R). By using the -X option, all URIs in the crawlerstore will be traversed and compared to the spec. URIs matching any excludes will then be deleted. Issue the command: $FASTSEARCH/bin/postprocess -R <collection> -X Note that to reprocess and delete documents that have been excluded by the configuration, you only need the -X (uppercase) switch as shown in this example. If you decide to add the -x (lowercase) switch, then everything else will be re-processed in addition to the verification and removal of excluded content. 3. Allow postprocess to finish running until all content has been queued and submitted. Note that it may take some time after postprocess exits before documents are fully removed from the index. 166 Troubleshooting the Enterprise Crawler #9: The crawler uses If system resources are being overwhelmed because of the scale of the crawl being a lot of resources, run, the ideal solutions are to: what can I do? • Ensure the correct configuration and caches • Add hardware resources • Reduce crawler impact Refer to Configuring a Multiple Node Crawler on page 128 and Large Scale XML Crawler Configuration on page 130 for additional information. If configurations are correct, and it is not possible to add resources, then the next step is to try to reduce the impact of the crawl, either by reducing the scope of the crawl or by slowing the pace of the crawl. There are no definite answers to this issue. Go through your configuration and determine if you can: • Suspend postprocess feeding of documents to FAST ESP: By default crawled pages are stored on disk, then fed to FAST ESP concurrently with the crawling of additional pages. By suspending the feeding of documents to FAST ESP, additional resources are made available to the crawling processes, thereby increasing their efficiency. Once the crawl is complete, feeding can be resumed to build the collection in the index. The commands to perform these tasks are: # crawleradmin --suspendfeed <collection> # crawleradmin --resumefeed <collection> • Reduce the overall load on the crawler host by: • • • • • • • • increasing cache sizes, especially the postprocess cache size. reducing the number of complex include/exclude rules and rewrites. focusing the crawl on fewer sites/servers (include/exclude domain/URIspaths). crawling fewer web sites at a given time by reducing Max concurrent sites (if I/O is overloaded then lowering this value may help increase performance). using an equivalent number of uberslaves to CPUs (or even more, up to 8, for large scale crawls). lowering the frequency of page requests (request_rate, delay). lengthening the overall update cycle (refresh_interval). limiting the crawling schedule: (variable_delay). Depending on your answers, tune these parameters accordingly. #10: I cannot locate Documents are kept a maximum time period of: all the documents in my index. dbswitch x refresh Where dbswitch denotes the number of crawl cycles a previously fetched document is allowed to remain unseen by the crawler. If this limit is reached, the dbswitch-delete parameter will decide what happens to the document. If dbswitch-delete is set to yes, the document will be deleted, and if it is set to no, it will be scheduled for an explicit download check. If this check fails, the document will be removed. There are three approaches to avoid this situation: 1. Make sure all documents covered by the collection are crawled within the refresh period. 167 FAST Enterprise Crawler 2. Set refresh_mode = scratch (default). All work queue will be emptied when a refresh starts, and the crawler starts from scratch. 3. Set dbswitch-delete = no (default). #11: The crawler To create a successful login configuration the goal to is to have the crawler behave cookie authentication in a similar way to what a user and browser does when logging into the web site. login does not work. In order to achieve this you can: • • Inspect traffic traces between the browser and server. Pay attention to the order in which pages are retrieved, what HTTP method is used, when the credentials are posted and what other variables are set. Available tools to do this are: • • • Mozilla LiveHTTPHeaders plugin which lets you see the http headers exchanged (even over encrypted transport as in https). Charles web proxy (shareware) which acts as a proxy and lets you inspect headers and bodies both as a tree and as a trace. Basic tools like tcpdump or ethereal can also be used. Note that only LiveHTTPHeaders will help you when https is used. Remember to erase your browser's cache and cookies before obtaining a trace. Refer to Setting Up Crawler Cookie Authentication on page 115 for details in setting up the crawler cookie authentication; see the section Form Based Login on page 57 for additional information about setting up a forms-based login. #12: The Browser You should start by tuning the Browser Engine. Please refer to the FAST ESP Engine gets Browser Engine Guide overloaded and sites get suspended. In order to solve the problem you may need to tune the EC configuration. By decreasing the max_sites setting and/or increasing the delay, the number of documents sent from the EC to the Browser Engine may be reduced. The side effect is that the crawl speed may decrease. However, as the EC will start suspending sites if the Browser Engines get overloaded, the speed may not necessarily decrease. If this still does not solve the problem, you need to reduce the number of sites that use JavaScript and/or Flash processing. This is done by: 1. Disable JavaScript and/or Flash options in the main crawl collection. 2. Exclude the sites where you want to use JavaScript and/or Flash from the main collection. 3. Create a new collection. 4. Activate JavaScript and/or Flash in the new collection. 5. Specify the sites that you want crawled using JavaScript/Flash in the new collection. 168 Chapter 6 Enterprise Crawler - reference information Topics: • • • • • Regular Expressions Binaries Tools Crawler Port Usage Log Files This chapter contains reference information for the Enterprise Crawler for the various binaries and tools. You will also find information about regular expressions, log files and ports. FAST Enterprise Crawler Regular Expressions Certain entries in the FAST ESP administrator interface collection specific screens request the use of regular expressions (regexp). Using Regular Expressions The following tables describe terminology used in this appendix. Table 46: Collection Specific Options Definitions Term Definition URI Uniform Resource Identifier - commonly known as a link and identifies a resource on the web. Example: http://subdomain.example.com/ Domain The domain/server portion of the URI - in the previous URI example, the Domain is the subdomain.example.com/. Path The path portion of the URI. For example, for the URI http://subdomain.example.com/shop, the path portion is /shop. Note: All patterns in the crawler are matched from the beginning of the line, unless specified otherwise Character Definition . Matches any character. * Repetition of the character 0 or more times. $ End of string. \ Escape characters that have a special meaning. .*\.gif$ Matches every string ending in .gif. .*/a/path/.* Matches any string with /a/path/ in the middle of the expression. .*\.example\.com All servers in the domain .example.com will be crawled. .*\.server\.com Matches any characters (string) followed by .server.com. Grouping Regular Expressions If the crawler needs to be configured with rewrite rules, as described in the URI rewrite rules entry in Table 5-1, then Perl-style grouping must be used. Grouping defines a regular expression as a set of sub-patterns organized into groups. A group is denoted by a sub-pattern enclosed in parenthesis. Example: If you want to capture the As and Cs in groups for the string: AABBCC then enclose the patterns for the As and Cs in parenthesis as shown in the following regular expression: (A*)B*(C*) 170 Enterprise Crawler - reference information Substituting Regular Expressions To perform regular expression substitution, you need a regular expression that is to be matched and a replacement string that should replace the matched text. The replacement string can contain back references to groups defined in the regular expression. With back references, the text matched by the group is used in the replacement. Back references are simply backslashes followed by an integer denoting the ordinal number of the group in the regular expression. Example: Using the regular expression described in the previous example, the following replacement string: \1XX\2 rewrites the string AABBCC to AAXXCC. Binaries The following sections describe the major Enterprise Crawler programs and the options and parameters they support. crawler The crawler binary is the master process, responsible for starting all other crawler processes. It also serves as the ubermaster process in a multiple node crawler installation. In addition to initialization of data directories and log files, the crawler is responsible for several centralized functions, including maintenance of the configuration database, handling communications with other FAST ESP components, resolving and caching hostnames and IP addresses, and routing sites to uberslave processes. Binary: $FASTSEARCH/bin/crawler [options] Table 47: crawler options Basic options -h Description Show usage information. Use this option to print a list with short description of the various options that are available. -P [<hostname>:] <crawlerbaseport> Use this option to specify an alternative crawler base port (XML-RPC interface). This option is useful if several instances of the crawler run on the same node. <hostname>: Set bind address for XML-RPC interfaces (optional). This field can be either a hostname or an explicit IP address. An actual IP address can also be used as some hosts have multiple IP addresses. <crawlerbaseport>: Set start of port number range that can be used by crawler. Default: 14000 9000 Note that uberslave processes will allocate ports from <port number>+10 and up. Furthermore, a specific interface to bind to can be specified. -d <path> Data storage directory. Use this option to store crawl data, runtime configuration and logs in subdirectories in the specified directory. Default: If the FAST environment variable is set then the default path is $FASTSEARCH/data/crawler; otherwise the default path is data. 171 FAST Enterprise Crawler Basic options Description -f <file> Specify collection(s). Use this option to specify the location of an XML file containing one or more collections. Read the contents of the file and start crawling the specified collection(s).The crawler will parse the contents of this file, add or update the collections contained within and start crawling. -c <number> Use this option to specify the number of uberslave processes to start. For larger crawls a process count of 8 is recommended. For larger crawls a process count equal to or greater than the number of CPUs is recommended. A maximum of 8 processes is supported. The number of processes should be equal to or less than the number of clusters defined in the collection specification. Default: 2 -v or -V Advanced options -D <number> This option prints the crawler version identifier and exits. Description Maximum DNS requests per second. The crawler has a built-in DNS lookup facility that may be configured to communicate with one or more DNS servers to perform DNS lookups. Use this option to limit the number of DNS requests per second that the crawler will send to the DNS server(s). The DNS resolver will automatically decrease the lookup rate if it detects that the DNS server is unable to handle the currently used rate. The actual rates can be seen in the collection statistics output. Default:100 requests -F <file> Specify the crawler global configuration file. Use this option to specify the location of an XML file containing the crawler global configuration. A crawler global configuration file is XML based and may contain default values for all command line options. Note that no command line switches may be specified in this configuration file. Also note that the crawler processes the command line switches in order. For example, if you use the -D option in ./crawler -F CrawlerGlobalDefaults.xml, the -D 20 will override any DNS request rate settings specified in the file. The crawler will on startup look for a startup file of default configuration settings. This option first attempts to locate the CrawlerGlobalDefaults.xml in the current directory. If not found it looks in $FASTSEARCH/etc directory. -n Shutdown crawler when idle. Use this option to signal that a crawler node should exit when it is idle. 172 Enterprise Crawler - reference information Advanced options Description This option requires the refresh setting in a collection to be higher than the time required to crawl the entire collection. Default: disabled Logging options -L <path> Description Log storage directory. Use this option to store crawler specific logs in sub directories of the specified directory. Default: If the FAST environment variable is set then the default path is $FASTSEARCH/var/log/crawler; otherwise the default path is data/log. -q Disable verbose logging. Use this option to log CRITICAL, ERROR and WARNING log messages. -l <level> Log level. Use this option to specify the log level. This can be one of the following preset log levels: debug, verbose, info, warning, error Data search integration options -o Description DataSearch mode. Use this option when running the crawler in a FAST DataSearch or ESP setting. -i Ignore Config Server. Continue running even if the Config Server component is unreachable. Do not exit if Config Server cannot be reached. -p Publish Corba interface. Publish this address/interface for postprocess CORBA interfaces if enabled. Note: Applies to FDS 4.x only. Multiple node options -U Description Run as ubermaster in a multiple node setup. Start crawler as an ubermaster. Subordinate masters connect to the XML-RPC port by specifying the -S option. -S <ubermaster_host:port> Run as master in a multiple node setup. 173 FAST Enterprise Crawler Multiple node options Description Start crawler as subordinate (master) to another crawler (ubermaster). The <host:port> specifies the address of the ubermaster. Example: uber1.examplecrawl.net:27000 -s Survival mode. This option indicates that the subordinate master in a distributed setup should stay alive and try reconnecting to the ubermaster until a successful connection is made. This option only applies to the master. -I <ID> Symbolic name of crawler node. It is not normally necessary to use this option. In a multiple node crawler setup, each crawler node must be assigned a unique symbolic name, to be used in collection configurations when defining which crawler nodes to include in a crawl. This option only applies to the master. The default value is auto generated, and stored in the configuration database. If the option is used, and an alternative value is specified, this need only be done the first time the crawler is started. Environment variables FASTSEARCH_DNS Description The crawler will automatically attempt to detect the available DNS server(s). However, it is also possible to override the servers with this environment variable. The value of FASTSEARCH_DNS should be a semicolon separated list of DNS server IP addresses. Example: FASTSEARCH_DNS="10.0.1.33;10.0.1.34" An empty string may also be specified to force the use of the gethostbyname() API, rather than speaking directly with the DNS server(s). Example: FASTSEARCH_DNS="" postprocess Postprocess is used by the crawler to perform duplicate detection and document submissions to FAST ESP. It is, like the uberslave processes, automatically started with the crawler. The postprocess binary may also be run as stand alone - when the crawler is not running - to manually refeed documents in one or more collections. Postprocess is responsible for submission of new, modified and deleted documents as they are encountered by the crawler during a crawl. Before submission each document is checked against the duplicate database, unless duplicate detection is turned off. A URI equivalence class for each unique checksum is also maintained by postprocess, and updates to this class are submitted to FAST ESP in the form of changes to the 'urls' field. Only one document in a set of duplicates will be submitted and the rest will be part of the URI equivalence class. In addition to document submission, postprocess also outputs to the postprocess log. Refer to Log files and usage on page 197 for a description of the postprocess log. 174 Enterprise Crawler - reference information Binary: $FASTSEARCH/bin/postprocess [options] Table 48: postprocess options General options -h or --help Description Show usage information. Use this option to print a list with short description of the various options that are available. -l <level> Use this option to specify the log level. This can be one of the following preset log levels: debug, verbose, info, warning, error -P [<addr>:]<port number> Postprocess port. <port number> Set start of port number range that can be used by postprocess (default value is crawlerbaseport + 6). An optional IP address may be specified (by hostname or value). Default port: 9006 -U <file> Use the crawler global default configuration file. This option first attempts to locate the CrawlerGlobalDefaults.xml in the current directory. If not found it looks in $FASTSEARCH/etc directory. Conflicting options specified on the command line override the values in the configuration file if given. -d <path> Data storage directory. Use this option to store crawl data, runtime configuration and logs in subdirectories in the specified directory. Default: $FASTSEARCH/data/crawler -R <collections> Re-feed collections. Re-feed all documents to ESP even if documents have been added before. Specify <collections> as either a single collection or a comma separated list of collections (with no whitespace). Specify '*' to refeed all. Be sure to use the quote signs surrounding the asterisk, otherwise the shell will expand it. Refeed mode (-R) Only Description Note: You must stop the crawler before working in the refeed mode. Otherwise, postprocess will report a busy socket. -r <sitename> Resume re-feeding after the specified site (hostname0). This option may not be used at the same time as -s. Note: Specifying the special keyword @auto for <sitename> will make postprocess attempt to auto resume traversal from where your last refeed left off. 175 FAST Enterprise Crawler Refeed mode (-R) Only Description -s <sitename> Process only the specified sitename (hostname0). This option may not be used at the same time as -r. -x (lowercase x) Process all permitted URIs. Include all URIs matching the current collection include/exclude rules, while ignoring URIs that do not match. This is useful when also using the -u option to specify an updated collection specification XML file. -X (uppercase X) Issue delete for excluded URIs. Issues deletes for URIs that do not match the collection specification includes/excludes. All other URIs are ignored, unless combined with -x to also process all permitted URIs. This option is useful when -u is specified. -b Apply robots.txt exclusion to processing. Let -x and -X options apply to robots.txt exclusion as well. -u <file> Update includes/excludes from file. Updates the include and exclude regexps loaded from the configuration database with those from the specified collection specification XML file. -f Resume feeding existing dsqueues data. -k <destination>:<collection> Override the feeding section specified in the collection configuration by specifying a destination (one specified in $FASTSEARCH/etc/CrawlerGlobalDefaults.xml) and a collection name. Alternatively specify the symbolic name of a feeding target as defined in the collection configuration, which then automatically maps down to feeding destination and collection name. ppdup In a multiple node crawler installation, a duplicate server is needed to provide a centralized duplicate detection function for each of the master/postprocessor hosts. The duplicate server can be configured using the ppdup binary. Binary: $FASTSEARCH/bin/ppdup [options] Table 49: ppdup options Option -h Description Show usage information. Use this option to print a list with short description of the various options that are available. 176 Enterprise Crawler - reference information Option -l <level> Description Use this option to specify the log level. This can be one of the following preset log levels: debug, verbose, info, warning, error -I <identifier> Symbolic duplicate server identifier. Use this option to assign a symbolic name to the duplicate server. This name is used when the state of the duplicate server is replicated by another duplicate server. -P [<addr>:]<port number> Port and optional interface. This option specifies the port to which postprocesses communicate to the Duplicate-Server in a multiple node setup. -r <port> Replication service port. This option enables "replica mode" for the duplicate server. The duplicate server will listen for incoming replication requests on the specified port. -R <host:port> Address of replication server. This option specifies the address of the duplicate server that should replicate the duplicate server state. The hostname specified must correspond to a server running the duplicate server with the -r option with the specified port. -d <path> Set current working data directory. This option specifies the working directory for the duplicate server. Default: If the FAST environment variable is set then the default path is $FASTSEARCH/data/crawler/ppdup; otherwise the default path is data. -c <cache size> Database cache size or hash size. When a storage format of "hashlog" is selected (see -S option) this value determines the size of the memory hash allocated. If the number of documents stored into the hash exceeds the available capacity the hash will automatically be converted into a disk hash and resized (2x increments). If a storage format of "diskhashlog" is selected the value determines the initial size of the hash on disk. For each overflow (whenever capacity is exceeded) the hash is resized, as described above. When the storage format is "gigabase" the value specifies the amount of memory to reserve for database caches. Note that this value is per collection. If multiple collections are used then each collection will allocate the specified amount of cache/memory/disk. Furthermore, if the duplicate server is being run as both a primary and a replica then twice the resources will be consumed. Default: 64 -s <stripes> Number of stripes. This option sets the number of stripes (separate files) that will be used by the duplicate server databases. 177 FAST Enterprise Crawler Option Description Default: 1 -D Direct I/O. This option specifies that the duplicate server should enable direct I/O for its databases. Enable only if supported by the operating system. -S This option specifies which database storage format to use. <hashlog|diskhashlog|gigabase> The "hashlog" format will initially allocate a memory based hash structure with a data log on disk. The size of the memory hash is specified by the -c option described separately. If the hash overflows it will automatically be converted into a "diskhashlog". The "diskhashlog" format is similar, but a disk based hash structure and the "gigabase" format is a database structure on disk. Default: hashlog -N -F Disable nightly compaction of duplicate server databases. Specify the crawler global configuration file. Use this option to specify the location of an XML file containing the crawler global configuration. A crawler global configuration file is XML based and may contain default values for all command line options. Note that no command line switches may be specified in this configuration file. Also note that the crawler processes the command line switches in order. For example, if you use the -D option in ./crawler -F CrawlerGlobalDefaults.xml, the -D 20 will override any DNS request rate settings specified in the file. The crawler will on startup look for a startup file of default configuration settings. This option first attempts to locate the CrawlerGlobalDefaults.xml in the current directory. If not found it looks in $FASTSEARCH/etc directory. -v or -V Print version ID. This option prints the ppdup version identifier. Tools The Enterprise Crawler has a suite of related tools that can be used to perform tasks ranging from quite general to extremely specific. Care should be exercised before using any of these programs, and backing up data is always a prudent consideration. crawleradmin The crawleradmin tool is used for configuring (XML configs), monitoring (statistics and various other calls) and managing (seeding, forcing of refreshing, reprocessing, suspending/resuming crawl/feed). Tool: $FASTSEARCH/bin/crawleradmin: option [options] Table 50: crawleradmin return codes 178 Enterprise Crawler - reference information Return code 0 1 2 3 4 5 6 10 11 Description Command successfully executed. An error occured. See error text for more details. Command line error. An unrecognized command was specified, or the arguments were incorrectly formatted. The collection specified on the command line does not exist. The command failed because it requires the crawler to be stopped and the --offline or -o flag to be specified. An error was encountered attempting to read a file, or some other I/O operation failed. See error text for more details. Statistics is not yet available for the specified collection/site. An error was reported by the master. See error text for details. A socket error was encountered trying to connect to the master. Table 51: crawleradmin options General options Description --crawlernode <hostname:port> Manage crawler at the specified hostname and port. or -C hostnameport Default: localhost:14000 <hostname:port> --offline or -o <configdir> Work in offline mode; crawler is stopped. Offline mode assumes the default configuration directory, $FASTSEARCH/data/crawler/config or just data/config if the FASTSEARCH environment variable is not set. This option can be used together with the following options: -a, -d, -c, -q, -G, -f, -d, --getdata and --verifyuri -l <log level> Specify log level. Use this option to specify the log level. This can be one of the following preset log levels: debug, verbose, info, warning, error --help or -h Print usage information. Use this option to print a list with short description of the various options that are available. Crawler configuration options --addconfig <file> or -f <file> Description Add or update collection configuration(s) from the specified XML file. 179 FAST Enterprise Crawler Crawler configuration options Description --collectionconfig <collection> or -g <collection> Display the configuration for the specified collection. --getcollection <collection> or -G <collection> --delcollection <collection> or -d <collection> Output the XML configuration for the specified collection to stdout. Redirect the stdout output to a file to save the configuration. Delete collection (including all crawler storage). Note that this has no effect on FDS/ESP collection configuration elements such as pipeline or index. Crawler control options --shutdown or -x Description Shutdown the crawler. Do not use this option when integrated with FAST ESP, as nctrl will restart crawler. Use nctrl stop crawler instead. --suspendcollection <collection> or -s <collection> --resumecollection <collection> or -r <collection> --suspendfeed <collection>[:targets] --resumefeed <collection>[targets] --enable-refreshing-crawlmode <collection> --disable-refreshing-crawlmode <collection> Suspend (pause) crawling of <collection>. Feeding will continue if there are documents in the feed queue. Resume crawling of <collection>. Suspend (pause) FAST ESP feeding for <collection>. Optionally specify a comma separated list of feeding targets (symbolic names found in the collection configuration). Resume FAST ESP feeding for <collection>, optionally the specified feeding targets. Enable the 'refresh' crawl mode for the specified collection. When enabled, the crawler will only crawl/refresh URIs that previously have been crawled. Disable the 'refresh' crawl mode for the specified collection, and resume to normal crawl mode. URI submission, refetching and refeeding Description -adduri <collection>:<URI> or -u <collection>:<URI> Append specified <URI> to <collection> work queue. Can be combined with the --force flag to prepend the URIs and crawl them immediately. -addurifile <collection>:<file> Append all URIs from the specified <file> to <collection> work queue. Can be combined with the --force flag to prepend the URIs and crawl them immediately. --refetch <collection> or -F <collection> 180 Force re-fetch of <collection>. Enterprise Crawler - reference information URI submission, refetching and refeeding Description This will cause the crawler to erase all existing work queues (regardless of refresh mode) and clear all caches, start a new crawl cycle and place all known start URIs on the work queue. This will not increment the counter used for orphan detection (dbswitch) unlike normal refreshes. --refetchuri <collection>:<URI> or -F <collection>:<URI> Force re-fetch of <URI> in <collection>. The URI does not need to be previously crawled. However, it must fall within the include/exclude rules for the <collection>. This also (as a side effect) triggers crawling of the site to which the URI belongs (unless this site has already been crawled in this refresh period). --refetchsite <collection>:<URI> --force --feed --refeedsite <collection>:<web site> Force re-fetch of site from <URI> in <collection>. Used with --adduri/addurifile/refetchuri/refetchsite to make sure the URI gets attention immediately (by potentially preempting active sites). Used with --refetchuri and --refetchsiteto also have the URIs refed to FAST ESP indexing, regardless of whether the documents have changed. Refeed all documents in the crawler store for <web site> to FAST ESP indexing. This is equivalent to running postprocess refeed on a single site, but does not require stopping the crawler. Due to the implementation of this feature, it is advisable to limit the amount of concurrent re-feeds at run time to prevent overloading the crawler. The URIs you refeed end up in a high priority queue. This means it doesn't have to wait for other docs currently waiting to be fed to the ESP. Feeding to the ESP will be done from both the high priority queue and the normal priority queue at the same time, so there might be a little delay before the document is visible in the search. --refeeduri <collection>:<URI> --refeedprefix <prefix> --refeedtarget <destination>:<collection> Refeed the specified URI from the crawler store to FAST ESP indexing. See --refeedsite above for more information. Specify a URI prefix (including scheme) that URIs must match to be re-fed. Only applicable with the --refeedsite option. Specify a feeding destination and collection to which the specified refeed command will feed URIS to. Only applicable with the --refeedsite option. 181 FAST Enterprise Crawler Preempting, blacklisting and deletion Description --preemptsite <collection>:<web site> or -p <collection>:<web site> Preempt crawling of site <web site> in <collection>. --blacklist <collection>:<web site>:<time> --unblacklist <collection>:<web site> --deletesite <collection>:<web site> --deluri <collection>:<URI> --delurifile <collection>:<file> Statistics options --collstats <collection> or -q <collection> --collstatsquiet <collection> or -Q <collection> --statistics or -c Blacklist <web site> from crawling in <collection> for <time> seconds. Remove blacklisting of <web site> in <collection>. Delete <web site> in <collection> from crawler. Delete <URI> in <collection>. Delete URIs in <file> from <collection>. Description Display crawl statistics for <collection>. Display abbreviated version of crawl statistics for <collection>. Display crawl statistics. Refer to crawleradmin statistics on page 183 for more information. --sitestats <collection>:<web site> --cycle (1,~) Monitoring options Statistics for <web site> in <collection>. Combine with any/all of the Statistics options listed in this table to display statistics for the specified refresh cycle. Use all to merge all refresh cycles. Default is current cycle. Description Note: id equals all or host:number or number. --status --nodestatus --active or -a --numslaves or -n --slavestatus <id> or -S <id> --numactiveslaves <id> or -N <id> --sites <id> or -t <id> 182 Display status for all collections. Display status (per node) for all collections. Display all active collection names. Display the number of sites currently being crawled. Show site status for uberslave process <id>. Show number of active sites for uberslave process <id>. List sites currently being crawled by uberslave <id>. Enterprise Crawler - reference information Monitoring options Description --starturistat Display feeding status of start URI files. Debugging options Description --verifyuri <collection>:<URI> Output information if an <URI> can be crawled and indexed in the <collection>. This option checks against the following crawler parameters: include_uris, include_domains, exclude_uris, exclude_domains, allowed_schemes, allowed_types, force_mimetype_detection, rewrite_rules, robots, max_redirects, refresh_redir_as_redir, max_uri_recursion, search_mimetype and check_meta_robots. Note that there still may be reasons why an URI is not crawled, e.g. DEPTH or due to an URI being dropped by a crawler document plugin. crawleradmin statistics Running crawleradmin -c provides statistics for all collections active in the crawler. Directing a statistics lookup to the administrator interface of the ubermaster will produce aggregated statistics for all crawler nodes. Statistics for a specific node in a multiple node crawler setup may be produced by directing the lookup to the administrator interface of the particular node. The following provides a sample statistics output: Brief statistics for collection <collection> ============================================ All cycles ========== Running time Average document rate Downloaded (tot/stored/mod/del) Document store (tot/unique) Document sizes (avg/max) : : : : : 20.21:29:38 44.69 dps 80,687,225 URIs / 41,886,951 / 10,702,819 / 6,186,202 24,997,930 URIs / ~24,254,800 24.14 kB / 488.28 kB Current cycle (57) ================== Running time Stats updated Status Progress Document rate (curr/avg) In bandwidth (curr/avg/tot) : : : : : : 01:46:40 22.6s ago Crawling, 4,482 sites active 26.9% 51.66 dps / 42.88 dps 6.28 Mbps / 5.38 Mbps / 4.01 GB Downloaded (tot/stored/mod/del) : 274,451 URIs / 101,831 / 49,743 / 15,156 Download times (avg/max/acc) : 19.7s / 07:37 / 59.11:56:00 DNS overview -----------Requests (tot/retries/timeouts) : 2,192,245 / 206,290 / 134,132 Request rate (curr/avg/limit) : 0.8 rps / 1.2 rps / 75 rps crawleradmin examples The following examples show some of the crawleradmin options being used for the collection named mytestcoll. 183 FAST Enterprise Crawler Extract crawler XML configuration To get crawler configuration file information: $FASTSEARCH/bin/crawleradmin -G mytestcoll > mytestcoll.xml Note that the name of the configuration file (mytestcoll.xml in this example) does not need to be the same as the collection name. When restoring the collection the actual name of the collection is given by the name of the DomainSpecification element in the configuration file. Add/update crawler XML configuration To restore or update a collection configuration from a saved file: $FASTSEARCH/bin/crawleradmin -f mytestcoll.xml Delete collection from crawler only To remove a collection from the crawler's configuration, and delete the stored data: $FASTSEARCH/bin/crawleradmin -d mytestcoll Note that this command has no effect on the collection in the index. Crawler collection statistics To display collection statistics: $FASTSEARCH/bin/crawleradmin -Q mytestcoll Replace uppercase Q with lowercase Q for more details. Force re-crawling of a site To force a re-crawl (re-fetch) a site: $FASTSEARCH/bin/crawleradmin --refetchsite mytestcoll:www.example.com Force re-crawling a single URI To re-crawl a specific URI immediately: $FASTSEARCH/bin/crawleradmin --refetchuri mytestcoll:http://www.example.com/test_pages/x1.html --force Force re-crawling and refeeding a single URI To re-crawl and refeed a specific URI immediately: $FASTSEARCH/bin/crawleradmin --refetchuri mytestcoll:http://www.example.com/test_pages/x1.html --force --feed Refeed a site while crawling To refeed a site to ESP for processing and indexing: $FASTSEARCH/bin/crawleradmin --refeedsite mytestcoll:www.example.com You can also specify a different feeding destination on the command line: $FASTSEARCH/bin/crawleradmin --refeedsite mytestcoll:www.example.com --refeedtarget otheresp:mytestcoll 184 Enterprise Crawler - reference information Suspending/resuming crawling To suspend the crawling of a collection: $FASTSEARCH/bin/crawleradmin --suspendcollection mytestcoll To resume crawling use --resumecollection. Suspending/resuming content feeding To suspend the content feed to ESP processing and indexing: $FASTSEARCH/bin/crawleradmin --suspendfeed mytestcoll If the collection has multiple destinations specified in the configuration, you can suspend an individual destination by doing: $FASTSEARCH/bin/crawleradmin --suspendfeed mytestcoll:mydest To resume feeding use --resumefeed. crawlerdbtool The crawlerdbtool lists all documents/URLs that the crawler knows about for each collection. To use: 1. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler 2. On the crawler node, run this command: crawlerdbtool -m list -d datasearch/data/crawler/store/test/db/ -S all Table 52: crawlerdbtool Options Option -m <mode> Description Operation mode. Valid modes: check - Report corrupt databases only. repair - Attempt repair of corrupt databases by copying elements to new databases. New databases are verified before they replace the corrupt databases. delete - Delete corrupt databases. compact - Compacts a database (specify filename or directory) or document store cluster (specify cluster directory). list - Outputs all keys in a database. count - Counts the number of keys in a database. view - View an entry in a database based on the key specified with -k. If none specified then all keys are output. viewraw - As above but without any formatting. export - Export a database to marshalled data. import - Imports a database from marshalled data. analyze - Analyzes a meta database. pphl2gb - Convert a postprocess checksum database from hashlog to gigabase format. Default: check 185 FAST Enterprise Crawler Option Description -d <dir> Specifies the directory/file to process. Must be specified except in 'align' mode. The -f option below is ignored if a file is specified. -f <filemask> Specifies the filemask/wildcard to work on. Can be repeated. Default: * -c <cachesize> Specify the cache size (in bytes) to be used when opening databases. Default: 8388608 -s <frequency> Database sync frequency during repair. Specifies the number of operations between each sync. A value of 1 will sync after each operation. Default: 10 -t <timeout> Specify a timeout in seconds after which a database check/repair process is terminated. child is assumed dead and killed. The database will be assumed corrupt beyond repair and will be deleted. Caution: Use with caution. Default: none -k <key> Only applicable in view mode. Specifies the database key to view. -K <key> Same as -k, but assumes key is repaired and will call eval() on it before using. Use this for checksums. -i <intermediate format> Only applicable in import/export mode. The selected format will be exported to or imported from. Valid formats: marshal - fast space-efficient format pickle - version and platform independent format Default: marshal -S <site> Specify a site to apply the current mode to. Use this for inspecting meta databases. If site is "all", all sites will be traversed. If site is "list" all sites will be listed. crawlerdbtool examples Note: Before running the crawlerdbtool make sure the crawler is stopped, as the tool cannot be run concurrently. List documents from a server The command to list all documents crawled from a server within a collection would be: crawlerdbtool -m list -d datasearch/data/crawler/store/test/db/ -S web001.example.net 186 Enterprise Crawler - reference information where web001.example.net is the server, and test is the name of the collection. Output: 'http://web001.example.net/Island/To.html' 'http://web001.example.net/in/and.html' 'http://web001.example.net/For/3).html' 'http://web001.example.net/for/services.html' List sites from a collection To list all known sites within a collection, use the command: crawlerdbtool -m list -d datasearch/data/crawler/store/test/db/ -S all Output: web001.example.net web000.example.net URI statistics for a collection To list statistics for all URIs crawled within a collection, use the command: crawlerdbtool -m analyze -d datasearch/data/crawler/store/test/db/ -S all Output: same as Example #3 showed for entire collection. URI statistics for a server To get statistics for URIs crawled from a specific server within a collection, use the command: crawlerdbtool -m analyze -d datasearch/data/crawler/store/test/db/ -S web001.example.net Output: Enterprise Crawler 6.7 - DB Check Utility Copyright (c) 2008 FAST, A Microsoft(R) Subsidiary Current options are: - Mode : analyze - Timeout : None - Directory : datasearch/data/crawler/store/test/db/ - File masks : * - Cachesize : 8388608 Site Report ================================================= Document and URIs Avg. Doc Size Data Volume JavaScript URIs Redirect URIs Total URIs Unique CSUMs : : : : : : 2.06 kB 18.43 MB 0 0 9141 9126 Mime-Types 187 FAST Enterprise Crawler text/html : 9141 List URIs (keys) from a database To list the URIs (or sites) within a given database file, use the list option as in the following command: crawlerdbtool -m list -d data/store/example/db/1/0.metadb2 'http://www.example.com/' 'http://www.example.com/maps.html' 'http://www.example.com/bart/bart.jsm' 'http://www.example.com/metro/metro.jsm' 'http://www.example.com/planimeter/planimeter.jsm' 'http://www.example.com/comments/' 'http://www.example.com/software/' 'http://www.example.com/software/micro_httpd/' 'http://www.example.com/software/mini_httpd/' 'http://www.example.com/software/thttpd/' 'http://www.example.com/software/spfmilter/' 'http://www.example.com/software/pbmplus/' 'http://www.example.com/software/globe/' 'http://www.example.com/software/phoon/' 'http://www.example.com/javascript/MapUtils.jsm' 'http://www.example.com/software/saytime/' 'http://www.example.com/javascript/Utils.jsm' Success View record of a specific database key The output of the previous command provides the keys to the data stored within each database. This can be specified with the -k option in view mode, to see all details associated with that URI or site, as in the following examples. crawlerdbtool -m view -d data/store/example/db/1/0.metadb2 -k 'http://www.example.com/maps.html' key (meta): 'http://www.example.com/maps.html' MIME type Crawl time Errors Compression Parent State flag Checksum : : : : : : : text/html 2006-12-21 18:54:09 None deflate None 0 c2f963f3b56e1495abad9c8b89ab41f5 Change history : (0, 0, 0, 1166723649) Links : http://mapper.example.com/ http://mapper.example.com/ http://www.example.com/ http://www.example.com/ http://www.example.com/GeoRSS/ http://www.example.com/GeoRSS/ http://www.example.com/bart/ http://www.example.com/bart/ http://www.example.com/javascript/ http://www.example.com/javascript/ http://www.example.com/jef/ggs/ http://www.example.com/jef/ggs/ http://www.example.com/jef/hotsprings/ http://www.example.com/jef/hotsprings/ http://www.example.com/jef/outlines/ http://www.example.com/jef/outlines/ http://www.example.com/jef/paris_forts/ 188 Enterprise Crawler - reference information http://www.example.com/jef/paris_forts/ http://www.example.com/jef/transpac2005/ http://www.example.com/jef/transpac2005/ http://www.example.com/mailto/?id=wa http://www.example.com/mailto/?id=wa http://www.example.com/mailto/wa.gif http://www.example.com/mailto/wa.gif http://www.example.com/metro/ http://www.example.com/metro/ http://www.example.com/planimeter/ http://www.example.com/planimeter/ http://www.example.com/resources/images/atom_ani.gif http://www.example.com/resources/images/atom_ani.gif http://www.google.com/apis/maps/ http://www.google.com/apis/maps/ Maxdoc counter : 2 Last-Modified : Tue, 11 Apr 2006 13:35:18 GMT Epoch ETag Flags Previous Checksum Referrers : : : : : 0 None 0 None http://www.example.com/ : Fileinfo : ('example/data/1', 1217, 65539) HTTP header : HTTP/1.1 200 OK Server: thttpd/2.26 ??apr2004 Content-Type: text/html; charset=iso-8859-1 Date: Thu, 21 Dec 2006 17:54:07 GMT Last-Modified: Tue, 11 Apr 2006 13:35:18 GMT Accept-Ranges: bytes Connection: close Content-Length: 4068 Adaptive epoch (upper) : 0 Adaptive rank : 7920 Level (min/current/max) : (1, 1, 1) # crawlerdbtool -m view -d data/store/example/db/1/site.metadb2 Site: 'www.example.com' Internal ID : 0 Hostname : www.example.com Alias : None Adaptive data : awo : (12, 0) awe : 2 Epoch details : Last refresh (upper) Clean epoch Last refresh Previous adaptive epoch Epoch Epoch (upper) Subdomain list IP address Mirrors Last seen Segment number Maxdoc limit : : : : : : : : : : : : 2007-01-09 17:34:49 2 2007-01-09 17:34:49 0 2 0 empty 192.168.178.28 None 0 0 0 crawlerconsistency The consistency tool is used for verifying and repairing the consistency of the crawler document and meta data structures on disk. 189 FAST Enterprise Crawler The consistency tool has two main uses. It can be used as a preventive measure to verify and maintain internal crawler store consistency, but also as part of recovering a damaged crawler store. The tool will detect, and by default also attempt to repair, the following inconsistencies: • • • • • Documents referenced in meta databases, but not found in the document store Invalid documents in the document store Unreferenced documents in the document store (requires docrebuild mode) Duplicate database checksums not found in meta databases Multiple checksums assigned to the same URI in the duplicate database The above list of inconsistencies are automatically corrected by running the tool in the doccheck or docrebuild mode, followed by the metacheck mode. Any URIs found to be non-consistent will be output to a log file (see below), and a delete operation will also be issued to the indexer (can be disabled) to ensure it is in sync. Refer to Crawler Store Consistency on page 157 for more information. In a multi node crawler environment the tool can also be used to rebuild a duplicate server from the contents of per-master postprocess checksum databases, using the ppduprebuild mode. Since this mode builds the duplicate server from scratch it can also be used to change the number of duplicate servers in use, by first changing the configuration of the collection and then rebuilding. Refer to Redistributing the Duplicate Server Database on page 160 for more information. The following log files will be generated by the tool. Be aware that log files are only created once the first URI is written to a file, hence not all log files will be present. Table 53: Output log files Filename Description <mode>_ok.txt Lists every URI found during the check, that was not removed as a result of an inconsistency. The output from the metacheck mode in particular will list every URI with a unique checksum, and is therefore useful for comparing against the index. Be aware that documents may have been dropped by the pipeline, and thus this file may correctly list URIs not actually present in the index. However, URIs in the index that are not in this file may be safely removed from the index as it is not known by the crawler. <mode>_deleted.txt This file lists each URI deleted by the tool. Unless indexer deletes were disabled with the -n option they would also have been removed from the index. As these URIs were only deleted due to internal inconsistencies within the crawler it is entirely possible that they still exist on the web servers, and should thus rightly be indexed. Therefore, it is recommended that this list of URIs is subsequently re-crawled. This can be accomplished through the crawleradmin using the --addurifile option. To expedite crawling add the --force option. <mode>_deleted_reasons.txt The contents of this file will be the same as the previous file, with the addition of an "error code" preceding each URI. The error codes identify the reason for each URI being deleted. The following codes exist: • • • • • • • • <mode>_wrongnode.txt 190 101 - Document not found in document store 102 - Document found, but unreadable in document store 103 - Document found, but length does not match meta information 201 - Meta data for document not found 202 - Meta data found, but unreadable 203 - Meta data found, but does not match checksum in duplicate database 204 - Meta data found, but has no checksum 206 - URI's hostname not found in routing database Only ever present on a multi node crawler, this file will output all URIs removed from a particular node due to incorrect routing. This means that the URIs should Enterprise Crawler - reference information Filename Description be, and most likely also are, crawled by a different master node. Therefore, these URIs are only output to the log file, but not deleted from the index. <mode>_refeed.txt The URIs listed in this file have had their URI equivalence class updates as a result of running the tool. To bring the index in sync use postprocess refeed with the -i option to refeed the contents of this file. Alternatively perform a full refeed. It is recommended to always redirect the stdout and stderr output of this tool to a log file on disk. Additionally, on UNIX use either screen or nohup in order to prevent the tool from terminating in case the session is disconnected. Tool: $FASTSEARCH/bin/crawlerconsistency: option [options] Table 54: crawlerconsistency options Mandatory options -M <mode>[,<mode>,..,<mode>] Description Selects the mode to run the tool in. The following modes are available • • doccheck - Verifies that all documents referenced in the meta databases also exist on disk. docrebuild - Same as above, but re-writes all referenced documents to a fresh document store, effectively getting rid of any orphans in the document store. Note: This can take a long time. • • • metacheck - Verifies that all checksums referenced in the PP databases also exist in the meta databases. metarebuild - Attempts to recovery a damaged metastore. Currently supports rebuilding a bad or lost site database based on segment databases. duprebuild: Rebuilds the contents of the Duplicate Server(s) from the local Post Process DB. Note: Exclusive mode. This mode must be run separately. Additionally, the following 'modifiers' can be specified: • updatestat - Updates the statistics document store counter. Note: Can only be used together with the doccheck/docrebuild mode. Only applies to the stored statistics counter. • routecheck - Verifies that sites/URIs are routed to the correct. Note: Only applies to multi-node crawlers. -O <path> Directory where the tool will place all output logs. The tool will create a sub directory here with a name matching the current date on the format <year><month><date>. If the directory already exists a counter will be appended, e.g. ".1" in order to ensure clean directories each time the tool is run. 191 FAST Enterprise Crawler Optional options -d <path> Description Location of crawl data, runtime configuration and logs in subdirectories in the specified directory. Default: data -C A comma separated list of collections to check. Default: All <collection>[,<collection>,...,<collection>]< collections -c <cluster>[,<cluster>,...,<cluster>] A comma separated list of clusters to check. Default: All clusters Note: Applies to: doccheck and docrebuild. -S <site>[,<site>,...,<site>] Only process the specified site(s). Default: All sites Note: Applies to: doccheck -z Compress documents in the document store when executing the docrebuild mode. This overrides the collection level option to compress documents if specified. Default: off Note: Applies to: docrebuild -i Skip free disk space checks. Normally the tool will check the amount of free disk space periodically and if it drops below 1GB it will abort the operation and exit. This option should be used with caution. -n Do not submit delete operations to the pipeline/indexer, only log them to files. In order to ensure removed documents are not present in the index afterwards it is recommended to manually delete the documents reported as deleted, or refeed the entire collection into an initially empty index. -F Load crawler global config from file. Conflicting options specified on the command line override the values in the configuration file if given. -T Test mode. Tool does not delete anything from disk or issue any deletes to the pipeline/indexer. crawlerwqdump The crawler work-queue-dump-tool writes the crawler queues that reside on disk to plain text files or to stdout. The following queues may be output: • • • • 192 the masters queue of resolved URIs the masters queue of unresolved URIs the masters queue of unresolved sites the slave work queues Enterprise Crawler - reference information Tool: $FASTSEARCH/bin/crawlerwqdump -d <dir> -c <collection> -t <target> -q <queue> All options must be specified. Each entry in the output contains the collection name and an URI, separated by ','. Table 55: crawlerdbtool Options Option Description -d <dir> -c <collection> Path to queue dir (data/crawler/queues). The name(s) of your collection(s). Use 'all' to process all collections. Separate collections with ',' if you specify more than one. -t <target> Output directory or 'stdout'. If you specify an output directory, the queues will be written to file and placed in <target> directory and named:<queue>.<time>.<collection>.txt ex: slavequeue.2005.12.21.11.9.mycollection.txt. -q <queue> Which queues to process: resolved/unresolved/slave/all. Example: crawlerwqdump -d $FASTSEARCH/data/crawler/queues/ -c mycollection,myothercollection -q slave" -t $FASTSEARCH/data/crawler/queuedumps/" crawlerdbexport The crawlerdbexport tool is used to dump the EC 6.3 databases to an intermediary format for subsequent import to an EC 6.7 installation, as part of the crawler store migration process. Dump files will be placed alongside the original databases, named with the suffix .dumped_nn. Tool: $FASTSEARCH/bin/crawlerdbexport [options] Table 56: crawlerdbexport options Option -m Description Required: Mode. Valid values: export, deldumps Default: export -d Required: Directory, path to crawler store ($FASTSEARCH/data/crawler). Default: none -g Name of your collection. If no collection is specified, then all collections are processed. Default: none -l Log level. 193 FAST Enterprise Crawler Option Description Valid values: normal, debug Default: normal -b Batch size. Maximum bytes per dump file. Default: 100MB crawlerstoreimport The crawlerstoreimport tool loads the crawlerdbexport dump files one by one, creates new databases and migrates the document storage, and a new 6.7 crawler store will be created. This also includes the documents stored. This section lists options for the import tool. Tool: $FASTSEARCH/bin/crawlerstoreimport [options] Table 57: crawlerstoreimport options Option -m Description Required: Mode. Valid values: import, deldumps Default: import -d Required: Directory, path to old crawler store ($FASTSEARCH/data/crawler.old). Default: none -n Required: Directory, path to new crawler store ($FASTSEARCH/data/crawler). Default: none -t Required: Node type. Valid values: ubermaster, master, ppdup Default: none -g Name of your collection. If no collection is specified, then all collections are processed. Default: none -s Storage format. Valid values: bstore, flatfile Default: current format -r Remove dump files. Valid values: 0,1 Default: 0 (no) -p 194 ppdup format. Enterprise Crawler - reference information Option Description Valid values: hashlog, gigabase Default:gigabase -l Log level. Valid values: normal, debug Default: normal Crawler Port Usage This appendix lists per process port usage for single node and multiple node crawlers. The crawler port is sometimes specified on the command line: -P <hostname>:<crawlerbaseport> <hostname> By default binds to all interfaces. <crawlerbaseport> If the FASTSEARCH environment variable is set, the port is read from the $FASTSEARCH/etc/NodeConf.xml file. If the FASTSEARCH variable is not set OR reading the port fails, port 14000 is used. Port range The maximum crawler port range is from <crawlerbaseport> to <crawlerbaseport>+299 Table 58: Crawler Port Usage (Single Node) Process name Purpose Port crawler XML-RPC <crawlerbaseport> Postprocess communication <crawlerbaseport> + 2 crawler Slave communication <crawlerbaseport> + 3 crawlerfs HTTP <crawlerbaseport> + 4 postprocess Slave communication <crawlerbaseport> + 5 postprocess XML-RPC <crawlerbaseport> + 6 uberslave XML-RPC <crawlerbaseport> + 7 and up to <crawlerbaseport> + 198 cglogdispatcher (GUI log dispatcher) XML-RPC <crawlerbaseport> + 199 Table 59: Crawler Port Usage (Multiple Node) Process name Purpose Port Ubermaster (crawler -U) XML-RPC <crawlerbaseport> Ubermaster (crawler -U) Master communication <crawlerbaseport>+1 Master (crawler -S) XML-RPC <crawlerbaseport> + 100 195 FAST Enterprise Crawler Process name Purpose Port Master (crawler -S) Postprocess communication <crawlerbaseport> + 102 crawler -S Slave communication <crawlerbaseport> + 103 crawlerfs HTTP <crawlerbaseport> + 104 postprocess Slave communication <crawlerbaseport> + 105 postprocess XML-RPC <crawlerbaseport> + 106 uberslave XML-RPC <crawlerbaseport> + 107 and up to <crawlerbaseport> + 198 cglogdispatcher (GUI log dispatcher) XML-RPC <crawlerbaseport> + 199 ppdup (Duplicate Server) Postprocess communication <crawlerbaseport> + 200 ppdup (Duplicate Server) Duplicate replication <crawlerbaseport> + 201 and up to <crawlerbaseport> + 298 Log Files The Enterprise Crawler creates numerous files in which to log information detailing the processing of URIs and collections. Some are created automatically, others must be enabled via configuration. Directory structure The following table describes the key directories and files in a crawler installation, relative to the FAST ESP installation root, $FASTSEARCH. Table 60: Crawler Directory Structure Structure Description $FASTSEARCH/bin /crawler Crawler executables /crawleradmin /postprocess $FASTSEARCH/lib Shared libraries and Python images $FASTSEARCH/etc FAST ESP configuration files read by crawler $FASTSEARCH/var/log/crawler Folders for detailed crawler logs /crawler.log /dns /dsfeed /header /fetch 196 Diagnostic and progress information Daily log files directories. Most of the directories are organized by collection. Enterprise Crawler - reference information Structure Description /screened /site /stats /PP $FASTSEARCH/data/crawler Folders for configuration, work queues and data/metadata store /config /queues /dsqueues $FASTSEARCH/data/crawler/store Temporary and permanent configuration data, work queues and batches; mostly binary data Data and metadata for crawled pages, organized in subdirectories by collection /db Metadata for each document gathered /data Document content /PP/csum Duplicate document checksum databases Log files and usage DNS log A directory that contains log files from DNS resolutions: $FASTSEARCH/var/log/crawler/dns Header log A directory that contains logs of HTTP request/response exchanges, separated into directories by sitename. The header log is disabled by default: $FASTSEARCH/var/log/crawler/header/<collection>/ Screened log A directory that contains log files of all URIs processed by the crawler and details for any given URI on whether or not it will be placed on the work queue: $FASTSEARCH/var/log/crawler/screened/<collection>/ The screened log is turned off by default. URIs that will be queued are logged as ALLOW others as DENY. Additionally all URIs logged as DENY will have a explanation code logged with it. Site log A directory that contains log files listing events in the processing of web sites. The logs contain entries listing a site being processed, a time stamp, and details of the transition in the state of that web site, such as STARTCRAWL, IDLE, REFRESH, and STOPCRAWL. $FASTSEARCH/var/log/crawler/site/<collection>/ 197 FAST Enterprise Crawler Fetch log A directory that contains log files for every collection that is populated by the crawler. The crawler logs attempted retrievals of documents to a per-collection log. Each log file describes actions taken for every URL along with a time stamp: $FASTSEARCH/var/log/crawler/fetch/<collection>/ Crawler log This file logs general diagnostic and progress information from the crawler process stdout and stderr output. The verbose level of this log is governed by the -l <level> option given to the crawler and can be modified in the crawler entry in $FASTSEARCH/etc/NodeConf.xml. Use the -l <level> option to specify the log level. Possible values are one of the following predefined log levels: debug, verbose, info, warning, error . If you adjust the level, reload the configuration file into the node controller (nctrl reloadcfg in $FASTSEARCH/bin) before stopping and starting the crawler for the change to take effect. $FASTSEARCH/var/log/crawler/crawler.log Postprocess log A directory that contains log files from postprocess. Postprocess performs duplicate detection of downloaded documents, and processes content to FAST ESP.The Postprocess log contains the URIs and referrer URI to every unique document together with their size, MIME type and URIs to any duplicates found: $FASTSEARCH/var/log/crawler/PP/<collection>/ DSfeed log A directory that contains log files for every collection that is populated by the crawler. The logs contain the status of each URI submitted to document processing. Deletes are also logged: $FASTSEARCH/var/log/crawler/dsfeed/<collection>/ Enabling all Log Files Logging options can be enabled via selection in the administrator interface, or by adding them to the XML configuration file and reloading that using the crawleradmin tool. An example of fully enabled log section from configuration file: <section name="log"> <attrib name="dsfeed" type="string"> text </attrib> <attrib name="fetch" type="string"> text </attrib> <attrib name="header" type="string"> text </attrib> <attrib name="postprocess" type="string"> text </attrib> <attrib name="screened" type="string"> text </attrib> <attrib name="site" type="string"> text </attrib> </section> Verbose and Debug Modes In cases where warnings or errors indicate that the crawler may have a problem, it may be helpful to obtain more detailed information than what is available in the daily crawler.log file. The options available for logging are the verbose mode (-v) and the debug mode (-l <value>, where <value> is often <debug>. To add these modes to the crawlers command line within FAST ESP: 1. Edit the NodeConf.xml file. To do so, find the Enterprise Crawler command specification, and add either “-v” or “-l debug” to the <parameters> string. 2. Save the change. 3. Force the node controller to reread the file. Run the command: nctrl reloadcfg The change will take effect when the crawler is next restarted 198 Enterprise Crawler - reference information Crawler Log Messages Below is a list of log messages that may be found in the $FASTSEARCH/var/log/crawler/crawler.log file. Severity CRITICAL Log Message(s) Cause(s) Another process, A process, most likely the crawler, is most likely another already running on the crawler port. crawler, is already running on the specified interface 'localhost:14000' Action(s) Ensure that the crawler is not already running on the port specified on the crawler command line. They may be killed if necessary. or Unable to bind master socket to interface 'localhost:14000' or Unable to open XML-RPC socket (%s:%s) or Another process, most likely another crawler, is already running and holding a lock on the file <filename> or Unable to create listen port for slave communication: socket.error: [Errno 98] Address already in use CRITICAL Unable to perform crawler license checkout: <text> The crawler was unable to retrieve a valid license. Your license may have expired. Refer to the licensing information listed in Contact Us on page iii for more information. or Unable to check out FLEXlm license. Shutting down. Contact FAST for a new license CRITICAL Lost connection to The uberslave process has detected Master. Taking down that the master is no longer running and is shutting down. This can occur if Slave either the master crashes or on normal shutdowns. Check the logs for additional information. If this was not the result of a normal shutdown please submit a bug report to FAST Technical Support. Include logs and core files if available. 199 FAST Enterprise Crawler Severity CRITICAL Log Message(s) Cause(s) Action(s) Unable to start Subordinate processes either could not Investigate system resources, process limits, check log files for error Slave/FileServer/PostProcess be started, or are failing repeatedly. messages. process or Too frequent process crashes. Shutting down CRITICAL No data directory specified Misconfiguration or ownership/permission problems. or Unable to create the data directory <directory> Verify that correct user is attempting to run crawler, and that ownership of crawler directories is correct. Recheck configuration files and command-line options. or Unable to write crawler pidfile to <directory> or Survival mode may only be used by a subordinate in a multi node setup CRITICAL Failed to load collection config <text>. Shutting down Unable to read configuration database Verify existence and or XML file. ownership/permission of configuration database or file. or Failed to load collection config specified on command line: <text> CRITICAL ERROR 200 DNS resolver error Crawler unable to contact DNS server Check system DNS configuration (e.g. to resolve names or addresses. /etc/resolv.conf), verify proper operation (e.g. using nslookup/dig or similar tool). Lost connection <name> (PID: <pid>), possibly due to crash Communication between two crawler processes has failed, possibly due to a process crash. None, the crawler will restart the process. Contact FAST Technical Support if it occurs repeatedly. Enterprise Crawler - reference information Severity ERROR ERROR ERROR ERROR WARNING WARNING WARNING WARNING WARNING Log Message(s) Cause(s) Action(s) Remote csum ownership, same URI: <URI> In a multiple node crawler setup this can occur if the same site has been routed to more than one master, or if stale data has not been properly deleted. Contact FAST Technical Support if it occurs repeatedly. Unable to load/create config DB '<path>': DBError: Failed to obtain database lock for <path>/config.hashdb The crawler is already running when an attempt was made to start another crawler or use a tool that requires the crawler to be stopped first. Stop the crawler. If, after waiting for at least 5 minutes, there are still crawler processes running they may need to be killed. Unable to connect to <name> process. Killing process (PID=<PID>) A process started by the master failed Check the logs for additional to connect properly to the master. The information. Contact FAST Technical process may have had startup Support if it occurs repeatedly. problems. Timeout waiting for <name> process to connect to Master, killing (PID=<PID>) A process started by the master failed Check the logs for additional to connect back within 60 seconds. The information. Contact FAST Technical process may have had startup Support if it occurs repeatedly. problems. <name> process (PID A crawler sub process identified by <pid>) terminated <name> has crashed and will be by signal <signal> restarted. (core dumped) Submit bug report to FAST Technical Support. Include logs and core file. Failed to read data from '<path>', discarding URI=<URI> The crawler was unable to read a None unless this occurs repeatedly. previously stored document from disk. Verify that there are no disk issues or This can occur if the document has other problems that could cause this. since been deleted from the web server and the crawler has a backlog. Unable to read block file index for block file <number> A document store file index was either Submit bug report to FAST Technical corrupt or missing on disk. Support. Include logs. Start URI <URI> is A start URI specified in the configuration did not pass the not valid include/exclude rule checks. Verify that all start URIs match the include/exclude rules as well as HTTP scheme and extension rules. Data Search Feeder The disk queue containing documents Delete the failed to process on disk for processing is corrupt. $FASTSEARCH/data/crawler/dsqueues packet: IOError directory and perform a PostProcess refeed as described in Re-processing Crawler Data Using postprocess on page 154. 201 FAST Enterprise Crawler Severity WARNING WARNING WARNING WARNING Log Message(s) Cause(s) KeepAlive ACK from In a multiple node crawler setup this If the log message repeats restart all can occur if the ubermaster and master crawler processes. unknown Master processes go out of sync. Unable to flush The master process work load is very If this repeats try to reduce the workload by reducing the number of Master comm channel high and communication between processes are suffering. concurrent sites being crawled or install on more powerful hardware or additional servers. <name> engine poll The specified process has a very high None, unless this occurs constantly. If workload. API calls may respond more so either decrease the work load by used <number> slowly. crawling fewer sites concurrently or seconds install on more powerful hardware or additional servers. Master ID '<ID>' already exists A master has been started with a symbolic ID that has already been specified for another in the same multiple node crawl. Stop the offending master and change the symbolic ID specified by the -I option WARNING The Browser Engine is shutdown or The Browser engine unavailable at <host>:<port> is down VERBOSE The Browser Engine is up and running The Browser engine after having been down. at <host>:<port> is up VERBOSE The Browser Engine is overloaded and Tune the EC to Browser Engine The Browser engine will not process new documents until communication, tune the Browser at <host>:<port> is the queue length is reduced. Engine. You may need to disable overloaded JavaScript and/or Flash processing or only enable JavaScript and/or Flash for certain sites. PROGRESS INFO Ignoring site: <sitename> (No URIs) Investigate why the Browser Engine is down and rectify it. See the FAST ESP Browser Engine Guide for more information. When postprocess re-feeding you may None needed. However, if the site get this message for sites containing should contain URIs you may wish to no URIs. try to re-crawl it or examine logs to determine why it has no URIs. Collection '<name>' The refresh cycle of the collection has You may want to increase the refresh period in order to completely crawl all is not idle by time completed and the crawler is not finished crawling all the sites. sites within the refresh period. of refresh PostProcess Log Below is a list of postprocess log messages that may be found in the $FASTSEARCH/var/log/crawler/PP/<collection>/ directory. 202 Action(s) Enterprise Crawler - reference information Severity CRITICAL CRITICAL STATUS STATUS PROGRESS Log Message Cause(s) Action(s) Must specify Master The postprocess process was run without the correct command line port arguments. The postprocess process can only be run manually in refeed mode (-R command line option), make sure the arguments are correct. Failed to start The PostProcess module failed to register with the configserver at PostProcess: ConfigServerError: initialization time. Failed to register with ConfigServer: Fault: (146, 'Connection refused') The configserver process is stopped or suspended. Restart it, wait a moment, and restart the PostProcess. Could not send batch to Data Search Content Distributor, will try again later. The error was: add_documents call with batch <batch ID> timed out The batch could not be forwarded to a None, the batch will be resent document processor since none were automatically. idle. This is a built-in throttling mechanism in FAST ESP. Waiting for Data Search to process remaining data... Hit CTRL+C to abort During refeed this message is logged Optionally signal postprocess to stop once all databases have been and resume crawler. traversed. Documents are still being sent to the Content Distributor, but it is safe to signal postprocess to stop and resume crawling as the crawler will then feed the remaining documents. Ignoring site: <sitename> (No URIs) During postprocess refeed traversal of The message can be ignored. If this the databases, a site was encountered site should have URIs associated with with no associated URIs. it, then sanity check the configuration rules and log files to discover why it has no URIs. Crawler Fetch Logs The crawler will log attempted retrievals of documents to a per-collection log located in $FASTSEARCH/var/log/crawler/fetch/<collection name>/<date>.log. The screened log is disabled by default. When enabled ("Screened" log enabled in the Logging section in the Advanced crawler collection configuration GUI), all URIs seen by the crawler and whether they will be attempted retrieved or not is located in $FASTSEARCH/var/log/crawler/screened/<collection name>/<date>.log The messages in these logs are in a whitespace-delimited format as follows: <time stamp> <status code> <outcome> <uri> [<auxiliary information>] where: 1. <time stamp> is a date/clock denoting a time at which the request was completed or terminated. 203 FAST Enterprise Crawler 2. <status code> contains a three letter code which describes the status of the outcome of the retrieval. When this code is a numerical value, it maps directly to the same status code in the proper protocol, as defined in RFC 2616 for HTTP/HTTPS, and RFC 765 for FTP. The authoritative description of these status codes are always the respective protocol specifications, but for convenience we describe a subset of these status codes informally below. 3. <outcome> is a somewhat more human readable status word that describes the status of the document after retrieval. 4. <uri> denotes the URI that was requested. 5. [<auxiliary information>] contains additional informative information such as descriptive error messages. An excerpt from a fetch log is shown below: 2007-08-02-16:51:41 200 MODIFIED http://www.example.com/video/ JavaScript processing complete 2007-08-02-16:52:32 301 REDIRECT http://www.example.com/video/living Redirect URI=http://www.example.com/video/living/ 2007-08-02-16:53:33 404 IGNORED http://www.example.com/video/living/ 2007-08-02-16:54:33 200 PENDING http://www.example.com/video/live/live.html?stream=stream1 Javascript processing Crawler Fetch Log Messages Below is a list of log messages that may be found in the fetch log in the $FASTSEARCH/var/log/crawler/fetch/<collection>/ directory. Table 61: Status Codes - Fetch Log Code/Message Description/Possible Solution 200 - HTTP 200 "OK" The request was successful. A document was retrieved following the request. 301 - HTTP 301 "Moved The document requested is available under a different URI. Target URI is shown in the auxiliary information field. permanently" 302 - HTTP 302 "Moved The document requested is available under a different URI. Target URI is shown in the auxiliary information field. The crawler treats HTTP 301/302 identically. temporarily" 204 303 - HTTP 303 "See Other" This method exists primarily to allow the output of a POST-activated script to redirect the crawler to a new URI. 304 - HTTP 304 "Not Modified" The document has been retrieved earlier, and was now conditionally requested if the server detected that it had changed since the last time. This is achieved by supplying the Last-Modified time stamp or Etag given by the server the last time in the request. "Not Modified" responses can be received when the "Send If-Modified-Since" setting is enabled in the crawler configuration GUI. 401 - HTTP 401 "Unauthorized" The web server requires that the crawler presents a set of credentials when requesting the URI. This would occur using Basic or NTLM authentication. 403 - HTTP 403 "Forbidden" The web server denies the crawler access to the URI, either because the crawler presented a bad set of credentials or because none were given at all. 404 - HTTP 404 "Not Found" The web server does not know about the requested URI. Commonly, this is because a "dead link" was seen by the crawler. Enterprise Crawler - reference information Code/Message Description/Possible Solution 406 - HTTP 406 "Not Acceptable" The web server discovers that the crawler is not configured to receive the type of page it has requested. 500 - HTTP 500 "Internal Server Error" Some unspecified error happened at the server when the request was serviced. 503 - HTTP 503 The server is currently unable to serve the request. This can for instance imply that the "Service unavailable" server is overloaded. 226 - FTP 226 "Closing data connection" An operation was performed to completion. 426 - FTP 426 "Connection closed; transfer aborted" The retrieval of the document was aborted. ERR A non-HTTP error occurred (crawler or network related). Details of the error is shown in the auxiliary information field. TTL The retrieval of the document exceeded the timeout setting. The number of seconds to wait before an uncompleted request is terminated is governed by the "Fetch timeout" setting in the crawler configuration GUI. DSW The document has been "garbage-collected" as it has not been seen for a number of crawler refresh cycles. This means that the crawler has crawled more data than it can deterministically re-crawl during one crawler refresh cycle and that documents are periodically purged if they have not been seen over recently. The number of refresh cycles to wait before purging documents is governed by the "DB switch interval" setting in the crawler configuration GUI. Alternatively, the documents are being deleted because they have become unreachable, either due to modifications in the crawler configuration or on the website itself. STU Start URI was deleted. The specified start URI had been removed and excluded from the configuration and has now been removed from the crawler store. USC The URI added from the crawler API (e.g. using crawleradmin) was deleted. A URI crawled earlier has now been excluded by the configuration, and when re-adding it the URI is now removed from the crawler store. USR The URI was deleted through the external API (e.g. using crawleradmin). Unless also excluded by the configuration it may be re-crawled again later. RSS The URI was deleted due to the RSS settings. The URI was deleted due to the RSS settings. The document was deleted either because it was too old, or because of the maximum allowed number of documents for a feed has been reached. Table 62: Outcome Codes - Fetch Log 205 FAST Enterprise Crawler Code/Message Description/Possible Solution NEW The retrieved document was seen for the first time by the crawler and will be further processed, pending final duplicate checks. UNCHANGED The retrieved document was seen before, and the retrieved version did not differ from the one retrieved the last time. MODIFIED The retrieved document was seen before, and the retrieved version did differ from the one retrieved the last time. The updated version will be further processed, pending final duplicate checks. REDIRECT The retrieval of the document resulted in a redirect, that is the server indicated that the document is available at a different location. The redirect target URI will be retrieved later, if applicable. EXCLUDED The document was retrieved, but properties of the response header or body was excluded by the crawler configuration. Details of the cause is shown in the auxiliary information field. Commonly, the data was of a MIME type not allowed in the crawler configuration. DUPLICATE The document was retrieved, but detected as a duplicate in the first level crawler duplicate check. The document will not be processed further. IGNORED The retrieval of the document failed, because of protocol or other error. If not evident from the status code (like HTTP 404, 403...), details of the cause is shown in the auxiliary information field. If the document had been retried a number of times and failed in all attempts, the last attempt will be flagged as IGNORED. DEPENDS The document was retrieved, but has dependencies to external documents required for further processing. Currently, this means that JavaScripts are enabled in the crawler configuration and that the document contained references to external JavaScripts. The document will be further processed when all dependencies have been retrieved. DEP The retrieved document was depended on by another document. PENDING 206 A document was sent to an external component for processing. Examples include JavaScript and Flash documents, which are processed by the Browser Engine. There might be multiple pending messages for the same URI. DELETED The document had been retrieved earlier but is no longer available, or the document had been retrieved earlier but has not been seen for a number of refresh cycles (refer to the DSW status code). The document will be flagged as deleted. RETRY An error occurred and the retrieval of the document will be retried. The number of times a document is retried is governed by the settings in the "HTTP Errors"/"FTP Errors" settings in the crawler configuration GUI. CANCELLED The BrowserEngine canceled processing of a document. This is either caused by that the BrowserEngine was shut down, or because processing of a document timed out. If the cancel operation was caused by time out, a text message will be logged in the auxiliary information field. STOREFAIL The crawler experienced an error saving the document contents to disk. Enterprise Crawler - reference information Code/Message Description/Possible Solution AUTHRETRY or AUTHPROXY The crawler was denied access to the URI, either directly or via the proxy. If Basic or NTLM authentication has been configured, this may be a normal part of the protocol. FAILED LIST RSSFEED SITEMAP The crawler was unable to complete internal processing of the document. Check the crawler log file to see if additional details were noted. A directory listing was retrieved via FTP to obtain URIs for FTP documents. A new RSS feed has been detected by the crawler.These feeds are processed as specified by the RSS settings of the collection. A sitemap or sitemap index has been detected and parsed by the crawler. Table 63: Crawler Fetch Log Auxiliary Field Messages Message Description Referrer=<referrer uri> The URI was referred by the given referrer URI. Redirect URI=<target uri> The retrieved URI redirects to the given target URI. Empty document The document was retrieved but contained no data. META Robots <directives> The document contained the given HTML META Robots directives. MIME type: <MIME-type> The document was of the specified MIME type (and was excluded because of this). Connection reset by peer The connection to the server was reset by the server (BSD socket error message). Connection refused The connection to the server failed to be established (BSD socket error message). Crawler Screened Log Messages Below is a list of screen log messages that may be found in the $FASTSEARCH/var/log/crawler/screened/<collection>/ directory. Table 64: Status Codes - Screened Log Code/Message Description/Possible Solution ALLOW The URI is eligible for retrieval according to the crawler configuration. DENY The URI is not eligible for retrieval according to either crawler configuration, robots.txt file or HTML robots META tags. Table 65: Outcome Codes - Screened Log 207 FAST Enterprise Crawler Code/Message Description/Possible Solution OK The URI was allowed and is eligible for retrieval. URI The URI was disallowed due to URI inclusion/exclusion settings in the crawler configuration. These are the URI include/exclude filters and Extension excludes in the crawler configuration GUI . 208 DOMAIN The URI was disallowed due to hostname inclusion/exclusion settings in the crawler configuration. This is governed by the Hostname include/exclude filters settings in the crawler configuration GUI. ROBOTS The URI was disallowed due to restrictions imposed by the robots.txt file on the web server for its site. Additional discussion of ROBOTS issues can be found in the External limits on fetching pages on page 19 section. LINKTYPE The URI was disallowed due to the type of HTML tag it was extracted from. Allowed link types are governed by the Link extraction settings in the crawler configuration GUI. NOFOLLOW The URI was disallowed due to the referring document containing a HTML META robots tag disallowing URI extraction. Additional discussion of ROBOTS issues can be found in the External limits on fetching pages on page 19 section. SCHEME The URI had a disallowed scheme. Allowed schemes are specified in the Allowed schemes setting in the crawler configuration GUI. FWDLINK Forwarding of non-local links are disabled. The URI was disallowed because the crawler configuration disallows following links from one website to another. This is governed by the Follow cross-site URIs setting in the crawler configuration GUI. PARSEERR Failed to parse the URI into URI components, i.e. scheme, host, path, and other elements if specified. LENGTH The URI is too long to process and store. WQMAXDOC The maximum number of documents for the site has been reached. WQEXISTS Already queued on the work queue, will not queue again. NOTNEW In Adaptive refresh mode (only), this URI has already been queued as a previously fetched entry. REFRESHONLY In Refreshing mode (only), this URI is not a previously fetched entry, and so will be ignored. RECURSION Maximum level of URI recursions (i.e. repeated patterns in the path element) reached. MAXREDIRS Maximum number of redirects reached. Enterprise Crawler - reference information Crawler Site Log Messages Below is a list of log messages that may be found in the site log in the $FASTSEARCH/var/log/crawler/site/<collection>/ directory. Table 66: Sttaus Codes - Site log Message Description STARTCRAWL <site> The crawler will start crawling the specified site. STOPCRAWL <site> Crawling of the specified site stopped voluntarily. Look for associated IDLE log message to determine cause. STOPCRAWL SHUTDOWN <site> Crawling of the specified site was stopped due to the crawler being shut down. REFRESH <site> The specified site is refreshing. REFRESH FORCE <site> The specified site will be refreshed as a result of a user initiated force re-fetch operation. REFRESHSKIPPED NOTIDLE <site> The specified site skipped refresh due to not yet having finished the previous refresh cycle. This event can occur only when refresh mode soft is used. WQERASE MAXDOC <site> The specified site has reached the maximum documents per cycle setting specified in the Max doc limit <count> configuration. The remaining work queues have been erased. reached, erasing work queue IDLE <reason> <site> <detailed reason> The specified site has gone idle (stopped crawling). The reason is given by <reason> and <detailed reason>. REPROCESS <site> Ready for reprocessing The crawler has been notified to reprocess (refeed) the specified site in the crawler store. DELSITE DELSITECMD <site> <count> URIs ready for deletion from crawler store The crawler has been notified to initiate the deletion of the specified URIs from the crawler store. DELURIS DELURICMD The crawler has been notified to initiate the deletion of the specified site from the crawler store. <site> Ready for deletion from crawler store LOGIN GET/POST <site> The crawler has initiated the form login sequence for the specified site. Performing Authentication LOGGEDIN <site> Through <login site> The crawler has successfully logged into the specified site through <login site>. 209 FAST Enterprise Crawler Message Description DELAYCHANGED ROBOTSDELAY <site> Set to <delay> seconds The robots.txt file of the current site has changed the crawl delay to <delay> seconds by specifying the "Crawl-Delay" directive. BLACKLIST <site> The specified site was blacklisted for <time span> number of seconds. During this time Blacklisted for <time no downloads will be performed for the site, but URIs will be kept on work queues. Once expired URIs will be eligible for crawling again. span> seconds. A blacklist operation may have been the result of either an explicit user action (e.g. thorough the crawleradmin tool) or an internal backoff mechanism within the crawler itself. 210 UNBLACKLIST EXPIRED <site> A site previously blacklisted is no longer blacklisted, and may resume crawling if there is available capacity. JSENGINE DOWN <site> Crawling paused The Browser Engine is down and the crawler has paused crawling the site. JSENGINE OVERLOADED <site> Crawling paused All the available Browser Engines are overloaded and the crawler has stopped sending requests to the Browser Engine. . JSENGINE UP <site> Crawling resumed The Browser Engine is ready to process documents after having been down or overloaded.