FAST Enterprise Crawler Guide

Transcription

FAST Enterprise Crawler
version:6.7
Crawler Guide
Document Number: ESP939, Document Revision: B, December 03, 2009
Copyright
Copyright © 1997-2009 by Fast Search & Transfer ASA (“FAST”). Some portions may be copyrighted
by FAST’s licensors. All rights reserved. The documentation is protected by the copyright laws of Norway,
the United States, and other countries and international treaties. No copyright notices may be removed
from the documentation. No part of this document may be reproduced, modified, copied, stored in a
retrieval system, or transmitted in any form or any means, electronic or mechanical, including
photocopying and recording, for any purpose other than the purchaser’s use, without the written
permission of FAST. Information in this documentation is subject to change without notice. The software
described in this document is furnished under a license agreement and may be used only in accordance
with the terms of the agreement.
Trademarks
FAST ESP, the FAST logos, FAST Personal Search, FAST mSearch, FAST InStream, FAST AdVisor,
FAST Marketrac, FAST ProPublish, FAST Sentimeter, FAST Scope Search, FAST Live Analytics, FAST
Contextual Insight, FAST Dynamic Merchandising, FAST SDA, FAST MetaWeb, FAST InPerspective,
NXT, FAST Unity, FAST Radar, RetrievalWare, AdMomentum, and all other FAST product names
contained herein are either registered trademarks or trademarks of Fast Search & Transfer ASA in
Norway, the United States and/or other countries. All rights reserved. This documentation is published
in the United States and/or other countries.
Sun, Sun Microsystems, the Sun Logo, all SPARC trademarks, Java, and Solaris are trademarks or
registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
Netscape is a registered trademark of Netscape Communications Corporation in the United States and
other countries.
Microsoft, Windows, Visual Basic, and Internet Explorer are either registered trademarks or trademarks
of Microsoft Corporation in the United States and/or other countries.
Red Hat is a registered trademark of Red Hat, Inc.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.
AIX and IBM Classes for Unicode are registered trademarks or trademarks of International Business
Machines Corporation in the United States, other countries, or both.
HP and the names of HP products referenced herein are either registered trademarks or service marks,
or trademarks or service marks, of Hewlett-Packard Company in the United States and/or other countries.
Remedy is a registered trademark, and Magic is a trademark, of BMC Software, Inc. in the United States
and/or other countries.
XML Parser is a trademark of The Apache Software Foundation.
All other company, product, and service names are the property of their respective holders and may be
registered trademarks or trademarks in the United States and/or other countries.
Restricted Rights Legend
The documentation and accompanying software are provided to the U.S. government in a transaction
subject to the Federal Acquisition Regulations with Restricted Rights. Use, duplication, or disclosure of
the documentation and software by the government is subject to restrictions as set forth in FAR 52.227-19
Commercial Computer Software-Restricted Rights (June 1987).
Contact Us
Web Site
Please visit us at: http://www.fastsearch.com/
Contacting FAST
FAST
Cutler Lake Corporate Center
117 Kendrick Street, Suite 100
Needham, MA 02492 USA
Tel: +1 (781) 304-2400 (8:30am - 5:30pm EST)
Fax: +1 (781) 304-2410
Technical Support and Licensing Procedures
Technical support for customers with active FAST Maintenance and Support agreements, e-mail:
[email protected]
For obtaining FAST licenses or software, contact your FAST Account Manager or e-mail:
[email protected]
For evaluations, contact your FAST Sales Representative or FAST Sales Engineer.
Product Training
E-mail: [email protected]
To access the FAST University Learning Portal, go to: http://www.fastuniversity.com/
Sales
E-mail: [email protected]
Contents
Preface..................................................................................................ii
Copyright..................................................................................................................................ii
Contact Us...............................................................................................................................iii
Chapter 1: Introducing the FAST Enterprise Crawler.......................9
New features..........................................................................................................................10
Web concepts.........................................................................................................................10
Crawler concepts....................................................................................................................12
Enterprise Crawler Architecture.............................................................................................14
Configuring a crawl.................................................................................................................16
Where to begin?..........................................................................................................16
Where to go?...............................................................................................................17
How fast to crawl?.......................................................................................................18
How long to crawl?......................................................................................................18
Excluding pages in other ways....................................................................................19
External limits on fetching pages.................................................................................19
Removal of old content................................................................................................21
Browser Engine......................................................................................................................22
Chapter 2: Migrating the Crawler.....................................................23
Overview................................................................................................................................24
Storage overview....................................................................................................................24
Document storage.......................................................................................................24
Meta database.............................................................................................................25
Postprocess Database.................................................................................................25
Duplicate server database...........................................................................................25
Configuration and Routing Databases.........................................................................26
CrawlerGlobalDefaults.xml file considerations.......................................................................26
The Migration Process...........................................................................................................27
Chapter 3: Configuring the Enterprise Crawler..............................31
Configuration via the Administrator Interface (GUI)................................................................32
Modifying an existing crawl via the administrator interface..........................................32
Basic Collection Specific Options................................................................................33
Advanced Collection Specific Options.........................................................................35
Adaptive Crawlmode....................................................................................................49
Authentication..............................................................................................................52
Cache Sizes................................................................................................................53
5
Crawl Mode.................................................................................................................53
Crawling Thresholds....................................................................................................54
Duplicate Server..........................................................................................................55
Feeding Destinations...................................................................................................56
Focused Crawl.............................................................................................................57
Form Based Login.......................................................................................................57
HTTP Proxies..............................................................................................................58
Link Extraction.............................................................................................................59
Logging........................................................................................................................59
POST Payload.............................................................................................................61
Postprocess.................................................................................................................61
RSS.............................................................................................................................61
Storage........................................................................................................................63
Sub Collections............................................................................................................63
Work Queue Priority....................................................................................................66
Configuration via XML Configuration Files.............................................................................67
Basic Collection Specific Options (XML).....................................................................67
Crawling thresholds.....................................................................................................87
Refresh Mode Parameters...........................................................................................89
Work Queue Priority Rules..........................................................................................89
Adaptive Parameters...................................................................................................91
HTTP Errors Parameters.............................................................................................93
Logins parameters.......................................................................................................94
Storage parameters.....................................................................................................96
Password Parameters..................................................................................................97
PostProcess Parameters.............................................................................................98
Log Parameters.........................................................................................................100
Cache Size Parameters.............................................................................................101
Link Extraction Parameters........................................................................................102
The ppdup Section....................................................................................................104
Datastore Section......................................................................................................105
Feeding destinations.................................................................................................105
RSS...........................................................................................................................107
Metadata Storage......................................................................................................108
Writing a Configuration File.......................................................................................109
Uploading a Configuration File..................................................................................110
Configuring Global Crawler Options via XML File................................................................110
CrawlerGlobalDefaults.xml options............................................................................110
Sample CrawlerGlobalDefaults.xml file.....................................................................113
Using Options.......................................................................................................................115
Setting Up Crawler Cookie Authentication.................................................................115
Implementing a Crawler Document Plugin Module....................................................118
Configuring Near Duplicate Detection.......................................................................125
Configuring SSL Certificates.....................................................................................127
Configuring a Multiple Node Crawler....................................................................................128
6
Removing the Existing Crawler..................................................................................128
Setting up a New Crawler with Existing Crawler........................................................128
Large Scale XML Crawler Configuration..............................................................................130
Node Layout..............................................................................................................130
Node Hardware.........................................................................................................131
Hardware Sizing........................................................................................................131
Ubermaster Node Requirements...............................................................................132
Duplicate Servers......................................................................................................132
Crawlers (Masters)....................................................................................................133
Configuration and Tuning...........................................................................................133
Duplicate Server Tuning.............................................................................................135
Postprocess Tuning....................................................................................................136
Crawler/Master Tuning...............................................................................................137
Maximum Number of Open Files...............................................................................141
Large Scale XML Configuration Template.................................................................141
Chapter 4: Operating the Enterprise Crawler................................147
Stopping, Suspending and Starting the Crawler..................................................................148
Starting in a Single Node Environment - administrator interface...............................148
Starting in a Single Node Environment - command line............................................148
Starting in a Multiple Node Environment - administrator interface............................148
Starting in a Multiple Node Environment - command line..........................................148
Suspending/Stopping in a Single Node Environment - administrator interface.........148
Suspending/Stopping in a Single Node Environment - command line......................149
Suspending/stopping in a Multiple Node Environment - administrator interface.......149
Suspending/stopping in a Multiple Node Environment - command line....................149
Monitoring.............................................................................................................................149
Enterprise Crawler Statistics.....................................................................................149
Backup and Restore.............................................................................................................153
Restore Crawler Without Restoring Documents........................................................154
Full Backup of Crawler Configuration and Data.........................................................154
Full restore of Crawler Configuration and Data.........................................................154
Re-processing Crawler Data Using postprocess.......................................................154
Single node crawler re-processing............................................................................155
Multiple node crawler re-processing..........................................................................156
Forced Re-crawling....................................................................................................156
Purging Excluded URIs from the Index.....................................................................156
Aborting and Resuming of a Re-process..................................................................156
Crawler Store Consistency...................................................................................................157
Verifying Docstore and Metastore Consistency.........................................................157
Rebuilding the Duplicate Server Database................................................................159
Redistributing the Duplicate Server Database.....................................................................160
Exporting and Importing Collection Specific Crawler Configuration.....................................161
Fault-Tolerance and Recovery..............................................................................................161
7
Ubermaster................................................................................................................162
Duplicate server.........................................................................................................162
Crawler Node.............................................................................................................162
Chapter 5: Troubleshooting the Enterprise Crawler.....................163
Troubleshooting the Crawler.................................................................................................164
Reporting Issues.......................................................................................................164
Known Issues and Resolutions.................................................................................165
Chapter 6: Enterprise Crawler - reference information................169
Regular Expressions............................................................................................................170
Using Regular Expressions.......................................................................................170
Grouping Regular Expressions..................................................................................170
Substituting Regular Expressions..............................................................................171
Binaries................................................................................................................................171
crawler.......................................................................................................................171
postprocess...............................................................................................................174
ppdup.........................................................................................................................176
Tools.....................................................................................................................................178
crawleradmin.............................................................................................................178
crawlerdbtool.............................................................................................................185
crawlerconsistency....................................................................................................189
crawlerwqdump.........................................................................................................192
crawlerdbexport.........................................................................................................193
crawlerstoreimport.....................................................................................................194
Crawler Port Usage..............................................................................................................195
Log Files...............................................................................................................................196
Directory structure.....................................................................................................196
Log files and usage...................................................................................................197
Enabling all Log Files................................................................................................198
Verbose and Debug Modes.......................................................................................198
Crawler Log Messages..............................................................................................199
PostProcess Log........................................................................................................202
Crawler Fetch Logs....................................................................................................203
Crawler Fetch Log Messages....................................................................................204
Crawler Screened Log Messages.............................................................................207
Crawler Site Log Messages.......................................................................................209
8
Chapter
1
Introducing the FAST Enterprise Crawler
Topics:
•
•
•
•
•
•
New features
Web concepts
Crawler concepts
Enterprise Crawler Architecture
Configuring a crawl
Browser Engine
This chapter introduces the FAST Enterprise Crawler (EC), version 6.7, for use
with FAST ESP.
New features
New features since EC 6.3:
•
•
•
•
Significant large-scale performance and robustness improvements. Through efficiency improvements and
the addition of new configuration variables to reduce or eliminate inter-node communications, a large-scale
web crawl of up to 2 billion documents can be supported with the crawler, with 25-30 million documents
on over 60 dedicated crawler hosts.
Multimedia enabled crawler.
Document evaluator plugin.
NTLM v1 server and digest authentication.
•
Introduction of the Browser Engine, which enables more links to be extracted from:
•
•
•
•
•
•
JavaScript. By default the Browser Engine extracts most links, but the customizable preprocessors
and extractors allow even more links to be extracted.
Static Flash. By default links will be extracted from static flash (.swf) and flash video files (.flv)
IDNA support.
Authentication improvements. Full NTLM v1 support and improved form based authentication.
Operational improvements including new crawl modes and tools to verify crawler store consistency and
change the number of crawler nodes and duplicate servers.
Near duplicate detection to evaluate patterns in the content to identify duplicates.
No new features since EC 6.5 as it was never officially released.
•
Comprehensive sitemap support, which includes:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Automatic detection of sitemaps, including support for robots.txt directive.
Support for storing/indexing metadata from sitemaps.
Obey sitemap access rules.
Sitemap enabling/disabling per subdomain.
Use the lastmod attribute to determine what pages require re-crawling (non-adaptive crawl mode only)
Use the priority and changefreq attributes to score documents in adaptive crawl mode.
Improved crawleradmin refetch and refeed options
Extended the crawleradmin verifyuri option to perform a more thorough verification
Passwords no longer stored or presented in plant text in exported crawler configurations
Postprocess supports auto-resume of the previous interrupted refeed
Improved flexibility in matching robots.txt user-agents through a regular expression
Configurable session cookie timeout
Support for overriding the Obey robots.txt setting in sub collections
Document plugins can now perfom limited logging to the fetch log
Web concepts
This section provides a list of definitions for terms that apply to the part of the Internet called the World Wide
Web (www).
10
Web server
A web server is a network application using the HyperText Transfer Protocol (HTTP) to
serve information to users. Human users utilize a client application called a browser to
request, transfer and display documents from the web server. The documents may be web
pages (encoded in HTML or XML markup languages), files stored on the web server's file
system in any number of formats (Microsoft Word or Adobe Acrobat PDF documents, JPEG
or other image files, MP3 or other audio files), or content generated dynamically based on
the user's request (e-commerce products, search results, or database lookup results).
The crawler responds to HTTP error codes. For extensive explanations of all HTTP/1.1
RFC codes, refer to the Hypertext Transfer Protocol -- HTTP/1.1 available at
http://www.ietf.org/.
Web site vs.
web server
A web site is a given hostname (for example, www.example.com), with an associated IP
address (or, sometimes, a set of IP addresses, generally if a site gets a lot of traffic), which
supports the HTTP protocol and serves content to user requests. A web server is the
hardware system corresponding to this hostname. Several web sites may share a given
web server, or even a single IP address.
Web page
A web page is the standard unit of content returned by a web server, which may be identified
by one or more URIs. It may represent the formatted output of a markup language (for
example, HTML or XML), or a document format stored on-disk (for example, Microsoft
Office, Adobe PDF, plain text), or the dynamic representation of a database or other archive.
In any case, the web server will return some header information along with the content,
describing the format of the contents, using the Internet Standard MIME type conventions.
Links
Web pages may contain references to other web pages, either on the same web server or
elsewhere in the network, called hyperlinks or simply links. These links are identified by
various internal formatting tags.
Uniform
Resource
Identifier (URI)
vs. Uniform
Resource
Locator (URL)
URI is overall namespace for identifying resources. URL is a specific type that include the
location of the resource (for example, a web page, http://www.example.com/index.html).
Encoded within this example are the network protocol (scheme, HTTP), the hostname and
implied port number, (www.example.com, port 80), and a specific path and page on that
server (/index.html). URI is the more general term, and is preferred. Extensive RFC 3986
details can be found at http://www.ietf.org/.
IDNA
Since normal DNS resolving doesn't support characters outside the ASCII scope of
characters, a hostname containing these special characters has to be translated into an
ASCII based format. This translation is defined by the Internationalizing Domain Names in
Applications (IDNA) standard.
An example of such a hostname would be www.blåbærsyltetøy.no . The DNS server
doesn't understand this name, so the host is registered as the IDN encoded version of the
host name: www.xn--blbrsyltety-y8ao3x.no. The Crawler will automatically translate these
host names to IDN encoded names before DNS lookup is performed. When working with
URIs that use special characters, please make sure the collection or the start uri files have
been stored using UTF-8 or similar encoding.
Extensive RFC 3490 details can be found at http://www.ietf.org/.
RSS
RSS is a family of web feed formats used to publish frequently updated digital content, such
as blogs, news feeds or podcasts. Users of RSS content use programs called feed readers
or aggregators. The user subscribes to a feed by supplying to their reader a link to the feed.
The reader can then check the user's subscribed feeds to see if any of those feeds have
new content since the last time it checked and if so, retrieve that content and present it to
the user.
The following RSS formats/versions are supported by the crawler:
•
RSS 0.9-2.0
11
•
•
XML Sitemaps
ATOM 0.3 and 1.0
Channel Definition Format (CDF)
An XML Sitemap (also known as Google Sitemap) is an XML format for specifying the links
on the site, with associated meta data. This meta data includes the following per URI:
•
•
•
The priority (importance)
The change frequency
The time it as last modified
The crawler can be configured to download such sitemaps, and make use of this information
when deciding what URIs to crawl, and in what order, for a site.
In non-adaptive refresh mode the crawler uses the lastmod attribute to determine whether
a page has been modified since the last time the sitemap was retrieved. Pages that have
not been modified will not be recrawled in this crawl cycle.
In adaptive refresh mode the crawler will use the priority and changefreq attributes from
the sitemap to score (weight) a page. Thus, assuming that the sitemap has sane values,
the crawler will prioritize high priority content. The sitemap is only re-downloaded each
major cycle however.
See the Sitemap support configuration option for more information.
Crawler concepts
The crawler is a software application that gathers (or fetches) web pages from a network, typically a bounded
institutional or corporate network, but potentially the entire Internet, in a controlled and reasonably deterministic
manner.
The crawler works, in many ways, like a web browser to download content from web servers. But unlike a
browser that responds only the user's input via mouse clicking or keyboard typing, the crawler works from a
set of rules it must follow when requesting web pages, including how long to wait between requests for pages
(Request rate), and how long to wait before checking for new/updated pages (Refresh interval). For each
web page downloaded by the crawler, it makes a list of all the links to other pages, and then checks these
links against the rules for what hosts, domains, or paths it is allowed to fetch.
A brief description of the crawler's algorithm is that it will start by comparing the start URIs list against the
include and (if defined) exclude rules. Valid URIs are then requested from their web servers at a rate
determined by the specified request rate. If fetched successfully, the page is parsed for links, and information
about the page stored in the meta database, with the contents stored in the crawler store. The URIs from the
parsed links are each evaluated against the rules, fetched, and the process continues until all included content
has been gathered, or the refresh interval is complete.
Because of the many different situations in which the crawler is used, there are many different ways to adjust
its configuration. This section identifies some of the fundamental elements used to set up and control web
page collection:
12
Collection
Named set of documents to be indexed together in ESP, this also identifies the crawler's
configuration rules.
Storage
The crawler stores crawled content locally by default, to be passed to other ESP components
later. If there is too much data to store, or a one-time index build is planned, pages can be
deleted after having been indexed. It also builds a database of meta data, or details about
a web page, such as what pages link to it or if there are any duplicate copies.
Include rules Settings that indicate the hosts and/or URIs that may be fetched. These can be quite specific
such as a list of web servers or URIs, or general, for all the servers in one's network. It's
important to keep in mind that this only specifies what may be fetched, it does not define
where to start crawling (see start URIs list below).
Exclude rules Optional settings that prevent hosts and/or URIs from being fetched, because they would
otherwise match the include rules, but are not desired in the index.
Start URIs list
List of web pages (URIs) to be fetched first, from which additional links may be extracted,
tested against the rules, and added to work queues for subsequent fetch attempts. As each
is fetched, additional URIs on that site and others may be found. If there are URIs listed to
more sites in the start URIs list than the number of sites the crawler can connect to
simultaneously (Maximum number of concurrent sites configuration variable), then
some will remain queued until a site completes crawling, at which point a new site can be
processed.
The start URIs list is sometimes referred to as a seed URIs list or simply seed list.
Refresh
interval
Length of time the crawler will work before re-crawling a site to see if new or modified pages
exist. The behavior of the crawler during this period depends upon the refresh mode. If the
crawler is busy it will have work queues of pages yet to be fetched; the contents of the
existing work queues may either be kept and crawled during the next refresh cycle, or it may
be erased (scratched). In either case the start URIs are also added to the work queue.
In the adaptive mode the overall refresh interval is called the major cycle. The major cycle
is subdivided into multiple minor cycles, with goals and limits regarding the number of pages
to be revisited. This interval may be quite short, measured in hours or minutes, for "fresh"
data like news stories, but is more typically set as a number of days.
The refresh interval is sometimes referred to as the crawl cycle, refresh cycle or simply
refresh.
Request rate
The amount of time the crawler will "wait" after fetching a document before attempting another
fetch from the same web site. For flexibility, different rates (variable delay) can be specified
for different times of day, or days of the week. Setting this value very low can cause problems,
as it increases the activity of both the web sites and the crawler system, along with the
network links between them.
The request rate is sometimes referred to as the page request delay, request delay or
delay.
Concurrent
sites
The crawler is capable of crawling a large number of unique web sites, however only a
limited number of these can be crawled concurrently at any one time. Normally, the crawler
will crawl a site to completion before continuing on the next site. You can however limit the
amount of documents crawled from a single site by several means, see Excluding or limiting
documents below. This can be used to ensure the crawler eventually gets time to crawl all
the web sites it is configured to.
Crawl speed
Sometimes also referred to as crawl rate this is rate at which documents are fetched from
the web sites for a given collection. The highest possible crawl rate can be calculated from
the number of concurrent sites divided by the request rate. For example, if crawling 50
web sites with request rate of 10 (10 seconds "delay" between each fetch) the total maximum
achievable crawl rate will be 5 documents per second. However, if the network or web sites
are slow the actual crawl rate may be less.
Excluding or
Because the ultimate goal of crawling is indexing the textual content of web pages rather
limiting
than viewing a fully detailed web page (as with a browser), the standard configuration of the
documents
crawler includes some exceptions to the rules of what to fetch. A common example is
graphical content; JPEG, GIF and bitmap files are all excluded.
13
There are several other controls that can be set to limit downloaded content. A per-page
size limit can be set, with another option to control what happens when the size limit is
exceeded (drop the page, or truncate it at the size limit). Another option, Maximum
documents per site, limits the number of pages downloaded from a given web site; helpful
if a large number of sites is being surveyed, and too many pages fetched from a "deep" site
would limit the resources available and starve other sites.
Level or hops This value indicates how many links have been followed from a start URI (Level 0) to reach
the current page. It is used in evaluating a crawl in which a DEPTH value has been specified.
For example, if the start URI http://www.example.com/index.html links to /sitemap.html
on the same site, from which a link to
http://www.example.com/test/setting/000/output_listing/three.txt is extracted,
this latter URI will be Level 2. The number of path elements is not considered in determining
the Level value. If you are running a DEPTH:0 crawl, the start URIs will be crawled, but
redirects and frame links will also be allowed. To strictly enforce a start URI only crawl,
specify DEPTH:-1 (minus-one).
Feed/refeed
The crawler will send fetched pages to FAST ESP in batches to be indexed, updated or
deleted, a process known as feeding. In normal operation it will automatically maintain the
synchronization between what pages exist on web sites and what pages are available in
the index.
Under some circumstances it may be necessary to rebuild the collection in the index, or
make major (bulk) changes in the contents. For example, a large number of deletions of
sites or pages no longer desired, or significant changes in the processing pipeline would
both require resending data to FAST ESP. In this case the crawler can be shut down and
postprocess run manually, a process known as re-feeding. After restarting the crawler, it will
continue to keep the index updated based on new pages fetched, or deleted pages
discovered.
Duplicate
documents
A web document may in some cases be represented by more than a single URI. In order to
avoid indexing the same document multiple times a mechanism known as duplicate detection
is used to ensure that only one copy of each unique document is indexed.
The crawler supports two ways of identifying such duplicates. The first method is to strip all
HTML markup and white space from the document, and then compute an MD5 checksum
of the resulting content. For non-HTML content such as PDFs the MD5 checksum is generated
directly from the binary content. Any documents sharing the same checksum are duplicate.
A variation on this method is the near duplicate detection, refer to the Configuring Near
Duplicate Detection on page 125 chapter for more information.
The set of documents with different URIs classified as duplicates will be indexed as one
document in the index, but the field 'urls' will contain multiple URIs pointing to this document.
Note: The crawler's duplicate handling will only apply within collections, not across.
Enterprise Crawler Architecture
The Enterprise Crawler is typically a component within a FAST ESP installation, started and stopped by the
Node Controller (nctrl). Internally the crawler is organized as a collection of processes and logical entities,
which in most cases run on a single machine. Distributing the processes across multiple hosts is supported,
allowing the crawler to gather and process a larger number of documents from numerous web sites.
Table 1: Crawler Processes
14
Binary
Function
crawler
Master/Ubermaster
uberslave
Uberslave/Slave
postprocess
Postprocess
ppdup
Duplicate Server
crawlerfs
File Server
In a single node installation the primary process is known as the master, and is implemented in the crawler
binary. It has several tasks, including resolving DNS names to addresses, maintaining the collection
configurations, and other "global" jobs. It also allocates sites to one of the uberslave processes.
The master is started (or stopped) by the node controller, and is responsible for starting and stopping the
other crawler processes. These include the uberslave processes (two by default), each of which creates
multiple slave entities.
The uberslave is responsible for creating the per-site work queues and databases; a slave is allocated to a
single site at any given time, and is responsible for fetching pages, directly or through a proxy, computing
the checksum of the page's content, storing the page to disk, and associated activities such as logging in to
protected sites.
The postprocess maintains a database of document content checksums, to determine duplicates (more
than one URI corresponding to the same data content), and is responsible for feeding batches of documents
to FAST ESP. Small documents are sent directly to the document processing pipelines, but larger documents
are sent with only a reference to the document; the file server process is responsible for supplying the
document contents to any pipeline stage that requests it.
Figure 1: Single Node Crawler Architecture
When the number of web sites, or total number of pages to be crawled, is large, the crawler can be scaled
up by distributing the processes across multiple hosts. In this configuration some additional processes are
required. An ubermaster is added, which takes on the role of DNS name/address resolution, centralized
logging, and routing URIs to the appropriate master node. Each master node continues to have a postprocess
locally, but each of these must now submit URI checksums to the duplicate server, which maintains a global
database of URIs and content checksums.
15
Figure 2: Multiple Node Crawler Architecture
Refer to the FAST ESP Product Overview Guide, Basic Concepts chapter for FAST ESP search engine
concepts.
Configuring a crawl
The purpose of the crawler is to fetch the web pages that are desired for the index, so that users can search
for and find the information they need. This section introduces how to limit and guide the crawler in selecting
web pages to fetch and index, and describes the alternatives for what to do once the refresh interval has
completed.
In building an index it is important to include documents that have useful information that people need to find,
but it is also critical to exclude content that is repetitive or otherwise less useful. For example, the automated
pages of an on-line calendar system, with one page per day (typically empty), stretching off into the distant
future may not be useful. Keep this in mind when setting up what the crawler will, and will NOT, fetch and
process.
At a minimum, a crawl is defined by two key issues: where to begin, and where to go; also important are
determining how fast to crawl, and for how long.
Where to begin?
The start URIs list provides the initial set of URIs to web sites/pages for the crawler to consider. As each is
fetched it generates additional URIs to that site and other sites.
If there are URIs listed to more sites in the start URIs list than the number of sites the crawler can connect
to simultaneously (Maximum number of concurrent sites), then some remain pending until a site completes
crawling, at which point a new site can be processed. To prevent site starvation, the setting of Maximum
documents before interleaving can force a different site to be scheduled after the specified value of pages
are fetched.
16
Note: This can be expensive with regard to queue structure and the possibility of overflowing file system
limits. It is recommended that you thoroughly consider the implications on web scale crawls before
implementing this feature.
Where to go?
The first factor to consider is what web sites should be crawled. If given no limitations, no rules to restrict it,
the crawler will consider ANY URI to be valid. For most indexing projects, this is too much data.
Generally, an index is being built for a limited number of known web sites, identified by their DNS domains
or, more specifically, hostnames. For these sites, one or more start URIs is identified, giving the crawler a
starting point within the web site. An include rule corresponding to the start URI can be quite specific, for
example, an EXACT match for www.example.org, or it can be more general to match all websites in a given
DNS domain, for example, any hostname matching the SUFFIX .example.com.
Figure 3: Configuring a Crawl
A crawl configured with these include rules, and a start URI to match each one, would attempt to download,
store, and index all the pages available within the large circles shown in the illustration, corresponding to the
www.example.com network and the www.example.org web site.
It is often the case, though, that general rules such as these have specific exemptions, special cases of
servers or documents that must not be indexed. Consider the host hidden.example.com in Site A containing
documents that are not useful to the general public, or are otherwise deemed to be unworthy of indexing. To
prevent any pages from this site being fetched, the crawler can be configured with a rule to exclude from
consideration any URI with the EXACT hostname hidden.example.com. Another possibility is that only files
from a particular part of the web site should be avoided; in such a case an Exclude URI rule could be entered,
for example, any URI matching the PREFIX http://hidden.example.com/forbidden_fruit/.
As the crawler fetches pages and looks through them for new URIs to crawl, it will evaluate each candidate
URI against its configured rules. If a URI matches either an Include Hostname or Include URI filter rule, while
NOT matching an Exclude Hostname or Exclude URI filter rule, then it is considered eligible for further
processing, and possibly fetching.
17
Note:
The semantics of URI and hostname inclusion rules have changed since ESP 5.0 (EC 6.3). In previous
ESP releases these two rule types were evaluated as an AND operation, meaning that a URI has to
match both rules (when rules are defined). As of ESP 5.1 and later (EC 6.6 and up), the rules processing
has changed to an OR operator, meaning a URI now only needs to match one of the two rule types.
Existing crawler configurations migrating from EC 6.3 must be updated, by removing or adjusting the
hostname include rules that overlap with the URI include rules.
Within a given site, you can configure the crawler to gather all pages (a FULL crawl), or limits can be set on
either the depth of the crawl (how many levels of links are followed), or an overall limit on the number of
pages allowed per site can be set (Maximum documents per site).
How fast to crawl?
Perhaps the key variable that affects how much work the crawler, the network, and the remote web sites
must do is the page Request rate. This value is determined by the delay setting, which indicates how long
the crawler should wait, after fetching a page, before requesting the next one.
For each active site, the overall page request rate will be a function of the Delay, the Outstanding page
requests setting, and the response time of the web site returning requested pages.
The crawler's overall download rate depends on how many active sites are busy. The number of uberslaves
per node can be modified, with a default of two and a maximum of eight.
When crawling remote sites that are not part of the same organization running the crawl, using the default
delay value of 60 seconds (or higher) is appropriate, so as not to burden the web site from which pages are
being requested. For crawlers within the same organization/network, lower values may be used, though note
that using very low values (for example, less than 5 seconds) can be stressful on the systems involved.
How long to crawl?
The Refresh interval determines the overall crawl cycle length; the period of time over which a crawl should
run without revisiting a site to see if new or modified pages exist.
Picking an appropriate interval depends on the amount of data to be fetched (which depends both on the
number of web sites, and how many web pages each contains) and the update rate or freshness of the web
sites.
In some cases, there are many web sites with very static/stable data, and a few that are updated frequently;
these can be configured as either separate collections, or given distinct settings through the use of a sub
collection.
The behavior of the crawler at the end of the crawl cycle depends upon the Refresh mode setting, the Refresh
when idle setting, and the current level of activity.
If the refresh cycle is long enough that all sites have been completely crawled, and the refresh when idle
parameter is "no", the crawler will remain idle until the refresh interval ends. If the refresh when idle parameter
is "yes", a new cycle will be started immediately. In the next cycle, the start URIs list is followed as in the first
cycle.
On the other hand, if the crawler is still busy, it will have work queues of pages yet to be fetched; in the default
setting (scratch), the work queues are erased, and the cycles begin "from scratch", just as in the first cycle.
Other options keep any existing work queues, and specify that the start URIs are to be placed at the end of
the work queue list (append), or at the front (prepend).
In the adaptive mode, the major cycle is subdivided into multiple micro cycles, with goals and limits regarding
the number of pages to be revisited in each of these. It works by maintaining a scaled score for each page
it retrieves, and this score is used to determine if a document should be re-crawled multiple times within a
major cycle. This mode is mainly useful when crawling large bodies of data. For instance, if a site that is being
crawled contains several million pages it can take, say, a month to completely crawl the site. If the "top" of
18
the site changes frequently and contains high quality information it may be useful for these pages to be
crawled more frequently than once a month. When in adaptive mode, the crawler will do exactly that.
Excluding pages in other ways
Because the ultimate goal of crawling in FAST ESP is indexing the textual content of web pages, rather than
viewing a fully detailed web page (as with a browser), the standard configuration of the crawler includes some
exceptions to the rules of what to fetch.
A common example is graphical content; JPEG, GIF and bitmap files are excluded. Links to audio or video
content are typically excluded to avoid downloading large amounts of content with no text content, although
special multimedia crawls may choose to include this content for further processing.
These restrictions can be implemented using either filename extensions (for example, any file that ends with
".jpg"), or via the Internet standard MIME type (for example, "image/jpeg"). Note that a MIME type screening
requires the crawler to actually fetch part of the document whereas an extension exclude can be performed
without any network access.
There are several other controls that can be set to limit downloaded content. A per-page size limit can be
set, with another option to control what happens when the size limit is exceeded (drop the page, or truncate
it at the size limit). Another option limits the number of pages downloaded from a given web site, helpful if a
large number of sites is being surveyed, and too many pages fetched from a "deep" site would limit the
resources available and starve other sites. It is also an option to exclude pages based on the header information
returned by web servers as part of the HTTP protocol.
External limits on fetching pages
Not every page that meets the crawler's configured rule set will be successfully fetched. In many cases, a
"trivial" crawl configured with a single start URI and a rule including just that one site will start, then suddenly
stop without any pages having been fetched.
This is generally an issue when the site itself has signaled that it does not wish to be crawled. This section
summarizes the ways that pages are NOT successfully crawled, and discuss how to recognize this situation.
The first fact to consider is that not all pages exist! Documents can be removed from a web server, and due
to the distributed nature of the web the links pointing to it may never disappear entirely. If such a "dead" link
is provided to the crawler, by either harvesting it off a fetched page or listed as a start URI, it will result in an
HTTP "404" error being returned by the remote web server, indicating "File Not Found". The HTTP status
codes are logged in the Crawler Fetch Logs.
Login Control
One common mechanism for limiting access to pages, by either crawlers or browsers, is to require a login.
Refer to Setting Up Crawler Cookie Authentication for more information.
Robots Control
Because the crawler (and programs similar to it, known collectively as "spiders" or "robots") collects web
pages automatically, repetitively fetching pages from a web site, some techniques have been developed to
give webmasters a measure of control over what can be fetched, and what pages can or cannot ultimately
be indexed. This section will review these techniques, the site-wide robots.txt file, and the per-page robots
META tags.
The primary tool available to webmasters is the Robots Exclusion Standard (http://www.robotstxt.org), or
more commonly known as the robots.txt standard. This was the first technique developed to organize the
growing number of web crawlers, and is a commonly implemented method of restricting, or even prohibiting,
crawlers from downloading pages. The way it works is that before a crawler fetches any page from a web
site, it should first request the page /robots.txt. If the file doesn't exist, there are no restrictions on crawling
documents from that server. If the file does exist, the crawler must download it and interpret the rules found
there. A webmaster can choose to list rules specific to the FAST crawler, in which case the robots.txt file
19
would have an User-agent entry that matches what the crawler sends to identify itself, normally "FAST
Enterprise Crawler 6", though in fact any string matching any prefix of this, such as "User-agent: fast",
would be considered a match. In the most common case the webmaster can indicate that every crawler,
"User-agent: *", is restricted from gathering any pages on the site, via the rule "Disallow: /". Any site
blocked in this way is off-limits from crawling, unless the crawler is explicitly configured to override the block.
This should only be done with the knowledge and permission of the webmaster. The Crawler Screened
Log, which should be enabled for test crawls, would list any site blocked in this way with the entry DENY
ROBOTS.
The crawler supports some non-standard extensions to the robots.txt protocol. The directives are described
in the following table:
Table 2: robots.txt Directives
Extension
Comments
Allow:
This directive can override a Disallow: directive for a
particular file or directory tree.
Disallow:
This directive is defined to be used with path prefixes (for
example, "/samples/music/" would block any files from
that directory), some sites specify particular file types to
avoid(only excluding the extensions), such as Disallow:
/*.pdf$, and the crawler obeys these entries.
Crawl-delay: 120
This directive specifies the number of seconds to delay
between page requests. If the crawler is configured with
the Obey robots.txt crawl delay setting enabled (set to
Yes/True), this value will override the collection-wide Delay
setting for this site.
Example:
User-agent: *
Crawl-delay: 5
Disallow: /files
Disallow: /search
Disallow: /book/print
Allow: /files/ovation.txt
Another tool that can be used to modify the behavior of visiting crawlers is known as robots META tags. Unlike
the robots.txt file, which provides guidance for the entire web site, robots META tags can be embedded
within any HTML web page, within the "head" section. For a META tag of name "robots", the content value
will indicate the actions to take, or to avoid. While a page without such tags will be parsed to find new URIs
before being indexed, the possible settings can prevent either or both of these actions by a crawler. In the
following example, the page is being effectively blocked from further processing by any crawler that downloads
it.
Table 3: Robots META Tags Settings
20
Value
Crawler Action
index
Accept the page contents for indexing. (Default)
noindex
Do not index the contents of this page.
follow
Parse the page for links (URIs) to other pages (Default)
nofollow
Do not follow any links (URIs) embedded in this page
all
All actions permitted (equivalent to "index, follow")
none
No further processing permitted (equivalent to
"noindex,nofollow")
Example:
<html>
<head>
<meta name="robots" content="noindex,nofollow">
<meta name="description" content="This page ....">
<title>...</title>
</head>
.
.
.
<html>
Removal of old content
Over time the content on web sites change. Documents, and sometimes entire web sites, disappear and the
crawler must be able to detect this. At the same time, web sites may also be unavailable for periods, and the
crawler must be able to differentiate between these two scenarios.
The crawler has two main methods of detecting removed content, which should be removed from the index.
These two methods are broken links and document expiry.
As the crawler follows links from documents it will inevitably come across a number of links that are not valid,
i.e. the web server does not return a document but instead an HTTP error code. In some cases the web
server may not even exist any more. If the document is currently in the crawler document store, i.e. it has
been previously crawled, the crawler will take an appropriate action to either delete the document or retry the
fetch later.
The following table shows various HTTP codes and how they are handled by default. The action is configurable
per collection.
Table 4: HTTP error handling
Error
Action taken
400-499
Delete document immediately.
500-599
Delete document on 10th failed fetch attempt.
Fetch timeout
Delete document on 3rd failed fetch attempt.
Network error
Delete document on 3rd failed fetch attempt. First retry is performed immediately.
Internal error
Keep the document.
The method of detecting dead links described above works fine as long as the crawler locates links leading
to the removed content. It is also sufficient in the adaptive refresh mode, since the crawler will create a work
queue internally of all URIs it has previously seen and use that for re-crawling.
However when not using adaptive refresh, a second method is necessary in order to correctly handle situations
where portions of a site, or perhaps the entire web site, has disappeared from the web. In this case, the
crawler will most likely not discover links leading to each separate document. The method used in this case
is that of document expiry, usually referred to as DB switch in the crawler.
The crawler keeps track internally of all documents seen every refresh cycle. It is therefore able to create a
list of documents not seen for the last X cycles, where X is defined as the DB switch interval. Under the
assumption that the crawler is able to completely re-crawl every web site every crawl cycle, these documents
therefore no longer exist on the web servers. The action taken by the crawler on these documents is dependent
on the DB switch delete option. The default value of this option is No which instructs the crawler to not delete
them immediately, but rather place them on the work queue to verify that they are indeed removed from the
web sites in question. Every document found to be irretrievable is subsequently deleted. This is the
recommended setting, however it is also possible to instruct the crawler to immediately discard these
documents.
21
Care should be taken when adjusting the DB switch interval and especially DB switch delete options.
Setting the former too low and using a brief refresh cycle can lead to a situation where the crawler incorrectly
interprets large numbers of documents as candidates for deletion. If then the DB switch delete option is set
to yes it is entirely possible for the crawler to accidentally delete a large portion of the crawler store and index.
Browser Engine
The Browser Engine is a stand alone component which is used by the Enterprise Crawler to extract information
from JavaScript and Flash files. The flow from the crawler to the Browser Engine and back is explained below.
Normal processing
If the crawler detects a document containing one or more JavaScript or Flash files and the corresponding
crawler option is enabled, the crawler submits the document to a Browser Engine for processing.
When the Browser Engine receives the request, it picks a thread from its pool of threads and assigns the
task to it.
If the file is a Flash file, it is parsed for links. However, if the document contains JavaScript, the Browser
Engine parses it, creates a DOM (document object model) tree and executes all inline JavaScript code. The
DOM tree is then passed to a configurable pipeline within the Browser Engine. This pipeline constructs a
HTML document, extracts cookies, generates a document checksum, simulate user interaction and extracts
links. Finally the data is returned to the crawler.
Some documents processed by the Browser Engine require external documents (dependencies), such as
scripts and frames. The Browser Engine will request these dependencies from the crawler, which in turn will
retrieve these as soon as possible. However, in order to reduce web server load the crawler still obeys the
configured request rate for each of these dependencies. Once the dependency is resolved a replay is sent
back to the Browser Engine. In other words the crawler will function as a download proxy for the Browser
Engine.
The crawler stores the processed HTML document, and sends it to the indexer. The crawler will also follow
links from the generated HTML document, provided the URIs are permitted according to the crawler
configuration.
Overloading
If the Browser Engine has no available capacity when receiving a processing request, it attempts to queue
the request. When the queue is full, the request is denied. The crawler automatically detects this situation
and will attempt to send the request to another Browser Engine, if one is available. If there are no others
available then the crawler uses an exponential back-off algorithm before resending the request, thus reducing
the load on the Browser Engine. This means that for each failed request it will wait a bit longer before trying
again. There is no upper limit on the number of retries.
A request to the Browser Engine is counted towards the maximum number of concurrent requests for the
web site. The maximum number of pending requests to the Browser Engines are thus limited by this
configuration option.
22
Chapter
2
Migrating the Crawler
Topics:
•
•
•
•
Overview
Storage overview
CrawlerGlobalDefaults.xml file
considerations
The Migration Process
FAST Data Search (FDS) 4.1 and FAST ESP 5.0, and related products, included
version 6.3 of the Enterprise Crawler (EC 6.3). If you are migrating an installation
of a previous release, and need to preserve the crawler data store, this chapter
outlines the necessary procedure. Refer to the FAST ESP Migration Guide for
additional overall migration information.
Overview
FAST Data Search (FDS) 4.1, FAST ESP 5.0 and related products, included version 6.3 of the Enterprise
Crawler (EC 6.3). If you are migrating an installation from these releases, and need to preserve the crawler
data store, this chapter outlines the necessary procedure. Upgrading from FAST ESP 5.1 or 5.2 can be done
simply by preserving the crawler's data directory, as there are no changes to the storage backend between
EC 6.6 and EC 6.7. Refer to the FAST ESP Migration Guide for additional overall migration information.
The EC 6.7 document storage is backwards compatible with that of EC 6.3 and EC 6.4. but the meta data
store of EC 6.3 must be converted to be readable by the new version. More specifically, the meta data and
configuration databases have new options or formats, to which existing data must be adapted. The document
storage can be retained in the same format, or the format can be changed from flatfile to bstore with the
migration tool.
The overall migration process consists of stopping the EC 6.3 crawler, so that its data becomes stable, then
running an export tool in that installation to prepare the metadata for migration. In the new EC 6.7 installation,
an import tool is run that can read the EC 6.3 databases and exported metadata, and copy, create, or recreate
all necessary files.
Note:
ESP releases these two rule types were evaluated as an AND, meaning that a URI has to match both
rules (when rules are defined). As of ESP 5.1 and later (EC 6.6 and up), the rules processing has changed
to an OR operator, meaning a URI now only needs to match one of the two rule types.
For example, an older configuration for fetching pages with the prefix http://www.example.com/public
would specify two rules:
•
•
include_domains: www.example.com (exact)
include_uris: http://www.example.com/public/ (prefix)
The first rule is no longer needed, and if not removed would allow any URI from that host to be fetched,
not only those from the /public path. Some configurations may be much more complex than this simple
example, and require careful adjustment in order to restrict URIs to the same limits as before. Contact
FAST Support for assistance in reviewing your configuration, if in doubt.
Existing crawler configurations migrating from EC 6.3 must be manually updated, by removing or
adjusting the hostname include rules that overlap with URI include rules.
Note: While the EC 6.4 configuration and database are compatible with EC 6.7, and do not require
special processing to convert them to an intermediate format, the import tool should still be used to copy
the old installation to the new location. The one special case that requires a conversion is when the EC
6.4 installation uses the non-standard ppdup_format value "hashlog". The migration tool will recognize
this case when copying an EC 6.4 installation, and automatically perform the conversion.
Storage overview
The crawler storage that is converted as part of the migration process is described in some detail in the
following sections.
Document storage
In a typical crawler installation, all downloaded documents are stored on disk, and the content is retained
even after having been sent to the search engine.
24
The two datastore formats supported in an EC 6.3 installation are flatfile and bstore. In the flatfile method
each downloaded URI is stored directly in the file system, with a base64 encoding of the URI used as the
filename. While this can be expensive in terms of disk usage, it does allow obsolete documents to be deleted
immediately.The alternative, bstore (block file storage), is more efficient in file system usage, writing documents
into internally indexed 16MB files, though these also require additional processing overhead in terms of
compaction.
Either bstore or flatfile may be specified when starting the import tool, allowing an installation to transition
from one format to another. If a new storage format is not specified, the setting in the configuration database
will be retained. Note that if your old data storage is in flatfile format, the migration process will be slower
than migrating a bstore data storage. It is suggested to specify the new crawler store to be in bstore format.
In either case, the number of clusters, that is the number of subdirectories across which the data is spread,
with a default value of eight, cannot be modified during migration.
The original document storage will not be touched in the export operation. During the import operation,
documents will be read from the old storage, and stored at its new location in the new format, one by one.
Again, the original version of the storage will not be modified.
Path: $FASTSEARCH/data/crawler/store/<collection>/data/<cluster>
Meta database
The meta databases contain meta information about all URIs visited by the crawler. Typical meta information
for a URI would include document storage location, content type, outgoing links, document checksum, referrers,
HTTP headers, and/or last modified date.
The metadata store consists of all the information available to it about the URIs that it is, or has been,
crawling. This information is organized into multiple databases that store the URIs and details about them,
primarily the metadata database and the postprocess database. If a given URI has been successfully
crawled, the metadata will also contain a reference to the document storage, a separate area where the
actual contents of the downloaded page is kept on disk, available to be fed to the search engine. The job of
the migration tool is to transfer, and if necessary update or convert, the crawler store from an earlier crawler
version so that it can be used by EC 6.7.
During the export operation, all the meta databases will be dumped to an intermediary format and placed in
the same directory as the original databases. The dumped versions of the databases will be given the same
filenames as the original databases, with the suffix .dumped_nn, where nn goes from 0 to the total number
of dump files. In the import operation, all the dumped postprocess databases are loaded and stored, one by
one, in the new postprocess database format. Optionally, each dump file can be deleted from the disk after
processing.
Path: $FASTSEARCH/data/crawler/store/<collection>/db/<cluster>/
Postprocess Database
The postprocess (PP) databases contain a limited amount of metadata for each unique checksum produced
by the postprocess process. For each item stored, it contains the checksum, the owner URI, duplicate URIs,
and redirects URIs.
During the import operation, all the EC 6.3 postprocess databases are copied to the new postprocess database
directory.
Path: $FASTSEARCH/data/crawler/store/<collection>/PP/csum/
Duplicate server database
Duplicate servers are only used in multiple node crawler setups. They are used to perform duplicate detection
across crawler nodes.
The duplicate server database format is unchanged between versions EC 6.3 and EC 6.7. In the import
operation, each database file is copied to the new duplicate server storage location.
25
Path: $FASTSEARCH/data/crawler/ppdup/
Configuration and Routing Databases
The configuration database contains all the crawler-options set in a collection specification.
The difference between EC 6.3 and EC 6.4-6.7 is the removal of several obsolete options listed in the table
below:
Table 5: Obsolete Database Options
Option
Type
Comment
starturis (database cache size)
integer
Start URI database removed for 6.4
and later versions
starturis_usedb
boolean
Start URI database removed for 6.4
and later versions
Compressdbs
boolean
All databases compressed in 6.4 and
later versions
When the configuration database is imported, the database will be read and all valid options will be used.
Two potential modifications may also be made. If a proxy definition exists, it will be converted from a string
to a list element (as EC 6.6 and 6.7 supports multiple proxies), and if the data storage format is changed (via
import tool command line option) that configuration setting is updated.
Path: $FASTSEARCH/data/crawler/config/ (for crawler nodes)
Path: $FASTSEARCH/data/crawler/config_um/ (for ubermasters)
The routing database will be migrated on an ubermaster, in addition to the configuration database.The routing
database is not important in a single node installation, however in a multiple node installation, the routing
database defines to which crawler node each site is assigned.
Note:
ESP releases these two rule types were evaluated as an AND, meaning that a URI has to match both
rules (when rules are defined). As of ESP 5.1 and later (EC 6.6 and up), the rules processing has changed
to an OR operator, meaning a URI now only needs to match one of the rule types.
For example, an older configuration for fetching pages with the prefix http://www.example.com/public
would specify two rules:
•
•
The first rule is no longer needed, and if not removed would allow any URI from that host to be fetched,
not only those from the /public path. Note that more complex rules
Existing crawler configurations migrating from EC 6.3 must be manually updated, by removing the
hostname include rules that overlap with URI include rules.
Note: Be sure to manually copy any text or XML configuration files that might be used in the installation,
located outside of the data/config file system, such as a start_uri_files listing.
CrawlerGlobalDefaults.xml file considerations
If you are using a CrawlerGlobalDefaults.xml file in your configuration, note that some options have been
restructured from separate options in EC 6.3 into values within options and possibly renamed in EC 6.7.
There are no such changes between EC 6.6 and 6.7.
26
The following tables list options that have been restructured for the EC 6.7 CrawlerGlobalDefaults.xml
file as well as how they were identified in the EC 6.3 version of this file. You will need to manually edit the
CrawlerGlobalDefaults.xml file to use the new options.
Table 6: Domain Name System (DNS) Options Changes
EC 6.7 Option
EC 6.7 Description
dns
Refer to CrawlerGlobalDefaults.xml options on page 110 for
detailed dns descriptions. Valid values:
EC 6.3 Option
min_rate
MinDNSRate
max_rate
MaxDNSRate
max_retries
MaxDNSRetries
timeout
DNSTimeout
min_ttl
DNSMinTTL
db_cachesize
DNSCachesize
Table 7: Feeding Options Changes
EC 6.7 Option
EC 6.7 Description
feeding
Refer to CrawlerGlobalDefaults.xml options on page 110 for
detailed feeding descriptions. Valid values for the FDS
feeding options related to postprocess and its behavior
when submitting data to DataSearch are:
EC 6.3 Option
priority
DSPriority
feeder_threads
Not available
max_outstanding
DSMaxOutstanding
max_cb_timeout
DSMaxCBTimeout
max_batch_size
Not available
fs_threshold
Not available
The Migration Process
The migration process consists of running two separate programs; the export tool, and the import tool. The
export tool is run in the EC 6.3 environment and the import tool is run in the EC 6.7 environment. If migrating
an EC 6.4 installation, only the import tool need be run from the EC 6.7 environment. This proceedure is not
neccessary when going from EC 6.6 to 6.7.
•
•
The export tool dumps all of the EC 6.3 databases to an intermediate data format on disk. The files will
be placed alongside the original databases, named with the suffix .dumped_nn.
The import tool loads these dump files one by one, creates new databases and migrates the document
storage, and a new 6.7 crawler store will be created. This also includes the document store.
Note: Ensure that you have sufficient disk space to migrate your crawler store. This process requires
significant amounts of free disk space, both to hold the intermediate format, and to write the new (6.7)
formatted data. The migration tool does not remove the old crawler store, and the new crawler store
consumes approximately the same amour of disk space as the old one. Note that changing formats, for
example, bstore to flatfile, may result in an increase in disk usage.
To migrate the crawler:
1. Stop all crawler processes. Crawler processes include ubermaster, crawler, ppdup, and postprocess.
27
$FASTSEARCH/bin/nctrl stop crawler
Make sure the FASTSEARCH environment variable points to the old ESP installation (the one being
migrated from).
2. Backup the crawler store, or as a minimum, backup the configuration database and files.
3. In an EC 6.3 installation, start the export tool. This example uses a crawler node with collection named
CollectionName:
$FASTSEARCH/bin/crawlerdbexport -m export -d /home/fast/esp50/data/crawler/ -g
CollectionName
Make sure the FASTSEARCH environment variable points to the old installation (the one being migrated
from).
Observe the log messages which are output by the export tool to monitor processing progress. If you are
running FAST ESP, the log messages will also appear under Logs in the administrator interface. Specifying
the "-l debug" option will give more detailed information, but is not necessary in most cases. If no error
messages are displayed, the export operation was successful. Skip this step if you are migrating an EC
6.4 installation.
Note: It is recommended to always redirect the stdout and stderr output of this tool to a log file on
disk. Additionally, on UNIX use either screen or nohup in order to prevent the tool from terminating
in case the session is disconnected.
4. Set the FASTSEARCH environment variable to correspond to the new ESP installation.
Refer to the FAST ESP Operations Guide, Chapter 1 for information on setting up the correct environment.
It is advisable to start the crawler briefly in the new ESP installation, to verify that it is operating correctly,
and then shut it down to prepare for migration.
5. Create all crawler collections in the new ESP installation, but leave data sources set to None:
a) Select Create Collection from the Collection Overview screen and the Description screen is
displayed.
b) Enter the Name of the collection to be migrated (matching the crawler collection name exactly), and
optional text for a description. This restores the original collection specification in the administrator
interface; when you start the crawler, the configuration will be loaded automatically and the crawl will
continue.
c) Proceed through the remaining steps to create a collection. Refer to the Creating a Basic Web Collection
in the Configuration Guide for a detailed procedure.
d) Leave the Data Source set to None, we will add the crawler once the collection has been migrated
below. Click submit.
6. Run the import tool.
$FASTSEARCH/bin/crawlerstoreimport -m import -d /home/fast/esp50/data/crawler/ -g
CollectionName -t master -n /home/fast/esp51/data/crawler/
Make sure the FASTSEARCH environment variable points to the new ESP installation (the one being
migrating to).
Observe the log messages which are output by the import tool to monitor processing progress. If you are
running FAST ESP, the log messages will also appear under Logs in the administrator interface. If no
error messages are displayed, the import operation was successful.
Note: In addition to migrating the crawler store, the import tool outputs statistics, sites, URIs and the
collection to separate files in the directory: $FASTSEARCH/data/crawler/migrationstats. Once
more, it is recommended to always redirect the stdout and stderr output of this tool to a log file on
disk. Additionally, on UNIX use either screen or nohup in order to prevent the tool from terminating
in case the session is disconnected.
28
7. Start the crawler from the administrator interface or the console.
8. Associate the crawler with the collection in the FAST ESP administrator interface:
a) Select Edit Collection from the Collection Overview screen and the Collection Details screen is
displayed.
b) Select Edit Data Sources from the Collection Details screen and the Edit Collection screen is
displayed.
c) When you identify the crawler as a Data Source, carefully read through the collection specification to
make sure everything is correct. Click submit.
9. To get the migrated documents into the new FAST ESP installation, you must run postprocess refeed,
which requires the crawler to be shut down. Stop the crawler:
10. To start the refeed, enter the following command:
$FASTSEARCH/bin/postprocess -R CollectionName -d $FASTSEARCH/data/crawler/
11. When the feeding has finished, check the logs to make sure the refeed was successful. Restart the crawler:
$FASTSEARCH/bin/nctrl start crawler
12. Shut down the old FAST ESP installation.
Note: If this migration process is terminated during processing, you must repeat the entire procedure.
That is, delete all dump files and the new data directory generated by the tool. To delete all dumps, you
may run any of the tools with the deldumps option. When this cleanup process has finished, you must
manually delete the new data directory from the disk. Upon completion, repeat the entire migration
procedure. Contact FAST Technical Support for assistance.
Note: If only a subset of collections are migrated to the new version, the undesired collections will still
be listed in the new configuration database, even though no data or metadata for the collection has been
transferred. These lingering references must be deleted with the command crawleradmin -d
oldCollection prior to starting the crawler in the new ESP installation.
29
Chapter
3
Configuring the Enterprise Crawler
Topics:
•
•
•
•
•
•
Configuration via the
Administrator Interface (GUI)
Configuration via XML
Configuration Files
Configuring Global Crawler
Options via XML File
Using Options
Configuring a Multiple Node
Crawler
Large Scale XML Crawler
Configuration
This chapter describes how to limit and guide the crawler in selecting web pages
to fetch and index, and describes the alternatives for what to do once the refresh
interval has completed. It also describes how to modify an existing web data
source, and how to configure and tune a large scale distributed crawler.
Configuration via the Administrator Interface (GUI)
Crawl collections may be configured using the administrator graphical user interface (GUI) or by using an
XML based file. The administrator interface includes access to most of the crawler options. However, some
options are only available using XML.
Modifying an existing crawl via the administrator interface
Complete this procedure to modify an existing crawl collection.
1. From the Collection Overview screen, click the Edit button for the collection you want to modify. The
following example selects collection1.
Note: The + Add Document function is not directly connected to the crawler, but rather attempts to
add the specified document directly to the index. This may cause problems if the document already
exists in the index and the crawler has found one or more duplicates to this document. In this case
the submitted document may appear as a duplicate in the Index because the crawler is not involved
in adding the document, so duplicate detection is not performed.
2. Click the Edit Data Sources button in the Control Panel and the following screen is displayed:
3. Click the Edit icon and the Edit Data Source Setup screen is displayed:
32
4. Work through the basic and advanced options making modifications as necessary.
a) To add information, highlight or type information into the text box on the left, then click the
button and the selection is added to the text box on the right.
b) To remove information, highlight the information in the text box on the right, then click the
button and the selected text is removed.
add
remove
5. Click Submit and the Edit Collection collection1 Action screen is displayed. The modified data source
crawler is now installed.
6. Click ok and you are returned to the Edit Collection collection1 Configuration screen. The configuration
is now complete. This screen lists the name, description, pipeline, index, and data source information you
have configured for collection1.
7. Click ok and you are returned to the Collection Overview screen.
Basic Collection Specific Options
The following table discusses the Basic collection specific options.
Table 8: Basic collection specific options
Option
Start URIs
Description
Enter start URIs in the Start URIs box. There is also a Start URI files option, which if
specified must be an absolute path to a file existing on the crawler node. The format of
the file is a text file containing URIs, separated by newlines.
These options defines a set of URIs from which to start crawling. At least one URI must
be defined before any crawling can start. If the URI points to the root of a web site then
make sure the URI ends with a slash (/).
33
Option
Description
As URIs are added, exact hostname include filters are automatically generated and added
to the list of allowed hosts in the Hostname include filters field. For example, if adding
the URI http://www.example.com/ then all documents from the web site at
www.example.com will be crawled.
Note: If the crawl includes any IDNA hostnames, they must be input using UTF-8
characters, and not in the DNS encoded format.
Hostname include filters
Specify filters in the Hostname include filters field to specify the hostnames (web sites)
to include in the crawl. Possible filter types are:
•
•
•
•
•
•
Exact - matches the identical hostname
File - identifies a local (to the crawler host) file containing include and/or exclude rules
for the configuration. Note that in a multiple node configuration, the file must be present
on all crawler hosts, in the same location.
IPmask - matches IP addresses of hostnames against specified dotted-quad or CIDR
expression.
Prefix - matches the given hostname prefix (for example, "www" matches
"www.example.com")
Regexp - matches the given hostname against the specified regular expression in
order from left to right
Suffix - matches the given hostname suffix (for example, "com" matches
"www.example.com")
This option specifies which hostnames (web sites) to be crawled. When a new web site
is found during a crawl, its hostname is checked against this set of rules. If it matches,
the web site is crawled. If no hostname or URI include filters (see below) are specified
then all web sites are allowed unless explicitly excluded (see below).
If rules are specified, a hostname must match at least one of these filters in order to be
crawled. For better crawler performance, use prefix, suffix or exact rules when possible
instead of regular expressions.
Hostname exclude filters
Specify filters in the Hostname exclude filters field to exclude a specific set of hostnames
(web sites) from the crawl. The possible filter types are the same as for the Hostname
include filters.
This option specifies hosts you do not want to be crawled. If a hostname matches a filter
in this list, the web site will not be crawled. If no setting is given, no sites are excluded.
For better crawler performance, use prefix, suffix or exact rules when possible instead of
regular expressions.
Request rate
Select one of the options in the Request rate drop-down menu; then select the rate in
seconds.
This option specifies how often (the delay between each request) the crawler should
access a single web site when crawling.
Default: 60 seconds
Note: FAST license terms do not allow a more frequent request rate setting than
60 seconds for external sites unless an agreement exists between the customer and
the external site.
34
Option
Refresh interval
Description
Specify the interval at which a single web site is scheduled for re-crawling in the Refresh
interval field.
The crawler retrieves documents from web servers. Since documents on web servers
frequently change, are added or removed, the crawler must periodically crawl a site over
again to reflect this. In the default crawler configuration, this refresh interval is one day
(1440 minutes), meaning that the crawler will start over crawling a site every 24 hours.
Since characteristics of web sites may differ, and customers may want to handle changes
differently, the action performed at the time of refresh is also configurable, via the Refresh
modesetting.
Default: 1440 minutes
Advanced Collection Specific Options
This table describes the options in the advanced section of the administrator interface.
Table 9: Overall Advanced Collection Specific Options
Option
URI include filters
Description
This option specifies rules on which URIs may be crawled. Leave this setting empty in
order to allow all URIs, unless those excluded by other filters. The possible filter types
are the same as for the Hostname include filters.
The URI include filters field and the URI exclude filters field examine the complete URI
(http://www.example.com/path.html) so the filter must include everything in the
URI, and not just the path.
An empty list of include filters will allow any URI, as long as it is allowed by the hostname
include/exclude rules. For better crawler performance, use prefix, suffix or exact rules
when possible instead of regular expressions.
Note:
The semantics of URI and hostname inclusion rules have changed since ESP 5.0
(EC 6.3). In previous ESP releases these two rule types were evaluated as an AND,
meaning that a URI has to match both rules (when rules are defined). As of ESP 5.1
and later (EC 6.6 and up), the rules processing has changed to an OR operator,
meaning a URI now only needs to match one of the two rule types.
Existing crawler configurations migrating from EC 6.3 must be manually updated,
by removing or adjusting the hostname include rules that overlap with URI include
rules.
If the crawl includes any IDNA hostnames, they must be input using UTF-8 characters,
and not in the DNS encoded format.
URI exclude filters
This option specifies which URIs you do not want to be crawled. If a URI matches one
listed in the set, it will not be crawled. The possible filter types are the same as for the
Hostname include filters.
Note:
If the crawl includes any IDNA hostnames, they must be input using UTF-8 characters,
and not in the DNS encoded format.
Allowed schemes
This option specifies which URI protocols (schemes) the crawler should follow. Select the
protocol(s) you want to use from the drop-down menu.
Valid schemes: http, https, ftp and multimedia formats MMS, RTSP.
35
Option
Description
Note: MMS and RTSP for multimedia crawl is supported via MM proxy.
Default: http, https
MIME types
This option specifies which MIME types will be downloaded from a site. If the document
header MIME type is different than specified here, then the document is not downloaded.
Select a MIME type you want to download from the drop-down menu. You can manually
enter additional MIME types directly as well.
That the crawler supports wildcard expansion of an entire field (only), for example
*/example, text/* or */*, but not appl/ms* is allowed. No other regular expressions
are supported.
Note: When adding additional MIME types beyond the two default types make sure
the corresponding file name extensions are not listed in the Extension excludes
list.
Default: text/html, text/plain
MIME types to search for
links
This option specifies MIME types of documents that the crawler should attempt to extract
links from. If not already listed in the default list, type in a MIME type you want to search
for links.
This option differs from the MIME types option in that the MIME types to search for
links denotes which documents should be inspected for links to crawl further, whereas
the latter indicates all formats the crawler should retrieve. In effect, MIME types is always
a superset of MIME types to search for links.
Note: Wildcard on type and subtype is allowed. For instance, text/* or */html
are valid. No other regular expressions are supported. Furthermore, the link extraction
within the crawler only works on textual markup documents, hence you should not
specify any binary document formats.
Default: text/vnd.wap.wml, text/wml, text/x-wap.wml, x-application/wml,
text/html, text/x-hdml
Extension excludes
This option specifies a list of suffixes (file name extensions) to be excluded from the crawl.
The extensions are suffix string matched with the path of the URIs, after first stripping
away any query portion. URIs that match the indicated file extensions will not be crawled.
If not already listed in the default list, type in the link extensions you want to be excluded
from the crawl. This option is commonly used to avoid unnecessary bandwidth usage
through the early exclusion of unwanted content, such as images.
Default: .jpg, .jpeg, .ico, .tif, .png, .bmp, .gif, .wmf, .avi, .mpg, .wmv, .wma,
.ram, .asx, .asf, .mp3, .wav, .ogg, .zip, .gz, .vmarc, .z, .tar, .iso, .img,
.rpm, .cab, .rar, .ace, .swf, .exe, .java, .jar, .prz, .wrl, .midr, .css, .ps,
.ttf, .mso
URI rewrite rules
This option specifies rewrite rules that allows the crawler to rewrite special URI patterns.
A rewrite rule is a grouped match regular expression and an expression that denotes how
the matched pattern should be rewritten. The rewrite expression can have references to
numbered groups in the match regexp, using regexp repetition.
URI rewrites are applied as the URIs are parsed out from the documents during crawling.
No rewriting occurs during re-feeding.
If you add URI rewrites after you have crawled the URIs you wanted to rewrite, you will
have to wait X (dbswitch) refresh cycles before they are fully removed from the index (they
are not crawled anymore). The rewritten ones are added in place as they are crawled. In
other words, there will be a time period in which both the rewritten and the not-rewritten
URIs may be in the index. Running postprocess refeed will not help, however you may
manually delete the URIs using the crawleradmin tool.
36
Option
Description
Since URIs are rewritten as they are parsed out of the documents, adding new URI rewrites
would in some cases seem to not take immediate effect. The reason for this is that if the
crawler already has a work queue full of URIs that are not rewritten, it must empty the
work queue before it can begin to crawl the URIs affected by the rewrite rules.
The format is:
<separator><matched pattern><separator><replacement
string><separator>
The separator can be any single non-whitespace character, but it is important that the
separator is selected so that it does not occur in either the matched pattern or the
replacement string. The separator is given explicit as the first character of each rule.
This example is useful if a website generates new session IDs each time it is crawled
(resulting in an infinite number of links over time), but that pages will be displayed correctly
without this session ID.
@(.*[&?])session_id=.*?(&|$)(.*)@\1\3@
The @-character is the separator.
Considering the URI
http://example.com/dynamic.php?par1=val1&session_id=123456789&par2=val2
the rewrite rule above would rewrite the URI to
http://example.com/dynamic.php?par1=val1&par2=val2
Default: empty
Start URI files
This option specifies a list of one or more Start URI files for the collection to be configured.
Each file must be an absolute path/filename on the crawler node.
A Start URI file is specified as the absolute path to a text file (for example,
C:\starturifile.txt). The format of the files is one URI per line.
All entries in the start URI files must match the Hostname include filters or the URI
include filters or they will not be crawled.
Default: empty
Mirror site files
Map file of primary/secondary servers for a site.
This parameter is a list of mirror site files for the specified web site (hostname). The file
format is a plain text, whitespace-separated list of hostnames, with the preferred (primary)
hostname listed first.
Format example:
www.fast.no fast.no
www.example.com example.com
mirror.example.com
Note: In a multiple node configuration, the file must be available on all masters.
Default: empty
Extra HTTP Headers
This option specifies any additional headers to send to each HTTP GET request to identify
the crawler. When crawling public sites not owned by the FAST ESP customer, the HTTP
header must include a User-Agent string which must contain information that can identify
the customer, as well as basic contact information (email or web address).
Format: <header field>:<header value>
37
Option
Description
Specifying an invalid value may prevent documents from being crawled and prevent you
from viewing/editing your configuration.
The recommended User-Agent string when crawling public web content is <Customer
name> Crawler (email address / WWW address). User agent information
(company and E-mail) suitable for intranet crawling is by default added during installation
of FAST ESP.
Default: User-Agent: FAST Enterprise Crawler 6 used by <example.com>
([email protected])
Refresh mode
These refresh modes determine the actions taken by the crawler when a refresh occurs.
Although no refreshes occur when the crawler is stopped, the time spent is still taken into
consideration when calculating the time of the next refresh. Thus, if the refresh period is
set to two days and the crawler is stopped after one day and restarted the next day, it will
then refresh immediately since two days have elapsed.
Refresh is on a per site (single hostname) basis. Even though Start URIs are fed at a
specific (refresh) interval by the master, each site keeps a record of the last time it was
refreshed. Since sites are scheduled randomly based on available resources/URIs, the
site refreshes quickly get desynchronized with the master Start URI feeding interval.
Refresh modes other than Scratch and Adaptive do not erase any existing queues at
the time of refresh. If the site(s) being crawled generate an infinite amount of URIs, or the
crawl is very loosely restricted, this may lead to the crawler work queues growing infinitely.
Valid modes:
•
•
•
•
•
Append - the Start-URIs are added to the end of the crawler work queue at the start
of every refresh. If there are URIs in the queue, Start URIs are appended and will not
be crawled until those before them in the queue have been crawled.
Prepend - the Start URIs are added to the beginning of the crawler work queue at
every refresh. However, URIs extracted from the documents downloaded from the
Start URIs will still be appended at the end of the queue.
Scratch - the work queue is truncated at every refresh before the Start URIs are
appended. This mode discards all outstanding work on each refresh event. It is useful
when crawling sites with dynamic content that produce an infinite number of links.
This is useful when sites generate an infinite number of links, as sometimes seen for
sites with dynamic content.
Soft - if the work queue is not empty at the end of a refresh period, the crawler will
continue crawling into the next refresh period. A server will not be refreshed until the
work queue is empty. This mode allows the crawler to ignore the refresh event for a
site if it is not idle. This allows large sites to be crawled in conjunction with smaller
sites, and the smaller sites can be refreshed more often than the larger sites.
Adaptive - build work queue according to scoring of URIs and limits set by adaptive
section parameters.
Default: Scratch
Automatically refresh when
This option allows you to specify whether the crawler automatically should trigger a new
idle
refresh cycle when the crawler goes idle (all websites are finished crawling) in the current
refresh cycle.
Select Yesto automatically refresh when idle.
Select No to wait the entire refresh cycle length.
Default: No
Note: This option cannot be used with a multi node crawler.
38
Option
Max concurrent sites
Description
This option allows you to limit the maximum number of sites being crawled concurrently.
The value of this option, together with the request rate, controls the aggregated crawl-rate
for your collection. A request rate of 1 document every 60 seconds, crawling 128 sites
concurrently yields a theoretical crawl-rate of about 2 (128/60) documents per second.
This option also impacts CPU usage and memory consumption; the more sites crawled
concurrently, the more CPU and memory will be used. It is recommended that values
higher than 2048 is used cautiously.
In a distributed setup, the value applies per crawler node.
Default: 128
Max document count per
site
This option sets the maximum amount of documents (web pages) to download from a
web site per refresh cycle. When this limit is reached any remaining URIs queued for the
site will be erased, and crawling of the site will go idle.
Note: The option only restricts the per-cycle count of documents, not the number
of unique documents across cycles. Therefore it's possible for a web site to exceed
this number in stored documents if the documents found each cycle changes. Over
time, the excess documents will however be removed by the document expiry
functionality (DB switch interval setting).
Default: 1000000
Max document size
This option sets the maximum size of a single document retrieved from any site in a
collection. If this limit is exceeded, the remaining documents are discarded or truncated
to the indicated maximum size (see the Discard or truncate option).
If you have large documents (for example, PDF files) on your site, and want to index
complete documents, make sure that this option is set high enough to handle the largest
documents in the collection.
Default: 5000000 bytes
Discard or truncate
This option discards or truncates documents exceeding the maximum document count
size determined in the previous entry.
It is not recommended to use the truncate option except for text document collections.
Valid values: Discard, Truncate
Default: Discard
Checksum cut-off
When crawling multimedia content through a multimedia proxy (schemes MMS or RTSP),
use this setting to adjust how the crawler determines whether a document has been
modified or not. Rather than downloading the entire document, only the number of bytes
specified in this setting will be transferred and the checksum calculated on that initial
portion of the file. Only if the checksum of the initial bytes have changed is the entire
document downloaded.
This saves bandwidth after the initial crawl cycle, and reduces the load on other system
and network resources as well. A setting of 0 will disable this feature (checksum always
computed on entire document).
Default: 0
Fetch timeout
This setting specifies the time, in seconds, that the download of a single document is
allowed to spend, before being aborted.
Set this value higher if you expect to download large documents from slow servers, and
you observe high average download times in the crawler statistics reported by the
crawleradmin tool.
39
Option
Description
Valid values: Positive integer
Default: 300 seconds
Obey robots.txt
A robots.txt file is a standardized way for web sites to direct a crawler (for example,
to not crawl certain paths or pages on the site). If the file exists it must be located on the
root of the web site, e.g. http://www.example.com/robots.txt, and contain a set
of Allow/Disallow directives.
This setting specifies whether or not the crawler should follow the directives in robots.txt
files when found.
If you do not control the site(s) being crawled, it is recommended that you use the default
setting and obey these files.
Select Yes to obey robots.txt directives.
Select No to ignore robots.txt directives.
Default: Yes
Check meta robots
A meta robots tag is a standardized way for web authors and administrators to direct the
crawler not to follow links or to save content from a particular page; it indicates whether
or not to follow the directives in the meta-robots tag (noindex or nofollow).
This option allows you to specify whether or not the crawler should follow such rules. If
you do not control the site(s) being crawled, it is recommended that you use the default
setting.
Select Yes to obey meta robots tags.
Select No to ignore meta robots tags.
Example (HTML):
<META name="robots" content="noindex,nofollow">
Default: Yes
Ignore robots on timeout
Before crawling a site, the crawler will attempt to retrieve a robots.txt file from the
server that describes areas of the site that should not be crawled.
If you do not control the site being crawled, it is recommended that you use the default
setting.
This option specifies what action the crawler should take in the event that it is unable to
retrieve the robots.txt file due to a timeout, unexpected HTTP error code (other than
404) or similar. If set to ignore then the crawler will proceed to crawl the site as if no
robots.txt exists, otherwise the web site in question will not be crawled.
Select Yesto obey robots on timeout.
Select No to ignore robots on timeout.
Default: No
Ignore robots auth sites
This option allows you to control whether the crawler should crawl sites returning 401/403
Authorization Required for their robots.txt from the crawl. The robots.txt standard
lists this behavior as a hint for a crawler to ignore the web site altogether. However,
incorrect configuration of web servers is widespread and can lead to a site being
erroneously excluded from the crawl. Enabling this option makes the crawler ignore such
indications and crawl the site anyway.
Default: Yes
40
Option
Obey robots.txt crawl delay
Description
This parameter indicates whether or not to follow the Crawl-delay directive in robots.txt
files. In a site's robots.txt file, this non-standard directive may be specified (e.g.
Crawl-Delay: 120, where the numerical value is the number of seconds to delay
between page requests. If this setting is enabled, this value will override the collection-wide
request rate (delay) setting for this web site.
Default: No
Robots refresh interval
This option allows you to specify how often (in seconds) the crawler will re-download the
robots.txt file from sites, in order to check if it has changed. Note that the robots.txt
file may be retrieved less often if the site is not crawling continuously. the refresh interval
of robots.txt files.The time period is on a per site basis and after it expires the robots.txt
file will be downloaded again and the rules will be updated.
Reduce this setting to pick up robots changes more quickly, at the expense of network
bandwidth and additional web server requests.
Default: 86400 seconds (24 hours)
Robots timeout
Before crawling a site, the crawler will attempt to retrieve a robots.txt file from the
server that describes areas of the site that should not be crawled.
This option allows you to specify the timeout to apply when attempting to retrieve
robots.txt files. Set this value high if you expect to have comparably slow interactions
requesting robots.txt.
Default: 300
Near duplicate detection
This option indicates whether or not to use the near duplicate detection scheme. This
option can be specified per sub collection.
Refer to Configuring Near Duplicate Detection on page 125 for more information.
Default: No (disabled)
Perform duplicate
detection?
This parameter indicates whether document duplicate detection should be enabled or not.
Default: Yes
Use HTTP/1.1
This option allows you to specify whether you want to use HTTP 1.1 or HTTP 1.0 requests
when retrieving web data. HTTP/1.1 is required for the crawler to accept compressed
documents from the server (the Accept Compression option) and enable ETag support
(Send If-Modified-Since option).
Select Yes to crawl using HTTP/1.1.
Select No to crawl using HTTP/1.0.
When using cookie authentication there may be instances where HTTP/1.1 is not supported
and you should select No.
Default: Yes
Send If-Modified-Since
If-Modified-Since headers allow bandwidth to be saved, as the web server only will
send a document if the document has changed since the last time the crawler retrieved
it. Also, if web servers report an ETag associated with a document, the crawler will set
the If-None-Match header when this setting and HTTP/1.1 is enabled.
Select Yes to send If-Modified-Since headers.
Select No to not send If-Modified-Since headers.
Web servers may give incorrect information whether or not a document has been modified
since the last time the crawler retrieved it. In those instances select No to allow the crawler
41
Option
Description
to decide whether or not the document has been modified instead of the web server, at
the expense of increased bandwidth usage.
Default: Yes
Accept compression
Specify whether the crawler should use the Accept-Encoding header, thus accepting
that the documents are compressed at the web server before returned to the crawler. This
may save bandwidth.
This option only applies if HTTP/1.1 is in use.
Select Yes to accept compressed content.
Select No to not accept compressed content.
Default: Yes
Send/receive cookies
This feature enables limited support for cookies in the crawler, which might enable crawling
cookie-based sessions for a site. Some limitations apply, mainly that cookies will only be
visible across web sites handled within the same uberslave process.
Note: Note that this feature is unrelated to cookie support as described in the Form
Based Login on page 57 section.
Select Yes to enable cookie support.
Select No to disable cookie support.
Default: No
Extract links from
duplicates
Even though two documents have duplicate content, they may have different links. The
reason for this is that all markup, including links, is stripped from the document prior to
generating a checksums for use by the duplicate detection algorithm.
This option lets you specify whether or not you want the crawler to extract links from
documents detected as duplicates. If enabled, you may get an increased amount of
duplicate links in the URI-queues. If duplicate documents contain duplicate links then you
can disable this parameter.
Note: Even though duplicate URIs exist on the work queues, a single URI is only
downloaded once each refresh cycle.
Select Yes to extract links from documents that are duplicates.
Select No to not extract links from documents that are duplicates.
Default: No
Macromedia Flash support
Select Yes to enable retrieval of Adobe Flash files, and limited link extraction within these.
The flash files are indexed as separate files within the searchable index.
Select No to disable Adobe Flash support.
You may also want to enable JavaScript support, as many web servers only provide Flash
content to clients that support JavaScript.
Note: Flash processing is resource intensive and should not be enabled for large
crawls.
Note: Processing Macromedia Flash files requires an available Browser Engine.
Please refer to the FAST ESP Browser Engine Guide for more information.
Default: No
42
Option
Sitemap support
Description
Enabling this option allows the crawler to detect and parse sitemaps. The crawler support
sitemap and sitemap index files as defined by the specification at
http://www.sitemaps.org/protocol.php.
The crawler uses the 'lastmod' attribute in a sitemap to see if a page has been modified
since the last time the sitemap was retrieved. Pages that have not been modified will not
be recrawled. An exception to this is if the collection uses adaptive refresh mode. In
adaptive refresh mode the crawler will use the 'priority' and 'changefreq' attributes of a
sitemap in order to determine how often a page should be crawled. For more information
see Adaptive Crawlmode on page 49.
Custom tags found in sitemaps are stored in the crawlers meta database and can be
submitted to document processing.
Note: Most sitemaps are specified in robots.txt. Thus, 'obey robots.txt' should be
enabled in order to get the best result.
Default: No
JavaScript support
Select Yes to enable JavaScript support. The crawler will execute JavaScripts embedded
within HTML documents, as well as retrieve and execute external JavaScripts.
Select No to disable JavaScript support.
Note: JavaScript processing is resource intensive and should not be enabled for
large crawls.
Note: Processing JavaScript requires an available Browser Engine.
Please refer to the FAST ESP Browser Engine Guide for more information.
Default: No
JavaScript keep original
HTML
Specify whether to submit the original HTML document, or the HTML resulting from the
JavaScript parsing, to document processing for indexing.
When parsing a HTML document the Browser Engine executes all inlined and external
JavaScripts, and thereby all document.write() statements, and includes these in its
HTML output. By default it is this resulting document that is indexed. However it is possible
to use this feature to limit the Browser Engine to link extraction only.
This option has no effect if JavaScript crawling is not enabled
Default: No
JavaScript request rate
Specify the request rate (delay) in seconds to be used when retrieving external JavaScripts
referenced from a HTML document. By default this rate is the same as the normal request
rate, but it may be set lower to speed up crawling of documents containing JavaScripts.
To specify the default value leave the option blank.
Default: Empty
FTP passive mode
This option determines if the FTP server (active) or the crawler (passive) should set up
the data connection between the crawler and the server. Passive mode is recommended,
and is required when crawling FTP content from behind a firewall.
Select Yes to crawl ftp sites in passive mode.
Select No to crawl ftp sites in active mode.
Default: Yes
43
Option
FTP search for links
Description
This option determines whether or not the crawler should run documents retrieved from
an FTP server through the link parser to extract any links contained.
Select Yes to search FTP documents for links.
Select No to not search FTP documents for links.
Default: Yes
Include meta in csum
The crawler differentiates between content and META tags when detecting duplicates
and detecting whether a document has been changed.
Select Yes and the crawler will detect changes in META tags in addition to the document
content. This means that only documents with identical content and META tags are treated
as duplicates.
Select No and the crawler will detect changes in content only. This means that documents
with the same content is treated as duplicates even if the META tags are different.
Default: No
Sort URI query params
Example: If http://example.com/?a=1&b=2 is really the same URI as
http://example.com/?b=2&a=1, then the URIs will be rewritten to be the same when
this option is enabled. If not, the two URIs most likely will be screened as duplicates. The
problem arises if the two URIs are crawled at different times, and the page has changed
during the time of which the first one was crawled. In this case you can end up with both
URIs in the index.
Select Yes to enable sorting of URI query parameters.
Select No to disable sorting of URI query parameters.
Default: No
Enforce request rate per IP
This option allows you to control whether the crawler should enforce the request rate on
a per IP basis.
If enabled, a maximum of 10 sites sharing the same IP will be crawled in parallel.
Additionally, at most Max pending requests will be issued to this IP in parallel. This prevents
overloading the server(s) that host these sites.
If disabled, sites sharing the same IP will be treated as unique sites, each hit with the
configured request rate.
Default: Yes
Enforce MIME type
detection
This option allows you to decide whether or not the crawler should run its own MIME type
detection on documents. In most cases web servers return the MIME type of documents
when they are downloaded, as part of the HTTP header.
If this option is enabled, documents will get tagged with the MIME type that looks most
accurate; either the one received from the web server or the result of the crawlers
determination.
Default: No (disabled)
Send logs to Ubermaster
If enabled (as by default), all logging is sent to the ubermaster host for storage, as well
as stored locally. In large multiple node configurations it can be disabled to reduce
inter-node communications, reducing resource utilization, at the expense of having to
check log files on individual masters.
Default: Yes (enabled)
Note: This option only applies to multi node crawlers
44
Option
META refresh is redirect
Description
This option allows you to specify whether the crawler should treat META refresh URIs as
HTTP redirects. Use together with META refresh threshold option which lets you specify
the upper threshold of this option.
Default: Yes
META refresh threshold
This option allows you to specify the upper limit on the refresh time for which a META
refresh URI is considered a redirect (The META refresh is redirect option must be
enabled.)
Example: Setting this option to 3 will make the crawler treat every META refresh URI with
a refresh of 3 seconds or lower as a redirect URI.
Default: 3 seconds
DB switch interval
Specify the number of cycles a document is allowed to exist without having been seen
by the crawler, before expiring. When a document expires, the action taken is determined
by the DB switch delete setting.
The age of a document is not affected by force re-fetches; only cycles where the crawler
refreshes normally (by itself) increases the document's age if not found.
This mechanism is used by the crawler to be able to purge documents that are no longer
linked to from the index. It is not used to detect dead links such as documents returning
an error code, e.g. 404.
This check is performed at the beginning of each refresh cycle individually for each site.
A similar check is performed for sites that have not been seen at the start of each collection
level refresh
Default: 5
Note: Setting this value very low, e.g 1-2, combined with a DB switch delete setting
of Yes can result in documents being incorrectly identified as expired and deleted
very suddenly.
DB switch delete
The crawler will at regular intervals perform an update of its internal database of retrieved
documents, to detect documents that the crawler has not seen for DB switch interval
number of refresh cycles. This option determines what to do with these documents; they
can either be deleted right away or put in the work queue for a retrieval attempt to make
certain they are actually removed from the web server.
Select Yes to delete documents immediately.
Select No to verify that the documents no longer exist on the web server before deleting
them.
Default: No
Note: This setting should normally be left at the default setting of No in order to
avoid situations where the crawler may incorrectly believe that a set of documents
have been deleted and immediately deletes them from the crawler store and index
Workqueue filter
When this feature is enabled, the crawler will associate a Bloom filter with its work queues,
thereby reducing the degree of duplicates that go onto the queue. This way the queues
will grow more slowly and therefore use less disk I/O and space, plus save memory since
Bloom filters are very memory efficient. The drawback with Bloom filters is that there is a
very low probability of false positives, which means that there is a theoretical chance may
lose some URIs that would be crawled if work queue filters were disabled. Disable this
feature if this risk is a problem and added disk overhead is not problematic.
Select Yes to enable use of Bloom filters with work queues.
45
Option
Description
Select No to disable use of Bloom filters with work queues.
Default: Yes
Master/Ubermaster filter
This parameter enables a Bloom filter to screen links transferred between masters and
the ubermaster. The value is the size of the filter, specifying the number of bits allocated,
which should be between 10 and 20 times the number of URIs to be screened.
Note that enabling this setting with a positive integer value disables the crosslinks cache.
It is recommended that you turn on this filter for large crawls; recommended value is
500000000 (500 megabit).
Default: 0 (disabled)
Master/Slave filter
When this feature is enabled, the crawler slave processes use a Bloom filter in the
communication channel with the master process, which reduces Inter Process
Communication (IPC) and memory overhead. The drawback with Bloom filters is that
there is a non-zero chance of false positives, which may cause URIs to be lost by the
crawler. Use this feature if this risk is not a concern and there is CPU and memory
contention on the crawler nodes.
It is recommended that you turn on this filter for large crawls; recommended value is
50000000 (50 megabit).
Valid values: Zero or positive integer
Default: 0 (filter is disabled)
Max docs before
interleaving
The crawler will by default crawl a site to exhaustion. However, the crawler can be
configured to crawl "batches" of documents from sites at a time, thereby interleaving
between sites. This option allows you to specify how many documents you want to be
crawled from a server consecutively before the crawler interleaves and starts crawling
other servers. The crawler will then return to crawling the former server as resources free
up.
Valid values: No value (empty) or positive integer
Default: empty (disabled)
Note: Since this feature will stop crawling web sites without fully emptying their work
queues on disk first it may lead to excessive amounts of work queue directories/files
on large scale crawls. This can impact crawler performance, if the underlying file
system is not able to handle it properly.
Max referrer links
This option specifies the maximum number of referrer levels the crawler will track for each
URI. As this feature is quite performance intensitive the setting should no longer be used,
instead the Web Analyzer should be queried to extract this information.
It is recommended that you contact FAST Solution Services if you still decide to modify
the default setting.
Default: 0
Max pending requests
Specify the maximum number of concurrent (outstanding) HTTP requests to a single site
at any given time.
The crawler may make overlapping requests to a site, and this setting determines the
maximum degree of this overlapping. If you do not control the site(s) being crawled, it is
recommended that you use the default setting. Keep in mind that regardless of this setting
the crawler will not issue requests to a single web site more often than specified by the
Request rate setting.
46
Option
Description
Default: 2
Max pending
proxy-requests
Proxy open connection limit.
This parameter specifies a limit on the number of outstanding open connections per HTTP
proxy, per uberslave process in the configuration.
Default: 2147483647
Max redirects
This option allows you to specify the maximum number redirects that should be followed
from an URI.
Example: http://example.com/path redirecting to http://example.com/path2
will be counted as 1.
Default: 10
Max URI recursion
This option allows you to specify the maximum number of times a pattern is allowed
appended to an URIs successors.
Example: http://example.com/path/ linking to
http://example.com/path/path/ will be counted as 1.
A value of 0 disables the test.
Default: 5
Max backoff count/delay
Together these options control the adaptive algorithm by which a site experiencing
connection failures (for example, network errors, timeouts, HTTP 503 "Server Unavailable"
errors) are contacted less frequently. For each consecutive instance of these errors, the
inter-request delay for that site is incremented by the initial delay setting (Request rate
setting):
Increasing delay = current delay + delay
The maximum delay for a site will be the Max backoff delay setting. If the number of
failures reaches Max backoff count, crawling of the site will become idle.
Should the network issues affecting the site be resolved, the internal backoff counter will
start decreasing, with the inter-request delay lowered on each successful document fetch
by half:
Decreasing delay = current delay / 2
This continues until the original delay (Request rate setting) is reached.
Default: Max backoff count = 50; Max backoff delay = 600
SSL key/certificate file
This option sets the filename for the file containing your client SSL key and certificate.
Type in a path and filename; the path and filename must be an absolute path on the
crawler node.
Example: /etc/ssl/key.pem
Default: empty
Note: This option is not necessary to specify in order to crawl HTTPS web sites. It
is only required if the web site requires the crawler to identify itself using a client
certificate
Document evaluator plugin
Specify a user-written Python module to be used for processing fetched documents and
(optionally) URI redirections.
47
Option
Description
The value specifies the python class module, ending with your class name. The crawler
splits on the last '.' separator and converts this to the Python equivalent "from <module>
import <class>". The source file should be located relative to $PYTHONPATH, which for
FDS installations corresponds to ${FASTSEARCH}/lib/python2.3/.
Refer to Implementing a Crawler Document Plugin Module on page 118 for more information.
Variable request rate
This option allows you to specify specific time slots when you want to use a higher or
lower request rate than the main setting. Time slots are specified as ending and starting
points, and cannot overlap. Time slot start and endpoints are specified with day of week
and time of day (optionally with minute resolution).
Note that no two time slots can have the same delay values. Each variable must be unique,
for example, 2.0, 2.1, and so forth.
You can also enter the value Suspend in the Delay field that will suspend the crawler so
that there is no crawling for the time span specified.
Example time slots crawling at 60 second delay during weekends and no crawling during
weekdays:
•
•
Time span: Fri:23.59-Sun:23.59 Delay: 60
Time span: Mon:00-Fri:23 Delay: Suspend
Note: Entering very long delays (above 600 seconds) is not recommended as it may
cause problems with sites requiring authentication. To suspend crawling for a period
always use the Suspend value.
HTTP errors
This option allows you to specify how the crawler handles various HTTP response codes
and errors. It is recommended that you contact FAST Solution Services if you decide to
modify the default setting.
The following actions that can be configured for each condition:
•
•
•
KEEP - no action is taken, the document is not deleted
DELETE[:X] - the document is deleted if the error condition persists over X retries. X
refers to the number of refresh cycles the same error condition occurs, until the
document should be considered deleted. If X is unspecified or 0, the document is
deleted immediately.
RETRY[:X] - X refers to the number of retries within the same refresh cycle that should
be attempted before giving up.
A DELETE:3, RETRY:1, would thus attempt to fetch a document with this error condition
2x every refresh, and after 3 refreshes, if the document at some time was stored and
added to the index, it will be deleted.
The protocol response codes are divided into separate protocol response codes as general
client-side errors (4xx) and general server-side errors (5xx). Behavior for individual 400/500
errors can also be specified.
There are three classes of non-protocol errors that can be configured:
•
•
•
ttl - specifies handling for connections that time out
net - specifies handling for network/socket-level errors
int - specifies handling for other internal errors
Example: To delete a document after consecutive 3 retries for an HTTP 503 error, enter
503 in the Error box, and DELETE:3, RETRY: 1 in the Value box, then click on the
right arrow.
FTP errors
48
This option is the equivalent of the HTTP errors for FTP errors.
Option
Description
Example: To delete a document after consecutive 3 retries for an FTP 550 error, enter
550 in the Error box, and DELETE:3, RETRY: 1 in the Value box, then click on the
right arrow.
FTP accounts
This option allows you to specify a number of FTP accounts required for crawling FTP
URIs. If unspecified for a site, the default anonymous user will be used. Specify the
hostname of the FTP site in the Hostname box, and the username and password in the
Credentials box.
The format of the Credentials is:
<USERNAME>:<PASSWORD>
Example (Credentials): myuser:secretpassword
Crawl sites if login fails
This parameter allows you to specify whether you want the crawler to continue crawling
a site after a configured login specification has failed, or not.
Select Yesto attempt crawling of the site regardless.
Select No to disallow crawling of the site.
Default: No
Domain clustering
In a web scale crawl it is possible to optimize the crawler to take advantage of locality in
the web link structure. Sub domains on the same domain tend to link more internally than
externally, just as a site would have mostly interlinks. The domain clustering option enables
clustering of sites on the same domain (for example, *.example.net) on the same master
node and the same storage cluster (and thus uberslave process).
This option also affects clustering within a single node, where all sites clustered in the
same domain will be handled by the same uberslave process. This ensures cookies (if
Send/receive cookies is enabled) can be used across a domain within the crawler.
Default: No
Note: This option is automatically turned on for multi node crawls by the ubermaster.
Adaptive Crawlmode
This section describes the adaptive scheduling options. Note that this parameter is only applicable if the
Refresh mode is set to adaptive.
Note: Extensive testing is strongly recommended before production use, to insure that desired pages
and sites are properly represented in the index.
Table 10: Adaptive Crawlmode Options
Option
Minor Refresh count
Description
Number of minor cycles within the major cycle.
A minor cycle is sometimes referred to as a micro cycle.
Refresh quota
Ratio of existing URIs re-crawled to new (unseen) URIs, expressed as percentage.
As long as the crawler has sufficient URIs of both types, this ratio is used. However, if it
runs out of URIs of either type it will crawl only the other type from then on until refresh
kicks in, or the site reaches some other limit (e.g. maximum document count for the cycle).
High value => re-crawling old content (recommended)
49
Option
Description
Low value => prefer fresh content
Minor Refresh Min
Coverage
Minimum number of URIs from a site to be crawled in minor cycle.
Used to guarantee some coverage for small sites.
Minor Refresh Max
Coverage
Limit percentage of site re-crawled within minor cycle.
Ensures small sites do not crawl fully each minor cycle, starving large sites.
When configuring this option have the number of minor cycles in mind. With e.g. 4 minor
cycles this option should be 25% or higher, to ensure the entire site is re-crawled over
the course of a major cycle.
If the crawler detects that this value is set too low it may increase it internally.
URI length weight
Each URI is scored against a set of rules to determine its crawl rank value. The crawl
rank value is used to determine the importance of the particular URI, and hence the
frequency at which it is re-crawled (from at most once every minor cycle to only once
every major cycle).
Each rule is assigned a weight to determine its contribution towards the total rank value.
Higher weights produce higher rank contribution. A weight of 0 disables a rule altogether.
The URI length scoring rule is based on number of slashes (/) in URI path. The document
receives the maximum score if there is only a single slash, down to no score for 10 slashes
or more.
Increase this setting to boost the priority of URIs with few levels (slashes in path).
Default weight: 1.0
Valid Range: 0.0-2^32
URI depth weight
The URI depth score is based on number of link "hops" to this URI. Max score for none
(for example, start URI), no score for 10 or more.
Use this setting to boost the priority of URIs linked closely from the top pages.
Default weight: 1.0
Landing page weight
The landing page score awards a bonus score if the URI is determined to be a "landing
page". A landing page is defined as any URI who's path ends in one of the following: /,
index.html, index.htm, index.php, index.jsp, index.asp, default.asp,
default.html, default.htm.
Any URI with query parameters receives no score.
Use this option to boost landing pages.
.
Default weight: 1.0
Markup document weight
50
The markup document score awards a bonus score if the document at the URI is
determined to be a "markup" page. A markup page is a document whose MIME type
matches one of the MIME types listed in the MIME types to search for links.
Option
Description
This option is used to give preference to more dynamic content as opposed to static
document types such as PDF, Word, etc.
Default weight: 1.0
Change history weight
The change history scores a document on the basis of how often it changes over time.
This is done by the crawler by keeping track of whether a document has changed, or
remains unchanged, as it is re-downloaded. An estimate is then made on how likely this
document is to have changed the next time.
Use this option to boost pages that change more frequently, compared to static
non-changing pages.
Default weight: 10.0
Sitemap weight
The sitemap score is based is based on metadata found in sitemaps. The score is
calculated by multiplying the value of the changefreq parameter with the priority parameter
of a sitemap.
Use this option to boost pages that are defined in sitemaps.
Changefreq always value
This value is used to map the changefreq string value "always" in sitemaps to a numerical
value.
Default weight: 1.0
Changefreq hourly value
This value is used to map the changefreq string value "hourly" in sitemaps to a numerical
value.
Changefreq daily value
This value is used to map the changefreq string value "daily" in sitemaps to a numerical
value.
Changefreq weekly value
This value is used to map the changefreq string value "weekly" in sitemaps to a numerical
value.
Changefreq monthly value
This value is used to map the changefreq string value "monthly" in sitemaps to a numerical
value.
51
Option
Description
Changefreq yearly value
This value is used to map the changefreq string value "yearly" in sitemaps to a numerical
value.
Changefreq never value
This value is used to map the changefreq string value "never" in sitemaps to a numerical
value.
Default weight: 0.0
Changefreq default value
This value is assigned to all documents that have no changefreq attribute set in a sitemap.
Authentication
This section of the Advanced Data Sources screen allows you to configure authentication credentials for
Basic, NTLM v1 or Digest schemes.
Note: After an Authentication item has been added, it cannot be modified. To modify an existing item,
save it under a new name and delete the old one.
Table 11: Authentication Options
Option
URI Prefix or Realm
Description
An identifier based on either a URI prefix or authentication realm. The corresponding
credentials (Username, Password, and optionally Domain) will be used in an authentication
attempt if either:
•
•
Username
Password
Domain
Authentication scheme
A URI matches the URI prefix string from left-to-right
The specified Realm matches the value returned by the web server in a
401/Unauthorized response
Specify the username to use for the login attempt. This value will be sent to every URI
that matches the specified URI prefix or realm.
Specify the password to use for authentication attempts. This value will be sent to every
URI that matches the specified prefix or realm.
Specify the domain value to use for authentication attempts. This value is optional.
Specify the scheme to use in authentication attempts. If auto is specified, the crawler
selects one on its own.
Note: If authentication fails, crawling of the site will stop.
52
Cache Sizes
It is recommended that you contact FAST Solution Services before changing any of the Cachesizes options.
The default selections are shown in the following screen. The options with empty defaults are automatically
adjusted by the crawler
Figure 4: Cache Size Options
Crawl Mode
This table describes the advanced options that apply to the Crawl mode.
Table 12: Crawl Mode Options
Option
Crawl mode
Description
Select how web sites in the collection should be crawled from the Crawl mode drop-down
menu. Highlight the type to be used.
Possible modes are:
•
•
Full - use if you want the crawler to crawls through all levels on a site.
Level - use to indicate the depth of the crawl as defined in the Max Levels option.
The start level is the level of the start URI specified in the Start URI files.
The crawler assumes that all cross-site links are valid links and will follow these links until
it reaches the number of levels specified in Max Levels. If the crawler crawls two sites
that are closely interlinked, it may crawl both sites entirely, despite the given maximum
level. You can prevent this by either:
•
•
Limiting the included domains in Hostname Includes
Selecting No in the Follow cross-site URIs
Default: Full
Max levels
This option allows you to specify the maximum number of levels to crawl from a server.
The crawler considers all cross-links to be valid and follows all cross-links the same amount
of levels. If the sites you are crawling are heavily cross-linked, you may crawl entire sites.
This option only applies when the Crawl mode option is set to Level. If unspecified, a
Level Crawl mode will default to Max level 0.
Example: 1 (the crawler crawls only the URI named in the Start URI files and any links
from the Start URI)
Default: empty
53
Option
Description
Note: Frame links, JavaScripts and redirects do not increase the level counter,
therefore even a depth 0 crawl may follow these links. In this case it is possible to
specify the depth as -1 instead, this will not follow any links.
Follow cross-site URIs
This option allows you to select whether the crawler is to follow cross-site URIs from one
web site to another.
Select Yes and the crawler will follow any links leading from the start URI sites as long
as they fulfill the Hostname include filters criteria.
Select No and the crawler will only follow "local" links with the same web site. It will not
follow links from one web site to another even if the site is included by the Hostname
include rules.
Default: Yes
Note: If cross-site link following is turned of it is necessary that each site to be
crawled has an entry in the start URIs list.
Note: The crawler treats a single hostname as a single web site, hence it will identify
example.com and www.example.com as two different web sites, even though they
may appear the same to the user.
Follow cross-site redirects
Specifies whether or not to follow external redirects from one web site to another.
Default: Yes
Reset crawl level
This option allows you to select whether the crawler is to reset the crawl level when
crawling cross-site.
Select Yes to enable the crawler to reset the crawl level when leaving the start URI sites
and crawling sites leading from there. The crawl mode will be reset to default (Crawl mode
= Full).
Select No to ensure that the crawler will not reset the crawl level, and the crawl mode and
level set for the start URIs will also apply for external sites.
Default: No
Crawling Thresholds
This option allows you to specify certain threshold limits for the crawler. When these limits are exceeded, the
crawler will enter a special mode called refresh (not to be confused with the now removed refresh mode called
refresh). The refresh crawl mode will make the crawler only crawl URIs that previously has been crawled.
Figure 5: Crawling Thresholds
The following table describes the crawling thresholds to be set
Table 13: Crawling Threshold Options
54
Option
Disk free percentage
Description
This option allows you to specify, as a percentage, the amount of free disk space
that must be available for the crawler to operate in normal crawl mode. If the
disk free percentage drops below this limit, the crawler enters the refresh crawl
mode.
While in the refresh crawl mode only documents previously seen will be
re-crawled, no new documents will be downloaded.
Default: 0% (0 == disabled)
Disk free percentage slack
This option allows you to specify, as a percentage, a slack to the disk free
threshold defined by the Disk free percentage. By setting this option, you create
a buffer zone above the disk free threshold. While the current free disk space
remains in this zone, the crawler will not change the crawl mode back to normal.
This prevents the crawler from switching back and forth between the crawl modes
when the percentage of free disk space is close to the value specified by the
Disk free percentage option. When the available disk space percentage rises
above disk_free+disk_free_slack, the crawler will change back to normal crawl
mode.
Default: 3%
Maximum documents
This option allows you to specify, in number of documents, the number of stored
documents in the collection that will trigger the crawler to enter the refresh crawl
mode.
While in the refresh crawl mode only documents previously seen will be
re-crawled, no new documents will be downloaded.
Default: 0 documents (0 == disabled)
Note: The threshold specified is not an exact limit, as the statistics reporting
is somewhat delayed compared to the crawling.
Note: This option should not be confused with Max document count per
site option.
Maximum documents slack
This option allows you to specify the number of documents which should act as
a buffer zone between normal mode and refresh mode. The option is related to
the Maximum documents setting. Whenever the refresh mode is activated
because the number of documents has exceeded the maximum, a buffer zone
is created between the maximum documents and maximum documents-maximum
documents slack. The crawler will not change back to normal mode while within
the buffer zone.This prevents the crawler from switching back and forth between
the crawl modes when the number of docs is close to the Maximum documents
value.
Default: 1000 documents
Duplicate Server
This section of the Advanced Data Sources screen allows you to configure the Duplicate Server settings.
Table 14: Duplicate Server Options
55
Option
Database format
Description
Specify the storage format to use for the duplicate server databases. Available formats
are:
•
•
•
Database Cachesize
Database stripe size
Nightly compaction?
Gigabase DB
Memory hash
Disk hash
Specify the size of the cache of the duplicate server databases. If the database format is
a hash format, the cache size specifies the initial size of the hash.
Specify the # of stripes to use for the duplicate server databases.
Specify whether nightly compaction should be enabled for the duplicate server databases.
Note: If no duplicate server setting is specified, defaults, or values given on the duplicate server command
line are used.
Feeding Destinations
This table describes the options available for custom document feeding destinations. It is possible to submit
document to a collection by another name, multiple collections and even another ESP installation. If no
destinations are specified the default is to feed into a collection by the same name in the current ESP
installation.
Table 15: Feeding Destination Options
Option
Name
Description
This parameter specifies a unique name that must be given for the feeding destination
you are configuring. The name can later be used in order to specify a destination for
refeeds.
This field is required.
Target collection
This parameter specifies the ESP collection name to feed documents into. Normally this
is the same as the collection name, unless you wish to feed into another collection. Ensure
that the collection already exists on the ESP installation designated by Destination first.
Each feeding desintation you specify maps to a single collection, thus to feed the same
crawl into multiple collections you need to specify multiple feeding destinations. It is also
possible for multiple crawler collections to feed into the same target collection.
Destination
This parameter specifies an ESP installation to feed to. The available ESP destinations
are listed in the feeding section of the crawler's global configuration file, normally
$FASTSEARCH/etc/CrawlerGlobalDefaults.xml. The XML file contains a list of
named destinations, each with a list of content distributors.
If no destinations are explicitly listed in the XML file you may specify "default" here, and
the crawler will feed into the current ESP installation. This current ESP installation is that
which is specified by $FASTSEARCH/etc/contentdistributor.cfg.
This field is required, may be "default" unless the global XML file has been altered.
Pause ESP feeding
56
This option specifies whether or not the crawler should pause document feeding to FAST
ESP. When paused, the feed will be written to stable storage on a queue.
Option
Description
Note that the value of this setting can be changed via the crawleradmin tool options,
--suspendfeed/--resumefeed.
Default: no
Primary
This parameter controls whether this feeding destination is considered a primary or
secondary destination. Only the primary destination is allowed to act on callback information
from the document feeding chain, secondary feeders are only permitted to log callbacks.
Exactly one feeding destination must be specified as primary.
Focused Crawl
This table describes the options to configure Language focused crawling
Table 16: Focused Crawl Options
Option
Languages
Description
This option allows you to specify a list of languages that documents must match to be
stored and indexed by FAST ESP. The crawler will only follow non-focused documents
to a maximum depth set by the Focus depth option.
Languages should be specified either as a two letter ISO-639-1 code, or the single word
equivalent.
Examples: english, en, german, de.
Focus depth
This option allows you to specify how many levels the crawler should follow links from
URIs not matching the specified language of the crawl.
For example, if you are doing an English only crawl, with a focus depth of 2, the URI chain
would look like this (focus depth in parenthesis, "-" means no depth assigned):
English(-) -> French(2) ->French(1) -> English(1) -> English(1)
->German (0)
In the example above, the crawler will not follow links from the last URI in the chain as
the specified depth has been reached.
Hostname exclude filters
Use this parameter to specify certain domains where language focus should not apply.
For example, if performing e.g. a Spanish crawl it is possible to exclude the top level
domain .es from the language focus checks, thereby crawling all of .es regardless of the
language on individual pages.
The format is the same as the Hostname exclude filters in the basic collection options.
Form Based Login
The crawler can crawl sites that rely on HTTP cookie authentication for access control of web pages.
Configuring the crawler to perform cookie authentication does however require a fair bit of insight in the details
of how the authentication scheme works and may take some trial and error to get correct.
Studying the HTML or JavaScript source of the login page and HTTP protocol traces of a browser login
session can be very helpful. Tools that perform such tasks are freely available, including the packet sniffer
Ethereal (http://www.ethereal.com/).
Note: When secure transport (HTTPS) is used, packet sniffing in general cannot be used, and some
type of application level debugging tool must be used. We recommend the LiveHTTPHeaders utility
(http://livehttpheaders.mozdev.org/) for the Mozilla browser.
57
Note: Login Specification does not allow empty values. If you need to crawl cookie authenticated sites
with empty values, contact FAST Technical Support for detailed instructions.
Table 17: Form Based Login Options
Option
Name
Preload
HTML form
Form scheme
Description
Required: Specify a unique name for the login specification you are configuring.
Optional: URI to fetch (in order to receive a cookie) before proceeding to the authentication
form. May or may not be necessary, depending on how the authentication for that site
works.
Optional: URI to the HTML page containing the login form. Used by the Autofill option. If
not specified, the crawler will assume the HTML page is specified by the Form action
option.
Optional: Type of scheme used for login.
Valid values: http, https
Default: http
Form site
Form action
Form method
The hostname of the login form URI.
The path/file of the login form URI.
The HTTP action of the form.
Valid values: GET, POST
Default: GET
Autofill
Re-login if failed?
Form parameters
Login sites
TTL
Whether the crawler should download the HTML page, parse it, identify which form you're
trying to log into by matching parameter names, and merge it with any specified form
parameters you may have specified in the Form parameters option.
Whether the crawler after a failed login should attempt to re-login to the web site after
TTL seconds. During this time, the web site will be kept active within the crawler, thus
occupying one available site resource.
The credentials as a sequence of key, value parameters the form requires for a successful
log on. These are typically different from form to form, and must be deduced by looking
at the HTML source of the form. In general, only user-visible (via the browser) variables
need be specified, e.g. username and password, or equivalent. The crawler will fetch the
login form and read any "hidden" variables that must be sent when the form is submitted.
If a variable and value are specified in the parameters section, this will override any value
read from the form by the crawler.
List of sites (i.e. hostnames) that should log into this form before being crawled.
Number of seconds before the crawler should re-authenticate itself.
HTTP Proxies
This topic specifies one or more proxy addresses to use for all HTTP/HTTPS communication.
58
Table 18: HTTP Proxy Options
Option
Name
Host
Port
Description
Name of proxy.
Hostname of proxy.
Port number of proxy.
Default port: 3128
User
Password
Registered HTTP Proxies
Username.
Password.
List of registered HTTP proxy names.
Link Extraction
This topic describes the advanced options available that apply to Link Extraction. These options allow you
to specify which HTML tags to extract links from, including whether or not to extract links from within comments
or JavaScript code (applies only when the proper JavaScript support is turned off).
The following display shows the default values for the various Link Extraction parameters.
Figure 6: Link Extraction Options
Logging
This section describes the advanced options available that apply to Logging. The different logs can be enabled
or disabled by selecting text or none respectively.
Table 19: Logging Options
59
Option
Document fetch log
Description
This log contains detailed information about every retrieved document. It contains status
on whether the retrieval was a success, or if not, what went wrong. It will also tell you if
the document was excluded after being downloaded, for instance if it was not of the correct
document type.
Inspecting this log is very useful if you suspect that your data should have been crawled
but was not, or vice versa. It should be the first place to look after examining the crawler
debugs for errors and warnings.
Default location:
$FASTSEARCH/var/log/crawler/fetch/<collection name>/<date>.log
Default: text
Site log
The site log contains information about all sites being crawled in a collection, for instance
when the crawler starts/stops crawling a site, as well as the time of refresh events.
Examining this log can be useful when debugging site-wide issues, as this log is
comparable to the fetch log only on a site basis.
Default location:
$FASTSEARCH/var/log/crawler/site/<collection name>/<date>.log
Default: text
Postprocess log
This log contains a report of all documents, modifications or deletions sent to the FAST
ESP indexing pipeline, and the outcome of these operations.
Default location:
$FASTSEARCH/var/log/crawler/PP/<collection name>/<date>.log
Default: text
Header log
This log contains all HTTP headers send and received from the HTTP servers when
documents are retrieved, and can be used for debugging purposes of your setup. This
log is essential when debugging authentication related issues, but should be turned off
for normal crawling.
Default location for every web site crawled:
$FASTSEARCH/var/log/crawler/header/<collection name>/<5 first chars
of hostname>/<hostname>/<date>.log
Default: none
Screened log
This log contains all URIs that are not attempted retrieved for any reason, including not
falling within the scope of the configured include/exclude filters, robots.txt exclusion and
so forth. This log is useful if you feel that content that should be crawled is not being
crawled. As this is a very high volume log it should be turned off for normal crawling.
Default location:
$FASTSEARCH/var/log/crawler/screened/<collection name>/<date>.log
Default: none
Data Search feed log
This log contains all URIs that have been submitted to document processing and their
status. The log contain error messages reported by document processing stages and is
the first place to look if a document is not in the index.
Default location:
$FASTSEARCH/var/log/crawler/dsfeed/<collection name>/<date>.log
Default: text
60
Option
Adaptive Scheduler log
Description
Logs adaptive rank score of documents, for debugging purposes only.
Default location:
$FASTSEARCH/var/log/crawler/scheduler/<collection name>/<date>.log
Default: none
POST Payload
This section of the Advanced Data Sources screen allows you to configure POST payloads
Table 20: POST Payload Options
Option
URI prefix
Payload
Description
Specify a URI or URI prefix. Every URI that matches the URI or prefix will have the below
associated Payload submitted to it using the HTTP POST method. A URI prefix must be
indicated by the string prefix:, followed by the URI string to match. A URI alone will be
used for an exact match.
Specify the payload to be submitted by the HTTP POST method to every URI that matches
the given URI prefix specified above.
Postprocess
It is recommended that you contact FAST Solution Services before changing any of the postprocess options.
The default selections are shown in the following screen. The options with empty defaults are automatically
adjusted by the crawler.
Figure 7: Postprocess Options
RSS
This topic describes the parameters for RSS crawling.
Note: Extensive testing is strongly recommended before production use, to insure that desired processing
patterns are attained.
Table 21: RSS Options
61
Option
RSS start URIs
Description
This option allows you to specify a list of RSS start URIs for the collection to be configured.
RSS documents (feeds) are treated a bit different than other documents by the crawler.
First, RSS feeds typically contain links to articles and meta data which describes the
articles. When the crawler parses these feeds, it will associate the metadata in the feeds
with the articles they point to.This meta data will be sent to the processing pipeline together
with the articles, and a RSS pipeline stage can be used to make this information
searchable. Second, links found in RSS feeds will be tagged with a force flag. Thus, the
crawler will crawl these links as soon as allowed (it will obey the collection's delay rate),
and they will be crawled regardless if it they have been crawled already in this crawl cycle.
Example: http://www.example.com/rss.xml
Default: Not mandatory
RSS start URI files
This parameter requires you to specify a list of RSS start URI files for the collection to be
configured. This option is not mandatory. The format of the files is one URI per line.
Example: C:\MyDirectory\rss_starturis.txt (Windows) or /home/user/rss_starturis.txt (UNIX).
Discover new RSS feeds?
This parameter allows you to specify if the crawler should attempt to find new RSS feeds.
If this option is not set, only feeds specified in the RSS start URIs and/or the RSS start
URIs files sections will be treated as feeds.
Default: no
Follow links from HTML?
This option allows you to specify if the crawler should follow links from HTML documents,
which is the normal crawler behavior. If this option is disabled, the crawler will only crawl
one hop away from a feed. Disable this option if you only want to crawl feeds and
documents referenced by feeds.
Default: yes
Ignore include/exclude
rules?
Use this option to specify if the crawler should crawl all documents referenced by feeds,
regardless of being valid according to the collection's include/exclude rules.
Default: no
Index RSS feeds?
This parameter allows you to specify if the crawler should send the RSS feed documents
to the processing pipeline. Regardless of this option, meta data from RSS feeds will be
sent to the processing pipeline together with the articles they link to.
Default: no
Max age for links in feeds
This parameter allows you to specify the maximum age (in minutes) for a link in an RSS
document. Expired links will be deleted if the 'Delete expired' option is enabled. 0 disables
this option.
Max articles per feed
This parameter allows you to specify the maximum number of links the crawler will
remember for a feed. The list of links found in a feed will be treated in a FIFO manner.
When links get pushed out of the list, they will be deleted if the 'Delete expired' option is
set. 0 disables this option.
Default: 128
62
Option
Delete expired articles?
Description
This option allows you to specify if the crawler should delete articles when they expire.
An article (link) will expire when it is affected by either 'Max articles per feed' or 'Max age
for links in feeds'.
Default: no
Storage
It is recommended that you contact FAST Solution Services before changing any of the Storage options. The
default selections are shown in the following screen. The options with empty defaults are automatically
adjusted by the crawler.
Figure 8: Storage Options
Sub Collections
This topic describes how to define and configure Sub Collections in the crawler. Sub Collections is a mechanism
that allows subsets of a collection to be specified differently in the crawler. An example is if a collection spans
across several sites, and one wish to crawl a particular site or set of sites to be crawled more aggressively.
In such a case, one can define a Sub Collection that includes this site and set a different request rate on that
Sub Collection.
Sub Collections should be considered as a separate work queue that is treated differently than the main
collection queue. Note that Sub Collections can span several sites, or a particular subset of a site. The Sub
Collection Hostname include/exclude filters and URI include/exclude filters determine what will be included
in a Sub Collection; the filters have the same semantics found in the Data Source Basic Options and Data
Source Advanced Options respectively.
Note that whatever does not fall within a Sub Collection automatically falls within the main collection. Also
note that what falls within a Sub Collection cannot be excluded in the main collection; it must be a subset.
Sub Collections must be given their own start URI or start URI file. The options that are set for Sub Collections
will contain the same semantics as those in the main collection; Sub Collection settings override main collection
settings. One or more of the following settings are mandatory:
•
•
•
Hostname/URI include/exclude filters
Start URI files/Start URIs
Name
The remaining settings are optional.
63
Figure 9: Sub Collection Basic Options
Figure 10: Sub Collection Crawl Mode Options
Figure 11: Sub Collection RSS Options
64
Figure 12: Sub Collection Advanced Options
Creating a new Sub Collection
Fill in the proper values in the fields for the Sub Collection. If values are already filled in, click New to get a
blank template. Fill in the mandatory values, and click Add.
Note: If a different Sub Collection has been viewed earlier, some options may not change. Make sure
all options are correct before selecting Add.
Modifying an existing Sub Collection
Select the Sub Collection you wish to add in the Installed items select box, then click View. Modify the
applicable settings. Before saving, select the same Sub Collection in the Installed Sub Collections box.
Click Delete. Click Add.
65
Removing an existing Sub Collection
Select the Sub Collection you wish to remove in the Installed Sub Collections box, then click Delete.
Work Queue Priority
This topic describes the work queue priority parameter, which allows you to specify how many priority levels
you want the work queue to consist of, and various rules and methods for how to insert and extract entries
from the work queue.
Table 22: Work Queue Priority Options
Option
Workqueue levels
Description
This option allows you to specify the number of priority levels you want the crawler work
queue to have.
Note: If this value is ever decreased in value (e.g. from 3 to 1), the on-disk storage
for the work queues must be deleted manually to recover the disk space.
Default: 1
Default queue
This option allows you to specify the default priority level for extracting and inserting URIs
from/to the work queue.
Default: 1
Start URI priority
This option allows you to specify the priority level for URIs coming from the start URIs
and start URI files options.
Default: 1
Pop Scheme
This option allows you to specify which method you want the crawler to use when extracting
URIs from the work queue.
Valid values:
•
rr - extract URIs from the priority levels in a round-robin fashion.
•
wrr - extract URIs from the priority levels in a iweighted round-robin fashion. The
weights are based on their respective share setting per priority level. Basically URIs
are extracted from the queue with the highest share value; when all shares are 0 the
shares are reset to their original settings.
pri - extract URIs from the priority levels in a priority fashion by always extracting
from the highest priority level if there still are entries available (1 being the highest).
default- same as wrr.
•
•
When using multiple work queue levels it's recommended to use either the wrr or pri pop
scheme.
Default: default
Put Scheme
This option allows you to specify which method you want the crawler to use when inserting
the URIs into the work queue.
Valid values:
•
66
default - always insert URIs with default priority level.
Option
Description
•
include - insert URIs with the priority level defined by the includes specified for every
priority level. If no includes match, the default priority level will be used.
Default: default
Queue - Hostname include
These options allow you to specify a set of include rules for each priority level to be used
filters Queue - URI include
when utilizing the include Put scheme of inserting entries to queue.
filters
Queue - Share
This option allows you to specify a share or weight for each queue to be used when utilizing
the wrr Pop scheme of extracting entries in the work queue.
Configuration via XML Configuration Files
The crawler may be configured using an XML based file format. This format allows you to manage files in a
text based environment to create/manage multiple collections as well as automate configuration changes.
Furthermore, a few advanced features are only available in the XML format.
Basic Collection Specific Options (XML)
This section discusses the parameters available on a per collection basis should you decide to configure the
crawler using an XML configuration file.
To add or update a collection in the crawler from an XML file, use the following command:
$FASTSEARCH/bin/crawleradmin -f <XML file path>
Substitute <XML file path> with the full path to the XML file.
Note: Removing a section from the XML configuration and submitting that configuration while keeping
the section intact is necessary for proper updating. For example, if you want to delete the existing
include_uris section, you should not completely delete that section from the XML file. You should add
an empty include_uris section in the XML file before importing the changes. This behavior allows
partial configs to be submitted in order to change a specific option while keeping the remaining
configuration intact.
Table 23: XML Configuration File Parameters
Parameter
info
Description
Collection information.
Parameter specifies a string that can contain general-purpose information.
<attrib name="info" type="string"> Test crawl for .no
domains on W2k </ attrib>
fetch_timeout
URI fetch timeout in seconds.
The maximum period of time allowed for downloading a document. Set this value
high if you expect to download large documents from slow servers.
Default: 300
<attrib name="fetch_timeout" type="integer"> 300 </attrib>
67
Parameter
allowed_types
Description
Allowed document MIME types.
Only download documents of the indicated MIME type(s). The MIME types
specified here is included in the accept-header of each GET request that is sent.
Note that some servers can return incorrect MIME types.
Note that the format supports wildcard expansion of an entire field only, for
example, */example, text/* or */*, but not appl/ms*. No other regular
expression is supported.
<attrib name="allowed_types" type="list-string">
<member> text/html</member>
<member>application/msword </member>
</attrib>
force_mimetype_detection
Force MIME type detection on documents.
This option allows you to decide whether or not the crawler should run its own
MIME type detection on documents. In most cases web servers return the MIME
type of documents when they are downloaded, as part of the HTTP header.
If this option is enabled, documents will get tagged with the MIME type that looks
most accurate; either the one received from the web server or the result of the
crawlers determination.
Default: no (disabled)
<attrib name="force_mimetype_detection" type="boolean"> no
</attrib>
allowed_schemes
Allowed schemes.
Specify which URI schemes to allow.
Valid schemes are: HTTP, HTTPS and FTP and multimedia formats MMS and
RTSP. Note that MMS and RTSP for multimedia crawl is supported via MM
proxy.
<attrib name="allowed_schemes" type=list-string">
<member> http </member>
<member> https </member>
<member> ftp </member>
</attrib>
ftp_acct
FTP accounts.
Specify FTP accounts for crawling FTP URIs. If no site match is found here, the
default is used.
Note that changing this value may result in previously accessible content to be
(eventually) deleted from the index.
68
Parameter
Description
Default: anonymous
<section name="ftp_acct">
<attrib name="ftp.mysite.com" type="string"> user:pass
</attrib>
</section>
ftp_passive
FTP passive mode.
Use FTP passive mode for retrieval from FTP sites.
Default: yes
<attrib name="ftp_passive" type="boolean"> yes </attrib>
domain_clustering
Route hosts from the same domain to the same slave.
If enabled in a multiple node configuration, sites from the same domain (for
example, www.example.com and forums.example.com) will also be routed
to the same master node.
Default: no (disabled) for single node and yes (enabled) for multiple node
<attrib name="domain_clustering" type="boolean"> yes
</attrib>
max_inter_docs
Maximum number of docs before interleaving site.
The crawler will by default crawl a site to exhaustion, or until the maximum
number of documents per site is reached. However, the crawler can be configured
to crawl "batches" of documents from sites at a time, thereby interleaving between
sites. This parameter allows you to specify how many documents you want to
be crawled from a server consecutively before the crawler interleaves and starts
crawling other servers. The crawler will then return to crawling the former server
as resources free up.
Valid values: No value (empty) or positive integer
Default: empty (disabled)
Example:
<attrib name="max_inter_docs" type="integer"> 3000 </attrib>
max_redirects
Maximum number of redirects to follow.
This parameter allows you to specify the maximum number redirects that should
be followed from an URI. For example, http://example.com/path redirecting
to http://example.com/path2 will be counted as 1.
Default: 10
<attrib name="max_redirects" type="integer"> 10 </attrib>
near_duplicate_detection
Enable near duplication detection algorithm.
The near_duplicate_detection parameter is boolean, with values true or
false, indicating whether or not to use the near duplicate detection scheme. The
69
Parameter
Description
near_duplicate_detection parameter can be used per domain
(sub-domain). It is disabled (false) by default.
Default: no
<attrib name="near_duplicate_detection" type="boolean"> no
</attrib>
Refer to Configuring Near Duplicate Detection for more information.
max_uri_recursion
Screen for recursive patterns in new URIs.
Use this parameter to check for repeating patterns in URIs, compared to their
referrers, with repetitions beyond the specified being dropped. For example,
http://www.example.com/wile linking to
http://www.example.com/wile/wile is a repetition of 1 element.
A value of 0 disables the test.
Default: 5
<attrib name="max_uri_recursion" type="integer"> 5 </attrib>
focused
Language focused crawl (optional).
Use this parameter to specify options to focus your crawl.
languages: Use this parameter to specify a list of languages that documents
must match to be stored and sent to FAST ESP. Documents that do not match
the languages will follow a configured amount (depth) of levels before traversing
stops.Those domains excluded from the language focused crawl are still eligible
for the main crawl. Languages should be specified according to ISO-639-1.
The depth and exclude_domains settings are used to limit the crawl:
depth: Use this parameter to specify the number of levels to follow
documents that do not match the language specification.
exclude_domains: Use this parameter to exclude certain domains from
which language focus should not apply. Format is the same as the
exclude_domains option in the collection configuration. Note that domains
will be crawled regardless of their language; they will be excluded from the
language check, but not excluded from the crawl.
<section name="focused">
<attrib name="depth" type="integer"> 3 </attrib>
<section name="exclude_domains">
<attrib name="suffix" type="list-string">
<member> .tv </member>
</attrib>
</section>
<attrib name="languages" type="list-string">
<member> norwegian </member>
<member> no </member>
<member> nb </member>
<member> nn </member>
<member> se </member>
</attrib>
</section>
ftp_searchlinks
70
FTP search for links.
Parameter
Description
Specify if you want the crawler to search the documents downloaded from FTP
for links.
Default: yes
<attrib name="ftp_searchlinks" type="boolean"> yes </member>
use_javascript
Enable JavaScript support.
Specify if you want to enable JavaScript support in the crawler. If enabled, the
crawler will download, parse/execute and extract links from any external
JavaScript.
Note: JavaScript processing is resource intensive and should not be
enabled for large crawls.
Note: Processing JavaScript requires an available Browser Engine.
For more information, please refer to the FAST ESP Browser Engine Guide.
Default: no
<attrib name="use_javascript" type="boolean"> no </attrib>
javascript_keep_html
Specify whether to submit the original HTML document, or the HTML resulting
from the JavaScript parsing, to document processing for indexing.
When parsing a HTML document the Browser Engine executes all inlined and
external JavaScripts, and thereby all document.write() statements, and
includes these in its HTML output. By default it is this resulting document that
is indexed. However it is possible to use this feature to limit the Browser Engine
to link extraction only.
Default: no
<attrib name="javascript_keep_html" type="boolean"> no
</attrib>
javascript_delay
Specify the delay (in seconds) to be used when retrieving external JavaScripts
referenced from a HTML document. The default (specified as an empty value)
is the same as the normal crawl delay, but it may be useful to set it lower to
speed up crawling of documents containing JavaScripts.
Default: empty
<attrib name="javascript_delay" type="real"> 60 </attrib>
exclude_headers
Exclude headers.
71
Parameter
Description
Specify which documents that you want to be excluded by identifying the
document HTTP header fields. First specify the header name, then one or more
regular expressions for the header value.
<section name="exclude_headers">
<attrib name="Server" type="list-string">
<member> webserverexample1.* </member>
<member> webserverexample2.* </member>
</attrib>
</section>
exclude_exts
Exclude extensions.
Specify which documents you want to be excluded by identifying the document
extensions. The extensions will be suffix string matched with the path of the
URIs.
<attrib name="exclude_exts" type="list-string">
<member> .gif </member>
<member> .jpg </member>
</attrib>
use_http_1_1
Use HTTP/1.1.
Specify whether the crawler should use HTTP/1.1 or not (HTTP/1.0). HTTP/1.1
is required for the crawler to accept compressed documents from the server
(accept_compression) and enable ETag support (if_modified_since
must be checked).
Default: yes (to crawl using HTTP/1.1)
<attrib name="use_http_1_1" type="boolean"> no </attrib>
accept_compression
Accept compression.
Specify whether the crawler should use the Accept-Encoding header, thus
accepting that the documents are compressed at the web server before returned
to the crawler.
Default: yes
Only applicable if use_http_1_1 is enabled.
<attrib name="accept_compression" type="boolean"> no
</attrib>
dbswitch
DB switch interval.
Specify the number of cycles a document is allowed to complete before being
deleted. When the DB interval is complete, the action taken on these deleted
documents is determined by the dbswitch_delete parameter.
Setting this value very low, such as to 1, can result in documents being deleted
very suddenly.
72
Parameter
Description
This parameter is not affected by force re-fetches; only cycles where the crawler
refreshes normally (by itself) increases the document's cycle number count.
<attrib name="dbswitch" type="integer"> 5 </attrib>
dbswitch_delete
DB switch delete.
The crawler will at regular intervals perform an update of its internal database
of retrieved documents, to detect documents that may be removed from the web
servers. This option determines what to do with these remaining documents;
they can either be deleted right away or put in the work queue for retrieval to
make certain they are actually removed.
A dbswitch check occurs at the start of a refresh cycle, independently for each
site.
If set to yes, then documents found to be too old are deleted immediately. If set
to no, then documents are scheduled for a re-retrieval and only deleted if they
no longer exist on the server.
Default: no
<attrib name="dbswitch_delete" type="boolean"> yes </attrib>
html_redir_is_redir
Treat META refresh HTTP tag contents as an HTTP redirect.
Use this parameter in conjunction with html_redir_thresh to allow the
crawler to treat META refresh tags inside HTML documents as if they were true
HTTP redirects. When enabled the document containing the META refresh will
not itself be indexed.
Default: yes
<attrib name="html_redir_is_redir" type="boolean"> yes
</attrib>
html_redir_thresh
Upper bound for META refresh tag delay.
Use this parameter in conjunction with html_redir_is_redir to specify the
number of seconds delay (threshold) which are allowed for the tag to be
considered a redirect. Anything less than this number is treated as a redirect,
other values are treated as a link (and the document itself is indexed also).
Default: 3
<attrib name="html_redir_thresh" type="integer"> 3 </attrib>
robots_ttl
Robots time to live.
Specifies how often (in seconds) the crawler will re-download the robots.txt file
from sites, in order to check if it has changed. Note that the robots.txt file may
be retrieved less often if the site is not crawling continuously.
73
Parameter
Description
Default: 86400 (24 hours)
<attrib name="robots_ttl" type="integer"> 86400 </attrib>
enable_flash
Extract links from flash files.
If enabled, extract links from Adobe Flash (.swf) files.
You may also want to enable JavaScript support, as many web servers only
provide Flash content to clients that support JavaScript.
Note: Flash processing is resource intensive and should not be enabled
for large crawls.
Note: Processing Macromedia Flash files requires an available Browser
Engine.
For more information, please refer to the FAST ESP Browser Engine Guide.
Default: no
<attrib name="enable_flash" type="boolean"> no </attrib>
use_sitemaps
Extract links and metadata from sitmap files.
Enabling this option allows the crawler to detect and parse sitemaps. The crawler
support sitemap and sitemap index files as defined by the specification at
http://www.sitemaps.org/protocol.php.
The crawler uses the 'lastmod' attribute in a sitemap to see if a page has been
modified since the last time the sitemap was retrieved. Pages that have not been
modified will not be recrawled. An exception to this is if the collection uses
adaptive refresh mode. In adaptive refresh mode the crawler will use the 'priority'
and 'changefreq' attributes of a sitemap in order to determine how often a page
should be crawled. For more information see Adaptive Parameters on page 91.
Custom tags found in sitemaps are stored in the crawlers meta database and
can be submitted to document processing.
Note: Most sitemaps are specified in robots.txt. Thus, 'robots' should be
enabled in order to get the best result.
Default: No
<attrib name="use_sitemaps" type="boolean">
no </attrib>
max_reflinks
Maximum referrer links.
Specify the maximum number of referrer links to store per URI (redirects
excluded).
Note: This value can have a major impact on crawler performance.
74
Parameter
Description
Default: 0
<attrib name="max_reflinks" type="integer"> 0 </attrib>
max_pending
Maximum number of concurrent requests per site.
Specify the maximum number of concurrent (outstanding) HTTP requests to a
site at any given time.
Default: 2
<attrib name="max_pending" type="integer"> 8 </attrib>
robots_auth_ignore
Ignore robots.txt authentication errors.
Specify whether or not the crawler should ignore robots.txt if an HTTP 40x
authentication error is returned by the server. If disabled the crawler will not
crawl the site in question at this time.
This option allows you to control whether the crawler should crawl sites returning
401/403 Authorization Required for their robots.txt from the crawl. The robots
standard lists this behavior as a hint for the spider to ignore the site altogether.
However, incorrect configuration of web servers is widespread and can lead to
a site being erroneously excluded from the crawl. Enabling this option makes
the crawler ignore such indications and crawl the site anyway.
Default: yes
<attrib name="robots_auth_ignore" type="boolean"> yes
</attrib>
robots_tout_ignore
Ignore robots.txt timeout.
Specify whether or not the crawler should ignore the robots.txt rules if the request
for this file times out.
Before crawling a site, the crawler will request the robots.txt file from the server,
according to the rules for limiting what areas of a site may be crawled. According
to these rules, if the request for this file times out the entire site should be
considered off-limits to the crawler.
Setting this parameter to yes indicates that the robots.txt rules should be ignored,
and the site crawled. Keep this option set to no if you do not control the site
being crawled.
Default: no
<attrib name="robots_tout_ignore" type="boolean"> no
</attrib>
rewrite_rules
Rewrite rules.
Specify a number of rewrite rules that rewrite certain URIs. Typical usage is to
rewrite URIs with session-ids by removing the session-id part. Sed-type format.
Separator character is the first one encountered, in this example "@".
<attrib name="rewrite_rules" type="list-string">
<member>
<![CDATA[@(.*/servlet/.*[&?])r=.*?(&|$)(.*)@\1\3@ ]]>
</member>
<member> <![CDATA[@(.*);jsessionid=.*?(\?.*|$)@\1\2@
75
Parameter
Description
]]> </member>
</attrib>
extract_links_from_dupes
Extract links from duplicates.
Even though two documents have duplicate contents, they may have different
links. Specify whether or not you want the crawler to extract links from duplicates.
If enabled, you may get duplicate links in the URI-queues. If duplicate documents
contain duplicate links then you can disable this parameter.
Default: no
<attrib name="extract_links_from_dupes" type="boolean">
no </attrib>
use_meta_csum
Include HTML META tag contents in checksum.
Specify if you want the crawler to include the contents (values) of HTML META
tags when generating the document checksum used for duplicate detection.
Use this to find changes in the document META tags.
Default: no
<attrib name="use_meta_csum" type="boolean">no</attrib>
csum_cut_off
Checksum cut-off limit.
When crawling multimedia content through a multimedia proxy (schemes MMS
or RTSP), use this setting to determine if a document has been modified. Rather
than downloading an entire document, only the number of bytes specified in this
setting will be transferred and the checksum calculated on that initial portion of
the file. This saves bandwidth after the initial crawl cycle, and reduces the load
on other system and network resources as well.
<attrib name="csum_cut_off" type="integer">0</attrib>
if_modified_since
Send If-Modified-Since header.
Specify if you want the crawler to send If-Modified-Since headers.
Default: yes
<attrib name="if_modified_since" type="boolean"> yes
</attrib>
use_cookies
Use cookies.
Specify if you want the crawler to store/send cookies received in HTTP headers.
This feature is automatically enabled for sites that use a login, but can also be
turned on globally through this option.
Default: no
<attrib name="use_cookies" type="boolean"> no </attrib>
uri_search_mime
Document MIME types to extract links from.
This option specifies MIME types that should be searched for links. If not already
listed in the default list, type in a MIME type you want to search for links.
76
Parameter
Description
Note that wildcard on type and subtype is allowed. For instance, text/* or
*/html are valid. No other regular expression is supported.
<attrib name="uri_search_mime" type="list-string">
<member> text/html </member>
<member> text/plain </member>
</attrib>
variable_delay
Variable request rate.
Specify time slots when you want to use a higher or lower request rate (delay)
than the main setting. Time slots are specified as ending and starting points,
and cannot overlap. Time slot start and endpoints are specified with day of week
and time of day (optionally with minute resolution).
You can also enter the value suspend in the delay field that will suspend the
crawler so that there is no crawling for the time span specified.
<section name="variable_delay">

<attrib name="Wed:00-Wed:23" type="string">20 </attrib>

<attrib name="Sat:08.00-Sun:20.30"
type="string">2</attrib>

<attrib name="Mon:00-Mon:23"
type="string">suspend</attrib>
</section>
site_clusters
Explicit site clustering.
Specify if you want to override normal routing of sites and force certain sites to
be on the same uberslave. This is useful when cookies/login is enabled, since
cookies are global only within an uberslave. Also if you know certain sites are
closely interlinked you can reduce internal communication by clustering them.
<section name="site_clusters">
<attrib name="mycluster" type="list-string">
<member> site1.example.com </member>
</attrib>
</section>
refresh_mode
workqueue_priority
Refer to Refresh Mode Parameters on page 89 for option information.
Refer to Work Queue Priority Rules on page 89 for option information.
adaptive
Refer to Adaptive Parameters on page 91 for option information.
max_backoff_counter and
max_backoff_delay
Maximum connection error backoff counter.
and
Maximum connection error backoff delay.
77
Parameter
Description
Together these options control the adaptive algorithm by which a site
experiencing connection failures (for example, network errors, timeouts, HTTP
503 "Server Unavailable" errors) are contacted less frequently. For each
consecutive instance of these errors, the inter-request delay for that site is
incremented by the initial delay setting (delay setting):
Increasing delay = current delay + delay
The maximum delay for a site will be the max_backoff_delay setting. If the
number of failures reaches max_backoff_counter, crawling of the site will
become idle.
Should the network issues affecting the site be resolved, the internal backoff
counter will start decreasing, with the inter-request delay lowered on each
successful document fetch by half:
Decreasing delay = current delay / 2
This continues until the original delay (delay setting) is reached.
Default:
<attrib name="max_backoff_counter" type="integer"> 50
</attrib>
<attrib name="max_backoff_delay" type="integer"> 600
</attrib>
http_errors
ftp_errors
Refer to HTTP Errors Parameters on page 93 for option information.
FTP error handling.
Specify how various response codes and error conditions are handled for FTP
URIs. Same XML structure as the http_errors section.
Logins
storage
delay
Refer to Logins parameters on page 94 for option information.
Refer to Storage parameters on page 96 for option information.
Delay between document requests (request rate).
This option specifies how often (the delay between each request) the crawler
should access a single web site when crawling.
<attrib name="delay" type="real"> 60.0 </attrib>
Note: FAST license terms do not allow a more frequent request rate setting
than 60 seconds for external sites unless an agreement exists between the
customer and the external site.
refresh
Refresh interval.
refresh_mode
<attrib name="refresh" type="real"> 1440 </attrib>
The crawler retrieves documents from web servers. Since documents on web
servers frequently change, are added or removed, the crawler must periodically
crawl a site over again to reflect this. In the default crawler configuration, this
78
Parameter
Description
refresh interval is one day (1440 minutes), meaning that the crawler will start
over crawling a site every 24 hours.
Since characteristics of web sites may differ, and customers may want to handle
changes differently, the action performed at the time of refresh is also
configurable, via the refresh_mode
robots
Respect robot directives.
This parameter indicates whether or not to follow the directives in robots.txt files.
<attrib name="robots" type="boolean"> yes </attrib>
include_domains
Sites included in crawl.
This parameter is a set of rules of which a hostname must match at least one
in order to be crawled. An empty section matches all domains.
Note: This setting is a primary control over the pages included in the crawl
(and index), and should not be changed without care.
Valid rules types are:
prefix: Matches the given sitename prefix (for example, www matches
www.example.net, but not download.example.net)
exact: Matches the exact sitename
file: Identifies a local (to the crawler host) file containing include and/or
exclude rules for the configuration. Note that in a multiple node configuration,
the file must be present on all crawler hosts, in the same location.
suffix: Matches the given sitename suffix (for example, com matches
www.example.com)
regexp: Matches the given sitename against the specified regular expression
(left to right).
IP mask: Matches IP addresses of sites against specified dotted-quad or
CIDR expression.
<section name="include_domains">
<member> example.net </member>
<member> example.com </member>
</attrib>
<attrib name="regexp" type="list-string">
<member> .*\.alltheweb\.com </member>
</attrib>
</section>
exclude_domains
Sites excluded from crawl.
This parameter is a set of rules of which a hostname must not match any rules
in order to be crawled. An empty section matches no domains (allowing all to
be crawled). Syntax is identical to include_domains parameter with only the
section name being different.
79
Parameter
include_uris
Description
Included URIs.
This parameter is a set of rules of which a URI must match at least one rule in
order to be crawled. An empty section matches all URIs. Syntax is identical to
include_domains parameter with only the section name being different.
Note:
The semantics of URI and hostname inclusion rules have changed since
ESP 5.0 (EC 6.3). In previous ESP releases these two rule types were
evaluated as an AND, meaning that a URI has to match both rules (when
rules are defined). As of ESP 5.1 and later (EC 6.6 up), the rules processing
has changed to an OR operator, meaning a URI now only needs to match
one of the two rule types.
For example, an older configuration for fetching pages with the prefix
http://www.example.com/public would specify two rules:
•
•
The first rule is no longer needed, and if not removed would allow any URI
from that host to be fetched, not only those from the /public path. Some
configurations may be much more complex than this simple example, and
require careful adjustment in order to restrict URIs to the same limits as
before. Contact Contact Us on page iii for assistance in reviewing your
configuration, if in doubt.
Existing crawler configurations migrating from EC 6.3 must be manually
updated, by removing or adjusting the hostname include rules that overlap
with URI include rules.
exclude_uris
Excluded URIs.
This parameter is a set of rules of which a URI must not match any rules in order
to be crawled. An empty section matches no URIs (allowing all to be crawled).
Syntax is identical to include_domains with only the section name being
different.
start_uris
Start URIs for the collection.
This parameter is a list of start URIs for the specified collection. The crawler
needs either start_uris or start_uri_files specified to start crawling.
Note: If your crawl includes any IDNA domain names, you should enter
them using UTF-8 characters, and not in the DNS encoded format.
<attrib name="start_uris" type="list-string">
<member> http://www.example.com/ </member>
<member> http://example.øl.no/ </member>
</attrib>
80
Parameter
start_uri_files
Description
Start URI files for the collection.
This parameter is a list of start URI files for the specified collection. The file
format is plain text with one URI per line. The crawler needs either start_uris
or start_uri_files specified to start crawling.
Note: In a multiple node configuration, the file must be available on all
masters.
<attrib name="start_urifiles" type="list-string">
<member> urifile.txt </member>
<member> urifile2.txt </member>
</attrib>
mirror_site_files
Map file of primary/secondary servers for a site.
This parameter is a list of mirror site files for the specified domain. The file format
is a plain text, whitespace-separated list of sites, with the preferred (primary)
name listed first.
Note: In a multiple node configuration, the file must be available on all
masters.
<attrib name="mirror_site_files" type="list-string">
<member> mirror_mappings.txt </member>
</attrib>
max_sites
Maximum number of concurrent sites.
This parameter limits the maximum number of sites that can be handled
concurrently by this crawler node.
This value applies per crawler node in a distributed setup.
Note: This value can have a major impact on system resource usage.
<attrib name="max_sites" type="integer"> 128 </attrib>
proxy
Proxy address.
This parameter specifies a proxy to redirect all HTTP communication. The proxy
can be specified in the format:
(http://)?(user:pass@)?hostname(:port)?
Default port: 3128
<attrib name="proxy" type="list-string">
<member> proxy1.example.com:3128 </member>
<member> proxyB.example.com:8080 </member>
</attrib>
proxy_max_pending
Proxy open connection limit.
81
Parameter
Description
This parameter specifies a limit on the number of outstanding open connections
per proxy, per uberslave in the configuration.
<attrib name="proxy_max_pending" type="integer"> 8 </attrib>
passwd
document_plugin
Refer to Password Parameters on page 97 for option information.
Specify user-defined document/redirect processing program.
Specify a user-written Python module to be used for processing fetched
documents and (optionally) URI redirections.
The value specifies the python class module, ending with your class name. The
crawler splits on the last '.' separator and converts this to the Python equivalent
"from <module> import <class>". The source file should be located
relative to $PYTHONPATH, which for FDS installations corresponds to
${FASTSEARCH}/lib/python2.3/.
<attrib name="document_plugin" type="string">
tests.plugins.plugins.helloworld</attrib>
Refer to Implementing a Crawler Document Plugin Module on page 118 for more
information.
headers
List of additional HTTP headers to send.
List of additional headers to add to the request sent to the web servers. Typically
this is used to specify a user-agent header.
<attrib name="headers" type="list-string">
<member> User-agent: FAST Enterprise Crawler 6 </member>
</attrib>
cut_off
Maximum document size in bytes.
This parameter limits the maximum size of documents. Document s larger than
the specified number of bytes will be truncated or discarded (refer to truncate
setting).
Default: no cut-off
<attrib name="cut_off" type="integer"> 100000000 </attrib>
truncate
Truncate/discard docs exceeding cut-off.
This parameter specifies the action taken when a document exceeds the specified
cut-off threshold. A value of "yes" truncates the document at that size and a
value of "no" discards the document entirely.
Default: yes
<attrib name="truncate" type="boolean"> yes </attrib>
diffcheck
Duplicate screening.
This parameter indicates whether or not duplicate screening should be performed.
<attrib name="diffcheck" type="boolean"> yes </attrib>
82
Parameter
check_meta_robots
Description
Inspect META robots directive.
This parameter indicates whether or not to follow the directives given in the
META robots tag (noindex or nofollow).
<attrib name="check_meta_robots" type="boolean"> yes
</attrib>
obey_robots_delay
Respect robots.txt crawl-delay directive.
This parameter indicates whether or not to follow the crawl-delay directive in
robots.txt files. In a site's robots.txt file, the non-standard directive
Crawl-delay: 120 may be specified, where the numerical value is the number
of seconds to delay between page requests. If this setting is enabled, this value
will override the collection-wide delay setting for this site.
<attrib name="obey_robots_delay" type="boolean"> no
</attrib>
key_file
SSL key file.
An SSL key file to use for HTTPS connections.
Note: In a multiple node configuration, the file must be on all masters.
<attrib name="key_file" type="string"> key.pem </attrib>
cert_file
SSL cert file.
An SSL certificate file to use for HTTPS connections.
Note: In a multiple node configuration, the file must be on all masters.
<attrib name="cert_file" type="string"> cert.pem </attrib>
max_doc
Maximum number of documents
This parameter indicates the maximum amount of documents to download from
a web site.
<attrib name="max_doc" type="integer"> 5000 </attrib>
enforce_delay_per_ip
Limit requests per target IP address.
Use this parameter to force the crawler to limit requests (per the delay setting)
to web servers whose names map to a shared IP address.
Default: yes
<attrib name="enforce_delay_per_ip" type="boolean"> yes
</attrib>
wqfilter
Enable work queue Bloom filter.
This parameter enables filtering that screens duplicate URI entries from the
per-site work queues. Sizing of the filter is automatic.
Default: yes
<attrib name="wqfilter" type="boolean"> yes </attrib>
83
Parameter
smfilter
Description
Slave/Master Bloom filter.
This parameter enables a Bloom filter to screen URI links transferred between
slaves and master. The value is the size of the filter, specifying the number of
bits allocated, which should be between 10 and 20 times the number of URIs
to be screened.
It is recommended that you turn on this filter for large crawls; recommended
value is 50000000 (50 megabit).
Default: 0
<attrib name="smfilter" type="integer"> 0 </attrib>
mufilter
Master/Ubermaster Bloom filter.
This parameter enables a Bloom filter to screen URI links transferred between
masters and the ubermaster. The value is the size of the filter, specifying the
number of bits allocated, which should be between 10 and 20 times the number
of URIs to be screened.
Note: Enabling this setting with a positive integer value disables the
crosslinks cache.
It is recommended that you turn on this filter for large crawls; recommended
value is 500000000 (500 megabit).
<attrib name="mufilter" type="integer"> 0 </attrib>
umlogs
Ubermaster log file consolidation.
If enabled (as by default), all logging is sent to the ubermaster host for storage.
In large multiple node configurations it can be disabled to reduce inter-node
communications, reducing resource utilization, at the expense of having to check
log files on individual masters.
<attrib name="umlogs" type="boolean"> yes </attrib>
crawlmode
Specify crawl mode.
This parameter indicates the crawl mode that should be applied to a collection.
Note: This setting is the primary control over the pages included in the
crawl (and index) and should not be changed without care.
The following settings exist:
mode: Specifies either FULL or DEPTH:# (where # is the maximum number
of levels to crawl from the start URIs). Default: FULL
fwdlinks: Specifies whether or not to follow external links from servers.
Default: yes
fwdredirects: Specifies whether or not to follow external redirects received
from servers. Default: yes
84
Parameter
Description
reset_level: Specifies whether or not to reset level counter when following
external links. Doing so will result in a deeper crawl and you will generally
want this set to "no" when doing a DEPTH crawl. Default: yes
<section name="crawlmode">
<attrib name="mode" type="string"> DEPTH:1 </attrib>
<attrib name="fwdlinks" type="boolean"> yes </attrib>
<attrib name="reset_level" type="boolean"> no </attrib>
</section>
Master
Master Crawler node inclusion.
In a multiple node crawler setup, each instance of this parameter specifies a
crawler node to include in the crawl. The following example specifies use of the
crawler node named "crawler_node_1":
<Master name="crawler_node1"> </Master>
It is possible to override "global" FAST ESP parameters for a crawler node by
including "local" values of the parameters within the <Master> tag:
<Master name="crawler_node1"> <attrib name="delay"
type="integer">60 </attrib> </Master>
It is possible to specify "local" values for all "global" collection parameters. A
specific sub collection may be bound to a crawler node by including a
<subdomain> tag within the <Master> tag:
<Master name="crawler_node1">
<attrib name="subdomain" type="list-string">
<member> subdomain1 </member>
</attrib>
</Master>
Note: Having no masters specified means that whatever masters are
connected when the configuration is initially added will be used.
sort_query_params
Sort query parameters.
This parameter tells the crawler whether or not it should sort the query
parameters in URIs.
For example, http://example.com/?a=1&b=2 is really the same URI as
http://example.com/?b=2&a=1. If this parameter is enabled, then the URIs
will be rewritten to be the same. If not, the two URIs most likely will be screened
as duplicates.The problem however arises if the two URIs are crawled at different
times, and the page has changed during the time of which the first one was
crawled. In this case you can end up with both URIs in the index.
Note: Changing this setting after an initial crawl has been done might also
lead to duplicates.
<attrib name="sort_query_params" type="boolean"> no
</attrib>
post_payload
POST payload.
85
Parameter
Description
This parameter can be used to submit data as the payload to a POST request
made to a URI matching the specified URI prefix or exact match. To specify a
URI prefix, use the label prefix:, then the leading portion of the URIs to match.
A URI alone will be tested for an exact match.
The payload value can be any data accepted by the target web server, but often
URL encoding of variables is required.
<section name="post_payload">
<attrib name="prefix:http://vault.example.com/secure"
type="string"> variable1=value1&variableB=valueB </attrib>
</section>
Note: Use of this option should be tested carefully, with header logs
enabled, to ensure expected response from remote server[s].
pp
SubDomain
Refer to PostProcess Parameters on page 98 for option information.
Specifies a sub collection (subdomain) within the collection.
Within a collection, you can specify sub collections with individual configuration
options. The following options are valid within a sub collection:
ftp_passive, allowed_schemes, include_domains, exclude_domains,
include_uris, exclude_uris, refresh, refresh_mode, use_http_1_1,
accept_compression, delay, crawlmode, cut_off, start_uris,
start_uri_files, headers, use_javascript, use_sitemaps ,
max_doc , proxy , enable_flash , rss and variable_delay.
One of either include_domains, exclude_domains, include_uris or
exclude_uris must be specified; the others are optional. This is used for
directing URI/sites to the sub collection.
The refresh parameter of a sub collection must be set lower than the refresh
rate of the main domain.
Note: The following options can only have a domain granularity:
use_javascript, enable_flash and max_doc
<SubDomain name="rabbagast">
<section name="include_uris">
<attrib name="prefix" type="list-string">
<member> http://www.example.net/index </member>
</attrib>
</section>
<attrib name="refresh" type="real"> 60.0 </attrib>
<attrib name="delay" type="real"> 10.0 </attrib>
<member> http://www.example.net/ </member>
</attrib>
</SubDomain>
log
cachesize
86
Refer to Log Parameters on page 100 for option information.
Refer to Cache Size Parameters on page 101 for option information.
Parameter
link_extraction
robots_timeout
Description
Refer to Link Extraction Parameters on page 102 for option information.
Use this parameter to specify the maximum amount of time in seconds you want
to allow for downloading a robots.txt file.
Before crawling a site, the crawler will attempt to retrieve a robots.txt file from
the server that describes areas of the site that should not be crawled. Set this
value high if you expect to have comparably slow interactions requesting
robots.txt.
Default: 300
<attrib name="robots_timeout" type="integer"> 300 </attrib>
login_timeout
Use this parameter to specify the maximum amount of time in seconds you want
to allow for login requests.
Set this value high if you expect to have comparably slow interactions with login
requests.
Default: 300
<attrib name="login_timeout" type="integer"> 300 </attrib>
post_payload
Use this parameter to specify a data payload that will be submitted by HTTP
POST to all URIs matching the specified URI prefix.
<section name="post_payload">
<attrib name="http://www.example.com/testsubmit.php"
type="string">
randomdatahere
</attrib>
</section>
send_links_to
The parameter allows one collection to send all extracted links to another crawler
collection. This can, for instance, be useful when setting up RSS crawling. You
can do RSS crawling with a high refresh rate in one collection, and make it pass
new URIs to another collection which does normal crawling.
<attrib
name="send_links_to" type="string"> collection_name
</attrib>
Crawling thresholds
The option allows you to specify fail-safe limits for the crawler. When the limits are exceeded, the crawler
enters a mode called 'refresh', which makes sure that only URIs that have been crawled previously will be
crawled.
The following table describes the crawler thresholds to be set
Table 24: Crawler thresholds
87
Parameter
disk_free
Description
This option allows you to specify, in percentage, the amount of free disk space
that must be available for the crawler to operate in normal crawl mode. If the
disk free percentage drops below this limit, the crawler enters the 'refresh' crawl
mode.
Default: <attrib name="disk_free" type="integer"> 0 </attrib>
(0 == disabled)
disk_free_slack
This option allows you to specify, in percentage, a slack to the disk_free
threshold. By setting this option, you create a buffer zone above the 'disk_free'
threshold. When the current free disk space is in this zone, the crawler will not
change the crawl mode back to normal. This prevents the crawler from switching
back and forth between the crawl modes when the percentage of free disk space
is close to the value specified by the 'disk_free' parameter. When the available
disk space percentage rises above disk_free + disk_free_slack, the crawler will
change back to normal crawl mode..
Default: <attrib name="disk_free_slack" type="integer"> 3
</attrib>
max_doc
This option allows you to specify, in number of documents, the number of stored
documents that will trigger the crawler to enter the 'refresh' crawl mode. Note:
the threshold specified is not an *exact* limit, as the statistics reporting is
somewhat delayed compared to the crawling.
Default: <attrib name="max_doc" type="integer"> 0 </attrib>
(0 == disabled)
Note: This option should not be confused with Max document count per
site option.
max_doc_slack
This option allows you to specify the number of documents which should act as
a buffer zone between normal mode and 'refresh' mode. The option is related
to the 'max_doc' parameter. Whenever the 'refresh' mode is activated, because
the number of documents has exceeded the 'max_doc' parameter, a buffer zone
is created between the 'max_doc' and 'max_doc'-'max_doc_slack'. The crawler
will not change back to normal mode within the buffer zone. This prevents the
crawler from switching back and forth between the crawl modes when the number
of docs is close to the 'max_doc' threshold value.
Default: <attrib name="max_doc_slack" type="integer"> 1000
</attrib>
Example:
<section name="limits">
<attrib name="disk_free" type="integer"> 0 </attrib>
<attrib name="disk_free_slack" type="integer"> 3 </attrib>
<attrib name="max_doc_slack" type="integer"> 1000 </attrib>
</section>
Note: This special refresh crawl mode can also be user initiated enabled with the crawleradmin tool.
88
Refresh Mode Parameters
The refresh_mode allows you to specify the refresh mode of the collection.
The following table describes the valid refresh modes
Table 25: Refresh Mode Parameter Options
Option
append
prepend
scratch
Description
The Start URIs are added to the end of the crawler work queue at the start of
every refresh. If there are URIs in the queue, Start URIs are appended and will
not be crawled until those before them in the queue have been crawled.
The Start URIs are added to the beginning of the crawler work queue at every
refresh. However, URIs extracted from the documents downloaded from the
Start URIs will still be appended at the end of the queue.
The work queue is truncated at every refresh before the Start URIs are appended.
This mode discards all outstanding work on each refresh event. It is useful when
crawling sites with dynamic content that produce an infinite number of links.
Default: <attrib name="refresh_mode" type="string"> scratch
</attrib>
soft
adaptive
If the work queue is not empty at the end of a refresh period, the crawler will
continue crawling into the next refresh period. A server will not be refreshed
until the work queue is empty. This mode allows the crawler to ignore the refresh
event for a site if it is not idle. This allows large sites to be crawled in conjunction
with smaller sites, and the smaller sites can be refreshed more often than the
larger sites.
Build work queue according to scoring of URIs and limits set by adaptive section
parameters. The overall refresh period can be subdivided into multiple intervals,
and high-scoring URIs re-fetched during each interval, to maintain content
freshness while still completing deep sites.
Refresh_when_idle
This option allows you to specify whether the crawler automatically should trigger a new refresh cycle when
the crawler goes idle (all websites are finished crawling) in the current refresh cycle.
Default: <attrib name="refresh_when_idle" type="boolean"> no </attrib >
Note: This option cannot be used with a multi node crawler.
Work Queue Priority Rules
The workqueue_priority section allows you to specify how many priority levels you want the work queue
to consist of, and various rules and methods for how to insert and extract entries from the work queue.
The following table describes the possible options:
Table 26: Work Queue Priority Parameter Options
89
Option
levels
Description
This option allows you to specify the number of priority levels you want the
crawler work queue to have.
Note: If this value is ever decreased in value (e.g. from 3 to 1), the on-disk
storage for the work queues must be deleted manually to recover the disk
space.
Default: 1
default
This option allows you to specify the default priority level for extracting and
inserting URIs from/to the work queue.
Default: 1
start_uri_pri
This option allows you to specify the priority level for URIs coming from the
start_uris/start_uri_files option.
Default: 1
pop_scheme
This option allows you to specify which method you want the crawler to use
when extracting URIs from the work queue.
Available values are:
rr - extract URIs from the priority levels in a round-robin fashion.
wrr - extract URIs from the priority levels in a weighted round-robin fashion.
The weights are based on their respective share setting per priority level.
Basically URIs are extracted from the queue with the highest share value;
when all shares are 0 the shares are reset to their original settings.
pri - extract URIs from the priority levels in a priority fashion by always
extracting from the highest priority level if there still are entries available (1
being the highest).
default - same as wrr.
Default: default
put_scheme
This option allows you to specify which method you want the crawler to use
when inserting the URIs into the work queue.
Available values are:
default - always insert URIs with default priority level.
include - insert URIs with the priority level defined by the includes specified
for every priority level. If no includes match, the default priority level will be
used.
Default: default
For each priority level specified, you can define:
share - this value allows you to specify a share or weight for each queue
to be used when utilizing the wrr (weighted round robin) of extracting entries
the work queue.
include_domains, include_uris - these values allow you to specify a
set of inclusion rules for each priority level to be used when utilizing the
include method of inserting entries to queue.
90
Work Queue Priority Parameter Example
<section name="workqueue_priority">

<attrib name="levels" type="integer"> 2 </attrib>

<attrib name="default" type="integer"> 2 </attrib>

<attrib name="start_uri_pri" type="integer"> 1 </attrib>

<attrib name="pop_scheme" type="string"> wrr </attrib>

<attrib name="put_scheme" type="string"> include </attrib>

<section name="1">

<attrib name="share" type="integer"> 10 </attrib>

<member> web005.example.net </member>
</attrib>
</section>
</section>

<section name="2">
<attrib name="share" type="integer"> 10 </attrib>
</attrib>
</section>
</section>
</section>
Adaptive Parameters
The adaptive section allows you to configure adaptive scheduling options.
Note: This section is only applicable if refresh_mode is set to adaptive.
Note: Extensive testing is strongly recommended before production use, to insure that desired pages
and sites are properly represented in the index.
Table 27: Adaptive Parameter Options
Option
refresh_count
refresh_quota
Description
Number of minor cycles within the major cycle.
Ratio of existing URIs re-crawled to new (unseen) URIs, expressed as
percentage.
High value = re-crawling old content
91
Option
Description
Low value = prefer fresh content
coverage_max_pct
coverage_min
Limit percentage of site re-crawled within a minor cycle. Ensures small sites do
not crawl fully each minor cycle, starving large sites.
Minimum number of URIs from a site to be crawled in a minor cycle.
Used to guarantee some coverage for small sites.
weights
Each URI is scored against a set of rules to determine its crawl rank value. The
crawl rank value is used to determine the importance of the particular URI, and
hence the frequency at which it is re-crawled (from every micro cycle to only
once every major cycle).
Each rule is assigned a weight to determine its contribution towards the total
rank value. Higher weights produce higher rank contribution. A weight of 0
disables a rule altogether.
Adaptive Crawling Scoring Rules:
inverse_length: Based on number of slashes (/) in URI path. Max score for
1, no score for 10 or more. Default weight: 1.0
inverse_depth: Based on number of link "hops" to this URI. Max score for
none (for example, start_uri), no score for 10 or more. Default weight: 1.0
is_landing_page: Bonus score if "landing page", ends in / or index.html.
Any page with query parameters gets no score. Default weight: 1.0
is_mime_markup: Bonus score if "markup" page listed in
uri_search_mime attribute. Preference to more dynamic content (vs. PDF,
Word, other static docs). Default weight: 1.0
change_history: Scored on basis of last-modified value over time (or
estimate). Default weight: 10.0
sitemap: Score based on the metadata found in sitemaps. The score is
calculated by multiplying the value of the changefreq parameter with the
priority parameter in a sitemap. Default weight: 10.0
sitemap_weights
Sitemap entries may contain a changefreq attribute. This attribute gives a hint
on how often a page is changed. The value of this attribute is a string. This string
value is mapped to a float value in order for the adaptive scheduler to calculate
an adaptive rank. This mapping can be changed by configuring the
sitemap_weights section.
Note that in addition to the defined values a default attribute is defined.
Documents with no changefreq attribute are given the value of the default
weight for priority.
Sitemap Changefreq Weights:
always Map the changefreq value 'always' to a numerical value. Default
weight: 1.0
hourly Map the changefreq value 'hourly' to a numerical value. Default
weight: 0.64
daily Map the changefreq value 'daily' to a numerical value. Default weight:
0.32
weekly Map the changefreq value 'weekly' to a numerical value. Default
weight: 0.16
92
Option
Description
monthly Map the changefreq value 'monthly' to a numerical value. Default
weight: 0.08
yearly Map the changefreq value 'yearly' to a numerical value. Default
weight: 0.04
never Map the changefreq value 'never' to a numerical value. Default weight:
0.0
default This value is assigned to all documents that have no changefreq
attribute. Default weight: 0.16
HTTP Errors Parameters
The http_errors section specifies how various response codes and error conditions are handled for HTTP(S)
URIs.
Table 28: HTTP Errors Parameter Options
Option
Description
4xx or 5xx
Specify handling for all 40X or 50x HTTP response codes.
Valid options for handling individual response codes are:
"KEEP" - keep the document (leave unchanged)
"DELETE[:X]" - delete the document if the error condition occurs for X
retries. If no X value is specified deletion happens immediately.
For both of these options "RETRY[:X]" can be specified, for which the crawler
will try to download the document again X times in the same refresh period
before giving up.
Note: If different behavior is desired for a specific value within one of these
ranges, e.g. for HTTP status 503, it may be given its own handling
specification.
ttl or net or int
Specify handling for:
•
•
•
HTTP connections that time out
net (socket) errors
internal errors
HTTP Errors Parameter Example
<section name="http_errors">

<attrib name="408" type="string"> KEEP </attrib>

<attrib name="4xx" type="string"> DELETE </attrib>


93
<attrib name="5xx" type="string"> DELETE:10, RETRY:3 </attrib>

<attrib name="ttl" type="string"> DELETE:3 </attrib>

<attrib name="net" type="string"> DELETE:3 </attrib>


<attrib name="int" type="string"> KEEP </attrib>
</section>
Logins parameters
The Logins section allows you to configure HTML form based authentication.You can specify multiple sections
to handle different site logins, but each must have a unique name.
The following table describes the options that may be set:
Table 29: Logins Parameter Options
Option
preload
scheme
site
form
Description
Specify the full URI of a page to be fetched before attempting login form
processing. Some sites require the user to first get a cookie from some page
before proceeding with authentication. Often the Start URI for the site is an
appropriate choice for preload.
Which scheme the form you are trying to log into is using, e.g. http or https.
The hostname of the login form page.
The path to the form you are trying to log into.
Note: The three previous values, scheme + site + form make up the URI of the login form page.
action
parameters
sites
ttl
94
The action of the form (GET or POST).
The credentials as a sequence of key, value parameters the form requires for
a successful log on. These are typically different from form to form, and must
be deduced by looking at the HTML source of the form. In general, if 'autofill' is
enabled, only user-visible (via the browser) variables need be specified, e.g.
username and password, or equivalent. The crawler will fetch the HTML page
(specified in 'html_form') containing the login form and read any "hidden"
variables that must be sent when the form is submitted. If a variable and value
are specified in the parameters section, this will override any value read from
the form by the crawler.
A list of sites that should log into this form before being crawled. Note that this
is a list of hostnames, not URIs.
Time before you have to log in to the form once again, before continuing the
crawl.
Option
Description
html_form
The URI to the HTML page, containing the login form. Used by the 'autofill'
option. If not specified, the crawler will assume the HTML page is specified by
the 'form' option.
autofill
Whether the crawler should download the HTML page, parse it, identify which
form you're trying to log into by matching parameter names, and merge it with
any specified form parameters you may have specified in the 'parameters' option.
Default on.
relogin_if_failed
Whether the crawler after a failed login should attempt to re-login to the web
site after 'ttl' seconds. During this time, the web site will be kept active within the
crawler, thus occupying one available site resource.
Logins Parameter Example
<Login name="mytestlogin">

<attrib name="preload"
type="string">http://preload.companyexample.com/</attrib>

<attrib name="scheme" type="string"> https </attrib>
<attrib name="site" type="string"> login.companyexample.com </attrib>
<attrib name="form" type="string"> /path/to/some/form.cgi </attrib>
<attrib name="action" type="string">POST</attrib>

<section name="parameters">
<attrib name="user" type="string"> username </attrib>
<attrib name="password" type="string"> password </attrib>
<attrib name="target" type="string"> sometarget </attrib>
</section>

<attrib name="sites" type="list-string">
<member> site1.companyexample.com </member>
<member> site2.companyexample.com </member>
</attrib>

<attrib name="ttl" type="integer"> 7200 </attrib>

<attrib name="html_form" type="string">
http://login.companyexample.com/login.html </attrib>

<attrib name="autofill" type="boolean"> yes </attrib>

<attrib name="relogin_if_failed" type="boolean"> yes </attrib>
</Login>
95
Storage parameters
The Storage parameter allows you to specify storage related options.
The following table describes the possible options.
Note: These values cannot be changed after a collection has been defined.
Table 30: Storage Parameter Options
Option
datastore
Description
Refer to Datastore Section on page 105 for more information.
Default: bstore
store_http_header
This option specifies if the received HTTP header should be stored as document
metadata. If enabled, the HTTP header will be included when the document is
submitted to the ESP document processing pipeline.
Default: yes
store_dupes
This option allows you to preserve disk space on the crawler nodes by instructing
the crawler not to store documents that are detected as duplicates at runtime
to disk. Duplicates detected by PostProcess are stored to disk initially, but will
be deleted later.
Default: no
compress
This option specifies if downloaded documents should be compressed before
stored on disk. If enabled, gzip compression will be performed.
Default: yes
compress_exclude_mime
MIME types of documents not compressed in storage.
Note that compressing multimedia documents can waste resources, as these
documents are often already compressed. Use this setting to selectively skip
compression of documents based on their MIME type, thus saving resources
both in the crawler (no unnecessary compression) and in the pipeline (no
unnecessary decompression).
remove_docs
This option allows you to preserve disk space on the crawler nodes by instructing
the crawler to remove docs on the disk after they have been processed by the
document processor.
Default: no
Note: Enabling this option will make it impossible to refeed the crawler
store at a later time (e.g. to take advantage of changes made to the
document processing pipeline) since the crawled documents are no longer
available on disk.
clusters
This option specifies how many storage clusters to use for the collection.
Default: 8
Note: In general, this value should not be modified unless so directed by
Contact Us on page iii
96
Option
Description
defrag_threshold
A non-zero value specifies the threshold value, in terms used capacity, before
defragmentation is initiated for any given data storage file. When the available
capacity drops below this level the file is compacted to reclaim fragmented space
caused by previously stored documents. Database files are compacted regardless
of fragmentation.
Default: 85
The default of 85% means there must be 15% reclaimable space in the data
storage file to trigger defragmentation of a particular file. Setting this value to 0
will disable the nightly database/data compression routines.
Note: The data storage format flatfile does not become fragmented,
and this option does not apply to that format.
uri_dir
All URIs extracted from a document by an uberslave process may be stored in
a separate file on disk. This option indicates in which directory to place the URI
files. The name of a URI file is constructed by concatenating the slave process
PID with `.txt'.
Default: The default is not to generate these files (empty directory path).
Storage Parameter Example
<section name="storage">
<attrib name="uri_dir" type="string"> test/URIS </attrib>

<attrib name="datastore" type="string"> bstore</attrib>

<attrib name='compress' type="boolean"> yes </attrib>

<attrib name="compress_exclude_mime" type="list-string">
<member> video/* </member>
<member> audio/* </member>
</attrib>

<attrib name="store_http_header" type="boolean"> yes </attrib>

<attrib name="store_dupes" type="boolean"> no </attrib>

<attrib name="remove_docs" type="boolean"> no </attrib>

<attrib name="defrag_threshold" type="integer"> 85 </attrib>
</section>
Password Parameters
Password specification for sites/paths that require Basic Authentication.
Note: Changing the passwd value may result in previously accessible content to eventually be deleted
from the index.
Support includes basic, digest and NTLM v1 authentication.
Note: AD/Kerberos and NTLM v2 is not supported.
Credentials can be keyed on either Realm or URI.
97
A valid URI can be used as the parameter value, in which case it serves a prefix value, as all links extracted
from the URI at its level or deeper will also utilize the authentication settings.
It is also possible to specify passwords for Realms. When a 401 Unauthorized is encountered, the crawler
attempts to locate a matching realm, and if one exists, the URI will be fetched again with corresponding
user/passwd set. As this requires two HTTP transactions for each document, it is inherently less efficient than
specifying a URI prefix.
The credentials format is: user:password:realm:scheme, though you can still use the basic format
of: user:password
Scheme can be any of the supported authentication schemes (basic, digest, ntlm) or auto in which the crawler
tries to pick one on its own.
<section name=“passwd”>

<attrib name=“http://www.example.net/confidential1/” type=“string”>
bob:bobsecret:mysite:auto
</attrib>


<attrib name=“http://www.example.net/confidential2/” type=“string”>
bob:bob\:secret\\:myotherdomain:basic
</attrib>
</section>
Note: Cookie authentication requires a separate setup. Refer to the Logins Parameters for more
information.
PostProcess Parameters
The pp section configures PostProcess.
Table 31: PostProcess Parameter Options
Option
dupservers
Description
This option specifies which duplicate servers should be used in a multiple node
crawler crawl. The crawler will automatically perform load-balancing between
multiple duplicate servers.
Values should be specified as hostname:port, e.g. dup01:18000
Default: none
Note: This setting can not be modified for a collection, once set.
max_dupes
This option specifies the maximum number of duplicates to record along with
the original document.
Default: 10
Note: This setting has a severe performance impact and values above
3-4 are not recommended for large scale crawls.
98
Option
Description
stripe
This option specifies the PostProcess database stripe size; the number of files
to spread the available data across. A value of 1 puts everything in a single file.
Default: 1
Note: This setting can not be modified for a collection, once set.
ds_meta_info
This option identifies the meta info that the PostProcess should report to
document processing.
Available types: duplicates, redirects, referrers, intra links, interlinks.
Default: none
ds_max_ecl
This option specifies the max URI equivalence class length such as the maximum
number of duplicates, redirects or referrers to report to document processing.
Default: 10
ds_send_links
Send extracted links from document to FAST ESP for document processing.
Default: no
ds_paused
This option specifies whether or not the PostProcess should pause feeding to
FAST ESP. When paused, the feed will be written to stable storage.
Note that the value of this setting can be changed via the crawleradmin tool
options, --suspendfeed/--resumefeed.
Default: no
ecl_override
This option specifies a regular expression used to identify URIs that should go
into the URI equivalence class, even though ds_max_ecl is reached. Example:
.*index\.html$
Default: none
PostProcess Parameter Example
<section name="pp">
<attrib name="dupservers" type="list-string">
<member> node5:13000 </member>
<member> node6:13000 </member>
</attrib>

<attrib name="max_dupes" type="integer"> 2 </attrib>

<attrib name="stripe" type="integer"> 8 </attrib>
<attrib name="ds_meta_info" type="list-string">
<member> duplicates </member>
</attrib>
<attrib name="ds_max_ecl" type="integer"> 10 </attrib>
<attrib name="ds_send_links" type="boolean"> yes </attrib>

<attrib name="ds_paused" type="boolean"> no </attrib>

<attrib name="ecl_override" type="string"> .*index\.html$ </attrib>
</section>
99
Log Parameters
The log section provides crawler logging options. Use this section to enable or disable various logs.
Note: The use of screened and header logs can be very useful during crawl setup and testing, but should
generally be disabled for production crawls as they can use a lot of disk space. It is sometimes necessary
to enable these when debugging specific issues.
The following table describes the possible options
Table 32: Log Parameter Options
Option
fetch
Description
Document log (collection wide) format. Logs all documents downloaded with
time stamp and response code/error values.
Values: text or none.
Default: text
postprocess
Postprocess log (collection wide) format. Logs all documents output by
postprocess with info.
Values: text, xml or none.
Default: text
header
HTTP Header exchanges log (stored per-site). Logs all header exchanges with
web servers. Useful for debugging, but should not be enabled for production
crawls.
Default: none
screened
URI allow/deny log (collection wide) format. Log URIs that are screened for
various reasons. Useful for debugging.
Default: none
scheduler
Provides details of adaptive scheduling algorithm processing.
Default: none
dsfeed
ESP document processing and indexing feed log. Logs all URIs that PostProcess
receives callbacks on with information on failure/success.
Values: text, none.
Default: text
site
Provide statistics for per-site crawl sessions.
Default: text
100
Log Parameter Example
<section name="log">

<attrib name="fetch" type="string"> text </attrib>
<attrib name="postprocess" type="string"> text </attrib>

<attrib name="header" type="string"> text </attrib>
<attrib name="screened" type="string"> text </attrib>
<attrib name="site" type="string"> text </attrib>
</section>
Cache Size Parameters
The cachesize section allows configuration of the crawler cache sizes. All cache sizes represent number of
entries unless otherwise noted.
The following table describes possible cache options:
Table 33: Cache Size Parameter Options
Option
duplicates
Description
Duplicate checksum cache.
Default: automatic
screened
URIs screened during crawling.
Default: automatic
smcomm
Slave/Master comm. channel.
Default: automatic
mucomm
Master/Ubermaster comm. channel.
Default: automatic
wqcache
Site work queue cache.
Default: automatic
crosslinks
Crosslinks cache (number of links).
Default: automatic
Note: Defaults for the previous parameters are auto generated based on the max_sites and delay parameters.
routetab
Routing table cache (in bytes).
Default: 1 MB
pp
PostProcess database cache (in bytes).
Default: 1 MB
101
Option
Description
pp_pending
PostProcess pending (in bytes).
Default: 128 KB
aliases
Aliases mapping database cache (in bytes).
Default: 1 MB
Cache Size Parameter Example
<section name="cachesize">

<attrib name="duplicates" type="integer"> 128 </attrib>
<attrib name="screened" type="integer"> 128 </attrib>
<attrib name="smcomm" type="integer"> 128 </attrib>
<attrib name="mucomm" type="integer"> 128 </attrib>
<attrib name="wqcache" type="integer"> 4096 </attrib>

<attrib name="crosslinks" type="integer"> 5242880 </attrib>
<attrib name="routetab" type="integer"> 5242880 </attrib>
<attrib name="pp" type="integer"> 5242880 </attrib>
<attrib name="pp_pending" type="integer"> 1048576 </attrib>
<attrib name="aliases" type="integer"> 5242880 </attrib>
</section>
Link Extraction Parameters
Use the link_extraction section to tell the crawler which links it should follow.
The following table lists possible options:
Table 34: Cache Size Parameter Options
Option
Description
a
URIs found in anchor tags. Default: yes
action
URIs found in action tags. Example: <form
action="http://someaction.com/?submit" method="get"> Default: yes
area
URIs found in area tags (related to image maps). Example: <map
name="mymap"> <area src="http://link.com"> </map> Default: yes
comment
URIs found within comment tags. The crawler extracts links from comments by
looking for http://. Example: <!- this URI is commented away; http://old.link.com/
--> Default: yes
frame
URIs found in frame tags.
Example: <frame src="http://topframe.com/"> </frame>
Default: yes
go
URIs found in go tags. Note that go tags are a feature of the WML specification.
Example: <go href="http://link.com/">
Default: yes
102
Option
img
Description
URIs found in image tags.
Example: <img src="picture.jpg">
Default: no
layer
URIs found in layer tags.
Example: <layer src="http://www.link.com/"></layer>
Default: yes
link
URIs found in link tags.
Example: <link href="http://link.com/">
Default: yes
meta
URIs found in META tags.
Example: <meta name="mymetatag" content="http://link.com/"/>
Default: yes
meta_refresh
URIs found in META refresh tags.
Example: <meta name="refresh"
content="20;URL="http://link.com/"/>
Default: yes
object
URIs found in object tags.
Example: <object data="picture.png">
Default: yes
script
URIs found within script tags.
Example: <script> variable = "http://somelink.com/" </script>
Default: yes
script_java
URIs found within script tags that are JavaScript styled.
Example: <script type="javascript">
window.location="http://somelink.com"</script>
Default: yes
style
URIs found within style tags.
Default: yes
embed
Typically used to insert links into audio files.
Default: yes
card
A link type used to define a card in a WML deck.
Default: yes
103
Link Extraction Parameter Example
<section name="link_extraction">
<attrib name="a" type="boolean"> yes </attrib>
<attrib name="action" type="boolean"> yes </attrib>
<attrib name="area" type="boolean"> yes </attrib>
<attrib name="comment" type="boolean"> no </attrib>
<attrib name="frame" type="boolean"> yes </attrib>
<attrib name="go" type="boolean"> yes </attrib>
<attrib name="img" type="boolean"> no </attrib>
<attrib name="layer" type="boolean"> yes </attrib>
<attrib name="link" type="boolean"> yes </attrib>
<attrib name="meta" type="boolean"> yes </attrib>
<attrib name="meta_refresh" type="boolean"> yes </attrib>
<attrib name="object" type="boolean"> yes </attrib>
<attrib name="script" type="boolean"> yes </attrib>
<attrib name="script_java" type="boolean"> yes </attrib>
<attrib name="style" type="boolean"> yes </attrib>
<attrib name="embed" type="boolean"> yes </attrib>
<attrib name="card" type="boolean"> no </attrib>
</section>
The ppdup Section
Use the ppdup section to specify the duplicate server settings.
The following table lists possible options:
Table 35: Duplicate Server Options
Option
Description
format
The duplicate server database format. Available formats are:
•
•
•
gigabase
hashlog
diskhashlog
The duplicate server database cache size. If the duplicate server database
format is a hash type, the cache size specifies the initial size of the hash.
cachesize
Note: Specified in MB
stripes
The duplicate server database stripe size.
compact
Specify whether to perform nightly compaction of the duplicate server databases.
Duplicate Server Settings Example

<section name="ppdup">

<attrib name="format" type="string"> hashlog </attrib>

<attrib name="stripes" type="integer"> 1 </attrib>

<attrib name="cachesize" type="integer"> 128 </attrib>

<attrib name="compact" type="boolean"> no </attrib>
104
</section>
Datastore Section
The datastore section specifies which format to use for the document data store.
The crawler normally stores a collection's documents in the directory:
$FASTSEARCH/data/crawler/store/collectionName/data/.
The following table describes possible options:
Table 36: Datastore Parameter Options
Option
Description
Option
Description
bstore
For each storage cluster that is crawled, a directory is created using the cluster
number. The directory contains:
0-N[.N] - BStore segment files. Documents are stored within numbered
files starting with '0' going up ad infinitum. After each compaction, the file is
appended with a generation identifier, e.g. 0.1 replaces 0, and 17.253
replaces 17.252. The older generation file is retained for up to 24 hours as
a read only resource for postprocess.
0-N[.N].idx - BStore segment index files. Contains the index of each of
the corresponding BStore segment files.
master_index - Contains information about all existing BStore segments.
The crawler will schedule a defragment of the data store to ensure that stale
segments are cleaned up on a daily basis.
flatfiles
The files are stored in a base64-encoded representation of the filenames. Storing
documents in this manner is more metadata intensive on the underlying file
system as each retrieved document is stored in a separate physical file, but
allows the crawler to delete old versions of documents when a new version is
retrieved from the web server.
Feeding destinations
This table describes the options available for custom document feeding destinations. It is possible to submit
document to a collection by another name, multiple collections and even another ESP installation. If no
destinations are specified the default is to feed into a collection by the same name in the current ESP
installation.
Table 37: Feeding destinations
Parameter
name
Description
This parameter specifies a unique name that must be given for the feeding
destination you are configuring. The name can later be used in order to specify
a destination for refeeds.
collection
This parameter specifies the ESP collection name to feed documents into.
Normally this is the same as the collection name, unless you wish to feed into
105
Parameter
Description
another collection. Ensure that the collection already exists on the ESP
installation designated by destination first.
Each feeding desintation you specify maps to a single collection, thus to feed
the same crawl into multiple collections you need to specify multiple feeding
destinations. It is also possible for multiple crawler collections to feed into the
same target collection.
destination
This parameter specifies an ESP installation to feed to. The available ESP
destinations are listed in the feeding section of the crawler's global configuration
file, normally $FASTSEARCH/etc/CrawlerGlobalDefaults.xml.The XML
file contains a list of named destinations, each with a list of content distributors.
If no destinations are explicitly listed in the XML file you may specify "default"
here, and the crawler will feed into the current ESP installation. This current
ESP installation is that which is specified by
$FASTSEARCH/etc/contentdistributor.cfg.
This field is required, may be "default" unless the gloabl XML file has been
altered.
paused
This option specifies whether or not the crawler should pause document feeding
to FAST ESP. When paused, the feed will be written to stable storage on a
queue.
Note that the value of this setting can be changed via the crawleradmin tool
options, --suspendfeed/--resumefeed.
Default: no
primary
This parameter controls whether this feeding destination is considered a primary
or secondary destination. Only the primary destination is allowed to act on
callback information from the document feeding chain, secondary feeders are
only permitted to log callbacks.
Exactly one feeding destination must be specified as primary.
Example:
<section name="feeding">
<section name="collA">
<attrib name="collection" type="string"> collA </attrib>
<attrib name="destination" type="string"> default </attrib>
<attrib name="primary" type="boolean"> yes </attrib>
<attrib name="paused" type="boolean"> no </attrib>
</section>
<section name="collB">
<attrib name="collection" type="string"> collB </attrib>
<attrib name="destination" type="string"> default </attrib>
<attrib name="primary" type="boolean"> no </attrib>
<attrib name="paused" type="boolean"> no </attrib>
</section>
106
</section>
RSS
This table describes the parameters for RSS crawling.
Table 38: RSS Options
Parameter
start_uris
Description
This paramter allows you to specify a list of RSS start URIs for the collection to
be configured. RSS documents (feeds) are treated a bit different than other
documents by the crawler. First, RSS feeds typically contain links to articles and
meta data which describes the articles. When the crawler parses these feeds,
it will associate the metadata in the feeds with the articles they point to. This
meta data will be sent to the processing pipeline together with the articles, and
a RSS pipeline stage can be used to make this information searchable. Second,
links found in RSS feeds will be tagged with a force flag. Thus, the crawler will
crawl these links as soon as allowed (it will obey the collection's delay rate), and
they will be crawled regardless if it they have been crawled already in this crawl
cycle. Example: http://www.example.com/rss.xml
start_uri_files
This parameter requires you to specify a list of RSS start URI files for the
collection to be configured. This option is not mandatory. The format of the files
is one URI per line. Example: C:\MyDirectory\rss_starturis.txt (Windows) or
/home/user/rss_starturis.txt (UNIX).
auto_discover
This parameter allows you to specify if the crawler should attempt to find new
RSS feeds. If this option is not set, only feeds specified in the RSS start URIs
and/or the RSS start URIs files sections will be treated as feeds.
Default: no
follow_links
This parameter allows you to specify if the crawler should follow links from HTML
documents, which is the normal crawler behavior. If this option is disabled, the
crawler will only crawl one hop away from a feed. Disable this option if you only
want to crawl feeds and documents referenced by feeds.
Default: yes
ingnore_rules
Use this parameter to specify if the crawler should crawl all documents referenced
by feeds, regardless of being valid according to the collection's include/exclude
rules.
Default: no
index_feed
This parameter allows you to specify if the crawler should send the RSS feed
documents to the processing pipeline. Regardless of this option, meta data from
RSS feeds will be sent to the processing pipeline together with the articles they
link to.
107
Parameter
Description
Default: no
max_link_age
This parameter allows you to specify the maximum age (in minutes) for a link
in an RSS document. Expired links will be deleted if the 'Delete expired' option
is enabled. 0 disables this option.
max_link_count
This parameter allows you to specify the maximum number of links the crawler
will remember for a feed. The list of links found in a feed will be treated in a FIFO
manner. When links get pushed out of the list, they will be deleted if the 'Delete
expired' option is set. 0 disables this option.
Default: 128
del_expired_links
This option allows you to specify if the crawler should delete articles when they
expire. An article (link) will expire when it is affected by either 'Max articles per
feed' or 'Max age for links in feeds'.
Default: no
Example:
<section name="rss">
<member> http://www.example.com/rss.xml </member>
</attrib>
<attrib name="auto_discover" type="boolean"> yes </attrib>
<attrib name="ignore_rules" type="boolean"> no </attrib>
<attrib name="index_feed" type="boolean"> yes </attrib>
<attrib name="follow_links" type="boolean"> yes </attrib>
<attrib name="max_link_age" type="integer"> 14400 </attrib>
<attrib name="max_link_count" type="integer"> 128 </attrib>
<attrib name="del_expired_links" type="boolean"> yes </attrib>
</section>
Metadata Storage
In addition to document storage being handled by the datastore section, the crawler maintains a set of data
structures on disk to do bookkeeping regarding retrieved content and content not yet retrieved. These are
maintained as databases and queues, and collectively referred to as metadata (as opposed to the actual
data retrieved).
The crawler stores a collection's site and document metadata in the directory:
$FASTSEARCH/data/crawler/store/collectionname/db/.
The following options are relevant to how the crawler handles metadata.
Site Databases
For each site from which pages are fetched, an entry is made in a site database, storing details including the
IP address, any equivalent (mirror) sites, the number of documents stored for the site.
Work Queue Files
The work queues used by the uberslave process to store URIs (and related data) waiting to be fetched are
stored on-disk in the following location and directory format.
108
Location: $FASTSEARCH/data/crawler/queues/slave/collectionname/XX/YY/sitename[:port]
In addition to being organized by collection, additional layers of directory structure are introduced to avoid
file system limits. Within each collection directory, subdirectories (shown as XX and YY above) are created,
using the first 4 hexadecimal digits of the MD5 checksum of a site's name. For example, the site
www.example.com has the MD5 checksum 7c:17:67:b3:05:12:b6:00:3f:d3:c2:e6:18:a8:65:22. The created
directory path is therefore 7c/17/www.example.com.
If a site uses a port number other than the default (80 for HTTP, 443 for HTTPS), it will be included in the
sitename directory, and used in calculating the checksum.
In case of a restart of the crawler, the work queues are reloaded from disk, and the crawl continues from
where it left off.
Pending Queues
The master (or, in a multi node configuration, ubermaster) process utilizes several on-disk queues used for
storing URI and site information while DNS address resolution is being performed, and prior to a site being
assigned to an uberslave (or master, in the multi node case) for further processing. These are stored in the
following locations, organized by collection:
Location: $FASTSEARCH/data/crawler/queues/master/collectionname/
unresolved.uris
Queue for URIs waiting for site (hostname) DNS resolution
unresolved.sites
Queue for sites waiting for DNS resolution (per the configured DNS rate limits)
resolved.uris
Queue of URIs pending assignment to slave work queue, or to a master (multi node case)
Writing a Configuration File
Note: This method described should only be used when the collections (and sub collections) have been
created in the FAST ESP administrator interface.
The “collection-name” must match the name given to the collection when it was created.
The “sub collection-name” must match the name given to the sub collection when it was created.
When adding or modifying a configuration parameter, the configuration file needs only to contain the
modified configuration parameters.
The crawler configuration files are XML files with the following structure:
<?xml version="1.0"?>
<CrawlerConfig>
<DomainSpecification name="collection-name">
...
collection configuration directives
...
</DomainSpecification>
</CrawlerConfig>
When configuring a sub collection, use the following structure:
<CrawlerConfig>
<DomainSpecification name="collection-name">
...
collection configuration directives
...
<SubDomain name="sub collection-name">
...
sub collection configuration directives
109
...
</SubDomain>
</CrawlerConfig>
Uploading a Configuration File
After a configuration file has been created, it must be uploaded to the crawler. This is done via the crawleradmin
tool which is located in the $FASTSEARCH/bin directory of your FAST ESP installation.
The following command uploads the configuration to the crawler:
crawleradmin -f configuration.xml
Changes take place immediately; any errors in the configuration file will be reported.
Configuring Global Crawler Options via XML File
Many of the crawler configuration options specified at startup can also be specified in the crawler default
XML-based configuration file.
At startup, the crawler looks for this file (CrawlerGlobalDefaults.xml) at:
$FASTSEARCH/etc/CrawlerGlobalDefaults.xml
The configuration file can also be specified at startup using the -F option.
CrawlerGlobalDefaults.xml options
Table 39: CrawlerGlobalDefaults.xml options
Option
slavenumsites
Description
Number of sites per uberslave.
Use this option to specify the initial number of sites (slave instances) for an
uberslave.
Default: 1024
<attrib name="slavenumsites" type="integer"> 1024 </attrib>
dbtrace
Enable database statistics.
Use this option to specify whether or not you want to enable detailed statistics
from the databases.
Default: no
<attrib name="dbtrace" type="boolean"> no </attrib>
directio
Enable direct disk I/O.
Use this option to specify whether or not to enable (yes) direct I/O in postprocess
and duplicate server. Use only if the operating system supports this functionality.
Default: no
<attrib name="directio" type="boolean"> no </attrib>
110
Option
numprocs
Description
Number of uberslave processes to start.
This value will be overridden by the -c command line option.
Default: 2
<attrib name="numprocs" type="integer"> 2 </attrib>
logfile_ttl
Log file lifetime.
This option specifies the number of days to keep old log files before deletion.
Default: 365
<attrib name="logfile_ttl" type="integer"> 365 </attrib>
store_cleanup
Time when daily storage cleanup job begins.
Format: HH:MM (24-hour clock)
Default: 04:00
<attrib name="store_cleanup" type="string"> 04:00 </attrib>
ppdup_dbformat
Duplicate server database format
Valid values: hashlog, diskhashlog or gigabase
<attrib name="ppdup_dbformat" type="string"> hashlog
</attrib>
disk_suspend_threshold
Specifies a threshold, in bytes, that when reached will make the crawler suspend
all existing collections.
Default: 500 MB
<attrib name="disk_suspend_threshold" type="real">524288000
</attrib>
disk_resume_threshold
Specifies a threshold, in bytes, that when reached will make the crawler resume
all existing collections, in the event they already have been suspended by the
'disk_suspend_threshold' option.
Default: 600 MB
<attrib name="disk_resume_threshold" type="real">629145600
</attrib>
browser_engines
List of browser engines that the crawler will use to process JavaScript and flash
extracted from html documents.
Default: none
<attrib name="browser_engines" type="list-string">
<member> <host>:<port> </member>
</attrib>
111
Option
feeding
Description
Various feeding options for postprocess.
Valid values:
•
•
•
•
•
•
•
•
priority: ESP conent feeder priority. Note that there must be a pipeline
configured with the same priority setting. Default: 0
feeder_threads: Number of content feeder threads to start. Must only be
changed when the data/dsqueues directory is empty. Default: 1
max_cb_timeout: Maximum time to wait for callbacks in postprocess (in
seconds) when shutting down. Default: 1800
max_batch_size: Number of documents in each batch submission. Smaller
batches may be sent if not enough docs are available or if the memory size
of the batch grows too large. Default: 128
max_batch_datasize: Maximum size of a batch specified in bytes. Lower
this limit if you have trouble with procservers using too much memory.
Default: 52428800 (50 MB)
fs_threshold: Specifies the crawler file system (crawlerfs) getpath
threshold in kB. Documents larger than this value will be served using the
crawlerfs HTTP server instead of being inserted in the batch itself. Default:
128
waitfor_callback: FAST ESP 5.0 only, from ESP 5.1 this is configured
in $FASTSEARCH/etc/dsfeeder.cfg. Feeding callback to wait for.
Possible values are PROCESSED, PERSISTED and LIVE. Recovery of batches
that fail will not be available when the PROCESSED callback is chosen.
Default: PERSISTED
destinations: Specifies a set of feeding destinations. Each destination
is identified by a symbolic name and a list of associated content distrbutor
locations (host:port format).The contentdistributors for an ESP installation
can be found by looking in
$FASTSEARCH/etc/contentdistributor.cfg of that installation.When
no feeding destinations are explicitly defined the crawler will default to the
current ESP installation, and use the symbolic name "default".
Note: To make use of user specified feeding destinations they must
be referenced in the collection configuration.
dns
Domain name system (DNS) tuning options for the resolver.
This option allows various settings related to the crawler's use of the DNS as a
client. In single node installations the master calls DNS to resolve hostnames.
In a multiple node installation this job is done by the ubermaster.
Valid values:
min_rate: Minimum number of DNS requests to issue per second. Default:
5
max_rate: Maximum number of DNS requests to issue per second. Default:
100
max_retries: Maximum number of retries to issue for a failed DNS lookup.
Default: 5
timeout: DNS request timeout before retrying (in seconds). Default: 30
min_ttl: Minimum lifetime of resolved names (in seconds). Default: 21600
112
Option
Description
db_cachesize: DNS database cache size setting for master; ubermaster
will use four times this value. Default: 10485760
near_duplicate_detection
Near duplicate detection tuning options.
Near duplicate detection is enabled on a per-collection basis. Near duplicate
detection primarily works for western languages.
Valid values:
min_token_size: Specifies the minimum number of characters a token
must have to be included in the lexicon. Tokens that contain fewer characters
than this value are excluded from the lexicon. Range: 0 - 2147483647.
Default: 5
max_token_size: Specifies the maximum character length for a token.
Tokens that contain more characters than this value are excluded from the
lexicon. Range: 1 - 2147483647. Default: 35
unique_tokens: Specifies the minimum number of unique tokens a lexicon
must contain in order to perform advanced duplicate detection. Below this
level the checksum is computed on the entire document. Range: 0 2147483647. Default: 10
high_freq_cut: Specifies the percentage of tokens with a high frequency
to cut from the lexicon. Range: between 0 and 1. Default: 0.1
low_freq_cut: Specifies the percentage of tokens with a low frequency
to cut from the lexicon. Range: between 0 and 1. Default: 0.2
Sample CrawlerGlobalDefaults.xml file
<CrawlerConfig>
<GlobalConfig>


<attrib name="slavenumsites" type="integer"> 1024 </attrib>

<attrib name="dbtrace" type="boolean"> no </attrib>

<attrib name="directio" type="boolean"> no </attrib>

<attrib name="numprocs" type="integer"> 2 </attrib>

<attrib name="logfile_ttl" type="integer"> 365 </attrib>

<attrib name="store_cleanup" type="string"> 04:00 </attrib>

<attrib name="ppdup_dbformat" type="string"> hashlog </attrib>


<attrib name="disk_suspend_threshold" type="real"> 524288000 </attrib>
113



<attrib name="disk_resume_threshold" type="real"> 629145600 </attrib>

<attrib name="browser_engines" type="list-string">
<member> mymachine.fastsearch.com:14195 </member>
</attrib>

<section name="feeding">

<attrib name="priority" type="integer"> 0 </attrib>


<attrib name="feeder_threads" type="integer"> 1 </attrib>


<attrib name="max_cb_timeout" type="integer"> 1800 </attrib>



<attrib name="max_batch_size" type="integer"> 128 </attrib>


<attrib name="max_batch_datasize" type="integer"> 52428800 </attrib>



<attrib name="fs_threshold" type="integer"> 128 </attrib>




<attrib name="waitfor_callback" type="string"> PERSISTED </attrib>

-->
-->
-->

<section name="default">

<attrib name="contentdistributors" type="list-string">
</attrib>
</section>

<section name="example">
<attrib name="contentdistributors" type="list-string">
<member> hostname1:port1 </member>
<member> hostname2:port2 </member>
</attrib>
</section>
</section>
</section>

<section name="dns">

<attrib name="min_rate" type="integer"> 5 </attrib>
114

<attrib name="max_rate" type="integer"> 100 </attrib>

<attrib name="max_retries" type="integer"> 5 </attrib>

<attrib name="timeout" type="integer"> 30 </attrib>

<attrib name="min_ttl" type="integer"> 21600 </attrib>


<attrib name="db_cachesize" type="integer"> 10485760 </attrib>
</section>


<section name='near_duplicate_detection'>

<attrib name="min_token_size" type="integer"> 5 </attrib>

<attrib name="max_token_size" type="integer"> 35 </attrib>

<attrib name="unique_tokens" type="integer"> 10 </attrib>
-->

<attrib name="high_freq_cut" type="real"> 0.1 </attrib>

<attrib name="low_freq_cut" type="real"> 0.2 </attrib>
</section>
</GlobalConfig>
</CrawlerConfig>
Using Options
This section provides information on how to set up various crawler configuration options.
Setting Up Crawler Cookie Authentication
This section describes how to set up the Enterprise Crawler to do forms based authentication, which is
sometimes referred to as cookie authentication.
Login page
To configure the crawler for forms based authentication, it is first necessary to understand the process of a
user logging in to the site using a browser. A common mechanism is that a request for a specific (or "target")
URI causes the browser to instead be directed to a page containing a login form, into which username and
password values must be entered. After entering valid values, the data is submitted to the web server using
an HTTP POST request, and once authenticated the browser is redirected back to the original target page.
1. Open the web browser.
2. Point the browser to the page where you want the crawler to log in. The following shows a sample login
page:
115
HTML Login Form
The following shows a sample HTML source view of the login form (with excess HTML source removed):
1: <form method="POST" name="login" action="/path/to/form.cgi">
2:
<input type="text" name="username" size="20">
3:
<input type="password" name="password" size="20">
4:
<input type="hidden" name="redirURI" value="/">
5:
<input type="submit" value="Login" name="show">
6:
<input type="reset" value="Reset" name="B2">
7: </form>
The information shown here can be used to configure the crawler to login to this site successfully.
HTML Login Form Descriptions
This example assumes the login page is found by going to http://mysecuresite.example.com. To browse
the site, log in with the Full name demo and Password demon.
Line 1: The method of the form is “POST”, and the action is “/path/to/form.cgi”. The form variables are
posted to that URI.
Line 2: The form needs a parameter named “username”. (This is the login page entry named Full name).
Note that the “username” parameter is not a general parameter, the username as well as the “password”
parameter can be any name. This is individual to each and every form, even though most people name their
variables/parameters something that can be associated with this parameter value.
Line 3: The form needs a parameter named “password”. (This is the login page entry named Password).
Line 4: The form needs a parameter named “redirURI” set. Note that this parameter is hidden, and thus not
shown when viewing the page in the browser. In general, this type of hidden parameter need not be specified
in the crawler's configuration, as the crawler will read the form itself and determine the names and values of
hidden variables.
Line 5: This line describes the Login button on the login page. There are no variables to extract from here
since the button is of type submit, which means that the browser should submit the form when the button is
pressed.
Line 6: This line describes the Reset button on the login page. There are no variables to extract from here
since the button is of type reset, which means that the browser should clear all input fields when the button
is pushed.
Crawler Login Form
The following shows a sample crawler login:
1: <Login name="mytestlogin">
2:
<attrib name="preload" type="string">http://site.com/</attrib>
3:
<attrib name="scheme" type="string"> https </attrib>
4:
<attrib name="site" type="string"> mysecuresite.example.com </attrib>
5:
<attrib name="form" type="string"> /path/to/form.cgi </attrib>
6:
<attrib name="action" type="string">POST</attrib>
7:
8:
<attrib name="user" type="string"> username </attrib>
9:
<attrib name="password" type="string"> password </attrib>
10:
</section>
116
11:
12:
13:
14:
</attrib>
15:
<attrib name="ttl" type="integer"> 7200 </attrib>
16: </Login>
Crawler Login Form Descriptions
Configure the crawler login specification by filling in the necessary values for the crawler configuration.
Line 1: The login name must be unique. All login specifications must have different names). This sample
uses the name “mytestlogin”.
Line 2: The preload step is not needed for this form, and is an optional parameter. If a target URI is used in
the browser login (i.e. in order to set initial cookie values, this URI should be used as the value of the preload
attribute, to force the crawler to fetch this page before attempting login.
Note: Lines 3, 4, and 5 (scheme+site+form) together make up the URI of the login form page: , e.g.
http://mysecuresite.example.com/path/to/form.cgi
Line 3: The “scheme” of the page this URI was found on was “http”. Note that some forms may be found on
HTTP sites, but the URI in the form action, may be absolute and point to an HTTPS site instead. For this
example the form action URI was relative, so it will have the same scheme as the form URI. The “scheme”
field is optional; if not set, “http” is assumed.
Line 4: The site (or hostname) of the web server on which the form URI resides. In this sample, the site is
“mysecuresite.example.com”.
Line 5: The actual form we are logging into is the form specified in the form action described in Line 1 of the
HTML login form. In this sample action=“/path/to/form.cgi”.
Line 6: The method of the form was found to be “POST”.
Lines 7+: Use the “parameters” section to describe the HTML login form. For this sample we need username
and password. These credentials are a sequence of key, value parameters the form requires for a successful
log on, differ from form to form, and must be deduced by looking at the HTML source of the form. In general,
only user-visible (via the browser) variables need be specified, e.g. username and password, or equivalent.
The crawler will fetch the login form and read any "hidden" variables that must be sent when the form is
submitted. If a variable and value are specified in the parameters section, this will override any value read
from the form by the crawler.
<attrib name="username" type="string"> demo </attrib>
<attrib name="password" type="string"> demon </attrib>
</section>
Lines 12+: Use the “sites” section to identify every site that needs to login to the login form before starting
to crawl. This sample lists two sites, site1.example.com and site2.example.com. When the crawler begins
to crawl either of these sites, it will log in with the specified options before fetching pages.
</attrib>
Line 16: The time to live (ttl) variable is optional, and the sample login page does not produce any time
limited cookies so it is not included in this description. Some forms may set expire times on the cookies they
return, and require credentials to be verified after a period of time. For such forms you may specify a ttl value,
specifying the number of seconds until the crawler logs in again.
117
Confirming Successful Login
The crawler will attempt login for each of the sites listed, and can generally be considered to have done so
successfully if it proceeds to crawl the site's Start URI and other pages linked to from it. The fetch log would
show this successful pattern, as in the following example.
2007-07-19-22:42:36
form)[TestLogin]
2007-07-19-22:42:36
form)[TestLogin]
2007-07-19-22:42:39
2007-07-19-22:42:42
200 NEW
http://www.example.com/index.php (Reading login
200 NEW
http://www.example.com/login.php (Submitting login
200 NEW
200 NEW
http://www.example.com/
http://www.example.com/faq.php
The site log should also show the status of the authentication attempt.
2007-07-19-22:42:35
2007-07-19-22:42:35
Authentication
2007-07-19-22:42:36
Authentication
2007-07-19-22:42:36
www.example.com
STARTCRAWL
LOGIN
N/A
GET
www.example.com
www.example.com
Performing
LOGIN
POST
www.example.com
Performing
LOGGEDIN
N/A
www.example.com
Through
A failure to log in will be indicated by a lack of crawling the site extensively, as shown in the fetch log. More
detailed information would be written to the crawler log file, especially in DEBUG mode. You can contact FAST
Support for further troubleshooting details.
Implementing a Crawler Document Plugin Module
This section describes how to create a python plugin module in order to provide an additional means of control
over the internal processing of fetched documents after they have been downloaded and initial processing
completed. The scope of work performed by the plugin can vary widely, ranging from a read only analyzer
to very complex processing of each document, and can include the rejection of documents from the crawl.
Overview
To implement the plugin, as a minimum you need a Python class that defines a process() call that takes
one argument, and a document object provided by the crawler.
An optional process_redirect() call may also be specified to evaluate redirections received in the course
of following links. A basic implementation of a plugin is as follows
class mycrawlerplugin:
def process(self, doc):
# XXX: Activity1.
def process_redirect(self, doc):
# XXX: Activity2.
The document object is an internal crawler data structure, and has a fixed set of attributes that you can utilize
and modify in your processing call. Note that only a subset of the attributes can be modified; any changes
will have an effect on subsequent crawler behavior.
The process() call is invoked for each document that is processed for links within the crawler, that is those
whose MIME type matches the MIME-types to search for links option.
The process_redirect() call is invoked whenever the crawler encounters a redirect response from a server.
That is, whenever the server returns ordinary redirect response codes (HTTP response codes 301 or 302)
or when an HTML META "refresh" is encountered and is evaluated as a redirect according to the configuration
settings.
118
Configuring the Crawler
Configure the crawler to use your plugin with the Document evaluator plugin option. Note that only one
plugin can be active at a time per collection. The format for the option is:
tests.plugins.plugins.helloworld
The format specifies the python class module, ending with your class name. The crawler splits on the last '.'
separator and converts this to the Python equivalent "from <module> import <class>".
Note: Whenever you create a python module in a separate directory, you need to have an empty
__init__.py file to be able to import it. This is a Python requirement, and failure to do so will result in
error messages in the crawler.log file.
The module must be available relative to the PYTHONPATH environment variable. For example, the file structure
on disk of this sample plugin configuration is tests/plugins/plugins.py, which contains the class
helloworld().
If used within an an existing FAST ESP installation this would be relative to:
${FASTSEARCH}/lib/python2.3/
Note that only documents that are searched for links, that is matching the MIME types of the uri_search_mime
option are subject to processing by the defined document plugin.
Modifying Document Object Options
Each URI downloaded by the crawler is processed internally to determine the values of various attributes,
which are made available to the plugin as a "document" object. The following tables describe the "document"
options for the process() and process_redirect() calls.
Table 40: process () Options
Option
store_document [integer]
Description
This option specifies whether or not to store the current document.
Valid values: 0 (no), 1 (yes)
Documents that are not stored (option set to 0) will be logged in the fetch log
as:
2006-04-19-14:53:42 200 IGNORED <URI> Excluded:
plugin_document
focus_crawl [integer]
This option specifies whether or not the current document is out of focus.
Example: To have an effect on the focus of the crawl, a focus section with a
depth attribute must be defined in the collection configuration.
<section name="focused">
<attrib name="depth" type="integer"> 2 </attrib>
</section>
links [list]
This option contains all the URIs the crawler link parser was able to pull out of
this document. The list consists of tuples for each link. A tuple contains either
three or five objects. A five-tuple entry contains:
uritype - type of URI as defined by pydocinfo (eg. pydocinfo.URI_A)
119
Option
Description
uriflag - attribute flag for URI.
uri - the original URI
uricomp - the parsed version of the URI as output by
pyuriparse.uriparse(<uri>)
metadata - dictionary containing the meta data that should be
associated/tagged with this URI (optional)
A three-tuple entry contains:
uritype - type of URI as defined by pydocinfo (eg. pydocinfo.URI_A)
uri - the original URI
uricomp - the parsed version of the URI as output by
pyuriparse.uriparse(<uri>)
Note: Be sure to keep the same format.
If you do not want any links to be followed from the current document, set this
attribute to an empty list.
cookies [list]
Use this option to add any additional cookies your module wants to set to the
crawler cookie store.
The cookies document attribute is always an empty list from the crawler.
Format Example (valid HTTP Set-Cookie header):
Set-Cookie: <cookie>
data [string]
csum [string]
This option contains the data of the current document.
This option contains the checksum of the current document.
Valid values: string of length 16 bytes
referrer_data [dictionary]
The referrer_data attribute contains the meta data which the parent document
has appended to this URI. For instance, with RSS feeds the meta data from the
RSS feed is forwarded with each URI found in the feed.
Note that the meta data is forwarded automatically if the URI is a redirect.
extra_data [dictionary]
destination [list(2)]
This attribute can be used to store additional meta data with this document,
which will also be sent to the processing pipeline. The docproc example below
shows how to extract this data in the pipeline.
This attribute can be used to set the feeding destination of the current document.
The list should contain 2 items [<destination>:<collection>].
<destination> refers to the pre-defined destination targets in your
GlobalCrawlerConfig.xml file.
<collection> refers to an existing collection on the target system.
errmsg [string]
120
This attribute can be used to set a short description on why the current document
was excluded due to being out of focus (focus_crawl = 1) or not being stored
(store_document = 0) that will be output to the crawler fetch log.
Option
Description
The errmsg will be prefixed by "document_plugin:" and appended to the
documents fetch log entry.
Table 41: process_redirect () Options
Option
Description
links [list]
This option is the same as the links [list] option in the process() call,
but with the restriction that it contains only a single tuple entry. This tuple contains
the URI that the redirect refers to. If this tuple is modified, then the crawler will
use the updated location as target for the redirect.
store_document [integer]
This option controls whether or not the redirect URI set in the links option should
be followed.
Documents that are not stored (option set to 0) will be logged in the fetch log
as:
2006-04-19-14:53:42 200 IGNORED <URI> Excluded:
plugin_document
Note: When store_document is set to 0, no further processing of the
redirect will take place, and any modifications to the links or cookies
attributes are ignored.
cookies [list]
Use this option to add any additional cookies your module wants to set to the
crawler cookie store.
The cookies document attribute is always an empty list from the crawler.
Format Example (valid HTTP Set-Cookie header):
Set-Cookie: <cookie>
errmsg [string]
This attribute can be used to set a short description on why the current document
was excluded due to or not being stored (store_document = 0) that will be output
to the crawler fetch log.
The errmsg will be prefixed by "document_plugin:" and appended to the
documents fetch log entry.
Static Document Object Options
The following document object attributes are included the plugin, but should not be changed.They are available
in both process() and process_redirect().
Table 42: Static process () and process_redirect() Attributes
Attribute
site [string]
ip [string]
uri [string]
Description
This attribute contains the site/hostname of the current document.
This attribute contains the IP of the site/hostname of the current document.
This attribute contains the URI of the current document.
121
Attribute
Description
header [string]
referrer [string]
This attribute contains the HTTP headers of the current document.
This attribute contains the referrer of the current document.
Note: Empty referrer means the current document was a start-URI.
collection [string]
mimetype [string]
encoding [string]
redirtrail [list]
This attribute contains the name of the collection the current document belongs
to.
process() only.This attribute contains the MIME type of the current document.
This attribute contains the auto-detected character encoding of the current
document.
process_redirect() only. This attribute is a list that contains all the URIs
of preceding redirects that were performed prior to the current one. Each URI
is a two-tuple that contains:
uri referring redirect URI
flags - flag internal to the crawler
is_rss [boolean]
is_sitemap [boolean]
This attribute informs if the crawler has identified the document as an RSS feed.
This attribute tells if the crawler has identified the document as a sitemap or
sitemap index.
Hello world
Processing class that prints 'hello world' for every document that is put through.
class helloworld:
def __init__(self):
pass
print "hello world"
Focus crawl
Processing class that focuses the crawl based on whether or not the content contains
'fast'. Note that this requires the global focus depth to be set in the configuration.
class focusonfast:
def __init__(self):
# Regexp matching the string 'fast'
self.re = re.compile(".*?fast", re.I)
if not self.re.match(doc.data):
# Change the focus crawl option of this document
doc.focus_crawl = 1
122
Lowercase all URIs
Processing class that lowercases the path of every URI (ref windows webservers that are
case-insensitive. Note that this requires that all URIs input to the crawler (for example
start-uris/crawleradmin -u) are also in lowercased form.
class lowercase:
newlinks = []
# Handle no links
if not doc.links:
return
for uritype, uri, uricomp in doc.links:
# Parse uri into its 7-part components
# lowercase path, params, query and fragment part of URI
newuri = pyuriparse.uriunparse([
uricomp[pyuriparse.URI_SCHEME],
uricomp[pyuriparse.URI_NETLOC],
uricomp[pyuriparse.URI_PATH].lower(),
uricomp[pyuriparse.URI_PARAMS].lower(),
uricomp[pyuriparse.URI_QUERY].lower(),
uricomp[pyuriparse.URI_FRAGMENT].lower(),
uricomp[pyuriparse.URI_USERPASS]])
newuricomp = pyuriparse.uriparse(newuri)
print "Before:", uri
print "After:", newuri
newlinks.append((uritype, newuri, newuricomp))
# Change the links associated with this document doc.links =
newlinks
Add text
Processing class that inserts ‘good’ at the beginning and end of the document.
class prefixsandpostfix:
# Modify content of this document
doc.data = 'good' + doc.data + 'good'
Modify Checksum
Processing class that duplicates every document by modifying the checksum.
class duplicate:
# Modify checksum of all docs to a x 16
doc.csum = 'a' * 16
Add cookie
Processing class that adds a cookie to the crawler cookie store for every document.
class cookieextenter:
# Set a cookie for the hostname of the current URI
uricomp = pyuriparse.uriparse(doc.uri)
domain = uricomp[pyuriparse.URI_NETLOC].split(".", 1)
doc.cookies = ['Set-Cookie: BogusCookie=f00bar; path=/; domain=.%s'
%\domain[1]]
123
Exclude links
Processing class that parses the document for links and anchortext, and based on anchor
text of the links excludes bad links associated with apache directory listings, and so forth.
class directorylistingdetector:
class myparser(htmllib.HTMLParser):
def __init__(self, formatter):
self.links = {}
self.currenthref = None
htmllib.HTMLParser.__init__(self, formatter)
def anchor_bgn(self, href, name, type):
if not href in self.links:
self.links[href] = ''
self.currenthref = href
self.save_bgn()
def anchor_end(self,):
if self.currenthref is not None:
self.links[self.currenthref] = self.save_end()
self.currenthref = None
def reset(self):
self.links = {}
htmllib.HTMLParser.reset(self)
def __init__(self):
self.parser = self.myparser(formatter.NullFormatter())
self.ignorelist = ( "parent directory", "name", "last modified",
"size", "description", "../")
self.parser.reset()
self.parser.feed(doc.data)
existinglinks = map(lambda x: x[1], doc.links)
for link in self.parser.links:
normlink = self.normalize_link(doc.uri, link)
idx = existinglinks.index(normlink)
if idx < 0:
# XXX: Crawler didn't find this link, ignore it
print "Ignoring %s, mismatch with crawler" % link
continue
if self.parser.links[link].lower() in self.ignorelist:
# The link should not be followed, ignore
print "Ignoring %s, anchor text=%s" % \
(link,self.parser.links[link])
doc.links.pop(idx)
existinglinks.pop(idx)
continue
def normalize_link(self, baseuri, uri):
return pyuriparse.urijoin(baseuri, uri)
Extra data
Processing class that adds text to the extra_data parameter for each document.
class addMeta:
doc.extra_data = "Extra data"
124
Docproc stage for extra_data
Docproc example that extracts the extra_data parameter set by a crawler plugin and adds
it to the document attribute "generic3"
class CrawlerPluginProcessor(Processor.Processor):
def Process(self, docid, document):
extra_data = document.GetValue('extra_data', None)
if extra_data:
plugin_data = extra_data.get("docplugin", None)
if plugin_data is not None:
document.Set("generic3", plugin_data)
else:
return ProcessorStatus.OK_NoChange
return ProcessorStatus.OK
Configuring Near Duplicate Detection
The default duplicate detection scheme in the crawler strips format tags from each new document and then
creates a checksum based on the remaining content. If another document exists with the same checksum,
that document is identified as a duplicate.
This approach may sometimes be too rigid. There are many documents that are not exactly identical, but are
perceived to contain the same content by the user. For example, two documents might have the same body
of text but be marked with different timestamps. Or, a document could have been copied but missed some
characters, a copy-and-paste mistake. Since these documents are not exactly identical they are not registered
as duplicates by the crawler. A near duplicate detection algorithm curbs this issue. This section describes
how the near duplicate detection scheme works and how it can be used in the crawler.
Overview
Once the crawler retrieves a new document, it is parsed into a token stream and its markup code and
punctuation are removed. Individual words (tokens) are separated by splitting the remaining content on
whitespace and punctuation (a rudimentary tokenizer). If a new token is shorter or longer than some predefined
values, then the token is discarded. Otherwise the token is lower cased and added to a lexicon, or collection
of words.
Note: Since CJK languages do not separate tokens by space, the text appears as a set of large continuous
tokens. Note that since languages such as Chinese, Japanese, Korean (CJK) and Thai have no means
of separating tokens without a more complex algorithm involved such characters will be extracted as a
set of large continuous tokens. Consequently, most of the detected tokens in such documents are
discarded.
When parsing is done, the constructed lexicon is trimmed. The most and least frequent tokens are removed
from the lexicon. The goal of this process is to retain only the significant tokens in the lexicon. By removing
the most frequent tokens the algorithm tries to get rid of the most common words in a language, for example,
'the', 'for', etc. And by removing infrequent tokens it tries to get rid of timestamps, tokens with spelling mistakes,
and so forth.
If the lexicon contains enough tokens then a digest string is constructed by traversing the original document
for tokens that are in the lexicon. If a token is in the document and in the lexicon the token is added to the
digest string. Once the entire document has been traversed, the digest is used to generate a signature
checksum. However, if the trimmed lexicon does not have enough tokens, a checksum will be constructed
from the entire document without format tags - just as in the existing duplicate detection scheme. As in the
default scheme, documents with the same checksum are defined as duplicates.
125
Configuring the Crawler
The crawler is able to modify the behavior of its duplicate detection with the Near duplicate detection
option(AW). The variable can also be set in the collection specification:
<attrib name="near_duplicate_detection" type="boolean"> yes </attrib>
Next, there are global options located in the $FASTSEARCH/etc/CrawlerGlobalDefaults.xml file that can
be set. Changing these parameters will result in a different digest and consequently generate a different
signature for the document. Refer to CrawlerGlobalDefaults.xml options on page 110 for options information.
It is not recommended that you modify these parameters after the crawl has started.
<attrib
<attrib
<attrib
<attrib
<attrib
name="max_token_size" type="integer"> 35 </attrib>
name="min_token_size" type="integer"> 5 </attrib>
name="high_freq_cut" type="real"> 0.1 </attrib>
name="low_freq_cut" type="real"> 0.2 </attrib>
name="unique_tokens" type="integer"> 10 </attrib>
Near Duplicate Detection Example
Original Document text:
The current version of the Enterprise Crawler (EC) has a duplicate detection algorithm that strips format tags
from each new document, and then creates a checksum based on the remaining content. If another document
exists with the same checksum, that document is identified as a duplicate.
CrawlerGlobalDefaults configuration:
<attrib
<attrib
<attrib
<attrib
<attrib
name="max_token_size" type="integer"> 35 </attrib>
name="min_token_size" type="integer"> 4 </attrib>
name="high_freq_cut" type="real"> 0.1 </attrib>
name="low_freq_cut" type="real"> 0.2 </attrib>
name="unique_tokens" type="integer"> 10 </attrib>
The previous configuration and document text yield the following adjustments to the full lexicon of 25 tokens.
Note that the cutoff percentages as computed against the full token list are rounded down to the nearest
whole number, and that terms of equal frequency are ordered alphabetically before trimming. As can be seen
in the following, the low frequency tokens trimmed are those from the end of the alphabet.
•
Removed due to high_freq_cut:
Remove top 2 words (10% of 25 words is 2) in reverse alphabetical order (document, that)
•
Token
Frequency
document
3
that
2
Removed due to low_freq_cut:
Remove bottom 4 words (20% of 25 words is 4) in reverse alphabetical order (with, version, then, tags)
•
126
Token
Frequency
tags
1
then
1
version
1
with
1
The final trimmed lexicon meets the threshold limit of 10 tokens, and contains the following terms:
•
Token
Frequency
checksum
2
duplicate
2
algorithm
1
another
1
based
1
content
1
crawler
1
creates
1
current
1
detection
1
each
1
enterprise
1
exists
1
format
1
from
1
identified
1
remaining
1
same
1
strips
1
Based on this the digest string, ordered according to the word order of the original document, would read:
currententerprisecrawlerduplicatedetectionalgorithmstripsformatfromeac
hcreateschecksumbasedremainingcontentanotherexistssamechecksumi
dentifiedduplicate
The checksum is then computed on this digest string, and is associated with the original fetched document.
Configuring SSL Certificates
In most cases no special configuration is necessary for the crawler to fetch from SSL protected sites (https).
In some cases it is necessary to enable Cookie support in the crawler. If a full SSL certificate chain must be
presented to the web server, use the following procedure to prepare the files.
Please follow the steps below to set up a certificate chain to support a client certificate in the crawler.
This is only required when using client certificates and when the client certificate itself cannot be directly
verified by the server without the complete certificate chain up to the trusted CA being attached.
1. Copy all certificates (including intermediate certificates) into the PEM certificate file specified for the crawler.
This is done with all other certificates at the beginning of the file with the root CA certificate last (no key
copied into file)
2. Encode the certificate file in PKCS#7 format using the command: openssl crl2pkcs7 -nocrl -certfile
file_with_certs -out combined.pem
Note: The key point is the use of the PKCS#7 format for the certificate file specified to the crawler.
127
3. Specify the combined.pem file in the crawler certificate file configuration option.
Configuring a Multiple Node Crawler
The distributed crawler consists of one ubermaster process, one or more duplicate servers and one or more
subordinate crawler (master) processes. The ubermaster controls all work going on in the crawler, and
presents itself as a single data source in the FAST ESP administrator interface.
Before you begin you should decide:
•
•
What nodes should run the ubermaster, duplicate server(s) and master(s) processes.
If you are removing the existing crawler or setting up the new crawler so that it does not interfere with the
existing crawler. Go to Removing the Existing Crawler on page 128 if you are removing the existing crawler;
go to Setting up a New Crawler with Existing Crawler on page 128 if you are setting up a new crawler while
keeping the existing crawler.
Removing the Existing Crawler
If you are removing the existing crawler and replacing it with a new crawler configuration, complete this
procedure.
To remove the existing crawler:
1. On all nodes that run crawler processes (assuming the processes are named crawler), run the command:
2. On the node running the ubermaster process, run the command: $FASTSEARCH/bin/nctrl stop
ubermaster
3. Stop the nctrl process with the command: $FASTSEARCH/bin/nctrl stop nctrl. On Windows it is
neccessary to stop the FAST ESP Service instead.
4. Remove the crawler process from the startorder list in $FASTSEARCH/etc/NodeConf.xml.
5. Remove the crawler process from $FASTSEARCH/etc/NodeState.xml.
6. Start the nctrl process by running the command: $FASTSEARCH/bin/nctrl start. On Windows this can
be done by starting the FAST ESP Service instead.
7. On the node running the ubermaster process, run the command: $FASTSEARCH/bin/nctrl start
ubermaster.
$FASTSEARCH/bin/nctrl start crawler.
Setting up a New Crawler with Existing Crawler
If you are setting up the new crawler as a separate data source so that it does not interfere with the existing
crawler, complete this procedure.
1. Modify the $FASTSEARCH/etc/NodeConf.xml files on the different nodes. Existing crawler entries in the
file on your FAST ESP installation file can be used as templates.
2. When several crawler components are run on the same node, be they multiple instances of single node
crawlers or several components of a multiple node crawler, always make sure that the following parameters
do not overlap:
-P (the port number used to communicate with the process)
-d (the data directory designated to the process)
-L (the log directory designated to that process)
128
Port numbers should be sufficiently far apart to avoid interference; incrementing by 100 per process should
be sufficient. Always inspect the existing entries in the NodeConf.xml file to ensure the port numbers do
not overlap with those allocated to other processes.
3. Add an ubermaster.
An example of an ubermaster process entry is as follows:
<process name="ubermaster" description="Master Crawler">
<start>
<executable>crawler</executable>
<parameters>-P $PORT -U -o -d
$FASTSEARCH/data/crawler_um -L $FASTSEARCH/var/log/crawler_um</parameters>
<port base="1000" increment="1000" count="1"/>
</start>
<outfile>ubermaster.scrap</outfile>
<limits>
<minimum_disk>1000</minimum_disk>
</limits>
</process>
Note the -U parameter, then add <proc>ubermaster</proc> to the global startorder list near the top of the
$FASTSEARCH/etc/NodeConf.xml file.
4. Add a subordinate master.
An example of a subordinate master entry is as follows:
<process name="master" description="Subordinate Crawler">
<start>
<executable>crawler</executable>
<parameters>-P $PORT -S
<ubermaster_host>:14000 -o -d $FASTSEARCH/data/crawler -L
$FASTSEARCH/var/log/crawler</parameters>
</start>
<outfile>master.scrap</outfile>
<limits>
</limits>
</process>
A <proc>master </proc> should be added to the startorder list in the $FASTSEARCH/etc/NodeConf.xml
file.
5. Add a duplicate server.
The duplicate server can be set up in a number of different ways, including striped and replicated modes,
but a simple standalone set up is as follows:
<process name="dupserver" description="Duplicate server">
<start>
<executable>ppdup</executable>
<parameters>-P $PORT -I <Symbolic ID></parameters>
</start>
<outfile>dupserver.scrap</outfile>
<limits>
</limits>
</process>
The ppdup binary must be added to the configuration in the FAST ESP administrator interface with a
host:port location (available in the advanced mode). Note that the ppdup does not have an -L parameter.
Refer to Crawler/Master Tuning for information about cache size and storage tuning
129
A <proc>dupserver</proc> should be added to the startorder list in the $FASTSEARCH/etc/NodeConf.xml
file.
6. Start the new crawler: $FASTSEARCH/bin/nctrl reloadcfg
7. Start the different processes on the relevant nodes in the following order:
$FASTSEARCH/bin/nctrl start ubermaster
$FASTSEARCH/bin/nctrl start dupserver
$FASTSEARCH/bin/nctrl start master (on all master nodes)
8. To verify the new configuration:
a) Check the ubermaster logs to verify all masters are connected and that there are no conflicts, for
example, conflicts in -I names.
b) Check to make sure the ubermaster appears as a Data Source in the FAST ESP administrator interface
You can add collections by either using the FAST ESP administrator interface or by uploading the crawler
XML specifications with the following command:
$FASTSEARCH/bin/crawleradmin -C <ubermaster hostname>:<ubermaster portnumber> -f <path
to xml specification>
If you are uploading a web scale crawl, it is recommended that you add the collection with the Large Scale
XML Configuration Template on page 141.
9. Refeed collections with postprocess.
Re-feeding collections in a multiple node crawler is similar to performing it in a single node crawler, with
some exceptions.
Before starting the refeed make sure the duplicate server(s) are running. The master(s) must be stopped
on the node(s) you wish to refeed, the ubermaster as well as masters on other nodes may continue to
run.
$FASTSEARCH/bin/postprocess -R "*"
Large Scale XML Crawler Configuration
This section provides information on how to configure and tune a large scale distributed crawler.
Node Layout
A large scale installation of the crawler consists of three different components:
•
•
•
One Ubermaster (UM),
One or more Duplicate servers (DS) and
Multiple Crawlers. Each crawler consists of a master process, multiple uberslave processes and a single
postprocess.
A typical 10 node layout may look like this (each square represents a server):
130
Figure 13:
This configuration offers both duplicate server fault tolerance and load balancing.
Node Hardware
The following items (in prioritized order) should be considered when planning hardware:
1. Disk I/O performance
2. Amount of memory
3. CPU performance and dual vs. single processor
Typical disk setup involves either RAID 0 or RAID 5 with a minimum of four drives. RAID 0 offers better
performance, but no fault tolerance. RAID 5 has substantial penalty on write performance. Other options
include RAID 0+1 (or RAID 1+0).
When running with a replicated duplicate server setup, it may be that a non-fault tolerant setup (for example,
RAID 0) is the best alternative for the duplicate server nodes, with all other nodes on a fault tolerant storage
(RAID 5).
Memory usage is very dependent on the cache configuration used, but both the duplicate server and
postprocess (on each crawler node) can take advantage of large amounts for database caching purposes.
CPU performance is much less important, and depends mainly on the configuration settings used. This is
discussed in more detail in the Configuration and Tuning section.
Hardware Sizing
Due to the I/O-bound nature of crawling, the hardware sizing should be based primarily on hard disk capacity
and performance.
To calculate the disk usage of the crawler nodes, the following needs to be accounted for.
•
•
•
•
Crawled data (each doc compressed individually by default)
Meta data structures
Postprocess duplicate database
Crawl and feed queues
Assuming an average compressed document size of 20kB (30kB if also crawling PDF and Office documents),
2kB meta data per document (includes HTTP header) and 500 bytes per URI in the duplicate DB we can
calculate the disk space requirements for a single node. Note however that the document sizes (20kB and
30kB) are estimates, and depends largely on what is being crawled and also the document size cut-off
specified in the configuration.
Adding 30% slack on top to account for wasted space in the data structures, log files, queues etc we get the
following guide line table.
131
Document
Data size (HTML only)
Data size (HTML, PDF, Word++)
10M
290GB
420GB
20M
585GB
845GB
30M
880GB
1.3TB
The rule of thumb is that a node should not be holding more than 20-30M documents, as performance may
degrade too much from that point on.
The disk usage of the duplicate servers will be similar to that of the postprocess duplicate database. However,
keep in mind that using replication (which is recommended) doubles the disk usage as each node holds a
mirror of its peer node's dataset.
Ubermaster Node Requirements
The UM can either run on a separate node or share a node with one of the duplicate servers. Place the UM
on a dedicated node for large installations (20 masters and up).
Memory usage
100-500MB
CPU usage
Moderate to high depending on number of masters
Disk I/O
Moderate to high depending on number of masters
Disk usage
Minimal
There are no global tuning parameters for this component.
Duplicate Servers
The duplicate server processes serve as the backbone of a multiple node crawler setup and care should be
taken when configuring since they may be difficult to reconfigure at a later stage.
Memory usage
70MB and up (tunable)
CPU usage
Minimal
Disk I/O
Heavy during first cycle, moderate on subsequent cycles
Disk usage
Moderate
Non-replicated Mode
A simple duplicate server layout involves one or more duplicate server processes in a non-replicated mode.
The advantage of this approach is the increased performance offered, with the drawback of no replication in
case of failure (loss of data). It should therefore only be used if the underlying disk system is fault tolerant
(and preferably more so than RAID 5).
For each duplicate server set up this way you must also add it to the duplicate servers configuration section
for each collection. Refer to ppdup for options information.
Two node (striped) example, running on servers node1 and node2 with symbolic IDs dup1 and dup2:
node1: ./ppdup -P 14900 -I dup1
node2: ./ppdup -P 14900 -I dup2
Replicated Mode
The duplicate server can be replicated in two ways; dedicated replica and cross-replication.
132
The dedicated replica mode works by setting up a second duplicate server that acts only as a replica with
no direct communication with postprocess, only its duplicate server primary. Both processes in the following
example use the same ID.
Dedicated replica example, running on two nodes:
node1: ./ppdup -P 14900 -I dup1 -R node2:14901
node2: ./ppdup -P 14900 -I dup1 -r 14901
An alternative means of getting both replication and load balancing is to use the cross-replication mode.
In this setup each duplicate server acts as both primary and replica for another duplicate server.
Cross replication example running on two nodes:
node1: ./ppdup -P 14900 -I dup1 -R node2:14901 -r 14901
node2: ./ppdup -P 14900 -I dup2 -R node1:14901 -r 14901
While it may seem that the last setup is preferable to a combined striped and replicated setup, it is not.
Separating primary and replica into two processes has two distinct advantages; it allows the processes to
use more memory for caching (max process size on RHEL3 is about 2.8GB) as well as placing the I/O and
CPU tasks somewhat in parallel (the duplicate server uses blocking I/O).
Crawlers (Masters)
Memory usage
512MB and up (tunable)
CPU usage
Moderate to high, configuration dependent
Disk I/O
Heavy
Disk usage
High
There are no global tuning parameters for this component.
Configuration and Tuning
When planning a large scale deployment configuration, first consider the number of collections and their
sizes.
An ideal large scale setup consists of a single collection, possibly with a few sub collections. Having multiple
smaller collections on a multiple node crawler is generally not desired, especially if they fit on single node
crawler. In this case it is better to set up one or several single node crawlers to handle the small collections.
If you have to have multiple collections on a multiple node crawler, keep in mind that many tunable parameters
such as cache sizes are configured per collection so they can add up for each collection. Furthermore, certain
parameters are applied individually to each collection, but may only be configured through one global setting.
This does not fit well with having both small and large collections on the same multiple node crawler and is
generally not recommended.
However, there are advantages to having multiple collections as opposed to one collection with sub collections.
Individual collections make management easier, including configuration updates, re-feeds, and scratching.
Having many sub collections add overhead since each URI must be checked against the include rules of
each sub collection until a match, if any, is found.
The best advice is to first divide your data into whatever logical collections make the most sense. If the setup
calls for a mix of large and small collections (for example, web and news crawls) then it is advisable to place
the small collections on a separate single node crawler. The remaining collections should generally be larger
than what a single node could handle and it therefore makes sense to run them either separately or as a
single collection on the multiple node crawler.
Include /Exclude Rules
The crawler uses a set of include and exclude rules to limit and control what is to be crawled.
133
In a small scale setting the performance considerations are few as there are generally a limited number of
rules. However, in a large scale setting there may be tens of thousands of rules and care must be taken when
selecting these. It is important to keep in mind that every URI extracted by the crawler is checked against
some or all of these rules.
The least expensive rules are the exact match hostname rules. Checking a URI against these involve a single
hash table lookup, so the performance is the same regardless of the number of URIs. Memory usage depends
on the number of rules.
The suffix and prefix hostname includes are also generally inexpensive, as they are also implemented using
hash structures. By dividing the URIs into subsets based on their lengths we get at most n lookups (where n
is the number of different lengths), rather than one lookup per rule. While more expensive than the exact
match rule, it is dependent on the number of unique lengths and not the number of rules.
The regexp type rule should be avoided if at all possible. In general it will only be necessary as a URI rule,
not a hostname rule. It is best to check with the crawler team if you have any questions regarding this.
Note: One common pitfall is to use either a suffix URI rule or a regexp URI rule to exclude certain
filename extensions. The former will fail if the URI contains for example, query parameters and the latter
consumes much more CPU than it needs to. To exclude certain file extensions you should use the
exclude_exts config option.
Tip: Using exclude rules can potentially speed up the checks. Since exclude rules are applied first, you
could for example, exclude all top level domains you do not wish to crawl.
Includes and sub collections - When setting up sub collections it is vital to keep in mind that they are subsets
of the main collection. Therefore, any URIs for the sub collection must match the include/exclude rules of the
main collection first, and then the relevant sub collection.
URI Seed Files
It is necessary to specify one or more Start URIs for any size crawl, but it is not necessarily true that a large
crawl requires a long list of Start URIs.
Because the web is heavily interconnected, with links from site to site, you can usually start at a single URI
(preferably a page with many external URIs) and allow the crawler to gather links and add sites to the crawl
from there. This works well if your goal is to crawl a top-level domain, such as the .no domain. Adding
numerous seeds will do little to improve or focus your crawl.
However, if you do not wish to crawl an entire top-level domain, but rather selected sites only, then a seed
list is useful.You also need to either be restrictive in the sites crawled (using include/exclude rules), or disable
the following of cross site URIs altogether. If you do neither, then use a small seed list.
Restricting the Size of the Crawl
There are several techniques for restricting the size of your crawl. When setting up a large scale crawl, you
often have requirements based on the number of URIs you would like in your index or how much data you
should handle.
The crawler has several configuration options that can restrict the size of a crawl, taken alone or together:
•
•
•
•
Limit number of documents per site
Specify maximum crawl depth per site
Set a maximum number of documents per collection
Require minimum free disk space limit
The max_docs option specifies the maximum number of documents to retrieve per site per refresh cycle. This
is useful to limit the crawling of deep (or perhaps endlessly recursive) sites. Keep in mind that this counter is
reset per refresh, so that over time the total number of documents might exceed this limit, though documents
not seen for a number of refresh cycles will be recognized and deleted, as described in the dbswitch
configuration option.
134
An another configuration setting that can be used in conjunction with the attribute described above (or on its
own) is level crawling. By specifying a crawlmode depth limitation, you can ensure that the crawler only follows
links a certain number of hops from the Start URI for each site. This avoids the deep internal portions of a
site.
The amount of time spent crawling and the aggressiveness of the crawl are major factors in determining the
volume of fetched documents. The configuration options refresh, max_sites and delay also play a major
role. The crawler will fetch at most refresh * 60 * ( max_sites / delay ) URIs during a single refresh
cycle. For multi node crawls this figure is per master node. Together with a refresh_mode set to scratch or
adaptive this limits the number of documents that will be indexed.
Note: No other refresh_mode value should be used for large-scale crawls, due to potentially large disk
usage requirements.
Keep in mind that subsequent refresh cycles may not fetch the exact same links as before, due to various
reasons including the fact that web pages (and their links) change, the structure of the web changes, and
network latencies change. If scratch refresh mode is used then the index may fluctuate slightly in size.
However, as URIs not fetched for some time are deleted it should be fairly stable once the configuration is
set. With refresh mode set to adaptive it will use the existing set of URIs in the crawler store as the basis
for re-crawling, but some new content will also be crawled.
The limits options allow you to specify threshold limits for the total number of documents and for the free
disk space. If exceeded, the crawler enters a "refreshing" crawl mode, so that it will only crawl URIs that have
been previously fetched. For each limit, one also has to specify a slack value, indicating the lower resource
limit that must be reached before the crawler returns to its normal crawl mode.
Duplicate Server Tuning
Two different storage formats are supported in EC 6.7, GigaBASE and hashlog. Hashlog is recommended
for most installations. However, if you have a lot of small collections (less than 10M each) then using GigaBASE
may also work very well.
GigaBASE
The original format is based on GigaBASE and consists of a set of striped databases.
The main motivation behind the striping is two-fold; decrease the size of database files on disk and reduce
depths of the B-trees within the databases. Reducing the database size through striping may have a limited
effect on the B-tree depths once the databases grow too large.
The following table can be used as a guideline for selecting an appropriate GigaBASE stripe size. Keep in
mind that the document count number relates to the number of documents (or rather document checksums)
stored on this particular duplicate server. A load balanced setup will thus have total_count/ds_instances
documents in a single duplicate server. Dedicated duplicate server replicas are not included in the ds_instances
count.
Document
Stripes
25M
2
50M
4
100M
8
In addition to stripe size you can also tune the GigaBASE cache size. The value given will be divided among
the stripes such that a cache size of 1GB will consume approximately 1GB of memory. If cross-replication is
used the memory usage will be twice the specified cache size. The rule of thumb when selecting a cache
size is to use as much memory as you can afford, as long as the process does not exceed the maximum
process size allowed by the operating system.
135
Hashlog
The newer format (the default format) is called hashlog and combines a hash (either disk or memory based)
with a log structure. The advantage of a hash compared to a B-tree structure is that lookups in a hash are
O(1) whereas a B-tree has O(log n). This means that as the data structure grows larger the hash is much
more suitable.
A disk based hash is selected by specifying "diskhashlog" as the format. The initial size of the hash (in MB)
can be specified by the cache size option. In this mode each read/write results in 1 seek. This is suitable for
very large structures where it is not feasible to hold the entire hash in memory.
To select a memory based hash, specify the maximum amount of memory to be used with the cache size
option and use the format "hashlog". If the amount of data exceeds the capacity the hash will automatically
rehash to disk. The following formula calculates approximately how many elements the memory hash will
hold:
capacity = memory / (12 * 1.3)
This yields the following approximate table:
Memory Reserved
Documents
100M
~6.5M
500M
~33M
1.0GB
~68M
1.5GB
~100M
In addition to the hash a structured log is also used. Reads from the log require a single seek (bringing us
up to 2 seeks for disk hash and 1 for memory hash). However, due to its nature the log grows in size even
when replacing old elements. To counter this, the duplicate server has a built-in nightly compaction where
the most fragmented log file is compacted. During this time the duplicate server will be unavailable. It does
not affect crawling performance, but will delay any processing in postprocess during that time.
Disk Hash vs. Memory Hash
Note: If sufficient memory is available, a memory hash will give significant performance advantages.
Keep in mind that every collection that uses the duplicate server will allocate a memory hash of the same
size (regardless of the size of the collection), so this affects the memory hash size that can be used. This
makes it impractical to use a memory hash if the collections differ greatly in size. In this case the best
solution is to setup multiple duplicate servers, with memory hash, on the same node, for instance one
duplicate server for a large collection and one duplicate server for the small/medium collections.
Furthermore, the summation of all the cache sizes should not exceed the total amount of available
physical memory on a node.
Note: When using a disk based hash larger than 10M it is generally recommended to turn off the automatic
compaction feature in the duplicate server. Compaction of such disk based hashes can take many hours,
and is best performed manually using the crawlerdbtool. This compaction can be turned off using the -N
command line option passed to the duplicate server.
Postprocess Tuning
"PostProcess" (PP) performs duplicate checking for a single node and feeds data through document processing
to the indexer. Please note that in a multi-node crawler environment, the "duplicate server" handles cross-node
duplicate checking.
136
Tuning duplicate checking
Duplicate checking within a single node requires a database which consists of all the unique checksums
"owned" by the node. The checksums map to a set of URIs (the URI equivalence class) from which one URI
is designated the owner URI and the rest duplicate URIs. Some additional meta information is also stored.
The parameters listed below are available for this purpose, and are tunable per collection in the configuration.
The options are specified in the "postprocess" section of the configuration, unless otherwise noted.
Postprocess
Parameter
Description
dupservers
Must be set to a list of primary duplicate servers.
max_dupes
Determines the maximum number of duplicate URIs corresponding to each checksum. This
setting has a severe performance impact and values above 3-4 are not recommended for large
scale crawls.
stripe
Please refer to the hashlog/GigaBASE discussion in the Duplicate Server Tuning section. A
stripe size of 4 is typical in most cases. Note that for GigaBASE storage the amount of memory
used by caching is defined as the postprocess cache size multiplied by the number of stripes.
ds_paused
Allows you to pause feeding to document processing and indexing. Useful if you would like to
crawl first and index later. Can also be controlled with the --suspendfeed and the resumefeed
crawleradmin option, but the value in the configuration overrides it if you feed.
ds_max_ecl
Maximum number of URIs in the equivalence class that is sent to document processing for
indexing. The value should be set to the same value as max_dupes.
pp (in cache size
section)
Specifies the amount of memory allocated to the checksum database cache for the collection.
For GigaBASE this is the database cache *per stripe*, and for Hashlog it is the memory hash
size. The value should be high (for example, 512MB or more for a 25M node), but keep in mind
that this setting is separate per collection and that they add up. Use the most memory on the
large collections.
Tuning postprocess feeding
By default crawlerfs in postprocess is run using a single thread. In order to increase the throughput, it is
possible to configure multiple crawlerfs processes and multiple DocumentRetriever processes in the ESP
document processing pipeline. This may significantly speed up the processing.
If you need to accomplish this task, please contact FAST support.
Crawler/Master Tuning
The following sections outline the various parameters that should be modified for large scale crawls.
Storage and Storage Tuning
The following storage section attributes should be tuned. The remaining storage parameters should be left
at default values (for example, not included in XML at all).
The crawler performs large amounts of network and file I/O, and the number of available file descriptors can
be a limiting factor, and lead to errors. Insure that sufficient file descriptors are available by running the limit
(or ulimit) command from the account under which the crawler runs. If the value is too low (below 2048),
increase the hard limit for descriptors to 8096 (8K). Check the operating system administrator documentation
for details on doing so; it may be sufficient to run the limit/ulimit command as superuser, or a system resource
configuration file might need to be modified, perhaps requiring a system reboot.
137
Storage Parameter
Description
store_http_header
Can be disabled if you know that it will not be needed in the processing pipeline
(it is sent in the 'http_header' attribute). Disabling saves some disk space in the
databases and may give a slight performance boost.
remove_docs
Enabling this option will delete documents from disk once they have been feed
to document processing and indexing. Disabled by default.
Note: Re-feeding the crawler store will not be possible with this option
enabled. Therefore, this mode should only be enabled for stable and
well-tested installations.
Cache Size Tuning
The crawler automatically tunes certain cache sizes based on what it perceives as the size of your crawl.
The main factors are the number of active sites and the delay value.
The following caches are automatically tuned, and they should not be included in your XML configuration
(and if they do they must have blank values):
•
•
•
•
•
screened
smcomm
mucomm
wqcache
crosslinks
Refer to Cache Size Parameters on page 101 for additional information including parameter defaults.The only
cache parameter to be configured is the postprocess (pp) cache which was previously discussed in the
Postprocess Tuning on page 136 section.You may also use a larger cache size for the routetab and aliases
cache if you crawl a lot of sites, especially multiple node crawls. The pp, routetab and aliases caches are
all GigaBASE caches specified in bytes.
Log Tuning
Less logging means improved performance. However, it also means that is becomes more difficult to debug
issues. Note that only some of the logs have large impact on the performance.
In order of resource consumption you should adhere to the following recommendation:
•
•
•
•
•
•
•
DNS log: Should always be enabled.
Screened log: Must be disabled!
Site log: Should always be enabled.
Fetch log: Should always be enabled.
Postprocess log: Should be disabled unless you use it.
DSFeed log: Should always be enabled.
Scheduler log: Should be disabled unless you use it.
General Tuning
fetch_timeout, robots_timeout, Default is 300 seconds, increase if you experience more timeouts than expected
(could be caused by bandwidth shapers)
login_timeout
138
use_http_1_1
Enable to take advantage of content compression and If-Modified-Since, both
saving bandwidth.
accept_compression
Enables the remote HTTP server to compress the document if desired before
sending. Few servers actually do this, but some do and it will save bandwidth.
robots
Always adhere to robots.txt when doing large scale crawls.
refresh_mode
A large scale crawl should always use 'scratch' (default) or 'adaptive'.
If you need to crawl everything, then you should initially set the 'refresh' to a
high value. Once you know the time required for an entire cycle, you can tune
it. If it is not possible to crawl everything within your time limit, you need to reduce
the 'refresh' and/or use the 'max_doc' option to reduce the number of documents
to download from each site.
Note: The option to automatically refresh when idle is not available for
multi node crawler setups.
max_sites
Together with the delay option this controls the maximum theoretical crawl speed
you will be able to obtain. For example, a max_sites of 2000 and a delay of 60
will give you 2000/60 = 33 docs/sec. Please note that this value is *per node*
so with 10 crawler nodes this would translate into 20000 max_sites total and
333 docs/sec.
In practice you seldom get that close to the theoretical speed, and it also depends
greatly on there being enough sites to crawl at any one time. To monitor your
crawler with regard to the actual rate use crawleradmin -c and look at the "Active
sites" count.
headers
While this setting can be used to specify arbitrary HTTP headers, it is usually
used for only the crawler 'User-Agent'. The user agent header must specify a
"name" for the crawler as well as contact information, either in the form of a web
page and/or e-mail address. For example "User-Agent: YourCompany Crawler
(crawler at yourcompany dot com)"
cut_off
Should be adjusted depending on the type of documents to be crawled. For
example, PDFs and Word documents tend to be larger than HTML documents
for example.
max_doc
This setting can be important when tuning the size of the crawl, as it limits the
number of documents retrieved per site.Typical values might be in the 2000-5000
range. Can also be specified for sub collections if and only if the sub collections
are only defined using hostname rules and not any URI rules.
check_meta_robots
META robots tags should be adhered to when doing a web scale crawl.
html_redir_is_redir/html_redir_thresh
The 'html_redir_is_redir' option lets the crawler treat META refresh tags inside
HTML documents as if they were true HTTP redirects. When enabled the
document containing the META refresh will not itself be indexed.
The 'html_redir_thresh' option specifies the number of seconds delay which are
allowed for the tag to be considered a redirect. Anything less than this number
is treated as a redirect, other values are just treated as a link (and the document
itself is indexed also).
dbswitch/dbswitch_delete
The 'dbswitch' option specifies the number of refreshes a given URI is allowed
to stay in the index without being seen again by the crawler. URIs that have not
been seen for this amount of refreshes will either be deleted or added to the
queue for re-crawl, depending on the 'dbswitch_delete' option.
This option should never be less than 2 and preferably at least 3. For example
a in a 30 day cycle crawl with a dbswitch of 3 any given URI may remain unseen
for 3 cycles before being removed/scheduled. Keep in mind that if the crawler
was stopped for 30 days the cycles would still progress.
One common method of limiting the amount of dead links in the index is to use
what is known as a dead links crawler. The idea is to use click-through tracking
to actively re-crawl the search results clicked on by users. Not only will the
crawler quickly discover pages that have disappeared, but freshness for
frequently clicked pages are also improved.
139
wqfilter/smfilter/mufilter
These options decide whether to use a Bloom filter to weed out duplicate URIs
before queuing in the slave ('wqfilter') and sending URIs from the slave to the
master ('smfilter'). The former is a yes/no option and the size of the filter is
calculated based on the max_docs setting and a very low probability of false
positives. For large scale crawls this should definitely be on to reduce the number
of duplicates in the queue.
The 'smfilter' is specified by a capacity value, typically 50000000 (50M). The
filter is automatically purged whenever it reaches a certain saturation, so there
should be a very low probability of false positives. The default is off (0), but the
50M size should definitely be used for large crawls.
The 'mufilter' is a similar filter present in the master and ubermaster. It should
be even larger, typically 500000000 (500M) for wide crawls to prevent overloading
the ubermaster with links.
max_reflinks
Must be set to 0 (the default value) for large-scale crawls, to disable the crawler
from storing a list of URIs that link to each document. Disabling this reduces
memory and disk usage within the crawler. The equivalent functionality is
implemented by the WebAnalyzer component of FAST ESP.
max_pending
This option limits the number of concurrent requests allowed to a single site.
For a large scale crawl with 60 seconds delay it should probably be set to 1 or
2 (the only time you would have more than one request would be if the first took
more than 60 seconds to complete).
extract_links_from_dupes
Since duplicates generally link to more duplicates this option should be turned
off, whether the crawl is large or small.
if_modified_since
Controls whether to send 'If-Modified-Since' requests, significantly reducing the
bandwidth use subsequent crawl cycles. Should always be on for wide crawls.
use_cookies
The cookie support is intended for small intranet crawls and should always be
disabled for large scale crawls. If you require cookie support for certain sites it
may be best to place them in a separate collection, rather than enabling this
feature for the entire crawl.
rewrite_rules
Rewrite rules can be used to rewrite links parsed out of documents by applying
regular expression and repeating captured text. This implies that all rewrite rules
are attempted applied for every link, and it can therefore be very expensive in
terms of CPU usage depending on the number of rules and their complexity. It
is therefore advised to limit the amount of rewrite rules for large scale crawls.
use_javascript and
enable_flash
For performance reasons it is highly recommended to disable JavaScript and
flash crawling for large crawls. If you require JavaScript and/or flash support,
you should only enable it for a limited set of sites. You need to put these sites
into a separate collection.
Note: Enabling any of these options also requires that one or more Browser
Engines be configured. For more information, please refer to the FAST
ESP Browser Engine Guide.
domain_clustering
In a web scale crawl it is possible to optimize the crawler to take advantage of
locality in the web link structure. Sub domains on the same domain tend to link
more internally than externally, just as a site would have mostly interlinks. The
domain clustering option enables clustering of sites on the same domain (for
example, *.example.net) on the same master node and the same storage
cluster (and thus uberslave process). For web crawls this feature should always
be enabled.
Note: This option is automatically turned on for multi node crawls by the
ubermaster
140
Maximum Number of Open Files
If you plan to do a large scale crawl you should increase the maximum number of open files (default 1024).
This change is done in etc/NodeConf.xml. For example, to change from:
<resourcelimits>
<limit name="core" soft="unlimited"/>
<limit name="nofile" soft="1024"/>
</resourcelimits>
to
<resourcelimits>
<limit name="core" soft="unlimited"/>
<limit name="nofile" soft="4096"/>
</resourcelimits>
Note that this will only set the soft limit. In order for this to work the system hard limit must also be set to the
same value or higher.
Large Scale XML Configuration Template
The following shows a largescale.xml collection configuration template.
largescale.xml crawler collection configuration template
<CrawlerConfig>

<DomainSpecification name="LARGESCALE">
<!-<!-<!-
-->
-->
-->
<attrib name="info" type="string">
Sample LARGESCALE crawler config
</attrib>

<attrib name="headers" type="list-string">
<member>
User-agent: COMPANYNAME Crawler (email address / WWW address)
</member>
</attrib>

-->
-->

<attrib name="robots" type="boolean"> yes </attrib>

<attrib name="" type="boolean"> yes </attrib>

<attrib name="obey_robots_delay" type="boolean"> no </attrib>


<attrib name="max_reflinks" type="integer"> 0 </attrib>

<attrib name="max_pending" type="integer"> 1 </attrib>

<attrib name="domain_clustering" type="boolean"> yes </attrib>
141

<attrib name="fetch_timeout" type="integer"> 300 </attrib>

<attrib name="html_redir_is_redir" type="boolean"> yes </attrib>

<attrib name="html_redir_thresh" type="integer"> 3 </attrib>

<attrib name="near_duplicate_detection" type="boolean"> no </attrib>

<attrib name="postprocess" type="string"> none </attrib>
</section>

<attrib name="extract_links_from_dupes" type="boolean"> no </attrib>

<section name="storage">
<attrib name="store_dupes" type="boolean"> no </attrib>
<attrib name="datastore" type="string"> bstore </attrib>
<attrib name="compress" type="boolean"> yes </attrib>
</section>

<section name="http_errors">
<attrib name="ttl" type="string"> DELETE:3 </attrib>
<attrib name="net" type="string"> DELETE:3, RETRY:1 </attrib>
<attrib name="int" type="string"> KEEP </attrib>
</section>

-->
-->

<attrib name="delay" type="real"> 60 </attrib>



<attrib name="refresh_mode" type="string"> scratch </attrib>

<attrib name="dbswitch" type="integer"> 3 </attrib>

<section name="crawlmode">

<attrib name="mode" type="string"> FULL </attrib>

<attrib name="fwdlinks" type="boolean"> yes </attrib>

<attrib name="reset_level" type="boolean"> no </attrib>
</section>

<attrib name="max_sites" type="integer"> 6144 </attrib>

<attrib name="cut_off" type="integer"> 500000 </attrib>

142
<attrib name="csum_cut_off" type="integer"> 0 </attrib>


<attrib name="use_http_1_1" type="boolean"> yes </attrib>

<attrib name="accept_compression" type="boolean"> yes </attrib>


<section name="cachesize">

<attrib name="routetab" type="integer"> 4194304 </attrib>

<attrib name="pp" type="integer"> 268435456 </attrib>
</section>

<attrib name="wqfilter" type="boolean"> yes </attrib>

<attrib name="smfilter" type="integer"> 50000000 </attrib>

<attrib name="mufilter" type="integer"> 500000000 </attrib>


<section name="adaptive">

<attrib name="refresh_count" type="integer"> 4 </attrib>



<attrib name="coverage_max_pct" type="integer"> 25 </attrib>

-->



<section name="weights">


<attrib name="inverse_length" type="real"> 1.0 </attrib>


<attrib name="inverse_depth" type="real"> 1.0 </attrib>



<attrib name="is_landing_page" type="real"> 1.0 </attrib>
143




<attrib name="is_mime_markup" type="real"> 1.0 </attrib>





<attrib name="change_history" type="real"> 10.0 </attrib>
</section>
</section>

-->
-->
<section name="pp">

<attrib name="stripe" type="integer"> 4 </attrib>

<attrib name="max_dupes" type="integer"> 4 </attrib>

<member> HOSTNAME1:PORT </member>
<member> HOSTNAME2:PORT </member>
</attrib>

<attrib name="ds_meta_info" type="list-string">
<member> duplicates </member>
<member> redirects </member>
</attrib>

<attrib name="ds_paused" type="boolean"> yes </attrib>
</section>
<!-<!-<!-<!-
-->
-->
-->
-->

<attrib name="allowed_schemes" type="list-string">
<member> http </member>
</attrib>

<attrib name="allowed_types" type="list-string">
<member> text/html </member>
<member> text/plain </member>
<member> text/asp </member>
<member> text/x-server-parsed-html </member>
</attrib>

144
<attrib name="exact" type="list-string">
</attrib>
</attrib>
</attrib>
</section>

<section name="exclude_domains">
</attrib>
</attrib>
</attrib>
</section>

<section name="exclude_uris">
</attrib>
</attrib>
</attrib>
</section>

<section name="include_uris">
</attrib>
</attrib>
</attrib>
</section>

<attrib name="exclude_exts" type="list-string">
<member> .jpg </member>
<member> .jpeg </member>
<member> .ico </member>
<member> .tif </member>
<member> .png </member>
<member> .bmp </member>
<member> .gif </member>
<member> .avi </member>
<member> .mpg </member>
<member> .wmv </member>
<member> .wma </member>
<member> .ram </member>
<member> .asx </member>
<member> .asf </member>
<member> .mp3 </member>
<member> .wav </member>
<member> .ogg </member>
<member> .zip </member>
<member> .gz </member>
<member> .vmarc </member>
<member> .z </member>
<member> .tar </member>
<member> .swf </member>
<member> .exe </member>
<member> .java </member>
<member> .jar </member>
<member> .prz </member>
<member> .wrl </member>
145
<member>
<member>
<member>
<member>
<member>
<member>
<member>
<member>
<member>
<member>
<member>
<member>
<member>
<member>
</attrib>
.midr </member>
.css </member>
.ps </member>
.ttf </member>
.xml </member>
.mso </member>
.rdf </member>
.rss </member>
.cab </member>
.xsl </member>
.rar </member>
.wmf </member>
.ace </member>
.rar </member>

<member> INSERT START URI HERE </member>
</attrib>

<attrib name="start_uri_files" type="list-string">
<member> INSERT START URI FILE HERE</member>
</attrib>
</CrawlerConfig>
146
Chapter
4
Operating the Enterprise Crawler
Topics:
•
•
•
•
•
•
•
Stopping, Suspending and
Starting the Crawler
Monitoring
Backup and Restore
Crawler Store Consistency
Redistributing the Duplicate
Server Database
Exporting and Importing
Collection Specific Crawler
Configuration
Fault-Tolerance and Recovery
The crawler runs as an integrated component within FAST ESP, monitored by
the node controller (nctrl) and started/stopped via the administrator interface or
the nctrl command.
Stopping, Suspending and Starting the Crawler
Stopping, suspending and starting the crawler can be executed from the administrator interface or from the
command line and differs depending on your environment.
Starting in a Single Node Environment - administrator interface
In a single node environment, to start the crawler from the administrator interface:
1. Select System Management on the navigation bar.
2. Locate the Enterprise Crawler on the Installed module list - Module name. Select the Start symbol to
start the crawler.
Starting in a Single Node Environment - command line
Use the nctrl tool to start the crawler from the command line.
Refer to the nctrl Tool appendix in the FAST ESP Operations Guide for nctrl usage information.
Run the following command to start the crawler:
1. $FASTSEARCH/bin/nctrl start crawler
Starting in a Multiple Node Environment - administrator interface
In a multiple node environment, the ubermaster processes must be started up first, followed by individual
crawler processes.
In a multiple node environment, to start the crawler from the administrator interface:
2. Locate the ubermaster process, and select the Start symbol.
3. For all crawler processes, select the Start symbol.
Starting in a Multiple Node Environment - command line
In a multiple node environment, the ubermaster processes must be started up first, followed by individual
crawler processes. Use the nctrl tool to start the crawler from the command line.
To start the crawler from the command line:
1. On the node running the ubermaster process, run the command: $FASTSEARCH/bin/nctrl start crawler
$FASTSEARCH/bin/nctrl start crawler
Suspending/Stopping in a Single Node Environment - administrator interface
The crawler is stopped (if running) and started when a configuration is updated. There is also a start/stop
button for an existing crawler.
In a single node environment, to stop the crawler from the administrator interface:
148
2. Locate the Enterprise Crawler on the Installed module list - Module name. Select the Stop symbol to
stop the crawler.
Suspending/Stopping in a Single Node Environment - command line
Use the nctrl tool to stop the crawler from the command line.
Run the following command to stop the crawler:
1. $FASTSEARCH/bin/nctrl stop crawler
Suspending/stopping in a Multiple Node Environment - administrator interface
In a multiple node environment, the individual crawler processes must be shut down first, followed by the
ubermaster processes.
In a multiple node environment, to stop the crawler from the administrator interface:
2. For all crawler processes, select the Stop symbol.
3. Locate the ubermaster process, and select the Stop symbol.
The crawler will not stop completely before all outstanding content batches have been successfully submitted
to FAST ESP and received by the indexer nodes. Monitor the crawler submit queue by waiting until the
$FASTSEARCH/data/crawler/dsqueues folder (on the node running the crawler) is empty.
Suspending/stopping in a Multiple Node Environment - command line
In a multiple node environment, the individual crawler processes must be shut down first, followed by the
ubermaster processes. Use the nctrl tool to stop the crawler from the command line.
To stop the crawler from the command line:
2. On the node running the ubermaster process, run the command: $FASTSEARCH/bin/nctrl stop crawler
The crawler will not stop completely before all outstanding content batches have been successfully submitted
to FAST ESP and received by the indexer nodes. Monitor the crawler submit queue by waiting until the
$FASTSEARCH/data/crawler/dsqueues folder (on the node running the crawler) is empty.
Monitoring
While the crawler is running, you can use the FAST ESP administrator interface or the crawleradmin tool to
monitor and manage the crawler.
Refer to the FAST ESP Configuration Guide for information about the administrator interface.
Enterprise Crawler Statistics
A detailed overview of statistics for each of the collections configured in the Enterprise Crawler is available
in the FAST ESP administrator interface.
149
Navigating to the Data Sources tab will list all the available Enterprise Crawlers installed. For each Enterprise
Crawler choosing List Collections will display all the collections associated with the particular Enterprise
Crawler instance.
For each collection there are a number of available options:
Configuration - List the configured settings for the collection.
Fetch log - See the last 5 minutes from the collection fetch log for the collection.
Site log - See the last 5 minutes of the collection site log for the collection.
Site statistics - View detailed statistics for a single web site in the collection. Input the web site you want to
view detailed statistics for and choose Lookup.
Note: Only web sites that have already been crawled can be viewed for statistics.
Table 43: Site statistics for <web site> in collection <collection>
Name
Description
Status
The current crawl status for this web site.
Possible values are:
Crawling - The web site is currently being crawled.
Idle - The web site is not being crawl at the moment.
Document Store
The number of documents in the crawler store for this web site.
Statistics age
The time since the last statistics update.
Last URI
The last URI crawled for this web site-
Queue Length
The current size of this web sites workqueue.
For a description of the detailed statistics of a web site. See viewing detailed statistics about collection below.
Statistiscs - View detailed statistics about collection.
Table 44: Overall Collection Statistics
Name
Description
Crawl status
Displays the current crawl status of the crawler.
Crawling, X sites active - The collection is crawling, X web sites are currently
active.
Idle - The collection is idle, no web sites are currently active.
Suspended - The collection is suspended.
Feed Status
Feeding - The collection is currently feeding the content to ESP.
Queueing - The collection is currently queueing content to disk and feeding to ESP is
suspended.
150
Cycle Progress (%)
The current collection refresh cycle progress. Calculated based on time until next refresh.
Time until refresh
The time until next refresh for this collection.
Stored Documents
The number of documents the crawler has stored.
Name
Description
Unique Documents
The number of unique documents the crawler has stored.
Document Rate
The current rate at which documents are downloaded.
In Bandwidth
The current inbound bandwidth the crawler is utilizing.
Statistics Updated
The time since the last statistics update.
The Status for all collections link will display a summary of all the collections and some of their most interesting
statistics.
The Detailed Statistics link will display detailed statistics for the previous and the current crawl cycle, as well
as the total for all crawl cycles.
Table 45: Detailed Collection Statistics
Processing Status
Description
Processed
The number of requested documents by the crawler.
Downloaded
The number of downloaded documents by the crawler.
Stored
The number of documents stored by the crawler.
Modified
The number of documents stored that were modified.
Unchanged
The number of documents that were unchanged.
Deleted
The number of documents that were deleted by the crawler.
Postprocess statistics
Description
ADD operations
The number of ADD operations sent to ESP.
DEL operations
The number of DEL(ete) operations sent to ESP.
MOD operations
The number of MOD(ified) operations sent to ESP.
Note: MODs are in reality sent as ADDs.
URLSChange operations
The number of URLSChange operations sent to ESP. URLSChanges contains updates
on the URIs equivalence class.
Total Operations
The total number of operations overall.
Successful operations
The number of successful operations overall.
Failed operations
The number of failed operations overall.
Operation rate
The rate, in operations per second, at which operations are sent to ESP.
Network
Description
Document Rate
The rate, in documents per seoncd, at which documents are downloaded.
In Bandwidth
The current inbound bandwidth the crawler is utilizing.
Out Bandwidth
The current outbound bandwidth the crawler is utilizing.
Downloaded bytes
The total number of bytes the crawler has downloaded.
Sent bytes
The total number of bytes the crawler has sent.
Average Document Size
The average document size of the documents the crawler has downloaded.
Max Document Size
The maximum document size of the documents the crawler has downloaded.
151
152
Network
Description
Download Time
The accumulated time used to download documents.
Average Download Time
The average time to download a document.
Maximum Download Time
The maximum time to download a document.
Mime Types
Description
<type>"/"<subtype>
A breakdown of the various Mime Types of the documents downloaded by the crawler.
URIs Skipped
Description
NoFollow
URIs skipped due to link tag having a rel="NoFollow" attribute.
Scheme
URIs skipped due to not matching the collection Allowed Schemes setting.
Robots
URIs skipped due to being excluded by robots.txt.
Domain
URIs skipped due to not matching the collection domain include/exclude filters.
URI
URIs skipped due to not matching the collection URI include/exclude filters.
Out of Focus
URIs skipped due to being out of focus from the collection Focus crawl settings.
Depth
URIs skipped due to being out of the collection Crawl Mode depth settings.
M/U Cache
URIs skipped due to being screened by internal caches.
Documents Skipped
Description
MIME Type
Document skipped due to not matching the collection MIME-Types setting.
Header Exclude
Document skipped due to matching the collection Header Excludes setting.
Too Large
Document skipped due to exceeding the collection Maximum Document Size setting.
NoIndex RSS
Document skipped due to the collection RSS setting Index RSS documents?.
HTTP Header
Document skipped due to errors with HTTP header.
Encoding
Document skipped due to problems with the document encoding. Typically problems with
compressed content.
Chunk Error
Document skipped due to problems with chunked encoding. Failure to de-chunk content.
Incomplete
Document skipped due to being incomplete. The webserver did not return the complete
document as indicated by the HTTP header.
No 30x Target
Document skipped due to not having a redirect target.
Connect Error
Document skipped due to failure to connect() to the remote web server.
Connect Timeout
Document skipped due to the to connect() to the remote web server timed out.
Timeout
Document skipped due to it using longer time to download than the Fetch Timeout setting.
Network Error
Document skipped due to various network errors.
NoIndex
Document skipped due to containing a META robots No Index tag.
Checksum Cache
Document skipped due to being screened by the run-time checksum cache used for
duplicate detection.
Other Error
Document skipped due to other reasons.
Document Plugin
Document skipped by the user specified document plugin.
Empty Document
Document skipped due to being 0 bytes.
Protocol Response Codes Description
<Response Code>
<Response Info >
A breakdown of the various protocol response codes received by the crawler.
DNS Statistics (global)
Description
DNSRequests
Number of issued DNS requests.
DNSResponses
Number of received DNS responses.
DNSRetries
Number of issued DNS request retries.
DNSTimeout
Number of issued DNS requests that timed out.
DNS Statistics (global)
Description
<Response code>
A breakdown of the DNS response codes received by the crawler.
Possible responses are:
NOERROR - The DNS server returned no error (hostname resolved).
NXDOMAIN - The domain name referenced in the query does not exist (hostname did not
resolve).
FORMERR - The DNS server was unable to interpret the query.
SERVFAIL - The DNS server was unable to process this query due to a problem on the
server side.
NOTIMP - The DNS server does not support the requested query.
REFUSED - The DNS server refused to perform the specified operation.
NOANSWER - The DNS rescord received from the DNS server did not contain an ANSWER
section.
PARTANSWER - The DNS record received from the DNS server contained only a partial
ANSWER section.
TIMEOUT - The DNS request timed out.
UNKNOWN - An unknown DNS reply packet was received.
Backup and Restore
Crawler configuration is primarily concerned with collection specific settings. Backup of the crawler configuration
will ensure that the crawler can be reconstructed to a state with an identical setup, but without knowledge of
any documents.
The crawler configuration is located in: $FASTSEARCH/data/crawler/config/config.hashdb
To backup the configuration, stop the crawler, then save this file.
It is also possible to export/import collection specific crawler configuration in XML format using the
crawleradmin tool. This is not necessary for pure backup needs (as the config.hashdb file includes all the
collection specific information, including statistics on previously gathered pages). However, if a collection is
to be completely recreated from scratch, having been deleted both from the crawler and the search engine,
the XML-formatted settings should be used to recreate the collection, rather than using a restored crawler
configuration database.
Refer to the FAST ESP Operations Guide for overall system backup and restore information.
153
Restore Crawler Without Restoring Documents
To restore a node to the backed up configuration without restoring the documents:
1. Install the node according to the overall procedures in the Installing Nodes from an Install Profile chapter
in the FAST ESP Operations Guide.
2. Restart the crawler.
3. Reload the backed up XML configuration file: $FASTSEARCH/bin/crawleradmin -f <configuration
filename>
Full Backup of Crawler Configuration and Data
Backing up the crawler configuration only ensures that all information about individual collections and the
setup of the crawler itself can be restored, but will trigger the sometimes unacceptable overhead of having
to crawl and reprocess all documents over again. To be able to recover without this overhead, a full backup
of the crawler is needed.
To perform a full backup:
1. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler
2. Backup the complete directory on all nodes involved in crawling: $FASTSEARCH/data/crawler
Note: Be sure to back up the duplicate server; this process often runs on a separate node from the
crawler.
3. If keeping log files is desired, backup the log file directory. This is for reference only, and is not needed
to get the system backup.
$FASTSEARCH/var/log/crawler
Full restore of Crawler Configuration and Data
To perform a full restore:
1. Install the node according to the overall procedures in the Installing Nodes from an Install Profile chapter
in the FAST ESP Operations Guide.
2. Make sure the crawler is not running, then restore the backed up directory on each node to be restored:
$FASTSEARCH/data/crawler
3. Start the crawler.
The crawler will start re-crawling from the point where it was backed up, and according to the restored
configuration.
Re-processing Crawler Data Using postprocess
This topic describes how to re-process the crawler data of one or several collections into the document
processing pipeline without starting a re-crawl.
The crawler stores the crawl data into meta storage and document storage. The metadata consists of a set
of databases mapping URIs to their associated metadata (such as crawl time, MIME type, checksum, document
storage location, and so forth.). The crawler uses a pool of database clusters (usually 8) which in turn consist
of a set of meta databases (one site databases and multiple URI segment databases).
Reprocessing the contents of collections involves the following process. Note that this is a somewhat simplified
description of the actual inner workings of postprocess. Steps 1 and 2 run in parallel, and as step 1 is usually
significantly faster, it also completes before step 2.
154
1. Meta databases are traversed site by site, and extract the URIs with associated document data on disk.
Each site, along with the number of URIs stored in the meta database, is logged as traversal of that site
commences. The number of URIs may therefore include duplicates.
For each URI that is extracted, duplicate detection is performed by a lookup for the associated checksum
in the postprocess database. If the checksum does not exist (new/changed document) or the checksum
is associated with the current URI, then the document is accepted as a unique document; otherwise it is
treated as a duplicate.
2. Unique (non-duplicate) documents are queued for submission to document processing.
3. All documents due for submission are placed in a queue in $FASTSEARCH/data/crawler/dsqueues.
Postprocess now serves the document processors from this queue, and terminates once all documents
have been consumed. The duration of this phase is dictated by the capacity of the document processing
subsystem.
If you have a configuration scenario with the crawler on a separate node you may experience a recovery
situation where index data has been lost (when running single-row indexer). If the crawler node is fully
operative, it is recommended to perform a full re-processing of crawled documents. This can be
time-consuming, but may be the only way to ensure full recovery of documents submitted after the last backup.
Single node crawler re-processing
1. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler
2. Delete the content of $FASTSEARCH/data/crawler/dsqueues.
Only perform this step if you are re-feeding all your collections, otherwise you may lose content scheduled
for submission for other collections.
3. Run the postprocess program in manual mode, using the -R (refeed) option:
To do this:
Use this command:
Re-process a single collection
$FASTSEARCH/bin/postprocess -R
<collectionname>
Re-process all collections, use an asterisk (*) with
quotes instead of a <collectionname>
$FASTSEARCH/bin/postprocess -R "*"
Re-process a single site by using the -r /<site> option $FASTSEARCH/bin/postprocess -R
<collectionname> -r <site>
On UNIX make sure you either run postprocess in a screen, or use the nohup command, to ensure
postprocess runs to completion. It is also considered a good practice to redirect stdout and stderr to a log
file.
4. Allow postprocess to finish running until all content has been queued and submitted.
When postprocess has completed the processing, the following message is displayed:
Waiting for ESP to process remaining data...Hit CTRL+C to abort
If you want to start crawling immediately, then it is safe to shutdown postprocess, since it has identified
and enqueued all documents due for processing, as long as the crawler is later restarted so that the
processing of the remaining documents can be serviced. The remaining documents will eventually be
processed before processing any newly crawled data. Otherwise, postprocess will eventually shut down
itself when all documents has been processed. Press Ctrl+C, or send the process SIGINT if it is running
in the background. Alternatively let postprocess run to completion and it will exit by itself.
5. Start the crawler: $FASTSEARCH/bin/nctrl start crawler.
155
Multiple node crawler re-processing
Re-processing the crawl data on a multiple node crawler is similar to the single node scenario, except that a
multiple node crawl will include one or more duplicate servers. These must be running when executing
postprocess.
For more information on multiple node crawler setup, contact FAST Solution Services.
Forced Re-crawling
The procedure in section Re-processing Crawler Data Using postprocess on page 154 assumes that the
crawler database is correct. This implies that it will only re-process all already crawled documents to FAST
ESP. In case of a single-node failure or crawler node failure the last documents fetched by the crawler (after
last backup) will be lost. In this case you must instead perform a full re-crawl of all the collections. This will
then re-fetch the remaining documents, assuming they are retrievable. In some cases documents may have
disappeared from web servers, but still be present in the index. If this is the case these documents will have
to be manually removed from the index.
Use the following command to force a full re-crawl of a given collection:
1. crawleradmin -F <collection>
2. Repeat the command for each collection in the system
This will then ensure that all documents crawled after the last backup will be re-fetched. Note that the re-crawl
may take a reasonable amount of time before finished, but the index will be fully operative in the meantime.
Purging Excluded URIs from the Index
Normally postprocess will not check the validity of the URIs it processes, as this has already been done by
the crawler during crawling. However, there are times when the include/exclude rules are altered and it is
necessary to remove content that is no longer allowed (but was previously allowed) by the configuration.
This can be accomplished by using the (uppercase) -X option. It will cause postprocess to traverse the meta
databases as usual, but rather than processing the contents it will delete the contents that no longer match
the configuration include and exclude rules.The contents that match are simply ignored, unless the (lowercase)
-x option is also specified, in which case this content will be re-processed at the same time.
Use the following command to remove excluded content from both the index and crawler store:
1. $FASTSEARCH/bin/postprocess -R <collectionname> -X
The -X and -x options as described above assume that the crawler has already been updated with the new
include/exclude rules. If you have the configuration in XML format, but have not yet uploaded the configuration
you can use the -u <XML config> option to tell postprocess to update the rules directly from the XML file
(and store them in the crawlers persistent configuration database).
Finally, the option -b instructs postprocess to re-check each URI against the robots.txt file for the
corresponding server. The check uses the currently stored robots.txt file, rather than download a new one.
The behavior for a URI that is no longer allowed for crawling by the robots.txt file is the same as if it had
been excluded by the configuration.
Aborting and Resuming of a Re-process
To pause/stop postprocess while it is re-processing and then to resume postprocess, you can use one of the
following procedures.
156
Aborting and Resuming of a Re-process - scenario 1
1. Stop postprocess after it has traversed all meta databases.
This log message indicates that the traversing of the meta databases has finished, and the only remaining
task is to submit all the queued data to FAST ESP, and wait for it to finish processing callbacks.
2. Press Ctrl+C or send SIGINT.
3. To resume postprocess refeed, use: $FASTSEARCH/bin/postprocess -R <collectionname> -f
Aborting and Resuming of a Re-process - scenario 2
1. Stop postprocess after it has traversed all meta databases.
If this message is not displayed in the log then postprocess has not finished traversal, and is still logging the
following message:
"Processing site: <site> (16 URIs)"
To resume postprocessing after stopping postprocess in this condition, you must use the -r <site> (resume
after <site>) option in combination with the -R <collections> option. To determine which site to resume, inspect
the postprocess logs and find the site which was logged before the last Processing site log entry.
For example, if the last two Processing site messages in the postprocess log are:
Processing site: SiteA (X URIs)
Processing site: SiteB (Y URIs)
Start postprocess with -r SiteA to make sure that it will traverse the remaining sites. Since the log message
is output before the site is traversed, this will ensure that SiteB is completely traversed.
Crawler Store Consistency
A consistency tool is included with the crawler to verify, and if necessary repair consistency issues in the
crawler store.
The following sections describe how to use the consistency tool.
Verifying Docstore and Metastore Consistency
The following steps will first verify (and if necessary repair) the consistency between the document store and
metadata store, and then perform the same verification between the verified metadata store and postprocess
checksum database. In case the tool removes documents we also ask it to keep the statistics in sync.
The logs will be placed in a directory named after todays date under
$FASTSEARCH/var/log/crawler/consistency. Make sure this directory exists before running the tool.
In a multi node crawler setup these steps should be performed on all master nodes. Additionally you may
also specify that the tool should verify the correct routing of sites to masters (details below).
1. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler. Verify that all crawler processes has exited
before proceeding to the next step.
157
2. Run the command: $FASTSEARCH/bin/postprocess -R mytestcol -f . This will ensure there is no
remaining documents to be fed to the indexer.
[2007-09-12
[2007-09-12
Hit CTRL+C
[2007-09-12
11:58:39] INFO
11:58:39] INFO
to abort
11:58:39] INFO
systemmsg Feeding existing dsqueues only..
systemmsg Waiting for ESP to process remaining data...
systemmsg PostProcess Refeed exiting
3. (Optional) Copy routing table from the ubermaster node.
Note: This step only applies to multi node crawlers and is only necessary if you wish to verify the
correct master routing of all sites. The routing table database can be found as
$FASTSEARCH/data/crawler/config_um/mytestcoll/routetab.hashdb on the ubermaster and
should overwrite $FASTSEARCH/data/crawler/config/mytestcoll/routetab.hashdb on the master nodes.
4. Create the directory $FASTSEARCH/var/log/crawler/consistency
The tool will create a sub directory inside this directory, in this example 20070912, where it will place the
output logs in a separate directory per collection checked. The path to the output directory is logged by
the tool on startup.
5. Run the command: $FASTSEARCH/bin/crawlerconsistency -C mytestcoll -M
doccheck,metacheck,updatestat -O $FASTSEARCH/var/log/crawler/consistency
Note: Ensure that you do not insert any space between the modes listed in the -M option. If the tool
is being run on a master in a multi node crawler you may also add the routecheck mode.
6. Examine the output and log files generated.
[2007-09-11 16:23:10] INFO
systemmsg Started EC Consistency Checker 6.7 (PID:
21542)
[2007-09-11 16:23:10] INFO
systemmsg Copyright (c) 2008 FAST, A Microsoft(R)
Subsidiary
[2007-09-11 16:23:10] INFO
systemmsg Data directory: $FASTSEARCH/data/crawler
[2007-09-11 16:23:10] INFO
systemmsg 1 collections specified
[2007-09-11 16:23:10] INFO
systemmsg Mode(s): doccheck, metacheck, updatestat
[2007-09-11 16:23:10] INFO
systemmsg Output directory:
$FASTSEARCH/var/log/consistency/20070912
[2007-09-11 16:23:10] INFO
mytestcoll Going to work on collection mytestcoll..
[2007-09-11 16:23:12] INFO
mytestcoll Completed docstore check of collection
mytestcoll in 1.6 seconds
[2007-09-11 16:23:12] INFO
mytestcoll ## Processed sites : 5 (2.50 per second)
[2007-09-11 16:23:12] INFO
mytestcoll ## Processed URIs : 5119 (2559.50 per
second)
[2007-09-11 16:23:12] INFO
mytestcoll ## OK URIs
: 5119
[2007-09-11 16:23:12] INFO
mytestcoll ## Deleted URIs
: 0
[2007-09-11 16:23:12] INFO
mytestcoll Document count in statistics left unchanged
[2007-09-11 16:23:12] INFO
mytestcoll Processing 5119 checksums (all clusters)..
[2007-09-11 16:23:14] INFO
mytestcoll Completed metastore check of collection
mytestcoll in 1.8 seconds
[2007-09-11 16:23:14] INFO
mytestcoll ## Processed csums : 5119 (2559.50 per
second)
[2007-09-11 16:23:14] INFO
mytestcoll ## OK csums
: 5119
[2007-09-11 16:23:14] INFO
mytestcoll ## Deleted csums
: 0
[2007-09-11 16:23:14] INFO
mytestcoll Finished work on collection mytestcoll
[2007-09-11 16:23:14] INFO
systemmsg Done
In the example output above all URIs and checksums were found to be ok. If this was not the case then
a mytestcol_deleted.txt file will contain the URIs deleted. Additionally if a mytestcol_refeed.txt file was
generated then the URIs listed here should be re-fed using postprocess (next step).
7. Run the command: $FASTSEARCH/bin/postprocess -R mytestcol -i <path to
mytestcol_refeed.txt>
158
Note: This step is only required in order to update the URI equivalence class of the listed URIs.
8. Start the crawler: $FASTSEARCH/bin/nctrl start crawler.
Rebuilding the Duplicate Server Database
This section explains the steps necessary to rebuild the duplicate server database, based on the contents of
the postprocess database present on each master. It only applies to multi node crawlers, as single node
crawlers doe not require a duplicate server. Prior to performing this task it is recommended to run the
consistency tool as outlined in the previous section first to ensure each node is in a consistent state.
As part of this operation a set of log files will be generated and placed in a directory named after todays date
under $FASTSEARCH/var/log/crawler/consistency. Make sure this directory exists before running the
tool.
To successfully rebuild the duplicate server databases it is vital that these steps be run on all master nodes.
The crawler must not be restarted until all nodes have successfully completed the execution of the tool.
1. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler. Verify that all crawler processes has exited
before proceeding to the next step. Perform this step on each master before proceeding to the next step.
2. Run the command: $FASTSEARCH/bin/postprocess -R mytestcol -f . This will ensure there is no
remaining documents to be fed to the indexer. Perform this step on each master before proceeding to the
next step.
3. Stop the duplicate server processes. Wait until the processes have completed shutting down before moving
to the next step. Depending on your configuration this may take several minutes
4. Delete the per-collection duplicate server databases. These databases are usually located under
$FASTSEARCH/data/crawler/ppdup/<collection> and should be deleted prior to running this tool to
ensure there will not be "orphan" checksums recorded in the database.
5. Start the duplicate server processes.
6. Create the directory $FASTSEARCH/var/log/crawler/consistency. The tool will create a sub directory
inside this directory, in this example 20070912, where it will place the output logs in a separate directory
per collection rebuilt. The path to the output directory is logged by the tool on startup.
7. Run the command: $FASTSEARCH/bin/crawlerconsistency -C mytestcoll -M ppduprebuild -O
$FASTSEARCH/var/log/crawler/consistency
Note: This command will usually take several hours to complete. Progress information is logged
regularly, but only applies per collection. Hence if you are processing multiple collections the
subsequent collections are not accounted for in the reported ETA.
8. Examine the output and log files generated.
[2007-09-11 09:17:12] INFO
systemmsg Started EC Consistency Checker 6.7 (PID:
18622)
[2007-09-11 09:17:12] INFO
systemmsg Copyright (c) 2008 FAST, A Microsoft(R)
Subsidiary
[2007-09-11 09:17:12] INFO
systemmsg Data directory: $FASTSEARCH/data/crawler/
[2007-09-11 09:17:12] INFO
systemmsg No collections specified, defaulting to all
collections (2 found)
[2007-09-11 09:17:12] INFO
systemmsg Mode(s): ppduprebuild
[2007-09-11 09:17:12] INFO
systemmsg Connected to Duplicate Server at
dupserver01:11100
[2007-09-11 09:17:13] INFO
systemmsg Output directory:
$FASTSEARCH/var/log/consistency/20070912
[2007-09-11 09:17:13] INFO
mytestcoll Going to work on collection mytestcoll..
[2007-09-11 09:17:13] INFO
mytestcoll Processing 5299429 checksums (all clusters)..
[2007-09-11 09:17:14] INFO
systemmsg Received config ACK -> connection state OK
.....
159
[2007-09-11
mytestcoll:
[2007-09-11
second)
[2007-09-11
[2007-09-11
[2007-09-11
12:01:21] INFO
mytestcoll Duplicate Server rebuild status for
12:01:21] INFO
mytestcoll ## Processed csums : 5299429 (477.98 per
12:01:21] INFO
12:01:21] INFO
12:01:21] INFO
mytestcoll ## OK csums
: 5299429
mytestcoll ## Deleted csums
: 0
mytestcoll ## Misrouted csums : 0
9. Start the crawler: $FASTSEARCH/bin/nctrl start crawler
Redistributing the Duplicate Server Database
This section explains the steps necessary to change the number of duplicate servers used by a collection in
a multi node crawler setup. It only applies to multi node crawlers, as single node crawlers doe not require a
duplicate server. Prior to performing this task it is recommended to run a consistency check on the crawler
store. Refer to Verifying Docstore and Metastore Consistency on page 157 for more information.
1. Stop the crawler: $FASTSEARCH/bin/nctrl stop crawler. Perform this step on each master as well as
the ubermaster before proceeding to the next step.
Note: Verify that all crawler processes has exited before proceeding to the next step.
2. Run the following command on each master node: $FASTSEARCH/bin/postprocess -R mytestcol -f
. This will ensure there is no remaining documents to be fed to the indexer.
3. Stop the duplicate server processes. Wait until the processes have completed shutting down before moving
to the next step.
Note: Depending on your configuration this may take several minutes
4. Delete the per-collection duplicate server databases. These databases are usually located under
$FASTSEARCH/data/crawler/ppdup/<collection> and should be deleted prior to running this tool to
ensure there will not be "orphan" checksums recorded in the database.
5. Create a partial XML configuration in order to specify a new set of duplicate servers.
The following is an example XML configuration file for three duplicate servers. You should also specify
appropriate duplicate server settings for the collection at this time.
<CrawlerConfig>
<DomainSpecification name="mytestcoll">
<section name="pp">
<member> dupserver1:14200 </member>
</attrib>
</section>
<section name="ppdup">
<attrib name="format" type="string"> hashlog </attrib>
<attrib name="stripes" type="integer"> 1 </attrib>
<attrib name="cachesize" type="integer">512 </attrib>
<attrib name="compact" type="boolean"> yes </attrib>
</section>
</CrawlerConfig>
160
Note: Make sure the collection name in the XML file matches the name of the collection you wish to
update.
6. Update the configuration on the ubermaster node with the command $FASTSEARCH/bin/crawleradmin
-f <path to XML file> -o $FASTSEARCH/data/crawler/config_um --forceoptions=dupservers.
The --forceoptions argument allows the command to override the dupserver option which is normally not
changeable.
7. Update the configuration on each master node with the command $FASTSEARCH/bin/crawleradmin -f
<path to XML file> -o $FASTSEARCH/data/crawler/config --forceoptions=dupservers
8. Start the duplicate server processes.
9. Rebuild the duplicate server. Refer to Rebuilding the Duplicate Server Database on page 159 for more
information.
Exporting and Importing Collection Specific Crawler Configuration
The basic collection data is backed up (exported) and restored (imported) using the procedure described in
the System Configuration Backup and Recovery section in the Operations Guide. This, however, does not
include the data source configuration for the crawler.
The crawler configuration can be set and read from the administrator interface, but it is also possible to
export/import the crawler setup of a particular collection to XML format using the crawleradmin tool.
If you intend to create a new collection using an exported crawler configuration, note the following:
•
The collection must be created prior to importing the crawler configuration. Create the collection in the
normal way using the administrator interface but do not select a crawler as this will import a default
configuration from the administrator interface into the crawler. The effect of this is that some options in
the XML configuration will not be able to take effect. Specifically:
1. Create collection in the administrator interface but do not select a Data Source
2. Import the XML configuration into crawler using crawleradmin.
3. Edit the collection in the administrator interface to select the crawler. Select Edit Data Sources and
add the crawler. Click ok on Edit Collection screen and again on the Collections Details screen.
Refer to the FAST ESP Configuration Guide, Basic Setup chapter for additional details on how to create
a collection and integrate the crawler through the FAST ESP administrator interface.
•
Use the same name for the collection in the new system as the old system. The collection name is given
by the content of the exported crawler configuration file. The collection name is also implicitly used within
the configuration file, for example, related to folder names within the crawler folder structure.
Note: If you want to use a different name for the new collection, you must change all references to
the collection name within the exported XML file prior to importing it using crawleradmin -f.
•
In a multiple node crawler, the crawleradmin tool must always be run on the main crawler node (the node
running the ubermaster process) to ensure that all nodes are updated.
Fault-Tolerance and Recovery
To increase fault-tolerance, the crawler may be configured to replicate the state of various components. The
following sections describe how state is replicated in the different components, and how state may be recovered
should an error occur.
161
Ubermaster
The ubermaster will incrementally replicate the information in its routing tables (the mapping of sites to masters)
for a specific collection to all crawler nodes associated with that collection.
If an ubermaster database is lost or becomes corrupted, the databases will be reconstructed automatically
upon restarting the ubermaster. If the ubermaster enters a recovery mode it will query crawler nodes in that
collection for their routing tables, which they will send back in batches. While in recovery mode, the ubermaster
will accept URIs from crawler nodes, but will not distribute new sites to crawler nodes until recovery is complete
for that collection.
Duplicate server
A duplicate server may be configured to replicate the state of another duplicate server.
By starting the with the –R option, a duplicate server is configured to incrementally replicate its state:
$FASTSEARCH/bin/ppdup –p <port> -R <host:port> -I <my_ID>
where <host:port> specifies the address of the target duplicate server and <my_id> specifies a symbolic
identifier for the duplicate server. The target duplicate server will store replicated state in its working directory
under a directory with the name <my_ID>>.
Conversely, a duplicate server is configured to replicate the state of another duplicate server by starting with
the –r option:
$FASTSEARCH/bin/ppdup –p <port> -r <port>
When replication is activated, communication between a postprocess process and a duplicate server has
transactional semantics. The duplicate server(s) performing replication on behalf of other duplicate servers
may be used actively by postprocesses. The state of a duplicate server may be reconstructed by manually
copying replicated state from the appropriate directory on the target duplicate server.
Crawler Node
There is no support for replicating the state stored on a crawler node. However, the crawler node state, if
lost, will eventually be reconstructed by re-crawling the set of sites assigned to the node. In the course of
crawling, the ubermaster will route URIs to the crawler, and from this the crawler node will gradually reconstruct
its state with respect to crawled documents (assuming all documents are still available on the web servers).
The postprocess databases on a crawler node will similarly recover over time, as each processed document
will be checked against the duplicate server(s) in the installation. This will permit the URI checksum tables
to be rebuilt, but it may not result in the same set of URI equivalences (duplicates) as had been previously
indexed, leading to some unnecessary updates being sent to the search engine.
162
Chapter
5
Troubleshooting the Enterprise Crawler
Topics:
•
Troubleshooting the Crawler
This chapter describes how to troubleshoot problems you may encounter when
using the crawler.
Troubleshooting the Crawler
This topic describes how to troubleshoot problems you may encounter when using the crawler.
General Guidelines
•
Inspect logs
The crawler logs a wide range of useful information, and these logs should always be inspected whenever
a perceived error or misconfiguration occurs. These include the crawler log which logs overall crawler
status messages and exceptional conditions, the fetch log which logs all attempted retrievals of documents,
the screened log which logs all documents that are not attempted retrieved, the postprocess log, which
logs the status of data feeding to FAST ESP and the site and header logs. By default, all these logs but
the screened logs are enabled.
•
Raise log level
The level of detail in the crawler log of the crawler is governed by the –l <level> option in the crawleradmin
tool. Restarting the crawler with a given parameter propagates this setting to all components.
•
Inspect traffic trace of crawler network activity
This can either be done by using a network packet trace utility such as ethereal or tcpdump on the crawler
node, or by crawling through a proxy and inspecting the traffic passing through it. Both of these have
shortcomings when encrypted transport such as HTTPS, is used.
•
Inspect browser traffic
If a particular behavior is expected from the crawler, a trace as suggested above can be examined alongside
one generated by a web browser. For web browsers, client side debugging can be used to bypass the
encryption for HTTPS. An example of such a utility is LiveHTTPHeaders (http://livehttpheaders.mozdev.org/
). This is particularly useful when debugging cookie authentication schemes.
Additional Information
Reporting Issues
Here is a list of important information to gather in order to get a fast and complete support response.
When reporting operational issues, the following list of information is critical in providing a fast and complete
response:
•
Crawler version
Which version of the crawler you are running and, if applicable, which hotfixes to the crawler have been
applied.
•
Platform
Which operating system/platform you are running on.
•
FAST ESP version
Which version of FAST ESP you are running the crawler against.
•
Crawler configuration
All applicable <collection> configurations in XML format, as output by crawleradmin -G <collection>
•
Crawler log files
All applicable <collection> crawler log files (fetch, pp, header, screened, dsfeed, site - as a minimum,
fetch and dsfeed). These files are located in $FASTSEARCH/var/log/crawler/.
•
164
Crawler log files
All available crawler.log* and any associated .scrap files. For multiple node installations, include
dupserver.scrap.* (or equivalent). These files are located in $FASTSEARCH/var/log/crawler/.
Known Issues and Resolutions
This section provides resolutions to known issues for the crawler.
#1: The crawler has
problems reaching
the license server or
allocating a valid
license.
A valid license served by the license manager (lmgrd) generates a log entry similar
to the following in $FASTSEARCH/var/log/lmgrd/lmgrd.scrap file:
hh:mm:ss (FASTSRCH) OUT:"FastDataSearchCrawler"
[email protected]
If the crawler is having problems either reaching the license server (which may be
on a remote node in a multiple node FAST ESP installation) or allocating a valid
license, it will issue an error (Message A), and try a total of 3 times before exiting
(Message B):
Message A: "Unable to check out FLEXlm license. Server may be down or
too many instances of the crawler are running. Retrying.
Message B: "Unable to check out FLEXlm license. Shutting down. Contact
Fast Search &Transfer (http://www.fastsearch.com/) for a new license."
Please contact FAST Support for any licensing issues.
#2: How do I display,
To display the collection configuration, type the command:
save, or import the
configuration for a
bin/crawleradmin -G <collection>
collection in XML
You can save this collection configuration by redirecting or saving it to a file. To
format?
import a configuration from a file, type the command:
bin/crawleradmin -f <filename>
#3: Postprocess
reports that it is
unable to obtain a
database lock.
The crawler is running on the system. If you stopped it, check the logs to make sure
that the process has stopped. You may have to kill it manually if it still exists.
#4: The crawler does
The following areas can be checked:
not fetch pages.
• Verify the Start URIs list against the configured rules.
Check the screened log.
Check the URIs individually using the crawleradmin tool and the --verifyuri
option:
# crawleradmin --verifyuri <collection>:<start URI>
•
•
•
•
If a site specified in the Start URIs list (or otherwise permitted under the rules)
is not being crawled, it may be due to a robots.txt file on the remote web
server. This is a mechanism that gives webmasters the ability to block access
to content from some or all web crawlers. To check with a browser, request page
http://<site>/robots.txt. If it does not exist, the web server should return
the HTTP status 404, Not Found, in the crawler fetch log.
Check the DNS log in case the server does not resolve.
Check proxy if one is used.
Check log file.
165
If there are no clear errors, yet some pages are not being crawled, it may be due
to the refresh cycle being too short to complete the crawl. Refer to Resolution #5:
for resolution.
#5: Some documents
Check your refresh interval (default = 1440 minutes) and refresh mode (default =
are never crawled.
scratch). If the interval is too short, some of the documents may never be crawled
(depending on the refresh mode). You need to either increase the refresh interval
or change the refresh mode of the crawler. The Refresh interval and Refresh
mode can be changed from Edit Collection in the administrator interface. Note
that the Refresh when idle option allows an idle crawl to start a new cycle
immediately without waiting for the next scheduled refresh.
All refresh modes include inserting the start URIs into to work queue. The work
queue is the queue from where the crawler retrieves URIs. Refer to Refresh Mode
Parameters on page 89 for valid refresh modes.
If there are no clear errors, refer to Resolution #4: for additional checks.
#6: How do I back up
Make a backup copy of the $FASTSEARCH/data/crawler/ folder. Complete the
the content retrieved
procedure described in Backup and Restore on page 153.
by the crawler?
#7: Some documents
Since the DB switch delete is set to off by default, no documents will be deleted
get deleted from the
unless they are irretrievable. Check to see if the DB switch delete has been turned
index.
on.
There may also be a problem if the DB switch delete is on and the refresh interval
is set too low. If so, then it is possible that the internal queue of your crawler is so
large that certain documents do not get refreshed (or re-visited) by the crawler. If
that is the case, you need to either change the mode of your crawler or increase
the refresh rate or both.
#8: Documents are
not removed when I
change the exclude
rules and refresh the
collection.
If you make changes to your configuration, only the configuration will be refreshed,
not the collection. Collection refreshes are only triggered by time since the last
refresh or by using the Force Re-fetch option.
1. In order to remove documents instantly, stop the crawler from System
Management. With a multiple node crawler, it is recommended that you stop all
instances of crawler and ubermaster. Do not stop the duplicate server as this is
required in order to run any postprocess refeed commands.
2. Run postprocess manually with the (uppercase) -X option (together with -R). By
using the -X option, all URIs in the crawlerstore will be traversed and compared
to the spec. URIs matching any excludes will then be deleted. Issue the command:
$FASTSEARCH/bin/postprocess -R <collection> -X
Note that to reprocess and delete documents that have been excluded by the
configuration, you only need the -X (uppercase) switch as shown in this example.
If you decide to add the -x (lowercase) switch, then everything else will be
re-processed in addition to the verification and removal of excluded content.
3. Allow postprocess to finish running until all content has been queued and
submitted.
Note that it may take some time after postprocess exits before documents are
fully removed from the index.
166
#9: The crawler uses
If system resources are being overwhelmed because of the scale of the crawl being
a lot of resources,
run, the ideal solutions are to:
what can I do?
• Ensure the correct configuration and caches
• Add hardware resources
• Reduce crawler impact
Refer to Configuring a Multiple Node Crawler on page 128 and Large Scale XML
Crawler Configuration on page 130 for additional information.
If configurations are correct, and it is not possible to add resources, then the next
step is to try to reduce the impact of the crawl, either by reducing the scope of the
crawl or by slowing the pace of the crawl. There are no definite answers to this
issue. Go through your configuration and determine if you can:
•
Suspend postprocess feeding of documents to FAST ESP:
By default crawled pages are stored on disk, then fed to FAST ESP concurrently
with the crawling of additional pages. By suspending the feeding of documents
to FAST ESP, additional resources are made available to the crawling processes,
thereby increasing their efficiency. Once the crawl is complete, feeding can be
resumed to build the collection in the index. The commands to perform these
tasks are:
# crawleradmin --suspendfeed <collection>
# crawleradmin --resumefeed <collection>
•
Reduce the overall load on the crawler host by:
•
•
•
•
•
•
•
•
increasing cache sizes, especially the postprocess cache size.
reducing the number of complex include/exclude rules and rewrites.
focusing the crawl on fewer sites/servers (include/exclude
domain/URIspaths).
crawling fewer web sites at a given time by reducing Max concurrent sites
(if I/O is overloaded then lowering this value may help increase performance).
using an equivalent number of uberslaves to CPUs (or even more, up to 8,
for large scale crawls).
lowering the frequency of page requests (request_rate, delay).
lengthening the overall update cycle (refresh_interval).
limiting the crawling schedule: (variable_delay).
Depending on your answers, tune these parameters accordingly.
#10: I cannot locate
Documents are kept a maximum time period of:
all the documents in
my index.
dbswitch x refresh
Where dbswitch denotes the number of crawl cycles a previously fetched document
is allowed to remain unseen by the crawler. If this limit is reached, the
dbswitch-delete parameter will decide what happens to the document. If
dbswitch-delete is set to yes, the document will be deleted, and if it is set to no,
it will be scheduled for an explicit download check. If this check fails, the document
will be removed. There are three approaches to avoid this situation:
1. Make sure all documents covered by the collection are crawled within the refresh
period.
167
2. Set refresh_mode = scratch (default). All work queue will be emptied when
a refresh starts, and the crawler starts from scratch.
3. Set dbswitch-delete = no (default).
#11: The crawler
To create a successful login configuration the goal to is to have the crawler behave
cookie authentication
in a similar way to what a user and browser does when logging into the web site.
login does not work.
In order to achieve this you can:
•
•
Inspect traffic traces between the browser and server.
Pay attention to the order in which pages are retrieved, what HTTP method is
used, when the credentials are posted and what other variables are set.
Available tools to do this are:
•
•
•
Mozilla LiveHTTPHeaders plugin which lets you see the http headers exchanged
(even over encrypted transport as in https).
Charles web proxy (shareware) which acts as a proxy and lets you inspect
headers and bodies both as a tree and as a trace.
Basic tools like tcpdump or ethereal can also be used. Note that only
LiveHTTPHeaders will help you when https is used.
Remember to erase your browser's cache and cookies before obtaining a trace.
Refer to Setting Up Crawler Cookie Authentication on page 115 for details in setting
up the crawler cookie authentication; see the section Form Based Login on page
57 for additional information about setting up a forms-based login.
#12: The Browser
You should start by tuning the Browser Engine. Please refer to the FAST ESP
Engine gets
Browser Engine Guide
overloaded and sites
get suspended.
In order to solve the problem you may need to tune the EC configuration. By
decreasing the max_sites setting and/or increasing the delay, the number of
documents sent from the EC to the Browser Engine may be reduced. The side effect
is that the crawl speed may decrease. However, as the EC will start suspending
sites if the Browser Engines get overloaded, the speed may not necessarily decrease.
If this still does not solve the problem, you need to reduce the number of sites that
use JavaScript and/or Flash processing. This is done by:
1. Disable JavaScript and/or Flash options in the main crawl collection.
2. Exclude the sites where you want to use JavaScript and/or Flash from the main
collection.
3. Create a new collection.
4. Activate JavaScript and/or Flash in the new collection.
5. Specify the sites that you want crawled using JavaScript/Flash in the new
collection.
168
Chapter
6
Enterprise Crawler - reference information
Topics:
•
•
•
•
•
Regular Expressions
Binaries
Tools
Crawler Port Usage
Log Files
This chapter contains reference information for the Enterprise Crawler for the
various binaries and tools. You will also find information about regular
expressions, log files and ports.
Regular Expressions
Certain entries in the FAST ESP administrator interface collection specific screens request the use of regular
expressions (regexp).
Using Regular Expressions
The following tables describe terminology used in this appendix.
Table 46: Collection Specific Options Definitions
Term
Definition
URI
Uniform Resource Identifier - commonly known as a link and identifies a resource
on the web.
Example: http://subdomain.example.com/
Domain
The domain/server portion of the URI - in the previous URI example, the Domain
is the subdomain.example.com/.
Path
The path portion of the URI. For example, for the URI
http://subdomain.example.com/shop, the path portion is /shop.
Note: All patterns in the crawler are matched from the beginning of the line, unless specified otherwise
Character
Definition
.
Matches any character.
*
Repetition of the character 0 or more times.
$
End of string.
\
Escape characters that have a special meaning.
.*\.gif$
Matches every string ending in .gif.
.*/a/path/.*
Matches any string with /a/path/ in the middle of the expression.
.*\.example\.com
All servers in the domain .example.com will be crawled.
.*\.server\.com
Matches any characters (string) followed by .server.com.
Grouping Regular Expressions
If the crawler needs to be configured with rewrite rules, as described in the URI rewrite rules entry in Table
5-1, then Perl-style grouping must be used.
Grouping defines a regular expression as a set of sub-patterns organized into groups. A group is denoted by
a sub-pattern enclosed in parenthesis.
Example: If you want to capture the As and Cs in groups for the string:
AABBCC
then enclose the patterns for the As and Cs in parenthesis as shown in the following regular expression:
(A*)B*(C*)
170
Substituting Regular Expressions
To perform regular expression substitution, you need a regular expression that is to be matched and a
replacement string that should replace the matched text.
The replacement string can contain back references to groups defined in the regular expression. With back
references, the text matched by the group is used in the replacement. Back references are simply backslashes
followed by an integer denoting the ordinal number of the group in the regular expression.
Example: Using the regular expression described in the previous example, the following replacement string:
\1XX\2
rewrites the string AABBCC to AAXXCC.
Binaries
The following sections describe the major Enterprise Crawler programs and the options and parameters they
support.
crawler
The crawler binary is the master process, responsible for starting all other crawler processes. It also serves
as the ubermaster process in a multiple node crawler installation. In addition to initialization of data directories
and log files, the crawler is responsible for several centralized functions, including maintenance of the
configuration database, handling communications with other FAST ESP components, resolving and caching
hostnames and IP addresses, and routing sites to uberslave processes.
Binary: $FASTSEARCH/bin/crawler [options]
Table 47: crawler options
Basic options
-h
Description
Show usage information.
Use this option to print a list with short description of the various options that
are available.
-P [<hostname>:]
<crawlerbaseport>
Use this option to specify an alternative crawler base port (XML-RPC interface).
This option is useful if several instances of the crawler run on the same node.
<hostname>: Set bind address for XML-RPC interfaces (optional). This field can
be either a hostname or an explicit IP address. An actual IP address can also
be used as some hosts have multiple IP addresses.
<crawlerbaseport>: Set start of port number range that can be used by crawler.
Default: 14000 9000
Note that uberslave processes will allocate ports from <port number>+10 and
up. Furthermore, a specific interface to bind to can be specified.
-d <path>
Data storage directory.
Use this option to store crawl data, runtime configuration and logs in
subdirectories in the specified directory.
Default: If the FAST environment variable is set then the default path is
$FASTSEARCH/data/crawler; otherwise the default path is data.
171
Basic options
Description
-f <file>
Specify collection(s).
Use this option to specify the location of an XML file containing one or more
collections. Read the contents of the file and start crawling the specified
collection(s).The crawler will parse the contents of this file, add or update the
collections contained within and start crawling.
-c <number>
Use this option to specify the number of uberslave processes to start.
For larger crawls a process count of 8 is recommended. For larger crawls a
process count equal to or greater than the number of CPUs is recommended.
A maximum of 8 processes is supported. The number of processes should be
equal to or less than the number of clusters defined in the collection specification.
Default: 2
-v or -V
Advanced options
-D <number>
This option prints the crawler version identifier and exits.
Description
Maximum DNS requests per second.
The crawler has a built-in DNS lookup facility that may be
configured to communicate with one or more DNS servers
to perform DNS lookups. Use this option to limit the number
of DNS requests per second that the crawler will send to
the DNS server(s).
The DNS resolver will automatically decrease the lookup
rate if it detects that the DNS server is unable to handle the
currently used rate. The actual rates can be seen in the
collection statistics output.
Default:100 requests
-F <file>
Specify the crawler global configuration file.
Use this option to specify the location of an XML file
containing the crawler global configuration. A crawler global
configuration file is XML based and may contain default
values for all command line options.
Note that no command line switches may be specified in
this configuration file. Also note that the crawler processes
the command line switches in order. For example, if you
use the -D option in ./crawler -F
CrawlerGlobalDefaults.xml, the -D 20 will override
any DNS request rate settings specified in the file. The
crawler will on startup look for a startup file of default
configuration settings.
This option first attempts to locate the
CrawlerGlobalDefaults.xml in the current directory.
If not found it looks in $FASTSEARCH/etc directory.
-n
Shutdown crawler when idle.
Use this option to signal that a crawler node should exit
when it is idle.
172
Advanced options
Description
This option requires the refresh setting in a collection to be
higher than the time required to crawl the entire collection.
Default: disabled
Logging options
-L <path>
Description
Log storage directory.
Use this option to store crawler specific logs in sub
directories of the specified directory.
Default: If the FAST environment variable is set then the
default path is $FASTSEARCH/var/log/crawler;
otherwise the default path is data/log.
-q
Disable verbose logging.
Use this option to log CRITICAL, ERROR and WARNING log
messages.
-l <level>
Log level.
Use this option to specify the log level. This can be one of
the following preset log levels:
debug, verbose, info, warning, error
Data search integration options
-o
Description
DataSearch mode.
Use this option when running the crawler in a FAST
DataSearch or ESP setting.
-i
Ignore Config Server.
Continue running even if the Config Server component is
unreachable. Do not exit if Config Server cannot be reached.
-p
Publish Corba interface.
Publish this address/interface for postprocess CORBA
interfaces if enabled.
Note: Applies to FDS 4.x only.
Multiple node options
-U
Description
Run as ubermaster in a multiple node setup.
Start crawler as an ubermaster. Subordinate masters
connect to the XML-RPC port by specifying the -S option.
-S <ubermaster_host:port>
Run as master in a multiple node setup.
173
Multiple node options
Description
Start crawler as subordinate (master) to another crawler
(ubermaster). The <host:port> specifies the address of
the ubermaster.
Example: uber1.examplecrawl.net:27000
-s
Survival mode.
This option indicates that the subordinate master in a
distributed setup should stay alive and try reconnecting to
the ubermaster until a successful connection is made. This
option only applies to the master.
-I <ID>
Symbolic name of crawler node.
It is not normally necessary to use this option. In a multiple
node crawler setup, each crawler node must be assigned
a unique symbolic name, to be used in collection
configurations when defining which crawler nodes to include
in a crawl. This option only applies to the master. The default
value is auto generated, and stored in the configuration
database. If the option is used, and an alternative value is
specified, this need only be done the first time the crawler
is started.
Environment variables
FASTSEARCH_DNS
Description
The crawler will automatically attempt to detect the available
DNS server(s). However, it is also possible to override the
servers with this environment variable. The value of
FASTSEARCH_DNS should be a semicolon separated list of
DNS server IP addresses.
Example: FASTSEARCH_DNS="10.0.1.33;10.0.1.34"
An empty string may also be specified to force the use of
the gethostbyname() API, rather than speaking directly with
the DNS server(s).
Example: FASTSEARCH_DNS=""
postprocess
Postprocess is used by the crawler to perform duplicate detection and document submissions to FAST ESP.
It is, like the uberslave processes, automatically started with the crawler. The postprocess binary may also
be run as stand alone - when the crawler is not running - to manually refeed documents in one or more
collections.
Postprocess is responsible for submission of new, modified and deleted documents as they are encountered
by the crawler during a crawl. Before submission each document is checked against the duplicate database,
unless duplicate detection is turned off. A URI equivalence class for each unique checksum is also maintained
by postprocess, and updates to this class are submitted to FAST ESP in the form of changes to the 'urls'
field. Only one document in a set of duplicates will be submitted and the rest will be part of the URI equivalence
class.
In addition to document submission, postprocess also outputs to the postprocess log. Refer to Log files and
usage on page 197 for a description of the postprocess log.
174
Binary: $FASTSEARCH/bin/postprocess [options]
Table 48: postprocess options
General options
-h or --help
Description
are available.
-l <level>
Use this option to specify the log level. This can be one of the following preset
log levels:
-P [<addr>:]<port number>
Postprocess port.
<port number> Set start of port number range that can be used by postprocess
(default value is crawlerbaseport + 6).
An optional IP address may be specified (by hostname or value).
Default port: 9006
-U <file>
Use the crawler global default configuration file.
This option first attempts to locate the CrawlerGlobalDefaults.xml in the
current directory. If not found it looks in $FASTSEARCH/etc directory. Conflicting
options specified on the command line override the values in the configuration
file if given.
-d <path>
Data storage directory.
Use this option to store crawl data, runtime configuration and logs in
subdirectories in the specified directory.
Default: $FASTSEARCH/data/crawler
-R <collections>
Re-feed collections.
Re-feed all documents to ESP even if documents have been added before.
Specify <collections> as either a single collection or a comma separated
list of collections (with no whitespace).
Specify '*' to refeed all. Be sure to use the quote signs surrounding the asterisk,
otherwise the shell will expand it.
Refeed mode (-R) Only
Description
Note: You must stop the crawler before working in the refeed mode. Otherwise, postprocess will report a busy
socket.
-r <sitename>
Resume re-feeding after the specified site (hostname0).
This option may not be used at the same time as -s.
Note: Specifying the special keyword @auto for
<sitename> will make postprocess attempt to auto
resume traversal from where your last refeed left off.
175
Refeed mode (-R) Only
Description
-s <sitename>
Process only the specified sitename (hostname0).
This option may not be used at the same time as -r.
-x (lowercase x)
Process all permitted URIs.
Include all URIs matching the current collection
include/exclude rules, while ignoring URIs that do not match.
This is useful when also using the -u option to specify an
updated collection specification XML file.
-X (uppercase X)
Issue delete for excluded URIs.
Issues deletes for URIs that do not match the collection
specification includes/excludes.
All other URIs are ignored, unless combined with -x to also
process all permitted URIs.
This option is useful when -u is specified.
-b
Apply robots.txt exclusion to processing.
Let -x and -X options apply to robots.txt exclusion as well.
-u <file>
Update includes/excludes from file.
Updates the include and exclude regexps loaded from the
configuration database with those from the specified
collection specification XML file.
-f
Resume feeding existing dsqueues data.
-k <destination>:<collection>
Override the feeding section specified in the collection
configuration by specifying a destination (one specified in
$FASTSEARCH/etc/CrawlerGlobalDefaults.xml)
and a collection name.
Alternatively specify the symbolic name of a feeding target
as defined in the collection configuration, which then
automatically maps down to feeding destination and
collection name.
ppdup
In a multiple node crawler installation, a duplicate server is needed to provide a centralized duplicate detection
function for each of the master/postprocessor hosts. The duplicate server can be configured using the ppdup
binary.
Binary: $FASTSEARCH/bin/ppdup [options]
Table 49: ppdup options
Option
-h
Description
are available.
176
Option
-l <level>
Description
log levels:
-I <identifier>
Symbolic duplicate server identifier.
Use this option to assign a symbolic name to the duplicate server. This name
is used when the state of the duplicate server is replicated by another duplicate
server.
-P [<addr>:]<port number>
Port and optional interface.
This option specifies the port to which postprocesses communicate to the
Duplicate-Server in a multiple node setup.
-r <port>
Replication service port.
This option enables "replica mode" for the duplicate server. The duplicate server
will listen for incoming replication requests on the specified port.
-R <host:port>
Address of replication server.
This option specifies the address of the duplicate server that should replicate
the duplicate server state. The hostname specified must correspond to a server
running the duplicate server with the -r option with the specified port.
-d <path>
Set current working data directory.
This option specifies the working directory for the duplicate server.
Default: If the FAST environment variable is set then the default path is
$FASTSEARCH/data/crawler/ppdup; otherwise the default path is data.
-c <cache size>
Database cache size or hash size.
When a storage format of "hashlog" is selected (see -S option) this value
determines the size of the memory hash allocated. If the number of documents
stored into the hash exceeds the available capacity the hash will automatically
be converted into a disk hash and resized (2x increments).
If a storage format of "diskhashlog" is selected the value determines the initial
size of the hash on disk. For each overflow (whenever capacity is exceeded)
the hash is resized, as described above.
When the storage format is "gigabase" the value specifies the amount of memory
to reserve for database caches.
Note that this value is per collection. If multiple collections are used then each
collection will allocate the specified amount of cache/memory/disk. Furthermore,
if the duplicate server is being run as both a primary and a replica then twice
the resources will be consumed.
Default: 64
-s <stripes>
Number of stripes.
This option sets the number of stripes (separate files) that will be used by the
duplicate server databases.
177
Option
Description
Default: 1
-D
Direct I/O.
This option specifies that the duplicate server should enable direct I/O for its
databases.
Enable only if supported by the operating system.
-S
This option specifies which database storage format to use.
<hashlog|diskhashlog|gigabase>
The "hashlog" format will initially allocate a memory based hash structure with
a data log on disk. The size of the memory hash is specified by the -c option
described separately. If the hash overflows it will automatically be converted
into a "diskhashlog".
The "diskhashlog" format is similar, but a disk based hash structure and the
"gigabase" format is a database structure on disk.
Default: hashlog
-N
-F
Disable nightly compaction of duplicate server databases.
Specify the crawler global configuration file.
Use this option to specify the location of an XML file containing the crawler global
configuration. A crawler global configuration file is XML based and may contain
default values for all command line options.
Note that no command line switches may be specified in this configuration file.
Also note that the crawler processes the command line switches in order. For
example, if you use the -D option in ./crawler -F
CrawlerGlobalDefaults.xml, the -D 20 will override any DNS request
rate settings specified in the file. The crawler will on startup look for a startup
file of default configuration settings.
This option first attempts to locate the CrawlerGlobalDefaults.xml in the
current directory. If not found it looks in $FASTSEARCH/etc directory.
-v or -V
Print version ID.
This option prints the ppdup version identifier.
Tools
The Enterprise Crawler has a suite of related tools that can be used to perform tasks ranging from quite
general to extremely specific. Care should be exercised before using any of these programs, and backing up
data is always a prudent consideration.
crawleradmin
The crawleradmin tool is used for configuring (XML configs), monitoring (statistics and various other calls)
and managing (seeding, forcing of refreshing, reprocessing, suspending/resuming crawl/feed).
Tool: $FASTSEARCH/bin/crawleradmin: option [options]
Table 50: crawleradmin return codes
178
Return code
0
1
2
3
4
5
6
10
11
Description
Command successfully executed.
An error occured. See error text for more details.
Command line error. An unrecognized command was specified, or the arguments
were incorrectly formatted.
The collection specified on the command line does not exist.
The command failed because it requires the crawler to be stopped and the
--offline or -o flag to be specified.
An error was encountered attempting to read a file, or some other I/O operation
failed. See error text for more details.
Statistics is not yet available for the specified collection/site.
An error was reported by the master. See error text for details.
A socket error was encountered trying to connect to the master.
Table 51: crawleradmin options
General options
Description
--crawlernode <hostname:port>
Manage crawler at the specified hostname and port.
or -C hostnameport
Default: localhost:14000
<hostname:port>
--offline or -o <configdir>
Work in offline mode; crawler is stopped.
Offline mode assumes the default configuration directory,
$FASTSEARCH/data/crawler/config or just data/config if the
FASTSEARCH environment variable is not set.
This option can be used together with the following options:
-a, -d, -c, -q, -G, -f, -d, --getdata and --verifyuri
-l <log level>
Specify log level.
log levels:
--help or -h
Print usage information.
are available.
Crawler configuration options
--addconfig <file> or -f <file>
Description
Add or update collection configuration(s) from the specified
XML file.
179
Crawler configuration options
Description
--collectionconfig <collection> or -g
<collection>
Display the configuration for the specified collection.
--getcollection <collection> or -G
<collection>
--delcollection <collection> or -d
<collection>
Output the XML configuration for the specified collection to
stdout. Redirect the stdout output to a file to save the
configuration.
Delete collection (including all crawler storage).
Note that this has no effect on FDS/ESP collection
configuration elements such as pipeline or index.
Crawler control options
--shutdown or -x
Description
Shutdown the crawler.
Do not use this option when integrated with FAST ESP, as
nctrl will restart crawler. Use nctrl stop crawler
instead.
--suspendcollection <collection> or -s
<collection>
--resumecollection <collection> or -r
<collection>
--suspendfeed <collection>[:targets]
--resumefeed <collection>[targets]
--enable-refreshing-crawlmode <collection>
--disable-refreshing-crawlmode <collection>
Suspend (pause) crawling of <collection>. Feeding will
continue if there are documents in the feed queue.
Resume crawling of <collection>.
Suspend (pause) FAST ESP feeding for <collection>.
Optionally specify a comma separated list of feeding targets
(symbolic names found in the collection configuration).
Resume FAST ESP feeding for <collection>, optionally the
specified feeding targets.
Enable the 'refresh' crawl mode for the specified collection.
When enabled, the crawler will only crawl/refresh URIs that
previously have been crawled.
Disable the 'refresh' crawl mode for the specified collection,
and resume to normal crawl mode.
URI submission, refetching and refeeding
Description
-adduri <collection>:<URI> or -u
<collection>:<URI>
Append specified <URI> to <collection> work queue.
Can be combined with the --force flag to prepend the
URIs and crawl them immediately.
-addurifile <collection>:<file>
Append all URIs from the specified <file> to <collection>
work queue.
Can be combined with the --force flag to prepend the
URIs and crawl them immediately.
--refetch <collection> or -F <collection>
180
Force re-fetch of <collection>.
URI submission, refetching and refeeding
Description
This will cause the crawler to erase all existing work queues
(regardless of refresh mode) and clear all caches, start a
new crawl cycle and place all known start URIs on the work
queue. This will not increment the counter used for orphan
detection (dbswitch) unlike normal refreshes.
--refetchuri <collection>:<URI> or -F
<collection>:<URI>
Force re-fetch of <URI> in <collection>.
The URI does not need to be previously crawled. However,
it must fall within the include/exclude rules for the
<collection>. This also (as a side effect) triggers crawling
of the site to which the URI belongs (unless this site has
already been crawled in this refresh period).
--refetchsite <collection>:<URI>
--force
--feed
--refeedsite <collection>:<web site>
Force re-fetch of site from <URI> in <collection>.
Used with
--adduri/addurifile/refetchuri/refetchsite
to make sure the URI gets attention immediately (by
potentially preempting active sites).
Used with --refetchuri and --refetchsiteto also
have the URIs refed to FAST ESP indexing, regardless of
whether the documents have changed.
Refeed all documents in the crawler store for <web site>
to FAST ESP indexing.
This is equivalent to running postprocess refeed on a single
site, but does not require stopping the crawler. Due to the
implementation of this feature, it is advisable to limit the
amount of concurrent re-feeds at run time to prevent
overloading the crawler.
The URIs you refeed end up in a high priority queue. This
means it doesn't have to wait for other docs currently waiting
to be fed to the ESP. Feeding to the ESP will be done from
both the high priority queue and the normal priority queue
at the same time, so there might be a little delay before the
document is visible in the search.
--refeeduri <collection>:<URI>
--refeedprefix <prefix>
--refeedtarget <destination>:<collection>
Refeed the specified URI from the crawler store to FAST
ESP indexing. See --refeedsite above for more information.
Specify a URI prefix (including scheme) that URIs must
match to be re-fed. Only applicable with the --refeedsite
option.
Specify a feeding destination and collection to which the
specified refeed command will feed URIS to.
Only applicable with the --refeedsite option.
181
Preempting, blacklisting and deletion
Description
--preemptsite <collection>:<web site> or -p
<collection>:<web site>
Preempt crawling of site <web site> in <collection>.
--blacklist <collection>:<web site>:<time>
--unblacklist <collection>:<web site>
--deletesite <collection>:<web site>
--deluri <collection>:<URI>
--delurifile <collection>:<file>
Statistics options
--collstats <collection> or -q <collection>
--collstatsquiet <collection> or -Q
<collection>
--statistics or -c
Blacklist <web site> from crawling in <collection> for <time>
seconds.
Remove blacklisting of <web site> in <collection>.
Delete <web site> in <collection> from crawler.
Delete <URI> in <collection>.
Delete URIs in <file> from <collection>.
Description
Display crawl statistics for <collection>.
Display abbreviated version of crawl statistics for
<collection>.
Display crawl statistics.
Refer to crawleradmin statistics on page 183 for more
information.
--sitestats <collection>:<web site>
--cycle (1,~)
Monitoring options
Statistics for <web site> in <collection>.
Combine with any/all of the Statistics options listed in this
table to display statistics for the specified refresh cycle. Use
all to merge all refresh cycles. Default is current cycle.
Description
Note: id equals all or host:number or number.
--status
--nodestatus
--active or -a
--numslaves or -n
--slavestatus <id> or -S <id>
--numactiveslaves <id> or -N <id>
--sites <id> or -t <id>
182
Display status for all collections.
Display status (per node) for all collections.
Display all active collection names.
Display the number of sites currently being crawled.
Show site status for uberslave process <id>.
Show number of active sites for uberslave process <id>.
List sites currently being crawled by uberslave <id>.
Monitoring options
Description
--starturistat
Display feeding status of start URI files.
Debugging options
Description
--verifyuri <collection>:<URI>
Output information if an <URI> can be crawled and indexed
in the <collection>. This option checks against the following
crawler parameters: include_uris, include_domains,
exclude_uris, exclude_domains, allowed_schemes,
allowed_types, force_mimetype_detection, rewrite_rules,
robots, max_redirects, refresh_redir_as_redir,
max_uri_recursion, search_mimetype and
check_meta_robots. Note that there still may be reasons
why an URI is not crawled, e.g. DEPTH or due to an URI
being dropped by a crawler document plugin.
crawleradmin statistics
Running crawleradmin -c provides statistics for all collections active in the crawler. Directing a statistics
lookup to the administrator interface of the ubermaster will produce aggregated statistics for all crawler nodes.
Statistics for a specific node in a multiple node crawler setup may be produced by directing the lookup to the
administrator interface of the particular node. The following provides a sample statistics output:
Brief statistics for collection <collection>
============================================
All cycles
==========
Running time
Average document rate
Downloaded (tot/stored/mod/del)
Document store (tot/unique)
Document sizes (avg/max)
:
:
:
:
:
20.21:29:38
44.69 dps
80,687,225 URIs / 41,886,951 / 10,702,819 / 6,186,202
24,997,930 URIs / ~24,254,800
24.14 kB / 488.28 kB
Current cycle (57)
==================
Running time
Stats updated
Status
Progress
Document rate (curr/avg)
In bandwidth (curr/avg/tot)
:
:
:
:
:
:
01:46:40
22.6s ago
Crawling, 4,482 sites active
26.9%
51.66 dps / 42.88 dps
6.28 Mbps / 5.38 Mbps / 4.01 GB
Downloaded (tot/stored/mod/del) : 274,451 URIs / 101,831 / 49,743 / 15,156
Download times (avg/max/acc)
: 19.7s / 07:37 / 59.11:56:00
DNS overview
-----------Requests (tot/retries/timeouts) : 2,192,245 / 206,290 / 134,132
Request rate (curr/avg/limit)
: 0.8 rps / 1.2 rps / 75 rps
crawleradmin examples
The following examples show some of the crawleradmin options being used for the collection named
mytestcoll.
183
Extract crawler XML configuration
To get crawler configuration file information:
$FASTSEARCH/bin/crawleradmin -G mytestcoll > mytestcoll.xml
Note that the name of the configuration file (mytestcoll.xml in this example) does not
need to be the same as the collection name. When restoring the collection the actual name
of the collection is given by the name of the DomainSpecification element in the
configuration file.
Add/update crawler XML configuration
To restore or update a collection configuration from a saved file:
$FASTSEARCH/bin/crawleradmin -f mytestcoll.xml
Delete collection from crawler only
To remove a collection from the crawler's configuration, and delete the stored data:
$FASTSEARCH/bin/crawleradmin -d mytestcoll
Note that this command has no effect on the collection in the index.
Crawler collection statistics
To display collection statistics:
$FASTSEARCH/bin/crawleradmin -Q mytestcoll
Replace uppercase Q with lowercase Q for more details.
Force re-crawling of a site
To force a re-crawl (re-fetch) a site:
$FASTSEARCH/bin/crawleradmin --refetchsite mytestcoll:www.example.com
Force re-crawling a single URI
To re-crawl a specific URI immediately:
$FASTSEARCH/bin/crawleradmin --refetchuri
mytestcoll:http://www.example.com/test_pages/x1.html --force
Force re-crawling and refeeding a single URI
To re-crawl and refeed a specific URI immediately:
$FASTSEARCH/bin/crawleradmin --refetchuri
mytestcoll:http://www.example.com/test_pages/x1.html --force --feed
Refeed a site while crawling
To refeed a site to ESP for processing and indexing:
$FASTSEARCH/bin/crawleradmin --refeedsite mytestcoll:www.example.com
You can also specify a different feeding destination on the command line:
$FASTSEARCH/bin/crawleradmin --refeedsite mytestcoll:www.example.com
--refeedtarget otheresp:mytestcoll
184
Suspending/resuming crawling
To suspend the crawling of a collection:
$FASTSEARCH/bin/crawleradmin --suspendcollection mytestcoll
To resume crawling use --resumecollection.
Suspending/resuming content feeding
To suspend the content feed to ESP processing and indexing:
$FASTSEARCH/bin/crawleradmin --suspendfeed mytestcoll
If the collection has multiple destinations specified in the configuration, you can suspend
an individual destination by doing:
$FASTSEARCH/bin/crawleradmin --suspendfeed mytestcoll:mydest
To resume feeding use --resumefeed.
crawlerdbtool
The crawlerdbtool lists all documents/URLs that the crawler knows about for each collection.
To use:
1. Stop the crawler:
2. On the crawler node, run this command:
crawlerdbtool -m list -d datasearch/data/crawler/store/test/db/ -S all
Table 52: crawlerdbtool Options
Option
-m <mode>
Description
Operation mode.
Valid modes:
check - Report corrupt databases only.
repair - Attempt repair of corrupt databases by copying elements to new
databases. New databases are verified before they replace the corrupt
databases.
delete - Delete corrupt databases.
compact - Compacts a database (specify filename or directory) or document
store cluster (specify cluster directory).
list - Outputs all keys in a database.
count - Counts the number of keys in a database.
view - View an entry in a database based on the key specified with -k. If
none specified then all keys are output.
viewraw - As above but without any formatting.
export - Export a database to marshalled data.
import - Imports a database from marshalled data.
analyze - Analyzes a meta database.
pphl2gb - Convert a postprocess checksum database from hashlog to
gigabase format.
Default: check
185
Option
Description
-d <dir>
Specifies the directory/file to process. Must be specified except in 'align' mode.
The -f option below is ignored if a file is specified.
-f <filemask>
Specifies the filemask/wildcard to work on. Can be repeated.
Default: *
-c <cachesize>
Specify the cache size (in bytes) to be used when opening databases.
Default: 8388608
-s <frequency>
Database sync frequency during repair. Specifies the number of operations
between each sync. A value of 1 will sync after each operation.
Default: 10
-t <timeout>
Specify a timeout in seconds after which a database check/repair process is
terminated. child is assumed dead and killed. The database will be assumed
corrupt beyond repair and will be deleted.
Caution: Use with caution.
Default: none
-k <key>
Only applicable in view mode. Specifies the database key to view.
-K <key>
Same as -k, but assumes key is repaired and will call eval() on it before using.
Use this for checksums.
-i <intermediate format>
Only applicable in import/export mode. The selected format will be exported to
or imported from.
Valid formats:
marshal - fast space-efficient format
pickle - version and platform independent format
Default: marshal
-S <site>
Specify a site to apply the current mode to. Use this for inspecting meta
databases. If site is "all", all sites will be traversed. If site is "list" all sites will be
listed.
crawlerdbtool examples
Note: Before running the crawlerdbtool make sure the crawler is stopped, as the tool cannot be run
concurrently.
List documents from a server
The command to list all documents crawled from a server within a collection would be:
crawlerdbtool -m list -d datasearch/data/crawler/store/test/db/ -S
web001.example.net
186
where web001.example.net is the server, and test is the name of the collection.
Output:
'http://web001.example.net/Island/To.html'
'http://web001.example.net/in/and.html'
'http://web001.example.net/For/3).html'
'http://web001.example.net/for/services.html'
List sites from a collection
To list all known sites within a collection, use the command:
crawlerdbtool -m list -d datasearch/data/crawler/store/test/db/ -S all
Output:
web001.example.net
web000.example.net
URI statistics for a collection
To list statistics for all URIs crawled within a collection, use the command:
crawlerdbtool -m analyze -d datasearch/data/crawler/store/test/db/ -S all
Output: same as Example #3 showed for entire collection.
URI statistics for a server
To get statistics for URIs crawled from a specific server within a collection, use the
command:
crawlerdbtool -m analyze -d datasearch/data/crawler/store/test/db/ -S
web001.example.net
Output:
Enterprise Crawler 6.7 - DB Check Utility
Copyright (c) 2008 FAST, A Microsoft(R) Subsidiary
Current options are:
- Mode
: analyze
- Timeout
: None
- Directory : datasearch/data/crawler/store/test/db/
- File masks : *
- Cachesize : 8388608
Site Report
=================================================
Document and URIs
Avg. Doc Size
Data Volume
JavaScript URIs
Redirect URIs
Total URIs
Unique CSUMs
:
:
:
:
:
:
2.06 kB
18.43 MB
0
0
9141
9126
Mime-Types
187
text/html : 9141
List URIs (keys) from a database
To list the URIs (or sites) within a given database file, use the list option as in the following
command:
crawlerdbtool -m list -d data/store/example/db/1/0.metadb2
'http://www.example.com/'
'http://www.example.com/maps.html'
'http://www.example.com/bart/bart.jsm'
'http://www.example.com/metro/metro.jsm'
'http://www.example.com/planimeter/planimeter.jsm'
'http://www.example.com/comments/'
'http://www.example.com/software/'
'http://www.example.com/software/micro_httpd/'
'http://www.example.com/software/mini_httpd/'
'http://www.example.com/software/thttpd/'
'http://www.example.com/software/spfmilter/'
'http://www.example.com/software/pbmplus/'
'http://www.example.com/software/globe/'
'http://www.example.com/software/phoon/'
'http://www.example.com/javascript/MapUtils.jsm'
'http://www.example.com/software/saytime/'
'http://www.example.com/javascript/Utils.jsm'
Success
View record of a specific database key
The output of the previous command provides the keys to the data stored within each
database. This can be specified with the -k option in view mode, to see all details
associated with that URI or site, as in the following examples.
crawlerdbtool -m view -d data/store/example/db/1/0.metadb2 -k
'http://www.example.com/maps.html'
key (meta): 'http://www.example.com/maps.html'
MIME type
Crawl time
Errors
Compression
Parent
State flag
Checksum
:
:
:
:
:
:
:
text/html
2006-12-21 18:54:09
None
deflate
None
0
c2f963f3b56e1495abad9c8b89ab41f5
Change history : (0, 0, 0, 1166723649)
Links : http://mapper.example.com/
http://mapper.example.com/ http://www.example.com/
http://www.example.com/
http://www.example.com/GeoRSS/ http://www.example.com/GeoRSS/
http://www.example.com/bart/ http://www.example.com/bart/
http://www.example.com/javascript/ http://www.example.com/javascript/
http://www.example.com/jef/ggs/ http://www.example.com/jef/ggs/
http://www.example.com/jef/hotsprings/
http://www.example.com/jef/hotsprings/
http://www.example.com/jef/outlines/
http://www.example.com/jef/outlines/
http://www.example.com/jef/paris_forts/
188
http://www.example.com/jef/paris_forts/
http://www.example.com/jef/transpac2005/
http://www.example.com/jef/transpac2005/
http://www.example.com/mailto/?id=wa
http://www.example.com/mailto/?id=wa
http://www.example.com/mailto/wa.gif
http://www.example.com/mailto/wa.gif
http://www.example.com/metro/ http://www.example.com/metro/
http://www.example.com/planimeter/ http://www.example.com/planimeter/
http://www.example.com/resources/images/atom_ani.gif
http://www.example.com/resources/images/atom_ani.gif
http://www.google.com/apis/maps/ http://www.google.com/apis/maps/
Maxdoc counter : 2
Last-Modified : Tue, 11 Apr 2006 13:35:18
GMT
Epoch
ETag
Flags
Previous Checksum
Referrers
:
:
:
:
:
0
None
0
None
http://www.example.com/ :
Fileinfo : ('example/data/1', 1217,
65539)
HTTP header : HTTP/1.1 200 OK
Server: thttpd/2.26 ??apr2004
Content-Type: text/html; charset=iso-8859-1
Date: Thu, 21 Dec 2006 17:54:07 GMT
Last-Modified: Tue, 11 Apr 2006 13:35:18 GMT
Accept-Ranges: bytes
Connection: close
Content-Length: 4068
Adaptive epoch (upper) : 0
Adaptive rank : 7920
Level (min/current/max) : (1, 1, 1)
# crawlerdbtool -m view -d data/store/example/db/1/site.metadb2
Site: 'www.example.com'
Internal ID : 0
Hostname : www.example.com
Alias : None
Adaptive data :
awo : (12, 0)
awe : 2
Epoch details :
Last refresh (upper)
Clean epoch
Last refresh
Previous adaptive epoch
Epoch
Epoch (upper)
Subdomain list
IP address
Mirrors
Last seen
Segment number
Maxdoc limit
:
:
:
:
:
:
:
:
:
:
:
:
2007-01-09 17:34:49
2
2007-01-09 17:34:49
0
2
0
empty
192.168.178.28
None
0
0
0
crawlerconsistency
The consistency tool is used for verifying and repairing the consistency of the crawler document and meta
data structures on disk.
189
The consistency tool has two main uses. It can be used as a preventive measure to verify and maintain
internal crawler store consistency, but also as part of recovering a damaged crawler store. The tool will detect,
and by default also attempt to repair, the following inconsistencies:
•
•
•
•
•
Documents referenced in meta databases, but not found in the document store
Invalid documents in the document store
Unreferenced documents in the document store (requires docrebuild mode)
Duplicate database checksums not found in meta databases
Multiple checksums assigned to the same URI in the duplicate database
The above list of inconsistencies are automatically corrected by running the tool in the doccheck or docrebuild
mode, followed by the metacheck mode. Any URIs found to be non-consistent will be output to a log file (see
below), and a delete operation will also be issued to the indexer (can be disabled) to ensure it is in sync.
Refer to Crawler Store Consistency on page 157 for more information.
In a multi node crawler environment the tool can also be used to rebuild a duplicate server from the contents
of per-master postprocess checksum databases, using the ppduprebuild mode. Since this mode builds the
duplicate server from scratch it can also be used to change the number of duplicate servers in use, by first
changing the configuration of the collection and then rebuilding. Refer to Redistributing the Duplicate Server
Database on page 160 for more information.
The following log files will be generated by the tool. Be aware that log files are only created once the first URI
is written to a file, hence not all log files will be present.
Table 53: Output log files
Filename
Description
<mode>_ok.txt
Lists every URI found during the check, that was not removed as a result of an
inconsistency. The output from the metacheck mode in particular will list every
URI with a unique checksum, and is therefore useful for comparing against the
index. Be aware that documents may have been dropped by the pipeline, and
thus this file may correctly list URIs not actually present in the index. However,
URIs in the index that are not in this file may be safely removed from the index
as it is not known by the crawler.
<mode>_deleted.txt
This file lists each URI deleted by the tool. Unless indexer deletes were disabled
with the -n option they would also have been removed from the index. As these
URIs were only deleted due to internal inconsistencies within the crawler it is
entirely possible that they still exist on the web servers, and should thus rightly
be indexed. Therefore, it is recommended that this list of URIs is subsequently
re-crawled. This can be accomplished through the crawleradmin using the
--addurifile option. To expedite crawling add the --force option.
<mode>_deleted_reasons.txt
The contents of this file will be the same as the previous file, with the addition
of an "error code" preceding each URI. The error codes identify the reason for
each URI being deleted. The following codes exist:
•
•
•
•
•
•
•
•
<mode>_wrongnode.txt
190
101 - Document not found in document store
102 - Document found, but unreadable in document store
103 - Document found, but length does not match meta information
201 - Meta data for document not found
202 - Meta data found, but unreadable
203 - Meta data found, but does not match checksum in duplicate database
204 - Meta data found, but has no checksum
206 - URI's hostname not found in routing database
Only ever present on a multi node crawler, this file will output all URIs removed
from a particular node due to incorrect routing. This means that the URIs should
Filename
Description
be, and most likely also are, crawled by a different master node. Therefore,
these URIs are only output to the log file, but not deleted from the index.
<mode>_refeed.txt
The URIs listed in this file have had their URI equivalence class updates as a
result of running the tool. To bring the index in sync use postprocess refeed
with the -i option to refeed the contents of this file. Alternatively perform a full
refeed.
It is recommended to always redirect the stdout and stderr output of this tool to a log file on disk. Additionally,
on UNIX use either screen or nohup in order to prevent the tool from terminating in case the session is
disconnected.
Tool: $FASTSEARCH/bin/crawlerconsistency: option [options]
Table 54: crawlerconsistency options
Mandatory options
-M <mode>[,<mode>,..,<mode>]
Description
Selects the mode to run the tool in. The following modes are available
•
•
doccheck - Verifies that all documents referenced in the meta databases
also exist on disk.
docrebuild - Same as above, but re-writes all referenced documents to a
fresh document store, effectively getting rid of any orphans in the document
store.
Note: This can take a long time.
•
•
•
metacheck - Verifies that all checksums referenced in the PP databases
also exist in the meta databases.
metarebuild - Attempts to recovery a damaged metastore. Currently supports
rebuilding a bad or lost site database based on segment databases.
duprebuild: Rebuilds the contents of the Duplicate Server(s) from the local
Post Process DB.
Note: Exclusive mode. This mode must be run separately.
Additionally, the following 'modifiers' can be specified:
•
updatestat - Updates the statistics document store counter.
Note: Can only be used together with the doccheck/docrebuild mode.
Only applies to the stored statistics counter.
•
routecheck - Verifies that sites/URIs are routed to the correct.
Note: Only applies to multi-node crawlers.
-O <path>
Directory where the tool will place all output logs.
The tool will create a sub directory here with a name matching the current date
on the format <year><month><date>. If the directory already exists a counter
will be appended, e.g. ".1" in order to ensure clean directories each time the tool
is run.
191
Optional options
-d <path>
Description
Location of crawl data, runtime configuration and logs in
subdirectories in the specified directory. Default: data
-C
A comma separated list of collections to check. Default: All
<collection>[,<collection>,...,<collection>]<
collections
-c <cluster>[,<cluster>,...,<cluster>]
A comma separated list of clusters to check. Default: All
clusters
Note: Applies to: doccheck and docrebuild.
-S <site>[,<site>,...,<site>]
Only process the specified site(s). Default: All sites
Note: Applies to: doccheck
-z
Compress documents in the document store when executing
the docrebuild mode. This overrides the collection level
option to compress documents if specified. Default: off
Note: Applies to: docrebuild
-i
Skip free disk space checks. Normally the tool will check
the amount of free disk space periodically and if it drops
below 1GB it will abort the operation and exit. This option
should be used with caution.
-n
Do not submit delete operations to the pipeline/indexer,
only log them to files.
In order to ensure removed documents are not present in
the index afterwards it is recommended to manually delete
the documents reported as deleted, or refeed the entire
collection into an initially empty index.
-F
Load crawler global config from file. Conflicting options
specified on the command line override the values in the
configuration file if given.
-T
Test mode. Tool does not delete anything from disk or issue
any deletes to the pipeline/indexer.
crawlerwqdump
The crawler work-queue-dump-tool writes the crawler queues that reside on disk to plain text files or to stdout.
The following queues may be output:
•
•
•
•
192
the masters queue of resolved URIs
the masters queue of unresolved URIs
the masters queue of unresolved sites
the slave work queues
Tool: $FASTSEARCH/bin/crawlerwqdump -d <dir> -c <collection> -t <target> -q <queue>
All options must be specified. Each entry in the output contains the collection name and an URI, separated
by ','.
Table 55: crawlerdbtool Options
Option
Description
-d <dir>
-c <collection>
Path to queue dir (data/crawler/queues).
The name(s) of your collection(s).
Use 'all' to process all collections.
Separate collections with ',' if you specify more than one.
-t <target>
Output directory or 'stdout'.
If you specify an output directory, the queues will be written to file and placed
in <target> directory and named:<queue>.<time>.<collection>.txt ex:
slavequeue.2005.12.21.11.9.mycollection.txt.
-q <queue>
Which queues to process: resolved/unresolved/slave/all.
Example:
crawlerwqdump -d $FASTSEARCH/data/crawler/queues/ -c
mycollection,myothercollection
-q slave" -t $FASTSEARCH/data/crawler/queuedumps/"
crawlerdbexport
The crawlerdbexport tool is used to dump the EC 6.3 databases to an intermediary format for subsequent
import to an EC 6.7 installation, as part of the crawler store migration process.
Dump files will be placed alongside the original databases, named with the suffix .dumped_nn.
Tool: $FASTSEARCH/bin/crawlerdbexport [options]
Table 56: crawlerdbexport options
Option
-m
Description
Required: Mode.
Valid values: export, deldumps
Default: export
-d
Required: Directory, path to crawler store ($FASTSEARCH/data/crawler).
Default: none
-g
Name of your collection.
If no collection is specified, then all collections are processed.
Default: none
-l
Log level.
193
Option
Description
Valid values: normal, debug
Default: normal
-b
Batch size.
Maximum bytes per dump file.
Default: 100MB
crawlerstoreimport
The crawlerstoreimport tool loads the crawlerdbexport dump files one by one, creates new databases and
migrates the document storage, and a new 6.7 crawler store will be created.
This also includes the documents stored. This section lists options for the import tool.
Tool: $FASTSEARCH/bin/crawlerstoreimport [options]
Table 57: crawlerstoreimport options
Option
-m
Description
Required: Mode.
Valid values: import, deldumps
Default: import
-d
Required: Directory, path to old crawler store ($FASTSEARCH/data/crawler.old).
Default: none
-n
Required: Directory, path to new crawler store ($FASTSEARCH/data/crawler).
Default: none
-t
Required: Node type.
Valid values: ubermaster, master, ppdup
Default: none
-g
Name of your collection.
If no collection is specified, then all collections are processed.
Default: none
-s
Storage format.
Valid values: bstore, flatfile
Default: current format
-r
Remove dump files.
Valid values: 0,1
Default: 0 (no)
-p
194
ppdup format.
Option
Description
Valid values: hashlog, gigabase
Default:gigabase
-l
Log level.
Valid values: normal, debug
Default: normal
Crawler Port Usage
This appendix lists per process port usage for single node and multiple node crawlers.
The crawler port is sometimes specified on the command line: -P <hostname>:<crawlerbaseport>
<hostname>
By default binds to all interfaces.
<crawlerbaseport>
If the FASTSEARCH environment variable is set, the port is read from the
$FASTSEARCH/etc/NodeConf.xml file.
If the FASTSEARCH variable is not set OR reading the port fails, port 14000 is used.
Port range
The maximum crawler port range is from <crawlerbaseport> to
<crawlerbaseport>+299
Table 58: Crawler Port Usage (Single Node)
Process name
Purpose
Port
crawler
XML-RPC
<crawlerbaseport>
Postprocess
communication
<crawlerbaseport> + 2
crawler
Slave communication
crawlerfs
HTTP
postprocess
Slave communication
postprocess
XML-RPC
uberslave
XML-RPC
<crawlerbaseport> + 7 and up to <crawlerbaseport> + 198
cglogdispatcher (GUI
log dispatcher)
XML-RPC
Table 59: Crawler Port Usage (Multiple Node)
Process name
Purpose
Port
Ubermaster (crawler
-U)
XML-RPC
<crawlerbaseport>
Ubermaster (crawler
-U)
Master communication <crawlerbaseport>+1
Master (crawler -S)
XML-RPC
195
Process name
Purpose
Port
Master (crawler -S)
Postprocess
communication
crawler -S
Slave communication
crawlerfs
HTTP
postprocess
Slave communication
postprocess
XML-RPC
uberslave
XML-RPC
<crawlerbaseport> + 107 and up to <crawlerbaseport> +
198
cglogdispatcher (GUI
log dispatcher)
XML-RPC
ppdup (Duplicate
Server)
Postprocess
communication
ppdup (Duplicate
Server)
Duplicate replication
<crawlerbaseport> + 201 and up to <crawlerbaseport> +
298
Log Files
The Enterprise Crawler creates numerous files in which to log information detailing the processing of URIs
and collections. Some are created automatically, others must be enabled via configuration.
Directory structure
The following table describes the key directories and files in a crawler installation, relative to the FAST ESP
installation root, $FASTSEARCH.
Table 60: Crawler Directory Structure
Structure
Description
$FASTSEARCH/bin
/crawler
Crawler executables
/crawleradmin
/postprocess
$FASTSEARCH/lib
Shared libraries and Python images
$FASTSEARCH/etc
FAST ESP configuration files read by
crawler
$FASTSEARCH/var/log/crawler
Folders for detailed crawler logs
/crawler.log
/dns
/dsfeed
/header
/fetch
196
Diagnostic and progress information
Daily log files directories. Most of the
directories are organized by collection.
Structure
Description
/screened
/site
/stats
/PP
$FASTSEARCH/data/crawler
Folders for configuration, work queues
and data/metadata store
/config
/queues
/dsqueues
$FASTSEARCH/data/crawler/store
Temporary and permanent
configuration data, work queues and
batches; mostly binary data
Data and metadata for crawled pages,
organized in subdirectories by
collection
/db
Metadata for each document gathered
/data
Document content
/PP/csum
Duplicate document checksum
databases
Log files and usage
DNS log
A directory that contains log files from DNS resolutions:
$FASTSEARCH/var/log/crawler/dns
Header log
A directory that contains logs of HTTP request/response exchanges, separated into
directories by sitename. The header log is disabled by default:
$FASTSEARCH/var/log/crawler/header/<collection>/
Screened log
A directory that contains log files of all URIs processed by the crawler and details for any
given URI on whether or not it will be placed on the work queue:
$FASTSEARCH/var/log/crawler/screened/<collection>/
The screened log is turned off by default. URIs that will be queued are logged as ALLOW
others as DENY. Additionally all URIs logged as DENY will have a explanation code logged
with it.
Site log
A directory that contains log files listing events in the processing of web sites. The logs
contain entries listing a site being processed, a time stamp, and details of the transition
in the state of that web site, such as STARTCRAWL, IDLE, REFRESH, and STOPCRAWL.
$FASTSEARCH/var/log/crawler/site/<collection>/
197
Fetch log
A directory that contains log files for every collection that is populated by the crawler. The
crawler logs attempted retrievals of documents to a per-collection log. Each log file
describes actions taken for every URL along with a time stamp:
$FASTSEARCH/var/log/crawler/fetch/<collection>/
Crawler log
This file logs general diagnostic and progress information from the crawler process stdout
and stderr output. The verbose level of this log is governed by the -l <level> option
given to the crawler and can be modified in the crawler entry in
$FASTSEARCH/etc/NodeConf.xml. Use the -l <level> option to specify the log level.
Possible values are one of the following predefined log levels: debug, verbose, info,
warning, error . If you adjust the level, reload the configuration file into the node controller
(nctrl reloadcfg in $FASTSEARCH/bin) before stopping and starting the crawler for the
change to take effect.
$FASTSEARCH/var/log/crawler/crawler.log
Postprocess
log
A directory that contains log files from postprocess. Postprocess performs duplicate
detection of downloaded documents, and processes content to FAST ESP.The Postprocess
log contains the URIs and referrer URI to every unique document together with their size,
MIME type and URIs to any duplicates found:
$FASTSEARCH/var/log/crawler/PP/<collection>/
DSfeed log
A directory that contains log files for every collection that is populated by the crawler. The
logs contain the status of each URI submitted to document processing. Deletes are also
logged:
$FASTSEARCH/var/log/crawler/dsfeed/<collection>/
Enabling all Log Files
Logging options can be enabled via selection in the administrator interface, or by adding them to the XML
configuration file and reloading that using the crawleradmin tool.
An example of fully enabled log section from configuration file:
<attrib name="dsfeed" type="string"> text </attrib>
<attrib name="header" type="string"> text </attrib>
<attrib name="postprocess" type="string"> text </attrib>
<attrib name="screened" type="string"> text </attrib>
<attrib name="site" type="string"> text </attrib>
</section>
Verbose and Debug Modes
In cases where warnings or errors indicate that the crawler may have a problem, it may be helpful to obtain
more detailed information than what is available in the daily crawler.log file.
The options available for logging are the verbose mode (-v) and the debug mode (-l <value>, where <value>
is often <debug>. To add these modes to the crawlers command line within FAST ESP:
1. Edit the NodeConf.xml file. To do so, find the Enterprise Crawler command specification, and add either
“-v” or “-l debug” to the <parameters> string.
2. Save the change.
3. Force the node controller to reread the file. Run the command: nctrl reloadcfg
The change will take effect when the crawler is next restarted
198
Crawler Log Messages
Below is a list of log messages that may be found in the $FASTSEARCH/var/log/crawler/crawler.log file.
Severity
CRITICAL
Log Message(s)
Cause(s)
Another process,
A process, most likely the crawler, is
most likely another already running on the crawler port.
crawler, is already
running on the
specified interface
'localhost:14000'
Action(s)
Ensure that the crawler is not already
running on the port specified on the
crawler command line. They may be
killed if necessary.
or
Unable to bind
master socket to
interface
'localhost:14000'
or Unable to open
XML-RPC socket
(%s:%s)
or
Another process,
most likely another
crawler, is already
running and holding
a lock on the file
<filename>
or
Unable to create
listen port for
slave
communication:
socket.error:
[Errno 98] Address
already in use
CRITICAL
Unable to perform
crawler license
checkout: <text>
The crawler was unable to retrieve a
valid license. Your license may have
expired.
Refer to the licensing information listed
in Contact Us on page iii for more
information.
or
Unable to check out
FLEXlm license.
Shutting down.
Contact FAST for a
new license
CRITICAL
Lost connection to The uberslave process has detected
Master. Taking down that the master is no longer running and
is shutting down. This can occur if
Slave
either the master crashes or on normal
shutdowns.
Check the logs for additional
information. If this was not the result of
a normal shutdown please submit a bug
report to FAST Technical Support.
Include logs and core files if available.
199
Severity
CRITICAL
Log Message(s)
Cause(s)
Action(s)
Unable to start
Subordinate processes either could not Investigate system resources, process
limits, check log files for error
Slave/FileServer/PostProcess be started, or are failing repeatedly.
messages.
process
or
Too frequent
process crashes.
Shutting down
CRITICAL
No data directory
specified
Misconfiguration or
ownership/permission problems.
or
Unable to create
the data directory
<directory>
Verify that correct user is attempting to
run crawler, and that ownership of
crawler directories is correct. Recheck
configuration files and command-line
options.
or
Unable to write
crawler pidfile to
<directory>
or
Survival mode may
only be used by a
subordinate in a
multi node setup
CRITICAL
Failed to load
collection config
<text>. Shutting
down
Unable to read configuration database Verify existence and
or XML file.
ownership/permission of configuration
database or file.
or
Failed to load
collection config
specified on
command line:
<text>
CRITICAL
ERROR
200
DNS resolver error Crawler unable to contact DNS server Check system DNS configuration (e.g.
to resolve names or addresses.
/etc/resolv.conf), verify proper
operation (e.g. using nslookup/dig or
similar tool).
Lost connection
<name> (PID:
<pid>), possibly
due to crash
Communication between two crawler
processes has failed, possibly due to
a process crash.
None, the crawler will restart the
process. Contact FAST Technical
Support if it occurs repeatedly.
Severity
ERROR
ERROR
ERROR
ERROR
WARNING
WARNING
WARNING
WARNING
WARNING
Log Message(s)
Cause(s)
Action(s)
Remote csum
ownership, same
URI: <URI>
In a multiple node crawler setup this
can occur if the same site has been
routed to more than one master, or if
stale data has not been properly
deleted.
Contact FAST Technical Support if it
occurs repeatedly.
Unable to
load/create config
DB '<path>':
DBError: Failed to
obtain database
lock for
<path>/config.hashdb
The crawler is already running when
an attempt was made to start another
crawler or use a tool that requires the
crawler to be stopped first.
Stop the crawler. If, after waiting for at
least 5 minutes, there are still crawler
processes running they may need to
be killed.
Unable to connect
to <name> process.
Killing process
(PID=<PID>)
A process started by the master failed Check the logs for additional
to connect properly to the master. The information. Contact FAST Technical
process may have had startup
problems.
Timeout waiting for
<name> process to
connect to Master,
killing (PID=<PID>)
A process started by the master failed Check the logs for additional
to connect back within 60 seconds. The information. Contact FAST Technical
process may have had startup
problems.
<name> process (PID A crawler sub process identified by
<pid>) terminated <name> has crashed and will be
by signal <signal> restarted.
(core dumped)
Submit bug report to FAST Technical
Support. Include logs and core file.
Failed to read data
from '<path>',
discarding
URI=<URI>
The crawler was unable to read a
None unless this occurs repeatedly.
previously stored document from disk. Verify that there are no disk issues or
This can occur if the document has
other problems that could cause this.
since been deleted from the web server
and the crawler has a backlog.
Unable to read
block file index
for block file
<number>
A document store file index was either Submit bug report to FAST Technical
corrupt or missing on disk.
Support. Include logs.
Start URI <URI> is A start URI specified in the
configuration did not pass the
not valid
include/exclude rule checks.
Verify that all start URIs match the
include/exclude rules as well as HTTP
scheme and extension rules.
Data Search Feeder
The disk queue containing documents Delete the
failed to process
on disk for processing is corrupt.
$FASTSEARCH/data/crawler/dsqueues
packet: IOError
directory and perform a PostProcess
refeed as described in Re-processing
Crawler Data Using postprocess on
page 154.
201
Severity
WARNING
WARNING
WARNING
WARNING
Log Message(s)
Cause(s)
KeepAlive ACK from In a multiple node crawler setup this
If the log message repeats restart all
can occur if the ubermaster and master crawler processes.
unknown Master
processes go out of sync.
Unable to flush
The master process work load is very If this repeats try to reduce the
workload by reducing the number of
Master comm channel high and communication between
processes are suffering.
concurrent sites being crawled or install
on more powerful hardware or
additional servers.
<name> engine poll The specified process has a very high None, unless this occurs constantly. If
workload. API calls may respond more so either decrease the work load by
used <number>
slowly.
crawling fewer sites concurrently or
seconds
install on more powerful hardware or
additional servers.
Master ID '<ID>'
already exists
A master has been started with a
symbolic ID that has already been
specified for another in the same
multiple node crawl.
Stop the offending master and change
the symbolic ID specified by the -I
option
WARNING
The Browser Engine is shutdown or
The Browser engine
unavailable
at <host>:<port> is
down
VERBOSE
The Browser Engine is up and running
The Browser engine
after having been down.
at <host>:<port> is
up
VERBOSE
The Browser Engine is overloaded and Tune the EC to Browser Engine
The Browser engine
will not process new documents until communication, tune the Browser
at <host>:<port> is
the queue length is reduced.
Engine. You may need to disable
overloaded
JavaScript and/or Flash processing or
only enable JavaScript and/or Flash for
certain sites.
PROGRESS
INFO
Ignoring site:
<sitename> (No
URIs)
Investigate why the Browser Engine is
down and rectify it. See the FAST ESP
Browser Engine Guide for more
information.
When postprocess re-feeding you may None needed. However, if the site
get this message for sites containing should contain URIs you may wish to
no URIs.
try to re-crawl it or examine logs to
determine why it has no URIs.
Collection '<name>' The refresh cycle of the collection has You may want to increase the refresh
period in order to completely crawl all
is not idle by time completed and the crawler is not
finished crawling all the sites.
sites within the refresh period.
of refresh
PostProcess Log
Below is a list of postprocess log messages that may be found in the
$FASTSEARCH/var/log/crawler/PP/<collection>/ directory.
202
Action(s)
Severity
CRITICAL
CRITICAL
STATUS
STATUS
PROGRESS
Log Message
Cause(s)
Action(s)
Must specify Master The postprocess process was run
without the correct command line
port
arguments.
The postprocess process can only be
run manually in refeed mode (-R
command line option), make sure the
arguments are correct.
Failed to start
The PostProcess module failed to
register with the configserver at
PostProcess:
ConfigServerError: initialization time.
Failed to register
with ConfigServer:
Fault: (146,
'Connection
refused')
The configserver process is stopped or
suspended. Restart it, wait a moment,
and restart the PostProcess.
Could not send
batch to Data
Search Content
Distributor, will
try again later.
The error was:
add_documents call
with batch <batch
ID> timed out
The batch could not be forwarded to a None, the batch will be resent
document processor since none were automatically.
idle. This is a built-in throttling
mechanism in FAST ESP.
Waiting for Data
Search to process
remaining data...
Hit CTRL+C to abort
During refeed this message is logged Optionally signal postprocess to stop
once all databases have been
and resume crawler.
traversed. Documents are still being
sent to the Content Distributor, but it is
safe to signal postprocess to stop and
resume crawling as the crawler will then
feed the remaining documents.
Ignoring site:
<sitename> (No
URIs)
During postprocess refeed traversal of The message can be ignored. If this
the databases, a site was encountered site should have URIs associated with
with no associated URIs.
it, then sanity check the configuration
rules and log files to discover why it has
no URIs.
Crawler Fetch Logs
The crawler will log attempted retrievals of documents to a per-collection log located in
$FASTSEARCH/var/log/crawler/fetch/<collection name>/<date>.log.
The screened log is disabled by default. When enabled ("Screened" log enabled in the Logging section in
the Advanced crawler collection configuration GUI), all URIs seen by the crawler and whether they will be
attempted retrieved or not is located in $FASTSEARCH/var/log/crawler/screened/<collection name>/<date>.log
The messages in these logs are in a whitespace-delimited format as follows:
<time stamp> <status code> <outcome> <uri> [<auxiliary information>]
where:
1. <time stamp> is a date/clock denoting a time at which the request was completed or terminated.
203
2. <status code> contains a three letter code which describes the status of the outcome of the retrieval.
When this code is a numerical value, it maps directly to the same status code in the proper protocol, as
defined in RFC 2616 for HTTP/HTTPS, and RFC 765 for FTP.
The authoritative description of these status codes are always the respective protocol specifications, but
for convenience we describe a subset of these status codes informally below.
3. <outcome> is a somewhat more human readable status word that describes the status of the document
after retrieval.
4. <uri> denotes the URI that was requested.
5. [<auxiliary information>] contains additional informative information such as descriptive error
messages.
An excerpt from a fetch log is shown below:
2007-08-02-16:51:41 200 MODIFIED http://www.example.com/video/ JavaScript processing
complete
2007-08-02-16:52:32 301 REDIRECT http://www.example.com/video/living Redirect
URI=http://www.example.com/video/living/
2007-08-02-16:53:33 404 IGNORED
http://www.example.com/video/living/
2007-08-02-16:54:33 200 PENDING
http://www.example.com/video/live/live.html?stream=stream1 Javascript processing
Crawler Fetch Log Messages
Below is a list of log messages that may be found in the fetch log in the
$FASTSEARCH/var/log/crawler/fetch/<collection>/ directory.
Table 61: Status Codes - Fetch Log
Code/Message
Description/Possible Solution
200 - HTTP 200 "OK"
The request was successful. A document was retrieved following the request.
301 - HTTP 301 "Moved The document requested is available under a different URI. Target URI is shown in the
auxiliary information field.
permanently"
302 - HTTP 302 "Moved The document requested is available under a different URI. Target URI is shown in the
auxiliary information field. The crawler treats HTTP 301/302 identically.
temporarily"
204
303 - HTTP 303 "See
Other"
This method exists primarily to allow the output of a POST-activated script to redirect the
crawler to a new URI.
304 - HTTP 304 "Not
Modified"
The document has been retrieved earlier, and was now conditionally requested if the
server detected that it had changed since the last time. This is achieved by supplying the
Last-Modified time stamp or Etag given by the server the last time in the request. "Not
Modified" responses can be received when the "Send If-Modified-Since" setting is enabled
in the crawler configuration GUI.
401 - HTTP 401
"Unauthorized"
The web server requires that the crawler presents a set of credentials when requesting
the URI. This would occur using Basic or NTLM authentication.
403 - HTTP 403
"Forbidden"
The web server denies the crawler access to the URI, either because the crawler presented
a bad set of credentials or because none were given at all.
404 - HTTP 404 "Not
Found"
The web server does not know about the requested URI. Commonly, this is because a
"dead link" was seen by the crawler.
Code/Message
406 - HTTP 406 "Not
Acceptable"
The web server discovers that the crawler is not configured to receive the type of page it
has requested.
500 - HTTP 500
"Internal Server
Error"
Some unspecified error happened at the server when the request was serviced.
503 - HTTP 503
The server is currently unable to serve the request. This can for instance imply that the
"Service unavailable" server is overloaded.
226 - FTP 226
"Closing data
connection"
An operation was performed to completion.
426 - FTP 426
"Connection closed;
transfer aborted"
The retrieval of the document was aborted.
ERR
A non-HTTP error occurred (crawler or network related). Details of the error is shown in
the auxiliary information field.
TTL
The retrieval of the document exceeded the timeout setting. The number of seconds to
wait before an uncompleted request is terminated is governed by the "Fetch timeout"
setting in the crawler configuration GUI.
DSW
The document has been "garbage-collected" as it has not been seen for a number of
crawler refresh cycles. This means that the crawler has crawled more data than it can
deterministically re-crawl during one crawler refresh cycle and that documents are
periodically purged if they have not been seen over recently. The number of refresh cycles
to wait before purging documents is governed by the "DB switch interval" setting in the
crawler configuration GUI. Alternatively, the documents are being deleted because they
have become unreachable, either due to modifications in the crawler configuration or on
the website itself.
STU
Start URI was deleted. The specified start URI had been removed and excluded from the
configuration and has now been removed from the crawler store.
USC
The URI added from the crawler API (e.g. using crawleradmin) was deleted. A URI
crawled earlier has now been excluded by the configuration, and when re-adding it the
URI is now removed from the crawler store.
USR
The URI was deleted through the external API (e.g. using crawleradmin). Unless also
excluded by the configuration it may be re-crawled again later.
RSS
The URI was deleted due to the RSS settings. The URI was deleted due to the RSS
settings. The document was deleted either because it was too old, or because of the
maximum allowed number of documents for a feed has been reached.
Table 62: Outcome Codes - Fetch Log
205
Code/Message
NEW
The retrieved document was seen for the first time by the crawler and will be further
processed, pending final duplicate checks.
UNCHANGED
The retrieved document was seen before, and the retrieved version did not differ from the
one retrieved the last time.
MODIFIED
The retrieved document was seen before, and the retrieved version did differ from the
one retrieved the last time. The updated version will be further processed, pending final
duplicate checks.
REDIRECT
The retrieval of the document resulted in a redirect, that is the server indicated that the
document is available at a different location. The redirect target URI will be retrieved later,
if applicable.
EXCLUDED
The document was retrieved, but properties of the response header or body was excluded
by the crawler configuration. Details of the cause is shown in the auxiliary information
field. Commonly, the data was of a MIME type not allowed in the crawler configuration.
DUPLICATE
The document was retrieved, but detected as a duplicate in the first level crawler duplicate
check. The document will not be processed further.
IGNORED
The retrieval of the document failed, because of protocol or other error. If not evident from
the status code (like HTTP 404, 403...), details of the cause is shown in the auxiliary
information field. If the document had been retried a number of times and failed in all
attempts, the last attempt will be flagged as IGNORED.
DEPENDS
The document was retrieved, but has dependencies to external documents required for
further processing. Currently, this means that JavaScripts are enabled in the crawler
configuration and that the document contained references to external JavaScripts. The
document will be further processed when all dependencies have been retrieved.
DEP
The retrieved document was depended on by another document.
PENDING
206
A document was sent to an external component for processing. Examples include
JavaScript and Flash documents, which are processed by the Browser Engine. There
might be multiple pending messages for the same URI.
DELETED
The document had been retrieved earlier but is no longer available, or the document had
been retrieved earlier but has not been seen for a number of refresh cycles (refer to the
DSW status code). The document will be flagged as deleted.
RETRY
An error occurred and the retrieval of the document will be retried. The number of times
a document is retried is governed by the settings in the "HTTP Errors"/"FTP Errors" settings
in the crawler configuration GUI.
CANCELLED
The BrowserEngine canceled processing of a document. This is either caused by that the
BrowserEngine was shut down, or because processing of a document timed out. If the
cancel operation was caused by time out, a text message will be logged in the auxiliary
information field.
STOREFAIL
The crawler experienced an error saving the document contents to disk.
Code/Message
AUTHRETRY or AUTHPROXY The crawler was denied access to the URI, either directly or via the proxy. If Basic or
NTLM authentication has been configured, this may be a normal part of the protocol.
FAILED
LIST
RSSFEED
SITEMAP
The crawler was unable to complete internal processing of the document. Check the
crawler log file to see if additional details were noted.
A directory listing was retrieved via FTP to obtain URIs for FTP documents.
A new RSS feed has been detected by the crawler.These feeds are processed as specified
by the RSS settings of the collection.
A sitemap or sitemap index has been detected and parsed by the crawler.
Table 63: Crawler Fetch Log Auxiliary Field Messages
Message
Description
Referrer=<referrer
uri>
The URI was referred by the given referrer URI.
Redirect URI=<target
uri>
The retrieved URI redirects to the given target URI.
Empty document
The document was retrieved but contained no data.
META Robots
<directives>
The document contained the given HTML META Robots directives.
MIME type:
<MIME-type>
The document was of the specified MIME type (and was excluded because of this).
Connection reset by
peer
The connection to the server was reset by the server (BSD socket error message).
Connection refused
The connection to the server failed to be established (BSD socket error message).
Crawler Screened Log Messages
Below is a list of screen log messages that may be found in the
$FASTSEARCH/var/log/crawler/screened/<collection>/ directory.
Table 64: Status Codes - Screened Log
Code/Message
ALLOW
The URI is eligible for retrieval according to the crawler configuration.
DENY
The URI is not eligible for retrieval according to either crawler configuration, robots.txt file
or HTML robots META tags.
Table 65: Outcome Codes - Screened Log
207
Code/Message
OK
The URI was allowed and is eligible for retrieval.
URI
The URI was disallowed due to URI inclusion/exclusion settings in the crawler configuration.
These are the URI include/exclude filters and Extension excludes in the crawler
configuration GUI
.
208
DOMAIN
The URI was disallowed due to hostname inclusion/exclusion settings in the crawler
configuration. This is governed by the Hostname include/exclude filters settings in the
crawler configuration GUI.
ROBOTS
The URI was disallowed due to restrictions imposed by the robots.txt file on the web
server for its site. Additional discussion of ROBOTS issues can be found in the External
limits on fetching pages on page 19 section.
LINKTYPE
The URI was disallowed due to the type of HTML tag it was extracted from. Allowed link
types are governed by the Link extraction settings in the crawler configuration GUI.
NOFOLLOW
The URI was disallowed due to the referring document containing a HTML META robots
tag disallowing URI extraction. Additional discussion of ROBOTS issues can be found in
the External limits on fetching pages on page 19 section.
SCHEME
The URI had a disallowed scheme. Allowed schemes are specified in the Allowed
schemes setting in the crawler configuration GUI.
FWDLINK
Forwarding of non-local links are disabled. The URI was disallowed because the crawler
configuration disallows following links from one website to another. This is governed by
the Follow cross-site URIs setting in the crawler configuration GUI.
PARSEERR
Failed to parse the URI into URI components, i.e. scheme, host, path, and other elements
if specified.
LENGTH
The URI is too long to process and store.
WQMAXDOC
The maximum number of documents for the site has been reached.
WQEXISTS
Already queued on the work queue, will not queue again.
NOTNEW
In Adaptive refresh mode (only), this URI has already been queued as a previously fetched
entry.
REFRESHONLY
In Refreshing mode (only), this URI is not a previously fetched entry, and so will be ignored.
RECURSION
Maximum level of URI recursions (i.e. repeated patterns in the path element) reached.
MAXREDIRS
Maximum number of redirects reached.
Crawler Site Log Messages
Below is a list of log messages that may be found in the site log in the
$FASTSEARCH/var/log/crawler/site/<collection>/ directory.
Table 66: Sttaus Codes - Site log
Message
Description
STARTCRAWL <site>
The crawler will start crawling the specified site.
STOPCRAWL <site>
Crawling of the specified site stopped voluntarily. Look for associated IDLE log message
to determine cause.
STOPCRAWL SHUTDOWN
<site>
Crawling of the specified site was stopped due to the crawler being shut down.
REFRESH <site>
The specified site is refreshing.
REFRESH FORCE <site>
The specified site will be refreshed as a result of a user initiated force re-fetch operation.
REFRESHSKIPPED
NOTIDLE <site>
The specified site skipped refresh due to not yet having finished the previous refresh
cycle. This event can occur only when refresh mode soft is used.
WQERASE MAXDOC <site> The specified site has reached the maximum documents per cycle setting specified in the
Max doc limit <count> configuration. The remaining work queues have been erased.
reached, erasing work
queue
IDLE <reason> <site>
<detailed reason>
The specified site has gone idle (stopped crawling). The reason is given by <reason> and
<detailed reason>.
REPROCESS <site>
Ready for
reprocessing
The crawler has been notified to reprocess (refeed) the specified site in the crawler store.
DELSITE DELSITECMD
<site> <count> URIs
ready for deletion
from crawler store
The crawler has been notified to initiate the deletion of the specified URIs from the crawler
store.
DELURIS DELURICMD
The crawler has been notified to initiate the deletion of the specified site from the crawler
store.
<site> Ready for
deletion from crawler
store
LOGIN GET/POST <site> The crawler has initiated the form login sequence for the specified site.
Performing
Authentication
LOGGEDIN <site>
Through <login site>
The crawler has successfully logged into the specified site through <login site>.
209
Message
Description
DELAYCHANGED
ROBOTSDELAY <site>
Set to <delay>
seconds
The robots.txt file of the current site has changed the crawl delay to <delay> seconds by
specifying the "Crawl-Delay" directive.
BLACKLIST <site>
The specified site was blacklisted for <time span> number of seconds. During this time
Blacklisted for <time no downloads will be performed for the site, but URIs will be kept on work queues. Once
expired URIs will be eligible for crawling again.
span> seconds.
A blacklist operation may have been the result of either an explicit user action (e.g.
thorough the crawleradmin tool) or an internal backoff mechanism within the crawler
itself.
210
UNBLACKLIST EXPIRED
<site>
A site previously blacklisted is no longer blacklisted, and may resume crawling if there is
available capacity.
JSENGINE DOWN <site>
Crawling paused
The Browser Engine is down and the crawler has paused crawling the site.
JSENGINE OVERLOADED
<site> Crawling
paused
All the available Browser Engines are overloaded and the crawler has stopped sending
requests to the Browser Engine. .
JSENGINE UP <site>
Crawling resumed
The Browser Engine is ready to process documents after having been down or overloaded.

FAST Enterprise Crawler Guide

Transcription

Similar documents

Pro-Line Holeshot - CML Distribution

Web crawlers I

Podium Newsletter - Antique Tractor Pull Guide

Analysis of music data from allmusic.com website

flat iron xlg8• super swamper xlg8• baja t/a kr2g8 - Pro

Web Crawling

3 purpose-built rock crawlers

File