The Web Robots Pages
Transcription
The Web Robots Pages
The Web Robots Pages The Web Robots Pages The Web Robots Pages Web Robots are programs that traverse the Web automatically. Some people call them Web Wanderers, Crawlers, or Spiders. These pages have further information about these Web Robots. The Web Robots FAQ Frequently Asked Questions about Web Robots, from Web users, Web authors, and Robot implementors. Robots Exclusion Find out what you can do to direct robots that visit your Web site. A List of Robots A database of currently known robots, with descriptions and contact details. The Robots Mailing List An archived mailing list for discussion of technical aspects of designing, building, and operating Web Robots. Articles and Papers Background reading for people interested in Web Robots Related Sites Some references to other sites that concern Web Robots. The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/robots.html [18.02.2001 13:14:10] The Web Robots FAQ The Web Robots Pages The Web Robots FAQ... These frequently asked questions about Web robots. Send suggestions and comments to Martijn Koster. This information is in the public domain. Table of Contents 1. About WWW robots ❍ What is a WWW robot? ❍ What is an agent? ❍ What is a search engine? ❍ What kinds of robots are there? ❍ So what are Robots, Spiders, Web Crawlers, Worms, Ants? ❍ Aren't robots bad for the web? ❍ Are there any robot books? ❍ Where do I find out more about robots? 2. Indexing robots ❍ How does a robot decide where to visit? ❍ How does an indexing robot decide what to index? ❍ How do I register my page with a robot? 3. For Server Administrators ❍ How do I know if I've been visited by a robot? ❍ I've been visited by a robot! Now what? ❍ A robot is traversing my whole site too fast! ❍ How do I keep a robot off my server? 4. Robots exclusion standard ❍ Why do I find entries for /robots.txt in my log files? ❍ How do I prevent robots scanning my site? ❍ Where do I find out how /robots.txt files work? ❍ Will the /robots.txt standard be extended? ❍ What if I cannot make a /robots.txt? 5. Availability ❍ Where can I use a robot? ❍ Where can I get a robot? http://info.webcrawler.com/mak/projects/robots/faq.html (1 of 9) [18.02.2001 13:14:14] The Web Robots FAQ ❍ Where can I get the source code for a robot? ❍ I'm writing a robot, what do I need to be careful of? ❍ I've written a robot, how do I list it? About Web Robots What is a WWW robot? A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot. Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images). Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them. What is an agent? The word "agent" is used for lots of meanings in computing these days. Specifically: Autonomous agents are programs that do travel between sites, deciding themselves when to move and what to do. These can only travel between special servers and are currently not widespread in the Internet. Intelligent agents are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking. User-agent is a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Navigator and Microsoft Internet Explorer, and Email User-agent like Qualcomm Eudora etc. What is a search engine? A search engine is a program that searches through some dataset. In the context of the Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot. http://info.webcrawler.com/mak/projects/robots/faq.html (2 of 9) [18.02.2001 13:14:14] The Web Robots FAQ What other kinds of robots are there? Robots can be used for a number of purposes: ● Indexing HTML validation ● Link validation ● "What's New" monitoring ● Mirroring See the list of active robots to see what robot does what. Don't ask me -- all I know is what's on the list... ● So what are Robots, Spiders, Web Crawlers, Worms, Ants They're all names for the same sort of thing, with slightly different connotations: Robots the generic name, see above. Spiders same as robots, but sounds cooler in the press. Worms same as robots, although technically a worm is a replicating program, unlike a robot. Web crawlers same as robots, but note WebCrawler is a specific robot WebAnts distributed cooperating robots. Aren't robots bad for the web? There are a few reasons people believe robots are bad for the Web: ● Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes. ● Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects ● Web-wide indexing robots build a central database of documents, which doesn't scale too well to millions of documents on millions of sites. But at the same time the majority of robots are well designed, professionally operated, cause no problems, and provide a valuable service in the absence of widely deployed better solutions. So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention. http://info.webcrawler.com/mak/projects/robots/faq.html (3 of 9) [18.02.2001 13:14:14] The Web Robots FAQ Are there any robot books? Yes: Internet Agents: Spiders, Wanderers, Brokers, and Bots by Fah-Chun Cheong. This books covers Web robots, commerce transaction agents, Mud agents, and a few others. It includes source code for a simple Web robot based on top of libwww-perl4. Its coverage of HTTP, HTML, and Web libraries is a bit too thin to be a "how to write a web robot" book, but it provides useful background reading and a good overview of the state-of-the-art, especially if you haven't got the time to find all the info yourself on the Web. Published by New Riders, ISBN 1-56205-463-5. Bots and Other Internet Beasties by Joseph Williams I haven't seen this myself, but someone said: The William's book 'Bots and other Internet Beasties' was quite disappointing. It claims to be a 'how to' book on writing robots, but my impression is that it is nothing more than a collection of chapters, written by various people involved in this area and subsequently bound together. Published by Sam's, ISBN: 1-57521-016-9 Web Client Programming with Perl by Clinton Wong This O'Reilly book is planned for Fall 1996, check the O'Reilly Web Site for the current status. It promises to be a practical book, but I haven't seen it yet. A few others can be found on the The Software Agents Mailing List FAQ Where do I find out more about robots? There is a Web robots home page on: http://info.webcrawler.com/mak/projects/robots/robots.html While this is hosted at one of the major robots' site, it is an unbiased and reasoneably comprehensive collection of information which is maintained by Martijn Koster <[email protected]>. Of course the latest version of this FAQ is there. You'll also find details and an archive of the robots mailing list, which is intended for technical discussions about robots. Indexing robots How does a robot decide where to visit? This depends on the robot, each one uses different strategies. In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web. Most indexing services also allow you to submit URLs manually, which will then be queued and http://info.webcrawler.com/mak/projects/robots/faq.html (4 of 9) [18.02.2001 13:14:14] The Web Robots FAQ visited by the robot. Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc. Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs. How does an indexing robot decide what to index? If an indexing robot knows about a document, it may decide to parse it, and insert it into its database. How this is done depends on the robot: Some robots index the HTML Titles, or the first few paragraphs, or parse the entire HTML and index all words, with weightings depending on HTML constructs, etc. Some parse the META tag, or other special hidden tags. We hope that as the Web evolves more facilities becomes available to efficiently associate meta data such as indexing information with a document. This is being worked on... How do I register my page with a robot? You guessed it, it depends on the service :-) Most services have a link to a URL submission form on their search page. Fortunately you don't have to submit your URL to every service by hand: Submit-it <URL: http://www.submit-it.com/> will do it for you. For Server Administrators How do I know if I've been visited by a robot? You can check your server logs for sites that retrieve many documents, especially in a short time. If your server supports User-agent logging you can check for retrievals with unusual User-agent heder values. Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too. I've been visited by a robot! Now what? Well, nothing :-) The whole idea is they are automatic; you don't need to do anything. If you think you have discovered a new robot (ie one that is not listed on the list of active robots, and it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But please don't tell me about every robot that happens to drop by! http://info.webcrawler.com/mak/projects/robots/faq.html (5 of 9) [18.02.2001 13:14:14] The Web Robots FAQ A robot is traversing my whole site too fast! This is called "rapid-fire", and people usually notice it if they're monitoring or analysing an access log file. First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick. However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash. If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains, and see if they are mentioned in the list of active robots. If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain. If the robot is not on the list, mail me with all the information you have collected, including actions on your part. If I can't help, at least I can make a note of it for others. How do I keep a robot off my server? Read the next section... Robots exclusion standard Why do I find entries for /robots.txt in my log files? They are probably from robots trying to see if you have specified any rules for them using the Standard for Robot Exclusion, see also below. If you don't care about robots and want to prevent the messages in your error logs, simply create an empty file called robots.txt in the root level of your server. Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never get read by anyone :-) How do I prevent robots scanning my site? The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server: http://info.webcrawler.com/mak/projects/robots/faq.html (6 of 9) [18.02.2001 13:14:14] The Web Robots FAQ User-agent: * Disallow: / but its easy to be more selective than that. Where do I find out how /robots.txt files work? You can read the whole standard specification but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example: # /robots.txt file for http://webcrawler.com/ # mail [email protected] for constructive criticism User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logs The first two lines, starting with '#', specify a comment The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere. The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off. The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines. Two common errors: ● Wildcards are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp'. ● You shouldn't put more than one path on a Disallow line (this may change in a future version of the spec) Will the /robots.txt standard be extended? Probably... there are some ideas floating around. They haven't made it into a coherent proposal because of time constraints, and because there is little pressure. Mail suggestions to the robots mailing list, and check the robots home page for work in progress. http://info.webcrawler.com/mak/projects/robots/faq.html (7 of 9) [18.02.2001 13:14:14] The Web Robots FAQ What if I can't make a /robots.txt file? Sometimes you cannot make a /robots.txt file, because you don't administer the entire server. All is not lost: there is a new standard for using HTML META tags to keep robots out of your documents. The basic idea is that if you include a tag like: <META NAME="ROBOTS" CONTENT="NOINDEX"> in your HTML document, that document won't be indexed. If you do: <META NAME="ROBOTS" CONTENT="NOFOLLOW"> the links in that document will not be parsed by the robot. Availability Where can I use a robot? If you mean a search service, check out the various directory pages on the Web, such as Netscape's Exploring the Net or try one of the Meta search services such as MetaSearch Where can I get a robot? Well, you can have a look at the list of robots; I'm starting to indicate their public availability slowly. In the meantime, two indexing robots that you should be able to get hold of are Harvest (free), and Verity's. Where can I get the source code for a robot? See above -- some may be willing to give out source code. Alternatively check out the libwww-perl5 package, that has a simple example. I'm writing a robot, what do I need to be careful of? Lots. First read through all the stuff on the robot page then read the proceedings of past WWW Conferences, and the complete HTTP and HTML spec. Yes; it's a lot of work :-) http://info.webcrawler.com/mak/projects/robots/faq.html (8 of 9) [18.02.2001 13:14:14] The Web Robots FAQ I've written a robot, how do I list it? Simply fill in a form you can find on The Web Robots Database and email it to me. The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/faq.html (9 of 9) [18.02.2001 13:14:14] Robots Exclusion The Web Robots Pages Robots Exclusion Sometimes people find they have been indexed by an indexing robot, or that a resource discovery robot has visited part of a site that for some reason shouldn't be visited by robots. In recognition of this problem, many Web Robots offer facilities for Web site administrators and content providers to limit what the robot does. This is achieved through two mechanisms: The Robots Exclusion Protocol A Web site administrator can indicate which parts of the site should not be vistsed by a robot, by providing a specially formatted file on their site, in http://.../robots.txt. The Robots META tag A Web author can indicate if a page may or may not be indexed, or analysed for links, through the use of a special HTML META tag. The remainder of this pages provides full details on these facilities. Note that these methods rely on cooperation from the Robot, and are by no means guaranteed to work for every Robot. If you need stronger protection from robots and other agents, you should use alternative methods such as password protection. The Robots Exclusion Protocol The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot. In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records like: User-agent: * Disallow: / to see if it is allowed to retrieve the document. The precise details on how these rules can be specified, and what they mean, can be found in: ● Web Server Administrator's Guide to the Robots Exclusion Protocol ● HTML Author's Guide to the Robots Exclusion Protocol ● The original 1994 protocol description, as currently deployed. ● The revised Internet-Draft specification, which is not yet completed or implemented. http://info.webcrawler.com/mak/projects/robots/exclusion.html (1 of 2) [18.02.2001 13:14:16] Robots Exclusion The Robots META tag The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links. No server administrator action is required. Note that currently only a few robots implement this. In this simple example: <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> a robot should neither index this document, nor analyse it for links. Full details on how this tags works is provided: ● Web Server Administrator's Guide to the Robots META tag ● HTML Author's Guide to the Robots META tag ● The original notes from the May 1996 IndexingWorkshop The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/exclusion.html (2 of 2) [18.02.2001 13:14:16] The Web Robots Database The Web Robots Pages The Web Robots Database The List of Active Robots has been changed to a new format, called The Web Robots Database. This format will allow more information to be stored, updates to happen faster, and the information to be more clearly presented. Note that now robot technology is being used in increasing numbers of end-user products, this list is becoming less useful and complete. For general information on robots see Web Robots Pages. The robot information is now stored into individual files, with several HTML tables providing different views of the data: ● View Names ● View Type Details using tables ● View Contact Details using tables Browsers without support for tables can consult the overview of text files. The combined raw data in machine readable format is available in a text file. To add a new robot, fill in this empty template, using this schema description, and email it to [email protected] Others There are robots out there that the database contains no details on. If/when I get those details they will be added, otherwise they'll remain on the list below, as unresponsive or unknown sites. Services with no information These services must use robots, but haven't replied to requests for an entry... Magellan User-agent field: Wobot/1.00 From: mckinley.mckinley.com (206.214.202.2) and galileo.mckinley.com. (206.214.202.45) Honors "robots.txt": yes Contact: [email protected] (or possibly: [email protected]) Purpose: Resource discovery for Magellan (http://www.mckinley.com/) User Agents These look like new robots, but have no contact info... BizBot04 kirk.overleaf.com http://info.webcrawler.com/mak/projects/robots/active.html (1 of 2) [18.02.2001 13:14:18] The Web Robots Database HappyBot (gserver.kw.net) CaliforniaBrownSpider EI*Net/0.1 libwww/0.1 Ibot/1.0 libwww-perl/0.40 Merritt/1.0 StatFetcher/1.0 TeacherSoft/1.0 libwww/2.17 WWW Collector processor/0.0ALPHA libwww-perl/0.20 wobot/1.0 from 206.214.202.45 Libertech-Rover www.libertech.com? WhoWhere Robot ITI Spider w3index MyCNNSpider SummyCrawler OGspider linklooker CyberSpyder ([email protected]) SlowBot heraSpider Surfbot Bizbot003 WebWalker SandBot EnigmaBot spyder3.microsys.com www.freeloader.com. Hosts These have no known user-agent, but have requested /robots.txt repeatedly or exhibited crawling patterns. 205.252.60.71 194.20.32.131 198.5.209.201 acke.dc.luth.se dallas.mt.cs.cmu.edu darkwing.cadvision.com waldec.com www2000.ogsm.vanderbilt.edu unet.ca murph.cais.net (rapid fire... sigh) spyder3.microsys.com www.freeloader.com. Some other robots are mentioned in a list of Japanese Search Engines. The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/active.html (2 of 2) [18.02.2001 13:14:18] WWW Robots Mailing List The Web Robots Pages WWW Robots Mailing List Note: this mailing list was formerly located at [email protected]. This list has moved to [email protected] Charter The [email protected] mailing-list is intended as a technical forum for authors, maintainers and administrators of WWW robots. Its aim is to maximise the benefits WWW robots can offer while minimising drawbacks and duplication of effort. It is intended to address both development and operational aspects of WWW robots. This list is not intended for general discussion of WWW development efforts, or as a first line of support for users of robot facilities. Postings to this list are informal, and decisions and recommendations formulated here do not constitute any official standards. Postings to this list will be made available publicly through a mailing list archive. The administrator of this list nor his company accept any responsibility for the content of the postings. Administrativa These few rules of etiquette make the administrator's life easier, and this list (and others) more productive and enjoyable: When subscribing to this list, make sure you check any auto-responder ("vacation"> software, and make sure it doesn't reply to messages from this list. X-400 and LAN email systems are notorious for positive delivery reports... If your email address changes, please unsubscribe and resubscribe rather than just let the subscription go stale: this saves the administrator work (and fustration) When first joining the list, glance through the archive (details below) or listen-in a while before posting, so you get a feel for the kind of traffic on the list. Never send "unsubscribe" messages to the list itself. Don't post unrelated or repeated advertising to the list. Subscription Details To subscribe to this list, send a mail message to [email protected], with the word subscribe on the first line of the body. http://info.webcrawler.com/mailing-lists/robots/info.html (1 of 2) [18.02.2001 13:14:19] WWW Robots Mailing List To unsubscribe to this list, send a mail message to [email protected], with the word unsubscribe on the first line of the body. Should this fail or should you otherwise need human assistance, send a message to [email protected]. To send message to all subscribers on the list itself, mail [email protected]. The Archive Messages to this list are archived. The preferred way of accessing the archived messages is using the Robots Mailing List Archive provided by Hypermail. Behind the scenes this list is currently managed by Majordomo, an automated mailing list manager written in Perl. Majordomo also allows acces to archived messages; send mail to [email protected] with the word help in the body to find out how. Martijn Koster http://info.webcrawler.com/mailing-lists/robots/info.html (2 of 2) [18.02.2001 13:14:19] Articles and Papers about WWW Robots The Web Robots Pages Articles and Papers about WWW Robots This is a list of papers related to robots. Formatted suggestions gracefully accepted. See also the FAQ on books and information. Protocol Gives Sites Way To Keep Out The 'Bots Jeremy Carl, Web Week, Volume 1, Issue 7, November 1995 (no longer online) Robots in the Web: threat or treat? , Martijn Koster, ConneXions, Volume 9, No. 4, April 1995 Guidelines for Robot Writers , Martijn Koster, 1993 Evaluation of the Standard for Robots Exclusion , Martijn Koster, 1996 The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/articles.html [18.02.2001 13:14:21] WWW Robots Related Sites The Web Robots Pages WWW Robots Related Sites Bot Spot "The Spot for All Bots on the Net". The Web Robots Pages Martijn Koster's pages on robots, specifically robot exclusion. Japanese Search Engines This is a comprehensive index for searching, submitting, and navigating using Japanese search engines. Search Engine Watch A site with information about many search engines, including comparisons. Some information is available to subscribers only. RoboGen RoboGen is a visual editor for Robot Exclusion Files; it allows one to create agent rules by logging onto your FTP server and selecting files and directories. The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/sites.html [18.02.2001 13:14:22] Martijn Koster's Home Page Martijn Koster My name is Martijn Koster . I currently work as a consultant software engineer at Excite, primarily working on Excite Inbox. You may be interested in my projects and publications, or a short biography. DO NOT ASK ME TO REMOVE OR ADD YOUR URLS TO ANY SEARCH ENGINES!. I can't even do it if I wanted to. To contact me about anything else, email [email protected]. Disclaimer: These pages represent my personal views, I do not speak for my employer. Copyright 1995-2000 Martijn Koster. All rights reserved. http://info.webcrawler.com/mak/mak.html [18.02.2001 13:14:25] A Standard for Robot Exclusion The Web Robots Pages A Standard for Robot Exclusion Table of contents: ● Status of this document ● Introduction ● Method ● Format ● Examples ● Example Code ● Author's Address Status of this document This document represents a consensus on 30 June 1994 on the robots mailing list ([email protected]) [Note the Robots mailing list has relocated to WebCrawler. See the Robots pages at WebCrawler for details], between the majority of robot authors and other people with an interest in robots. It has also been open for discussion on the Technical World Wide Web mailing list ([email protected]). This document is based on a previous working draft under the same title. It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots. The latest version of this document can be found on http://info.webcrawler.com/mak/projects/robots/robots.html. Introduction WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page. In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting). These incidents indicated the need for established mechanisms for WWW servers to indicate to robots http://info.webcrawler.com/mak/projects/robots/norobots.html (1 of 4) [18.02.2001 13:14:28] A Standard for Robot Exclusion which parts of their server should not be accessed. This standard addresses this need with an operational solution. The Method The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt". The contents of this file are specified below. This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval. A possible drawback of this single-file approach is that only a server administrator can maintain such a list, not the individual document maintainers on the server. This can be resolved by a local process to construct the single file from a number of others, but if, or how, this is done is outside of the scope of this document. The choice of the URL was motivated by several criteria: ● The filename should fit in file naming restrictions of all common operating systems. ● The filename extension should not require extra server configuration. ● The filename should indicate the purpose of the file and be easy to remember. ● The likelihood of a clash with existing files should be minimal. The Format The format and semantics of the "/robots.txt" file are as follows: The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form "<field>:<optionalspace><value><optionalspace>". The field name is case insensitive. Comments can be included in file using UNIX bourne shell conventions: the '#' character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary. The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored. User-agent The value of this field is the name of the robot the record is describing access policy for. If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record. The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended. http://info.webcrawler.com/mak/projects/robots/norobots.html (2 of 4) [18.02.2001 13:14:28] A Standard for Robot Exclusion If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file. Disallow The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html. Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record. The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome. Examples The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/" or "/tmp/", or /foo.html: # robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ # This is an infinite virtual URL space Disallow: /tmp/ # these will soon disappear Disallow: /foo.html This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper": # robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ # This is an infinite virtual URL space # Cybermapper knows where to go. User-agent: cybermapper Disallow: This example indicates that no robots should visit this site further: # go away User-agent: * Disallow: / http://info.webcrawler.com/mak/projects/robots/norobots.html (3 of 4) [18.02.2001 13:14:28] A Standard for Robot Exclusion Example Code Although it is not part of this specification, some example code in Perl is available in norobots.pl. It is a bit more flexible in its parsing than this document specificies, and is provided as-is, without warranty. Note: This code is no longer available. Instead I recommend using the robots exclusion code in the Perl libwww-perl5 library, available from CPAN in the LWP directory. Author's Address Martijn Koster <[email protected]> The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/norobots.html (4 of 4) [18.02.2001 13:14:28] Web Server Administrator's Guide to the Robots Exclusion Protocol The Web Robots Pages Web Server Administrator's Guide to the Robots Exclusion Protocol This guide is aimed at Web Server Administrators who want to use the Robots Exclusion Protocol. Note that this is not a specification -- for details and formal syntax and definition see the specification. Introduction The Robots Exclusion Protocol is very straightforward. In a nutshell it works like this: When a compliant Web Robot vists a site, it first checks for a "/robots.txt" URL on the site. If this URL exists, the Robot parses its contents for directives that instruct the robot not to visit certain parts of the site. As a Web Server Administrator you can create directives that make sense for your site. This page tells you how. Where to create the robots.txt file The Robot will simply look for a "/robots.txt" URL on your site, where a site is defined as a HTTP server running on a particular host and port number. For example: Site URL Corresponding Robots.txt URL http://www.w3.org/ http://www.w3.org/robots.txt http://www.w3.org:80/ http://www.w3.org:80/robots.txt http://www.w3.org:1234/ http://www.w3.org:1234/robots.txt http://w3.org/ http://w3.org/robots.txt Note that there can only be a single "/robots.txt" on a site. Specifically, you should not put "robots.txt" files in user directories, because a robot will never look at them. If you want your users to be able to create their own "robots.txt", you will need to merge them all into a single "/robots.txt". If you don't want to do this your users might want to use the Robots META Tag instead. Also, remeber that URL's are case sensitive, and "/robots.txt" must be all lower-case. Pointless robots.txt URLs http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html (1 of 3) [18.02.2001 13:14:30] Web Server Administrator's Guide to the Robots Exclusion Protocol http://www.w3.org/admin/robots.txt http://www.w3.org/~timbl/robots.txt ftp://ftp.w3.com/robots.txt So, you need to provide the "/robots.txt" in the top-level of your URL space. How to do this depends on your particular server software and configuration. For most servers it means creating a file in your top-level server directory. On a UNIX machine this might be /usr/local/etc/httpd/htdocs/robots.txt What to put into the robots.txt file The "/robots.txt" file usually contains a record looking like this: User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/ In this example, three directories are excluded. Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/". Also, you may not have blank lines in a record, as they are used to delimit multiple records. Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif". What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples: To exclude all robots from the entire server User-agent: * Disallow: / To allow all robots complete access User-agent: * Disallow: Or create an empty "/robots.txt" file. http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html (2 of 3) [18.02.2001 13:14:30] Web Server Administrator's Guide to the Robots Exclusion Protocol To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/ To exclude a single robot User-agent: BadBot Disallow: / To allow a single robot User-agent: WebCrawler Disallow: User-agent: * Disallow: / To exclude all files except one This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "docs", and leave the one file in the level above this directory: User-agent: * Disallow: /~joe/docs/ Alternatively you can explicitly disallow all disallowed pages: User-agent: * Disallow: /~joe/private.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html (3 of 3) [18.02.2001 13:14:30] HTML Author's Guide to the Robots Exclusion Protocol The Web Robots Pages HTML Author's Guide to the Robots Exclusion Protocol The Robots Exclusion Protocol requires that instructions are placed in a URL "/robots.txt", i.e. in the top-level of your server's document space. If you rent space for your HTML files on the server of your Internet Service Provider, or another third party, you are usually not allowed to install or modify files in the top-level of the server's document space. This means that to use the Robots Exclusion Protocol, you have to liase with the server administrator, and get him/her add the rules to the "/robots.txt", using the Web Server Administrator's Guide to the Robots Exclusion Protocol. There is no way around this -- specifically, there is no point in providing your own "/robots.txt" files elsewhere on the server, like in your home directory or subdirectories; Robots won't look for them, and even if they did find them, they wouldn't pay attention to the rules there. If your administrator is unwilling to install or modify "/robots.txt" rules on your behalf, and all you want is prevent being indexed by indexing robots like WebCrawler and Lycos, you can add a Robots Meta Tag to all pages you don't want indexed. Note this functionality is not implemented by all indexing robots. The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/exclusion-user.html [18.02.2001 13:14:31] A Standard for Robot Exclusion The Web Robots Pages Network Working Group INTERNET DRAFT Category: Informational Dec 4, 1996 <draft-koster-robots-00.txt> M. Koster WebCrawler November 1996 Expires June 4, 1997 A Method for Web Robots Control Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as InternetDrafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use InternetDrafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet- Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (1 of 12) [18.02.2001 13:14:36] A Standard for Robot Exclusion Koster draft-koster-robots-00.txt INTERNET DRAFT A Method for Robots Control [Page 1] December 4, 1996 Table of Contents 1. 2. 3. 3.1 3.2 3.2.1 3.2.2 3.3 3.4 4. 5. 5.1 5.2 6. 7. 8. 9. 1. Abstract . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . Specification . . . . . . . . . . . . . . . . . . . Access method . . . . . . . . . . . . . . . . . . . File Format Description . . . . . . . . . . . . . . The User-agent line . . . . . . . . . . . . . . . . The Allow and Disallow lines . . . . . . . . . . . Formal Syntax . . . . . . . . . . . . . . . . . . . Expiration . . . . . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . Implementor's Notes . . . . . . . . . . . . . . . . Backwards Compatibility . . . . . . . . . . . . . . Interoperability . . .. . . . . . . . . . . . . . . Security Considerations . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . Acknowledgements . . . . . . . . . . . . . . . . . Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 3 3 4 5 5 6 8 8 9 9 10 10 10 11 11 Abstract This memo defines a method for administrators of sites on the WorldWide Web to give instructions to visiting Web robots, most importantly what areas of the site are to be avoided. This document provides a more rigid specification of the Standard for Robots Exclusion [1], which is currently in wide-spread use by the Web community since 1994. 2. Introduction Web Robots (also called "Wanderers" or "Spiders") are Web client programs that automatically traverse the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (2 of 12) [18.02.2001 13:14:36] A Standard for Robot Exclusion Note that "recursively" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it qualifies to be called a robot. Robots are often used for maintenance and indexing purposes, by people other than the administrators of the site being visited. In some cases such visits may have undesirable effects which the Koster draft-koster-robots-00.txt INTERNET DRAFT A Method for Robots Control [Page 2] December 4, 1996 administrators would like to prevent, such as indexing of an unannounced site, traversal of parts of the site which require vast resources of the server, recursive traversal of an infinite URL space, etc. The technique specified in this memo allows Web site administrators to indicate to visiting robots which parts of the site should be avoided. It is solely up to the visiting robot to consult this information and act accordingly. Blocking parts of the Web site regardless of a robot's compliance with this method are outside the scope of this memo. 3. The Specification This memo specifies a format for encoding instructions to visiting robots, and specifies an access method to retrieve these instructions. Robots must retrieve these instructions before visiting other URLs on the site, and use the instructions to determine if other URLs on the site can be accessed. 3.1 Access method The instructions must be accessible via HTTP [2] from the site that the instructions are to be applied to, as a resource of Internet Media Type [3] "text/plain" under a standard relative path on the server: "/robots.txt". For convenience we will refer to this resource as the "/robots.txt file", though the resource need in fact not originate from a filesystem. Some examples of URLs [4] for sites and URLs for corresponding "/robots.txt" sites: http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (3 of 12) [18.02.2001 13:14:36] A Standard for Robot Exclusion http://www.foo.com/welcome.html http://www.foo.com/robots.txt http://www.bar.com:8001/ http://www.bar.com:8001/robots.txt If the server response indicates Success (HTTP 2xx Status Code,) the robot must read the content, parse it, and follow any instructions applicable to that robot. If the server response indicates the resource does not exist (HTTP Status Code 404), the robot can assume no instructions are available, and that access to the site is not restricted by /robots.txt. Koster draft-koster-robots-00.txt INTERNET DRAFT A Method for Robots Control [Page 3] December 4, 1996 Specific behaviors for other server responses are not required by this specification, though the following behaviours are recommended: - On server response indicating access restrictions (HTTP Status Code 401 or 403) a robot should regard access to the site completely restricted. - On the request attempt resulted in temporary failure a robot should defer visits to the site until such time as the resource can be retrieved. - On server response indicating Redirection (HTTP Status Code 3XX) a robot should follow the redirects until a resource can be found. 3.2 File Format Description The instructions are encoded as a formatted plain text object, described here. A complete BNF-like description of the syntax of this format is given in section 3.3. The format logically consists of a non-empty set or records, separated by blank lines. The records consist of a set of lines of the form: <Field> ":" <value> In this memo we refer to lines with a Field "foo" as "foo lines". The record starts with one or more User-agent lines, specifying http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (4 of 12) [18.02.2001 13:14:36] A Standard for Robot Exclusion which robots the record applies to, followed by "Disallow" and "Allow" instructions to that robot. For example: User-agent: webcrawler User-agent: infoseek Allow: /tmp/ok.html Disallow: /tmp Disallow: /user/foo These lines are discussed separately below. Lines with Fields not explicitly specified by this specification may occur in the /robots.txt, allowing for future extension of the format. Consult the BNF for restrictions on the syntax of such extensions. Note specifically that for backwards compatibility with robots implementing earlier versions of this specification, breaking of lines is not allowed. Koster draft-koster-robots-00.txt INTERNET DRAFT A Method for Robots Control [Page 4] December 4, 1996 Comments are allowed anywhere in the file, and consist of optional whitespace, followed by a comment character '#' followed by the comment, terminated by the end-of-line. 3.2.1 The User-agent line Name tokens are used to allow robots to identify themselves via a simple product token. Name tokens should be short and to the point. The name token a robot chooses for itself should be sent as part of the HTTP User-agent header, and must be well documented. These name tokens are used in User-agent lines in /robots.txt to identify to which specific robots the record applies. The robot must obey the first record in /robots.txt that contains a UserAgent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive. If no such record exists, it should obey the first record with a User-agent line with a "*" value, if present. If no record satisfied either condition, or no records are present at all, access is unlimited. The name comparisons are case-insensitive. For example, a fictional company FigTree Search Services who names their robot "Fig Tree", send HTTP requests like: GET / HTTP/1.0 User-agent: FigTree/0.1 Robot libwww-perl/5.04 http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (5 of 12) [18.02.2001 13:14:36] A Standard for Robot Exclusion might scan the "/robots.txt" file for records with: User-agent: figtree 3.2.2 The Allow and Disallow lines These lines indicate whether accessing a URL that matches the corresponding path is allowed or disallowed. Note that these instructions apply to any HTTP method on a URL. To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed. The /robots.txt URL is always allowed, and must not appear in the Allow/Disallow rules. The matching process compares every octet in the path portion of the URL and the path from the record. If a %xx encoded octet is Koster draft-koster-robots-00.txt INTERNET DRAFT A Method for Robots Control [Page 5] December 4, 1996 encountered it is unencoded prior to comparison, unless it is the "/" character, which has special meaning in a path. The match evaluates positively if and only if the end of the path from the record is reached before a difference in octets is encountered. This table illustrates some examples: Record Path /tmp /tmp /tmp /tmp/ /tmp/ /tmp/ URL path /tmp /tmp.html /tmp/a.html /tmp /tmp/ /tmp/a.html Matches yes yes yes no yes yes /a%3cd.html /a%3Cd.html /a%3cd.html /a%3Cd.html /a%3cd.html /a%3cd.html /a%3Cd.html /a%3Cd.html yes yes yes yes /a%2fb.html /a%2fb.html /a/b.html /a%2fb.html /a/b.html /a%2fb.html yes no no http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (6 of 12) [18.02.2001 13:14:36] A Standard for Robot Exclusion /a/b.html /a/b.html yes /%7ejoe/index.html /~joe/index.html yes /~joe/index.html /%7Ejoe/index.html yes 3.3 Formal Syntax This is a BNF-like description, using the conventions of RFC 822 [5], except that "|" is used to designate alternatives. Briefly, literals are quoted with "", parentheses "(" and ")" are used to group elements, optional elements are enclosed in [brackets], and elements may be preceded with <n>* to designate n or more repetitions of the following element; n defaults to 0. robotstxt = *blankcomment | *blankcomment record *( 1*commentblank 1*record ) *blankcomment blankcomment = 1*(blank | commentline) commentblank = *commentline blank *(blankcomment) blank = *space CRLF CRLF = CR LF record = *commentline agentline *(commentline | agentline) 1*ruleline *(commentline | ruleline) Koster draft-koster-robots-00.txt INTERNET DRAFT A Method for Robots Control agentline ruleline disallowline allowline extension value = = = = = = commentline comment space rpath agent anychar CHAR CTL = = = = = = = = CR LF SP [Page 6] December 4, 1996 "User-agent:" *space agent [comment] CRLF (disallowline | allowline | extension) "Disallow" ":" *space path [comment] CRLF "Allow" ":" *space rpath [comment] CRLF token : *space value [comment] CRLF <any CHAR except CR or LF or "#"> comment CRLF *blank "#" anychar 1*(SP | HT) "/" path token <any CHAR except CR or LF> <any US-ASCII character (octets 0 - 127)> <any US-ASCII control character (octets 0 - 31) and DEL (127)> = <US-ASCII CR, carriage return (13)> = <US-ASCII LF, linefeed (10)> = <US-ASCII SP, space (32)> http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (7 of 12) [18.02.2001 13:14:36] A Standard for Robot Exclusion HT = <US-ASCII HT, horizontal-tab (9)> The syntax for "token" is taken from RFC 1945 [2], reproduced here for convenience: token = 1*<any CHAR except CTLs or tspecials> tspecials = | | | "(" "," "/" "{" | | | | ")" ";" "[" "}" | | | | "<" | ">" | "@" ":" | "\" | <"> "]" | "?" | "=" SP | HT The syntax for "path" is defined in RFC 1808 [6], reproduced here for convenience: path fsegment segment = fsegment *( "/" segment ) = 1*pchar = *pchar pchar uchar unreserved = uchar | ":" | "@" | "&" | "=" = unreserved | escape = alpha | digit | safe | extra escape hex = "%" hex hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f" alpha = lowalpha | hialpha Koster draft-koster-robots-00.txt INTERNET DRAFT A Method for Robots Control lowalpha hialpha = "a" "j" "s" = "A" "J" "S" | | | | | | "b" "k" "t" "B" "K" "T" | | | | | | "c" "l" "u" "C" "L" "U" | | | | | | "d" "m" "v" "D" "M" "V" | | | | | | "e" "n" "w" "E" "N" "W" | | | | | | [Page 7] "f" "o" "x" "F" "O" "X" December 4, 1996 | | | | | | "g" "p" "y" "G" "P" "Y" | | | | | | "h" "q" "z" "H" "Q" "Z" | "i" | | "r" | | "I" | | "R" | digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" safe extra = "$" | "-" | "_" | "." | "+" = "!" | "*" | "'" | "(" | ")" | "," 3.4 Expiration http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (8 of 12) [18.02.2001 13:14:36] A Standard for Robot Exclusion Robots should cache /robots.txt files, but if they do they must periodically verify the cached copy is fresh before using its contents. Standard HTTP cache-control mechanisms can be used by both origin server and robots to influence the caching of the /robots.txt file. Specifically robots should take note of Expires header set by the origin server. If no cache-control directives are present robots should default to an expiry of 7 days. 4. Examples This section contains an example of how a /robots.txt may be used. A fictional site may have the following URLs: http://www.fict.org/ http://www.fict.org/index.html http://www.fict.org/robots.txt http://www.fict.org/server.html http://www.fict.org/services/fast.html http://www.fict.org/services/slow.html http://www.fict.org/orgo.gif http://www.fict.org/org/about.html http://www.fict.org/org/plans.html http://www.fict.org/%7Ejim/jim.html http://www.fict.org/%7Emak/mak.html The site may in the /robots.txt have specific rules for robots that send a HTTP User-agent "UnhipBot/0.1", "WebCrawler/3.0", and Koster draft-koster-robots-00.txt INTERNET DRAFT A Method for Robots Control "Excite/1.0", and a set of default rules: # /robots.txt for http://www.fict.org/ # comments to [email protected] User-agent: unhipbot Disallow: / User-agent: webcrawler User-agent: excite Disallow: User-agent: * http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (9 of 12) [18.02.2001 13:14:36] [Page 8] December 4, 1996 A Standard for Robot Exclusion Disallow: /org/plans.html Allow: /org/ Allow: /serv Allow: /~mak Disallow: / The following matrix shows which robots are allowed to access URLs: http://www.fict.org/ http://www.fict.org/index.html http://www.fict.org/robots.txt http://www.fict.org/server.html http://www.fict.org/services/fast.html http://www.fict.org/services/slow.html http://www.fict.org/orgo.gif http://www.fict.org/org/about.html http://www.fict.org/org/plans.html http://www.fict.org/%7Ejim/jim.html http://www.fict.org/%7Emak/mak.html unhipbot webcrawler other & excite No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes Yes No Yes Yes No Yes No No Yes Yes No Yes No No Yes No No Yes Yes 5. Notes for Implementors 5.1 Backwards Compatibility Previous of this specification didn't provide the Allow line. The introduction of the Allow line causes robots to behave slightly differently under either specification: If a /robots.txt contains an Allow which overrides a later occurring Disallow, a robot ignoring Allow lines will not retrieve those parts. This is considered acceptable because there is no requirement for a robot to access URLs it is allowed to retrieve, and it is safe, in that no URLs a Web site administrator wants to Disallow are be allowed. It is expected this may in fact encourage robots to upgrade compliance to the specification in this memo. Koster draft-koster-robots-00.txt INTERNET DRAFT A Method for Robots Control 5.2 [Page 9] December 4, 1996 Interoperability Implementors should pay particular attention to the robustness in parsing of the /robots.txt file. Web site administrators who are not aware of the /robots.txt mechanisms often notice repeated failing request for it in their log files, and react by putting up pages asking "What are you looking for?". http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (10 of 12) [18.02.2001 13:14:36] A Standard for Robot Exclusion As the majority of /robots.txt files are created with platformspecific text editors, robots should be liberal in accepting files with different end-of-line conventions, specifically CR and LF in addition to CRLF. 6. Security Considerations There are a few risks in the method described here, which may affect either origin server or robot. Web site administrators must realise this method is voluntary, and is not sufficient to guarantee some robots will not visit restricted parts of the URL space. Failure to use proper authentication or other restriction may result in exposure of restricted information. It even possible that the occurence of paths in the /robots.txt file may expose the existence of resources not otherwise linked to on the site, which may aid people guessing for URLs. Robots need to be aware that the amount of resources spent on dealing with the /robots.txt is a function of the file contents, which is not under the control of the robot. For example, the contents may be larger in size than the robot can deal with. To prevent denial-ofservice attacks, robots are therefore encouraged to place limits on the resources spent on processing of /robots.txt. The /robots.txt directives are retrieved and applied in separate, possible unauthenticated HTTP transactions, and it is possible that one server can impersonate another or otherwise intercept a /robots.txt, and provide a robot with false information. This specification does not preclude authentication and encryption from being employed to increase security. 7. Acknowledgements The author would like the subscribers to the robots mailing list for their contributions to this specification. Koster draft-koster-robots-00.txt INTERNET DRAFT A Method for Robots Control [Page 10] December 4, 1996 8. References [1] Koster, M., "A Standard for Robot Exclusion", http://info.webcrawler.com/mak/projects/robots/norobots.html, http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (11 of 12) [18.02.2001 13:14:36] A Standard for Robot Exclusion June 1994. [2] Berners-Lee, T., Fielding, R., and Frystyk, H., "Hypertext Transfer Protocol -- HTTP/1.0." RFC 1945, MIT/LCS, May 1996. [3] Postel, J., "Media Type Registration Procedure." RFC 1590, USC/ISI, March 1994. [4] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform Resource Locators (URL)", RFC 1738, CERN, Xerox PARC, University of Minnesota, December 1994. [5] Crocker, D., "Standard for the Format of ARPA Internet Text Messages", STD 11, RFC 822, UDEL, August 1982. [6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808, UC Irvine, June 1995. 9. Author's Address Martijn Koster WebCrawler America Online 690 Fifth Street San Francisco CA 94107 Phone: 415-3565431 EMail: [email protected] Expires June 4, 1997 Koster draft-koster-robots-00.txt [Page 11] The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (12 of 12) [18.02.2001 13:14:36] Web Server Administrator's Guide to the Robots META tag The Web Robots Pages Web Server Administrator's Guide to the Robots META tag. Good news! As a Web Server Administrator you don't need to do anything to support the Robots META tag. Simply refer your users to the HTML Author's Guide to the Robots META tag The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/meta-admin.html [18.02.2001 13:14:37] HTML Author's Guide to the Robots META tag The Web Robots Pages HTML Author's Guide to the Robots META tag. The Robots META tag is a simple mechanism to indicate to visiting Web Robots if a page should be indexed, or links on the page should be followed. It differs from the Protocol for Robots Exclusion in that you need no effort or permission from your Web Server Administrator. Note: Currently only few robots support this tag! Where to put the Robots META tag Like any META tag it should be placed in the HEAD section of an HTML page: <html> <head> <meta name="robots" content="noindex,nofollow"> <meta name="description" content="This page ...."> <title>...</title> </head> <body> ... What to put into the Robots META tag The content of the Robots META tag contains directives separated by commas. The currently defined directives are [NO]INDEX and [NO]FOLLOW. The INDEX directive specifies if an indexing robot should index the page. The FOLLOW directive specifies if a robot is to follow links on the page. The defaults are INDEX and FOLLOW. The values ALL and NONE set all directives on or off: ALL=INDEX,FOLLOW and NONE=NOINDEX,NOFOLLOW. Some examples: <meta <meta <meta <meta name="robots" name="robots" name="robots" name="robots" content="index,follow"> content="noindex,follow"> content="index,nofollow"> content="noindex,nofollow"> Note the "robots" name of the tag and the content are case insensitive. You obviously should not specify conflicting or repeating directives such as: <meta name="robots" content="INDEX,NOINDEX,NOFOLLOW,FOLLOW,FOLLOW"> http://info.webcrawler.com/mak/projects/robots/meta-user.html (1 of 2) [18.02.2001 13:14:39] HTML Author's Guide to the Robots META tag A formal syntax for the Robots META tag content is: content all none directives directive index follow = = = = = = = all | none | directives "ALL" "NONE" directive ["," directives] index | follow "INDEX" | "NOINDEX" "FOLLOW" | "NOFOLLOW" The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/meta-user.html (2 of 2) [18.02.2001 13:14:39] Spidering BOF Report The Web Robots Pages Spidering BOF Report [Note: This is an HTML version of the original notes from the Distributed Indexing/Searching Workshop ] Report by Michael Mauldin (Lycos) (later edited by Michael Schwartz) While the overall workshop goal was to determine areas where standards could be pursued, the Spidering BOF attempted to reach actual standards agreements about some immediate term issues facing robot-based search services, at least among spider-based search service representatives who were in attendance at the workshop (Excite, InfoSeek, and Lycos). The agreements fell into four areas, but we report only three of them here because the fourth area concerned a KEYWORDS tag that many workshop participants felt was not appropriate for specification by this BOF without the participation of other groups that have been working on that issue. The remaining three areas were: ROBOTS meta-tag <META NAME="ROBOTS" CONTENT="ALL | NONE | NOINDEX | NOFOLLOW"> default = empty = "ALL" "NONE" = "NOINDEX, NOFOLLOW" The filler is a comma separated list of terms: ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW. Discussion: This tag is meant to provide users who cannot control the robots.txt file at their sites. It provides a last chance to keep their content out of search services. It was decided not to add syntax to allow robot specific permissions within the meta-tag. INDEX means that robots are welcome to include this page in search services. FOLLOW means that robots are welcome to follow links from this page to find other pages. So a value of "NOINDEX" allows the subsidiary links to be explored, even though the page is not indexed. A value of "NOFOLLOW" allows the page to be indexed, but no links from the page are explored (this may be useful if the page is a free entry point into pay-per-view content, for example. A value of "NONE" tells the robot to ignore the page. http://info.webcrawler.com/mak/projects/robots/meta-notes.html (1 of 2) [18.02.2001 13:14:40] Spidering BOF Report DESCRIPTION meta-tag <META NAME="DESCRIPTION" CONTENT="...text..."> The intent is that the text can be used by a search service when printing a summary of the document. The text should not contain any formatting information. Other issues with ROBOTS.TXT These are issues recommended for future standards discussion that could not be resolved within the scope of this workshop. ● Ambiguities in the current specification http://www.kollar.com/robots.html ● A means of canonicalizing sites, using: HTTP-EQUIV HOST ROBOTS.TXT ALIAS ● ways of supporting multiple robots.txt files per site ("robotsN.txt") ● ways of advertising content that should be indexed (rather than just restricting content that should not be indexed) ● Flow control information: retrieval interval or maximum connections open to server The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/meta-notes.html (2 of 2) [18.02.2001 13:14:40] Database of Web Robots, Overview The Web Robots Pages Database of Web Robots, Overview In addition to this overview, you can View Contact or View Type. 1. Acme.Spider 2. Ahoy! The Homepage Finder 3. Alkaline 4. Walhello appie 5. Arachnophilia 6. ArchitextSpider 7. Aretha 8. ARIADNE 9. arks 10. ASpider (Associative Spider) 11. ATN Worldwide 12. Atomz.com Search Robot 13. AURESYS 14. BackRub 15. unnamed 16. Big Brother 17. Bjaaland 18. BlackWidow 19. Die Blinde Kuh 20. Bloodhound 21. bright.net caching robot 22. BSpider 23. CACTVS Chemistry Spider 24. Calif 25. Cassandra 26. Digimarc Marcspider/CGI 27. Checkbot 28. churl 29. CMC/0.01 30. Collective http://info.webcrawler.com/mak/projects/robots/active/html/index.html (1 of 8) [18.02.2001 13:14:44] Database of Web Robots, Overview 31. Combine System 32. Conceptbot 33. CoolBot 34. Web Core / Roots 35. XYLEME Robot 36. Internet Cruiser Robot 37. Cusco 38. CyberSpyder Link Test 39. DeWeb(c) Katalog/Index 40. DienstSpider 41. Digger 42. Digital Integrity Robot 43. Direct Hit Grabber 44. DNAbot 45. DownLoad Express 46. DragonBot 47. DWCP (Dridus' Web Cataloging Project) 48. e-collector 49. EbiNess 50. EIT Link Verifier Robot 51. Emacs-w3 Search Engine 52. ananzi 53. Esther 54. Evliya Celebi 55. nzexplorer 56. Fluid Dynamics Search Engine robot 57. Felix IDE 58. Wild Ferret Web Hopper #1, #2, #3 59. FetchRover 60. fido 61. Hämähäkki 62. KIT-Fireball 63. Fish search 64. Fouineur 65. Robot Francoroute http://info.webcrawler.com/mak/projects/robots/active/html/index.html (2 of 8) [18.02.2001 13:14:44] Database of Web Robots, Overview 66. Freecrawl 67. FunnelWeb 68. gazz 69. GCreep 70. GetBot 71. GetURL 72. Golem 73. Googlebot 74. Grapnel/0.01 Experiment 75. Griffon 76. Gromit 77. Northern Light Gulliver 78. HamBot 79. Harvest 80. havIndex 81. HI (HTML Index) Search 82. Hometown Spider Pro 83. Wired Digital 84. ht://Dig 85. HTMLgobble 86. Hyper-Decontextualizer 87. IBM_Planetwide 88. Popular Iconoclast 89. Ingrid 90. Imagelock 91. IncyWincy 92. Informant 93. InfoSeek Robot 1.0 94. Infoseek Sidewinder 95. InfoSpiders 96. Inspector Web 97. IntelliAgent 98. I, Robot 99. Iron33 100. Israeli-search http://info.webcrawler.com/mak/projects/robots/active/html/index.html (3 of 8) [18.02.2001 13:14:44] Database of Web Robots, Overview 101. JavaBee 102. JBot Java Web Robot 103. JCrawler 104. Jeeves 105. Jobot 106. JoeBot 107. The Jubii Indexing Robot 108. JumpStation 109. Katipo 110. KDD-Explorer 111. Kilroy 112. KO_Yappo_Robot 113. LabelGrabber 114. larbin 115. legs 116. Link Validator 117. LinkScan 118. LinkWalker 119. Lockon 120. logo.gif Crawler 121. Lycos 122. Mac WWWWorm 123. Magpie 124. Mattie 125. MediaFox 126. MerzScope 127. NEC-MeshExplorer 128. MindCrawler 129. moget 130. MOMspider 131. Monster 132. Motor 133. Muscat Ferret 134. Mwd.Search 135. Internet Shinchakubin http://info.webcrawler.com/mak/projects/robots/active/html/index.html (4 of 8) [18.02.2001 13:14:44] Database of Web Robots, Overview 136. NetCarta WebMap Engine 137. NetMechanic 138. NetScoop 139. newscan-online 140. NHSE Web Forager 141. Nomad 142. The NorthStar Robot 143. Occam 144. HKU WWW Octopus 145. Orb Search 146. Pack Rat 147. PageBoy 148. ParaSite 149. Patric 150. pegasus 151. The Peregrinator 152. PerlCrawler 1.0 153. Phantom 154. PiltdownMan 155. Pioneer 156. html_analyzer 157. Portal Juice Spider 158. PGP Key Agent 159. PlumtreeWebAccessor 160. Poppi 161. PortalB Spider 162. GetterroboPlus Puu 163. The Python Robot 164. Raven Search 165. RBSE Spider 166. Resume Robot 167. RoadHouse Crawling System 168. Road Runner: The ImageScape Robot 169. Robbie the Robot 170. ComputingSite Robi/1.0 http://info.webcrawler.com/mak/projects/robots/active/html/index.html (5 of 8) [18.02.2001 13:14:44] Database of Web Robots, Overview 171. Robozilla 172. Roverbot 173. SafetyNet Robot 174. Scooter 175. Search.Aus-AU.COM 176. SearchProcess 177. Senrigan 178. SG-Scout 179. ShagSeeker 180. Shai'Hulud 181. Sift 182. Simmany Robot Ver1.0 183. Site Valet 184. Open Text Index Robot 185. SiteTech-Rover 186. SLCrawler 187. Inktomi Slurp 188. Smart Spider 189. Snooper 190. Solbot 191. Spanner 192. Speedy Spider 193. spider_monkey 194. SpiderBot 195. SpiderMan 196. SpiderView(tm) 197. Spry Wizard Robot 198. Site Searcher 199. Suke 200. suntek search engine 201. Sven 202. TACH Black Widow 203. Tarantula 204. tarspider 205. Tcl W3 Robot http://info.webcrawler.com/mak/projects/robots/active/html/index.html (6 of 8) [18.02.2001 13:14:44] Database of Web Robots, Overview 206. TechBOT 207. Templeton 208. TeomaTechnologies 209. TitIn 210. TITAN 211. The TkWWW Robot 212. TLSpider 213. UCSD Crawl 214. UdmSearch 215. URL Check 216. URL Spider Pro 217. Valkyrie 218. Victoria 219. vision-search 220. Voyager 221. VWbot 222. The NWI Robot 223. W3M2 224. the World Wide Web Wanderer 225. WebBandit Web Spider 226. WebCatcher 227. WebCopy 228. webfetcher 229. The Webfoot Robot 230. weblayers 231. WebLinker 232. WebMirror 233. The Web Moose 234. WebQuest 235. Digimarc MarcSpider 236. WebReaper 237. webs 238. Websnarf 239. WebSpider 240. WebVac http://info.webcrawler.com/mak/projects/robots/active/html/index.html (7 of 8) [18.02.2001 13:14:44] Database of Web Robots, Overview 241. webwalk 242. WebWalker 243. WebWatch 244. Wget 245. whatUseek Winona 246. WhoWhere Robot 247. w3mir 248. WebStolperer 249. The Web Wombat 250. The World Wide Web Worm 251. WWWC Ver 0.2.5 252. WebZinger 253. XGET 254. Nederland.zoek The Web Robots Database http://info.webcrawler.com/mak/projects/robots/active/html/index.html (8 of 8) [18.02.2001 13:14:44] Database of Web Robots, View Type The Web Robots Pages Database of Web Robots, View Type Alternatively you can View Contact, or see the Overview. Name Details Acme.Spider Purpose: indexing maintenance statistics Availability: source Platform: java Ahoy! The Homepage Finder Purpose: maintenance Availability: none Platform: UNIX Alkaline Purpose: indexing Availability: binary Platform: unix windows95 windowsNT Walhello appie Purpose: indexing Availability: none Platform: windows98 Arachnophilia Purpose: Availability: Platform: ArchitextSpider Purpose: indexing, statistics Availability: Platform: Aretha Purpose: Availability: Platform: Macintosh ARIADNE Purpose: statistics, development of focused crawling strategies Availability: none Platform: java http://info.webcrawler.com/mak/projects/robots/active/html/type.html (1 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type arks Purpose: indexing Availability: data Platform: PLATFORM INDEPENDENT ASpider (Associative Spider) Purpose: indexing Availability: Platform: unix ATN Worldwide Purpose: indexing Availability: Platform: Atomz.com Search Robot Purpose: indexing Availability: service Platform: unix AURESYS Purpose: indexing,statistics Availability: Protected by Password Platform: Aix, Unix BackRub Purpose: indexing, statistics Availability: Platform: unnamed Purpose: Copyright Infringement Tracking Availability: 24/7 Platform: NT Big Brother Purpose: maintenance Availability: binary Platform: mac Bjaaland Purpose: indexing Availability: none Platform: unix BlackWidow Purpose: indexing, statistics Availability: Platform: Die Blinde Kuh Purpose: indexing Availability: none Platform: unix http://info.webcrawler.com/mak/projects/robots/active/html/type.html (2 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type Bloodhound Purpose: Web Site Download Availability: Executible Platform: Windows95, WindowsNT, Windows98, Windows2000 bright.net caching robot Purpose: caching Availability: none Platform: BSpider Purpose: indexing Availability: none Platform: Unix CACTVS Chemistry Spider Purpose: indexing. Availability: Platform: Calif Purpose: indexing Availability: none Platform: unix Cassandra Purpose: indexing Availability: none Platform: crossplatform Digimarc Marcspider/CGI Purpose: maintenance Availability: none Platform: windowsNT Checkbot Purpose: maintenance Availability: source Platform: unix,WindowsNT churl Purpose: maintenance Availability: Platform: CMC/0.01 Purpose: maintenance Availability: none Platform: unix http://info.webcrawler.com/mak/projects/robots/active/html/type.html (3 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type Collective Purpose: Collective is a highly configurable program designed to interrogate online search engines and online databases, it will ignore web pages that lie about there content, and dead url's, it can be super strict, it searches each web page it finds for your search terms to ensure those terms are present, any positive urls are added to a html file for your to view at any time even before the program has finished. Collective can wonder the web for days if required. Availability: Executible Platform: Windows95, WindowsNT, Windows98, Windows2000 Combine System Purpose: indexing Availability: source Platform: unix Conceptbot Purpose: indexing Availability: data Platform: unix CoolBot Purpose: indexing Availability: none Platform: unix Web Core / Roots Purpose: indexing, maintenance Availability: Platform: XYLEME Robot Purpose: indexing Availability: data Platform: unix Internet Cruiser Robot Purpose: indexing Availability: none Platform: unix Cusco Purpose: indexing Availability: none Platform: any http://info.webcrawler.com/mak/projects/robots/active/html/type.html (4 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type CyberSpyder Link Test Purpose: link validation, some html validation Availability: binary Platform: windows 3.1x, windows95, windowsNT DeWeb(c) Katalog/Index Purpose: indexing, mirroring, statistics Availability: Platform: DienstSpider Purpose: indexing Availability: none Platform: unix Digger Purpose: indexing Availability: none Platform: unix, windows Digital Integrity Robot Purpose: WWW Indexing Availability: none Platform: unix Direct Hit Grabber Purpose: Indexing and statistics Availability: Platform: unix DNAbot Purpose: indexing Availability: data Platform: unix, windows, windows95, windowsNT, mac DownLoad Express Purpose: graphic download Availability: binary Platform: win95/98/NT DragonBot Purpose: indexing Availability: none Platform: windowsNT DWCP (Dridus' Web Cataloging Project) Purpose: indexing, statistics Availability: source, binary, data Platform: java e-collector Purpose: email collector Availability: Binary Platform: Windows 9*/NT/2000 http://info.webcrawler.com/mak/projects/robots/active/html/type.html (5 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type EbiNess Purpose: statistics Availability: Open Source Platform: unix(Linux) EIT Link Verifier Robot Purpose: maintenance Availability: Platform: Emacs-w3 Search Engine Purpose: indexing Availability: Platform: ananzi Purpose: indexing Availability: Platform: Esther Purpose: indexing Availability: data Platform: unix (FreeBSD 2.2.8) Evliya Celebi Purpose: indexing turkish content Availability: source Platform: unix nzexplorer Purpose: indexing, statistics Availability: source (commercial) Platform: UNIX Fluid Dynamics Search Engine robot Purpose: indexing Availability: source;data Platform: unix;windows Felix IDE Purpose: indexing, statistics Availability: binary Platform: windows95, windowsNT Wild Ferret Web Hopper #1, #2, #3 Purpose: indexing maintenance statistics Availability: Platform: FetchRover Purpose: maintenance, statistics Availability: binary, source Platform: Windows/NT, Windows/95, Solaris SPARC http://info.webcrawler.com/mak/projects/robots/active/html/type.html (6 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type fido Purpose: indexing Availability: none Platform: Unix Hämähäkki Purpose: indexing Availability: no Platform: UNIX KIT-Fireball Purpose: indexing Availability: none Platform: unix Fish search Purpose: indexing Availability: binary Platform: Fouineur Purpose: indexing, statistics Availability: none Platform: unix, windows Robot Francoroute Purpose: indexing, mirroring, statistics Availability: Platform: Freecrawl Purpose: indexing Availability: none Platform: unix FunnelWeb Purpose: indexing, statisitics Availability: Platform: gazz Purpose: statistics Availability: none Platform: unix GCreep Purpose: indexing Availability: none Platform: linux+mysql GetBot Purpose: maintenance Availability: Platform: http://info.webcrawler.com/mak/projects/robots/active/html/type.html (7 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type GetURL Purpose: maintenance, mirroring Availability: Platform: Golem Purpose: maintenance Availability: none Platform: mac Googlebot Purpose: indexing statistics Availability: Platform: Grapnel/0.01 Experiment Purpose: Indexing Availability: None, yet Platform: WinNT Griffon Purpose: indexing Availability: none Platform: unix Gromit Purpose: indexing Availability: none Platform: unix Northern Light Gulliver Purpose: indexing Availability: none Platform: unix HamBot Purpose: indexing Availability: none Platform: unix, Windows95 Harvest Purpose: indexing Availability: Platform: havIndex Purpose: indexing Availability: binary Platform: Java VM 1.1 HI (HTML Index) Search Purpose: indexing Availability: Platform: http://info.webcrawler.com/mak/projects/robots/active/html/type.html (8 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type Hometown Spider Pro Purpose: indexing Availability: none Platform: windowsNT Wired Digital Purpose: indexing Availability: none Platform: unix ht://Dig Purpose: indexing Availability: source Platform: unix HTMLgobble Purpose: mirror Availability: Platform: Hyper-Decontextualizer Purpose: indexing Availability: Platform: IBM_Planetwide Purpose: indexing, maintenance, mirroring Availability: Platform: Popular Iconoclast Purpose: statistics Availability: source Platform: unix (OpenBSD) Ingrid Purpose: Indexing Availability: Commercial as part of search engine package Platform: UNIX Imagelock Purpose: maintenance Availability: none Platform: windows95 IncyWincy Purpose: Availability: Platform: Informant Purpose: indexing Availability: none Platform: unix http://info.webcrawler.com/mak/projects/robots/active/html/type.html (9 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type InfoSeek Robot 1.0 Purpose: indexing Availability: Platform: Infoseek Sidewinder Purpose: indexing Availability: Platform: InfoSpiders Purpose: search Availability: none Platform: unix, mac Inspector Web Purpose: maintentance: link validation, html validation, image size validation, etc Availability: free service and more extensive commercial service Platform: unix IntelliAgent Purpose: indexing Availability: Platform: I, Robot Purpose: indexing Availability: none Platform: unix Iron33 Purpose: indexing, statistics Availability: source Platform: unix Israeli-search Purpose: indexing. Availability: Platform: JavaBee Purpose: Stealing Java Code Availability: binary Platform: Java JBot Java Web Robot Purpose: indexing Availability: source Platform: Java http://info.webcrawler.com/mak/projects/robots/active/html/type.html (10 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type JCrawler Purpose: indexing Availability: none Platform: unix Jeeves Purpose: indexing maintenance statistics Availability: none Platform: UNIX Jobot Purpose: standalone Availability: Platform: JoeBot Purpose: Availability: Platform: The Jubii Indexing Robot Purpose: indexing, maintainance Availability: Platform: JumpStation Purpose: indexing Availability: Platform: Katipo Purpose: maintenance Availability: binary Platform: Macintosh KDD-Explorer Purpose: indexing Availability: none Platform: unix Kilroy Purpose: indexing,statistics Availability: none Platform: unix,windowsNT KO_Yappo_Robot Purpose: indexing Availability: none Platform: unix http://info.webcrawler.com/mak/projects/robots/active/html/type.html (11 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type LabelGrabber Purpose: Grabs PICS labels from web pages, submits them to a label bueau Availability: source Platform: windows, windows95, windowsNT, unix larbin Purpose: Your imagination is the only limit Availability: source (GPL), mail me for customization Platform: Linux legs Purpose: indexing Availability: none Platform: linux Link Validator Purpose: maintenance Availability: none Platform: unix, windows LinkScan Purpose: Link checker, SiteMapper, and HTML Validator Availability: Program is shareware Platform: Unix, Linux, Windows 98/NT LinkWalker Purpose: maintenance, statistics Availability: none Platform: windowsNT Lockon Purpose: indexing Availability: none Platform: UNIX logo.gif Crawler Purpose: indexing Availability: none Platform: unix Lycos Purpose: indexing Availability: Platform: Mac WWWWorm Purpose: indexing Availability: none Platform: Macintosh http://info.webcrawler.com/mak/projects/robots/active/html/type.html (12 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type Magpie Purpose: indexing, statistics Availability: Platform: unix Mattie Purpose: MP3 Spider Availability: None Platform: Windows 2000 MediaFox Purpose: indexing and maintenance Availability: none Platform: (Java) MerzScope Purpose: WebMapping Availability: binary Platform: (Java Based) unix,windows95,windowsNT,os2,mac etc .. NEC-MeshExplorer Purpose: indexing Availability: none Platform: unix MindCrawler Purpose: indexing Availability: none Platform: linux moget Purpose: indexing,statistics Availability: none Platform: unix MOMspider Purpose: maintenance, statistics Availability: source Platform: UNIX Monster Purpose: maintenance, mirroring Availability: binary Platform: UNIX (Linux) Motor Purpose: indexing Availability: data Platform: mac http://info.webcrawler.com/mak/projects/robots/active/html/type.html (13 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type Muscat Ferret Purpose: indexing Availability: none Platform: unix Mwd.Search Purpose: indexing Availability: none Platform: unix (Linux) Internet Shinchakubin Purpose: find new links and changed pages Availability: binary as bundled software Platform: Windows98 NetCarta WebMap Engine Purpose: indexing, maintenance, mirroring, statistics Availability: Platform: NetMechanic Purpose: Link and HTML validation Availability: via web page Platform: UNIX NetScoop Purpose: indexing Availability: none Platform: UNIX newscan-online Purpose: indexing Availability: binary Platform: Linux NHSE Web Forager Purpose: indexing Availability: Platform: Nomad Purpose: indexing Availability: Platform: The NorthStar Robot Purpose: indexing Availability: Platform: Occam Purpose: indexing Availability: none Platform: unix http://info.webcrawler.com/mak/projects/robots/active/html/type.html (14 of 25) [18.02.2001 13:15:10] Database of Web Robots, View Type HKU WWW Octopus Purpose: indexing Availability: Platform: Orb Search Purpose: indexing Availability: data Platform: unix Pack Rat Purpose: both maintenance and mirroring Availability: at the moment, none...source when developed. Platform: unix PageBoy Purpose: indexing Availability: none Platform: unix ParaSite Purpose: indexing Availability: none Platform: windowsNT Patric Purpose: statistics Availability: data Platform: unix pegasus Purpose: indexing Availability: source, binary Platform: unix The Peregrinator Purpose: Availability: Platform: PerlCrawler 1.0 Purpose: indexing Availability: source Platform: unix Phantom Purpose: indexing Availability: Platform: Macintosh PiltdownMan Purpose: statistics Availability: none Platform: windows95, windows98, windowsNT http://info.webcrawler.com/mak/projects/robots/active/html/type.html (15 of 25) [18.02.2001 13:15:11] Database of Web Robots, View Type Pioneer Purpose: indexing, statistics Availability: Platform: html_analyzer Purpose: maintainance Availability: Platform: Portal Juice Spider Purpose: indexing, statistics Availability: none Platform: unix PGP Key Agent Purpose: indexing Availability: none Platform: UNIX, Windows NT PlumtreeWebAccessor Purpose: indexing for the Plumtree Server Availability: none Platform: windowsNT Poppi Purpose: indexing Availability: none Platform: unix/linux PortalB Spider Purpose: indexing Availability: none Platform: unix GetterroboPlus Puu Purpose: Purpose of the robot. One or more of: - gathering: gather data of original standerd TAG for Puu contains the information of the sites registered my Search Engin. - maintenance: link validation Availability: none Platform: unix The Python Robot Purpose: Availability: none Platform: Raven Search Purpose: Indexing: gather content for commercial query engine. Availability: None Platform: Unix, Windows98, WindowsNT, Windows2000 http://info.webcrawler.com/mak/projects/robots/active/html/type.html (16 of 25) [18.02.2001 13:15:11] Database of Web Robots, View Type RBSE Spider Purpose: indexing, statistics Availability: Platform: Resume Robot Purpose: indexing. Availability: Platform: RoadHouse Crawling System Purpose: Availability: none Platform: Road Runner: The ImageScape Robot Purpose: indexing Availability: Platform: UNIX Robbie the Robot Purpose: indexing Availability: none Platform: unix, windows95, windowsNT ComputingSite Robi/1.0 Purpose: indexing,maintenance Availability: Platform: UNIX Robozilla Purpose: maintenance Availability: none Platform: Roverbot Purpose: indexing Availability: Platform: SafetyNet Robot Purpose: indexing. Availability: Platform: Scooter Purpose: indexing Availability: none Platform: unix Search.Aus-AU.COM Purpose: - indexing: gather content for an indexing service Availability: - none Platform: - mac - unix - windows95 - windowsNT http://info.webcrawler.com/mak/projects/robots/active/html/type.html (17 of 25) [18.02.2001 13:15:11] Database of Web Robots, View Type SearchProcess Purpose: Statistic Availability: none Platform: linux Senrigan Purpose: indexing Availability: none Platform: Java SG-Scout Purpose: indexing Availability: Platform: ShagSeeker Purpose: indexing Availability: data Platform: unix Shai'Hulud Purpose: mirroring Availability: source Platform: unix Sift Purpose: indexing Availability: data Platform: unix Simmany Robot Ver1.0 Purpose: indexing, maintenance, statistics Availability: none Platform: unix Site Valet Purpose: maintenance Availability: data Platform: unix Open Text Index Robot Purpose: indexing Availability: inquire to [email protected] (Mark Kraatz) Platform: UNIX SiteTech-Rover Purpose: indexing Availability: Platform: SLCrawler Purpose: To build the site map. Availability: none Platform: windows, windows95, windowsNT http://info.webcrawler.com/mak/projects/robots/active/html/type.html (18 of 25) [18.02.2001 13:15:11] Database of Web Robots, View Type Inktomi Slurp Purpose: indexing, statistics Availability: none Platform: unix Smart Spider Purpose: indexing Availability: data, binary, source Platform: windows95, windowsNT Snooper Purpose: Availability: none Platform: Solbot Purpose: indexing Availability: none Platform: unix Spanner Purpose: indexing,maintenance Availability: source Platform: unix Speedy Spider Purpose: indexing Availability: none Platform: Windows spider_monkey Purpose: gather content for a free indexing service Availability: bulk data gathered by robot available Platform: unix SpiderBot Purpose: indexing, mirroring Availability: source, binary, data Platform: unix, windows, windows95, windowsNT SpiderMan Purpose: user searching using IR technique Availability: binary&source Platform: Java 1.2 SpiderView(tm) Purpose: maintenance Availability: source Platform: unix, nt Spry Wizard Robot Purpose: indexing Availability: Platform: http://info.webcrawler.com/mak/projects/robots/active/html/type.html (19 of 25) [18.02.2001 13:15:11] Database of Web Robots, View Type Site Searcher Purpose: indexing Availability: binary Platform: winows95, windows98, windowsNT Suke Purpose: indexing Availability: source Platform: FreeBSD3.* suntek search engine Purpose: to create a search portal on Asian web sites Availability: available now Platform: NT, Linux, UNIX Sven Purpose: indexing Availability: none Platform: Windows TACH Black Widow Purpose: maintenance: link validation Availability: none Platform: UNIX, Linux Tarantula Purpose: indexing Availability: none Platform: unix tarspider Purpose: mirroring Availability: Platform: Tcl W3 Robot Purpose: maintenance, statistics Availability: Platform: TechBOT Purpose: statistics, maintenance Availability: none Platform: Unix Templeton Purpose: mirroring, mapping, automating web applications Availability: binary Platform: OS/2, Linux, SunOS, Solaris TeomaTechnologies Purpose: Availability: none Platform: http://info.webcrawler.com/mak/projects/robots/active/html/type.html (20 of 25) [18.02.2001 13:15:11] Database of Web Robots, View Type TitIn Purpose: indexing, statistics Availability: data, source on request Platform: unix TITAN Purpose: indexing Availability: no Platform: SunOS 4.1.4 The TkWWW Robot Purpose: indexing Availability: Platform: TLSpider Purpose: to get web sites and add them to the topiclink future directory Availability: none Platform: linux UCSD Crawl Purpose: indexing, statistics Availability: Platform: UdmSearch Purpose: indexing, validation Availability: source, binary Platform: unix URL Check Purpose: maintenance Availability: binary Platform: unix URL Spider Pro Purpose: indexing Availability: binary Platform: Windows9x/NT Valkyrie Purpose: indexing Availability: none Platform: unix Victoria Purpose: maintenance Availability: none Platform: unix http://info.webcrawler.com/mak/projects/robots/active/html/type.html (21 of 25) [18.02.2001 13:15:11] Database of Web Robots, View Type vision-search Purpose: indexing. Availability: Platform: Voyager Purpose: indexing, maintenance Availability: none Platform: unix VWbot Purpose: indexing Availability: source Platform: unix The NWI Robot Purpose: discovery,statistics Availability: none (at the moment) Platform: UNIX W3M2 Purpose: indexing, maintenance, statistics Availability: Platform: the World Wide Web Wanderer Purpose: statistics Availability: data Platform: unix WebBandit Web Spider Purpose: Resource Gathering / Server Benchmarking Availability: source, binary Platform: Intel - windows95 WebCatcher Purpose: indexing Availability: none Platform: unix, windows, mac WebCopy Purpose: mirroring Availability: Platform: webfetcher Purpose: mirroring Availability: Platform: The Webfoot Robot Purpose: Availability: Platform: http://info.webcrawler.com/mak/projects/robots/active/html/type.html (22 of 25) [18.02.2001 13:15:11] Database of Web Robots, View Type weblayers Purpose: maintainance Availability: Platform: WebLinker Purpose: maintenance Availability: Platform: WebMirror Purpose: mirroring Availability: Platform: Windows95 The Web Moose Purpose: statistics, maintenance Availability: data Platform: Windows NT WebQuest Purpose: indexing Availability: none Platform: unix Digimarc MarcSpider Purpose: maintenance Availability: none Platform: windowsNT WebReaper Purpose: indexing/offline browsing Availability: binary Platform: windows95, windowsNT webs Purpose: statistics Availability: none Platform: unix Websnarf Purpose: Availability: Platform: WebSpider Purpose: maintenance, link diagnostics Availability: Platform: WebVac Purpose: mirroring Availability: Platform: http://info.webcrawler.com/mak/projects/robots/active/html/type.html (23 of 25) [18.02.2001 13:15:11] Database of Web Robots, View Type webwalk Purpose: indexing, maintentance, mirroring, statistics Availability: Platform: WebWalker Purpose: maintenance Availability: source Platform: unix WebWatch Purpose: maintainance, statistics Availability: Platform: Wget Purpose: mirroring, maintenance Availability: source Platform: unix whatUseek Winona Purpose: Robot used for site-level search and meta-search engines. Availability: none Platform: unix WhoWhere Robot Purpose: indexing Availability: none Platform: Sun Unix w3mir Purpose: mirroring. Availability: Platform: UNIX, WindowsNT WebStolperer Purpose: indexing Availability: none Platform: unix, NT The Web Wombat Purpose: indexing, statistics. Availability: Platform: The World Wide Web Worm Purpose: indexing Availability: Platform: http://info.webcrawler.com/mak/projects/robots/active/html/type.html (24 of 25) [18.02.2001 13:15:11] Database of Web Robots, View Type WWWC Ver 0.2.5 Purpose: maintenance Availability: binary Platform: windows, windows95, windowsNT WebZinger Purpose: indexing Availability: binary Platform: windows95, windowsNT 4, mac, solaris, unix XGET Purpose: mirroring Availability: binary Platform: X68000, X68030 Nederland.zoek Purpose: indexing Availability: none Platform: unix (Linux) The Web Robots Database http://info.webcrawler.com/mak/projects/robots/active/html/type.html (25 of 25) [18.02.2001 13:15:11] Database of Web Robots, View Contact The Web Robots Pages Database of Web Robots, View Contact Alternatively you can View Type, or see the Overview. Name Details Acme.Spider Agent: Due to a deficiency in Java it's not currently possible to set the User-Agent. Host: * Email: [email protected] Ahoy! The Homepage Finder Agent: 'Ahoy! The Homepage Finder' Host: cs.washington.edu Email: [email protected] Alkaline Agent: AlkalineBOT Host: * Email: [email protected] Walhello appie Agent: appie/1.1 Host: 213.10.10.116, 213.10.10.117, 213.10.10.118 Email: [email protected] Arachnophilia Agent: Arachnophilia Host: halsoft.com Email: [email protected] ArchitextSpider Agent: ArchitextSpider Host: *.atext.com Email: [email protected] Aretha Agent: Host: Email: [email protected] ARIADNE Agent: Due to a deficiency in Java it's not currently possible to set the User-Agent. Host: dbs.informatik.uni-muenchen.de Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (1 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact arks Agent: arks/1.0 Host: dpsindia.com Email: [email protected] ASpider (Associative Spider) Agent: ASpider/0.09 Host: nova.pvv.unit.no Email: [email protected] ATN Worldwide Agent: ATN_Worldwide Host: www.allthatnet.com Email: [email protected] Atomz.com Search Robot Agent: Atomz/1.0 Host: www.atomz.com Email: [email protected] AURESYS Agent: AURESYS/1.0 Host: crrm.univ-mrs.fr, 192.134.99.192 Email: [email protected] BackRub Agent: BackRub/*.* Host: *.stanford.edu Email: [email protected] unnamed Agent: BaySpider Host: Email: Big Brother Agent: Big Brother Host: * Email: [email protected] Bjaaland Agent: Bjaaland/0.5 Host: barry.bitmovers.net Email: [email protected] BlackWidow Agent: BlackWidow Host: 140.190.65.* Email: [email protected] Die Blinde Kuh Agent: Die Blinde Kuh Host: minerva.sozialwiss.uni-hamburg.de Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (2 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact Bloodhound Agent: None Host: * Email: [email protected] bright.net caching robot Agent: Mozilla/3.01 (compatible;) Host: 209.143.1.46 Email: BSpider Agent: BSpider/1.0 libwww-perl/0.40 Host: 210.159.73.34, 210.159.73.35 Email: [email protected] CACTVS Chemistry Spider Agent: CACTVS Chemistry Spider Host: utamaro.organik.uni-erlangen.de Email: [email protected] Calif Agent: Calif/0.6 ([email protected]; http://www.tnps.dp.ua) Host: cobra.tnps.dp.ua Email: [email protected] Cassandra Agent: Host: www.aha.ru Email: [email protected] Digimarc Marcspider/CGI Agent: Digimarc CGIReader/1.0 Host: 206.102.3.* Email: [email protected] Checkbot Agent: Checkbot/x.xx LWP/5.x Host: * Email: [email protected] churl Agent: Host: Email: [email protected] CMC/0.01 Agent: CMC/0.01 Host: haruna.next.ne.jp, 203.183.218.4 Email: [email protected] Collective Agent: LWP Host: * Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (3 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact Combine System Agent: combine/0.0 Host: *.ub2.lu.se Email: [email protected] Conceptbot Agent: conceptbot/0.3 Host: router.sifry.com Email: [email protected] CoolBot Agent: CoolBot Host: www.suchmaschine21.de Email: [email protected] Web Core / Roots Agent: root/0.1 Host: shiva.di.uminho.pt, from www.di.uminho.pt Email: [email protected] XYLEME Robot Agent: cosmos/0.3 Host: Email: [email protected] Internet Cruiser Robot Agent: Internet Cruiser Robot/2.1 Host: *.krstarica.com Email: [email protected] Cusco Agent: Cusco/3.2 Host: *.cusco.pt, *.viatecla.pt Email: [email protected] CyberSpyder Link Test Agent: CyberSpyder/2.1 Host: * Email: [email protected] DeWeb(c) Katalog/Index Agent: Deweb/1.01 Host: deweb.orbit.de Email: [email protected] DienstSpider Agent: dienstspider/1.0 Host: sappho.csi.forth.gr Email: [email protected] Digger Agent: Digger/1.0 JDK/1.3.0 Host: Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (4 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact Digital Integrity Robot Agent: DIIbot Host: digital-integrity.com Email: [email protected] Direct Hit Grabber Agent: grabber Host: *.directhit.com Email: [email protected] DNAbot Agent: DNAbot/1.0 Host: xx.dnainc.co.jp Email: [email protected] DownLoad Express Agent: Host: * Email: [email protected] DragonBot Agent: DragonBot/1.0 libwww/5.0 Host: *.paczone.com Email: [email protected] DWCP (Dridus' Web Cataloging Project) Agent: DWCP/2.0 Host: *.dridus.com Email: [email protected] e-collector Agent: LWP:: Host: * Email: [email protected] EbiNess Agent: EbiNess/0.01a Host: Email: [email protected] EIT Link Verifier Robot Agent: EIT-Link-Verifier-Robot/0.2 Host: * Email: [email protected] Emacs-w3 Search Engine Agent: Emacs-w3/v[0-9\.]+ Host: * Email: [email protected] ananzi Agent: EMC Spider Host: bilbo.internal.empirical.com Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (5 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact Esther Agent: esther Host: *.falconsoft.com Email: [email protected] Evliya Celebi Agent: Evliya Celebi v0.151 - http://ilker.ulak.net.tr Host: 193.140.83.* Email: [email protected] nzexplorer Agent: explorersearch Host: bitz.co.nz Email: [email protected] Fluid Dynamics Search Engine robot Agent: Mozilla/4.0 (compatible: FDSE robot) Host: yes Email: [email protected] Felix IDE Agent: FelixIDE/1.0 Host: * Email: [email protected] Wild Ferret Web Hopper #1, #2, #3 Agent: Hazel's Ferret Web hopper, Host: Email: [email protected] FetchRover Agent: ESIRover v1.0 Host: * Email: [email protected] fido Agent: fido/0.9 Harvest/1.4.pl2 Host: fido.planetsearch.com, *.planetsearch.com, 206.64.113.* Email: [email protected] Hämähäkki Agent: Hämähäkki/0.2 Host: *.www.fi Email: [email protected] KIT-Fireball Agent: KIT-Fireball/2.0 libwww/5.0a Host: *.fireball.de Email: [email protected] Fish search Agent: Fish-Search-Robot Host: www.win.tue.nl Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (6 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact Fouineur Agent: Mozilla/2.0 (compatible fouineur v2.0; fouineur.9bit.qc.ca) Host: * Email: [email protected] Robot Francoroute Agent: Robot du CRIM 1.0a Host: zorro.crim.ca Email: [email protected] Freecrawl Agent: Freecrawl Host: *.freeside.net Email: [email protected] FunnelWeb Agent: FunnelWeb-1.0 Host: earth.planets.com.au Email: [email protected] gazz Agent: gazz/1.0 Host: *.nttrd.com, *.infobee.ne.jp Email: [email protected] GCreep Agent: gcreep/1.0 Host: mbx.instrumentpolen.se Email: [email protected] GetBot Agent: ??? Host: Email: [email protected] GetURL Agent: GetURL.rexx v1.05 Host: * Email: [email protected] Golem Agent: Golem/1.1 Host: *.quibble.com Email: [email protected] Googlebot Agent: Googlebot/2.0 beta (googlebot(at)googlebot.com) Host: *.googlebot.com Email: [email protected] Grapnel/0.01 Experiment Agent: Host: varies Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (7 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact Griffon Agent: griffon/1.0 Host: *.navi.ocn.ne.jp Email: [email protected] Gromit Agent: Gromit/1.0 Host: *.austlii.edu.au Email: [email protected] Northern Light Gulliver Agent: Gulliver/1.1 Host: scooby.northernlight.com, taz.northernlight.com, gulliver.northernlight.com Email: [email protected] HamBot Agent: Host: *.hamrad.com Email: [email protected] Harvest Agent: yes Host: bruno.cs.colorado.edu Email: havIndex Agent: havIndex/X.xx[bxx] Host: * Email: [email protected] HI (HTML Index) Search Agent: AITCSRobot/1.1 Host: Email: [email protected] Hometown Spider Pro Agent: Hometown Spider Pro Host: 63.195.193.17 Email: [email protected] Wired Digital Agent: wired-digital-newsbot/1.5 Host: gossip.hotwired.com Email: [email protected] ht://Dig Agent: htdig/3.1.0b2 Host: * Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (8 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact HTMLgobble Agent: HTMLgobble v2.2 Host: tp70.rz.uni-karlsruhe.de Email: [email protected] Hyper-Decontextualizer Agent: no Host: Email: [email protected] IBM_Planetwide Agent: IBM_Planetwide, Host: www.ibm.com www2.ibm.com Email: [email protected]" Popular Iconoclast Agent: gestaltIconoclast/1.0 libwww-FM/2.17 Host: gestalt.sewanee.edu Email: [email protected] Ingrid Agent: INGRID/0.1 Host: bart.ilse.nl Email: [email protected] Imagelock Agent: Mozilla 3.01 PBWF (Win95) Host: 209.111.133.* Email: [email protected] IncyWincy Agent: IncyWincy/1.0b1 Host: osiris.sunderland.ac.uk Email: [email protected] Informant Agent: Informant Host: informant.dartmouth.edu Email: [email protected] InfoSeek Robot 1.0 Agent: InfoSeek Robot 1.0 Host: corp-gw.infoseek.com Email: [email protected] Infoseek Sidewinder Agent: Infoseek Sidewinder Host: Email: [email protected] InfoSpiders Agent: InfoSpiders/0.1 Host: *.ucsd.edu Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (9 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact Inspector Web Agent: inspectorwww/1.0 http://www.greenpac.com/inspectorwww.html Host: www.corpsite.com, www.greenpac.com, 38.234.171.* Email: [email protected] IntelliAgent Agent: 'IAGENT/1.0' Host: sand.it.bond.edu.au Email: [email protected] I, Robot Agent: I Robot 0.4 ([email protected]) Host: *.mame.dk, 206.161.121.* Email: [email protected] Iron33 Agent: Iron33/0.0 Host: *.folon.ueda.info.waseda.ac.jp, 133.9.215.* Email: [email protected] Israeli-search Agent: IsraeliSearch/1.0 Host: dylan.ius.cs.cmu.edu Email: [email protected] JavaBee Agent: JavaBee Host: * Email: [email protected] JBot Java Web Robot Agent: JBot (but can be changed by the user) Host: * Email: [email protected] JCrawler Agent: JCrawler/0.2 Host: db.netimages.com Email: [email protected] Jeeves Agent: Jeeves v0.05alpha (PERL, LWP, [email protected]) Host: *.doc.ic.ac.uk Email: [email protected] Jobot Agent: Jobot/0.1alpha libwww-perl/4.0 Host: supernova.micrognosis.com Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (10 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact JoeBot Agent: JoeBot/x.x, Host: Email: [email protected] The Jubii Indexing Robot Agent: JubiiRobot/version# Host: any host in the cybernet.dk domain Email: [email protected] JumpStation Agent: jumpstation Host: *.stir.ac.uk Email: [email protected] Katipo Agent: Katipo/1.0 Host: * Email: [email protected] KDD-Explorer Agent: KDD-Explorer/0.1 Host: mlc.kddvw.kcom.or.jp Email: [email protected] Kilroy Agent: yes Host: *.oclc.org Email: [email protected] KO_Yappo_Robot Agent: KO_Yappo_Robot/1.0.4(http://yappo.com/info/robot.html) Host: yappo.com,209.25.40.1 Email: [email protected] LabelGrabber Agent: LabelGrab/1.1 Host: head.w3.org Email: [email protected] larbin Agent: larbin (+mail) Host: * Email: [email protected] legs Agent: legs Host: Email: [email protected] Link Validator Agent: Linkidator/0.93 Host: *.mitre.org Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (11 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact LinkScan Agent: LinkScan Server/5.5 | LinkScan Workstation/5.5 Host: * Email: [email protected] LinkWalker Agent: LinkWalker Host: *.seventwentyfour.com Email: [email protected] Lockon Agent: Lockon/xxxxx Host: *.hitech.tuis.ac.jp Email: [email protected] logo.gif Crawler Agent: logo.gif crawler Host: *.inm.de Email: [email protected] Lycos Agent: Lycos/x.x Host: fuzine.mt.cs.cmu.edu, lycos.com Email: [email protected] Mac WWWWorm Agent: Host: Email: [email protected] Magpie Agent: Magpie/1.0 Host: *.blueberry.co.uk, 194.70.52.*, 193.131.167.144 Email: [email protected] Mattie Agent: AO/A-T.IDRG v2.3 Host: mattie.mcw.aarkayn.org Email: [email protected] MediaFox Agent: MediaFox/x.y Host: 141.99.*.* Email: [email protected] MerzScope Agent: MerzScope Host: (Client Based) Email: NEC-MeshExplorer Agent: NEC-MeshExplorer Host: meshsv300.tk.mesh.ad.jp Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (12 of 24) [18.02.2001 13:15:50] Database of Web Robots, View Contact MindCrawler Agent: MindCrawler Host: * Email: [email protected] moget Agent: moget/1.0 Host: *.goo.ne.jp Email: [email protected] MOMspider Agent: MOMspider/1.00 libwww-perl/0.40 Host: * Email: [email protected] Monster Agent: Monster/vX.X.X -$TYPE ($OSTYPE) Host: wild.stu.neva.ru Email: [email protected] Motor Agent: Motor/0.2 Host: Michael.cybercon.technopark.gmd.de Email: [email protected] Muscat Ferret Agent: MuscatFerret/ Host: 193.114.89.*, 194.168.54.11 Email: [email protected] Mwd.Search Agent: MwdSearch/0.1 Host: *.fifi.net Email: [email protected] Internet Shinchakubin Agent: User-Agent: Mozilla/4.0 (compatible; sharp-info-agent v1.0; ) Host: * Email: [email protected] NetCarta WebMap Engine Agent: NetCarta CyberPilot Pro Host: Email: [email protected] NetMechanic Agent: NetMechanic Host: 206.26.168.18 Email: [email protected] NetScoop Agent: NetScoop/1.0 libwww/5.0a Host: alpha.is.tokushima-u.ac.jp, beta.is.tokushima-u.ac.jp Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (13 of 24) [18.02.2001 13:15:51] Database of Web Robots, View Contact newscan-online Agent: newscan-online/1.1 Host: *newscan-online.de Email: [email protected] NHSE Web Forager Agent: NHSEWalker/3.0 Host: *.mcs.anl.gov Email: [email protected] Nomad Agent: Nomad-V2.x Host: *.cs.colostate.edu Email: [email protected] The NorthStar Robot Agent: NorthStar Host: frognot.utdallas.edu, utdallas.edu, cnidir.org Email: [email protected] Occam Agent: Occam/1.0 Host: gentian.cs.washington.edu, sekiu.cs.washington.edu, saxifrage.cs.washington.edu Email: [email protected] HKU WWW Octopus Agent: HKU WWW Robot, Host: phoenix.cs.hku.hk Email: [email protected] Orb Search Agent: Orbsearch/1.0 Host: cow.dyn.ml.org, *.dyn.ml.org Email: [email protected] Pack Rat Agent: PackRat/1.0 Host: cps.msu.edu Email: [email protected] PageBoy Agent: PageBoy/1.0 Host: *.webdocs.org Email: [email protected] ParaSite Agent: ParaSite/0.21 (http://www.ianett.com/parasite/) Host: *.ianett.com Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (14 of 24) [18.02.2001 13:15:51] Database of Web Robots, View Contact Patric Agent: Patric/0.01a Host: *.nwnet.net Email: [email protected] pegasus Agent: web robot PEGASUS Host: * Email: [email protected] The Peregrinator Agent: Peregrinator-Mathematics/0.7 Host: Email: [email protected] PerlCrawler 1.0 Agent: PerlCrawler/1.0 Xavatoria/2.0 Host: server5.hypermart.net Email: [email protected] Phantom Agent: Duppies Host: Email: [email protected] PiltdownMan Agent: PiltdownMan/1.0 [email protected] Host: 62.36.128.*, 194.133.59.*, 212.106.215.* Email: [email protected] Pioneer Agent: Pioneer Host: *.uncfsu.edu or flyer.ncsc.org Email: [email protected] html_analyzer Agent: Host: Email: [email protected] Portal Juice Spider Agent: PortalJuice.com/4.0 Host: *.portaljuice.com, *.nextopia.com Email: [email protected] PGP Key Agent Agent: PGP-KA/1.2 Host: salerno.starnet.it Email: [email protected] PlumtreeWebAccessor Agent: PlumtreeWebAccessor/0.9 Host: Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (15 of 24) [18.02.2001 13:15:51] Database of Web Robots, View Contact Poppi Agent: Poppi/1.0 Host: =20 Email: PortalB Spider Agent: PortalBSpider/1.0 ([email protected]) Host: spider1.portalb.com, spider2.portalb.com, etc. Email: [email protected] GetterroboPlus Puu Agent: straight FLASH!! GetterroboPlus 1.5 Host: straight FLASH!! Getterrobo-Plus, *.homing.net Email: [email protected] The Python Robot Agent: Host: Email: [email protected] Raven Search Agent: Raven-v2 Host: 192.168.1.* Email: [email protected] RBSE Spider Agent: Host: rbse.jsc.nasa.gov (192.88.42.10) Email: [email protected] Resume Robot Agent: Resume Robot Host: Email: [email protected] RoadHouse Crawling System Agent: RHCS/1.0a Host: stage.perceval.be Email: [email protected] Road Runner: The ImageScape Robot Agent: Road Runner: ImageScape Robot ([email protected]) Host: Email: [email protected] Robbie the Robot Agent: Robbie/0.1 Host: *.lmco.com Email: [email protected] ComputingSite Robi/1.0 Agent: ComputingSite Robi/1.0 ([email protected]) Host: robi.computingsite.com Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (16 of 24) [18.02.2001 13:15:51] Database of Web Robots, View Contact Robozilla Agent: Robozilla/1.0 Host: directory.mozilla.org Email: [email protected] Roverbot Agent: Roverbot Host: roverbot.com Email: [email protected] SafetyNet Robot Agent: SafetyNet Robot 0.1, Host: *.urlabs.com Email: [email protected] Scooter Agent: Scooter/2.0 G.R.A.B. V1.1.0 Host: *.av.pa-x.dec.com Email: [email protected] Search.Aus-AU.COM Agent: not available Host: Search.Aus-AU.COM, 203.55.124.29, 203.2.239.29 Email: [email protected] SearchProcess Agent: searchprocess/0.9 Host: searchprocess.com Email: [email protected] Senrigan Agent: Senrigan/xxxxxx Host: aniki.olu.info.waseda.ac.jp Email: [email protected] SG-Scout Agent: SG-Scout Host: beta.xerox.com Email: [email protected], [email protected] ShagSeeker Agent: Shagseeker at http://www.shagseek.com /1.0 Host: shagseek.com Email: [email protected] Shai'Hulud Agent: Shai'Hulud Host: *.rdtex.ru Email: [email protected] Sift Agent: libwww-perl-5.41 Host: www.worthy.com Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (17 of 24) [18.02.2001 13:15:51] Database of Web Robots, View Contact Simmany Robot Ver1.0 Agent: SimBot/1.0 Host: sansam.hnc.net Email: [email protected] Site Valet Agent: Site Valet Host: valet.webthing.com,valet.* Email: [email protected] Open Text Index Robot Agent: Open Text Site Crawler V1.0 Host: *.opentext.com Email: [email protected] SiteTech-Rover Agent: SiteTech-Rover Host: Email: [email protected] SLCrawler Agent: SLCrawler Host: n/a Email: [email protected] Inktomi Slurp Agent: Slurp/2.0 Host: *.inktomi.com Email: [email protected] Smart Spider Agent: ESISmartSpider/2.0 Host: 207.16.241.* Email: [email protected] Snooper Agent: Snooper/b97_01 Host: Email: [email protected] Solbot Agent: Solbot/1.0 LWP/5.07 Host: robot*.sol.no Email: [email protected] Spanner Agent: Spanner/1.0 (Linux 2.0.27 i586) Host: *.kluge.net Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (18 of 24) [18.02.2001 13:15:51] Database of Web Robots, View Contact Speedy Spider Agent: Speedy Spider ( http://www.entireweb.com/speedy.html ) Host: router-00.sverige.net, 193.15.210.29, *.entireweb.com, *.worldlight.com Email: [email protected] spider_monkey Agent: mouse.house/7.1 Host: snowball.ionsys.com Email: [email protected] SpiderBot Agent: SpiderBot/1.0 Host: * Email: [email protected] SpiderMan Agent: SpiderMan 1.0 Host: NA Email: [email protected] SpiderView(tm) Agent: Mozilla/4.0 (compatible; SpiderView 1.0;unix) Host: bobmin.quad2.iuinc.com, * Email: [email protected] Spry Wizard Robot Agent: no Host: wizard.spry.com or tiger.spry.com Email: [email protected] Site Searcher Agent: ssearcher100 Host: * Email: [email protected] Suke Agent: suke/*.* Host: * Email: [email protected] suntek search engine Agent: suntek/1.0 Host: search.suntek.com.hk Email: [email protected] Sven Agent: Host: 24.113.12.29 Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (19 of 24) [18.02.2001 13:15:51] Database of Web Robots, View Contact TACH Black Widow Agent: Mozilla/3.0 (Black Widow v1.1.0; Linux 2.0.27; Dec 31 1997 12:25:00 Host: *.theautochannel.com Email: [email protected] Tarantula Agent: Tarantula/1.0 Host: yes Email: [email protected] tarspider Agent: tarspider Host: Email: [email protected] Tcl W3 Robot Agent: dlw3robot/x.y (in TclX by http://hplyot.obspm.fr/~dl/) Host: hplyot.obspm.fr Email: [email protected] TechBOT Agent: TechBOT Host: techaid.net Email: [email protected] Templeton Agent: Templeton/{version} for {platform} Host: * Email: [email protected] TeomaTechnologies Agent: teoma_agent1 [[email protected]] Host: 63.236.92.145 Email: [email protected] TitIn Agent: TitIn/0.2 Host: barok.foi.hr Email: [email protected] TITAN Agent: TITAN/0.1 Host: nlptitan.isl.ntt.jp Email: [email protected] The TkWWW Robot Agent: Host: Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (20 of 24) [18.02.2001 13:15:51] Database of Web Robots, View Contact TLSpider Agent: TLSpider/1.1 Host: tlspider.topiclink.com (not avalible yet) Email: [email protected] UCSD Crawl Agent: UCSD-Crawler Host: nuthaus.mib.org scilib.ucsd.edu Email: [email protected] UdmSearch Agent: UdmSearch/2.1.1 Host: * Email: [email protected] URL Check Agent: urlck/1.2.3 Host: * Email: [email protected] URL Spider Pro Agent: URL Spider Pro Host: * Email: [email protected] Valkyrie Agent: Valkyrie/1.0 libwww-perl/0.40 Host: *.c.u-tokyo.ac.jp Email: [email protected] Victoria Agent: Victoria/1.0 Host: Email: [email protected] vision-search Agent: vision-search/3.0' Host: dylan.ius.cs.cmu.edu Email: [email protected] Voyager Agent: Voyager/0.0 Host: *.lisa.co.jp Email: [email protected] VWbot Agent: VWbot_K/4.2 Host: vancouver-webpages.com Email: [email protected] The NWI Robot Agent: w3index Host: nwi.ub2.lu.se, mars.dtv.dk and a few others Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (21 of 24) [18.02.2001 13:15:51] Database of Web Robots, View Contact W3M2 Agent: W3M2/x.xxx Host: * Email: [email protected] the World Wide Web Wanderer Agent: WWWWanderer v3.0 Host: *.mit.edu Email: [email protected] WebBandit Web Spider Agent: WebBandit/1.0 Host: ix.netcom.com Email: [email protected] WebCatcher Agent: WebCatcher/1.0 Host: oscar.lang.nagoya-u.ac.jp Email: [email protected] WebCopy Agent: WebCopy/(version) Host: * Email: [email protected] webfetcher Agent: WebFetcher/0.8, Host: * Email: [email protected] The Webfoot Robot Agent: Host: phoenix.doc.ic.ac.uk Email: [email protected] weblayers Agent: weblayers/0.0 Host: Email: [email protected] WebLinker Agent: WebLinker/0.0 libwww-perl/0.1 Host: Email: [email protected] WebMirror Agent: no Host: Email: [email protected] The Web Moose Agent: WebMoose/0.0.0000 Host: msn.com Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (22 of 24) [18.02.2001 13:15:51] Database of Web Robots, View Contact WebQuest Agent: WebQuest/1.0 Host: 210.121.146.2, 210.113.104.1, 210.113.104.2 Email: [email protected] Digimarc MarcSpider Agent: Digimarc WebReader/1.2 Host: 206.102.3.* Email: [email protected] WebReaper Agent: WebReaper [[email protected]] Host: * Email: [email protected] webs Agent: [email protected] Host: lemon.recruit.co.jp Email: [email protected] Websnarf Agent: Host: Email: [email protected] WebSpider Agent: Host: several Email: [email protected] WebVac Agent: webvac/1.0 Host: Email: [email protected] webwalk Agent: webwalk Host: Email: WebWalker Agent: WebWalker/1.10 Host: * Email: [email protected] WebWatch Agent: WebWatch Host: Email: [email protected] Wget Agent: Wget/1.4.0 Host: * Email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (23 of 24) [18.02.2001 13:15:51] Database of Web Robots, View Contact whatUseek Winona Agent: whatUseek_winona/3.0 Host: *.whatuseek.com, *.aol2.com Email: [email protected] WhoWhere Robot Agent: Host: spica.whowhere.com Email: [email protected] w3mir Agent: w3mir Host: Email: [email protected] WebStolperer Agent: WOLP/1.0 mda/1.0 Host: www.suchfibel.de Email: [email protected] The Web Wombat Agent: no Host: qwerty.intercom.com.au Email: [email protected] The World Wide Web Worm Agent: Host: piper.cs.colorado.edu Email: [email protected] WWWC Ver 0.2.5 Agent: WWWC/0.25 (Win95) Host: Email: [email protected] WebZinger Agent: none Host: http://www.imaginon.com/wzindex.html * Email: [email protected] XGET Agent: XGET/0.7 Host: * Email: [email protected] Nederland.zoek Agent: Nederland.zoek Host: 193.67.110.* Email: [email protected] The Web Robots Database http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (24 of 24) [18.02.2001 13:15:51] Database of Web Robots, Overview of Raw files The Web Robots Pages Database of Web Robots, Overview of Raw files 1. Acme.Spider 2. Ahoy! The Homepage Finder 3. Alkaline 4. Walhello appie 5. Arachnophilia 6. ArchitextSpider 7. Aretha 8. ARIADNE 9. arks 10. ASpider (Associative Spider) 11. ATN Worldwide 12. Atomz.com Search Robot 13. AURESYS 14. BackRub 15. unnamed 16. Big Brother 17. Bjaaland 18. BlackWidow 19. Die Blinde Kuh 20. Bloodhound 21. bright.net caching robot 22. BSpider 23. CACTVS Chemistry Spider 24. Calif 25. Cassandra 26. Digimarc Marcspider/CGI 27. Checkbot 28. churl 29. CMC/0.01 http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (1 of 8) [18.02.2001 13:16:00] Database of Web Robots, Overview of Raw files 30. Collective 31. Combine System 32. Conceptbot 33. CoolBot 34. Web Core / Roots 35. XYLEME Robot 36. Internet Cruiser Robot 37. Cusco 38. CyberSpyder Link Test 39. DeWeb(c) Katalog/Index 40. DienstSpider 41. Digger 42. Digital Integrity Robot 43. Direct Hit Grabber 44. DNAbot 45. DownLoad Express 46. DragonBot 47. DWCP (Dridus' Web Cataloging Project) 48. e-collector 49. EbiNess 50. EIT Link Verifier Robot 51. Emacs-w3 Search Engine 52. ananzi 53. Esther 54. Evliya Celebi 55. nzexplorer 56. Fluid Dynamics Search Engine robot 57. Felix IDE 58. Wild Ferret Web Hopper #1, #2, #3 59. FetchRover 60. fido 61. Hämähäkki 62. KIT-Fireball 63. Fish search 64. Fouineur http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (2 of 8) [18.02.2001 13:16:00] Database of Web Robots, Overview of Raw files 65. Robot Francoroute 66. Freecrawl 67. FunnelWeb 68. gazz 69. GCreep 70. GetBot 71. GetURL 72. Golem 73. Googlebot 74. Grapnel/0.01 Experiment 75. Griffon 76. Gromit 77. Northern Light Gulliver 78. HamBot 79. Harvest 80. havIndex 81. HI (HTML Index) Search 82. Hometown Spider Pro 83. Wired Digital 84. ht://Dig 85. HTMLgobble 86. Hyper-Decontextualizer 87. IBM_Planetwide 88. Popular Iconoclast 89. Ingrid 90. Imagelock 91. IncyWincy 92. Informant 93. InfoSeek Robot 1.0 94. Infoseek Sidewinder 95. InfoSpiders 96. Inspector Web 97. IntelliAgent 98. I, Robot 99. Iron33 http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (3 of 8) [18.02.2001 13:16:00] Database of Web Robots, Overview of Raw files 100. Israeli-search 101. JavaBee 102. JBot Java Web Robot 103. JCrawler 104. Jeeves 105. Jobot 106. JoeBot 107. The Jubii Indexing Robot 108. JumpStation 109. Katipo 110. KDD-Explorer 111. Kilroy 112. KO_Yappo_Robot 113. LabelGrabber 114. larbin 115. legs 116. Link Validator 117. LinkScan 118. LinkWalker 119. Lockon 120. logo.gif Crawler 121. Lycos 122. Mac WWWWorm 123. Magpie 124. Mattie 125. MediaFox 126. MerzScope 127. NEC-MeshExplorer 128. MindCrawler 129. moget 130. MOMspider 131. Monster 132. Motor 133. Muscat Ferret 134. Mwd.Search http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (4 of 8) [18.02.2001 13:16:00] Database of Web Robots, Overview of Raw files 135. Internet Shinchakubin 136. NetCarta WebMap Engine 137. NetMechanic 138. NetScoop 139. newscan-online 140. NHSE Web Forager 141. Nomad 142. The NorthStar Robot 143. Occam 144. HKU WWW Octopus 145. Orb Search 146. Pack Rat 147. PageBoy 148. ParaSite 149. Patric 150. pegasus 151. The Peregrinator 152. PerlCrawler 1.0 153. Phantom 154. PiltdownMan 155. Pioneer 156. html_analyzer 157. Portal Juice Spider 158. PGP Key Agent 159. PlumtreeWebAccessor 160. Poppi 161. PortalB Spider 162. GetterroboPlus Puu 163. The Python Robot 164. Raven Search 165. RBSE Spider 166. Resume Robot 167. RoadHouse Crawling System 168. Road Runner: The ImageScape Robot 169. Robbie the Robot http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (5 of 8) [18.02.2001 13:16:00] Database of Web Robots, Overview of Raw files 170. ComputingSite Robi/1.0 171. Robozilla 172. Roverbot 173. SafetyNet Robot 174. Scooter 175. Search.Aus-AU.COM 176. SearchProcess 177. Senrigan 178. SG-Scout 179. ShagSeeker 180. Shai'Hulud 181. Sift 182. Simmany Robot Ver1.0 183. Site Valet 184. Open Text Index Robot 185. SiteTech-Rover 186. SLCrawler 187. Inktomi Slurp 188. Smart Spider 189. Snooper 190. Solbot 191. Spanner 192. Speedy Spider 193. spider_monkey 194. SpiderBot 195. SpiderMan 196. SpiderView(tm) 197. Spry Wizard Robot 198. Site Searcher 199. Suke 200. suntek search engine 201. Sven 202. TACH Black Widow 203. Tarantula 204. tarspider http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (6 of 8) [18.02.2001 13:16:00] Database of Web Robots, Overview of Raw files 205. Tcl W3 Robot 206. TechBOT 207. Templeton 208. TeomaTechnologies 209. TitIn 210. TITAN 211. The TkWWW Robot 212. TLSpider 213. UCSD Crawl 214. UdmSearch 215. URL Check 216. URL Spider Pro 217. Valkyrie 218. Victoria 219. vision-search 220. Voyager 221. VWbot 222. The NWI Robot 223. W3M2 224. the World Wide Web Wanderer 225. WebBandit Web Spider 226. WebCatcher 227. WebCopy 228. webfetcher 229. The Webfoot Robot 230. weblayers 231. WebLinker 232. WebMirror 233. The Web Moose 234. WebQuest 235. Digimarc MarcSpider 236. WebReaper 237. webs 238. Websnarf 239. WebSpider http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (7 of 8) [18.02.2001 13:16:00] Database of Web Robots, Overview of Raw files 240. WebVac 241. webwalk 242. WebWalker 243. WebWatch 244. Wget 245. whatUseek Winona 246. WhoWhere Robot 247. w3mir 248. WebStolperer 249. The Web Wombat 250. The World Wide Web Worm 251. WWWC Ver 0.2.5 252. WebZinger 253. XGET 254. Nederland.zoek The Web Robots Database http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (8 of 8) [18.02.2001 13:16:00] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-status: robot-purpose: robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: to set the User-Agent. robot-noindex: robot-host: robot-from: robot-useragent: to set the User-Agent. robot-language: robot-description: robot-history: robot-environment: modified-date: modified-by: Acme.Spider Acme.Spider http://www.acme.com/java/software/Acme.Spider.html http://www.acme.com/java/software/Acme.Spider.html Jef Poskanzer - ACME Laboratories http://www.acme.com/ [email protected] active indexing maintenance statistics standalone java source yes Due to a deficiency in Java it's not currently possible no * no Due to a deficiency in Java it's not currently possible java A Java utility class for writing your own robots. Wed, 04 Dec 1996 21:30:11 GMT Jef Poskanzer robot-id: ahoythehomepagefinder robot-name: Ahoy! The Homepage Finder robot-cover-url: http://www.cs.washington.edu/research/ahoy/ robot-details-url: http://www.cs.washington.edu/research/ahoy/doc/home.html robot-owner-name: Marc Langheinrich robot-owner-url: http://www.cs.washington.edu/homes/marclang robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: UNIX robot-availability: none robot-exclusion: yes robot-exclusion-useragent: ahoy robot-noindex: no robot-host: cs.washington.edu robot-from: no robot-useragent: 'Ahoy! The Homepage Finder' robot-language: Perl 5 robot-description: Ahoy! is an ongoing research project at the University of Washington for finding personal Homepages. robot-history: Research project at the University of Washington in 1995/1996 robot-environment: research modified-date: Fri June 28 14:00:00 1996 modified-by: Marc Langheinrich robot-id: Alkaline robot-name: Alkaline robot-cover-url: http://www.vestris.com/alkaline robot-details-url: http://www.vestris.com/alkaline robot-owner-name: Daniel Doubrovkine robot-owner-url: http://cuiwww.unige.ch/~doubrov5 robot-owner-email: [email protected] robot-status: development active robot-purpose: indexing robot-type: standalone robot-platform: unix windows95 windowsNT http://info.webcrawler.com/mak/projects/robots/active/all.txt (1 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-availability: binary robot-exclusion: yes robot-exclusion-useragent: AlkalineBOT robot-noindex: yes robot-host: * robot-from: no robot-useragent: AlkalineBOT robot-language: c++ robot-description: Unix/NT internet/intranet search engine robot-history: Vestris Inc. search engine designed at the University of Geneva robot-environment: commercial research modified-date: Thu Dec 10 14:01:13 MET 1998 modified-by: Daniel Doubrovkine <[email protected]> robot-id: appie robot-name: Walhello appie robot-cover-url: www.walhello.com robot-details-url: www.walhello.com/aboutgl.html robot-owner-name: Aimo Pieterse robot-owner-url: www.walhello.com robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: windows98 robot-availability: none robot-exclusion: yes robot-exclusion-useragent: appie robot-noindex: yes robot-host: 213.10.10.116, 213.10.10.117, 213.10.10.118 robot-from: yes robot-useragent: appie/1.1 robot-language: Visual C++ robot-description: The appie-spider is used to collect and index web pages for the Walhello search engine robot-history: The spider was built in march/april 2000 robot-environment: commercial modified-date: Thu, 20 Jul 2000 22:38:00 GMT modified-by: Aimo Pieterse robot-id: arachnophilia robot-name: Arachnophilia robot-cover-url: robot-details-url: robot-owner-name: Vince Taluskie robot-owner-url: http://www.ph.utexas.edu/people/vince.html robot-owner-email: [email protected] robot-status: robot-purpose: robot-type: robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: halsoft.com robot-from: robot-useragent: Arachnophilia robot-language: robot-description: The purpose (undertaken by HaL Software) of this run was to collect approximately 10k html documents for testing automatic abstract generation robot-history: robot-environment: http://info.webcrawler.com/mak/projects/robots/active/all.txt (2 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt modified-date: modified-by: robot-id: architext robot-name: ArchitextSpider robot-cover-url: http://www.excite.com/ robot-details-url: robot-owner-name: Architext Software robot-owner-url: http://www.atext.com/spider.html robot-owner-email: [email protected] robot-status: robot-purpose: indexing, statistics robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: *.atext.com robot-from: yes robot-useragent: ArchitextSpider robot-language: perl 5 and c robot-description: Its purpose is to generate a Resource Discovery database, and to generate statistics. The ArchitextSpider collects information for the Excite and WebCrawler search engines. robot-history: robot-environment: modified-date: Tue Oct 3 01:10:26 1995 modified-by: robot-id: aretha robot-name: Aretha robot-cover-url: robot-details-url: robot-owner-name: Dave Weiner robot-owner-url: http://www.hotwired.com/Staff/userland/ robot-owner-email: [email protected] robot-status: robot-purpose: robot-type: robot-platform: Macintosh robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: robot-host: robot-from: robot-useragent: robot-language: robot-description: A crude robot built on top of Netscape and Userland Frontier, a scripting system for Macs robot-history: robot-environment: modified-date: modified-by: robot-id: ariadne robot-name: ARIADNE robot-cover-url: (forthcoming) robot-details-url: (forthcoming) robot-owner-name: Mr. Matthias H. Gross robot-owner-url: http://www.lrz-muenchen.de/~gross/ robot-owner-email: [email protected] robot-status: development robot-purpose: statistics, development of focused crawling strategies http://info.webcrawler.com/mak/projects/robots/active/all.txt (3 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-type: standalone robot-platform: java robot-availability: none robot-exclusion: yes robot-exclusion-useragent: ariadne robot-noindex: no robot-host: dbs.informatik.uni-muenchen.de robot-from: no robot-useragent: Due to a deficiency in Java it's not currently possible to set the User-Agent. robot-language: java robot-description: The ARIADNE robot is a prototype of a environment for testing focused crawling strategies. robot-history: This robot is part of a research project at the University of Munich (LMU), started in 2000. robot-environment: research modified-date: Mo, 13 Mar 2000 14:00:00 GMT modified-by: Mr. Matthias H. Gross robot-id:arks robot-name:arks robot-cover-url:http://www.dpsindia.com robot-details-url:http://www.dpsindia.com robot-owner-name:Aniruddha Choudhury robot-owner-url: robot-owner-email:[email protected] robot-status:development robot-purpose:indexing robot-type:standalone robot-platform:PLATFORM INDEPENDENT robot-availability:data robot-exclusion:yes robot-exclusion-useragent:arks robot-noindex:no robot-host:dpsindia.com robot-from:no robot-useragent:arks/1.0 robot-language:Java 1.2 robot-description:The Arks robot is used to build the database for the dpsindia/lawvistas.com search service . The robot runs weekly, and visits sites in a random order robot-history:finds its root from s/w development project for a portal robot-environment:commercial modified-date:6 th November 2000 modified-by:Aniruddha Choudhury robot-id: aspider robot-name: ASpider (Associative Spider) robot-cover-url: robot-details-url: robot-owner-name: Fred Johansen robot-owner-url: http://www.pvv.ntnu.no/~fredj/ robot-owner-email: [email protected] robot-status: retired robot-purpose: indexing robot-type: robot-platform: unix robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: no robot-host: nova.pvv.unit.no robot-from: yes robot-useragent: ASpider/0.09 robot-language: perl4 http://info.webcrawler.com/mak/projects/robots/active/all.txt (4 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-description: ASpider is a CGI script that searches the web for keywords given by the user through a form. robot-history: robot-environment: hobby modified-date: modified-by: robot-id: atn.txt robot-name: ATN Worldwide robot-details-url: robot-cover-url: robot-owner-name: All That Net robot-owner-url: http://www.allthatnet.com robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: ATN_Worldwide robot-noindex: robot-nofollow: robot-host: www.allthatnet.com robot-from: robot-useragent: ATN_Worldwide robot-language: robot-description: The ATN robot is used to build the database for the AllThatNet search service operated by All That Net. The robot runs weekly, and visits sites in a random order. robot-history: robot-environment: modified-date: July 09, 2000 17:43 GMT robot-id: atomz robot-name: Atomz.com Search Robot robot-cover-url: http://www.atomz.com/help/ robot-details-url: http://www.atomz.com/ robot-owner-name: Mike Thompson robot-owner-url: http://www.atomz.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: service robot-exclusion: yes robot-exclusion-useragent: Atomz robot-noindex: yes robot-host: www.atomz.com robot-from: no robot-useragent: Atomz/1.0 robot-language: c robot-description: Robot used for web site search service. robot-history: Developed for Atomz.com, launched in 1999. robot-environment: service modified-date: Tue Jul 13 03:50:06 GMT 1999 modified-by: Mike Thompson robot-id: auresys robot-name: AURESYS robot-cover-url: http://crrm.univ-mrs.fr robot-details-url: http://crrm.univ-mrs.fr robot-owner-name: Mannina Bruno robot-owner-url: ftp://crrm.univ-mrs.fr/pub/CVetud/Etudiants/Mannina/CVbruno.htm http://info.webcrawler.com/mak/projects/robots/active/all.txt (5 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-owner-email: [email protected] robot-status: robot actively in use robot-purpose: indexing,statistics robot-type: Standalone robot-platform: Aix, Unix robot-availability: Protected by Password robot-exclusion: Yes robot-exclusion-useragent: robot-noindex: no robot-host: crrm.univ-mrs.fr, 192.134.99.192 robot-from: Yes robot-useragent: AURESYS/1.0 robot-language: Perl 5.001m robot-description: The AURESYS is used to build a personnal database for somebody who search information. The database is structured to be analysed. AURESYS can found new server by IP incremental. It generate statistics... robot-history: This robot finds its roots in a research project at the University of Marseille in 1995-1996 robot-environment: used for Research modified-date: Mon, 1 Jul 1996 14:30:00 GMT modified-by: Mannina Bruno robot-id: backrub robot-name: BackRub robot-cover-url: robot-details-url: robot-owner-name: Larry Page robot-owner-url: http://backrub.stanford.edu/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing, statistics robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: *.stanford.edu robot-from: yes robot-useragent: BackRub/*.* robot-language: Java. robot-description: robot-history: robot-environment: modified-date: Wed Feb 21 02:57:42 1996. modified-by: robot-id: robot-name: bayspider robot-cover-url: http://www.baytsp.com robot-details-url: http://www.baytsp.com robot-owner-name: BayTSP.com,Inc robot-owner-url: robot-owner-email: [email protected] robot-status: Active robot-purpose: Copyright Infringement Tracking robot-type: Stand Alone robot-platform: NT robot-availability: 24/7 robot-exclusion: robot-exclusion-useragent: robot-noindex: robot-host: robot-from: robot-useragent: BaySpider robot-language: English http://info.webcrawler.com/mak/projects/robots/active/all.txt (6 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-description: robot-history: robot-environment: modified-date: 1/15/2001 modified-by: [email protected] robot-id: bigbrother robot-name: Big Brother robot-cover-url: http://pauillac.inria.fr/~fpottier/mac-soft.html.en robot-details-url: robot-owner-name: Francois Pottier robot-owner-url: http://pauillac.inria.fr/~fpottier/ robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: mac robot-availability: binary robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: * robot-from: not as of 1.0 robot-useragent: Big Brother robot-language: c++ robot-description: Macintosh-hosted link validation tool. robot-history: robot-environment: shareware modified-date: Thu Sep 19 18:01:46 MET DST 1996 modified-by: Francois Pottier robot-id: bjaaland robot-name: Bjaaland robot-cover-url: http://www.textuality.com robot-details-url: http://www.textuality.com robot-owner-name: Tim Bray robot-owner-url: http://www.textuality.com robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: Bjaaland robot-noindex: no robot-host: barry.bitmovers.net robot-from: no robot-useragent: Bjaaland/0.5 robot-language: perl5 robot-description: Crawls sites listed in the ODP (see http://dmoz.org) robot-history: None, yet robot-environment: service modified-date: Monday, 19 July 1999, 13:46:00 PDT modified-by: [email protected] robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-status: robot-purpose: blackwidow BlackWidow http://140.190.65.12/~khooghee/index.html Kevin Hoogheem [email protected] indexing, statistics http://info.webcrawler.com/mak/projects/robots/active/all.txt (7 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-type: standalone robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: robot-host: 140.190.65.* robot-from: yes robot-useragent: BlackWidow robot-language: C, C++. robot-description: Started as a research project and now is used to find links for a random link generator. Also is used to research the growth of specific sites. robot-history: robot-environment: modified-date: Fri Feb 9 00:11:22 1996. modified-by: robot-id: blindekuh robot-name: Die Blinde Kuh robot-cover-url: http://www.blinde-kuh.de/ robot-details-url: http://www.blinde-kuh.de/robot.html (german language) robot-owner-name: Stefan R. Mueller robot-owner-url: http://www.rrz.uni-hamburg.de/philsem/stefan_mueller/ robot-owner-email:[email protected] robot-status: development robot-purpose: indexing robot-type: browser robot-platform: unix robot-availability: none robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: minerva.sozialwiss.uni-hamburg.de robot-from: yes robot-useragent: Die Blinde Kuh robot-language: perl5 robot-description: The robot is use for indixing and proofing the registered urls in the german language search-engine for kids. Its a none-comercial one-woman-project of Birgit Bachmann living in Hamburg, Germany. robot-history: The robot was developed by Stefan R. Mueller to help by the manual proof of registered Links. robot-environment: hobby modified-date: Mon Jul 22 1998 modified-by: Stefan R. Mueller robot-id:Bloodhound robot-name:Bloodhound robot-cover-url:http://web.ukonline.co.uk/genius/bloodhound.htm robot-details-url:http://web.ukonline.co.uk/genius/bloodhound.htm robot-owner-name:Dean Smart robot-owner-url:http://web.ukonline.co.uk/genius/bloodhound.htm robot-owner-email:[email protected] robot-status:active robot-purpose:Web Site Download robot-type:standalone robot-platform:Windows95, WindowsNT, Windows98, Windows2000 robot-availability:Executible robot-exclusion:No robot-exclusion-useragent:Ukonline robot-noindex:No robot-host:* robot-from:No robot-useragent:None http://info.webcrawler.com/mak/projects/robots/active/all.txt (8 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-language:Perl5 robot-description:Bloodhound will download an whole web site depending on the number of links to follow specified by the user. robot-history:First version was released on the 1 july 2000 robot-environment:Commercial modified-date:1 july 2000 modified-by:Dean Smart robot-id: brightnet robot-name: bright.net caching robot robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-status: active robot-purpose: caching robot-type: robot-platform: robot-availability: none robot-exclusion: no robot-noindex: robot-host: 209.143.1.46 robot-from: no robot-useragent: Mozilla/3.01 (compatible;) robot-language: robot-description: robot-history: robot-environment: modified-date: Fri Nov 13 14:08:01 EST 1998 modified-by: brian d foy <[email protected]> robot-id: bspider robot-name: BSpider robot-cover-url: not yet robot-details-url: not yet robot-owner-name: Yo Okumura robot-owner-url: not yet robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: Unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: bspider robot-noindex: yes robot-host: 210.159.73.34, 210.159.73.35 robot-from: yes robot-useragent: BSpider/1.0 libwww-perl/0.40 robot-language: perl robot-description: BSpider is crawling inside of Japanese domain for indexing. robot-history: Starts Apr 1997 in a research project at Fuji Xerox Corp. Research Lab. robot-environment: research modified-date: Mon, 21 Apr 1997 18:00:00 JST modified-by: Yo Okumura robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: cactvschemistryspider CACTVS Chemistry Spider http://schiele.organik.uni-erlangen.de/cactvs/spider.html W. D. Ihlenfeldt http://schiele.organik.uni-erlangen.de/cactvs/ [email protected] http://info.webcrawler.com/mak/projects/robots/active/all.txt (9 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-status: robot-purpose: indexing. robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: utamaro.organik.uni-erlangen.de robot-from: no robot-useragent: CACTVS Chemistry Spider robot-language: TCL, C robot-description: Locates chemical structures in Chemical MIME formats on WWW and FTP servers and downloads them into database searchable with structure queries (substructure, fullstructure, formula, properties etc.) robot-history: robot-environment: modified-date: Sat Mar 30 00:55:40 1996. modified-by: robot-id: calif robot-name: Calif robot-details-url: http://www.tnps.dp.ua/calif/details.html robot-cover-url: http://www.tnps.dp.ua/calif/ robot-owner-name: Alexander Kosarev robot-owner-url: http://www.tnps.dp.ua/~dark/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: calif robot-noindex: yes robot-host: cobra.tnps.dp.ua robot-from: yes robot-useragent: Calif/0.6 ([email protected]; http://www.tnps.dp.ua) robot-language: c++ robot-description: Used to build searchable index robot-history: In development stage robot-environment: research modified-date: Sun, 6 Jun 1999 13:25:33 GMT robot-id: cassandra robot-name: Cassandra robot-cover-url: http://post.mipt.rssi.ru/~billy/search/ robot-details-url: http://post.mipt.rssi.ru/~billy/search/ robot-owner-name: Mr. Oleg Bilibin robot-owner-url: http://post.mipt.rssi.ru/~billy/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: crossplatform robot-availability: none robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: www.aha.ru robot-from: no robot-useragent: robot-language: java robot-description: Cassandra search robot is used to create and maintain indexed http://info.webcrawler.com/mak/projects/robots/active/all.txt (10 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt database for widespread Information Retrieval System robot-history: Master of Science degree project at Moscow Institute of Physics and Technology robot-environment: research modified-date: Wed, 3 Jun 1998 12:00:00 GMT robot-id: cgireader robot-name: Digimarc Marcspider/CGI robot-cover-url: http://www.digimarc.com/prod_fam.html robot-details-url: http://www.digimarc.com/prod_fam.html robot-owner-name: Digimarc Corporation robot-owner-url: http://www.digimarc.com robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: windowsNT robot-availability: none robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: 206.102.3.* robot-from: robot-useragent: Digimarc CGIReader/1.0 robot-language: c++ robot-description: Similar to Digimarc Marcspider, Marcspider/CGI examines image files for watermarks but more focused on CGI Urls. In order to not waste internet bandwidth with yet another crawler, we have contracted with one of the major crawlers/seach engines to provide us with a list of specific CGI URLs of interest to us. If an URL is to a page of interest (via CGI), then we access the page to get the image URLs from it, but we do not crawl to any other pages. robot-history: First operation in December 1997 robot-environment: service modified-date: Fri, 5 Dec 1997 12:00:00 GMT modified-by: Dan Ramos robot-id: checkbot robot-name: Checkbot robot-cover-url: http://www.xs4all.nl/~graaff/checkbot/ robot-details-url: robot-owner-name: Hans de Graaff robot-owner-url: http://www.xs4all.nl/~graaff/checkbot/ robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: unix,WindowsNT robot-availability: source robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: * robot-from: no robot-useragent: Checkbot/x.xx LWP/5.x robot-language: perl 5 robot-description: Checkbot checks links in a given set of pages on one or more servers. It reports links which returned an error code robot-history: robot-environment: hobby modified-date: Tue Jun 25 07:44:00 1996 modified-by: Hans de Graaff http://info.webcrawler.com/mak/projects/robots/active/all.txt (11 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-id: churl robot-name: churl robot-cover-url: http://www-personal.engin.umich.edu/~yunke/scripts/churl/ robot-details-url: robot-owner-name: Justin Yunke robot-owner-url: http://www-personal.engin.umich.edu/~yunke/ robot-owner-email: [email protected] robot-status: robot-purpose: maintenance robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: robot-useragent: robot-language: robot-description: A URL checking robot, which stays within one step of the local server robot-history: robot-environment: modified-date: modified-by: robot-id: cmc robot-name: CMC/0.01 robot-details-url: http://www2.next.ne.jp/cgi-bin/music/help.cgi?phase=robot robot-cover-url: http://www2.next.ne.jp/music/ robot-owner-name: Shinobu Kubota. robot-owner-url: http://www2.next.ne.jp/cgi-bin/music/help.cgi?phase=profile robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: CMC/0.01 robot-noindex: no robot-host: haruna.next.ne.jp, 203.183.218.4 robot-from: yes robot-useragent: CMC/0.01 robot-language: perl5 robot-description: This CMC/0.01 robot collects the information of the page that was registered to the music specialty searching service. robot-history: This CMC/0.01 robot was made for the computer music center on November 4, 1997. robot-environment: hobby modified-date: Sat, 23 May 1998 17:22:00 GMT robot-id:Collective robot-name:Collective robot-cover-url:http://web.ukonline.co.uk/genius/collective.htm robot-details-url:http://web.ukonline.co.uk/genius/collective.htm robot-owner-name:Dean Smart robot-owner-url:http://web.ukonline.co.uk/genius/collective.htm robot-owner-email:[email protected] robot-status:development robot-purpose:Collective is a highly configurable program designed to interrogate online search engines and online databases, it will ignore web pages that lie about there content, and dead url's, it can be super strict, it searches each web page http://info.webcrawler.com/mak/projects/robots/active/all.txt (12 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt it finds for your search terms to ensure those terms are present, any positive urls are added to a html file for your to view at any time even before the program has finished. Collective can wonder the web for days if required. robot-type:standalone robot-platform:Windows95, WindowsNT, Windows98, Windows2000 robot-availability:Executible robot-exclusion:No robot-exclusion-useragent: robot-noindex:No robot-host:* robot-from:No robot-useragent:LWP robot-language:Perl5 (With Visual Basic front-end) robot-description:Collective is the most cleverest Internet search engine, With all found url?s guaranteed to have your search terms. robot-history:Develpment started on August, 03, 2000 robot-environment:Commercial modified-date:August, 03, 2000 modified-by:Dean Smart robot-id: combine robot-name: Combine System robot-cover-url: http://www.ub2.lu.se/~tsao/combine.ps robot-details-url: http://www.ub2.lu.se/~tsao/combine.ps robot-owner-name: Yong Cao robot-owner-url: http://www.ub2.lu.se/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: source robot-exclusion: yes robot-exclusion-useragent: combine robot-noindex: no robot-host: *.ub2.lu.se robot-from: yes robot-useragent: combine/0.0 robot-language: c, perl5 robot-description: An open, distributed, and efficient harvester. robot-history: A complete re-design of the NWI robot (w3index) for DESIRE project. robot-environment: research modified-date: Tue, 04 Mar 1997 16:11:40 GMT modified-by: Yong Cao robot-id: conceptbot robot-name: Conceptbot robot-cover-url: http://www.aptltd.com/~sifry/conceptbot/tech.html robot-details-url: http://www.aptltd.com/~sifry/conceptbot robot-owner-name: David L. Sifry robot-owner-url: http://www.aptltd.com/~sifry robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: data robot-exclusion: yes robot-exclusion-useragent: conceptbot robot-noindex: yes robot-host: router.sifry.com robot-from: yes robot-useragent: conceptbot/0.3 robot-language: perl5 http://info.webcrawler.com/mak/projects/robots/active/all.txt (13 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-description:The Conceptbot spider is used to research concept-based search indexing techniques. It uses a breadth first seach to spread out the number of hits on a single site over time. The spider runs at irregular intervals and is still under construction. robot-history: This spider began as a research project at Sifry Consulting in April 1996. robot-environment: research modified-date: Mon, 9 Sep 1996 15:31:07 GMT modified-by: David L. Sifry <[email protected]> robot-id: coolbot robot-name: CoolBot robot-cover-url: www.suchmaschine21.de robot-details-url: www.suchmaschine21.de robot-owner-name: Stefan Fischerlaender robot-owner-url: www.suchmaschine21.de robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: CoolBot robot-noindex: yes robot-host: www.suchmaschine21.de robot-from: no robot-useragent: CoolBot robot-language: perl5 robot-description: The CoolBot robot is used to build and maintain the directory of the german search engine Suchmaschine21. robot-history: none so far robot-environment: service modified-date: Wed, 21 Jan 2001 12:16:00 GMT modified-by: Stefan Fischerlaender robot-id: core robot-name: Web Core / Roots robot-cover-url: http://www.di.uminho.pt/wc robot-details-url: robot-owner-name: Jorge Portugal Andrade robot-owner-url: http://www.di.uminho.pt/~cbm robot-owner-email: [email protected] robot-status: robot-purpose: indexing, maintenance robot-type: robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: shiva.di.uminho.pt, from www.di.uminho.pt robot-from: no robot-useragent: root/0.1 robot-language: perl robot-description: Parallel robot developed in Minho Univeristy in Portugal to catalog relations among URLs and to support a special navigation aid. robot-history: First versions since October 1995. robot-environment: modified-date: Wed Jan 10 23:19:08 1996. modified-by: robot-id: cosmos robot-name: XYLEME Robot http://info.webcrawler.com/mak/projects/robots/active/all.txt (14 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-cover-url: http://xyleme.com/ robot-details-url: robot-owner-name: Mihai Preda robot-owner-url: http://www.mihaipreda.com/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: data robot-exclusion: yes robot-exclusion-useragent: cosmos robot-noindex: no robot-nofollow: no robot-host: robot-from: yes robot-useragent: cosmos/0.3 robot-language: c++ robot-description: index XML, follow HTML robot-history: robot-environment: service modified-date: Fri, 24 Nov 2000 00:00:00 GMT modified-by: Mihai Preda robot-id: cruiser robot-name: Internet Cruiser Robot robot-cover-url: http://www.krstarica.com/ robot-details-url: http://www.krstarica.com/eng/url/ robot-owner-name: Internet Cruiser robot-owner-url: http://www.krstarica.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: Internet Cruiser Robot robot-noindex: yes robot-host: *.krstarica.com robot-from: no robot-useragent: Internet Cruiser Robot/2.1 robot-language: c++ robot-description: Internet Cruiser Robot is Internet Cruiser's prime index agent. robot-history: robot-environment: service modified-date: Fri, 17 Jan 2001 12:00:00 GMT modified-by: [email protected] robot-id: cusco robot-name: Cusco robot-cover-url: http://www.cusco.pt/ robot-details-url: http://www.cusco.pt/ robot-owner-name: Filipe Costa Clerigo robot-owner-url: http://www.viatecla.pt/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standlone robot-platform: any robot-availability: none robot-exclusion: yes robot-exclusion-useragent: cusco robot-noindex: yes http://info.webcrawler.com/mak/projects/robots/active/all.txt (15 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-host: *.cusco.pt, *.viatecla.pt robot-from: yes robot-useragent: Cusco/3.2 robot-language: Java robot-description: The Cusco robot is part of the CUCE indexing sistem. It gathers information from several sources: HTTP, Databases or filesystem. At this moment, it's universe is the .pt domain and the information it gathers is available at the Portuguese search engine Cusco http://www.cusco.pt/. robot-history: The Cusco search engine started in the company ViaTecla as a project to demonstrate our development capabilities and to fill the need of a portuguese-specific search engine. Now, we are developping new functionalities that cannot be found in any other on-line search engines. robot-environment:service, research modified-date: Mon, 21 Jun 1999 14:00:00 GMT modified-by: Filipe Costa Clerigo robot-id: cyberspyder robot-name: CyberSpyder Link Test robot-cover-url: http://www.cyberspyder.com/cslnkts1.html robot-details-url: http://www.cyberspyder.com/cslnkts1.html robot-owner-name: Tom Aman robot-owner-url: http://www.cyberspyder.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: link validation, some html validation robot-type: standalone robot-platform: windows 3.1x, windows95, windowsNT robot-availability: binary robot-exclusion: user configurable robot-exclusion-useragent: cyberspyder robot-noindex: no robot-host: * robot-from: no robot-useragent: CyberSpyder/2.1 robot-language: Microsoft Visual Basic 4.0 robot-description: CyberSpyder Link Test is intended to be used as a site management tool to validate that HTTP links on a page are functional and to produce various analysis reports to assist in managing a site. robot-history: The original robot was created to fill a widely seen need for a easy to use link checking program. robot-environment: commercial modified-date: Tue, 31 Mar 1998 01:02:00 GMT modified-by: Tom Aman robot-id: deweb robot-name: DeWeb(c) Katalog/Index robot-cover-url: http://deweb.orbit.de/ robot-details-url: robot-owner-name: Marc Mielke robot-owner-url: http://www.orbit.de/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing, mirroring, statistics robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: deweb.orbit.de robot-from: yes robot-useragent: Deweb/1.01 robot-language: perl 4 robot-description: Its purpose is to generate a Resource Discovery database, perform mirroring, and generate statistics. Uses combination http://info.webcrawler.com/mak/projects/robots/active/all.txt (16 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt of Informix(tm) Database and WN 1.11 serversoftware for indexing/ressource discovery, fulltext search, text excerpts. robot-history: robot-environment: modified-date: Wed Jan 10 08:23:00 1996 modified-by: robot-id: dienstspider robot-name: DienstSpider robot-cover-url: http://sappho.csi.forth.gr:22000/ robot-details-url: robot-owner-name: Antonis Sidiropoulos robot-owner-url: http://www.csi.forth.gr/~asidirop robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: robot-exclusion-useragent: robot-noindex: robot-host: sappho.csi.forth.gr robot-from: robot-useragent: dienstspider/1.0 robot-language: C robot-description: Indexing and searching the NCSTRL(Networked Computer Science Technical Report Library) and ERCIM Collection robot-history: The version 1.0 was the developer's master thesis project robot-environment: research modified-date: Fri, 4 Dec 1998 0:0:0 GMT modified-by: [email protected] robot-id: digger robot-name: Digger robot-cover-url: http://www.diggit.com/ robot-details-url: robot-owner-name: Benjamin Lipchak robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix, windows robot-availability: none robot-exclusion: yes robot-exclusion-useragent: digger robot-noindex: yes robot-host: robot-from: yes robot-useragent: Digger/1.0 JDK/1.3.0 robot-language: java robot-description: indexing web sites for the Diggit! search engine robot-history: robot-environment: service modified-date: modified-by: robot-id: diibot robot-name: Digital Integrity Robot robot-cover-url: http://www.digital-integrity.com/robotinfo.html robot-details-url: http://www.digital-integrity.com/robotinfo.html robot-owner-name: Digital Integrity, Inc. robot-owner-url: http://info.webcrawler.com/mak/projects/robots/active/all.txt (17 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-owner-email: [email protected] robot-status: Production robot-purpose: WWW Indexing robot-type: robot-platform: unix robot-availability: none robot-exclusion: Conforms to robots.txt convention robot-exclusion-useragent: DIIbot robot-noindex: Yes robot-host: digital-integrity.com robot-from: robot-useragent: DIIbot robot-language: Java/C robot-description: robot-history: robot-environment: modified-date: modified-by: robot-id: directhit robot-name: Direct Hit Grabber robot-cover-url: www.directhit.com robot-details-url: http://www.directhit.com/about/company/spider.html robot-status: active robot-description: Direct Hit Grabber indexes documents and collects Web statistics for the Direct Hit Search Engine (available at www.directhit.com and our partners' sites) robot-purpose: Indexing and statistics robot-type: standalone robot-platform: unix robot-language: C++ robot-owner-name: Direct Hit Technologies, Inc. robot-owner-url: www.directhit.com robot-owner-email: [email protected] robot-exclusion: yes robot-exclusion-useragent: grabber robot-noindex: yes robot-host: *.directhit.com robot-from: yes robot-useragent: grabber robot-environment: service modified-by: [email protected] robot-id: dnabot robot-name: DNAbot robot-cover-url: http://xx.dnainc.co.jp/dnabot/ robot-details-url: http://xx.dnainc.co.jp/dnabot/ robot-owner-name: Tom Tanaka robot-owner-url: http://xx.dnainc.co.jp robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix, windows, windows95, windowsNT, mac robot-availability: data robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: xx.dnainc.co.jp robot-from: yes robot-useragent: DNAbot/1.0 robot-language: java robot-description: A search robot in 100 java, with its own built-in database engine and web server . Currently in Japanese. robot-history: Developed by DNA, Inc.(Niigata City, Japan) in 1998. http://info.webcrawler.com/mak/projects/robots/active/all.txt (18 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-environment: commercial modified-date: Mon, 4 Jan 1999 14:30:00 GMT modified-by: Tom Tanaka robot-id: download_express robot-name: DownLoad Express robot-cover-url: http://www.jacksonville.net/~dlxpress robot-details-url: http://www.jacksonville.net/~dlxpress robot-owner-name: DownLoad Express Inc robot-owner-url: http://www.jacksonville.net/~dlxpress robot-owner-email: [email protected] robot-status: active robot-purpose: graphic download robot-type: standalone robot-platform: win95/98/NT robot-availability: binary robot-exclusion: yes robot-exclusion-useragent: downloadexpress robot-noindex: no robot-host: * robot-from: no robot-useragent: robot-language: visual basic robot-description: automatically downloads graphics from the web robot-history: robot-environment: commerical modified-date: Wed, 05 May 1998 modified-by: DownLoad Express Inc robot-id: dragonbot robot-name: DragonBot robot-cover-url: http://www.paczone.com/ robot-details-url: robot-owner-name: Paul Law robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: windowsNT robot-availability: none robot-exclusion: yes robot-exclusion-useragent: DragonBot robot-noindex: no robot-host: *.paczone.com robot-from: no robot-useragent: DragonBot/1.0 libwww/5.0 robot-language: C++ robot-description: Collects web pages related to East Asia robot-history: robot-environment: service modified-date: Mon, 11 Aug 1997 00:00:00 GMT modified-by: robot-id: dwcp robot-name: DWCP (Dridus' Web Cataloging Project) robot-cover-url: http://www.dridus.com/~rmm/dwcp.php3 robot-details-url: http://www.dridus.com/~rmm/dwcp.php3 robot-owner-name: Ross Mellgren (Dridus Norwind) robot-owner-url: http://www.dridus.com/~rmm robot-owner-email: [email protected] robot-status: development robot-purpose: indexing, statistics robot-type: standalone robot-platform: java http://info.webcrawler.com/mak/projects/robots/active/all.txt (19 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-availability: source, binary, data robot-exclusion: yes robot-exclusion-useragent: dwcp robot-noindex: no robot-host: *.dridus.com robot-from: [email protected] robot-useragent: DWCP/2.0 robot-language: java robot-description: The DWCP robot is used to gather information for Dridus' Web Cataloging Project, which is intended to catalog domains and urls (no content). robot-history: Developed from scratch by Dridus Norwind. robot-environment: hobby modified-date: Sat, 10 Jul 1999 00:05:40 GMT modified-by: Ross Mellgren robot-id: e-collector robot-name: e-collector robot-cover-url: http://www.thatrobotsite.com/agents/ecollector.htm robot-details-url: http://www.thatrobotsite.com/agents/ecollector.htm robot-owner-name: Dean Smart robot-owner-url: http://www.thatrobotsite.com robot-owner-email: [email protected] robot-status: Active robot-purpose: email collector robot-type: Collector of email addresses robot-platform: Windows 9*/NT/2000 robot-availability: Binary robot-exclusion: No robot-exclusion-useragent: ecollector robot-noindex: No robot-host: * robot-from: No robot-useragent: LWP:: robot-language: Perl5 robot-description: e-collector in the simplist terms is a e-mail address collector, thus the name e-collector. So what? Have you ever wanted to have the email addresses of as many companys that sell or supply for example "dried fruit", i personnaly don't but this is just an example. Those of you who may use this type of robot will know exactly what you can do with information, first don't spam with it, for those still not sure what this type of robot will do for you then take this for example: Your a international distributer of "dried fruit" and you boss has told you if you rise sales by 10% then he will bye you a new car (Wish i had a boss like that), well anyway there are thousands of shops distributers ect, that you could be doing business with but you don't know who they are?, because there in other countries or the nearest town but have never heard about them before. Has the penny droped yet, no well now you have the opertunity to find out who they are with an internet address and a person to contact in that company just by downloading and running e-collector. Plus it's free, you don't have to do any leg work just run the program and sit back and watch your potential customers arriving. robot-history: robot-environment: Service modified-date: Weekly modified-by: Dean Smart robot-id:ebiness robot-name:EbiNess robot-cover-url:http://sourceforge.net/projects/ebiness robot-details-url:http://ebiness.sourceforge.net/ robot-owner-name:Mike Davis robot-owner-url:http://www.carisbrook.co.uk/mike http://info.webcrawler.com/mak/projects/robots/active/all.txt (20 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-owner-email:[email protected] robot-status:Pre-Alpha robot-purpose:statistics robot-type:standalone robot-platform:unix(Linux) robot-availability:Open Source robot-exclusion:yes robot-exclusion-useragent:ebiness robot-noindex:no robot-host: robot-from:no robot-useragent:EbiNess/0.01a robot-language:c++ robot-description:Used to build a url relationship database, to be viewed in 3D robot-history:Dreamed it up over some beers robot-environment:hobby modified-date:Mon, 27 Nov 2000 12:26:00 GMT modified-by:Mike Davis robot-id: eit robot-name: EIT Link Verifier Robot robot-cover-url: http://wsk.eit.com/wsk/dist/doc/admin/webtest/verify_links.html robot-details-url: robot-owner-name: Jim McGuire robot-owner-url: http://www.eit.com/people/mcguire.html robot-owner-email: [email protected] robot-status: robot-purpose: maintenance robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: no robot-host: * robot-from: robot-useragent: EIT-Link-Verifier-Robot/0.2 robot-language: robot-description: Combination of an HTML form and a CGI script that verifies links from a given starting point (with some controls to prevent it going off-site or limitless) robot-history: Announced on 12 July 1994 robot-environment: modified-date: modified-by: robot-id: emacs robot-name: Emacs-w3 Search Engine robot-cover-url: http://www.cs.indiana.edu/elisp/w3/docs.html robot-details-url: robot-owner-name: William M. Perry robot-owner-url: http://www.cs.indiana.edu/hyplan/wmperry.html robot-owner-email: [email protected] robot-status: retired robot-purpose: indexing robot-type: browser robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: * robot-from: yes robot-useragent: Emacs-w3/v[0-9\.]+ robot-language: lisp http://info.webcrawler.com/mak/projects/robots/active/all.txt (21 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-description: Its purpose is to generate a Resource Discovery database This code has not been looked at in a while, but will be spruced up for the Emacs-w3 2.2.0 release sometime this month. It will honor the /robots.txt file at that time. robot-history: robot-environment: modified-date: Fri May 5 16:09:18 1995 modified-by: robot-id: emcspider robot-name: ananzi robot-cover-url: http://www.empirical.com/ robot-details-url: robot-owner-name: Hunter Payne robot-owner-url: http://www.psc.edu/~hpayne/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: bilbo.internal.empirical.com robot-from: yes robot-useragent: EMC Spider robot-language: java This spider is still in the development stages but, it will be hitting sites while I finish debugging it. robot-description: robot-history: robot-environment: modified-date: Wed May 29 14:47:01 1996. modified-by: robot-id: esther robot-name: Esther robot-details-url: http://search.falconsoft.com/ robot-cover-url: http://search.falconsoft.com/ robot-owner-name: Tim Gustafson robot-owner-url: http://www.falconsoft.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix (FreeBSD 2.2.8) robot-availability: data robot-exclusion: yes robot-exclusion-useragent: esther robot-noindex: no robot-host: *.falconsoft.com robot-from: yes robot-useragent: esther robot-language: perl5 robot-description: This crawler is used to build the search database at http://search.falconsoft.com/ robot-history: Developed by FalconSoft. robot-environment: service modified-date: Tue, 22 Dec 1998 00:22:00 PST robot-id: evliyacelebi robot-name: Evliya Celebi robot-cover-url: http://ilker.ulak.net.tr/EvliyaCelebi robot-details-url: http://ilker.ulak.net.tr/EvliyaCelebi http://info.webcrawler.com/mak/projects/robots/active/all.txt (22 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-owner-name: Ilker TEMIR robot-owner-url: http://ilker.ulak.net.tr robot-owner-email: [email protected] robot-status: development robot-purpose: indexing turkish content robot-type: standalone robot-platform: unix robot-availability: source robot-exclusion: yes robot-exclusion-useragent: N/A robot-noindex: no robot-nofollow: no robot-host: 193.140.83.* robot-from: [email protected] robot-useragent: Evliya Celebi v0.151 - http://ilker.ulak.net.tr robot-language: perl5 robot-history: robot-description: crawles pages under ".tr" domain or having turkish character encoding (iso-8859-9 or windows-1254) robot-environment: hobby modified-date: Fri Mar 31 15:03:12 GMT 2000 robot-id: nzexplorer robot-name: nzexplorer robot-cover-url: http://nzexplorer.co.nz/ robot-details-url: robot-owner-name: Paul Bourke robot-owner-url: http://bourke.gen.nz/paul.html robot-owner-email: [email protected] robot-status: active robot-purpose: indexing, statistics robot-type: standalone robot-platform: UNIX robot-availability: source (commercial) robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: bitz.co.nz robot-from: no robot-useragent: explorersearch robot-language: c++ robot-history: Started in 1995 to provide a comprehensive index to WWW pages within New Zealand. Now also used in Malaysia and other countries. robot-environment: service modified-date: Tues, 25 Jun 1996 modified-by: Paul Bourke robot-id:fdse robot-name:Fluid Dynamics Search Engine robot robot-cover-url:http://www.xav.com/scripts/search/ robot-details-url:http://www.xav.com/scripts/search/ robot-owner-name:Zoltan Milosevic robot-owner-url:http://www.xav.com/ robot-owner-email:[email protected] robot-status:active robot-purpose:indexing robot-type:standalone robot-platform:unix;windows robot-availability:source;data robot-exclusion:yes robot-exclusion-useragent:FDSE robot-noindex:yes robot-host:yes robot-from:* http://info.webcrawler.com/mak/projects/robots/active/all.txt (23 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-useragent:Mozilla/4.0 (compatible: FDSE robot) robot-language:perl5 robot-description:Crawls remote sites as part of a shareware search engine program robot-history:Developed in late 1998 over three pots of coffee robot-environment:commercial modified-date:Fri, 21 Jan 2000 10:15:49 GMT modified-by:Zoltan Milosevic robot-id: felix robot-name: Felix IDE robot-cover-url: http://www.pentone.com robot-details-url: http://www.pentone.com robot-owner-name: The Pentone Group, Inc. robot-owner-url: http://www.pentone.com robot-owner-email: [email protected] robot-status: active robot-purpose: indexing, statistics robot-type: standalone robot-platform: windows95, windowsNT robot-availability: binary robot-exclusion: yes robot-exclusion-useragent: FELIX IDE robot-noindex: yes robot-host: * robot-from: yes robot-useragent: FelixIDE/1.0 robot-language: visual basic robot-description: Felix IDE is a retail personal search spider sold by The Pentone Group, Inc. It supports the proprietary exclusion "Frequency: ??????????" in the robots.txt file. Question marks represent an integer indicating number of milliseconds to delay between document requests. This is called VDRF(tm) or Variable Document Retrieval Frequency. Note that users can re-define the useragent name. robot-history: This robot began as an in-house tool for the lucrative Felix IDS (Information Discovery Service) and has gone retail. robot-environment: service, commercial, research modified-date: Fri, 11 Apr 1997 19:08:02 GMT modified-by: Kerry B. Rogers robot-id: ferret robot-name: Wild Ferret Web Hopper #1, #2, #3 robot-cover-url: http://www.greenearth.com/ robot-details-url: robot-owner-name: Greg Boswell robot-owner-url: http://www.greenearth.com/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing maintenance statistics robot-type: standalone robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: robot-host: robot-from: yes robot-useragent: Hazel's Ferret Web hopper, robot-language: C++, Visual Basic, Java robot-description: The wild ferret web hopper's are designed as specific agents to retrieve data from all available sources on the internet. They work in an onion format hopping from spot to spot one level at a time over the internet. The information is gathered into different relational databases, known as http://info.webcrawler.com/mak/projects/robots/active/all.txt (24 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt "Hazel's Horde". The information is publicly available and will be free for the browsing at www.greenearth.com. Effective date of the data posting is to be announced. robot-history: robot-environment: modified-date: Mon Feb 19 00:28:37 1996. modified-by: robot-id: fetchrover robot-name: FetchRover robot-cover-url: http://www.engsoftware.com/fetch.htm robot-details-url: http://www.engsoftware.com/spiders/ robot-owner-name: Dr. Kenneth R. Wadland robot-owner-url: http://www.engsoftware.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance, statistics robot-type: standalone robot-platform: Windows/NT, Windows/95, Solaris SPARC robot-availability: binary, source robot-exclusion: yes robot-exclusion-useragent: ESI robot-noindex: N/A robot-host: * robot-from: yes robot-useragent: ESIRover v1.0 robot-language: C++ robot-description: FetchRover fetches Web Pages. It is an automated page-fetching engine. FetchRover can be used stand-alone or as the front-end to a full-featured Spider. Its database can use any ODBC compliant database server, including Microsoft Access, Oracle, Sybase SQL Server, FoxPro, etc. robot-history: Used as the front-end to SmartSpider (another Spider product sold by Engineeering Software, Inc.) robot-environment: commercial, service modified-date: Thu, 03 Apr 1997 21:49:50 EST modified-by: Ken Wadland robot-id: fido robot-name: fido robot-cover-url: http://www.planetsearch.com/ robot-details-url: http://www.planetsearch.com/info/fido.html robot-owner-name: Steve DeJarnett robot-owner-url: http://www.planetsearch.com/staff/steved.html robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: Unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: fido robot-noindex: no robot-host: fido.planetsearch.com, *.planetsearch.com, 206.64.113.* robot-from: yes robot-useragent: fido/0.9 Harvest/1.4.pl2 robot-language: c, perl5 robot-description: fido is used to gather documents for the search engine provided in the PlanetSearch service, which is operated by the Philips Multimedia Center. The robots runs on an ongoing basis. robot-history: fido was originally based on the Harvest Gatherer, but has since evolved into a new creature. It still uses some support code from Harvest. http://info.webcrawler.com/mak/projects/robots/active/all.txt (25 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-environment: service modified-date: Sat, 2 Nov 1996 00:08:18 GMT modified-by: Steve DeJarnett robot-id: finnish robot-name: Hämähäkki robot-cover-url: http://www.fi/search.html robot-details-url: http://www.fi/www/spider.html robot-owner-name: Timo Metsälä robot-owner-url: http://www.fi/~timo/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: UNIX robot-availability: no robot-exclusion: yes robot-exclusion-useragent: Hämähäkki robot-noindex: no robot-host: *.www.fi robot-from: yes robot-useragent: Hämähäkki/0.2 robot-language: C robot-description: Its purpose is to generate a Resource Discovery database from the Finnish (top-level domain .fi) www servers. The resulting database is used by the search engine at http://www.fi/search.html. robot-history: (The name Hämähäkki is just Finnish for spider.) robot-environment: modified-date: 1996-06-25 modified-by: [email protected] robot-id: fireball robot-name: KIT-Fireball robot-cover-url: http://www.fireball.de robot-details-url: http://www.fireball.de/technik.html (in German) robot-owner-name: Gruner + Jahr Electronic Media Service GmbH robot-owner-url: http://www.ems.guj.de robot-owner-email:[email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: KIT-Fireball robot-noindex: yes robot-host: *.fireball.de robot-from: yes robot-useragent: KIT-Fireball/2.0 libwww/5.0a robot-language: c robot-description: The Fireball robots gather web documents in German language for the database of the Fireball search service. robot-history: The robot was developed by Benhui Chen in a research project at the Technical University of Berlin in 1996 and was re-implemented by its developer in 1997 for the present owner. robot-environment: service modified-date: Mon Feb 23 11:26:08 1998 modified-by: Detlev Kalb robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: fish Fish search http://www.win.tue.nl/bin/fish-search Paul De Bra http://info.webcrawler.com/mak/projects/robots/active/all.txt (26 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-owner-url: http://www.win.tue.nl/win/cs/is/debra/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: standalone robot-platform: robot-availability: binary robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: www.win.tue.nl robot-from: no robot-useragent: Fish-Search-Robot robot-language: c robot-description: Its purpose is to discover resources on the fly a version exists that is integrated into the Tübingen Mosaic 2.4.2 browser (also written in C) robot-history: Originated as an addition to Mosaic for X robot-environment: modified-date: Mon May 8 09:31:19 1995 modified-by: robot-id: fouineur robot-name: Fouineur robot-cover-url: http://fouineur.9bit.qc.ca/ robot-details-url: http://fouineur.9bit.qc.ca/informations.html robot-owner-name: Joel Vandal robot-owner-url: http://www.9bit.qc.ca/~jvandal/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing, statistics robot-type: standalone robot-platform: unix, windows robot-availability: none robot-exclusion: yes robot-exclusion-useragent: fouineur robot-noindex: no robot-host: * robot-from: yes robot-useragent: Mozilla/2.0 (compatible fouineur v2.0; fouineur.9bit.qc.ca) robot-language: perl5 robot-description: This robot build automaticaly a database that is used by our own search engine. This robot auto-detect the language (french, english & spanish) used in the HTML page. Each database record generated by this robot include: date, url, title, total words, title, size and de-htmlized text. Also support server-side and client-side IMAGEMAP. robot-history: No robots does all thing that we need for our usage. robot-environment: service modified-date: Thu, 9 Jan 1997 22:57:28 EST modified-by: [email protected] robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-status: robot-purpose: robot-type: robot-platform: robot-availability: francoroute Robot Francoroute Marc-Antoine Parent http://www.crim.ca/~maparent [email protected] indexing, mirroring, statistics browser http://info.webcrawler.com/mak/projects/robots/active/all.txt (27 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: zorro.crim.ca robot-from: yes robot-useragent: Robot du CRIM 1.0a robot-language: perl5, sqlplus robot-description: Part of the RISQ's Francoroute project for researching francophone. Uses the Accept-Language tag and reduces demand accordingly robot-history: robot-environment: modified-date: Wed Jan 10 23:56:22 1996. modified-by: robot-id: freecrawl robot-name: Freecrawl robot-cover-url: http://euroseek.net/ robot-owner-name: Jesper Ekhall robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: Freecrawl robot-noindex: no robot-host: *.freeside.net robot-from: yes robot-useragent: Freecrawl robot-language: c robot-description: The Freecrawl robot is used to build a database for the EuroSeek service. robot-environment: service robot-id: funnelweb robot-name: FunnelWeb robot-cover-url: http://funnelweb.net.au robot-details-url: robot-owner-name: David Eagles robot-owner-url: http://www.pc.com.au robot-owner-email: [email protected] robot-status: robot-purpose: indexing, statisitics robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: earth.planets.com.au robot-from: yes robot-useragent: FunnelWeb-1.0 robot-language: c and c++ robot-description: Its purpose is to generate a Resource Discovery database, and generate statistics. Localised South Pacific Discovery and Search Engine, plus distributed operation under development. robot-history: robot-environment: modified-date: Mon Nov 27 21:30:11 1995 modified-by: robot-id: gazz http://info.webcrawler.com/mak/projects/robots/active/all.txt (28 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-name: gazz robot-cover-url: http://gazz.nttrd.com/ robot-details-url: http://gazz.nttrd.com/ robot-owner-name: NTT Cyberspace Laboratories robot-owner-url: http://gazz.nttrd.com/ robot-owner-email: [email protected] robot-status: development robot-purpose: statistics robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: gazz robot-noindex: yes robot-host: *.nttrd.com, *.infobee.ne.jp robot-from: yes robot-useragent: gazz/1.0 robot-language: c robot-description: This robot is used for research purposes. robot-history: Its root is TITAN project in NTT. robot-environment: research modified-date: Wed, 09 Jun 1999 10:43:18 GMT modified-by: [email protected] robot-id: gcreep robot-name: GCreep robot-cover-url: http://www.instrumentpolen.se/gcreep/index.html robot-details-url: http://www.instrumentpolen.se/gcreep/index.html robot-owner-name: Instrumentpolen AB robot-owner-url: http://www.instrumentpolen.se/ip-kontor/eng/index.html robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: browser+standalone robot-platform: linux+mysql robot-availability: none robot-exclusion: yes robot-exclusion-useragent: gcreep robot-noindex: yes robot-host: mbx.instrumentpolen.se robot-from: yes robot-useragent: gcreep/1.0 robot-language: c robot-description: Indexing robot to learn SQL robot-history: Spare time project begun late '96, maybe early '97 robot-environment: hobby modified-date: Fri, 23 Jan 1998 16:09:00 MET modified-by: Anders Hedstrom robot-id: getbot robot-name: GetBot robot-cover-url: http://www.blacktop.com.zav/bots robot-details-url: robot-owner-name: Alex Zavatone robot-owner-url: http://www.blacktop.com/zav robot-owner-email: [email protected] robot-status: robot-purpose: maintenance robot-type: standalone robot-platform: robot-availability: robot-exclusion: no. robot-exclusion-useragent: robot-noindex: robot-host: http://info.webcrawler.com/mak/projects/robots/active/all.txt (29 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-from: no robot-useragent: ??? robot-language: Shockwave/Director. robot-description: GetBot's purpose is to index all the sites it can find that contain Shockwave movies. It is the first bot or spider written in Shockwave. The bot was originally written at Macromedia on a hungover Sunday as a proof of concept. Alex Zavatone 3/29/96 robot-history: robot-environment: modified-date: Fri Mar 29 20:06:12 1996. modified-by: robot-id: geturl robot-name: GetURL robot-cover-url: http://Snark.apana.org.au/James/GetURL/ robot-details-url: robot-owner-name: James Burton robot-owner-url: http://Snark.apana.org.au/James/ robot-owner-email: [email protected] robot-status: robot-purpose: maintenance, mirroring robot-type: standalone robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: * robot-from: no robot-useragent: GetURL.rexx v1.05 robot-language: ARexx (Amiga REXX) robot-description: Its purpose is to validate links, perform mirroring, and copy document trees. Designed as a tool for retrieving web pages in batch mode without the encumbrance of a browser. Can be used to describe a set of pages to fetch, and to maintain an archive or mirror. Is not run by a central site and accessed by clients - is run by the end user or archive maintainer robot-history: robot-environment: modified-date: Tue May 9 15:13:12 1995 modified-by: robot-id: golem robot-name: Golem robot-cover-url: http://www.quibble.com/golem/ robot-details-url: http://www.quibble.com/golem/ robot-owner-name: Geoff Duncan robot-owner-url: http://www.quibble.com/geoff/ robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: mac robot-availability: none robot-exclusion: yes robot-exclusion-useragent: golem robot-noindex: no robot-host: *.quibble.com robot-from: yes robot-useragent: Golem/1.1 robot-language: HyperTalk/AppleScript/C++ robot-description: Golem generates status reports on collections of URLs supplied by clients. Designed to assist with editorial updates of http://info.webcrawler.com/mak/projects/robots/active/all.txt (30 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt Web-related sites or products. robot-history: Personal project turned into a contract service for private clients. robot-environment: service,research modified-date: Wed, 16 Apr 1997 20:50:00 GMT modified-by: Geoff Duncan robot-id: googlebot robot-name: Googlebot robot-cover-url: http://googlebot.com/ robot-details-url: http://googlebot.com/ robot-owner-name: Google Inc. robot-owner-url: http://google.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing statistics robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: Googlebot robot-noindex: yes robot-host: *.googlebot.com robot-from: yes robot-useragent: Googlebot/2.0 beta (googlebot(at)googlebot.com) robot-language: Python robot-description: robot-history: Used to be called backrub and run from stanford.edu robot-environment: service modified-date: Wed, 29 Sep 1999 18:36:25 -0700 modified-by: Amit Patel <[email protected]> robot-id: grapnel robot-name: Grapnel/0.01 Experiment robot-cover-url: varies robot-details-url: mailto:[email protected] robot-owner-name: Philip Kallerman robot-owner-url: [email protected] robot-owner-email: [email protected] robot-status: Experimental robot-purpose: Indexing robot-type: robot-platform: WinNT robot-availability: None, yet robot-exclusion: Yes robot-exclusion-useragent: No robot-noindex: No robot-host: varies robot-from: Varies robot-useragent: robot-language: Perl robot-description: Resource Discovery Experimentation robot-history: None, hoping to make some robot-environment: modified-date: modified-by: 7 Feb 1997 robot-id:griffon robot-name:Griffon robot-cover-url:http://navi.ocn.ne.jp/ robot-details-url:http://navi.ocn.ne.jp/griffon/ robot-owner-name:NTT Communications Corporate Users Business Division robot-owner-url:http://navi.ocn.ne.jp/ robot-owner-email:[email protected] robot-status:active http://info.webcrawler.com/mak/projects/robots/active/all.txt (31 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-purpose:indexing robot-type:standalone robot-platform:unix robot-availability:none robot-exclusion:yes robot-exclusion-useragent:griffon robot-noindex:yes robot-nofollow:yes robot-host:*.navi.ocn.ne.jp robot-from:yes robot-useragent:griffon/1.0 robot-language:c robot-description:The Griffon robot is used to build database for the OCN navi search service operated by NTT Communications Corporation. It mainly gathers pages written in Japanese. robot-history:Its root is TITAN project in NTT. robot-environment:service modified-date:Mon,25 Jan 2000 15:25:30 GMT modified-by:[email protected] robot-id: gromit robot-name: Gromit robot-cover-url: http://www.austlii.edu.au/ robot-details-url: http://www2.austlii.edu.au/~dan/gromit/ robot-owner-name: Daniel Austin robot-owner-url: http://www2.austlii.edu.au/~dan/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: Gromit robot-noindex: no robot-host: *.austlii.edu.au robot-from: yes robot-useragent: Gromit/1.0 robot-language: perl5 robot-description: Gromit is a Targetted Web Spider that indexes legal sites contained in the AustLII legal links database. robot-history: This robot is based on the Perl5 LWP::RobotUA module. robot-environment: research modified-date: Wed, 11 Jun 1997 03:58:40 GMT modified-by: Daniel Austin robot-id: gulliver robot-name: Northern Light Gulliver robot-cover-url: robot-details-url: robot-owner-name: Mike Mulligan robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: gulliver robot-noindex: yes robot-host: scooby.northernlight.com, taz.northernlight.com, gulliver.northernlight.com robot-from: yes robot-useragent: Gulliver/1.1 http://info.webcrawler.com/mak/projects/robots/active/all.txt (32 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-language: c robot-description: Gulliver is a robot to be used to collect web pages for indexing and subsequent searching of the index. robot-history: Oct 1996: development; Dec 1996-Jan 1997: crawl & debug; Mar 1997: crawl again; robot-environment: service modified-date: Wed, 21 Apr 1999 16:00:00 GMT modified-by: Mike Mulligan robot-id: hambot robot-name: HamBot robot-cover-url: http://www.hamrad.com/search.html robot-details-url: http://www.hamrad.com/ robot-owner-name: John Dykstra robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix, Windows95 robot-availability: none robot-exclusion: yes robot-exclusion-useragent: hambot robot-noindex: yes robot-host: *.hamrad.com robot-from: robot-useragent: robot-language: perl5, C++ robot-description: Two HamBot robots are used (stand alone & browser based) to aid in building the database for HamRad Search - The Search Engine for Search Engines. The robota are run intermittently and perform nearly identical functions. robot-history: A non commercial (hobby?) project to aid in building and maintaining the database for the the HamRad search engine. robot-environment: service modified-date: Fri, 17 Apr 1998 21:44:00 GMT modified-by: JD robot-id: harvest robot-name: Harvest robot-cover-url: http://harvest.cs.colorado.edu robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-status: robot-purpose: indexing robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: robot-host: bruno.cs.colorado.edu robot-from: yes robot-useragent: yes robot-language: robot-description: Harvest's motivation is to index community- or topicspecific collections, rather than to locate and index all HTML objects that can be found. Also, Harvest allows users to control the enumeration several ways, including stop lists and depth and count limits. Therefore, Harvest provides a much more controlled way of indexing the Web than is typical of robots. Pauses 1 second between requests (by default). http://info.webcrawler.com/mak/projects/robots/active/all.txt (33 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-history: robot-environment: modified-date: modified-by: robot-id: havindex robot-name: havIndex robot-cover-url: http://www.hav.com/ robot-details-url: http://www.hav.com/ robot-owner-name: hav.Software and Horace A. (Kicker) Vallas robot-owner-url: http://www.hav.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: Java VM 1.1 robot-availability: binary robot-exclusion: yes robot-exclusion-useragent: havIndex robot-noindex: yes robot-host: * robot-from: no robot-useragent: havIndex/X.xx[bxx] robot-language: Java robot-description: havIndex allows individuals to build searchable word index of (user specified) lists of URLs. havIndex does not crawl rather it requires one or more user supplied lists of URLs to be indexed. havIndex does (optionally) save urls parsed from indexed pages. robot-history: Developed to answer client requests for URL specific index capabilities. robot-environment: commercial, service modified-date: 6-27-98 modified-by: Horace A. (Kicker) Vallas robot-id: hi robot-name: HI (HTML Index) Search robot-cover-url: http://cs6.cs.ait.ac.th:21870/pa.html robot-details-url: robot-owner-name: Razzakul Haider Chowdhury robot-owner-url: http://cs6.cs.ait.ac.th:21870/index.html robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: yes robot-useragent: AITCSRobot/1.1 robot-language: perl 5 robot-description: Its purpose is to generate a Resource Discovery database. This Robot traverses the net and creates a searchable database of Web pages. It stores the title string of the HTML document and the absolute url. A search engine provides the boolean AND & OR query models with or without filtering the stop list of words. Feature is kept for the Web page owners to add the url to the searchable database. robot-history: robot-environment: modified-date: Wed Oct 4 06:54:31 1995 modified-by: http://info.webcrawler.com/mak/projects/robots/active/all.txt (34 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-id: hometown robot-name: Hometown Spider Pro robot-cover-url: http://www.hometownsingles.com robot-details-url: http://www.hometownsingles.com robot-owner-name: Bob Brown robot-owner-url: http://www.hometownsingles.com robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: windowsNT robot-availability: none robot-exclusion: yes robot-exclusion-useragent: * robot-noindex: yes robot-host: 63.195.193.17 robot-from: no robot-useragent: Hometown Spider Pro robot-language: delphi robot-description: The Hometown Spider Pro is used to maintain the indexes for Hometown Singles. robot-history: Innerprise URL Spider Pro robot-environment: commercial modified-date: Tue, 28 Mar 2000 16:00:00 GMT modified-by: Hometown Singles robot-id: wired-digital robot-name: Wired Digital robot-cover-url: robot-details-url: robot-owner-name: Bowen Dwelle robot-owner-url: robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: hotwired robot-noindex: no robot-host: gossip.hotwired.com robot-from: yes robot-useragent: wired-digital-newsbot/1.5 robot-language: perl-5.004 robot-description: this is a test robot-history: robot-environment: research modified-date: Thu, 30 Oct 1997 modified-by: [email protected] robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-owner-name2: robot-owner-url2: robot-owner-email2: robot-status: robot-purpose: robot-type: htdig ht://Dig http://www.htdig.org/ http://www.htdig.org/howitworks.html Andrew Scherpbier http://www.htdig.org/author.html [email protected] Geoff Hutchison http://wso.williams.edu/~ghutchis/ [email protected] indexing standalone http://info.webcrawler.com/mak/projects/robots/active/all.txt (35 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-platform: unix robot-availability: source robot-exclusion: yes robot-exclusion-useragent: htdig robot-noindex: yes robot-host: * robot-from: no robot-useragent: htdig/3.1.0b2 robot-language: C,C++. robot-history:This robot was originally developed for use at San Diego State University. robot-environment: modified-date:Tue, 3 Nov 1998 10:09:02 EST modified-by: Geoff Hutchison <[email protected]> robot-id: htmlgobble robot-name: HTMLgobble robot-cover-url: robot-details-url: robot-owner-name: Andreas Ley robot-owner-url: robot-owner-email: [email protected] robot-status: robot-purpose: mirror robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: no robot-host: tp70.rz.uni-karlsruhe.de robot-from: yes robot-useragent: HTMLgobble v2.2 robot-language: robot-description: A mirroring robot. Configured to stay within a directory, sleeps between requests, and the next version will use HEAD to check if the entire document needs to be retrieved robot-history: robot-environment: modified-date: modified-by: robot-id: hyperdecontextualizer robot-name: Hyper-Decontextualizer robot-cover-url: http://www.tricon.net/Comm/synapse/spider/ robot-details-url: robot-owner-name: Cliff Hall robot-owner-url: http://kpt1.tricon.net/cgi-bin/cliff.cgi robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: standalone robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: robot-host: robot-from: no robot-useragent: no robot-language: Perl 5 Takes an input sentence and marks up each word with an appropriate hyper-text link. robot-description: robot-history: http://info.webcrawler.com/mak/projects/robots/active/all.txt (36 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-environment: modified-date: modified-by: Mon May 6 17:41:29 1996. robot-id: ibm robot-name: IBM_Planetwide robot-cover-url: http://www.ibm.com/%7ewebmaster/ robot-details-url: robot-owner-name: Ed Costello robot-owner-url: http://www.ibm.com/%7ewebmaster/ robot-owner-email: [email protected]" robot-status: robot-purpose: indexing, maintenance, mirroring robot-type: standalone and robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: www.ibm.com www2.ibm.com robot-from: yes robot-useragent: IBM_Planetwide, robot-language: Perl5 robot-description: Restricted to IBM owned or related domains. robot-history: robot-environment: modified-date: Mon Jan 22 22:09:19 1996. modified-by: robot-id: iconoclast robot-name: Popular Iconoclast robot-cover-url: http://gestalt.sewanee.edu/ic/ robot-details-url: http://gestalt.sewanee.edu/ic/info.html robot-owner-name: Chris Cappuccio robot-owner-url: http://sefl.satelnet.org/~ccappuc/ robot-owner-email: [email protected] robot-status: development robot-purpose: statistics robot-type: standalone robot-platform: unix (OpenBSD) robot-availability: source robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: gestalt.sewanee.edu robot-from: yes robot-useragent: gestaltIconoclast/1.0 libwww-FM/2.17 robot-language: c,perl5 robot-description: This guy likes statistics robot-history: This robot has a history in mathematics and english robot-environment: research modified-date: Wed, 5 Mar 1997 17:35:16 CST modified-by: [email protected] robot-id: Ilse robot-name: Ingrid robot-cover-url: robot-details-url: robot-owner-name: Ilse c.v. robot-owner-url: http://www.ilse.nl/ robot-owner-email: [email protected] robot-status: Running robot-purpose: Indexing robot-type: Web Indexer robot-platform: UNIX http://info.webcrawler.com/mak/projects/robots/active/all.txt (37 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-availability: Commercial as part of search engine package robot-exclusion: Yes robot-exclusion-useragent: INGRID/0.1 robot-noindex: Yes robot-host: bart.ilse.nl robot-from: Yes robot-useragent: INGRID/0.1 robot-language: C robot-description: robot-history: robot-environment: modified-date: 06/13/1997 modified-by: Ilse robot-id: imagelock robot-name: Imagelock robot-cover-url: robot-details-url: robot-owner-name: Ken Belanger robot-owner-url: robot-owner-email: [email protected] robot-status: development robot-purpose: maintenance robot-type: robot-platform: windows95 robot-availability: none robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: 209.111.133.* robot-from: no robot-useragent: Mozilla 3.01 PBWF (Win95) robot-language: robot-description: searches for image links robot-history: robot-environment: service modified-date: Tue, 11 Aug 1998 17:28:52 GMT modified-by: [email protected] robot-id: incywincy robot-name: IncyWincy robot-cover-url: http://osiris.sunderland.ac.uk/sst-scripts/simon.html robot-details-url: robot-owner-name: Simon Stobart robot-owner-url: http://osiris.sunderland.ac.uk/sst-scripts/simon.html robot-owner-email: [email protected] robot-status: robot-purpose: robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: osiris.sunderland.ac.uk robot-from: yes robot-useragent: IncyWincy/1.0b1 robot-language: C++ robot-description: Various Research projects at the University of Sunderland robot-history: robot-environment: modified-date: Fri Jan 19 21:50:32 1996. modified-by: http://info.webcrawler.com/mak/projects/robots/active/all.txt (38 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-id: informant robot-name: Informant robot-cover-url: http://informant.dartmouth.edu/ robot-details-url: http://informant.dartmouth.edu/about.html robot-owner-name: Bob Gray robot-owner-name2: Aditya Bhasin robot-owner-name3: Katsuhiro Moizumi robot-owner-name4: Dr. George V. Cybenko robot-owner-url: http://informant.dartmouth.edu/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: no robot-exclusion-useragent: Informant robot-noindex: no robot-host: informant.dartmouth.edu robot-from: yes robot-useragent: Informant robot-language: c, c++ robot-description: The Informant robot continually checks the Web pages that are relevant to user queries. Users are notified of any new or updated pages. The robot runs daily, but the number of hits per site per day should be quite small, and these hits should be randomly distributed over several hours. Since the robot does not actually follow links (aside from those returned from the major search engines such as Lycos), it does not fall victim to the common looping problems. The robot will support the Robot Exclusion Standard by early December, 1996. robot-history: The robot is part of a research project at Dartmouth College. The robot may become part of a commercial service (at which time it may be subsumed by some other, existing robot). robot-environment: research, service modified-date: Sun, 3 Nov 1996 11:55:00 GMT modified-by: Bob Gray robot-id: infoseek robot-name: InfoSeek Robot 1.0 robot-cover-url: http://www.infoseek.com robot-details-url: robot-owner-name: Steve Kirsch robot-owner-url: http://www.infoseek.com robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: corp-gw.infoseek.com robot-from: yes robot-useragent: InfoSeek Robot 1.0 robot-language: python robot-description: Its purpose is to generate a Resource Discovery database. Collects WWW pages for both InfoSeek's free WWW search and commercial search. Uses a unique proprietary algorithm to identify the most popular and interesting WWW pages. Very fast, but never has more than one request per site outstanding at any given time. Has been refined for more than a year. robot-history: robot-environment: http://info.webcrawler.com/mak/projects/robots/active/all.txt (39 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt modified-date: modified-by: Sun May 28 01:35:48 1995 robot-id: infoseeksidewinder robot-name: Infoseek Sidewinder robot-cover-url: http://www.infoseek.com/ robot-details-url: robot-owner-name: Mike Agostino robot-owner-url: http://www.infoseek.com/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: robot-from: yes robot-useragent: Infoseek Sidewinder robot-language: C Collects WWW pages for both InfoSeek's free WWW search services. Uses a unique, incremental, very fast proprietary algorithm to find WWW pages. robot-description: robot-history: robot-environment: modified-date: Sat Apr 27 01:20:15 1996. modified-by: robot-id: infospider robot-name: InfoSpiders robot-cover-url: http://www-cse.ucsd.edu/users/fil/agents/agents.html robot-owner-name: Filippo Menczer robot-owner-url: http://www-cse.ucsd.edu/users/fil/ robot-owner-email: [email protected] robot-status: development robot-purpose: search robot-type: standalone robot-platform: unix, mac robot-availability: none robot-exclusion: yes robot-exclusion-useragent: InfoSpiders robot-noindex: no robot-host: *.ucsd.edu robot-from: yes robot-useragent: InfoSpiders/0.1 robot-language: c, perl5 robot-description: application of artificial life algorithm to adaptive distributed information retrieval robot-history: UC San Diego, Computer Science Dept. PhD research project (1995-97) under supervision of Prof. Rik Belew robot-environment: research modified-date: Mon, 16 Sep 1996 14:08:00 PDT robot-id: inspectorwww robot-name: Inspector Web robot-cover-url: http://www.greenpac.com/inspector/ robot-details-url: http://www.greenpac.com/inspector/ourrobot.html robot-owner-name: Doug Green robot-owner-url: http://www.greenpac.com robot-owner-email: [email protected] robot-status: active: robot significantly developed, but still undergoing fixes robot-purpose: maintentance: link validation, html validation, image size validation, etc http://info.webcrawler.com/mak/projects/robots/active/all.txt (40 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-type: standalone robot-platform: unix robot-availability: free service and more extensive commercial service robot-exclusion: yes robot-exclusion-useragent: inspectorwww robot-noindex: no robot-host: www.corpsite.com, www.greenpac.com, 38.234.171.* robot-from: yes robot-useragent: inspectorwww/1.0 http://www.greenpac.com/inspectorwww.html robot-language: c robot-description: Provide inspection reports which give advise to WWW site owners on missing links, images resize problems, syntax errors, etc. robot-history: development started in Mar 1997 robot-environment: commercial modified-date: Tue Jun 17 09:24:58 EST 1997 modified-by: Doug Green robot-id: intelliagent robot-name: IntelliAgent robot-cover-url: http://www.geocities.com/SiliconValley/3086/iagent.html robot-details-url: robot-owner-name: David Reilly robot-owner-url: http://www.geocities.com/SiliconValley/3086/index.html robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: robot-host: sand.it.bond.edu.au robot-from: no robot-useragent: 'IAGENT/1.0' robot-language: C robot-description: IntelliAgent is still in development. Indeed, it is very far from completion. I'm planning to limit the depth at which it will probe, so hopefully IAgent won't cause anyone much of a problem. At the end of its completion, I hope to publish both the raw data and original source code. robot-history: robot-environment: modified-date: Fri May 31 02:10:39 1996. modified-by: robot-id: irobot robot-name: I, Robot robot-cover-url: http://irobot.mame.dk/ robot-details-url: http://irobot.mame.dk/about.phtml robot-owner-name: [mame.dk] robot-owner-url: http://www.mame.dk/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: irobot robot-noindex: yes robot-host: *.mame.dk, 206.161.121.* robot-from: no robot-useragent: I Robot 0.4 ([email protected]) robot-language: c http://info.webcrawler.com/mak/projects/robots/active/all.txt (41 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-description: I Robot is used to build a fresh database for the emulation community. Primary focus is information on emulation and especially old arcade machines. Primarily english sites will be indexed and only if they have their own domain. Sites are added manually on based on submitions after they has been evaluated. robot-history: The robot was started in june 2000 robot-environment1: service robot-environment2: hobby modified-date: Fri, 27 Oct 2000 09:08:06 GMT modified-by: BombJack [email protected] robot-id:iron33 robot-name:Iron33 robot-cover-url:http://verno.ueda.info.waseda.ac.jp/iron33/ robot-details-url:http://verno.ueda.info.waseda.ac.jp/iron33/history.html robot-owner-name:Takashi Watanabe robot-owner-url:http://www.ueda.info.waseda.ac.jp/~watanabe/ robot-owner-email:[email protected] robot-status:active robot-purpose:indexing, statistics robot-type:standalone robot-platform:unix robot-availability:source robot-exclusion:yes robot-exclusion-useragent:Iron33 robot-noindex:no robot-host:*.folon.ueda.info.waseda.ac.jp, 133.9.215.* robot-from:yes robot-useragent:Iron33/0.0 robot-language:c robot-description:The robot "Iron33" is used to build the database for the WWW search engine "Verno". robot-history: robot-environment:research modified-date:Fri, 20 Mar 1998 18:34 JST modified-by:Watanabe Takashi robot-id: israelisearch robot-name: Israeli-search robot-cover-url: http://www.idc.ac.il/Sandbag/ robot-details-url: robot-owner-name: Etamar Laron robot-owner-url: http://www.xpert.com/~etamar/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing. robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: dylan.ius.cs.cmu.edu robot-from: no robot-useragent: IsraeliSearch/1.0 robot-language: C A complete software designed to collect information in a distributed workload and supports context queries. Intended to be a complete updated resource for Israeli sites and information related to Israel or Israeli Society. robot-description: robot-history: robot-environment: modified-date: Tue Apr 23 19:23:55 1996. modified-by: http://info.webcrawler.com/mak/projects/robots/active/all.txt (42 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-id: javabee robot-name: JavaBee robot-cover-url: http://www.javabee.com robot-details-url: robot-owner-name:ObjectBox robot-owner-url:http://www.objectbox.com/ robot-owner-email:[email protected] robot-status:Active robot-purpose:Stealing Java Code robot-type:standalone robot-platform:Java robot-availability:binary robot-exclusion:no robot-exclusion-useragent: robot-noindex:no robot-host:* robot-from:no robot-useragent:JavaBee robot-language:Java robot-description:This robot is used to grab java applets and run them locally overriding the security implemented robot-history: robot-environment:commercial modified-date: modified-by: robot-id: JBot robot-name: JBot Java Web Robot robot-cover-url: http://www.matuschek.net/software/jbot robot-details-url: http://www.matuschek.net/software/jbot robot-owner-name: Daniel Matuschek robot-owner-url: http://www.matuschek.net robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: Java robot-availability: source robot-exclusion: yes robot-exclusion-useragent: JBot robot-noindex: no robot-host: * robot-from: robot-useragent: JBot (but can be changed by the user) robot-language: Java robot-description: Java web crawler to download web sites robot-history: robot-environment: hobby modified-date: Thu, 03 Jan 2000 16:00:00 GMT modified-by: Daniel Matuschek <[email protected]> robot-id: jcrawler robot-name: JCrawler robot-cover-url: http://www.nihongo.org/jcrawler/ robot-details-url: robot-owner-name: Benjamin Franz robot-owner-url: http://www.nihongo.org/snowhare/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes http://info.webcrawler.com/mak/projects/robots/active/all.txt (43 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-exclusion-useragent: jcrawler robot-noindex: yes robot-host: db.netimages.com robot-from: yes robot-useragent: JCrawler/0.2 robot-language: perl5 robot-description: JCrawler is currently used to build the Vietnam topic specific WWW index for VietGATE <URL:http://www.vietgate.net/>. It schedules visits randomly, but will not visit a site more than once every two minutes. It uses a subject matter relevance pruning algorithm to determine what pages to crawl and index and will not generally index pages with no Vietnam related content. Uses Unicode internally, and detects and converts several different Vietnamese character encodings. robot-history: robot-environment: service modified-date: Wed, 08 Oct 1997 00:09:52 GMT modified-by: Benjamin Franz robot-id: jeeves robot-name: Jeeves robot-cover-url: http://www-students.doc.ic.ac.uk/~lglb/Jeeves/ robot-details-url: robot-owner-name: Leon Brocard robot-owner-url: http://www-students.doc.ic.ac.uk/~lglb/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing maintenance statistics robot-type: standalone robot-platform: UNIX robot-availability: none robot-exclusion: no robot-exclusion-useragent: jeeves robot-noindex: no robot-host: *.doc.ic.ac.uk robot-from: yes robot-useragent: Jeeves v0.05alpha (PERL, LWP, [email protected]) robot-language: perl5 robot-description: Jeeves is basically a web-mirroring robot built as a final-year degree project. It will have many nice features and is already web-friendly. Still in development. robot-history: Still short (0.05alpha) robot-environment: research modified-date: Wed, 23 Apr 1997 17:26:50 GMT modified-by: Leon Brocard robot-id: jobot robot-name: Jobot robot-cover-url: http://www.micrognosis.com/~ajack/jobot/jobot.html robot-details-url: robot-owner-name: Adam Jack robot-owner-url: http://www.micrognosis.com/~ajack/index.html robot-owner-email: [email protected] robot-status: inactive robot-purpose: standalone robot-type: robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: supernova.micrognosis.com robot-from: yes http://info.webcrawler.com/mak/projects/robots/active/all.txt (44 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-useragent: robot-language: robot-description: Intended to Hence - Job robot-history: robot-environment: modified-date: modified-by: Jobot/0.1alpha libwww-perl/4.0 perl 4 Its purpose is to generate a Resource Discovery database. seek out sites of potential "career interest". Robot. Tue Jan 9 18:55:55 1996 robot-id: joebot robot-name: JoeBot robot-cover-url: robot-details-url: robot-owner-name: Ray Waldin robot-owner-url: http://www.primenet.com/~rwaldin robot-owner-email: [email protected] robot-status: robot-purpose: robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: robot-from: yes robot-useragent: JoeBot/x.x, robot-language: java JoeBot is a generic web crawler implemented as a collection of Java classes which can be used in a variety of applications, including resource discovery, link validation, mirroring, etc. It currently limits itself to one visit per host per minute. robot-description: robot-history: robot-environment: modified-date: Sun May 19 08:13:06 1996. modified-by: robot-id: jubii robot-name: The Jubii Indexing Robot robot-cover-url: http://www.jubii.dk/robot/default.htm robot-details-url: robot-owner-name: Jakob Faarvang robot-owner-url: http://www.cybernet.dk/staff/jakob/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing, maintainance robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: any host in the cybernet.dk domain robot-from: yes robot-useragent: JubiiRobot/version# robot-language: visual basic 4.0 robot-description: Its purpose is to generate a Resource Discovery database, and validate links. Used for indexing the .dk top-level domain as well as other Danish sites for aDanish web database, as well as link validation. robot-history: Will be in constant operation from Spring 1996 robot-environment: http://info.webcrawler.com/mak/projects/robots/active/all.txt (45 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt modified-date: modified-by: Sat Jan 6 20:58:44 1996 robot-id: jumpstation robot-name: JumpStation robot-cover-url: http://js.stir.ac.uk/jsbin/jsii robot-details-url: robot-owner-name: Jonathon Fletcher robot-owner-url: http://www.stir.ac.uk/~jf1 robot-owner-email: [email protected] robot-status: retired robot-purpose: indexing robot-type: robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: *.stir.ac.uk robot-from: yes robot-useragent: jumpstation robot-language: perl, C, c++ robot-description: robot-history: Originated as a weekend project in 1993. robot-environment: modified-date: Tue May 16 00:57:42 1995. modified-by: robot-id: katipo robot-name: Katipo robot-cover-url: http://www.vuw.ac.nz/~newbery/Katipo.html robot-details-url: http://www.vuw.ac.nz/~newbery/Katipo/Katipo-doc.html robot-owner-name: Michael Newbery robot-owner-url: http://www.vuw.ac.nz/~newbery robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: Macintosh robot-availability: binary robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: * robot-from: yes robot-useragent: Katipo/1.0 robot-language: c robot-description: Watches all the pages you have previously visited and tells you when they have changed. robot-history: robot-environment: commercial (free) modified-date: Tue, 25 Jun 96 11:40:07 +1200 modified-by: Michael Newbery robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-status: robot-purpose: robot-type: robot-platform: kdd KDD-Explorer http://mlc.kddvw.kcom.or.jp/CLINKS/html/clinks.html not available Kazunori Matsumoto not available [email protected] development (to be avtive in June 1997) indexing standalone unix http://info.webcrawler.com/mak/projects/robots/active/all.txt (46 of 107) [18.02.2001 13:17:47] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-availability: none robot-exclusion: yes robot-exclusion-useragent:KDD-Explorer robot-noindex: no robot-host: mlc.kddvw.kcom.or.jp robot-from: yes robot-useragent: KDD-Explorer/0.1 robot-language: c robot-description: KDD-Explorer is used for indexing valuable documents which will be retrieved via an experimental cross-language search engine, CLINKS. robot-history: This robot was designed in Knowledge-bases Information processing Laboratory, KDD R&D Laboratories, 1996-1997 robot-environment: research modified-date: Mon, 2 June 1997 18:00:00 JST modified-by: Kazunori Matsumoto robot-id:kilroy robot-name:Kilroy robot-cover-url:http://purl.org/kilroy robot-details-url:http://purl.org/kilroy robot-owner-name:OCLC robot-owner-url:http://www.oclc.org robot-owner-email:[email protected] robot-status:active robot-purpose:indexing,statistics robot-type:standalone robot-platform:unix,windowsNT robot-availability:none robot-exclusion:yes robot-exclusion-useragent:* robot-noindex:no robot-host:*.oclc.org robot-from:no robot-useragent:yes robot-language:java robot-description:Used to collect data for several projects. Runs constantly and visits site no faster than once every 90 seconds. robot-history:none robot-environment:research,service modified-date:Thursday, 24 Apr 1997 20:00:00 GMT modified-by:tkac robot-id: ko_yappo_robot robot-name: KO_Yappo_Robot robot-cover-url: http://yappo.com/info/robot.html robot-details-url: http://yappo.com/ robot-owner-name: Kazuhiro Osawa robot-owner-url: http://yappo.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: ko_yappo_robot robot-noindex: yes robot-host: yappo.com,209.25.40.1 robot-from: yes robot-useragent: KO_Yappo_Robot/1.0.4(http://yappo.com/info/robot.html) robot-language: perl robot-description: The KO_Yappo_Robot robot is used to build the database for the Yappo search service by k,osawa (part of AOL). http://info.webcrawler.com/mak/projects/robots/active/all.txt (47 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt The robot runs random day, and visits sites in a random order. robot-history: The robot is hobby of k,osawa at the Tokyo in 1997 robot-environment: hobby modified-date: Fri, 18 Jul 1996 12:34:21 GMT modified-by: KO robot-id: labelgrabber.txt robot-name: LabelGrabber robot-cover-url: http://www.w3.org/PICS/refcode/LabelGrabber/index.htm robot-details-url: http://www.w3.org/PICS/refcode/LabelGrabber/index.htm robot-owner-name: Kyle Jamieson robot-owner-url: http://www.w3.org/PICS/refcode/LabelGrabber/index.htm robot-owner-email: [email protected] robot-status: active robot-purpose: Grabs PICS labels from web pages, submits them to a label bueau robot-type: standalone robot-platform: windows, windows95, windowsNT, unix robot-availability: source robot-exclusion: yes robot-exclusion-useragent: label-grabber robot-noindex: no robot-host: head.w3.org robot-from: no robot-useragent: LabelGrab/1.1 robot-language: java robot-description: The label grabber searches for PICS labels and submits them to a label bureau robot-history: N/A robot-environment: research modified-date: Wed, 28 Jan 1998 17:32:52 GMT modified-by: [email protected] robot-id: larbin robot-name: larbin robot-cover-url: http://para.inria.fr/~ailleret/larbin/index-eng.html robot-owner-name: Sebastien Ailleret robot-owner-url: http://para.inria.fr/~ailleret/ robot-owner-email: [email protected] robot-status: active robot-purpose: Your imagination is the only limit robot-type: standalone robot-platform: Linux robot-availability: source (GPL), mail me for customization robot-exclusion: yes robot-exclusion-useragent: larbin robot-noindex: no robot-host: * robot-from: no robot-useragent: larbin (+mail) robot-language: c++ robot-description: Parcourir le web, telle est ma passion robot-history: french research group (INRIA Verso) robot-environment: hobby modified-date: 2000-3-28 modified-by: Sebastien Ailleret robot-id: legs robot-name: legs robot-cover-url: http://www.MagPortal.com/ robot-details-url: robot-owner-name: Bill Dimm robot-owner-url: http://www.HotNeuron.com/ robot-owner-email: [email protected] robot-status: active http://info.webcrawler.com/mak/projects/robots/active/all.txt (48 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-purpose: indexing robot-type: standalone robot-platform: linux robot-availability: none robot-exclusion: yes robot-exclusion-useragent: legs robot-noindex: no robot-host: robot-from: yes robot-useragent: legs robot-language: perl5 robot-description: The legs robot is used to build the magazine article database for MagPortal.com. robot-history: robot-environment: service modified-date: Wed, 22 Mar 2000 14:10:49 GMT modified-by: Bill Dimm robot-id: linkidator robot-name: Link Validator robot-cover-url: robot-details-url: robot-owner-name: Thomas Gimon robot-owner-url: robot-owner-email: [email protected] robot-status: development robot-purpose: maintenance robot-type: standalone robot-platform: unix, windows robot-availability: none robot-exclusion: yes robot-exclusion-useragent: Linkidator robot-noindex: yes robot-nofollow: yes robot-host: *.mitre.org robot-from: yes robot-useragent: Linkidator/0.93 robot-language: perl5 robot-description: Recursively checks all links on a site, looking for broken or redirected links. Checks all off-site links using HEAD requests and does not progress further. Designed to behave well and to be very configurable. robot-history: Built using WWW-Robot-0.022 perl module. Currently in beta test. Seeking approval for public release. robot-environment: internal modified-date: Fri, 20 Jan 2001 02:22:00 EST modified-by: Thomas Gimon robot-id:linkscan robot-name:LinkScan robot-cover-url:http://www.elsop.com/ robot-details-url:http://www.elsop.com/linkscan/overview.html robot-owner-name:Electronic Software Publishing Corp. (Elsop) robot-owner-url:http://www.elsop.com/ robot-owner-email:[email protected] robot-status:Robot actively in use robot-purpose:Link checker, SiteMapper, and HTML Validator robot-type:Standalone robot-platform:Unix, Linux, Windows 98/NT robot-availability:Program is shareware robot-exclusion:No robot-exclusion-useragent: robot-noindex:Yes robot-host:* robot-from: http://info.webcrawler.com/mak/projects/robots/active/all.txt (49 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-useragent:LinkScan Server/5.5 | LinkScan Workstation/5.5 robot-language:perl5 robot-description:LinkScan checks links, validates HTML and creates site maps robot-history: First developed by Elsop in January,1997 robot-environment:Commercial modified-date:Fri, 3 September 1999 17:00:00 PDT modified-by: Kenneth R. Churilla robot-id: linkwalker robot-name: LinkWalker robot-cover-url: http://www.seventwentyfour.com robot-details-url: http://www.seventwentyfour.com/tech.html robot-owner-name: Roy Bryant robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance, statistics robot-type: standalone robot-platform: windowsNT robot-availability: none robot-exclusion: yes robot-exclusion-useragent: linkwalker robot-noindex: yes robot-host: *.seventwentyfour.com robot-from: yes robot-useragent: LinkWalker robot-language: c++ robot-description: LinkWalker generates a database of links. We send reports of bad ones to webmasters. robot-history: Constructed late 1997 through April 1998. In full service April 1998. robot-environment: service modified-date: Wed, 22 Apr 1998 modified-by: Roy Bryant robot-id:lockon robot-name:Lockon robot-cover-url: robot-details-url: robot-owner-name:Seiji Sasazuka & Takahiro Ohmori robot-owner-url: robot-owner-email:[email protected] robot-status:active robot-purpose:indexing robot-type:standalone robot-platform:UNIX robot-availability:none robot-exclusion:yes robot-exclusion-useragent:Lockon robot-noindex:yes robot-host:*.hitech.tuis.ac.jp robot-from:yes robot-useragent:Lockon/xxxxx robot-language:perl5 robot-description:This robot gathers only HTML document. robot-history:This robot was developed in the Tokyo university of information sciences in 1998. robot-environment:research modified-date:Tue. 10 Nov 1998 20:00:00 GMT modified-by:Seiji Sasazuka & Takahiro Ohmori robot-id:logo_gif robot-name: logo.gif Crawler robot-cover-url: http://www.inm.de/projects/logogif.html robot-details-url: http://info.webcrawler.com/mak/projects/robots/active/all.txt (50 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-owner-name: Sevo Stille robot-owner-url: http://www.inm.de/people/sevo robot-owner-email: [email protected] robot-status: under development robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: logo_gif_crawler robot-noindex: no robot-host: *.inm.de robot-from: yes robot-useragent: logo.gif crawler robot-language: perl robot-description: meta-indexing engine for corporate logo graphics The robot runs at irregular intervals and will only pull a start page and its associated /.*logo\.gif/i (if any). It will be terminated once a statistically significant number of samples has been collected. robot-history: logo.gif is part of the design diploma of Markus Weisbeck, and tries to analyze the abundance of the logo metaphor in WWW corporate design. The crawler and image database were written by Sevo Stille and Peter Frank of the Institut für Neue Medien, respectively. robot-environment: research, statistics modified-date: 25.5.97 modified-by: Sevo Stille robot-id: lycos robot-name: Lycos robot-cover-url: http://lycos.cs.cmu.edu/ robot-details-url: robot-owner-name: Dr. Michael L. Mauldin robot-owner-url: http://fuzine.mt.cs.cmu.edu/mlm/home.html robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: fuzine.mt.cs.cmu.edu, lycos.com robot-from: robot-useragent: Lycos/x.x robot-language: robot-description: This is a research program in providing information retrieval and discovery in the WWW, using a finite memory model of the web to guide intelligent, directed searches for specific information needs robot-history: robot-environment: modified-date: modified-by: robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-status: macworm Mac WWWWorm Sebastien Lemieux [email protected] http://info.webcrawler.com/mak/projects/robots/active/all.txt (51 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-purpose: indexing robot-type: robot-platform: Macintosh robot-availability: none robot-exclusion: robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: robot-useragent: robot-language: hypercard robot-description: a French Keyword-searching robot for the Mac The author has decided not to release this robot to the public robot-history: robot-environment: modified-date: modified-by: robot-id: magpie robot-name: Magpie robot-cover-url: robot-details-url: robot-owner-name: Keith Jones robot-owner-url: robot-owner-email: [email protected] robot-status: development robot-purpose: indexing, statistics robot-type: standalone robot-platform: unix robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: *.blueberry.co.uk, 194.70.52.*, 193.131.167.144 robot-from: no robot-useragent: Magpie/1.0 robot-language: perl5 robot-description: Used to obtain information from a specified list of web pages for local indexing. Runs every two hours, and visits only a small number of sites. robot-history: Part of a research project. Alpha testing from 10 July 1996, Beta testing from 10 September. robot-environment: research modified-date: Wed, 10 Oct 1996 13:15:00 GMT modified-by: Keith Jones robot-id: mattie robot-name: Mattie robot-cover-url: http://www.mcw.aarkayn.org robot-details-url: http://www.mcw.aarkayn.org/web/mattie.asp robot-owner-name: Matt robot-owner-url: http://www.mcw.aarkayn.org robot-owner-email: [email protected] robot-status: Active robot-purpose: MP3 Spider robot-type: Standalone robot-platform: Windows 2000 robot-availability: None robot-exclusion: Yes robot-exclusion-useragent: mattie robot-noindex: N/A robot-nofollow: Yes robot-host: mattie.mcw.aarkayn.org robot-from: Yes robot-useragent: AO/A-T.IDRG v2.3 http://info.webcrawler.com/mak/projects/robots/active/all.txt (52 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-language: AO/A-T.IDRGL robot-description: Mattie?s sole purpose is to seek out MP3z for Matt. robot-history: Mattie was written 2000 Mar. 03 Fri. 18:48:00 -0500 GMT (e). He was last modified 2000 Nov. 08 Wed. 14:52:00 -0600 GMT (f). robot-environment: Hobby modified-date: Wed, 08 Nov 2000 20:52:00 GMT modified-by: Matt robot-id: mediafox robot-name: MediaFox robot-cover-url: none robot-details-url: none robot-owner-name: Lars Eilebrecht robot-owner-url: http://www.home.unix-ag.org/sfx/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing and maintenance robot-type: standalone robot-platform: (Java) robot-availability: none robot-exclusion: yes robot-exclusion-useragent: mediafox robot-noindex: yes robot-host: 141.99.*.* robot-from: yes robot-useragent: MediaFox/x.y robot-language: Java robot-description: The robot is used to index meta information of a specified set of documents and update a database accordingly. robot-history: Project at the University of Siegen robot-environment: research modified-date: Fri Aug 14 03:37:56 CEST 1998 modified-by: Lars Eilebrecht robot-id:merzscope robot-name:MerzScope robot-cover-url:http://www.merzcom.com robot-details-url:http://www.merzcom.com robot-owner-name:(Client based robot) robot-owner-url:(Client based robot) robot-owner-email: robot-status:actively in use robot-purpose:WebMapping robot-type:standalone robot-platform: (Java Based) unix,windows95,windowsNT,os2,mac etc .. robot-availability:binary robot-exclusion: yes robot-exclusion-useragent: MerzScope robot-noindex: no robot-host:(Client Based) robot-from: robot-useragent: MerzScope robot-language: java robot-description: Robot is part of a Web-Mapping package called MerzScope, to be used mainly by consultants, and web masters to create and publish maps, on and of the World wide web. robot-history: robot-environment: modified-date: Fri, 13 March 1997 16:31:00 modified-by: Philip Lenir, MerzScope lead developper robot-id: robot-name: robot-cover-url: meshexplorer NEC-MeshExplorer http://netplaza.biglobe.or.jp/ http://info.webcrawler.com/mak/projects/robots/active/all.txt (53 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-details-url: http://netplaza.biglobe.or.jp/keyword.html robot-owner-name: web search service maintenance group robot-owner-url: http://netplaza.biglobe.or.jp/keyword.html robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: NEC-MeshExplorer robot-noindex: no robot-host: meshsv300.tk.mesh.ad.jp robot-from: yes robot-useragent: NEC-MeshExplorer robot-language: c robot-description: The NEC-MeshExplorer robot is used to build database for the NETPLAZA search service operated by NEC Corporation. The robot searches URLs around sites in japan(JP domain). The robot runs every day, and visits sites in a random order. robot-history: Prototype version of this robot was developed in C&C Research Laboratories, NEC Corporation. Current robot (Version 1.0) is based on the prototype and has more functions. robot-environment: research modified-date: Jan 1, 1997 modified-by: Nobuya Kubo, Hajime Takano robot-id: MindCrawler robot-name: MindCrawler robot-cover-url: http://www.mindpass.com/_technology_faq.htm robot-details-url: robot-owner-name: Mindpass robot-owner-url: http://www.mindpass.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: linux robot-availability: none robot-exclusion: yes robot-exclusion-useragent: MindCrawler robot-noindex: no robot-host: * robot-from: no robot-useragent: MindCrawler robot-language: c++ robot-description: robot-history: robot-environment: modified-date: Tue Mar 28 11:30:09 CEST 2000 modified-by: robot-id:moget robot-name:moget robot-cover-url: robot-details-url: robot-owner-name:NTT-ME Infomation Xing,Inc robot-owner-url:http://www.nttx.co.jp robot-owner-email:[email protected] robot-status:active robot-purpose:indexing,statistics robot-type:standalone robot-platform:unix robot-availability:none http://info.webcrawler.com/mak/projects/robots/active/all.txt (54 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-exclusion:yes robot-exclusion-useragent:moget robot-noindex:yes robot-host:*.goo.ne.jp robot-from:yes robot-useragent:moget/1.0 robot-language:c robot-description: This robot is used to build the database for the search service operated by goo robot-history: robot-environment:service modified-date:Thu, 30 Mar 2000 18:40:37 GMT modified-by:[email protected] robot-id: momspider robot-name: MOMspider robot-cover-url: http://www.ics.uci.edu/WebSoft/MOMspider/ robot-details-url: robot-owner-name: Roy T. Fielding robot-owner-url: http://www.ics.uci.edu/dir/grad/Software/fielding robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance, statistics robot-type: standalone robot-platform: UNIX robot-availability: source robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: * robot-from: yes robot-useragent: MOMspider/1.00 libwww-perl/0.40 robot-language: perl 4 robot-description: to validate links, and generate statistics. It's usually run from anywhere robot-history: Originated as a research project at the University of California, Irvine, in 1993. Presented at the First International WWW Conference in Geneva, 1994. robot-environment: modified-date: Sat May 6 08:11:58 1995 modified-by: [email protected] robot-id: monster robot-name: Monster robot-cover-url: http://www.neva.ru/monster.list/russian.www.html robot-details-url: robot-owner-name: Dmitry Dicky robot-owner-url: http://wild.stu.neva.ru/ robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance, mirroring robot-type: standalone robot-platform: UNIX (Linux) robot-availability: binary robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: wild.stu.neva.ru robot-from: robot-useragent: Monster/vX.X.X -$TYPE ($OSTYPE) robot-language: C robot-description: The Monster has two parts - Web searcher and Web analyzer. Searcher is intended to perform the list of WWW sites of desired domain (for example it can perform list of all WWW sites of mit.edu, com, org, etc... domain) http://info.webcrawler.com/mak/projects/robots/active/all.txt (55 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt In the User-agent field $TYPE is set to 'Mapper' for Web searcher and 'StAlone' for Web analyzer. robot-history: Now the full (I suppose) list of ex-USSR sites is produced. robot-environment: modified-date: Tue Jun 25 10:03:36 1996 modified-by: robot-id: motor robot-name: Motor robot-cover-url: http://www.cybercon.de/Motor/index.html robot-details-url: robot-owner-name: Mr. Oliver Runge, Mr. Michael Goeckel robot-owner-url: http://www.cybercon.de/index.html robot-owner-email: [email protected] robot-status: developement robot-purpose: indexing robot-type: standalone robot-platform: mac robot-availability: data robot-exclusion: yes robot-exclusion-useragent: Motor robot-noindex: no robot-host: Michael.cybercon.technopark.gmd.de robot-from: yes robot-useragent: Motor/0.2 robot-language: 4th dimension robot-description: The Motor robot is used to build the database for the www.webindex.de search service operated by CyberCon. The robot ios under development - it runs in random intervals and visits site in a priority driven order (.de/.ch/.at first, root and robots.txt first) robot-history: robot-environment: service modified-date: Wed, 3 Jul 1996 15:30:00 +0100 modified-by: Michael Goeckel ([email protected]) robot-id: muscatferret robot-name: Muscat Ferret robot-cover-url: http://www.muscat.co.uk/euroferret/ robot-details-url: robot-owner-name: Olly Betts robot-owner-url: http://www.muscat.co.uk/~olly/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: MuscatFerret robot-noindex: yes robot-host: 193.114.89.*, 194.168.54.11 robot-from: yes robot-useragent: MuscatFerret/<version> robot-language: c, perl5 robot-description: Used to build the database for the EuroFerret <URL:http://www.muscat.co.uk/euroferret/> robot-history: robot-environment: service modified-date: Tue, 21 May 1997 17:11:00 GMT modified-by: [email protected] robot-id: mwdsearch robot-name: Mwd.Search robot-cover-url: (none) robot-details-url: (none) http://info.webcrawler.com/mak/projects/robots/active/all.txt (56 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-owner-name: Antti Westerberg robot-owner-url: (none) robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix (Linux) robot-availability: none robot-exclusion: yes robot-exclusion-useragent: MwdSearch robot-noindex: yes robot-host: *.fifi.net robot-from: no robot-useragent: MwdSearch/0.1 robot-language: perl5, c robot-description: Robot for indexing finnish (toplevel domain .fi) webpages for search engine called Fifi. Visits sites in random order. robot-history: (none) robot-environment: service (+ commercial)mwd.sci.fi> modified-date: Mon, 26 May 1997 15:55:02 EEST modified-by: [email protected] robot-id: myweb robot-name: Internet Shinchakubin robot-cover-url: http://naragw.sharp.co.jp/myweb/home/ robot-details-url: robot-owner-name: SHARP Corp. robot-owner-url: http://naragw.sharp.co.jp/myweb/home/ robot-owner-email: [email protected] robot-status: active robot-purpose: find new links and changed pages robot-type: standalone robot-platform: Windows98 robot-availability: binary as bundled software robot-exclusion: yes robot-exclusion-useragent: sharp-info-agent robot-noindex: no robot-host: * robot-from: no robot-useragent: User-Agent: Mozilla/4.0 (compatible; sharp-info-agent v1.0; ) robot-language: Java robot-description: makes a list of new links and changed pages based on user's frequently clicked pages in the past 31 days. client may run this software one or few times every day, manually or specified time. robot-history: shipped for SHARP's PC users since Feb 2000 robot-environment: commercial modified-date: Fri, 30 Jun 2000 19:02:52 JST modified-by: Katsuo Doi <[email protected]> robot-id: netcarta robot-name: NetCarta WebMap Engine robot-cover-url: http://www.netcarta.com/ robot-details-url: robot-owner-name: NetCarta WebMap Engine robot-owner-url: http://www.netcarta.com/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing, maintenance, mirroring, statistics robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: http://info.webcrawler.com/mak/projects/robots/active/all.txt (57 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-noindex: robot-host: robot-from: yes robot-useragent: NetCarta CyberPilot Pro robot-language: C++. robot-description: The NetCarta WebMap Engine is a general purpose, commercial spider. Packaged with a full GUI in the CyberPilo Pro product, it acts as a personal spider to work with a browser to facilitiate context-based navigation. The WebMapper product uses the robot to manage a site (site copy, site diff, and extensive link management facilities). All versions can create publishable NetCarta WebMaps, which capture the crawled information. If the robot sees a published map, it will return the published map rather than continuing its crawl. Since this is a personal spider, it will be launched from multiple domains. This robot tends to focus on a particular site. No instance of the robot should have more than one outstanding request out to any given site at a time. The User-agent field contains a coded ID identifying the instance of the spider; specific users can be blocked via robots.txt using this ID. robot-history: robot-environment: modified-date: Sun Feb 18 02:02:49 1996. modified-by: robot-id: netmechanic robot-name: NetMechanic robot-cover-url: http://www.netmechanic.com robot-details-url: http://www.netmechanic.com/faq.html robot-owner-name: Tom Dahm robot-owner-url: http://iquest.com/~tdahm robot-owner-email: [email protected] robot-status: development robot-purpose: Link and HTML validation robot-type: standalone with web gateway robot-platform: UNIX robot-availability: via web page robot-exclusion: Yes robot-exclusion-useragent: WebMechanic robot-noindex: no robot-host: 206.26.168.18 robot-from: no robot-useragent: NetMechanic robot-language: C robot-description: NetMechanic is a link validation and HTML validation robot run using a web page interface. robot-history: robot-environment: modified-date: Sat, 17 Aug 1996 12:00:00 GMT modified-by: robot-id: netscoop robot-name: NetScoop robot-cover-url: http://www-a2k.is.tokushima-u.ac.jp/search/index.html robot-owner-name: Kenji Kita robot-owner-url: http://www-a2k.is.tokushima-u.ac.jp/member/kita/index.html robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: UNIX robot-availability: none robot-exclusion: yes robot-exclusion-useragent: NetScoop http://info.webcrawler.com/mak/projects/robots/active/all.txt (58 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-host: alpha.is.tokushima-u.ac.jp, beta.is.tokushima-u.ac.jp robot-useragent: NetScoop/1.0 libwww/5.0a robot-language: C robot-description: The NetScoop robot is used to build the database for the NetScoop search engine. robot-history: The robot has been used in the research project at the Faculty of Engineering, Tokushima University, Japan., since Dec. 1996. robot-environment: research modified-date: Fri, 10 Jan 1997. modified-by: Kenji Kita robot-id: newscan-online robot-name: newscan-online robot-cover-url: http://www.newscan-online.de/ robot-details-url: http://www.newscan-online.de/info.html robot-owner-name: Axel Mueller robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: Linux robot-availability: binary robot-exclusion: yes robot-exclusion-useragent: newscan-online robot-noindex: no robot-host: *newscan-online.de robot-from: yes robot-useragent: newscan-online/1.1 robot-language: perl robot-description: The newscan-online robot is used to build a database for the newscan-online news search service operated by smart information services. The robot runs daily and visits predefined sites in a random order. robot-history: This robot finds its roots in a prereleased software for news filtering for Lotus Notes in 1995. robot-environment: service modified-date: Fri, 9 Apr 1999 11:45:00 GMT modified-by: Axel Mueller robot-id: nhse robot-name: NHSE Web Forager robot-cover-url: http://nhse.mcs.anl.gov/ robot-details-url: robot-owner-name: Robert Olson robot-owner-url: http://www.mcs.anl.gov/people/olson/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: *.mcs.anl.gov robot-from: yes robot-useragent: NHSEWalker/3.0 robot-language: perl 5 robot-description: to generate a Resource Discovery database robot-history: robot-environment: modified-date: Fri May 5 15:47:55 1995 modified-by: http://info.webcrawler.com/mak/projects/robots/active/all.txt (59 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-id: nomad robot-name: Nomad robot-cover-url: http://www.cs.colostate.edu/~sonnen/projects/nomad.html robot-details-url: robot-owner-name: Richard Sonnen robot-owner-url: http://www.cs.colostate.edu/~sonnen/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: standalone robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: robot-host: *.cs.colostate.edu robot-from: no robot-useragent: Nomad-V2.x robot-language: Perl 4 robot-description: robot-history: Developed in 1995 at Colorado State University. robot-environment: modified-date: Sat Jan 27 21:02:20 1996. modified-by: robot-id: northstar robot-name: The NorthStar Robot robot-cover-url: http://comics.scs.unr.edu:7000/top.html robot-details-url: robot-owner-name: Fred Barrie robot-owner-url: robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: robot-host: frognot.utdallas.edu, utdallas.edu, cnidir.org robot-from: yes robot-useragent: NorthStar robot-language: robot-description: Recent runs (26 April 94) will concentrate on textual analysis of the Web versus GopherSpace (from the Veronica data) as well as indexing. robot-history: robot-environment: modified-date: modified-by: robot-id: occam robot-name: Occam robot-cover-url: http://www.cs.washington.edu/research/projects/ai/www/occam/ robot-details-url: robot-owner-name: Marc Friedman robot-owner-url: http://www.cs.washington.edu/homes/friedman/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes http://info.webcrawler.com/mak/projects/robots/active/all.txt (60 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-exclusion-useragent: Occam robot-noindex: no robot-host: gentian.cs.washington.edu, sekiu.cs.washington.edu, saxifrage.cs.washington.edu robot-from: yes robot-useragent: Occam/1.0 robot-language: CommonLisp, perl4 robot-description: The robot takes high-level queries, breaks them down into multiple web requests, and answers them by combining disparate data gathered in one minute from numerous web sites, or from the robots cache. Currently the only user is me. robot-history: The robot is a descendant of Rodney, an earlier project at the University of Washington. robot-environment: research modified-date: Thu, 21 Nov 1996 20:30 GMT modified-by: [email protected] (Marc Friedman) robot-id: octopus robot-name: HKU WWW Octopus robot-cover-url: http://phoenix.cs.hku.hk:1234/~jax/w3rui.shtml robot-details-url: robot-owner-name: Law Kwok Tung , Lee Tak Yeung , Lo Chun Wing robot-owner-url: http://phoenix.cs.hku.hk:1234/~jax robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: standalone robot-platform: robot-availability: robot-exclusion: no. robot-exclusion-useragent: robot-noindex: robot-host: phoenix.cs.hku.hk robot-from: yes robot-useragent: HKU WWW Robot, robot-language: Perl 5, C, Java. robot-description: HKU Octopus is an ongoing project for resource discovery in the Hong Kong and China WWW domain . It is a research project conducted by three undergraduate at the University of Hong Kong robot-history: robot-environment: modified-date: Thu Mar 7 14:21:55 1996. modified-by: robot-id: orb_search robot-name: Orb Search robot-cover-url: http://orbsearch.home.ml.org robot-details-url: http://orbsearch.home.ml.org robot-owner-name: Matt Weber robot-owner-url: http://www.weberworld.com robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: data robot-exclusion: yes robot-exclusion-useragent: Orbsearch/1.0 robot-noindex: yes robot-host: cow.dyn.ml.org, *.dyn.ml.org robot-from: yes robot-useragent: Orbsearch/1.0 robot-language: Perl5 robot-description: Orbsearch builds the database for Orb Search Engine. http://info.webcrawler.com/mak/projects/robots/active/all.txt (61 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt It runs when requested. robot-history: This robot was started as a hobby. robot-environment: hobby modified-date: Sun, 31 Aug 1997 02:28:52 GMT modified-by: Matt Weber robot-id: packrat robot-name: Pack Rat robot-cover-url: http://web.cps.msu.edu/~dexterte/isl/packrat.html robot-details-url: robot-owner-name: Terry Dexter robot-owner-url: http://web.cps.msu.edu/~dexterte robot-owner-email: [email protected] robot-status: development robot-purpose: both maintenance and mirroring robot-type: standalone robot-platform: unix robot-availability: at the moment, none...source when developed. robot-exclusion: yes robot-exclusion-useragent: packrat or * robot-noindex: no, not yet robot-host: cps.msu.edu robot-from: robot-useragent: PackRat/1.0 robot-language: perl with libwww-5.0 robot-description: Used for local maintenance and for gathering web pages so that local statisistical info can be used in artificial intelligence programs. Funded by NEMOnline. robot-history: In the making... robot-environment: research modified-date: Tue, 20 Aug 1996 15:45:11 modified-by: Terry Dexter robot-id:pageboy robot-name:PageBoy robot-cover-url:http://www.webdocs.org/ robot-details-url:http://www.webdocs.org/ robot-owner-name:Chihiro Kuroda robot-owner-url:http://www.webdocs.org/ robot-owner-email:[email protected] robot-status:development robot-purpose:indexing robot-type:standalone robot-platform:unix robot-availability:none robot-exclusion:yes robot-exclusion-useragent:pageboy robot-noindex:yes robot-nofollow:yes robot-host:*.webdocs.org robot-from:yes robot-useragent:PageBoy/1.0 robot-language:c robot-description:The robot visits at regular intervals. robot-history:none robot-environment:service modified-date:Fri, 21 Oct 1999 17:28:52 GMT modified-by:webdocs robot-id: parasite robot-name: ParaSite robot-cover-url: http://www.ianett.com/parasite/ robot-details-url: http://www.ianett.com/parasite/ robot-owner-name: iaNett.com http://info.webcrawler.com/mak/projects/robots/active/all.txt (62 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-owner-url: http://www.ianett.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: windowsNT robot-availability: none robot-exclusion: yes robot-exclusion-useragent: ParaSite robot-noindex: yes robot-nofollow: yes robot-host: *.ianett.com robot-from: yes robot-useragent: ParaSite/0.21 (http://www.ianett.com/parasite/) robot-language: c++ robot-description: Builds index for ianett.com search database. Runs continiously. robot-history: Second generation of ianett.com spidering technology, originally called Sven. robot-environment: service modified-date: July 28, 2000 modified-by: Marty Anstey robot-id: patric robot-name: Patric robot-cover-url: http://www.nwnet.net/technical/ITR/index.html robot-details-url: http://www.nwnet.net/technical/ITR/index.html robot-owner-name: [email protected] robot-owner-url: http://www.nwnet.net/company/staff/toney robot-owner-email: [email protected] robot-status: development robot-purpose: statistics robot-type: standalone robot-platform: unix robot-availability: data robot-exclusion: yes robot-exclusion-useragent: patric robot-noindex: yes robot-host: *.nwnet.net robot-from: no robot-useragent: Patric/0.01a robot-language: perl robot-description: (contained at http://www.nwnet.net/technical/ITR/index.html ) robot-history: (contained at http://www.nwnet.net/technical/ITR/index.html ) robot-environment: service modified-date: Thurs, 15 Aug 1996 modified-by: [email protected] robot-id: pegasus robot-name: pegasus robot-cover-url: http://opensource.or.id/projects.html robot-details-url: http://pegasus.opensource.or.id robot-owner-name: A.Y.Kiky Shannon robot-owner-url: http://go.to/ayks robot-owner-email: [email protected] robot-status: inactive - open source robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: source, binary robot-exclusion: yes robot-exclusion-useragent: pegasus robot-noindex: yes robot-host: * robot-from: yes http://info.webcrawler.com/mak/projects/robots/active/all.txt (63 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-useragent: web robot PEGASUS robot-language: perl5 robot-description: pegasus gathers information from HTML pages (7 important tags). The indexing process can be started based on starting URL(s) or a range of IP address. robot-history: This robot was created as an implementation of a final project on Informatics Engineering Department, Institute of Technology Bandung, Indonesia. robot-environment: research modified-date: Fri, 20 Oct 2000 14:58:40 GMT modified-by: A.Y.Kiky Shannon robot-id: perignator robot-name: The Peregrinator robot-cover-url: http://www.maths.usyd.edu.au:8000/jimr/pe/Peregrinator.html robot-details-url: robot-owner-name: Jim Richardson robot-owner-url: http://www.maths.usyd.edu.au:8000/jimr.html robot-owner-email: [email protected] robot-status: robot-purpose: robot-type: robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: yes robot-useragent: Peregrinator-Mathematics/0.7 robot-language: perl 4 robot-description: This robot is being used to generate an index of documents on Web sites connected with mathematics and statistics. It ignores off-site links, so does not stray from a list of servers specified initially. robot-history: commenced operation in August 1994 robot-environment: modified-date: modified-by: robot-id: perlcrawler robot-name: PerlCrawler 1.0 robot-cover-url: http://perlsearch.hypermart.net/ robot-details-url: http://www.xav.com/scripts/xavatoria/index.html robot-owner-name: Matt McKenzie robot-owner-url: http://perlsearch.hypermart.net/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: source robot-exclusion: yes robot-exclusion-useragent: perlcrawler robot-noindex: yes robot-host: server5.hypermart.net robot-from: yes robot-useragent: PerlCrawler/1.0 Xavatoria/2.0 robot-language: perl5 robot-description: The PerlCrawler robot is designed to index and build a database of pages relating to the Perl programming language. robot-history: Originated in modified form on 25 June 1998 robot-environment: hobby modified-date: Fri, 18 Dec 1998 23:37:40 GMT modified-by: Matt McKenzie http://info.webcrawler.com/mak/projects/robots/active/all.txt (64 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-id: phantom robot-name: Phantom robot-cover-url: http://www.maxum.com/phantom/ robot-details-url: robot-owner-name: Larry Burke robot-owner-url: http://www.aktiv.com/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: standalone robot-platform: Macintosh robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: robot-from: yes robot-useragent: Duppies robot-language: robot-description: Designed to allow webmasters to provide a searchable index of their own site as well as to other sites, perhaps with similar content. robot-history: robot-environment: modified-date: Fri Jan 19 05:08:15 1996. modified-by: robot-id: piltdownman robot-name: PiltdownMan robot-cover-url: http://profitnet.bizland.com/ robot-details-url: http://profitnet.bizland.com/piltdownman.html robot-owner-name: Daniel Vilà robot-owner-url: http://profitnet.bizland.com/aboutus.html robot-owner-email: [email protected] robot-status: active robot-purpose: statistics robot-type: standalone robot-platform: windows95, windows98, windowsNT robot-availability: none robot-exclusion: yes robot-exclusion-useragent: piltdownman robot-noindex: no robot-nofollow: no robot-host: 62.36.128.*, 194.133.59.*, 212.106.215.* robot-from: no robot-useragent: PiltdownMan/1.0 [email protected] robot-language: c++ robot-description: The PiltdownMan robot is used to get a list of links from the search engines in our database. These links are followed, and the page that they refer is downloaded to get some statistics from them. The robot runs once a month, more or less, and visits the first 10 pages listed in every search engine, for a group of keywords. robot-history: To maintain a database of search engines, we needed an automated tool. That's why we began the creation of this robot. robot-environment: service modified-date: Mon, 13 Dec 1999 21:50:32 GMT modified-by: Daniel Vilà robot-id: pioneer http://info.webcrawler.com/mak/projects/robots/active/all.txt (65 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-name: Pioneer robot-cover-url: http://sequent.uncfsu.edu/~micah/pioneer.html robot-details-url: robot-owner-name: Micah A. Williams robot-owner-url: http://sequent.uncfsu.edu/~micah/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing, statistics robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: *.uncfsu.edu or flyer.ncsc.org robot-from: yes robot-useragent: Pioneer robot-language: C. robot-description: Pioneer is part of an undergraduate research project. robot-history: robot-environment: modified-date: Mon Feb 5 02:49:32 1996. modified-by: robot-id: pitkow robot-name: html_analyzer robot-cover-url: robot-details-url: robot-owner-name: James E. Pitkow robot-owner-url: robot-owner-email: [email protected] robot-status: robot-purpose: maintainance robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: robot-useragent: robot-language: robot-description: to check validity of Web servers. I'm not sure if it has ever been run remotely. robot-history: robot-environment: modified-date: modified-by: robot-id: pjspider robot-name: Portal Juice Spider robot-cover-url: http://www.portaljuice.com robot-details-url: http://www.portaljuice.com/pjspider.html robot-owner-name: Nextopia Software Corporation robot-owner-url: http://www.portaljuice.com robot-owner-email: [email protected] robot-status: active robot-purpose: indexing, statistics robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: pjspider http://info.webcrawler.com/mak/projects/robots/active/all.txt (66 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-noindex: yes robot-host: *.portaljuice.com, *.nextopia.com robot-from: yes robot-useragent: PortalJuice.com/4.0 robot-language: C/C++ robot-description: Indexing web documents for Portal Juice vertical portal search engine robot-history: Indexing the web since 1998 for the purposes of offering our commerical Portal Juice search engine services. robot-environment: service modified-date: Wed Jun 23 17:00:00 EST 1999 modified-by: [email protected] robot-id: pka robot-name: PGP Key Agent robot-cover-url: http://www.starnet.it/pgp robot-details-url: robot-owner-name: Massimiliano Pucciarelli robot-owner-url: http://www.starnet.it/puma robot-owner-email: [email protected] robot-status: Active robot-purpose: indexing robot-type: standalone robot-platform: UNIX, Windows NT robot-availability: none robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: salerno.starnet.it robot-from: yes robot-useragent: PGP-KA/1.2 robot-language: Perl 5 robot-description: This program search the pgp public key for the specified user. robot-history: Originated as a research project at Salerno University in 1995. robot-environment: Research modified-date: June 27 1996. modified-by: Massimiliano Pucciarelli robot-id: plumtreewebaccessor robot-name: PlumtreeWebAccessor robot-cover-url: robot-details-url: http://www.plumtree.com/ robot-owner-name: Joseph A. Stanko robot-owner-url: robot-owner-email: [email protected] robot-status: development robot-purpose: indexing for the Plumtree Server robot-type: standalone robot-platform: windowsNT robot-availability: none robot-exclusion: yes robot-exclusion-useragent: PlumtreeWebAccessor robot-noindex: yes robot-host: robot-from: yes robot-useragent: PlumtreeWebAccessor/0.9 robot-language: c++ robot-description: The Plumtree Web Accessor is a component that customers can add to the Plumtree Server to index documents on the World Wide Web. robot-history: robot-environment: commercial modified-date: Thu, 17 Dec 1998 http://info.webcrawler.com/mak/projects/robots/active/all.txt (67 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt modified-by: Joseph A. Stanko <[email protected]> robot-id: poppi robot-name: Poppi robot-cover-url: http://members.tripod.com/poppisearch robot-details-url: http://members.tripod.com/poppisearch robot-owner-name: Antonio Provenzano robot-owner-url: Antonio Provenzano robot-owner-email: robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix/linux robot-availability: none robot-exclusion: robot-exclusion-useragent: robot-noindex: yes robot-host:=20 robot-from: robot-useragent: Poppi/1.0 robot-language: C robot-description: Poppi is a crawler to index the web that runs weekly gathering and indexing hypertextual, multimedia and executable file formats robot-history: Created by Antonio Provenzano in the april of 2000, has been acquired from Tomi Officine Multimediali srl and it is next to release as service and commercial robot-environment: service modified-date: Mon, 22 May 2000 15:47:30 GMT modified-by: Antonio Provenzano robot-id: portalb robot-name: PortalB Spider robot-cover-url: http://www.portalb.com/ robot-details-url: robot-owner-name: PortalB Spider Bug List robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: PortalBSpider robot-noindex: yes robot-nofollow: yes robot-host: spider1.portalb.com, spider2.portalb.com, etc. robot-from: no robot-useragent: PortalBSpider/1.0 ([email protected]) robot-language: C++ robot-description: The PortalB Spider indexes selected sites for high-quality business information. robot-history: robot-environment: service robot-id: Puu robot-name: GetterroboPlus Puu robot-details-url: http://marunaka.homing.net/straight/getter/ robot-cover-url: http://marunaka.homing.net/straight/ robot-owner-name: marunaka robot-owner-url: http://marunaka.homing.net robot-owner-email: [email protected] robot-status: active: robot actively in use robot-purpose: Purpose of the robot. One or more of: http://info.webcrawler.com/mak/projects/robots/active/all.txt (68 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt - gathering: gather data of original standerd TAG for Puu contains the information of the sites registered my Search Engin. - maintenance: link validation robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes (Puu patrols only registered url in my Search Engine) robot-exclusion-useragent: Getterrobo-Plus robot-noindex: no robot-host: straight FLASH!! Getterrobo-Plus, *.homing.net robot-from: yes robot-useragent: straight FLASH!! GetterroboPlus 1.5 robot-language: perl5 robot-description: Puu robot is used to gater data from registered site in Search Engin "straight FLASH!!" for building anouncement page of state of renewal of registered site in "straight FLASH!!". Robot runs everyday. robot-history: This robot patorols based registered sites in Search Engin "straight FLASH!!" robot-environment: hobby modified-date: Fri, 26 Jun 1998 robot-id: python robot-name: The Python Robot robot-cover-url: http://www.python.org/ robot-details-url: robot-owner-name: Guido van Rossum robot-owner-url: http://www.python.org/~guido/ robot-owner-email: [email protected] robot-status: retired robot-purpose: robot-type: robot-platform: robot-availability: none robot-exclusion: robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: robot-useragent: robot-language: robot-description: robot-history: robot-environment: modified-date: modified-by: robot-id: raven robot-name: Raven Search robot-cover-url: http://ravensearch.tripod.com robot-details-url: http://ravensearch.tripod.com robot-owner-name: Raven Group robot-owner-url: http://ravensearch.tripod.com robot-owner-email: [email protected] robot-status: Development: robot under development robot-purpose: Indexing: gather content for commercial query engine. robot-type: Standalone: a separate program robot-platform: Unix, Windows98, WindowsNT, Windows2000 robot-availability: None robot-exclusion: Yes robot-exclusion-useragent: Raven robot-noindex: Yes robot-nofollow: Yes robot-host: 192.168.1.* http://info.webcrawler.com/mak/projects/robots/active/all.txt (69 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-from: Yes robot-useragent: Raven-v2 robot-language: Perl-5 robot-description: Raven was written for the express purpose of indexing the web. It can parallel process hundreds of URLS's at a time. It runs on a sporadic basis as testing continues. It is really several programs running concurrently. It takes four computers to run Raven Search. Scalable in sets of four. robot-history: This robot is new. First active on March 25, 2000. robot-environment: Commercial: is a commercial product. Possibly GNU later ;-) modified-date: Fri, 25 Mar 2000 17:28:52 GMT modified-by: Raven Group robot-id: rbse robot-name: RBSE Spider robot-cover-url: http://rbse.jsc.nasa.gov/eichmann/urlsearch.html robot-details-url: robot-owner-name: David Eichmann robot-owner-url: http://rbse.jsc.nasa.gov/eichmann/home.html robot-owner-email: [email protected] robot-status: active robot-purpose: indexing, statistics robot-type: robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: rbse.jsc.nasa.gov (192.88.42.10) robot-from: robot-useragent: robot-language: C, oracle, wais robot-description: Developed and operated as part of the NASA-funded Repository Based Software Engineering Program at the Research Institute for Computing and Information Systems, University of Houston - Clear Lake. robot-history: robot-environment: modified-date: Thu May 18 04:47:02 1995 modified-by: robot-id: resumerobot robot-name: Resume Robot robot-cover-url: http://www.onramp.net/proquest/resume/robot/robot.html robot-details-url: robot-owner-name: James Stakelum robot-owner-url: http://www.onramp.net/proquest/resume/java/resume.html robot-owner-email: [email protected] robot-status: robot-purpose: indexing. robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: robot-from: yes robot-useragent: Resume Robot robot-language: C++. robot-description: robot-history: robot-environment: modified-date: Tue Mar 12 15:52:25 1996. modified-by: http://info.webcrawler.com/mak/projects/robots/active/all.txt (70 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-id: rhcs robot-name: RoadHouse Crawling System robot-cover-url: http://stage.perceval.be (under developpement) robot-details-url: robot-owner-name: Gregoire Welraeds, Emmanuel Bergmans robot-owner-url: http://www.perceval.be robot-owner-email: [email protected] robot-status: development robot-purpose1: indexing robot-purpose2: maintenance robot-purpose3: statistics robot-type: standalone robot-platform1: unix (FreeBSD & Linux) robot-availability: none robot-exclusion: no (under development) robot-exclusion-useragent: RHCS robot-noindex: no (under development) robot-host: stage.perceval.be robot-from: no robot-useragent: RHCS/1.0a robot-language: c robot-description: robot used tp build the database for the RoadHouse search service project operated by Perceval robot-history: The need of this robot find its roots in the actual RoadHouse directory not maintenained since 1997 robot-environment: service modified-date: Fri, 26 Feb 1999 12:00:00 GMT modified-by: Gregoire Welraeds robot-id: roadrunner robot-name: Road Runner: The ImageScape Robot robot-owner-name: LIM Group robot-owner-email: [email protected] robot-status: development/active robot-purpose: indexing robot-type: standalone robot-platform: UNIX robot-exclusion: yes robot-exclusion-useragent: roadrunner robot-useragent: Road Runner: ImageScape Robot ([email protected]) robot-language: C, perl5 robot-description: Create Image/Text index for WWW robot-history: ImageScape Project robot-environment: commercial service modified-date: Dec. 1st, 1996 robot-id: robbie robot-name: Robbie the Robot robot-cover-url: robot-details-url: robot-owner-name: Robert H. Pollack robot-owner-url: robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix, windows95, windowsNT robot-availability: none robot-exclusion: yes robot-exclusion-useragent: Robbie robot-noindex: no robot-host: *.lmco.com robot-from: yes robot-useragent: Robbie/0.1 robot-language: java http://info.webcrawler.com/mak/projects/robots/active/all.txt (71 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-description: Used to define document collections for the DISCO system. Robbie is still under development and runs several times a day, but usually only for ten minutes or so. Sites are visited in the order in which references are found, but no host is visited more than once in any two-minute period. robot-history: The DISCO system is a resource-discovery component in the OLLA system, which is a prototype system, developed under DARPA funding, to support computer-based education and training. robot-environment: research modified-date: Wed, 5 Feb 1997 19:00:00 GMT modified-by: robot-id: robi robot-name: ComputingSite Robi/1.0 robot-cover-url: http://www.computingsite.com/robi/ robot-details-url: http://www.computingsite.com/robi/ robot-owner-name: Tecor Communications S.L. robot-owner-url: http://www.tecor.com/ robot-owner-email: [email protected] robot-status: Active robot-purpose: indexing,maintenance robot-type: standalone robot-platform: UNIX robot-availability: robot-exclusion: yes robot-exclusion-useragent: robi robot-noindex: no robot-host: robi.computingsite.com robot-from: robot-useragent: ComputingSite Robi/1.0 ([email protected]) robot-language: python robot-description: Intelligent agent used to build the ComputingSite Search Directory. robot-history: It was born on August 1997. robot-environment: service modified-date: Wed, 13 May 1998 17:28:52 GMT modified-by: Jorge Alegre robot-id: robozilla robot-name: Robozilla robot-cover-url: http://dmoz.org/ robot-details-url: http://www.dmoz.org/newsletter/2000Aug/robo.html robot-owner-name: "Rob O'Zilla" robot-owner-url: http://dmoz.org/profiles/robozilla.html robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-availability: none robot-exclusion: no robot-noindex: no robot-host: directory.mozilla.org robot-useragent: Robozilla/1.0 robot-description: Robozilla visits all the links within the Open Directory periodically, marking the ones that return errors for review. robot-environment: service robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: roverbot Roverbot http://www.roverbot.com/ GlobalMedia Design (Andrew Cowan & Brian http://info.webcrawler.com/mak/projects/robots/active/all.txt (72 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt Clark) robot-owner-url: http://www.radzone.org/gmd/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: roverbot.com robot-from: yes robot-useragent: Roverbot robot-language: perl5 robot-description: Targeted email gatherer utilizing user-defined seed points and interacting with both the webserver and MX servers of remote sites. robot-history: robot-environment: modified-date: Tue Jun 18 19:16:31 1996. modified-by: robot-id: safetynetrobot robot-name: SafetyNet Robot robot-cover-url: http://www.urlabs.com/ robot-details-url: robot-owner-name: Michael L. Nelson robot-owner-url: http://www.urlabs.com/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing. robot-type: standalone robot-platform: robot-availability: robot-exclusion: no. robot-exclusion-useragent: robot-noindex: robot-host: *.urlabs.com robot-from: yes robot-useragent: SafetyNet Robot 0.1, robot-language: Perl 5 robot-description: Finds URLs for K-12 content management. robot-history: robot-environment: modified-date: Sat Mar 23 20:12:39 1996. modified-by: robot-id: scooter robot-name: Scooter robot-cover-url: http://www.altavista.com/ robot-details-url: http://www.altavista.com/av/content/addurl.htm robot-owner-name: AltaVista robot-owner-url: http://www.altavista.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: Scooter robot-noindex: yes robot-host: *.av.pa-x.dec.com robot-from: yes http://info.webcrawler.com/mak/projects/robots/active/all.txt (73 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-useragent: Scooter/2.0 G.R.A.B. V1.1.0 robot-language: c robot-description: Scooter is AltaVista's prime index agent. robot-history: Version 2 of Scooter/1.0 developed by Louis Monier of WRL. robot-environment: service modified-date: Wed, 13 Jan 1999 17:18:59 GMT modified-by: [email protected] robot-id: search_au robot-name: Search.Aus-AU.COM robot-details-url: http://Search.Aus-AU.COM/ robot-cover-url: http://Search.Aus-AU.COM/ robot-owner-name: Dez Blanchfield robot-owner-url: not currently available robot-owner-email: [email protected] robot-status: - development: robot under development robot-purpose: - indexing: gather content for an indexing service robot-type: - standalone: a separate program robot-platform: - mac - unix - windows95 - windowsNT robot-availability: - none robot-exclusion: yes robot-exclusion-useragent: Search-AU robot-noindex: yes robot-host: Search.Aus-AU.COM, 203.55.124.29, 203.2.239.29 robot-from: no robot-useragent: not available robot-language: c, perl, sql robot-description: Search-AU is a development tool I have built to investigate the power of a search engine and web crawler to give me access to a database of web content ( html / url's ) and address's etc from which I hope to build more accurate stats about the .au zone's web content. the robot started crawling from http://www.geko.net.au/ on march 1st, 1998 and after nine days had 70mb of compressed ascii in a database to work with. i hope to run a refresh of the crawl every month initially, and soon every week bandwidth and cpu allowing. if the project warrants further development, i will turn it into an australian ( .au ) zone search engine and make it commercially available for advertising to cover the costs which are starting to mount up. --dez (980313 - black friday!) robot-environment: - hobby: written as a hobby modified-date: Fri Mar 13 10:03:32 EST 1998 robot-id: searchprocess robot-name: SearchProcess robot-cover-url: http://www.searchprocess.com robot-details-url: http://www.intelligence-process.com robot-owner-name: Mannina Bruno robot-owner-url: http://www.intelligence-process.com robot-owner-email: [email protected] robot-status: active robot-purpose: Statistic robot-type: browser robot-platform: linux robot-availability: none robot-exclusion: yes robot-exclusion-useragent: searchprocess robot-noindex: yes robot-host: searchprocess.com robot-from: yes robot-useragent: searchprocess/0.9 robot-language: perl robot-description: An intelligent Agent Online. SearchProcess is used to provide structured information to user. robot-history: This is the son of Auresys http://info.webcrawler.com/mak/projects/robots/active/all.txt (74 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-environment: Service freeware modified-date: Thus, 22 Dec 1999 modified-by: Mannina Bruno robot-id: senrigan robot-name: Senrigan robot-cover-url: http://www.info.waseda.ac.jp/search-e.html robot-details-url: robot-owner-name: TAMURA Kent robot-owner-url: http://www.info.waseda.ac.jp/muraoka/members/kent/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: Java robot-availability: none robot-exclusion: yes robot-exclusion-useragent:Senrigan robot-noindex: yes robot-host: aniki.olu.info.waseda.ac.jp robot-from: yes robot-useragent: Senrigan/xxxxxx robot-language: Java robot-description: This robot now gets HTMLs from only jp domain. robot-history: It has been running since Dec 1994 robot-environment: research modified-date: Mon Jul 1 07:30:00 GMT 1996 modified-by: TAMURA Kent robot-id: sgscout robot-name: SG-Scout robot-cover-url: http://www-swiss.ai.mit.edu/~ptbb/SG-Scout/SG-Scout.html robot-details-url: robot-owner-name: Peter Beebee robot-owner-url: http://www-swiss.ai.mit.edu/~ptbb/personal/index.html robot-owner-email: [email protected], [email protected] robot-status: active robot-purpose: indexing robot-type: robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: beta.xerox.com robot-from: yes robot-useragent: SG-Scout robot-language: robot-description: Does a "server-oriented" breadth-first search in a round-robin fashion, with multiple processes. robot-history: Run since 27 June 1994, for an internal XEROX research project robot-environment: modified-date: modified-by: robot-id:shaggy robot-name:ShagSeeker robot-cover-url:http://www.shagseek.com robot-details-url: robot-owner-name:Joseph Reynolds robot-owner-url:http://www.shagseek.com robot-owner-email:[email protected] robot-status:active robot-purpose:indexing http://info.webcrawler.com/mak/projects/robots/active/all.txt (75 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-type:standalone robot-platform:unix robot-availability:data robot-exclusion:yes robot-exclusion-useragent:Shagseeker robot-noindex:yes robot-host:shagseek.com robot-from: robot-useragent:Shagseeker at http://www.shagseek.com /1.0 robot-language:perl5 robot-description:Shagseeker is the gatherer for the Shagseek.com search engine and goes out weekly. robot-history:none yet robot-environment:service modified-date:Mon 17 Jan 2000 10:00:00 EST modified-by:Joseph Reynolds robot-id: shaihulud robot-name: Shai'Hulud robot-cover-url: robot-details-url: robot-owner-name: Dimitri Khaoustov robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: mirroring robot-type: standalone robot-platform: unix robot-availability: source robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: *.rdtex.ru robot-from: robot-useragent: Shai'Hulud robot-language: C robot-description: Used to build mirrors for internal use robot-history: This robot finds its roots in a research project at RDTeX Perspective Projects Group in 1996 robot-environment: research modified-date: Mon, 5 Aug 1996 14:35:08 GMT modified-by: Dimitri Khaoustov robot-id: sift robot-name: Sift robot-cover-url: http://www.worthy.com/ robot-details-url: http://www.worthy.com/ robot-owner-name: Bob Worthy robot-owner-url: http://www.worthy.com/~bworthy robot-owner-email: [email protected] robot-status: development, active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: data robot-exclusion: yes robot-exclusion-useragent: sift robot-noindex: yes robot-host: www.worthy.com robot-from: robot-useragent: libwww-perl-5.41 robot-language: perl robot-description: Subject directed (via key phrase list) indexing. robot-history: Libwww of course, implementation using MySQL August, 1999. Indexing Search and Rescue sites. http://info.webcrawler.com/mak/projects/robots/active/all.txt (76 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-environment: research, service modified-date: Sat, 16 Oct 1999 19:40:00 GMT modified-by: Bob Worthy robot-id: simbot robot-name: Simmany Robot Ver1.0 robot-cover-url: http://simmany.hnc.net/ robot-details-url: http://simmany.hnc.net/irman1.html robot-owner-name: Youngsik, Lee(@L?5=D) robot-owner-url: robot-owner-email: [email protected] robot-status: development & active robot-purpose: indexing, maintenance, statistics robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: SimBot robot-noindex: no robot-host: sansam.hnc.net robot-from: no robot-useragent: SimBot/1.0 robot-language: C robot-description: The Simmany Robot is used to build the Map(DB) for the simmany service operated by HNC(Hangul & Computer Co., Ltd.). The robot runs weekly, and visits sites that have a useful korean information in a defined order. robot-history: This robot is a part of simmany service and simmini products. The simmini is the Web products that make use of the indexing and retrieving modules of simmany. robot-environment: service, commercial modified-date: Thu, 19 Sep 1996 07:02:26 GMT modified-by: Youngsik, Lee robot-id: site-valet robot-name: Site Valet robot-cover-url: http://valet.webthing.com/ robot-details-url: http://valet.webthing.com/ robot-owner-name: Nick Kew robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: unix robot-availability: data robot-exclusion: yes robot-exclusion-useragent: Site Valet robot-noindex: no robot-host: valet.webthing.com,valet.* robot-from: yes robot-useragent: Site Valet robot-language: perl robot-description: a deluxe site monitoring and analysis service robot-history: builds on cg-eye, the WDG Validator, and the Link Valet robot-environment: service modified-date: Tue, 27 June 2000 modified-by: [email protected] robot-id: sitegrabber robot-name: Open Text Index Robot robot-cover-url: http://index.opentext.net/main/faq.html robot-details-url: http://index.opentext.net/OTI_Robot.html robot-owner-name: John Faichney robot-owner-url: http://info.webcrawler.com/mak/projects/robots/active/all.txt (77 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: UNIX robot-availability: inquire to [email protected] (Mark Kraatz) robot-exclusion: yes robot-exclusion-useragent: Open Text Site Crawler robot-noindex: no robot-host: *.opentext.com robot-from: yes robot-useragent: Open Text Site Crawler V1.0 robot-language: perl/C robot-description: This robot is run by Open Text Corporation to produce the data for the Open Text Index robot-history: Started in May/95 to replace existing Open Text robot which was based on libwww robot-environment: commercial modified-date: Fri Jul 25 11:46:56 EDT 1997 modified-by: John Faichney robot-id: sitetech robot-name: SiteTech-Rover robot-cover-url: http://www.sitetech.com/ robot-details-url: robot-owner-name: Anil Peres-da-Silva robot-owner-url: http://www.sitetech.com robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: robot-from: yes robot-useragent: SiteTech-Rover robot-language: C++. robot-description: Originated as part of a suite of Internet Products to organize, search & navigate Intranet sites and to validate links in HTML documents. robot-history: This robot originally went by the name of LiberTech-Rover robot-environment: modified-date: Fri Aug 9 17:06:56 1996. modified-by: Anil Peres-da-Silva robot-id:slcrawler robot-name:SLCrawler robot-cover-url: robot-details-url: robot-owner-name:Inxight Software robot-owner-url:http://www.inxight.com robot-owner-email:[email protected] robot-status:active robot-purpose:To build the site map. robot-type:standalone robot-platform:windows, windows95, windowsNT robot-availability:none robot-exclusion:yes robot-exclusion-useragent:SLCrawler/2.0 robot-noindex:no robot-host:n/a robot-from: http://info.webcrawler.com/mak/projects/robots/active/all.txt (78 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-useragent:SLCrawler robot-language:Java robot-description:To build the site map. robot-history:It is SLCrawler to crawl html page on Internet. robot-environment: commercial: is a commercial product modified-date:Nov. 15, 2000 modified-by:Karen Ng robot-id: slurp robot-name: Inktomi Slurp robot-cover-url: http://www.inktomi.com/ robot-details-url: http://www.inktomi.com/slurp.html robot-owner-name: Inktomi Corporation robot-owner-url: http://www.inktomi.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing, statistics robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: slurp robot-noindex: yes robot-host: *.inktomi.com robot-from: yes robot-useragent: Slurp/2.0 robot-language: C/C++ robot-description: Indexing documents for the HotBot search engine (www.hotbot.com), collecting Web statistics robot-history: Switch from Slurp/1.0 to Slurp/2.0 November 1996 robot-environment: service modified-date: Fri Feb 28 13:57:43 PST 1997 modified-by: [email protected] robot-id: smartspider robot-name: Smart Spider robot-cover-url: http://www.travel-finder.com robot-details-url: http://www.engsoftware.com/robots.htm robot-owner-name: Ken Wadland robot-owner-url: http://www.engsoftware.com robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: windows95, windowsNT robot-availability: data, binary, source robot-exclusion: Yes robot-exclusion-useragent: ESI robot-noindex: Yes robot-host: 207.16.241.* robot-from: Yes robot-useragent: ESISmartSpider/2.0 robot-language: C++ robot-description: Classifies sites using a Knowledge Base. Robot collects web pages which are then parsed and feed to the Knowledge Base. The Knowledge Base classifies the sites into any of hundreds of categories based on the vocabulary used. Currently used by: //www.travel-finder.com (Travel and Tourist Info) and //www.golightway.com (Christian Sites). Several options exist to control whether sites are discovered and/or classified fully automatically, full manually or somewhere in between. robot-history: Feb '96 -- Product design begun. May '96 -- First data results published by Travel-Finder. Oct '96 -- Generalized and announced and a product for other sites. Jan '97 -- First data results published by GoLightWay. robot-environment: service, commercial http://info.webcrawler.com/mak/projects/robots/active/all.txt (79 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt modified-date: Mon, 13 Jan 1997 10:41:00 EST modified-by: Ken Wadland robot-id: snooper robot-name: Snooper robot-cover-url: http://darsun.sit.qc.ca robot-details-url: robot-owner-name: Isabelle A. Melnick robot-owner-url: robot-owner-email: [email protected] robot-status: part under development and part active robot-purpose: robot-type: robot-platform: robot-availability: none robot-exclusion: yes robot-exclusion-useragent: snooper robot-noindex: robot-host: robot-from: robot-useragent: Snooper/b97_01 robot-language: robot-description: robot-history: robot-environment: modified-date: modified-by: robot-id: solbot robot-name: Solbot robot-cover-url: http://kvasir.sol.no/ robot-details-url: robot-owner-name: Frank Tore Johansen robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: solbot robot-noindex: yes robot-host: robot*.sol.no robot-from: robot-useragent: Solbot/1.0 LWP/5.07 robot-language: perl, c robot-description: Builds data for the Kvasir search service. Only searches sites which ends with one of the following domains: "no", "se", "dk", "is", "fi"robot-history: This robot is the result of a 3 years old late night hack when the Verity robot (of that time) was unable to index sites with iso8859 characters (in URL and other places), and we just _had_ to have something up and going the next day... robot-environment: service modified-date: Tue Apr 7 16:25:05 MET DST 1998 modified-by: Frank Tore Johansen <[email protected]> robot-id: spanner robot-name: Spanner robot-cover-url: http://www.kluge.net/NES/spanner/ robot-details-url: http://www.kluge.net/NES/spanner/ robot-owner-name: Theo Van Dinter robot-owner-url: http://www.kluge.net/~felicity/ robot-owner-email: [email protected] robot-status: development http://info.webcrawler.com/mak/projects/robots/active/all.txt (80 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-purpose: indexing,maintenance robot-type: standalone robot-platform: unix robot-availability: source robot-exclusion: yes robot-exclusion-useragent: Spanner robot-noindex: yes robot-host: *.kluge.net robot-from: yes robot-useragent: Spanner/1.0 (Linux 2.0.27 i586) robot-language: perl robot-description: Used to index/check links on an intranet. robot-history: Pet project of the author since beginning of 1996. robot-environment: hobby modified-date: Mon, 06 Jan 1997 00:00:00 GMT modified-by: [email protected] robot-id:speedy robot-name:Speedy Spider robot-cover-url:http://www.entireweb.com/ robot-details-url:http://www.entireweb.com/speedy.html robot-owner-name:WorldLight.com AB robot-owner-url:http://www.worldlight.com robot-owner-email:[email protected] robot-status:active robot-purpose:indexing robot-type:standalone robot-platform:Windows robot-availability:none robot-exclusion:yes robot-exclusion-useragent:speedy robot-noindex:yes robot-host:router-00.sverige.net, 193.15.210.29, *.entireweb.com, *.worldlight.com robot-from:yes robot-useragent:Speedy Spider ( http://www.entireweb.com/speedy.html ) robot-language:C, C++ robot-description:Speedy Spider is used to build the database for the Entireweb.com search service operated by WorldLight.com (part of WorldLight Network). The robot runs constantly, and visits sites in a random order. robot-history:This robot is a part of the highly advanced search engine Entireweb.com, that was developed in Halmstad, Sweden during 1998-2000. robot-environment:service, commercial modified-date:Mon, 17 July 2000 11:05:03 GMT modified-by:Marcus Andersson robot-id: spider_monkey robot-name: spider_monkey robot-cover-url: http://www.mobrien.com/add_site.html robot-details-url: http://www.mobrien.com/add_site.html robot-owner-name: MPRM Group Limited robot-owner-url: http://www.mobrien.com robot-owner-email: [email protected] robot-status: robot actively in use robot-purpose: gather content for a free indexing service robot-type: FDSE robot robot-platform: unix robot-availability: bulk data gathered by robot available robot-exclusion: yes robot-exclusion-useragent: spider_monkey robot-noindex: yes robot-host: snowball.ionsys.com robot-from: yes robot-useragent: mouse.house/7.1 http://info.webcrawler.com/mak/projects/robots/active/all.txt (81 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-language: perl5 robot-description: Robot runs every 30 days for a full index and weekly = on a list of accumulated visitor requests robot-history: This robot is under development and currently active robot-environment: written as an employee / guest service modified-date: Mon, 22 May 2000 12:28:52 GMT modified-by: MPRM Group Limited robot-id: spiderbot robot-name: SpiderBot robot-cover-url: http://pisuerga.inf.ubu.es/lsi/Docencia/TFC/ITIG/icruzadn/cover.htm robot-details-url: http://pisuerga.inf.ubu.es/lsi/Docencia/TFC/ITIG/icruzadn/details.htm robot-owner-name: Ignacio Cruzado Nu.o robot-owner-url: http://pisuerga.inf.ubu.es/lsi/Docencia/TFC/ITIG/icruzadn/icruzadn.htm robot-owner-email: [email protected] robot-status: active robot-purpose: indexing, mirroring robot-type: standalone, browser robot-platform: unix, windows, windows95, windowsNT robot-availability: source, binary, data robot-exclusion: yes robot-exclusion-useragent: SpiderBot/1.0 robot-noindex: yes robot-host: * robot-from: yes robot-useragent: SpiderBot/1.0 robot-language: C++, Tcl robot-description: Recovers Web Pages and saves them on your hard disk. Then it reindexes them. robot-history: This Robot belongs to Ignacio Cruzado Nu.o End of Studies Thesis "Recuperador p.ginas Web", to get the titulation of "Management Tecnical Informatics Engineer" in the for the Burgos University in Spain. robot-environment: research modified-date: Sun, 27 Jun 1999 09:00:00 GMT modified-by: Ignacio Cruzado Nu.o robot-id:spiderman robot-name:SpiderMan robot-cover-url:http://www.comp.nus.edu.sg/~leunghok robot-details-url:http://www.comp.nus.edu.sg/~leunghok/honproj.html robot-owner-name:Leung Hok Peng , The School Of Computing Nus , Singapore robot-owner-url:http://www.comp.nus.edu.sg/~leunghok robot-owner-email:[email protected] robot-status:development & active robot-purpose:user searching using IR technique robot-type:stand alone robot-platform:Java 1.2 robot-availability:binary&source robot-exclusion:no robot-exclusion-useragent:nil robot-noindex:no robot-host:NA robot-from:NA robot-useragent:SpiderMan 1.0 robot-language:java robot-description:It is used for any user to search the web given a query string robot-history:Originated from The Center for Natural Product Research and The School of computing National University Of Singapore robot-environment:research modified-date:08/08/1999 modified-by:Leung Hok Peng and Dr Hsu Wynne robot-id: spiderview robot-name: SpiderView(tm) http://info.webcrawler.com/mak/projects/robots/active/all.txt (82 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-cover-url: http://www.northernwebs.com/set/spider_view.html robot-details-url: http://www.northernwebs.com/set/spider_sales.html robot-owner-name: Northern Webs robot-owner-url: http://www.northernwebs.com robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: unix, nt robot-availability: source robot-exclusion: no robot-exclusion-useragent: robot-noindex: robot-host: bobmin.quad2.iuinc.com, * robot-from: No robot-useragent: Mozilla/4.0 (compatible; SpiderView 1.0;unix) robot-language: perl robot-description: SpiderView is a server based program which can spider a webpage, testing the links found on the page, evaluating your server and its performance. robot-history: This is an offshoot http retrieval program based on our Medibot software. robot-environment: commercial modified-date: modified-by: robot-id: spry robot-name: Spry Wizard Robot robot-cover-url: http://www.spry.com/wizard/index.html robot-details-url: robot-owner-name: spry robot-owner-url: ttp://www.spry.com/index.html robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: robot-host: wizard.spry.com or tiger.spry.com robot-from: no robot-useragent: no robot-language: robot-description: Its purpose is to generate a Resource Discovery database Spry is refusing to give any comments about this robot robot-history: robot-environment: modified-date: Tue Jul 11 09:29:45 GMT 1995 modified-by: robot-id: ssearcher robot-name: Site Searcher robot-cover-url: www.satacoy.com robot-details-url: www.satacoy.com robot-owner-name: Zackware robot-owner-url: www.satacoy.com robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: winows95, windows98, windowsNT robot-availability: binary http://info.webcrawler.com/mak/projects/robots/active/all.txt (83 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: * robot-from: no robot-useragent: ssearcher100 robot-language: C++ robot-description: Site Searcher scans web sites for specific file types. (JPG, MP3, MPG, etc) robot-history: Released 4/4/1999 robot-environment: hobby modified-date: 04/26/1999 robot-id: suke robot-name: Suke robot-cover-url: http://www.kensaku.org/ robot-details-url: http://www.kensaku.org/ robot-owner-name: Yosuke Kuroda robot-owner-url: http://www.kensaku.org/yk/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: FreeBSD3.* robot-availability: source robot-exclusion: yes robot-exclusion-useragent: suke robot-noindex: no robot-host: * robot-from: yes robot-useragent: suke/*.* robot-language: c robot-description: This robot visits mainly sites in japan. robot-history: since 1999 robot-environment: service robot-id: suntek robot-name: suntek search engine robot-cover-url: http://www.portal.com.hk/ robot-details-url: http://www.suntek.com.hk/ robot-owner-name: Suntek Computer Systems robot-owner-url: http://www.suntek.com.hk/ robot-owner-email: [email protected] robot-status: operational robot-purpose: to create a search portal on Asian web sites robot-type: robot-platform: NT, Linux, UNIX robot-availability: available now robot-exclusion: robot-exclusion-useragent: robot-noindex: yes robot-host: search.suntek.com.hk robot-from: yes robot-useragent: suntek/1.0 robot-language: Java robot-description: A multilingual search engine with emphasis on Asia contents robot-history: robot-environment: modified-date: modified-by: robot-id: sven robot-name: Sven robot-cover-url: robot-details-url: http://marty.weathercity.com/sven/ http://info.webcrawler.com/mak/projects/robots/active/all.txt (84 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-owner-name: Marty Anstey robot-owner-url: http://marty.weathercity.com/ robot-owner-email: [email protected] robot-status: Active robot-purpose: indexing robot-type: standalone robot-platform: Windows robot-availability: none robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: 24.113.12.29 robot-from: no robot-useragent: robot-language: VB5 robot-description: Used to gather sites for netbreach.com. Runs constantly. robot-history: Developed as an experiment in web indexing. robot-environment: hobby, service modified-date: Tue, 3 Mar 1999 08:15:00 PST modified-by: Marty Anstey robot-id: tach_bw robot-name: TACH Black Widow robot-cover-url: http://theautochannel.com/~mjenn/bw.html robot-details-url: http://theautochannel.com/~mjenn/bw-syntax.html robot-owner-name: Michael Jennings robot-owner-url: http://www.spd.louisville.edu/~mejenn01/ robot-owner-email: [email protected] robot-status: development robot-purpose: maintenance: link validation robot-type: standalone robot-platform: UNIX, Linux robot-availability: none robot-exclusion: yes robot-exclusion-useragent: tach_bw robot-noindex: no robot-host: *.theautochannel.com robot-from: yes robot-useragent: Mozilla/3.0 (Black Widow v1.1.0; Linux 2.0.27; Dec 31 1997 12:25:00 robot-language: C/C++ robot-description: Exhaustively recurses a single site to check for broken links robot-history: Corporate application begun in 1996 for The Auto Channel robot-environment: commercial modified-date: Thu, Jan 23 1997 23:09:00 GMT modified-by: Michael Jennings robot-id:tarantula robot-name: Tarantula robot-cover-url: http://www.nathan.de/nathan/software.html#TARANTULA robot-details-url: http://www.nathan.de/ robot-owner-name: Markus Hoevener robot-owner-url: robot-owner-email: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: yes robot-noindex: yes robot-host: yes robot-from: no robot-useragent: Tarantula/1.0 robot-language: C http://info.webcrawler.com/mak/projects/robots/active/all.txt (85 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-description: Tarantual gathers information for german search engine Nathanrobot-history: Started February 1997 robot-environment: service modified-date: Mon, 29 Dec 1997 15:30:00 GMT modified-by: Markus Hoevener robot-id: tarspider robot-name: tarspider robot-cover-url: robot-details-url: robot-owner-name: Olaf Schreck robot-owner-url: http://www.chemie.fu-berlin.de/user/chakl/ChaklHome.html robot-owner-email: [email protected] robot-status: robot-purpose: mirroring robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: [email protected] robot-useragent: tarspider robot-language: robot-description: robot-history: robot-environment: modified-date: modified-by: robot-id: tcl robot-name: Tcl W3 Robot robot-cover-url: http://hplyot.obspm.fr/~dl/robo.html robot-details-url: robot-owner-name: Laurent Demailly robot-owner-url: http://hplyot.obspm.fr/~dl/ robot-owner-email: [email protected] robot-status: robot-purpose: maintenance, statistics robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: hplyot.obspm.fr robot-from: yes robot-useragent: dlw3robot/x.y (in TclX by http://hplyot.obspm.fr/~dl/) robot-language: tcl robot-description: Its purpose is to validate links, and generate statistics. robot-history: robot-environment: modified-date: Tue May 23 17:51:39 1995 modified-by: robot-id: techbot robot-name: TechBOT robot-cover-url: http://www.techaid.net/ robot-details-url: http://www.echaid.net/TechBOT/ robot-owner-name: TechAID Internet Services robot-owner-url: http://www.techaid.net/ robot-owner-email: [email protected] robot-status: active http://info.webcrawler.com/mak/projects/robots/active/all.txt (86 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-purpose:statistics, maintenance robot-type: standalone robot-platform: Unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: TechBOT robot-noindex: yes robot-host: techaid.net robot-from: yes robot-useragent: TechBOT robot-language: perl5 robot-description: TechBOT is constantly upgraded. Currently he is used for Link Validation, Load Time, HTML Validation and much much more. robot-history: TechBOT started his life as a Page Change Detection robot, but has taken on many new and exciting roles. robot-environment: service modified-date: Sat, 18 Dec 1998 14:26:00 EST modified-by: [email protected] robot-id: templeton robot-name: Templeton robot-cover-url: http://www.bmtmicro.com/catalog/tton/ robot-details-url: http://www.bmtmicro.com/catalog/tton/ robot-owner-name: Neal Krawetz robot-owner-url: http://www.cs.tamu.edu/people/nealk/ robot-owner-email: [email protected] robot-status: active robot-purpose: mirroring, mapping, automating web applications robot-type: standalone robot-platform: OS/2, Linux, SunOS, Solaris robot-availability: binary robot-exclusion: yes robot-exclusion-useragent: templeton robot-noindex: no robot-host: * robot-from: yes robot-useragent: Templeton/{version} for {platform} robot-language: C robot-description: Templeton is a very configurable robots for mirroring, mapping, and automating applications on retrieved documents. robot-history: This robot was originally created as a test-of-concept. robot-environment: service, commercial, research, hobby modified-date: Sun, 6 Apr 1997 10:00:00 GMT modified-by: Neal Krawetz robot-id:teoma_agent1 robot-name:TeomaTechnologies robot-cover-url:http://www.teoma.com/ robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email:[email protected] robot-status:active robot-purpose: robot-type: robot-platform: robot-availability:none robot-exclusion:no robot-exclusion-useragent: robot-noindex:unknown robot-host:63.236.92.145 robot-from: robot-useragent:teoma_agent1 [[email protected]] robot-language:unknown robot-description:Unknown robot visiting pages and tacking "%09182837231" or http://info.webcrawler.com/mak/projects/robots/active/all.txt (87 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt somesuch onto the ends of URL's. robot-history:unknown robot-environment:unknown modified-date:Thu, 04 Jan 2001 09:05:00 PDT modified-by: kph robot-id: titin robot-name: TitIn robot-cover-url: http://www.foi.hr/~dpavlin/titin/ robot-details-url: http://www.foi.hr/~dpavlin/titin/tehnical.htm robot-owner-name: Dobrica Pavlinusic robot-owner-url: http://www.foi.hr/~dpavlin/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing, statistics robot-type: standalone robot-platform: unix robot-availability: data, source on request robot-exclusion: yes robot-exclusion-useragent: titin robot-noindex: no robot-host: barok.foi.hr robot-from: no robot-useragent: TitIn/0.2 robot-language: perl5, c robot-description: The TitIn is used to index all titles of Web server in .hr domain. robot-history: It was done as result of desperate need for central index of Croatian web servers in December 1996. robot-environment: research modified-date: Thu, 12 Dec 1996 16:06:42 MET modified-by: Dobrica Pavlinusic robot-id: titan robot-name: TITAN robot-cover-url: http://isserv.tas.ntt.jp/chisho/titan-e.html robot-details-url: http://isserv.tas.ntt.jp/chisho/titan-help/eng/titan-help-e.html robot-owner-name: Yoshihiko HAYASHI robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: SunOS 4.1.4 robot-availability: no robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: nlptitan.isl.ntt.jp robot-from: yes robot-useragent: TITAN/0.1 robot-language: perl 4 robot-description: Its purpose is to generate a Resource Discovery database, and copy document trees. Our primary goal is to develop an advanced method for indexing the WWW documents. Uses libwww-perl robot-history: robot-environment: modified-date: Mon Jun 24 17:20:44 PDT 1996 modified-by: Yoshihiko HAYASHI robot-id: robot-name: robot-cover-url: tkwww The TkWWW Robot http://fang.cs.sunyit.edu/Robots/tkwww.html http://info.webcrawler.com/mak/projects/robots/active/all.txt (88 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-details-url: robot-owner-name: Scott Spetka robot-owner-url: http://fang.cs.sunyit.edu/scott/scott.html robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: robot-useragent: robot-language: robot-description: It is designed to search Web neighborhoods to find pages that may be logically related. The Robot returns a list of links that looks like a hot list. The search can be by key word or all links at a distance of one or two hops may be returned. The TkWWW Robot is described in a paper presented at the WWW94 Conference in Chicago. robot-history: robot-environment: modified-date: modified-by: robot-id: tlspider robot-name:TLSpider robot-cover-url: n/a robot-details-url: n/a robot-owner-name: topiclink.com robot-owner-url: topiclink.com robot-owner-email: [email protected] robot-status: not activated robot-purpose: to get web sites and add them to the topiclink future directory robot-type:development: robot under development robot-platform:linux robot-availability:none robot-exclusion:yes robot-exclusion-useragent:topiclink robot-noindex:no robot-host: tlspider.topiclink.com (not avalible yet) robot-from:no robot-useragent:TLSpider/1.1 robot-language:perl5 robot-description:This robot runs 2 days a week getting information for TopicLink.com robot-history:This robot was created to server for the internet search engine TopicLink.com robot-environment:service modified-date:September,10,1999 17:28 GMT modified-by: TopicLink Spider Team robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-status: robot-purpose: robot-type: robot-platform: ucsd UCSD Crawl http://www.mib.org/~ucsdcrawl Adam Tilghman http://www.mib.org/~atilghma [email protected] indexing, statistics standalone http://info.webcrawler.com/mak/projects/robots/active/all.txt (89 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: nuthaus.mib.org scilib.ucsd.edu robot-from: yes robot-useragent: UCSD-Crawler robot-language: Perl 4 robot-description: Should hit ONLY within UC San Diego - trying to count servers here. robot-history: robot-environment: modified-date: Sat Jan 27 09:21:40 1996. modified-by: robot-id: udmsearch robot-name: UdmSearch robot-details-url: http://mysearch.udm.net/ robot-cover-url: http://mysearch.udm.net/ robot-owner-name: Alexander Barkov robot-owner-url: http://mysearch.udm.net/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing, validation robot-type: standalone robot-platform: unix robot-availability: source, binary robot-exclusion: yes robot-exclusion-useragent: UdmSearch robot-noindex: yes robot-host: * robot-from: no robot-useragent: UdmSearch/2.1.1 robot-language: c robot-description: UdmSearch is a free web search engine software for intranet/small domain internet servers robot-history: Developed since 1998, origin purpose is a search engine over republic of Udmurtia http://search.udm.net robot-environment: hobby modified-date: Mon, 6 Sep 1999 10:28:52 GMT robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-status: robot-purpose: robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: robot-host: robot-from: robot-useragent: robot-language: robot-description: urlck URL Check http://www.cutternet.com/products/webcheck.html http://www.cutternet.com/products/urlck.html Dave Finnegan http://www.cutternet.com [email protected] active maintenance standalone unix binary yes urlck no * yes urlck/1.2.3 c The robot is used to manage, maintain, and modify web sites. It builds a database detailing the site, builds HTML reports describing the site, and can be used to up-load pages to the site or to modify existing pages and URLs within the site. It http://info.webcrawler.com/mak/projects/robots/active/all.txt (90 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-history: robot-environment: modified-date: modified-by: can also be used to mirror whole or partial sites. It supports HTTP, File, FTP, and Mailto schemes. Originally designed to validate URLs. commercial July 9, 1997 Dave Finnegan robot-id: us robot-name: URL Spider Pro robot-cover-url: http://www.innerprise.net robot-details-url: http://www.innerprise.net/us.htm robot-owner-name: Innerprise robot-owner-url: http://www.innerprise.net robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: Windows9x/NT robot-availability: binary robot-exclusion: yes robot-exclusion-useragent: * robot-noindex: yes robot-host: * robot-from: no robot-useragent: URL Spider Pro robot-language: delphi robot-description: Used for building a database of web pages. robot-history: Project started July 1998. robot-environment: commercial modified-date: Mon, 12 Jul 1999 17:50:30 GMT modified-by: Innerprise robot-id: valkyrie robot-name: Valkyrie robot-cover-url: http://kichijiro.c.u-tokyo.ac.jp/odin/ robot-details-url: http://kichijiro.c.u-tokyo.ac.jp/odin/robot.html robot-owner-name: Masanori Harada robot-owner-url: http://www.graco.c.u-tokyo.ac.jp/~harada/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: Valkyrie libwww-perl robot-noindex: no robot-host: *.c.u-tokyo.ac.jp robot-from: yes robot-useragent: Valkyrie/1.0 libwww-perl/0.40 robot-language: perl4 robot-description: used to collect resources from Japanese Web sites for ODIN search engine. robot-history: This robot has been used since Oct. 1995 for author's research. robot-environment: service research modified-date: Thu Mar 20 19:09:56 JST 1997 modified-by: [email protected] robot-id: victoria robot-name: Victoria robot-cover-url: robot-details-url: robot-owner-name: Adrian Howard robot-owner-url: robot-owner-email: [email protected] http://info.webcrawler.com/mak/projects/robots/active/all.txt (91 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-status: development robot-purpose: maintenance robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: Victoria robot-noindex: yes robot-host: robot-from: robot-useragent: Victoria/1.0 robot-language: perl,c robot-description: Victoria is part of a groupware produced by Victoria Real Ltd. (voice: +44 [0]1273 774469, fax: +44 [0]1273 779960 email: [email protected]). Victoria is used to monitor changes in W3 documents, both intranet and internet based. Contact Victoria Real for more information. robot-history: robot-environment: commercial modified-date: Fri, 22 Nov 1996 16:45 GMT modified-by: [email protected] robot-id: visionsearch robot-name: vision-search robot-cover-url: http://www.ius.cs.cmu.edu/cgi-bin/vision-search robot-details-url: robot-owner-name: Henry A. Rowley robot-owner-url: http://www.cs.cmu.edu/~har robot-owner-email: [email protected] robot-status: robot-purpose: indexing. robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: dylan.ius.cs.cmu.edu robot-from: no robot-useragent: vision-search/3.0' robot-language: Perl 5 robot-description: Intended to be an index of computer vision pages, containing all pages within <em>n</em> links (for some small <em>n</em>) of the Vision Home Page robot-history: robot-environment: modified-date: Fri Mar 8 16:03:04 1996 modified-by: robot-id: voyager robot-name: Voyager robot-cover-url: http://www.lisa.co.jp/voyager/ robot-details-url: robot-owner-name: Voyager Staff robot-owner-url: http://www.lisa.co.jp/voyager/ robot-owner-email: [email protected] robot-status: development robot-purpose: indexing, maintenance robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: Voyager robot-noindex: no http://info.webcrawler.com/mak/projects/robots/active/all.txt (92 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-host: *.lisa.co.jp robot-from: yes robot-useragent: Voyager/0.0 robot-language: perl5 robot-description: This robot is used to Lisa Search service. and visits sites in a robot-history: robot-environment: service modified-date: Mon, 30 Nov 1998 08:00:00 modified-by: Hideyuki Ezaki build the database for the The robot manually launch random order. GMT robot-id: vwbot robot-name: VWbot robot-cover-url: http://vancouver-webpages.com/VWbot/ robot-details-url: http://vancouver-webpages.com/VWbot/aboutK.shtml robot-owner-name: Andrew Daviel robot-owner-url: http://vancouver-webpages.com/~admin/ robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix robot-availability: source robot-exclusion: yes robot-exclusion-useragent: VWbot_K robot-noindex: yes robot-host: vancouver-webpages.com robot-from: yes robot-useragent: VWbot_K/4.2 robot-language: perl4 robot-description: Used to index BC sites for the searchBC database. Runs daily. robot-history: Originally written fall 1995. Actively maintained. robot-environment: service commercial research modified-date: Tue, 4 Mar 1997 20:00:00 GMT modified-by: Andrew Daviel robot-id: w3index robot-name: The NWI Robot robot-cover-url: http://www.ub2.lu.se/NNC/projects/NWI/the_nwi_robot.html robot-owner-name: Sigfrid Lundberg, Lund university, Sweden robot-owner-url: http://nwi.ub2.lu.se/~siglun robot-owner-email: [email protected] robot-status: active robot-purpose: discovery,statistics robot-type: standalone robot-platform: UNIX robot-availability: none (at the moment) robot-exclusion: yes robot-noindex: No robot-host: nwi.ub2.lu.se, mars.dtv.dk and a few others robot-from: yes robot-useragent: w3index robot-language: perl5 robot-description: A resource discovery robot, used primarily for the indexing of the Scandinavian Web robot-history: It is about a year or so old. Written by Anders Ard–, Mattias Borrell, HÂkan Ard– and myself. robot-environment: service,research modified-date: Wed Jun 26 13:58:04 MET DST 1996 modified-by: Sigfrid Lundberg robot-id: robot-name: w3m2 W3M2 http://info.webcrawler.com/mak/projects/robots/active/all.txt (93 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-cover-url: http://tronche.com/W3M2 robot-details-url: robot-owner-name: Christophe Tronche robot-owner-url: http://tronche.com/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing, maintenance, statistics robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: * robot-from: yes robot-useragent: W3M2/x.xxx robot-language: Perl 4, Perl 5, and C++ robot-description: to generate a Resource Discovery database, validate links, validate HTML, and generate statistics robot-history: robot-environment: modified-date: Fri May 5 17:48:48 1995 modified-by: robot-id: wanderer robot-name: the World Wide Web Wanderer robot-cover-url: http://www.mit.edu/people/mkgray/net/ robot-details-url: robot-owner-name: Matthew Gray robot-owner-url: http://www.mit.edu:8001/people/mkgray/mkgray.html robot-owner-email: [email protected] robot-status: active robot-purpose: statistics robot-type: standalone robot-platform: unix robot-availability: data robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: *.mit.edu robot-from: robot-useragent: WWWWanderer v3.0 robot-language: perl4 robot-description: Run initially in June 1993, its aim is to measure the growth in the web. robot-history: robot-environment: research modified-date: modified-by: robot-id:webbandit robot-name:WebBandit Web Spider robot-cover-url:http://pw2.netcom.com/~wooger/ robot-details-url:http://pw2.netcom.com/~wooger/ robot-owner-name:Jerry Walsh robot-owner-url:http://pw2.netcom.com/~wooger/ robot-owner-email:[email protected] robot-status:active robot-purpose:Resource Gathering / Server Benchmarking robot-type:standalone application robot-platform:Intel - windows95 robot-availability:source, binary robot-exclusion:no robot-exclusion-useragent:WebBandit/1.0 robot-noindex:no http://info.webcrawler.com/mak/projects/robots/active/all.txt (94 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-host:ix.netcom.com robot-from:no robot-useragent:WebBandit/1.0 robot-language:C++ robot-description:multithreaded, hyperlink-following, resource finding webspider robot-history:Inspired by reading of Internet Programming book by Jamsa/Cope robot-environment:commercial modified-date:11/21/96 modified-by:Jerry Walsh robot-id: webcatcher robot-name: WebCatcher robot-cover-url: http://oscar.lang.nagoya-u.ac.jp robot-details-url: robot-owner-name: Reiji SUZUKI robot-owner-url: http://oscar.lang.nagoya-u.ac.jp/~reiji/index.html robot-owner-email: [email protected] robot-owner-name2: Masatoshi SUGIURA robot-owner-url2: http://oscar.lang.nagoya-u.ac.jp/~sugiura/index.html robot-owner-email2: [email protected] robot-status: development robot-purpose: indexing robot-type: standalone robot-platform: unix, windows, mac robot-availability: none robot-exclusion: yes robot-exclusion-useragent: webcatcher robot-noindex: no robot-host: oscar.lang.nagoya-u.ac.jp robot-from: no robot-useragent: WebCatcher/1.0 robot-language: perl5 robot-description: WebCatcher gathers web pages that Japanese collage students want to visit. robot-history: This robot finds its roots in a research project at Nagoya University in 1998. robot-environment: research modified-date: Fri, 16 Oct 1998 17:28:52 JST modified-by: "Reiji SUZUKI" <[email protected]> robot-id: webcopy robot-name: WebCopy robot-cover-url: http://www.inf.utfsm.cl/~vparada/webcopy.html robot-details-url: robot-owner-name: Victor Parada robot-owner-url: http://www.inf.utfsm.cl/~vparada/ robot-owner-email: [email protected] robot-status: robot-purpose: mirroring robot-type: standalone robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: * robot-from: no robot-useragent: WebCopy/(version) robot-language: perl 4 or perl 5 robot-description: Its purpose is to perform mirroring. WebCopy can retrieve files recursively using HTTP protocol.It can be used as a delayed browser or as a mirroring tool. It cannot jump from one site to another. http://info.webcrawler.com/mak/projects/robots/active/all.txt (95 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-history: robot-environment: modified-date: modified-by: Sun Jul 2 15:27:04 1995 robot-id: webfetcher robot-name: webfetcher robot-cover-url: http://www.ontv.com/ robot-details-url: robot-owner-name: robot-owner-url: http://www.ontv.com/ robot-owner-email: [email protected] robot-status: robot-purpose: mirroring robot-type: standalone robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: robot-host: * robot-from: yes robot-useragent: WebFetcher/0.8, robot-language: C++ robot-description: don't wait! OnTV's WebFetcher mirrors whole sites down to your hard disk on a TV-like schedule. Catch w3 documentation. Catch discovery.com without waiting! A fully operational web robot for NT/95 today, most UNIX soon, MAC tomorrow. robot-history: robot-environment: modified-date: Sat Jan 27 10:31:43 1996. modified-by: robot-id: webfoot robot-name: The Webfoot Robot robot-cover-url: robot-details-url: robot-owner-name: Lee McLoughlin robot-owner-url: http://web.doc.ic.ac.uk/f?/lmjm robot-owner-email: [email protected] robot-status: robot-purpose: robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: robot-host: phoenix.doc.ic.ac.uk robot-from: robot-useragent: robot-language: robot-description: robot-history: First spotted in Mid February 1994 robot-environment: modified-date: modified-by: robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: weblayers weblayers http://www.univ-paris8.fr/~loic/weblayers/ Loic Dachary http://www.univ-paris8.fr/~loic/ http://info.webcrawler.com/mak/projects/robots/active/all.txt (96 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-owner-email: [email protected] robot-status: robot-purpose: maintainance robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: robot-useragent: weblayers/0.0 robot-language: perl 5 robot-description: Its purpose is to validate, cache and maintain links. It is designed to maintain the cache generated by the emacs emacs w3 mode (N*tscape replacement) and to support annotated documents (keep them in sync with the original document via diff/patch). robot-history: robot-environment: modified-date: Fri Jun 23 16:30:42 FRE 1995 modified-by: robot-id: weblinker robot-name: WebLinker robot-cover-url: http://www.cern.ch/WebLinker/ robot-details-url: robot-owner-name: James Casey robot-owner-url: http://www.maths.tcd.ie/hyplan/jcasey/jcasey.html robot-owner-email: [email protected] robot-status: robot-purpose: maintenance robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: robot-host: robot-from: robot-useragent: WebLinker/0.0 libwww-perl/0.1 robot-language: robot-description: it traverses a section of web, doing URN->URL conversion. It will be used as a post-processing tool on documents created by automatic converters such as LaTeX2HTML or WebMaker. At the moment it works at full speed, but is restricted to localsites. External GETs will be added, but these will be running slowly. WebLinker is meant to be run locally, so if you see it elsewhere let the author know! robot-history: robot-environment: modified-date: modified-by: robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-status: robot-purpose: robot-type: robot-platform: webmirror WebMirror http://www.winsite.com/pc/win95/netutil/wbmiror1.zip Sui Fung Chan http://www.geocities.com/NapaVally/1208 [email protected] mirroring standalone Windows95 http://info.webcrawler.com/mak/projects/robots/active/all.txt (97 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: robot-host: robot-from: no robot-useragent: no robot-language: C++ robot-description: It download web pages to hard drive for off-line browsing. robot-history: robot-environment: modified-date: Mon Apr 29 08:52:25 1996. modified-by: robot-id: webmoose robot-name: The Web Moose robot-cover-url: robot-details-url: http://www.nwlink.com/~mikeblas/webmoose/ robot-owner-name: Mike Blaszczak robot-owner-url: http://www.nwlink.com/~mikeblas/ robot-owner-email: [email protected] robot-status: development robot-purpose: statistics, maintenance robot-type: standalone robot-platform: Windows NT robot-availability: data robot-exclusion: no robot-exclusion-useragent: WebMoose robot-noindex: no robot-host: msn.com robot-from: no robot-useragent: WebMoose/0.0.0000 robot-language: C++ robot-description: This robot collects statistics and verifies links. It builds an graph of its visit path. robot-history: This robot is under development. It will support ROBOTS.TXT soon. robot-environment: hobby modified-date: Fri, 30 Aug 1996 00:00:00 GMT modified-by: Mike Blaszczak robot-id:webquest robot-name:WebQuest robot-cover-url: robot-details-url: robot-owner-name:TaeYoung Choi robot-owner-url:http://www.cosmocyber.co.kr:8080/~cty/index.html robot-owner-email:[email protected] robot-status:development robot-purpose:indexing robot-type:standalone robot-platform:unix robot-availability:none robot-exclusion:yes robot-exclusion-useragent:webquest robot-noindex:no robot-host:210.121.146.2, 210.113.104.1, 210.113.104.2 robot-from:yes robot-useragent:WebQuest/1.0 robot-language:perl5 robot-description:WebQuest will be used to build the databases for various web search service sites which will be in service by early 1998. Until the end of Jan. 1998, WebQuest will run from time to time. Since then, it will run http://info.webcrawler.com/mak/projects/robots/active/all.txt (98 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt daily(for few hours and very slowly). robot-history:The developent of WebQuest was motivated by the need for a customized robot in various projects of COSMO Information & Communication Co., Ltd. in Korea. robot-environment:service modified-date:Tue, 30 Dec 1997 09:27:20 GMT modified-by:TaeYoung Choi robot-id: webreader robot-name: Digimarc MarcSpider robot-cover-url: http://www.digimarc.com/prod_fam.html robot-details-url: http://www.digimarc.com/prod_fam.html robot-owner-name: Digimarc Corporation robot-owner-url: http://www.digimarc.com robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: windowsNT robot-availability: none robot-exclusion: yes robot-exclusion-useragent: robot-noindex: robot-host: 206.102.3.* robot-from: yes robot-useragent: Digimarc WebReader/1.2 robot-language: c++ robot-description: Examines image files for watermarks. In order to not waste internet bandwidth with yet another crawler, we have contracted with one of the major crawlers/seach engines to provide us with a list of specific URLs of interest to us. If an URL is to an image, we may read the image, but we do not crawl to any other URLs. If an URL is to a page of interest (ususally due to CGI), then we access the page to get the image URLs from it, but we do not crawl to any other pages. robot-history: First operation in August 1997. robot-environment: service modified-date: Mon, 20 Oct 1997 16:44:29 GMT modified-by: Brian MacIntosh robot-id: webreaper robot-name: WebReaper robot-cover-url: http://www.otway.com/webreaper robot-details-url: robot-owner-name: Mark Otway robot-owner-url: http://www.otway.com robot-owner-email: [email protected] robot-status: active robot-purpose: indexing/offline browsing robot-type: standalone robot-platform: windows95, windowsNT robot-availability: binary robot-exclusion: yes robot-exclusion-useragent: webreaper robot-noindex: no robot-host: * robot-from: no robot-useragent: WebReaper [[email protected]] robot-language: c++ robot-description: Freeware app which downloads and saves sites locally for offline browsing. robot-history: Written for personal use, and then distributed to the public as freeware. robot-environment: hobby modified-date: Thu, 25 Mar 1999 15:00:00 GMT http://info.webcrawler.com/mak/projects/robots/active/all.txt (99 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt modified-by: Mark Otway robot-id: webs robot-name: webs robot-cover-url: http://webdew.rnet.or.jp/ robot-details-url: http://webdew.rnet.or.jp/service/shank/NAVI/SEARCH/info2.html#robot robot-owner-name: Recruit Co.Ltd, robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: statistics robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: webs robot-noindex: no robot-host: lemon.recruit.co.jp robot-from: yes robot-useragent: [email protected] robot-language: perl5 robot-description: The webs robot is used to gather WWW servers' top pages last modified date data. Collected statistics reflects the priority of WWW server data collection for webdew indexing service. Indexing in webdew is done by manually. robot-history: robot-environment: service modified-date: Fri, 6 Sep 1996 10:00:00 GMT modified-by: robot-id: websnarf robot-name: Websnarf robot-cover-url: robot-details-url: robot-owner-name: Charlie Stross robot-owner-url: robot-owner-email: [email protected] robot-status: retired robot-purpose: robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: robot-useragent: robot-language: robot-description: robot-history: robot-environment: modified-date: modified-by: robot-id: webspider robot-name: WebSpider robot-details-url: http://www.csi.uottawa.ca/~u610468 robot-cover-url: robot-owner-name: Nicolas Fraiji robot-owner-email: [email protected] robot-status: active, under further enhancement. robot-purpose: maintenance, link diagnostics http://info.webcrawler.com/mak/projects/robots/active/all.txt (100 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-type: standalone robot-exclusion: yes robot-noindex: no robot-exclusion-useragent: webspider robot-host: several robot-from: Yes robot-language: Perl4 robot-history: developped as a course project at the University of Ottawa, Canada in 1996. robot-environment: Educational use and Research robot-id: webvac robot-name: WebVac robot-cover-url: http://www.federated.com/~tim/webvac.html robot-details-url: robot-owner-name: Tim Jensen robot-owner-url: http://www.federated.com/~tim robot-owner-email: [email protected] robot-status: robot-purpose: mirroring robot-type: standalone robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: robot-host: robot-from: no robot-useragent: webvac/1.0 robot-language: C++ robot-description: robot-history: robot-environment: modified-date: Mon May 13 03:19:17 1996. modified-by: robot-id: webwalk robot-name: webwalk robot-cover-url: robot-details-url: robot-owner-name: Rich Testardi robot-owner-url: robot-owner-email: robot-status: retired robot-purpose: indexing, maintentance, mirroring, statistics robot-type: standalone robot-platform: robot-availability: robot-exclusion: yes robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: yes robot-useragent: webwalk robot-language: c robot-description: Its purpose is to generate a Resource Discovery database, validate links, validate HTML, perform mirroring, copy document trees, and generate statistics. Webwalk is easily extensible to perform virtually any maintenance function which involves web traversal, in a way much like the '-exec' option of the find(1) command. Webwalk is usually used behind the HP firewall robot-history: robot-environment: modified-date: Wed Nov 15 09:51:59 PST 1995 http://info.webcrawler.com/mak/projects/robots/active/all.txt (101 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt modified-by: robot-id: webwalker robot-name: WebWalker robot-cover-url: robot-details-url: robot-owner-name: Fah-Chun Cheong robot-owner-url: http://www.cs.berkeley.edu/~fccheong/ robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: unix robot-availability: source robot-exclusion: yes robot-exclusion-useragent: WebWalker robot-noindex: no robot-host: * robot-from: yes robot-useragent: WebWalker/1.10 robot-language: perl4 robot-description: WebWalker performs WWW traversal for individual sites and tests for the integrity of all hyperlinks to external sites. robot-history: A Web maintenance robot for expository purposes, first published in the book "Internet Agents: Spiders, Wanderers, Brokers, and Bots" by the robot's author. robot-environment: hobby modified-date: Thu, 25 Jul 1996 16:00:52 PDT modified-by: Fah-Chun Cheong robot-id: webwatch robot-name: WebWatch robot-cover-url: http://www.specter.com/users/janos/specter robot-details-url: robot-owner-name: Joseph Janos robot-owner-url: http://www.specter.com/users/janos/specter robot-owner-email: [email protected] robot-status: robot-purpose: maintainance, statistics robot-type: standalone robot-platform: robot-availability: robot-exclusion: no robot-exclusion-useragent: robot-noindex: no robot-host: robot-from: no robot-useragent: WebWatch robot-language: c++ robot-description: Its purpose is to validate HTML, and generate statistics. Check URLs modified since a given date. robot-history: robot-environment: modified-date: Wed Jul 26 13:36:32 1995 modified-by: robot-id: wget robot-name: Wget robot-cover-url: ftp://gnjilux.cc.fer.hr/pub/unix/util/wget/ robot-details-url: robot-owner-name: Hrvoje Niksic robot-owner-url: robot-owner-email: [email protected] robot-status: development http://info.webcrawler.com/mak/projects/robots/active/all.txt (102 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-purpose: mirroring, maintenance robot-type: standalone robot-platform: unix robot-availability: source robot-exclusion: yes robot-exclusion-useragent: wget robot-noindex: no robot-host: * robot-from: yes robot-useragent: Wget/1.4.0 robot-language: C robot-description: Wget is a utility for retrieving files using HTTP and FTP protocols. It works non-interactively, and can retrieve HTML pages and FTP trees recursively. It can be used for mirroring Web pages and FTP sites, or for traversing the Web gathering data. It is run by the end user or archive maintainer. robot-history: robot-environment: hobby, research modified-date: Mon, 11 Nov 1996 06:00:44 MET modified-by: Hrvoje Niksic robot-id: whatuseek robot-name: whatUseek Winona robot-cover-url: http://www.whatUseek.com/ robot-details-url: http://www.whatUseek.com/ robot-owner-name: Neil Mansilla robot-owner-url: http://www.whatUseek.com/ robot-owner-email: [email protected] robot-status: active robot-purpose: Robot used for site-level search and meta-search engines. robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: winona robot-noindex: yes robot-host: *.whatuseek.com, *.aol2.com robot-from: no robot-useragent: whatUseek_winona/3.0 robot-language: c++ robot-description: The whatUseek robot, Winona, is used for site-level search engines. It is also implemented in several meta-search engines. robot-history: Winona was developed in November of 1996. robot-environment: service modified-date: Wed, 17 Jan 2001 11:52:00 EST modified-by: Neil Mansilla robot-id: whowhere robot-name: WhoWhere Robot robot-cover-url: http://www.whowhere.com robot-details-url: robot-owner-name: Rupesh Kapoor robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: Sun Unix robot-availability: none robot-exclusion: yes robot-exclusion-useragent: whowhere robot-noindex: no robot-host: spica.whowhere.com robot-from: no http://info.webcrawler.com/mak/projects/robots/active/all.txt (103 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-useragent: robot-language: C/Perl robot-description: Gathers data for email directory from web pages robot-history: robot-environment: commercial modified-date: modified-by: robot-id: wmir robot-name: w3mir robot-cover-url: http://www.ifi.uio.no/~janl/w3mir.html robot-details-url: robot-owner-name: Nicolai Langfeldt robot-owner-url: http://www.ifi.uio.no/~janl/w3mir.html robot-owner-email: [email protected] robot-status: robot-purpose: mirroring. robot-type: standalone robot-platform: UNIX, WindowsNT robot-availability: robot-exclusion: no. robot-exclusion-useragent: robot-noindex: robot-host: robot-from: yes robot-useragent: w3mir robot-language: Perl robot-description: W3mir uses the If-Modified-Since HTTP header and recurses only the directory and subdirectories of it's start document. Known to work on U*ixes and Windows NT. robot-history: robot-environment: modified-date: Wed Apr 24 13:23:42 1996. modified-by: robot-id: wolp robot-name: WebStolperer robot-cover-url: http://www.suchfibel.de/maschinisten robot-details-url: http://www.suchfibel.de/maschinisten/text/werkzeuge.htm (in German) robot-owner-name: Marius Dahler robot-owner-url: http://www.suchfibel.de/maschinisten robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix, NT robot-availability: none robot-exclusion: yes robot-exclusion-useragent: WOLP robot-noindex: yes robot-host: www.suchfibel.de robot-from: yes robot-useragent: WOLP/1.0 mda/1.0 robot-language: perl5 robot-description: The robot gathers information about specified web-projects and generates knowledge bases in Javascript or an own format robot-environment: hobby modified-date: 22 Jul 1998 modified-by: Marius Dahler robot-id: robot-name: robot-cover-url: wombat The Web Wombat http://www.intercom.com.au/wombat/ http://info.webcrawler.com/mak/projects/robots/active/all.txt (104 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-details-url: robot-owner-name: Internet Communications robot-owner-url: http://www.intercom.com.au/ robot-owner-email: [email protected] robot-status: robot-purpose: indexing, statistics. robot-type: robot-platform: robot-availability: robot-exclusion: no. robot-exclusion-useragent: robot-noindex: robot-host: qwerty.intercom.com.au robot-from: no robot-useragent: no robot-language: IBM Rexx/VisualAge C++ under OS/2. robot-description: The robot is the basis of the Web Wombat search engine (Australian/New Zealand content ONLY). robot-history: robot-environment: modified-date: Thu Feb 29 00:39:49 1996. modified-by: robot-id: worm robot-name: The World Wide Web Worm robot-cover-url: http://www.cs.colorado.edu/home/mcbryan/WWWW.html robot-details-url: robot-owner-name: Oliver McBryan robot-owner-url: http://www.cs.colorado.edu/home/mcbryan/Home.html robot-owner-email: [email protected] robot-status: robot-purpose: indexing robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: no robot-host: piper.cs.colorado.edu robot-from: robot-useragent: robot-language: robot-description: indexing robot, actually has quite flexible search options robot-history: robot-environment: modified-date: modified-by: robot-id: wwwc robot-name: WWWC Ver 0.2.5 robot-cover-url: http://www.kinet.or.jp/naka/tomo/wwwc.html robot-details-url: robot-owner-name: Tomoaki Nakashima. robot-owner-url: http://www.kinet.or.jp/naka/tomo/ robot-owner-email: [email protected] robot-status: active robot-purpose: maintenance robot-type: standalone robot-platform: windows, windows95, windowsNT robot-availability: binary robot-exclusion: yes robot-exclusion-useragent: WWWC robot-noindex: no robot-host: http://info.webcrawler.com/mak/projects/robots/active/all.txt (105 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-from: yes robot-useragent: WWWC/0.25 (Win95) robot-language: c robot-description: robot-history: 1997 robot-environment: hobby modified-date: Tuesday, 18 Feb 1997 06:02:47 GMT modified-by: Tomoaki Nakashima ([email protected]) robot-id: wz101 robot-name: WebZinger robot-details-url: http://www.imaginon.com/wzindex.html robot-cover-url: http://www.imaginon.com robot-owner-name: ImaginOn, Inc robot-owner-url: http://www.imaginon.com robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: windows95, windowsNT 4, mac, solaris, unix robot-availability: binary robot-exclusion: no robot-exclusion-useragent: none robot-noindex: no robot-host: http://www.imaginon.com/wzindex.html * robot-from: no robot-useragent: none robot-language: java robot-description: commercial Web Bot that accepts plain text queries, uses webcrawler, lycos or excite to get URLs, then visits sites. If the user's filter parameters are met, downloads one picture and a paragraph of test. Playsback slide show format of one text paragraph plus image from each site. robot-history: developed by ImaginOn in 1996 and 1997 robot-environment: commercial modified-date: Wed, 11 Sep 1997 02:00:00 GMT modified-by: [email protected] robot-id: xget robot-name: XGET robot-cover-url: http://www2.117.ne.jp/~moremore/x68000/soft/soft.html robot-details-url: http://www2.117.ne.jp/~moremore/x68000/soft/soft.html robot-owner-name: Hiroyuki Shigenaga robot-owner-url: http://www2.117.ne.jp/~moremore/ robot-owner-email: [email protected] robot-status: active robot-purpose: mirroring robot-type: standalone robot-platform: X68000, X68030 robot-availability: binary robot-exclusion: yes robot-exclusion-useragent: XGET robot-noindex: no robot-host: * robot-from: yes robot-useragent: XGET/0.7 robot-language: c robot-description: Its purpose is to retrieve updated files.It is run by the end userrobot-history: 1997 robot-environment: hobby modified-date: Fri, 07 May 1998 17:00:00 GMT modified-by: Hiroyuki Shigenaga robot-id: Nederland.zoek robot-name: Nederland.zoek robot-cover-url: http://www.nederland.net/ http://info.webcrawler.com/mak/projects/robots/active/all.txt (106 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/all.txt robot-details-url: robot-owner-name: System Operator Nederland.net robot-owner-url: robot-owner-email: [email protected] robot-status: active robot-purpose: indexing robot-type: standalone robot-platform: unix (Linux) robot-availability: none robot-exclusion: yes robot-exclusion-useragent: Nederland.zoek robot-noindex: no robot-host: 193.67.110.* robot-from: yes robot-useragent: Nederland.zoek robot-language: c robot-description: This robot indexes all .nl sites for the search-engine of Nederland.net robot-history: Developed at Computel Standby in Apeldoorn, The Netherlands robot-environment: service modified-date: Sat, 8 Feb 1997 01:10:00 CET modified-by: Sander Steffann <[email protected]> http://info.webcrawler.com/mak/projects/robots/active/all.txt (107 of 107) [18.02.2001 13:17:48] http://info.webcrawler.com/mak/projects/robots/active/empty.txt robot-id: robot-name: robot-cover-url: robot-details-url: robot-owner-name: robot-owner-url: robot-owner-email: robot-status: robot-purpose: robot-type: robot-platform: robot-availability: robot-exclusion: robot-exclusion-useragent: robot-noindex: robot-host: robot-from: robot-useragent: robot-language: robot-description: robot-history: robot-environment: modified-date: modified-by: http://info.webcrawler.com/mak/projects/robots/active/empty.txt [18.02.2001 13:17:50] http://info.webcrawler.com/mak/projects/robots/active/schema.txt Database Format --------------Records ------Records are formatted like RFC 822 messages. Unless specified, values may not contain HTML, or empty lines, but may contain 8-bit values. Where a value contains "one or more" tokens, they are to be separated by a comma followed by a space. Fields can be repeated and grouped by appending number 2 and up, for example: robot-owner-name1: Mr A. RobotAuthor robot-owner-url1: http://webrobot.com/~a/a.html robot-owner-name2: Mr B. RobotCoAuthor robot-owner-name2: http://webrobot.com/~b/b.html Fields Schema -----robot-id: Short name for the robot, used internally as a unique reference. Should use [a-z-_]+ Example: webcrawler robot-name: Full name of the robot, for presentation purposes. Example: WebCrawler robot-details-url: URL of the robot home page, containing further technical details on the robot, background information etc. Example: http://webcrawler.com/WebCrawler/Facts/HowItWorks.html robot-cover-url: URL of the robot product, containing marketing details about either the robot, or the service to which the robot is related. Example: http://webcrawler.com/ robot-owner-name: Name of the owner. For service robots this is the person running the robot, who can be contacted in case of specific problems. In the case of robot products this is the person maintaining the product, who can be contacted if the robot has bugs. Example: Brian Pinkerton robot-owner-url: Home page of the robot-owner-name Example: http://info.webcrawler.com/bp/bp.html robot-owner-email: Email address of owner Example: [email protected] robot-status: http://info.webcrawler.com/mak/projects/robots/active/schema.txt (1 of 3) [18.02.2001 13:17:53] http://info.webcrawler.com/mak/projects/robots/active/schema.txt Deployment status of the robot. One of: - development: robot under development - active: robot actively in use - retired: robot no longer used robot-purpose: Purpose of the robot. One or more of: - indexing: gather content for an indexing service - maintenance: link validation, html validation etc. - statistics: used to gather statistics Further details can be given in the description robot-type: Type of robot software. One or more of: - standalone: a separate program - browser: built into a browser - plugin: a plugin for a browser robot-platform: Platform robot runs on. One or more of: - unix - windows, windows95, windowsNT - os2 - mac etc. robot-availability: Availability of robot to general public. One or more of: - source: source code available - binary: binary form available - data: bulk data gathered by robot available - none Details on robot-url or robot-cover-url. robot-exclusion: Standard for Robots Exclusion supported. yes or no robot-exclusion-useragent: Substring to use in /robots.txt Example: webcrawler robot-noindex: <meta name="robots" content="noindex"> directive supported: yes or no robot-nofollow: <meta name="robots" content="nofollow"> directive supported: yes or no robot-host: Host the robot is run from. Can be a pattern of DNS and/or IP. If the robot is available to the general public, add '*' Example: spidey.webcrawler.com, *.webcrawler.com, 192.216.46.* robot-from: The HTTP From field as defined in RFC 1945 can be set. yes or no robot-useragent: The HTTP User-Agent field as defined in RFC 1945 Example: WebCrawler/1.0 libwww/4.0 robot-language: Languages the robot is written in. One or more of: http://info.webcrawler.com/mak/projects/robots/active/schema.txt (2 of 3) [18.02.2001 13:17:53] http://info.webcrawler.com/mak/projects/robots/active/schema.txt c,c++,perl,perl4,perl5,java,tcl,python, etc. robot-description: Text description of the robot's functions. More details should go on robot-url. Example: The WebCrawler robot is used to build the database for the WebCrawler search service operated by GNN (part of AOL). The robot runs weekly, and visits sites in a random order. robot-history: Text description of the origins of the robot. Example: This robot finds its roots in a research project at the University of Washington in 1994. robot-environment: The environment the robot operates in. One or more of: - service: builds a commercial service - commercial: is a commercial product - research: used for research - hobby: written as a hobby modified-date: The date this record was last modified. Format as in HTTP Example: Fri, 21 Jun 1996 17:28:52 GMT http://info.webcrawler.com/mak/projects/robots/active/schema.txt (3 of 3) [18.02.2001 13:17:53] Robots Mailing List Archive by thread Robots Mailing List Archive by thread ● About this archive ● Most recent messages ● Messages sorted by: [ date ][ subject ][ author ] ● Other mail archives Starting: Wed 00 Jan 1970 - 16:31:48 PDT Ending: Thu 18 Dec 1997 - 14:33:60 PDT Messages: 2106 ● Announcement Michael G=?iso-8859-1?Q?=F6ckel ● RE: The Internet Archive robot Sigfrid Lundberg ● The robots mailing list at WebCrawler Martijn Koster ● Something that would be handy Tim Bray ● Site Announcement James ❍ Re: Site Announcement Martijn Koster ● How do I let spiders in? Roger Dearnaley ● Unfriendly robot at 205.177.10.2 Nick Arnett ❍ Re: Unfriendly robot at 205.177.10.2 Tim Bray ❍ Re: Unfriendly robot at 205.177.10.2 Nick Arnett ❍ Re: Unfriendly robot at 205.177.10.2 Reinier Post ❍ Re: Unfriendly robot at 205.177.10.2 Reinier Post ❍ Re: Unfriendly robot at 205.177.10.2 Leigh DeForest Dupee ● CORRECTION -- Re: Unfriendly robot Nick Arnett ● Looking for a spider Alain Desilets ❍ Re: Looking for a spider Alvaro Monge ❍ Re: Looking for a spider Xiaodong Zhang ❍ Re: Looking for a spider Alvaro Monge ❍ Re: Looking for a spider Reinier Post ❍ Re: Looking for a spider Alain Desilets ❍ Re: Looking for a spider Alain Desilets ❍ Re: Looking for a spider Alain Desilets ❍ Re: Looking for a spider Marilyn R Wulfekuhler ❍ Re: Looking for a spider Alain Desilets ❍ Re: Looking for a spider Marilyn R Wulfekuhler ❍ Re: Looking for a spider Alain Desilets ❍ Re: Looking for a spider Gene Essman http://info.webcrawler.com/mailing-lists/robots/index.html (1 of 61) [18.02.2001 13:19:24] Robots Mailing List Archive by thread ❍ Re: Looking for a spider Nick Arnett ❍ Re: Looking for a spider Ted Sullivan ❍ Re: Looking for a spider [email protected] ❍ Re: Looking for a spider Ted Sullivan ❍ Re: Looking for a spider i.bromwich ● Is it a robot or a link-updater? David Bakin ● Unfriendly robot owner identified! Nick Arnett ❍ Re: Unfriendly robot owner identified! Andrew Leonard ● Really fast searching Nick Arnett ● Sorry! Alain Desilets ● Re: Unfriendly robot at 205.252.60.50 Nick Arnett ❍ ● re: Lycos unfriendly robot Murray Bent ❍ ● Re: Unfriendly robot at 205.252.60.50 Andrew Leonard Re: Lycos unfriendly robot Reinier Post Re: Unfriendly robot at 205.252.60.50 Nick Arnett ❍ Re: Unfriendly robot at 205.252.60.50 Kim Davies ● Re: Unfriendly robot at 205.252.60.50 Nick Arnett ● Re: Proposed URLs that robots should search Nick Arnett ❍ ● Re: Proposed URLs that robots should search Kim Davies ❍ ● Re: Proposed URLs that robots should search Martijn Koster Re: Proposed URLs that robots should search Andrew Daviel lycos patents Murray Bent ❍ Re: lycos patents Scott Stephenson ❍ Re: lycos patents Martijn Koster ❍ Re: lycos patents Matthew Gray ❍ Re: lycos patents Reinier Post ❍ Re: lycos patents Roger Dearnaley ● re: Lycos patents Murray Bent ● Patents? Martijn Koster ● meta tag implementation Davide Musella ❍ Re: meta tag implementation Jeffrey C. Chen ❍ Simple load robot Jaakko Hyvatti ❍ Re: meta tag implementation Steve Nisbet ❍ Re: meta tag implementation Davide Musella ❍ Re: meta tag implementation Reinier Post http://info.webcrawler.com/mailing-lists/robots/index.html (2 of 61) [18.02.2001 13:19:24] Robots Mailing List Archive by thread ❍ ● Re: meta tag implementation Steve Nisbet Preliminary robot.faq (Please Send Questions or Comments) Keith Fischer ❍ Re: Preliminary robot.faq (Please Send Questions or Comments) Tim Bray ❍ Re: Preliminary robot.faq (Please Send Questions or Comments) Keith Fischer ❍ Re: Preliminary robot.faq (Please Send Questions or Comments) Reinier Post ❍ Re: Preliminary robot.faq (Please Send Questions or Comments) YUWONO BUDI ❍ Bad robot: WebHopper bounch! Owner: [email protected] Benjamin Franz ● BOUNCE robots: Admin request ● wwwbot.pl problem Andrew Daviel ❍ ● ● ● ● ● ● Re: wwwbot.pl problem Fred Douglis yet another robot Paul Francis ❍ yet another robot, volume 2 David Eagles ❍ Re: yet another robot, volume 2 James ❍ Re: yet another robot, volume 2 James Q: Cooperation of robots Byung-Gyu Chang ❍ Re: Q: Cooperation of robots David Eagles ❍ Re: Q: Cooperation of robots Nick Arnett ❍ Re: Q: Cooperation of robots Paul Francis ❍ Re: Q: Cooperation of robots Jaakko Hyvatti Smart Agent help Michael Goldberg ❍ Re: Smart Agent help Paul Francis ❍ Re: Smart Agent help [email protected] harvest John D. Pritchard ❍ Re: harvest Michael Goldberg ❍ mortgages with: Re: harvest John D. Pritchard How frequently should I check /robots.txt? Skip Montanaro ❍ Re: How frequently should I check /robots.txt? gil cosson ❍ Re: How frequently should I check /robots.txt? Martijn Koster ❍ Re: How frequently should I check /robots.txt? Martijn Koster McKinley Spider hit us hard Christopher Penrose ❍ Re: McKinley Spider hit us hard Michael Van Biesbrouck ● Mail failure Adminstrator ● Mail failure Adminstrator ● Mail failure Adminstrator ● Mail failure Adminstrator http://info.webcrawler.com/mailing-lists/robots/index.html (3 of 61) [18.02.2001 13:19:24] Robots Mailing List Archive by thread ● Small robot needed Karoly Negyesi ● New robot turned loose on an unsuspecting public... and a DNS question Skip Montanaro ● Re: New robot turned loose on an unsuspecting public... and a DNS question Thomas Maslen inquiry about robots Cristian Ionitoiu ● MacPower Lance Ogletree ❍ ❍ Re: MacPower Jaakko Hyvatti ❍ Re: MacPower (an apology, I am very sorry) Jaakko Hyvatti ● Re: Returned mail: Service unavailableHELP HELP! Julian Gorodsky ● Re: Returned mail: Service unavailableHELP AGAIN HELP AGAIN! Julian Gorodsky ● Indexing two-byte text Harry Munir Behrens ❍ Re: Indexing two-byte text John D. Pritchard ● Indexing two-byte text Harry Munir Behrens ● Either a spider or a hacker? ww2.allcon.com Randall Hill ● Indexing two-byte text Mark Schrimsher ❍ Re: Indexing two-byte text Paul Francis ❍ Re: Indexing two-byte text Mark Schrimsher ❍ Re: Indexing two-byte text Paul Francis ❍ Re: Indexing two-byte text Frank Smadja ❍ Re: Indexing two-byte text Paul Francis ❍ Re: Indexing two-byte text Mark Schrimsher ❍ Re: Indexing two-byte text Mark Schrimsher ❍ Re: Indexing two-byte text Paul Francis ❍ Re: Indexing two-byte text Mark Schrimsher ● RE: Indexing two-byte text [email protected] ● RE: Indexing two-byte text Mark Schrimsher ❍ RE: Indexing two-byte text Noboru Iwayama ● Freely available robot code in C available? [email protected] ● Freely available robot code in C available? Kevin Hoogheem ● ❍ Re: Freely available robot code in C available? Mark Schrimsher ❍ Re: Freely available robot code in C available? Nick Arnett ❍ Re: Freely available robot code in C available? Edwin Carp Harvest question Jim Meritt ❍ Re: Harvest question Mark Schrimsher ● Announcement and Help Requested Simon.Stobart ● Announcement and Help Requested Simon.Stobart http://info.webcrawler.com/mailing-lists/robots/index.html (4 of 61) [18.02.2001 13:19:24] Robots Mailing List Archive by thread ❍ Re: Announcement and Help Requested Martijn Koster ❍ Re: Announcement and Help Requested Jeremy.Ellman ❍ Re: Announcement and Help Requested Simon.Stobart ❍ Re: Announcement and Help Requested Jeremy.Ellman ● Re[2]: Harvest question Jim Meritt ● Robot on the Rampage [email protected] ❍ Re: Robot on the Rampage Susumu Shimizu ❍ Re: Robot on the Rampage Reinier Post ❍ Checking Log files Cees Hek ❍ Re: Checking Log files Kevin Hoogheem ❍ Re: Checking Log files Mark Schrimsher ❍ Re: Checking Log files Cees Hek ● [1]RE>Checking Log files Roger Dearnaley ● [2]RE>Checking Log files Roger Dearnaley ● [3]RE>Checking Log files Roger Dearnaley ● [4]RE>Checking Log files Roger Dearnaley ● [5]RE>Checking Log files Roger Dearnaley ● [5]RE>Checking Log files Mark Schrimsher ❍ Re: [5]RE>Checking Log files [email protected] ● [1]RE>[5]RE>Checking Log fi Roger Dearnaley ● [2]RE>[5]RE>Checking Log fi Roger Dearnaley ● ❍ Re: [2]RE>[5]RE>Checking Log fi Micah A. Williams ❍ Re: [2]RE>[5]RE>Checking Log fi Skip Montanaro ❍ Re: [2]RE>[5]RE>Checking Log fi Bjorn-Olav Strand ❍ Re: [2]RE>[5]RE>Checking Log fi Gordon Bainbridge ❍ Re: [2]RE>[5]RE>Checking Log fi Micah A. Williams ❍ Re: [2]RE>[5]RE>Checking Log fi [email protected] Wobot? Byung-Gyu Chang ❍ Re: Wobot? Nick Arnett ● Announcing NaecSpyr, a new. . . robot? Mordechai T. Abzug ● [3]RE>[5]RE>Checking Log fi Roger Dearnaley ❍ Re: [3]RE>[5]RE>Checking Log fi Micah A. Williams ● [1]Wobot? Roger Dearnaley ● [1]Announcing NaecSpyr, a n Roger Dearnaley ● [2]Announcing NaecSpyr, a n Roger Dearnaley http://info.webcrawler.com/mailing-lists/robots/index.html (5 of 61) [18.02.2001 13:19:24] Robots Mailing List Archive by thread ● [2]Wobot? Roger Dearnaley ❍ Contact for Intouchgroup.com Vince Taluskie ● [4]RE>[5]RE>Checking Log fi Roger Dearnaley ● [3]Wobot? Roger Dearnaley ● [3]Announcing NaecSpyr, a n Roger Dearnaley ● [1]RE>[3]RE>[5]RE>Checking Roger Dearnaley ● [5]RE>[5]RE>Checking Log fi Roger Dearnaley ● [1]RE>[2]RE>[5]RE>Checking Roger Dearnaley ● [4]Wobot? Roger Dearnaley ● [4]Announcing NaecSpyr, a n Roger Dearnaley ● [2]RE>[3]RE>[5]RE>Checking Roger Dearnaley ● Dearnaley Auto Reply Cannon? Micah A. Williams ● Dearnaley Auto Reply Cannon? Kevin Hoogheem ● [2]RE>[2]RE>[5]RE>Checking Roger Dearnaley ● [1]Contact for Intouchgroup Roger Dearnaley ● Re: [2]RE>[5]RE>Checking Lo Saul Jacobs ● [5]Wobot? Roger Dearnaley ● Re: [2]RE>[5]RE>Checking Lo Bonnie Scott ● Vacation wars Martijn Koster ❍ Re: Vacation wars Nick Arnett ● Re: [2]RE>[5]RE>Checking Lo David Henderson ● New Robot??? David Henderson ● test; please ignore Martijn Koster ❍ Re: test; please ignore Mark Schrimsher ❍ Re: test; please ignore David Henderson ● Unfriendly Lycos , again ... Murray Bent ● Inter-robot Comms Port David Eagles ● ❍ Re: Inter-robot Comms Port John D. Pritchard ❍ Re: Inter-robot Comms Port Super-User ❍ Re: Inter-robot Comms Port Carlos Baquero Re: Unfriendly Lycos , again ... Steven L. Baur ❍ ● Re: Unfriendly Lycos , again ... [email protected] Inter-robot Communications - Part II David Eagles ❍ Re: Inter-robot Communications - Part II Martijn Koster ❍ Re: Inter-robot Communications - Part II Mordechai T. Abzug http://info.webcrawler.com/mailing-lists/robots/index.html (6 of 61) [18.02.2001 13:19:24] Robots Mailing List Archive by thread ● unknown robot (no name) ❍ Re: unknown robot Luiz Fernando ❍ Re: unknown robot ❍ Re: unknown robot John Lindroth ● RE: Inter-robot Communications - Part II David Eagles ● RE: Inter-robot Communications - Part II Martijn Koster ❍ ● please add my site gil cosson ❍ ● Re: please add my site Martijn Koster Please Help ME!! Dong-Hyun Kim ❍ ● Re: Inter-robot Communications - Part II John D. Pritchard Re: Please Help ME!! Byung-Gyu Chang Infinite e-mail loop Roger Dearnaley ❍ Infinite e-mail loop Skip Montanaro ● Up to date list of Robots James ● Web Robot Matthew Gray ● Re: Web Robots James ● Re: Web Robots Jakob Faarvang ● Does this count as a robot? Thomas Stets ❍ Re: Does this count as a robot? Jeremy.Ellman ❍ Re: Does this count as a robot? Benjamin Franz ❍ Re: Does this count as a robot? Benjamin Franz ❍ Re: Does this count as a robot? YUWONO BUDI ❍ Re: Does this count as a robot? [email protected] ❍ Re: Does this count as a robot? YUWONO BUDI ❍ Recursing heuristics (Re: Does this..) Jaakko Hyvatti ❍ avoiding infinite regress for robots Reinier Post ● RE: avoiding infinite regress for robots David Eagles ● Recursion David Eagles ❍ Re: Recursion [email protected] ● Duplicate docs (was avoiding infinite regress...) Nick Arnett ● RE: Recursion David Eagles ● MD5 in HTTP headers - where? Skip Montanaro ❍ ● Re: MD5 in HTTP headers - where? Mordechai T. Abzug robots.txt extensions Adam Jack ❍ Re: robots.txt extensions Martijn Koster http://info.webcrawler.com/mailing-lists/robots/index.html (7 of 61) [18.02.2001 13:19:24] Robots Mailing List Archive by thread ● ❍ Re: robots.txt extensions Adam Jack ❍ Re: robots.txt extensions Jaakko Hyvatti ❍ Re: robots.txt extensions Adam Jack ❍ Re: robots.txt extensions Martin Kiff Does anyone else consider this irresponsible? Robert Raisch, The Internet Company ❍ Re: Does anyone else consider this irresponsible? Stan Norton ❍ Re: Does anyone else consider this irresponsible? ❍ Re: Does anyone else consider this irresponsible? Mark Norman ❍ Re: Does anyone else consider this irresponsible? Eric Hollander ❍ Re: Does anyone else consider this irresponsible? Super-User ❍ Re: Does anyone else consider this irresponsible? Robert Raisch, The Internet Company Responsible behavior, Robots vs. humans, URL botany... Skip Montanaro ❍ ❍ Re: Does anyone else consider this irresponsible? Robert Raisch, The Internet Company Re: Does anyone else consider this irresponsible? Ed Carp @ TSSUN5 ❍ Re: Does anyone else consider this irresponsible? ❍ ● FAQ again. Martijn Koster ● Robots / source availability? Martijn Koster ❍ ● ● Re: Robots / source availability? Erik Selberg Re: Does anyone else consider... Mark Norman ❍ Re: Does anyone else consider... ❍ Re: Does anyone else consider... Skip Montanaro Re: Does anyone else consider... Mark Schrimsher ❍ Re: Does anyone else consider... ● (no subject) Alison Gwin ● Robots not Frames savy David Henderson ● Re: Spam Software Sought Mark Schrimsher ● Re: Does anyone else consider... [email protected] ● Horror story ❍ Re: Horror story Skip Montanaro ❍ Re: Horror story Jaakko Hyvatti ❍ Re: Horror story Ted Sullivan ❍ Re: Horror story Brian Pinkerton ❍ Re: Horror story Murray Bent ❍ Re: Horror story Mordechai T. Abzug http://info.webcrawler.com/mailing-lists/robots/index.html (8 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ❍ Re: Horror story Steve Nisbet ❍ Re: Horror story Steve Nisbet ● Gopher Protocol Question Hal Belisle ● New Robot Announcement Larry Burke ● ❍ Re: New Robot Announcement Mordechai T. Abzug ❍ Re: New Robot Announcement Larry Burke ❍ Re: New Robot Announcement Jeremy.Ellman ❍ Re: New Robot Announcement Ed Carp @ TSSUN5 ❍ Re: New Robot Announcement David Levine ❍ Re: New Robot Announcement Larry Burke ❍ Re: New Robot Announcement Jakob Faarvang ❍ Re: New Robot Announcement John Lindroth ❍ Re: New Robot Announcement Ed Carp @ TSSUN5 ❍ Re: New Robot Announcement Kevin Hoogheem Re: robots.txt extensions Steven L Baur ❍ ● ● Re: robots.txt extensions Skip Montanaro Alta Vista searches WHAT?!? Ed Carp @ TSSUN5 ❍ Re: Alta Vista searches WHAT?!? Martijn Koster ❍ Re: Alta Vista searches WHAT?!? ❍ Re: Alta Vista searches WHAT?!? Adam Jack ❍ Re: Alta Vista searches WHAT?!? Tronche Ch. le pitre ❍ Re: Alta Vista searches WHAT?!? Mark Schrimsher ❍ Re: Alta Vista searches WHAT?!? Wayne Lamb ❍ Re: Alta Vista searches WHAT?!? Reinier Post ❍ Re: Alta Vista searches WHAT?!? Wayne Lamb ❍ Re: Alta Vista searches WHAT?!? Erik Selberg ❍ Re: Alta Vista searches WHAT?!? Edward Stangler ❍ Re: Alta Vista searches WHAT?!? Erik Selberg BOUNCE robots: Admin request Martijn Koster ❍ Re: BOUNCE robots: Admin request Nick Arnett ❍ Re: BOUNCE robots: Admin request Jim Meritt ● RE: Alta Vista searches WHAT?!? Ted Sullivan ● robots.txt , authors of robots , webmasters .... [email protected] ❍ Re: robots.txt , authors of robots , webmasters .... Reinier Post ❍ Re: robots.txt , authors of robots , webmasters ....OMOMOM[D Wayne Lamb http://info.webcrawler.com/mailing-lists/robots/index.html (9 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ● ❍ Re: robots.txt , authors of robots , webmasters ....OM Wayne Lamb ❍ Re: robots.txt , authors of robots , webmasters .... Benjamin Franz ❍ Re: robots.txt , authors of robots , webmasters .... Wayne Lamb ❍ ❍ Re: robots.txt , authors of robots , webmasters .... Robert Raisch, The Internet Company Re: robots.txt , authors of robots , webmasters .... Benjamin Franz ❍ Re: robots.txt , authors of robots , webmasters .... Carlos Baquero ❍ Re: robots.txt , authors of robots , webmasters .... Reinier Post ❍ Re: robots.txt , authors of robots , webmasters .... Kevin Hoogheem ❍ Re: robots.txt , authors of robots , webmasters .... Ed Carp @ TSSUN5 ❍ Re: robots.txt , authors of robots , webmasters .... Adam Jack ❍ Re: robots.txt , authors of robots , webmasters .... Nick Arnett Web robots and gopher space -- two separate worlds [email protected] ❍ Re: Web robots and gopher space -- two separate worlds Wayne Lamb ● [ANNOUNCE] CFP: AAAI-96 WS on Internet-based Information Systems Alexander Franz Robot Research Bhupinder S. Sran ● Re: Re: robots.txt , authors of robots , webmasters .... Larry Burke ● re: privacy, courtesy, protection John Lammers ● Server name in /robots.txt Martin Kiff ● ● ❍ Re: Server name in /robots.txt Tim Bray ❍ Re: Server name in /robots.txt Christopher Penrose ❍ Re: Server name in /robots.txt Martijn Koster ❍ Re: Server name in /robots.txt Reinier Post ❍ Re: Server name in /robots.txt Martin Kiff ❍ ❍ Canonical Names for documents (was Re: Server name in /robots.txt) Michael De La Rue Re: Server name in /robots.txt Martijn Koster ❍ Re: Server name in /robots.txt [email protected] Polite Request #2 to be Removed form List [email protected] ❍ Re: Polite Request #2 to be Removed form List [email protected] ● un-subcribe [email protected] ● RE: Server name in /robots.txt David Eagles ● RE: Server name in /robots.txt Tim Bray ❍ ● Re: Server name in /robots.txt Mordechai T. Abzug Who sets standards (was Server name in /robots.txt) Nick Arnett http://info.webcrawler.com/mailing-lists/robots/index.html (10 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ❍ ● Re: Who sets standards (was Server name in /robots.txt) Tim Bray HEAD request [was Re: Server name in /robots.txt] Davide Musella ❍ Re: HEAD request [was Re: Server name in /robots.txt] Martijn Koster ❍ Re: HEAD request [was Re: Server name in /robots.txt] Davide Musella ❍ Re: HEAD request [was Re: Server name in /robots.txt] Renato Mario Rossello ● Activity from 205.252.60.5[0-8] Martin Kiff ● test. please ignore. Mark Norman ● Any info on "E-mail America"? Bonnie Scott ● www.pl? The YakkoWakko. Webmaster ● New URL's from Equity Int'll Webcenter Mark Krell ● Requesting info on database engines Renato Mario Rossello ● News Clipper for newsgroups - Windows Richard Glenner ❍ ● ● ● Re: News Clipper for newsgroups - Windows Nick Arnett Wanted: Web Robot code - C/Perl Charlie Brown ❍ Re: Wanted: Web Robot code - C/Perl Patrick 'Zapzap' Lin ❍ Re: Wanted: Web Robot code - C/Perl Keith Fischer ❍ Re: Wanted: Web Robot code - C/Perl Kevin Hoogheem Perl Spiders Christopher Penrose ❍ Re: Perl Spiders dino ❍ Re: Perl Spiders Charlie Brown ❍ Re: Perl Spiders Christopher Penrose ❍ Re:Re: Perl Spiders [email protected] Here is WebWalker Christopher Penrose ❍ Re: Here is WebWalker dino ● The "Robot and Search Engine FAQ" Keith D. Fischer ● algorithms Kenneth DeMarse ❍ ● The Robot And Search Engine FAQ Keith D. Fischer ❍ ● ● Re: algorithms too Jose Raul Vaquero Pulido Re: The Robot And Search Engine FAQ Erik Selberg Money Spider WWW Robot for Windows John McGrath - Money Spider Ltd. ❍ Re: Money Spider WWW Robot for Windows Nick Arnett ❍ Re: Money Spider WWW Robot for Windows Simon.Stobart robots.txt changes how often? Tangy Verdell ❍ Re: robots.txt changes how often? Darrin Chandler ❍ Re: robots.txt changes how often? Jaakko Hyvatti http://info.webcrawler.com/mailing-lists/robots/index.html (11 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ● ❍ Re: robots.txt changes how often? Martin Kiff ❍ Re: robots.txt changes how often? Jeremy.Ellman robots.txt Tangy Verdell ❍ Re: robots.txt Jaakko Hyvatti ● Re: Commercial Robot Vendor Recoomendations Request Michael De La Rue ● fdsf Davide Musella ● Robots and search engines technical information. Hayssam Hasbini ❍ Re: Robots and search engines technical information. Jeremy.Ellman ❍ Re: Robots and search engines technical information. Tangy Verdell ❍ Re: Robots and search engines technical information. Erik Selberg ❍ Re: Robots and search engines technical information. joseph williams ● Tutorial Proposal for WWW95 http ● URL measurement studies? Darren R. Hardy ❍ Re: URL measurement studies? John D. Pritchard ● about robots.txt content errors [email protected] ● Dot dot problem... Sean Parker ● ● ❍ Re: Dot dot problem... Reinier Post ❍ Re: Dot dot problem... Nick Arnett Robot Databases Steve Livingston ❍ Re: Robot Databases Tronche Ch. le pitre ❍ Re: Robot Databases Ted Sullivan ❍ Re: Robot Databases ❍ Re: Robot Databases Skip Montanaro ❍ Re: Robot Databases Ted Sullivan Anyone doing a Java-based robot yet? Nick Arnett ❍ Re: Anyone doing a Java-based robot yet? David A Weeks ❍ Anyone doing a Java-based robot yet? Pertti Kasanen ❍ Re: Anyone doing a Java-based robot yet? Adam Jack ❍ Re: Anyone doing a Java-based robot yet? John D. Pritchard ❍ Re: Anyone doing a Java-based robot yet? Nick Arnett ❍ Re: Anyone doing a Java-based robot yet? Adam Jack ❍ Re: Anyone doing a Java-based robot yet? John D. Pritchard ❍ Re: Anyone doing a Java-based robot yet? Adam Jack ❍ Re: Anyone doing a Java-based robot yet? Frank Smadja ❍ Re: Anyone doing a Java-based robot yet?6 Mr David A Weeks http://info.webcrawler.com/mailing-lists/robots/index.html (12 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ● RE: Robot Databases Ted Sullivan ● Ingrid ready for prelim alpha testing.... Paul Francis ● Robot for Sun David Schnardthorst ● url locating Martijn De Boef ❍ Re: url locating Terry Smith ● Re[2]: Anyone doing a Java-based robot yet? Tangy Verdell ● Altavista indexing password files John Messerly ❍ Re: Altavista indexing password files ❍ Re: Altavista indexing password files [email protected] ● RE: Altavista indexing password files John Messerly ● BSE-Slurp/0.6 Gordon V. Cormack ❍ Re: BSE-Slurp/0.6 Mordechai T. Abzug ❍ Re: BSE-Slurp/0.6 Mark Schrimsher ● Can I retrieve image map files? Mark Norman ● Robots available for Intranet applications Douglas Summersgill ❍ Re: Robots available for Intranet applications Sylvain Duclos ❍ Re: Robots available for Intranet applications Mark Slabinski ❍ Re: Robots available for Intranet applications Jared Williams ❍ Re: Robots available for Intranet applications Josef Pellizzari ❍ Re: Robots available for Intranet applications Jared Williams ❍ Re: Robots available for Intranet applications Nick Arnett ● "What's new" in web pages is not necessarily reliable Mordechai T. Abzug ● verify URL Jim Meritt ❍ Re: verify URL Vince Taluskie ❍ Re: verify URL Carlos Baquero ❍ Re: verify URL Tronche Ch. le pitre ❍ Re: verify URL Reinier Post ● libww and robot source for Sequent Dynix/Ptx 4.1.3 (no name) ● Re[2]: verify URL Jim Meritt ● robot authentication parameters Sibylle Gonzales ❍ Re: robot authentication parameters Dan Gildor ❍ Re: robot authentication parameters Michael De La Rue ● RE: verify URL Debbie Swanson ● Re: Robots available for Intranet applications jon madison ❍ Re: Robots available for Intranet applications David Schnardthorst http://info.webcrawler.com/mailing-lists/robots/index.html (13 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ● How to...??? Francesco ❍ ● Re: How to...??? Jeremy.Ellman image map traversal Mark Norman ❍ Re: image map traversal Benjamin Franz ❍ Re: image map traversal Cees Hek ● ReHowto...??? Jared Williams ● Info on authoring a Web Robot Jared Williams ● ● ● ❍ Re: Info on authoring a Web Robot Detlev Kalb ❍ Re: Info on authoring a Web Robot Keith D. Fischer ❍ RCPT: Re: Info on authoring a Web Robot Jeannine Washington image map traversal [email protected] ❍ Re: image map traversal Nick Arnett ❍ Re: image map traversal jon madison Links Jared Williams ❍ Re: Links Martijn Koster ❍ Re: Links Jared Williams ❍ Re: Links Mordechai T. Abzug ❍ Re: Links (don't bother checking; I've done it for you) Michael De La Rue ❍ Re: Links (don't bother checking; I've done it for you) Chris Brown ❍ Re: Links (don't bother checking; I've done it for you) Jaakko Hyvatti ❍ Re: Links Jared Williams Limiting robots to top-level page only (via robots.txt)? Chuck Doucette ❍ ● Re: Limiting robots to top-level page only (via robots.txt)? Jaakko Hyvatti Image Maps Thomas Merlin ❍ Re: Image Maps ● Request for Source code in C for Robots ACHAKS ● robots that index comments Dan Gildor ❍ Re: robots that index comments murray bent ● Re: UNSUBSCRIBE ROBOTS [email protected] ● keywords in META-element Detlev Kalb ❍ Re: keywords in META-element Davide Musella ● Announce: ActiveX Search (IFilter) spec/sample Lee Fisher ● Re: Links This Site is about Robots Not Censorship Keith ❍ ● Re: Links This Site is about Robots Not Censorship Michael De La Rue Re: Links (don't bother checking; I've done it for you) Darrin Chandler http://info.webcrawler.com/mailing-lists/robots/index.html (14 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ❍ Re: Links (don't bother checking; I've done it for you) Benjamin Franz ● Re: Links (don't bother checking; I've done it for you) David Henderson ● Re: Links This Site is about Robots Not Censorship Rob Turk ● Re: Links (don't bother checking; I've done it for you) Martin Kiff ● Re: Links (don't bother checking; I've done it for you) Darrin Chandler ● The Letter To End All Letters Jared Williams ● Heuristics.... Martin Kiff ❍ ● Re: Heuristics.... Nick Arnett unscribe christophe grandjacquet ❍ Re: unscribe jon madison ● unsubscibe christophe grandjacquet ● Admin: how to get off this list Martijn Koster ● Search accuracy Nick Arnett ❍ Re: Search accuracy Benjamin Franz ❍ Re: Search accuracy Nick Arnett ❍ Re: Search accuracy Benjamin Franz ❍ Re: Search accuracy [email protected] ❍ Re: Search accuracy YUWONO BUDI ❍ Re: Search accuracy Judy Feder ❍ Re: Search accuracy Nick Arnett ❍ Re: Search accuracy John D. Pritchard ❍ Re: Search accuracy Daniel C Grigsby ❍ Re: Search accuracy David A Weeks ❍ Re: Search accuracy Nick Arnett ❍ Re: Search accuracy Robert Raisch, The Internet Company ❍ Re: Search accuracy Nick Arnett ❍ Re: Search accuracy Benjamin Franz ❍ Re: Search accuracy Ellen M Voorhees ❍ Re: Search accuracy Ted Sullivan ● Clean up Bots... Andy Warner ● Re: (Fwd) Re: Search accuracy Colin Goodier ● VB and robot development Mitchell Elster ❍ Re: VB and robot development [email protected] ❍ Re: VB and robot development [email protected] ❍ Re: VB and robot development Jakob Faarvang http://info.webcrawler.com/mailing-lists/robots/index.html (15 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ● ❍ Re: VB and robot development Ian McKellar ❍ Re: VB and robot development Jakob Faarvang ❍ Re: VB and robot development Darrin Chandler ❍ Re: VB and robot development Problem with your Index Mike Rodriguez ❍ ● ● ● Re: Problem with your Index Martijn Koster Handling keyword repetitions Mr David A Weeks ❍ Re: Handling keyword repetitions Alan ❍ Re: Handling keyword repetitions Alan word spam chris cobb ❍ Re: word spam arutgers ❍ Re: word spam Alan ❍ Re: word spam Trevor Jenkins ❍ Re: word spam Benjamin Franz ❍ Re: word spam Ken Wadland ❍ Re: word spam YUWONO BUDI ❍ Re: word spam Benjamin Franz ❍ Re: word spam Kevin Hoogheem ❍ Re: word spam Ken Wadland ❍ Re: word spam Reinier Post ❍ Re: word spam Alan ❍ Re: word spam [email protected] ❍ Re: word spam ❍ Re: word spam Andrey A. Krasov ❍ Re: word spam Nick Arnett http directory index request Mark Norman ❍ Re: http directory index request Mordechai T. Abzug ● RE: http directory index request David Levine ● Returned mail: Can't create output: Error 0 Mail Delivery Subsystem ● Re: word spam ● Web Robot Jared Williams ● Robots in the client? Ricardo Eito Brun ❍ Re: Robots in the client? Paul De Bra ❍ Re: Robots in the client? Bonnie Scott ❍ Re: Robots in the client? Ricardo Eito Brun http://info.webcrawler.com/mailing-lists/robots/index.html (16 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ❍ Re: Robots in the client? Michael Carnevali, Student, FHD ● General Information Ricardo Eito Brun ● Magic, Intelligence, and search engines Tim Bray ❍ ● ● Re: Magic, Intelligence, and search engines YUWONO BUDI default documents Jakob Faarvang ❍ Re: default documents Darrin Chandler ❍ Re: default documents Harry Munir Behrens ❍ Re: default documents Micah A. Williams ❍ Re: default documents [email protected] Mailing list Jared Williams ❍ Re: Mailing list Mordechai T. Abzug ❍ Re: Mailing list Jared Williams ❍ Re: Mailing list Mordechai T. Abzug ❍ Re: Mailing list Kevin Hoogheem ❍ Re: Mailing list Gordon Bainbridge ❍ Re: Mailing list Rob Turk ● RE: Mailing List Mitchell Elster ● About Mother of All Bulletin Boards Ricardo Eito Brun ❍ ● Re: About Mother of All Bulletin Boards Oliver A. McBryan search engine Jose Raul Vaquero Pulido ❍ Re: search engine Scott W. Wood ❍ Re: search engine Rob Turk ❍ Re: search engine Jakob Faarvang ● Quiz playing robots ? Andrey A. Krasov ● Try robot... Jared Williams ❍ ● Re: Try robot... Andy Warner (no subject) Jared Williams ❍ (no subject) Vince Taluskie ❍ (no subject) Michael De La Rue ● "Good Times" hoax Nick Arnett ● (no subject) Kevin Hoogheem ● RE: "Good Times" hoax David Eagles ● Re: Re: Bill Day ● About integrated search engines Ricardo Eito Brun ❍ Re: About integrated search engines Brian Ulicny http://info.webcrawler.com/mailing-lists/robots/index.html (17 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ❍ Re: About integrated search engines Keith D. Fischer ❍ Re: About integrated search engines Frank Smadja ● [ MERCHANTS ] My Sincerest Apologies Jared Williams ● Re: Apologies || communal bots Rob Turk ❍ Re: communal bots Bonnie Scott ❍ Re: communal bots John D. Pritchard ● Re: communal bots Rob Turk ● To: ???? Robot Mitchell Elster ❍ Re: To: ???? Robot Ken Wadland ❍ Re: To: ???? Robot John D. Pritchard ❍ Re: To: ???? Robot Mitchell Elster ❍ Re: To: ???? Robot John D. Pritchard ● RE: To: ???? Robot chris cobb ● Private Investigator Lists Scott W. Wood ● [Fwd: Re: To: ???? Robot] Rob Turk ● Re: To: ???? Robot] Terry Smith ● Re: To: ???? Robot Mitchell Elster ● Looking for a search engine Bhupinder S. Sran ● ❍ Re: Looking for a search engine Mark Schrimsher ❍ Re: Looking for a search engine Nick Arnett Admin: List archive is back Martijn Koster ❍ ● Re: Admin: List archive is back Nick Arnett topical search tool -- help?! Brian Fitzgerald ❍ Re: topical search tool -- help?! Brian Ulicny ❍ Re: topical search tool -- help?! Brian Ulicny ❍ Re: topical search tool -- help?! Paul Francis ❍ Re: topical search tool -- help?! Paul Francis ❍ Re: topical search tool -- help?! Paul Francis ❍ Re: topical search tool -- help?! Nick Arnett ❍ Re: topical search tool -- help?! Brian Ulicny ❍ Re: topical search tool -- help?! Brian Ulicny ❍ Re: topical search tool -- help?! Brian Ulicny ❍ Re: topical search tool -- help?! Nick Arnett ❍ Re: topical search tool -- help?! Robert Raisch, The Internet Company ❍ Re: topical search tool -- help?! Brian Ulicny http://info.webcrawler.com/mailing-lists/robots/index.html (18 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ❍ Re: topical search tool -- help?! Paul Francis ❍ Re: topical search tool -- help?! Robert Raisch, The Internet Company ● cc:Mail SMTPLINK Undeliverable Message ● Indexing a set of URL's Fred Melssen ● Political economy of distributed search (was topical search...) Nick Arnett ❍ Re: Political economy of distributed search (was topical search...) Erik Selberg ❍ Re: Political economy of distributed search (was topical search...) Jeremy.Ellman ❍ Re: Political economy of distributed search (was topical search...) Benjamin Franz ❍ Re: Political economy of distributed search (was topical search...) John D. Pritchard ● Re: Political economy of distributed search (was topical search Robert Raisch, The Internet Company (OTP) RE: Political economy of distributed search (was topical search...) David Levine ● Re: Political economy of distributed search (was topical Terry Smith ● Re: (OTP) RE: Political economy of distributed search (was topical Kevin Hoogheem ● Re: Political economy of distributed search (was topical Steve Jones ● Meta-seach engines Harry Munir Behrens ● ● ● ● ❍ Re: Meta-seach engines Greg Fenton ❍ Re: Meta-seach engines Erik Selberg any robots/search-engines which index links? Alex Chapman ❍ Re: any robots/search-engines which index links? Jakob Faarvang ❍ Re: any robots/search-engines which index links? Carlos Baquero ❍ Re: any robots/search-engines which index links? Benjamin Franz ❍ Re: any robots/search-engines which index links? Alex Chapman ❍ Re: any robots/search-engines which index links? Carlos Baquero ❍ Re: any robots/search-engines which index links? Carlos Baquero alta vista and virtualvin.com chris cobb ❍ Re: alta vista and virtualvin.com Benjamin Franz ❍ Re: alta vista and virtualvin.com Larry Gilbert ❍ Re: alta vista and virtualvin.com Benjamin Franz VB. page grabber... Marc's internet diving suit ❍ Re: VB. page grabber... Terry Smith ❍ Re: VB. page grabber... Ed Carp ❍ Re: VB. page grabber... Jakob Faarvang ● Re: Lead Time Myles Olson ● RE: VB. page grabber... Victor F Ribeiro ● Re: Political economy of distributed search (was topical Brian Ulicny http://info.webcrawler.com/mailing-lists/robots/index.html (19 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ● RE: VB. page grabber... chris cobb ● RE: any robots/search-engines which index links? Louis Monier ● Re[2]: verify URL Jim Meritt ● ANNOUNCE: Don Norman (Apple) LIVE! 15-May 5PM UK = noon EDT Marc Eisenstadt ● Web spaces of strange topology. Where? Michael De La Rue ● ❍ Re: Web spaces of strange topology. Where? Michael De La Rue ❍ Re: Web spaces of strange topology. Where? J.E. Fritz ❍ Re: Web spaces of strange topology. Where? Benjamin Franz ❍ Re: Web spaces of strange topology. Where? John Lindroth ❍ Re: Web spaces of strange topology. Where? Brian Clark ❍ Re: Web spaces of strange topology. Where? [email protected] Somebody is turning 23! Eric ❍ Re: Somebody is turning 23! [email protected] ❍ Re: Somebody is turning 23! John D. Pritchard ❍ Re: Somebody is turning 23! [email protected] ● robots and cookies Ken Nakagama ● Accept: Thomas Abrahamsson ● ❍ Re: Accept: Martijn Koster ❍ Re: Accept: [email protected] Defenses against bad robots [email protected] ❍ Re: Defenses against bad robots Larry Gilbert ❍ Re: Defenses against bad robots Mordechai T. Abzug ❍ Re: Defenses against bad robots Benjamin Franz ❍ Re: Defenses against bad robots Martijn Koster ❍ Re: Defenses against bad robots [email protected] ❍ Re: Defenses against bad robots [email protected] ❍ Re: Defenses against bad robots Benjamin Franz ❍ Book about robots (was Re: Defenses against bad robots) Tronche Ch. le pitre ❍ Re: Book about robots (was Re: Defenses against bad robots) Eric Knight ❍ Re: Defenses against bad robots ❍ Re: Defenses against bad robots Jaakko Hyvatti ❍ Re: Defenses against bad robots Steve Jones ❍ Re: Defenses against bad robots John D. Pritchard ❍ Re: Defenses against bad robots [email protected] ❍ Re: Defenses against bad robots Robert Raisch, The Internet Company http://info.webcrawler.com/mailing-lists/robots/index.html (20 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ❍ Re: Defenses against bad robots John D. Pritchard ❍ Re: Defenses against bad robots John D. Pritchard ❍ Re: Defenses against bad robots Robert Raisch, The Internet Company ● Re: Image Maps jon madison ● That wacky Wobot ● ??: reload problem Andrey A. Krasov ● Robot-HTML Web Page? J.Y.K. ❍ Re: Robot-HTML Web Page? Paul Francis ❍ Re: Robot-HTML Web Page? Darrin Chandler ❍ Re: Robot-HTML Web Page? Paul Francis ❍ Re: Robot-HTML Web Page? Mordechai T. Abzug ❍ Re: Robot-HTML Web Page? Ricardo Eito Brun ❍ Re: Robot-HTML Web Page? Alex Chapman ❍ Re: Robot-HTML Web Page? Nick Arnett ● Robot Exclusion Standard Revisited Charles P. Kollar ● McKinley robot Rafhael Cedeno ● McKinley robot Dave Rothwell ● RE: alta vista and virtualvin.com Louis Monier ● RE: alta vista and virtualvin.com Louis Monier ❍ RE: alta vista and virtualvin.com Ann Cantelow ❍ RE: alta vista and virtualvin.com Michael De La Rue ❍ RE: alta vista and virtualvin.com Ann Cantelow ❍ Re: alta vista and virtualvin.com John D. Pritchard ● RE: alta vista and virtualvin.com Louis Monier ● Specific searches The Wild ● RE: alta vista and virtualvin.com Louis Monier ● RE: alta vista and virtualvin.com Bakin, David ● RE: alta vista and virtualvin.com Paul Francis ❍ Re: alta vista and virtualvin.com Michael Van Biesbrouck ● A new robot...TOPjobs(tm) USA JOBbot 1.0a D. Williams ● Test server for robot development? D. Williams ● Re: Robot Exclusion Standard Revisited (LONG) Martijn Koster ❍ BackRub robot warning Roy T. Fielding ● Content based robot collectors Scott 'Webster' Wood ● Robot to collect web pages per site Henrik Fagrell http://info.webcrawler.com/mailing-lists/robots/index.html (21 of 61) [18.02.2001 13:19:25] Robots Mailing List Archive by thread ❍ Re: Robot to collect web pages per site Jeremy.Ellman ● Recherche de documentation sur les agents intelligents ou Robots. Mannina Bruno ● www.kollar.com/robots.html John D. Pritchard ● Tagging a document with language Tronche Ch. le pitre ❍ Re: Tagging a document with language Donald E. Eastlake 3rd ❍ Re: Tagging a document with language J.E. Fritz ● Re: Tagging a document with language Gen-ichiro Kikui ● RE: Tagging a document with language Henk Alles ● RE: Tagging a document with language Robert Raisch, The Internet Company ● implementation fo HEAD response with meta info Davide Musella ● ❍ Re: implementation fo HEAD response with meta info G. Edward Johnson ❍ Re: implementation fo HEAD response with meta info Davide Musella robots, what else! Fred K. Lenherr ❍ Re: robots, what else! Scott 'Webster' Wood ❍ Re: robots, what else! [email protected] ● Finding the canonical name for a server Jaakko Hyvatti ● Re: (book recommendation re: net agents) Rob Turk ● Content based search engine Scott 'Webster' Wood ❍ Re: Content based search engine Martijn Koster ❍ Re: Content based search engine Skip Montanaro ● (no subject) Digital Universe Inc. ● (no subject) Larry Stephen Burke ● (no subject) Fred K. Lenherr ● BackRub robot L a r r y P a g e ❍ Re: BackRub robot Ross Finlayson ❍ Re: BackRub robot Ron Kanagy ❍ Re: BackRub robot Issac Roth ❍ Re: BackRub robot Ann Cantelow ❍ Re: BackRub robot Martijn Koster ● robots.txt: allow directive Leslie Cuff ● looking for specific bot... Brenden Portolese ❍ ● Looking for News robot Michael Goldberg ❍ ● Re: looking for specific bot... Ed Carp Re: Looking for News robot Martijn Koster robot.polite Brian Hancock http://info.webcrawler.com/mailing-lists/robots/index.html (22 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ● ● ❍ Re: robot.polite Joe Nieten ❍ Re: robot.polite Martijn Koster ❍ Re: robot.polite Martijn Koster ❍ Re: robot.polite Terry O'Neill ❍ Re: robot.polite Leslie Cuff ❍ Re: robot.polite Leslie Cuff ❍ Re: robot.polite Ross Finlayson ❍ Re: robot.polite Martijn Koster Looking for good one JoongSub Lee ❍ Introducing myself Richard J. Rossi ❍ Re: Introducing myself Martijn Koster ❍ Re: Introducing myself Fred K. Lenherr ❍ Re: Introducing myself Siu-ki Wong ❍ Re: Introducing myself Leslie Cuff ❍ Re: Introducing myself Martijn Koster ❍ Re: Introducing myself Martijn Koster ❍ Re: Introducing myself Dr. Detlev Kalb ❍ Re: Introducing myself Dr. Detlev Kalb ❍ RCPT: Re: Introducing myself Hauke Loens ❍ RCPT: Re: Introducing myself Hauke Loens ❍ Re: Looking for good one Greg Fenton More dangers of spiders... dws ❍ HTML query to .ps? Scott 'Webster' Wood ❍ Re: HTML query to .ps? .... John W. Kulp ❍ Re: HTML query to .ps? .... John W. Kulp ● RE: Tagging a document with language Henk Alles ● Keyword indexing David Reilly ❍ Re: Keyword indexing Robert Raisch, The Internet Company ❍ Re: Keyword indexing Martijn Koster ❍ Re: Keyword indexing Scott 'Webster' Wood ❍ Re: Keyword indexing Sigfrid Lundberg ❍ Re: Keyword indexing Brian Ulicny ❍ Re: Keyword indexing David Reilly ❍ Re: Keyword indexing Dave White ❍ Re: Keyword indexing Brian Ulicny http://info.webcrawler.com/mailing-lists/robots/index.html (23 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ● ❍ Re: Keyword indexing Paul Francis ❍ Re: Keyword indexing Ricardo Eito Brun ❍ Re: Keyword indexing Fred K. Lenherr Robot books boogieoogie goobnie ❍ Re: Robot books joseph williams ❍ Re: Robot books Richard J. Rossi ❍ Re: Introducing myself Richard J. Rossi ❍ Re: Robot books Hauke Loens ❍ RCPT: Re: Robot books Netmode ❍ RCPT: Re: Robot books Hauke Loens ● Allow/deny robots from major search services boogieoogie goobnie ● Robot logic? [email protected] ❍ ● ● ● Re: Robot logic? Martijn Koster robots.txt usage Wiebe Weikamp ❍ Re: robots.txt usage Reinier Post ❍ Re: robots.txt usage Daniel Lo ❍ Re: robots.txt usage Brian Clark ❍ Re: robots.txt usage Ulrich Ruffiner robot vaiable list Marco Genua ❍ Re: robot vaiable list Martijn Koster ❍ The Web Robots Database (was Re: robot vaiable list Martijn Koster WebAnalyzer - introduction Craig McQueen ❍ Re: WebAnalyzer - introduction Martijn Koster ❍ Re: WebAnalyzer - introduction Paul De Bra ● (no subject) Martijn Koster ● RE: Introducing myself Anthony D. Thomas ● RE: WebAnalyzer - introduction Gregg Steinhilpert ● ● PERL Compilers & Interpretive Tools MPMC-Manhattan Premed Council & Hunter PBPMA-Post Bac PreMed Assoc 212-843-3701 Ext 2800 RE: robots.txt usage David Levine ● RE: WebAnalyzer - introduction Craig McQueen ● RE: Tagging a document with language Robert Raisch, The Internet Company ● Microsoft Tripoli Web Search Beta now available Lee Fisher ● RE: Tagging a document with language Henk Alles ● (no subject) Kevin Lew ● in-document directive to discourage indexing ? Denis McKeon http://info.webcrawler.com/mailing-lists/robots/index.html (24 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ❍ makerobots.perl (Re: in-document directive..) Jaakko Hyvatti ❍ Re: in-document directive to discourage indexing ? Martijn Koster ❍ Re: in-document directive to discourage indexing ? Nick Arnett ❍ Re: in-document directive to discourage indexing ? arutgers ❍ Re: in-document directive to discourage indexing ? Denis McKeon ❍ Re: in-document directive to discourage indexing ? Kevin Hoogheem ❍ Re: in-document directive to discourage indexing ? Benjamin Franz ❍ Re: in-document directive to discourage indexing ? Jaakko Hyvatti ❍ Re: in-document directive to discourage indexing ? Drew Hamilton ❍ Re: in-document directive to discourage indexing ? Rob Turk ❍ Re: in-document directive to discourage indexing ? Kevin Hoogheem ❍ Re: in-document directive to discourage indexing ? Kevin Hoogheem ❍ Re: in-document directive to discourage indexing ? Rob Turk ❍ Re: in-document directive to discourage indexing ? Convegno NATO HPC 1996 ❍ Re: your mail Ed Carp ❍ Req for ADMIM: How to sunsubscribe? [email protected] ❍ Re: in-document directive to discourage indexing ? Terry O'Neill ❍ Re: in-document directive to discourage indexing ? Kevin Hoogheem ❍ Re: in-document directive to discourage indexing ? Nick Arnett ❍ Re: in-document directive to discourage indexing ? Terry O'Neill ❍ Re: in-document directive to discourage indexing ? Rob Turk ● (no subject) [email protected] ● Client Robot 'Ranjan' Brenden Portolese ❍ Re: Client Robot 'Ranjan' Robert Raisch, The Internet Company ❍ Re: Client Robot 'Ranjan' John D. Pritchard ❍ Re: Client Robot 'Ranjan' Kevin Hoogheem ❍ Re: Client Robot 'Ranjan' Benjamin Franz ❍ Re: Client Robot 'Ranjan' Brenden Portolese ❍ Re: Client Robot 'Ranjan' Reinier Post ❍ Re: Client Robot 'Ranjan' Kevin Hoogheem ● multiple copies Steve Leibman ● RE: Client Robot 'Ranjan' chris cobb ● Inter-robot communication David Reilly ❍ Re: Inter-robot communication Darren R. Hardy ❍ Re: Inter-robot communication John D. Pritchard http://info.webcrawler.com/mailing-lists/robots/index.html (25 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ● ❍ Re: Inter-robot communication Harry Munir Behrens ❍ Re: Inter-robot communication John D. Pritchard ❍ Re: Inter-robot communication David Reilly ❍ Re: Inter-robot communication David Reilly ❍ Re: Inter-robot communication John D. Pritchard ❍ Re: Inter-robot communication Ross Finlayson Re: RCPT: Re: Introducing myself Steve Jones ❍ Re: RCPT: Re: Introducing myself Bj\xrn-Olav Strand ● Re: RCPT: Re: Introducing myself Fred K. Lenherr ● Looking for... Steven Frank ● Re: RCPT: Re: Introducing myself Rob Turk ❍ ● Re: RCPT: Re: Introducing myself Ann Cantelow Re: RCPT: Re: Introducing myself Daniel Williams ❍ People who live in SPAM houses... (Was re: RCPT Scott 'Webster' Wood ● RE: Looking for... Craig McQueen ● Collected information standards Scott 'Webster' Wood ● Re: RCPT: Re: Introducing myself Ross Finlayson ● Re: RCPT: Re: Introducing myself Chris Crowther ● Re: RCPT: Re: Introducing myself Chris Crowther ● Java Robot Fabio Arciniegas A. ❍ Re: Java Robot Joe Nieten ❍ Re: Java Robot John D. Pritchard ❍ Re: Java Robot Joe Nieten ❍ Re: Java Robot L a r r y P a g e ❍ Re: Java Robot John D. Pritchard ● Re: RCPT: Re: Introducing myself Hauke Loens ● Report of the Distributed Indexing/Searching Workshop Martijn Koster ● RE: Java Robot David Levine ● RE: Java Robot David Levine ● Re: Java Robot John D. Pritchard ● unscribe Jeffrey Kerns ● Dead account Anthony John Carmody ● web topology Fred K. Lenherr ❍ Re: web topology Nick Arnett ❍ Re: web topology L a r r y P a g e http://info.webcrawler.com/mailing-lists/robots/index.html (26 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ❍ Re: web topology Ed Carp ● A modest proposal...<snip> to discourage indexing ? Rob Turk ● Re: Unsubscribing from Robots (was "your mail") Ian Samson ● Robot's Book. Mannina Bruno ● ❍ Re: Robot's Book. Rob Turk ❍ Re: Robot's Book. joseph williams ❍ Re: Robot's Book. Victor Ribeiro ADMIN: unsubscribing (was: Re: Martijn Koster ❍ ● ● ● Search Engine end-users Stephen Kahn ❍ Re: Search Engine end-users Ted Resnick ❍ Re: Search Engine end-users Nick Arnett loc(SOIF) John D. Pritchard ❍ Re: loc(SOIF) Paul Francis ❍ Re: loc(SOIF) David Reilly ADMIN: Archive Martijn Koster ❍ ● Re: ADMIN: Archive Nick Arnett roverbot - perhaps the worst robot yet dws ❍ ● New engine on the loose? Scott 'Webster' Wood Re: roverbot - perhaps the worst robot yet Brian Clark we should help spiders and not say NO! Daniel Lo ❍ Re: we should help spiders and not say NO! Daniel Lo ❍ robot clusion; was Re: we should help spiders and not say NO! John D. Pritchard ❍ Re: robot clusion; was Re: we should help spiders and not say NO! Martijn Koster Re: robot clusion; was Re: we should help spiders and not say NO! John D. Pritchard Advice Alyne Mochan & Warren Baker ❍ ● ❍ Re: Advice John D. Pritchard ❍ Re: Advice Martijn Koster ❍ Re: Advice Alyne Mochan & Warren Baker ❍ Re: Advice Alyne Mochan & Warren Baker ● Robot Mirror with Username/Password feature (no name) ● (no subject) K.E. HERING ● Re: Robot Mirror with Username/Password feature Mannina Bruno ● ❍ Re: Announcement Mannina Bruno ❍ Re: Announcement Michael. Gunn (no subject) K.E. HERING http://info.webcrawler.com/mailing-lists/robots/index.html (27 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ● Announcement Michael G=?iso-8859-1?Q?=F6ckel ● Harvest-like use of spiders Fred Melssen ❍ Re: Harvest-like use of spiders Nick Arnett ● (no subject) Ronald Kanagy ● Re: hey man gimme a break Martijn Koster ● Identifying identical documents Daniel T. Martin ❍ ● ● ● Description or Abstract? G. Edward Johnson ❍ Re: Description or Abstract? Reinier Post ❍ Re: Description or Abstract? Davide Musella ❍ Re: Description or Abstract? Reinier Post ❍ Re: Description or Abstract? Davide Musella ❍ Re: Description or Abstract? Nick Arnett ❍ Re: Description or Abstract? Mike Agostino ❍ Re: Description or Abstract? Martijn Koster ❍ Re: Description or Abstract? Martijn Koster ❍ Re: Description or Abstract? Nick Arnett ❍ On the subject of abuse/pro-activeness Scott 'Webster' Wood ❍ Re: Description or Abstract? Davide Musella Should I index all ... CLEDER Catherine ❍ Re: Should I index all ... Michael G=?iso-8859-1?Q?=F6ckel ❍ Re: Should I index all ... Jaakko Hyvatti ❍ Re: Should I index all ... Terry O'Neill ❍ Re: Should I index all ... Cleder Catherine ❍ Re: Should I index all ... Michael G=?iso-8859-1?Q?=F6ckel ❍ Re: Should I index all ... Martijn Koster ❍ Re: Should I index all ... Nick Arnett ❍ Re: Should I index all ... Chris Crowther ❍ Re: Should I index all ... Terry O'Neill ❍ Re: Should I index all ... Terry O'Neill ❍ Re: Should I index all ... Chris Crowther ❍ Re: Should I index all ... Trevor Jenkins HTML Parser Ronald Kanagy ❍ ● Re: Identifying identical documents Jaakko Hyvatti HTML Parser Skip Montanaro desperately looking for a news searcher Saloum Fall http://info.webcrawler.com/mailing-lists/robots/index.html (28 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ❍ Re: desperately looking for a news searcher Reinier Post ● Re: Unsubscribing from Robots (was "your mail") Ian Samson ● A modest proposal...<snip> to discourage indexing ? Rob Turk ● Re: your mail Ed Carp ● Blackboard for Discussing Domain-specific Robots Fred K. Lenherr ● robots.txt unavailability Daniel T. Martin ● ● ❍ Re: robots.txt unavailability Jaakko Hyvatti ❍ Re: robots.txt unavailability Fred K. Lenherr ❍ Re: robots.txt unavailability Michael G=?iso-8859-1?Q?=F6ckel ❍ Re: robots.txt unavailability Daniel T. Martin ❍ Re: robots.txt unavailability [email protected] nastygram from xxx.lanl.gov Aaron Nabil ❍ Re: nastygram from xxx.lanl.gov Wiebe Weikamp ❍ Re: nastygram from xxx.lanl.gov Aaron Nabil Re: nastygram from xxx.lanl.gov Aaron Nabil ❍ ● Re: nastygram from xxx.lanl.gov Michael Schlindwein nastygram from xxx.lanl.gov [email protected] ❍ Re: nastygram from xxx.lanl.gov Paul Francis ❍ Re: nastygram from xxx.lanl.gov Paul Francis ❍ Re: nastygram from xxx.lanl.gov Aaron Nabil ❍ Re: nastygram from xxx.lanl.gov Paul Francis ❍ Re: nastygram from xxx.lanl.gov Steve Nisbet ❍ Re: nastygram from xxx.lanl.gov Daniel T. Martin ❍ Re: nastygram from xxx.lanl.gov Chris Crowther ❍ Re: nastygram from xxx.lanl.gov Roy T. Fielding ❍ Re: nastygram from xxx.lanl.gov Istvan ❍ Re: nastygram from xxx.lanl.gov Istvan ❍ Re: nastygram from xxx.lanl.gov Larry Gilbert ❍ Re: nastygram from xxx.lanl.gov Tim Bray ❍ Re: nastygram from xxx.lanl.gov Chris Crowther ❍ Re: nastygram from xxx.lanl.gov Rob Hartill ❍ Re: nastygram from xxx.lanl.gov Benjamin Franz ❍ Re: nastygram from xxx.lanl.gov Rob Hartill ❍ Re: nastygram from xxx.lanl.gov Aaron Nabil ❍ Re: nastygram from xxx.lanl.gov Rob Hartill http://info.webcrawler.com/mailing-lists/robots/index.html (29 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ● ❍ Re: nastygram from xxx.lanl.gov Gordon Bainbridge ❍ Re: nastygram from xxx.lanl.gov Benjamin Franz ❍ Re: nastygram from xxx.lanl.gov Denis McKeon ❍ Re: nastygram from xxx.lanl.gov Drew Hamilton ❍ Re: nastygram from xxx.lanl.gov Rob Hartill ❍ Re: nastygram from xxx.lanl.gov Istvan ❍ Re: nastygram from xxx.lanl.gov Fred K. Lenherr ❍ Re: nastygram from xxx.lanl.gov Steve Jones ❍ You found it... (was Re: nastygram from xxx.lanl.gov) Michael Schlindwein ❍ Re: nastygram from xxx.lanl.gov Garth T Kidd ❍ Re: nastygram from xxx.lanl.gov Rob Hartill ❍ Re: nastygram from xxx.lanl.gov Garth T Kidd ❍ Re: nastygram from xxx.lanl.gov [email protected] ❍ Re: nastygram from xxx.lanl.gov Istvan ❍ Re: nastygram from xxx.lanl.gov Garth T Kidd ❍ Re: nastygram from xxx.lanl.gov Jaakko Hyvatti ❍ Re: nastygram from xxx.lanl.gov Chris Crowther ❍ Re: nastygram from xxx.lanl.gov Chris Crowther ❍ Re: nastygram from xxx.lanl.gov Gordon Bainbridge ❍ Re: nastygram from xxx.lanl.gov ❍ Re: nastygram from xxx.lanl.gov Kevin Hoogheem ❍ Re: nastygram from xxx.lanl.gov Chris Crowther RE: nastygram from xxx.lanl.gov Elias Sideris ❍ ● RE: nastygram from xxx.lanl.gov Bj\xrn-Olav Strand RE: nastygram from xxx.lanl.gov Frank Wales ❍ Re: nastygram from xxx.lanl.gov Michael Schlindwein ● htaccess Steve Leibman ● RE: robots.txt unavailability Louis Monier ● RE: On the subject of abuse/pro-activeness Louis Monier ❍ ● ● ● ● Re: On the subject of abuse/pro-activeness Scott 'Webster' Wood NCSA Net Access_log Analysis Tool for Win95 MPMC-Manhattan Premed Council & Hunter PBPMA-Post Bac PreMed Assoc 212-843-3701 Ext 2800 Proxies Larry Stephen Burke NCSA Net Access_log Analysis Tool for Win95 MPMC-Manhattan Premed Council & Hunter PBPMA-Post Bac PreMed Assoc 212-843-3701 Ext 2800 RE: Should I index all ... David Levine http://info.webcrawler.com/mailing-lists/robots/index.html (30 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ● ● ● How long to cache robots.txt for? Aaron Nabil ❍ Re: How long to cache robots.txt for? Micah A. Williams ❍ Re: How long to cache robots.txt for? Jaakko Hyvatti ❍ Re: How long to cache robots.txt for? Martin Kiff ❍ Re: How long to cache robots.txt for? Greg Fenton Re: psycho at xxx.lanl.gov Rob Turk ❍ Re: psycho at xxx.lanl.gov Bonnie Scott ❍ Re: psycho at xxx.lanl.gov Scott 'Webster' Wood Re: Alta Vista getting stale? Denis McKeon ❍ Re: Alta Vista getting stale? Martin Kiff ❍ Re: Alta Vista getting stale? Denis McKeon ● RE: nastygram from xxx.lanl.gov Frank Wales ● RE: nastygram from xxx.lanl.gov Paul Francis ● ❍ Re: nastygram from xxx.lanl.gov Michael De La Rue ❍ Re: nastygram from xxx.lanl.gov Michael Schlindwein RE: nastygram from xxx.lanl.gov Rob Hartill ❍ Re: nastygram from xxx.lanl.gov Aaron Nabil ❍ Re: nastygram from xxx.lanl.gov Rob Hartill ❍ RE: nastygram from xxx.lanl.gov Bj\xrn-Olav Strand ● Updating Robots Ian Samson ● Re: nastygram from xxx.lanl.gov Aaron Nabil ❍ Re: nastygram from xxx.lanl.gov Rob Hartill ● Prasad Wagle: Webhackers: Java servlets and agents John D. Pritchard ● RE: nastygram from xxx.lanl.gov Istvan ● RE: nastygram from xxx.lanl.gov Tim Bray ● Re: nastygram for xxx.lanl.gov David Levine ● RE: nastygram from xxx.lanl.gov Istvan ● Re: nastygram from xxx.lanl.gov Aaron Nabil ❍ ● Re: nastygram from xxx.lanl.gov Rob Hartill dumb robots and xxx Rob Hartill ❍ Re: dumb robots and xxx Ron Wolf ❍ Re: dumb robots and xxx Randy Terbush ❍ Re: dumb robots and xxx Jim Ausman ❍ Re: dumb robots and xxx Michael Schlindwein ❍ Re: dumb robots and xxx Randy Terbush http://info.webcrawler.com/mailing-lists/robots/index.html (31 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ● ADMIN: Spoofing vs xxx.lanl.gov Martijn Koster ● RE: nastygram from xxx.lanl.gov Frank Wales ● RE: nastygram from xxx.lanl.gov Istvan ● forwarded e-mail Paul Ginsparg 505-667-7353 ● the POST myth... a web admin's opinions.. Rob Hartill ❍ Re: the POST myth... a web admin's opinions.. Benjamin Franz ❍ Re: the POST myth... a web admin's opinions.. Rob Hartill ❍ Re: the POST myth... a web admin's opinions.. Benjamin Franz ❍ Re: the POST myth... a web admin's opinions.. Bonnie Scott ❍ Re: the POST myth... a web admin's opinions.. Benjamin Franz ❍ Re: the POST myth... a web admin's opinions.. Rob Hartill ❍ Re: the POST myth... a web admin's opinions.. Istvan ● xxx.lanl.gov - The thread continues.... Chris Crowther ● xxx.lanl.gov a real threat? John Lammers ● Re: Alta Vista getting stale? Nick Arnett ● crawling accident Rafhael Cedeno ● xxx.lanl.gov/robots.txt Chris Crowther ● Stupid robots cache DNS and not IMS Roy T. Fielding ● Suggestion to help robots and sites coexist a little better Rob Hartill ❍ Java intelligent agents and compliance? Scott 'Webster' Wood ❍ Re: Java intelligent agents and compliance? John D. Pritchard ❍ Re: Java intelligent agents and compliance? Shiraz Siddiqui ❍ Re: Suggestion to help robots and sites coexist a little better Benjamin Franz ❍ Re: Suggestion to help robots and sites coexist a little better Rob Hartill ❍ Suggestion to help robots and sites coexist a little better Skip Montanaro ❍ Re: Suggestion to help robots and sites coexist a little better Dirk.vanGulik ❍ Re: Suggestion to help robots and sites coexist a little better Martijn Koster ❍ Re: Suggestion to help robots and sites coexist a little better Randy Terbush ❍ Re: Suggestion to help robots and sites coexist a little better Rob Hartill ❍ Re: Suggestion to help robots and sites coexist a little better Nick Arnett ❍ Re: Suggestion to help robots and sites coexist a little better Scott 'Webster' Wood ❍ Re: Suggestion to help robots and sites coexist a little better Rob Hartill ❍ Re: Suggestion to help robots and sites coexist a little better Nick Arnett ❍ Re: Suggestion to help robots and sites coexist a little better Rob Hartill ❍ Re: Suggestion to help robots and sites coexist a little better Nick Arnett http://info.webcrawler.com/mailing-lists/robots/index.html (32 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ● ❍ Re: Suggestion to help robots and sites coexist a little better Scott 'Webster' Wood ❍ Re: Suggestion to help robots and sites coexist a little better Jaakko Hyvatti ❍ Re: Suggestion to help robots and sites coexist a little better Rob Hartill ❍ Re: Suggestion to help robots and sites coexist a little better Brian Clark ❍ Re: Suggestion to help robots and sites coexist a little better Nick Arnett ❍ Re: Suggestion to help robots and sites coexist a little better Martijn Koster ❍ Apology -- I didn't mean to send that last message to the list Bonnie Scott ❍ Re: Suggestion to help robots and sites coexist a little better Rob Hartill ❍ Re: Suggestion to help robots and sites coexist a little better Nick Arnett ❍ Re: Suggestion to help robots and sites coexist a little better Rob Hartill PS Rob Hartill ❍ Re: PS Benjamin Franz ❍ Re: PS Brian Clark ❍ Re: PS Rob Hartill ❍ About the question what a robots is (was Re: PS) Michael Schlindwein ❍ Re: About the question what a robots is (was Re: PS) Rob Turk ❍ Re: About the question what a robots is (was Re: PS) Michael Schlindwein ● Linux and Robot development... root ● Re: Suggestion to help robots and sites coexist a little better Mark J Cox ● interactive generation of URL's Fred Melssen ● ❍ Re: interactive generation of URL's Benjamin Franz ❍ Re: interactive generation of URL's Chris Crowther Newbie question Dan Hurwitz ❍ Re: Newbie question David Eichmann ❍ Re: Newbie question Danny Sullivan ● Q: meta name="robots" content="noindex" ? John Bro, InterSoft Solutions, Inc ● Possible MSIIS bug? Jakob Faarvang ● HOST: header Rob Hartill ● ● ❍ Re: HOST: header Scott 'Webster' Wood ❍ Re: HOST: header Rob Hartill Robot Research Kosta Tombras ❍ Re: Robot Research David Eichmann ❍ Re: Robot Research Steve Nisbet ❍ Re: Robot Research Paul Francis Safe Methods Istvan http://info.webcrawler.com/mailing-lists/robots/index.html (33 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ❍ Re: Safe Methods Garth T Kidd ❍ Re: Safe Methods Rob Hartill ❍ Re: Safe Methods Benjamin Franz ❍ Re: Safe Methods Rob Hartill ❍ Re: Safe Methods Benjamin Franz ❍ Re: Safe Methods Randy Terbush ❍ Re: Safe Methods Rob Turk ❍ Re: Safe Methods Rob Hartill ● robots source code in C Ethan Lee ● Re: How long to cache robots.txt Daniel T. Martin ● *Help: Writing BOTS* Shiraz Siddiqui ● Re: How long to cache robots.txt Mike Agostino ● Search Engine article Danny Sullivan ❍ Re: Search Engine article James ● Unusual request - sorry! David Eagles ● Re: How long to cache robots.txt Daniel T. Martin ● Last message David Eagles ● Re: Social Responsibilities (was Safe Methods) David Eichmann ● AltaVista's Index is obsolete Ian Samson ❍ Re: AltaVista's Index is obsolete Patrick Lee ❍ Re: AltaVista's Index is obsolete Wiebe Weikamp ● RE: AltaVista's Index is obsolete; but what about the others Ted Sullivan ● RE: AltaVista's Index is obsolete; but what about the other Ian Samson ● Re: AltaVista's Index is obsolete; but what about the other Daniel T. Martin ● hoohoo.cac.washington = bad Andy Warner ❍ Re: hoohoo.cac.washington = bad David Reilly ❍ Re: hoohoo.cac.washington = bad Betsy Dunphy ❍ Re: hoohoo.cac.washington = bad Erik Selberg ❍ Re: hoohoo.cac.washington = bad Patrick Lee ❍ Re: hoohoo.cac.washington = bad Jeremy.Ellman ● PHP stops robots [email protected] ● netscape spec for RDM Jim Ausman ● ❍ Re: netscape spec for RDM Nick Arnett ❍ Re: netscape spec for RDM David Reilly Stop 'bots using apache, etc. or php? [email protected] http://info.webcrawler.com/mailing-lists/robots/index.html (34 of 61) [18.02.2001 13:19:26] Robots Mailing List Archive by thread ● ● ● Anyone know who owns this one? Betsy Dunphy ❍ Re: Anyone know who owns this one? Benjamin Franz ❍ Re: Anyone know who owns this one? Rob Turk ❍ Re: Anyone know who owns this one? Rob Hartill ❍ Re: Anyone know who owns this one? David Reilly ❍ Robo-phopbic Mailing list Shiraz Siddiqui ❍ Re: Anyone know who owns this one? Captain Napalm ❍ Re: Anyone know who owns this one? pinson Robot Gripes forum? (Was: Anyone know who owns this one?) Betsy Dunphy ❍ Re: Robot Gripes forum? (Was: Anyone know who owns this one?) Rob Turk ❍ Re: Robot Gripes forum? (Was: Anyone know who owns this one?) Bruce Rhodewalt ❍ Re: Robot Gripes forum? (Was: Anyone know who owns this one?) Issac Roth RE: Anyone know who owns this one? Martin.Soukup ❍ ● ● I vote NO (Was: Robot Gripes forum?) Nick Arnett ❍ Re: I vote NO (Was: Robot Gripes forum?) Rob Hartill ❍ Re: I vote NO (Was: Robot Gripes forum?) Nick Arnett ❍ Re: I vote NO (Was: Robot Gripes forum?) Rob Hartill Re: I vote NO (Was: Robot Gripes forum?) - I vote YES Betsy Dunphy ❍ ● ● Re: Anyone know who owns this one? Captain Napalm Re: I vote NO (Was: Robot Gripes forum?) - I vote YES Erik Selberg What is your favorite search engine - a survey Bhupinder S. Sran ❍ Re: What is your favorite search engine - a survey Sanna ❍ Re: What is your favorite search engine - a survey [email protected] robots on an intranet Adam Gaffin ❍ Re: robots on an intranet Tim Bray ❍ Re: robots on an intranet Tim Bray ❍ Re: robots on an intranet (replies to list...) Michael De La Rue ❍ Re: robots on an intranet Peter Small ❍ Re: robots on an intranet [email protected] ❍ Re: robots on an intranet Jane Doyle ❍ (no subject) [email protected] ❍ Re: robots on an intranet Ulla Sandberg ❍ Re: robots on an intranet Abderrezak Kamel ● grammar engines Ross A. Finlayson ● Re[2]: robots on an intranet Stacy Cannady http://info.webcrawler.com/mailing-lists/robots/index.html (35 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ● Add This Search Engine to your Results. Thomas Bedell ❍ ● Re: Add This Search Engine to your Results. James Search Engine Tutorial for Web Developers Edward Stangler ❍ How can IR Agents be evaluate ? Giacomo Fiorentini ● RE: How can IR Agents be evaluate ? Jim Harris ● RE: How can IR Agents be evaluate ? Dan Quigley ● Project Aristotle(sm) Gerry McKiernan ● Re: robots on an intranet (replies to list...) Martijn Koster ❍ ● Re: robots on an intranet (replies to list...) Reinier Post RE: Search Engine System Admin ❍ RE: Search Engine Marc Langheinrich ❍ RE: Search Engine WEBsmith Editor ❍ Re: Search Engine Harry Munir Behrens ● Re: robots on an intranet (replies to list...) Operator ● Re: Search Engine Operator ❍ Re: Search Engine siddiqui athar shiraz ● Re: Search Engine Stimpy ● Lynx. The one true browser. Ian McKellar ● email spider Richard A. Paris ❍ Re: email spider Scott 'Webster' Wood ❍ Re: email spider Jerry Walsh ● RE: email spider Jim Harris ● RE: How can IR Agents be evaluate ? Nick Arnett ● Re: Search Engine Nick Arnett ● Offline Agents for UNIX A. Shiraz Siddiqui ❍ ● Re: Offline Agents for UNIX Ian McKellar mini-robot Chiaki Ohta ❍ Re: mini-robot Joao Moreira ● Webfetch Edward Ooi ● Library Agents(sm): Library Applications of Intelligent Software Agents Gerry McKiernan Mail robot? Christopher J. Farrell IV ● ❍ ● Re: Mail robot? [email protected] depth first vs breadth first Robert Nicholson ❍ Re: depth first vs breadth first [email protected] ❍ Re: depth first vs breadth first David Eichmann http://info.webcrawler.com/mailing-lists/robots/index.html (36 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ❍ ● ● Quakebots Ross A. Finlayson ❍ Re: Quakebots Rob Torchon ❍ Re: Quakebots Ross A. Finlayson articles or URL's on search engines Werner Schweibenz, Mizzou ❍ ● ● ● Re: depth first vs breadth first David Eichmann Re: articles or URL's on search engines Peter Small HEAD Anna Torti ❍ Re: HEAD Daniel Lo ❍ Re: HEAD Captain Napalm ❍ Re: HEAD Captain Napalm The Internet Archive robot Mike Burner ❍ Re: The Internet Archive robot Tronche Ch. le pitre ❍ Re: The Internet Archive robot Fred Douglis The Internet Archive robot [email protected] ❍ Re: The Internet Archive robot Brian Clark ❍ Re: The Internet Archive robot Robert B. Turk ❍ Re: The Internet Archive robot Alex Strasheim ❍ Re: The Internet Archive robot Marilyn R Wulfekuhler ❍ Re: The Internet Archive robot Richard Gaskin - Fourth World ❍ Re: The Internet Archive robot Richard Gaskin - Fourth World ❍ Re: The Internet Archive robot Eric Kristoff ❍ Re: The Internet Archive robot Jeremy Sigmon ❍ Re: The Internet Archive robot Michael G=?iso-8859-1?Q?=F6ckel ❍ Re: The Internet Archive robot Eric Kristoff ❍ Re: The Internet Archive robot Gareth R White ❍ Re: The Internet Archive robot Jeremy Sigmon ❍ Re: The Internet Archive robot Todd Markle ❍ Re: The Internet Archive robot Jeremy Sigmon ❍ robots.txt buffer question. Jeremy Sigmon ❍ Re: The Internet Archive robot Martijn Koster ❍ Re: The Internet Archive robot Z Smith ❍ Re: The Internet Archive robot Rob Hartill ❍ Re: The Internet Archive robot Z Smith ● (no subject) Stacy Cannady ● "hidden text" vs. META tags for robots/search engines Todd Sellers http://info.webcrawler.com/mailing-lists/robots/index.html (37 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ● ❍ Re: "hidden text" vs. META tags for robots/search engines Can Ozturan ❍ Re: "hidden text" vs. META tags for robots/search engines Martijn Koster ❍ Re: "hidden text" vs. META tags for robots/search engines Martin Kiff ❍ Re: "hidden text" vs. META tags for robots/search engines Chad Zimmerman ❍ Re: "hidden text" vs. META tags for robots/search engines Davide Musella ❍ Re: "hidden text" vs. META tags for robots/search engines Martin Kiff Looking for subcontracting spider-programmers Richard Rossi ❍ Re: Looking for subcontracting spider-programmers Bob Worthy ❍ Re: Looking for subcontracting spider-programmers Hani Yakan ❍ tryme Marty Landman ❍ Re: Looking for subcontracting spider-programmers Kevin Hoogheem ● RE: The Internet Archive robot Ted Sullivan ● (no subject) Robert Stober ● RE: The Internet Archive robot (fwd) Brewster Kahle ❍ ● ● Re: The Internet Archive robot (fwd) Fred Douglis pointers for a novice? Alex Strasheim ❍ Re: pointers for a novice? Kevin Hoogheem ❍ Re: pointers for a novice? Elias Hatzis Extracting info from SIG forum archives Peter Small ❍ Re: Extracting info from SIG forum archives Denis McKeon ● Conceptbot spider David L. Sifry ● Re: The Internet Archive robot (fwd) Robert B. Turk ● Re: The Internet Archive robot Phil Hochstetler ● Copyrights (was Re: The Internet Archive robot) Brian Clark ● ❍ Re: Copyrights (was Re: The Internet Archive robot) Tim Bray ❍ Re: Copyrights (was Re: The Internet Archive robot) Robert B. Turk ❍ Re: Copyrights (was Re: The Internet Archive robot) Brian Clark ❍ Re: Copyrights (was Re: The Internet Archive robot) Robert B. Turk ❍ Re: Copyrights (was Re: The Internet Archive robot) Brian Clark Copyrights on the web Charlie Brown ❍ Re: Copyrights on the web Benjamin Franz ❍ Re: Copyrights on the web [email protected] ❍ Re: Copyrights on the web Chad Zimmerman ❍ Re: Copyrights on the web Denis McKeon ❍ Re: Copyrights on the web Richard Gaskin - Fourth World http://info.webcrawler.com/mailing-lists/robots/index.html (38 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ❍ Re: Copyrights on the web Eric Kristoff ● RE: Copyrights on the web Ted Sullivan ● More ways to spam search engines? G. Edward Johnson ❍ Re: More ways to spam search engines? Andrew Leonard ● RE: copyright, etc. Richard Gaskin - Fourth World ● Copyrights, let them be ! Joao Moreira ● ❍ Re: Copyrights, let them be ! Denis McKeon ❍ Re: Copyrights, let them be ! Brian Clark ❍ Re: Copyrights, let them be ! Richard Gaskin - Fourth World ❍ Re: Copyrights, let them be ! John D. Pritchard ❍ (Fwd) Re: Copyrights, let them be ! Robert Raisch, The Internet Company ❍ Re: (Fwd) Re: Copyrights, let them be ! Richard Gaskin - Fourth World crawling FTP sites Greg Fenton ❍ Re: crawling FTP sites Jaakko Hyvatti ❍ Re: crawling FTP sites James Black ● RE: Copyrights, let them be ! William Dan Terry ● RE: crawling FTP sites William Dan Terry ● Re: The Internet Archive robot (fwd) Brewster Kahle ● RE: crawling FTP sites William Dan Terry ● Re: Public Access Nodes / Copywrited Nodes Ross A. Finlayson ● FAQ? Frank Smadja ● RE: The Internet Archive robot David Levine ❍ RE: The Internet Archive robot Denis McKeon ❍ RE: The Internet Archive robot Sigfrid Lundberg ❍ Re: The Internet Archive robot Fred Douglis ● RE: Copyrights on the web Bryan Cort ● RE: The Internet Archive robot William Dan Terry ● RE: The Internet Archive robot David Levine ❍ RE: The Internet Archive robot Denis McKeon ● InfoSpiders/0.1 Filippo Menczer ● info for newbie Filippo Menczer ❍ MOMSpider problem. Broken Pipe Jeremy Sigmon ● Bad agent...A *very* bad agent. Benjamin Franz ● RE: The Internet Archive robot Nick Arnett ● A few copyright notes Nick Arnett http://info.webcrawler.com/mailing-lists/robots/index.html (39 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ● Topic drift (archive robot, copyright...) Nick Arnett ❍ Re: Topic drift (archive robot, copyright...) Ross A. Finlayson ❍ Re: Topic drift (archive robot, copyright...) Jeremy Sigmon ● Netscape Catalog Server: An Eval Eric Kristoff ● file retrieval Eddie Rojas ❍ ● Re: file retrieval Martijn Koster Preferred access time Martin.Soukup ❍ Re: Preferred access time John D. Pritchard ● White House/PARC "Leveraging Cyberspace" Nick Arnett ● RE: Preferred access time Martin.Soukup ● robot to get specific info only? Martin.Soukup ❍ re: robot to get specific info only? Steve Leibman ● Unregistered MIME types? Nick Arnett ● A bad agent? Nick Dearnaley ❍ Re: A bad agent? Rob Hartill ❍ Re: A bad agent? Nick Dearnaley ❍ Re: A bad agent? Rob Hartill ● Use of robots.txt to "check status"? Ed Costello ● Robot exclustion for for non-'unix file' hierarchy Hallvard B Furuseth ❍ Re: Robot exclustion for for non-'unix file' hierarchy Martijn Koster ● How to get listed #1 on all search engines (fwd) Bonnie ● RE: How to get listed #1 on all search engines (fwd) Ted Sullivan ● Cannot believe it "Morons" Cafe ● Possible robot? Chad Zimmerman ● Bye Bye HyperText: The End of the World (Wide Web) As We Know It! Gerry McKiernan ● Re: Bye Bye HyperText: The End of the World (Wide Web) As We Know It! Michael De La Rue Image Maps Harold Gibbs ❍ ❍ Re: Image Maps Martin Kiff ● Bug in LibWWW perl + Data::Dumper (libwwwperl refs are strange) Michael De La Rue ● do robots send HTTP_HOST? Joe Pruett ❍ ● ● Re: do robots send HTTP_HOST? Aaron Nabil The End of The World (Wide Web) / Part II Gerry McKiernan ❍ Re: The End of The World (Wide Web) / Part II Brian Clark ❍ Re: The End of The World (Wide Web) / Part II Nick Dearnaley Seeing is Believing: Candidate Web Resources for Information Visualization Gerry http://info.webcrawler.com/mailing-lists/robots/index.html (40 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ● McKiernan CitedSites(sm): Citation Indexing of Web Resources Gerry McKiernan ● Netscape Catalog Server Eric Kristoff ● Netscape Catalog Server Eric Kristoff ● Topic-specific robots Fred K. Lenherr ❍ ● ● ● Re: Topic-specific robots Nick Dearnaley robots.txt syntax Fred K. Lenherr ❍ Re: robots.txt syntax John D. Pritchard ❍ Re: robots.txt syntax Martijn Koster ❍ Re: robots.txt syntax Martijn Koster ❍ Re: robots.txt syntax Captain Napalm ❍ Re: robots.txt syntax John D. Pritchard ❍ Re: robots.txt syntax Captain Napalm ❍ Re: robots.txt syntax John D. Pritchard ❍ Re: robots.txt syntax Captain Napalm ❍ Re: robots.txt syntax John D. Pritchard Another rating scam! (And a proposal on how to fix it) Aaron Nabil ❍ Re: Another rating scam! (And a proposal on how to fix it) Martijn Koster ❍ Re: Another rating scam! (And a proposal on how to fix it) Paul Francis ❍ Re: Another rating scam! (And a proposal on how to fix it) Paul Francis ❍ Re: Another rating scam! (And a proposal on how to fix it) Aaron Nabil ❍ Re: Another rating scam! (And a proposal on how to fix it) Benjamin Franz META tag standards, search accuracy Nick Arnett ❍ Re: META tag standards, search accuracy Robert B. Turk ❍ Re: META tag standards, search accuracy Nick Arnett ❍ Re: META tag standards, search accuracy Nick Arnett ❍ Re: META tag standards, search accuracy Nick Arnett ❍ Re: META tag standards, search accuracy Nick Arnett ❍ Re: META tag standards, search accuracy Eric Miller ● ADMIN (was Re: hypermail archive not operational Martijn Koster ● ActiveAgent HipCrime ❍ Re: ActiveAgent Benjamin Franz ● non 2nn repsonses on robots.txt Aaron Nabil ● servers that don't return a 404 for "not found" Aaron Nabil ● Re: servers that don't return a 404 for "not found" Aaron Nabil ● Re: servers that don't return a 404 for "not found" Aaron Nabil http://info.webcrawler.com/mailing-lists/robots/index.html (41 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ● Re: Another rating scam! (And a proposal on how to fix it Nick Dearnaley ● Returned mail: Host unknown (Name server: webcrawler: host not found) Mail Delivery Subsystem Re: META tag standards, search accuracy Eric Miller ● ❍ Re: META tag standards, search accuracy Benjamin Franz ❍ Re: META tag standards, search accuracy Benjamin Franz ● ActiveAgent Larry Steinberg ● Re: ActiveAgent Benjamin Franz ● Re: META tag standards, search accuracy Eric Miller ● ● Returned mail: Host unknown (Name server: webcrawler: host not found) Mail Delivery Subsystem RE: ActiveAgent and E-Mail Spam Bryan Cromartie ● Re: META tag standards, search accuracy Eric Miller ● Re: ActiveAgent William Neuhauser ● agents ignoring robots.txt Rob Hartill ❍ Re: agents ignoring robots.txt Erik Selberg ❍ Re: agents ignoring robots.txt Captain Napalm ❍ Re: agents ignoring robots.txt Erik Selberg ❍ Re: agents ignoring robots.txt John D. Pritchard ● Re: agents ignoring robots.txt Rob Hartill ● Disallow/Allow by Action (Re: robots.txt syntax) Brian Clark ❍ Re: Disallow/Allow by Action (Re: robots.txt syntax) Nick Dearnaley ● CyberPromo shut down at last!!! Richard Gaskin - Fourth World ● infoseek Fred K. Lenherr ● Comparing robots/search sites Fred K. Lenherr ● Re: infoseek Rob Hartill ● McKinley -- 100% error rate Jennifer C. O'Brien ● Re: robots.txt buffer question. Brent Boghosian ● What to rate limit/lock on, name or IP address? Aaron Nabil ❍ Re: What to rate limit/lock on, name or IP address? [email protected] ● RE: What to rate limit/lock on, name or IP address? Greg Fenton ● RE: What to rate limit/lock on, name or IP address? Brent Boghosian ● Filtering queries on a robot-built database Fred K. Lenherr ● sockets in PERL HipCrime ● Re: sockets in PERL Otis Gospodnetic ● Re: Showbiz search engine Showbiz Information http://info.webcrawler.com/mailing-lists/robots/index.html (42 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ● Is a robot visiting? Tim Freeman ❍ Re: Is a robot visiting? Klaus Johannes Rusch ❍ Re: Is a robot visiting? Daniel T. Martin ❍ Re: Is a robot visiting? Aaron Nabil ❍ Re: Is a robot visiting? Tim Freeman ❍ Re: Is a robot visiting? Tim Freeman ❍ Re: Is a robot visiting? David Steele ❍ Re: Is a robot visiting? Tim Freeman ❍ Re: Is a robot visiting? Aaron Nabil ❍ Re: Is a robot visiting? Klaus Johannes Rusch ❍ Re: Is a robot visiting? Hallvard B Furuseth ❍ Re: Is a robot visiting? Hallvard B Furuseth ● RE: Is a robot visiting? Greg Fenton ● Tim Freeman Aaron Nabil ● Thanks! Tim Freeman ● Possible robots.txt addition Ian Graham ❍ Re: Possible robots.txt addition Issac Roth ❍ Re: Possible robots.txt addition Francois Rouaix ❍ Re: Possible robots.txt addition John D. Pritchard ❍ Re: Possible robots.txt addition John D. Pritchard ❍ Re: Possible robots.txt addition Martijn Koster ● RE: Is a robot visiting? Hallvard B Furuseth ● download robot LIAM GUINANE ● A new robot -- ask for advice Hrvoje Niksic ● Re: Possible robots.txt addition Martin Kiff ● technical descripton [D [D [D LIAM GUINANE ❍ ● ● Re: technical descripton [D [D [D P. Senthil Domains and HTTP_HOST Brian Clark ❍ Re: Domains and HTTP_HOST Ian Graham ❍ Re: Domains and HTTP_HOST DECLAN FITZPATRICK ❍ Re: Domains and HTTP_HOST Klaus Johannes Rusch ❍ Re: Domains and HTTP_HOST Brian Clark ❍ Re: Domains and HTTP_HOST Klaus Johannes Rusch ❍ Re: Domains and HTTP_HOST John D. Pritchard Source code Stephane Vaillancourt http://info.webcrawler.com/mailing-lists/robots/index.html (43 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ● Back of the envelope computations Francois Rouaix ❍ Re: Back of the envelope computations John D. Pritchard ❍ Re: Back of the envelope computations Sigfrid Lundberg ❍ Re: Possible robots.txt addition Ian Graham ● Re: Possible robots.txt addition Ian Graham ● Re: Possible robots.txt addition Ian Graham ● Possible robots.txt addition - did I say that? Martin Kiff ● Re: Possible robots.txt addition Klaus Johannes Rusch ● Re: Possible robots.txt addition (fwd) Ian Graham ● Belated notice of spider article Adam Gaffin ● Re: Possible robots.txt addition Klaus Johannes Rusch ● ActiveAgent HipCrime ● Matching the user-agent in /robots.txt Hrvoje Niksic ❍ Re: Domains and HTTP_HOST Benjamin Franz ● Re: ActiveAgent Aaron Nabil ● Re: ActiveAgent [email protected] ● Re: ActiveAgent Rob Hartill ● We need robot information Juan ● anti-robot regexps Hallvard B Furuseth ● anti-robot regexps Hallvard B Furuseth ● An extended verion of the robot exclusion standard Captain Napalm ❍ ● Re: An extended verion of the robot exclusion standard Hrvoje Niksic Re: An extended version of the Robots... Martijn Koster ❍ Re: An extended version of the Robots... Captain Napalm ❍ Re: An extended version of the Robots... Hrvoje Niksic ● Re: Domains and HTTP_HOST Klaus Johannes Rusch ● robot algorithm ? Otis Gospodnetic ● itelligent agents Wanca, Vincent ❍ ● ● Re: itelligent agents Hrvoje Niksic Get official! Hallvard B Furuseth ❍ Re: Get official! DA ❍ Re: Get official! Klaus Johannes Rusch ❍ Re: Get official! Hallvard B Furuseth Regexps (Was: Re: An extended version of the Robots...) Hallvard B Furuseth ❍ Regexps (Was: Re: An extended version of the Robots...) Skip Montanaro http://info.webcrawler.com/mailing-lists/robots/index.html (44 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ● Regexps (Was: Re: An extended version of the Robots...) Hallvard B Furuseth ● Is their a web site... [email protected] ❍ Re: Is their a web site... DA ● Re: Regexps (Was: Re: An extended version of the Robots...) Martijn Koster ● Re: An extended verion of the robot exclusion standard Captain Napalm ● Re: An extended version of the Robots... Captain Napalm ❍ ● An updated extended standard for robots.txt Captain Napalm ❍ ● ● Re: An extended version of the Robots... Hrvoje Niksic Re: An updated extended standard for robots.txt Art Matheny Notification protocol? Nick Arnett ❍ Re: Notification protocol? Fred K. Lenherr ❍ Re: Notification protocol? Ted Hardie ❍ Re: Notification protocol? John D. Pritchard ❍ Re: Notification protocol? Tony Barry ❍ Re: Notification protocol? Mike Schwartz ❍ Re: Notification protocol? Sankar Virdhagriswaran ❍ Re: Notification protocol? John D. Pritchard ❍ Re: Notification protocol? Peter Jurg Re: An extended version of the Robots... Hallvard B Furuseth ❍ Re: An extended version of the Robots... Art Matheny ● Re: An extended version of the Robots... (fwd) Vu Quoc HUNG ● changes to robots.txt Rob Hartill ● ❍ Re: changes to robots.txt DA ❍ Re: changes to robots.txt Rob Hartill ❍ Re: changes to robots.txt DA ❍ Re: changes to robots.txt Klaus Johannes Rusch ❍ Re: changes to robots.txt Steve DeJarnett ❍ Re: changes to robots.txt Rob Hartill Re: An updated extended standard for robots.txt Captain Napalm ❍ Re: An updated extended standard for robots.txt Art Matheny ● Re: An extended version of the Robots... Martijn Koster ● UN/LINK protocol is standardized! wasn't that quick! John D. Pritchard ● Re: UN/LINK protocol is standardized! wasn't that quick! John D. Pritchard ● [ROBOTS JJA] Lib-WWW-perl5 Juan Jose Amor ● Admitting the obvious I think therefore I spam http://info.webcrawler.com/mailing-lists/robots/index.html (45 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ❍ ● Re: Admitting the obvious Richard Gaskin how do they do that? Otis Gospodnetic ❍ Re: how do they do that? Martijn Koster ● databases for spiders Elias Hatzis ● databases for spiders Elias Hatzis ❍ Re: databases for spiders John D. Pritchard ❍ Re: databases for spiders Nick Arnett ❍ Re: databases for spiders DA ❍ Re: databases for spiders Nick Arnett ● Re: databases for spiders [email protected] ● indexing intranet-site Martin Paff ❍ Re: indexing intranet-site Nick Arnett ● Re: An extended version of the Robots... Hallvard B Furuseth ● Re: An extended version of the Robots... Hallvard B Furuseth ● RE: databases for spiders Larry Fitzpatrick ● RE: databases for spiders Nick Arnett ● infoseeks robot is dumb Otis Gospodnetic ❍ Re: infoseeks robot is dumb Matthew K Gray ❍ Re: infoseeks robot is dumb Otis Gospodnetic ● RE: changes to robots.txt Scott Johnson ● Not so Friendly Robot - Teleport David McGrath ● ❍ Re: infoseeks robot is dumb DA ❍ Re: infoseeks robot is dumb Hrvoje Niksic ❍ Re: infoseeks robot is dumb Otis Gospodnetic RFC, draft 1 Martijn Koster ❍ Re: RFC, draft 1 Captain Napalm ❍ Re: RFC, draft 1 Denis McKeon ❍ Re: RFC, draft 1 Hrvoje Niksic ❍ Re: RFC, draft 1 Darren Hardy ❍ Re: RFC, draft 1 Darren Hardy ❍ Re: RFC, draft 1 Hallvard B Furuseth ❍ Re: RFC, draft 1 Martijn Koster ❍ Re: RFC, draft 1 Hrvoje Niksic ❍ Re: RFC, draft 1 Martijn Koster ❍ Re: RFC, draft 1 Martijn Koster http://info.webcrawler.com/mailing-lists/robots/index.html (46 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ❍ Re: RFC, draft 1 Martijn Koster ❍ Re: RFC, draft 1 Captain Napalm ❍ Re: RFC, draft 1 Klaus Johannes Rusch ❍ Re: RFC, draft 1 Martijn Koster ❍ Re: RFC, draft 1 Klaus Johannes Rusch ❍ RFC, draft 2 (was Re: RFC, draft 1 Martijn Koster ❍ Re: RFC, draft 1 Martijn Koster ❍ Re: RFC, draft 1 Hallvard B Furuseth ❍ Re: RFC, draft 1 Martijn Koster ● Re: An extended version of the Robots... Martijn Koster ● Re: infoseeks robot is dumb DA ❍ Re: infoseeks robot is dumb Hrvoje Niksic ❍ Re: infoseeks robot is dumb Otis Gospodnetic ● Getting a Reply-to: field ... Captain Napalm ● Re: Getting a Reply-to: field ... Captain Napalm ● RE: RFC, draft 1 Keiji Kanazawa ● RE: RFC, draft 1 Keiji Kanazawa ❍ Re: Notification protocol? Peter Jurg ❍ Re: Notification protocol? Erik Selberg ❍ Re: Notification protocol? Issac Roth ● Re: An extended version of the Robots... Darren Hardy ● Re: indexing intranet-site Nick Arnett ● Re: RFC, draft 1 Klaus Johannes Rusch ● Hipcrime no more Martijn Koster ● RE: Notification protocol? Larry Fitzpatrick ● http://HipCrime.com HipCrime ● Re: http://HipCrime.com Nick Arnett ● RE: RFC, draft 1 Martijn Koster ● Re: Virtual (was: RFC, draft 1) Klaus Johannes Rusch ● Re: Washington again !!! Erik Selberg ● Re: Washington again !!! Rob Hartill ❍ Re: Washington again !!! Erik Selberg ● SetEnv a problem Anna Torti ● User-Agent David Banes ● Broadness of Robots.txt (Re: Washington again !!!) Brian Clark http://info.webcrawler.com/mailing-lists/robots/index.html (47 of 61) [18.02.2001 13:19:27] Robots Mailing List Archive by thread ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) Thaddeus O. Cooper ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) Brian Clark ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) Brian Clark ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) Hrvoje Niksic ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) Martijn Koster ❍ ❍ Agent Categories (was Re: Broadness of Robots.txt (Re: Washington again !!!) Martijn Koster Re: Agent Categories (was Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg Re: Broadness of Robots.txt (Re: Washington again !!!) Martijn Koster ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) John D. Pritchard ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) John D. Pritchard ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) John D. Pritchard ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) Hallvard B Furuseth ❍ RE: Broadness of Robots.txt (Re: Washington again !!!) Martin.Soukup ❍ ● Re: Washington again !!! Martijn Koster ● Re: Washington again !!! Gregory Lauckhart ● robots.txt HipCrime ❍ Re: robots.txt David M Banes ❍ Re: robots.txt Martijn Koster ❍ Re: robots.txt David Banes ● Re: User-Agent Klaus Johannes Rusch ● Re: Broadness of Robots.txt (Re: Washington again !!!) Captain Napalm ● Re: Broadness of Robots.txt (Re: Washington again !!!) Captain Napalm ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg ● must be something in the water Rob Hartill ● who/what uses robots.txt HipCrime ❍ Re: who/what uses robots.txt Martijn Koster ❍ Re: who/what uses robots.txt Erik Selberg ❍ Re: who/what uses robots.txt HipCrime ❍ Re: who/what uses robots.txt Erik Selberg ❍ Re: who/what uses robots.txt HipCrime ❍ Re: who/what uses robots.txt Erik Selberg http://info.webcrawler.com/mailing-lists/robots/index.html (48 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ● ❍ Re: who/what uses robots.txt Hrvoje Niksic ❍ Re: who/what uses robots.txt Martijn Koster ❍ Re: who/what uses robots.txt Erik Selberg ❍ Re: who/what uses robots.txt [email protected] ❍ Re: who/what uses robots.txt Terry O'Neill Re: Broadness of Robots.txt (Re: Washington again !!!) Art Matheny ❍ ● Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg Re: Broadness of Robots.txt (Re: Washington again !!!) Rob Hartill ❍ Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg ● Koen Holtman: Content negotiation draft 04 submitted John D. Pritchard ● Re: who/what uses robots.txt Rob Hartill ● defining "robot" HipCrime ❍ Re: defining "robot" Matthew K Gray ❍ Re: defining "robot" Hrvoje Niksic ❍ Re: defining "robot" David M Banes ❍ Re: defining "robot" Martin Kiff ❍ Re: defining "robot" Martijn Koster ❍ Re: defining "robot" Hrvoje Niksic ❍ Re: defining "robot" Martin Kiff ❍ Re: defining "robot" Erik Selberg ❍ Re: defining "robot" Martijn Koster ❍ Re: defining "robot" Erik Selberg ❍ Re: defining "robot" Rob Hartill ❍ Re: defining "robot" Erik Selberg ❍ Re: defining "robot" Brian Clark ❍ Re: defining "robot" David M Banes ❍ Re: defining "robot" David M Banes ❍ Re: USER_AGENT and Apache 1.2 Klaus Johannes Rusch ❍ Re: defining "robot" Martijn Koster ❍ Re: defining "robot" Brian Clark ● Re: defining "robot" Art Matheny ● define a page? HipCrime ❍ ● Re: define a page? Ross A. Finlayson robots.txt (A *little* off the subject) Thaddeus O. Cooper ❍ Re: robots.txt (A *little* off the subject) Erik Selberg http://info.webcrawler.com/mailing-lists/robots/index.html (49 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ❍ Re: robots.txt (A *little* off the subject) Thaddeus O. Cooper ❍ Re: NetJet Rob Hartill ● Re: define a page? Rob Hartill ● robot defined HipCrime ❍ ● Re: robot defined Kim Davies not a robot HipCrime ❍ Re: not a robot Matthew K Gray ❍ Re: not a robot Hrvoje Niksic ● robot? HipCrime ● another suggestion Rob Hartill ● Re: not a robot Rob Hartill ● ActiveAgent Rob Hartill ● robot definition Ross Finlayson ● make people use ROBOTS.txt? HipCrime ❍ Re: make people use ROBOTS.txt? Richard Levitte - VMS Whacker ❍ Re: make people use ROBOTS.txt? Kim Davies ❍ Re: make people use ROBOTS.txt? Kim Davies ❍ Re: make people use ROBOTS.txt? ❍ Re: make people use ROBOTS.txt? Erik Selberg ❍ Re: make people use ROBOTS.txt? John D. Pritchard ❍ Re: make people use ROBOTS.txt? John D. Pritchard ❍ Re: make people use ROBOTS.txt? Nick Arnett ❍ Re: make people use ROBOTS.txt? Erik Selberg ❍ Re: make people use ROBOTS.txt? John D. Pritchard ● Re: make people use ROBOTS.txt? Benjamin Franz ● Hip Crime Thomas Bedell ❍ Re: Hip Crime Otis Gospodnetic ❍ Re: Hip Crime John D. Pritchard ● another rare attack Rob Hartill ● another dumb robot (possibly) Rob Hartill ● ActiveAgent Rob Hartill ❍ Re: ActiveAgent HipCrime ❍ Re: ActiveAgent Benjamin Franz ❍ Re: ActiveAgent Issac Roth ❍ Re: robots.txt syntax Captain Napalm http://info.webcrawler.com/mailing-lists/robots/index.html (50 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ● ● ❍ Re: robots.txt syntax Brent Boghosian ❍ Re: ActiveAgent Kim Davies ❍ Re: ActiveAgent John D. Pritchard ❍ Re: ActiveAgent Brian Clark ❍ Re: ActiveAgent Betsy Dunphy ❍ Re: ActiveAgent John D. Pritchard ❍ Re: ActiveAgent HipCrime ❍ Re: ActiveAgent Nick Dearnaley ❍ Re: ActiveAgent Betsy Dunphy ❍ Re: ActiveAgent Richard Levitte - VMS Whacker ❍ Re: ActiveAgent Ross A. Finlayson ❍ Re: ActiveAgent Betsy Dunphy ❍ Re: ActiveAgent Richard Gaskin ❍ Re: ActiveAgent Richard Levitte - VMS Whacker ❍ Re: ActiveAgent Fred K. Lenherr ❍ Re: ActiveAgent Randy Terbush ❍ Re: ActiveAgent Richard Gaskin - Fourth World ❍ Re: ActiveAgent John Lindroth ❍ Re: ActiveAgent Randy Terbush ❍ Re: ActiveAgent [email protected] ❍ Re: ActiveAgent [email protected] ❍ Re: ActiveAgent Richard Levitte - VMS Whacker ❍ Re: ActiveAgent Richard Gaskin - Fourth World ❍ Re: ActiveAgent Richard Levitte - VMS Whacker ❍ Re: ActiveAgent Fred K. Lenherr ❍ Re: An extended version of the Robots... Hallvard B Furuseth ❍ Re: An extended version of the Robots... Captain Napalm ❍ Re: ActiveAgent John D. Pritchard ❍ Re: ActiveAgent Captain Napalm Lycos' HEAD vs. GET Otis Gospodnetic ❍ Re: Lycos' HEAD vs. GET Klaus Johannes Rusch ❍ Re: Lycos' HEAD vs. GET David Banes java applet sockets John D. Pritchard ❍ ● Re: java applet sockets Art Matheny Re: spam? (fwd) Otis Gospodnetic http://info.webcrawler.com/mailing-lists/robots/index.html (51 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ● Re: spam? (fwd) Otis Gospodnetic ● hipcrime Rob Hartill ❍ Re: hipcrime [email protected] ● Re: robots.txt (A *little* off the subject) Erik Selberg ● RE: make people use ROBOTS.txt? Terry Coatta ❍ Re: make people use ROBOTS.txt? Hrvoje Niksic ❍ Re: make people use ROBOTS.txt? Erik Selberg ● IROS 97 Call for Papers John D. Pritchard ● Re[2]: Lycos' HEAD vs. GET Shadrach Todd ● Re: user-agent in Java Captain Napalm ● USER_AGENT and Apache 1.2 Rob Hartill ❍ ● NetJet [email protected] ❍ ● ● Re: USER_AGENT and Apache 1.2 Martijn Koster Re: NetJet Rob Hartill Standard Joseph Whitmore ❍ Re: Standard Michael Göckel ❍ Re: Standard John D. Pritchard ❍ Re: Standard Hrvoje Niksic ❍ Re: Standard Greg Fenton ❍ Re: Standard Joseph Whitmore Servers vs Agents Davis, Ian ❍ Re: Servers vs Agents David M Banes ❍ Re: Standard? Captain Napalm ❍ Who are you robots.txt? was Re: Servers vs Agents John D. Pritchard ❍ Re: Who are you robots.txt? was Re: Servers vs Agents david jost ● unix robot LIAM GUINANE ● Re: Servers vs Agents Rob Hartill ● Re: Servers vs Agents Martin Kiff ❍ ● Re: Servers vs Agents Erik Selberg Standard? [email protected] ❍ Re: Standard? Kim Davies ❍ Re: Standard? Martin Kiff ❍ Re: Standard? Brian Clark ❍ Re: Standard? Richard Levitte - VMS Whacker ❍ Re: Standard? http://info.webcrawler.com/mailing-lists/robots/index.html (52 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ❍ Re: Standard? ❍ Re: Standard? Nigel Rantor ❍ legal equivalence of fax and email was Re: Standard? John D. Pritchard ❍ Re: legal equivalence of fax and email was Re: Standard? Rob Hartill ❍ Junkie-Mail was Re: Standard? John D. Pritchard ❍ Re: Junkie-Mail was Re: Standard? Brian Clark ❍ Re: Junkie-Mail was Re: Standard? John D. Pritchard ❍ Re: Junkie-Mail was Re: Standard? DA ❍ Re: Junkie-Mail was Re: Standard? Gary L. Burt ❍ Re: Junkie-Mail was Re: Standard? Brian Clark ❍ Re: Standard? Nick Arnett ❍ Re: Standard? John D. Pritchard ❍ Re: Standard? Davis, Ian ❍ Re: Standard? Randy Fischer ❍ Re: Standard? Richard Gaskin ❍ Re: Standard? Joseph Whitmore ❍ Re: Standard? Gary L. Burt ● Cache Filler Nigel Rantor ● an article... (was: Re: Standard?) Greg Fenton ● [...]Re: Cache Filler Benjamin Franz ● Re: Cache Filler Ian Graham ● Re: an article... (was: Re: Standard?) Ian Graham ● Re: an article... (was: Re: Standard?) Rob Hartill ● Re: Servers vs Agents Art Matheny ● Re: an article... (was: Re: Standard?) Nigel Rantor ● Re: [...]Re: Cache Filler Carlos Horowicz ● Regexp Library Cook-off Tim Bunce ● Re: Standard? Nigel Rantor ❍ Re: Standard? Erik Selberg ❍ Re: Standard? Denis McKeon ● Re: robot ? Baron Timothy de Vallee ● Lets get on task! {was Re: Standard} Wes Miller ❍ Re: Lets get on task! {was Re: Standard} Ross A. Finlayson ● What Is wwweb =?ISO-8859-1?Q?Alvaro_Mu=F1oz-Aycuens_Martinez?= ● Re: robot ? David Banes http://info.webcrawler.com/mailing-lists/robots/index.html (53 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ● ● WebCrawler & Excite Otis Gospodnetic ❍ Re: WebCrawler & Excite Nick Arnett ❍ Re: WebCrawler & Excite Brian Pinkerton ❍ Re: WebCrawler & Excite Nick Arnett [Fwd: WebCrawler & Excite] Wes Miller ❍ Re: WebCrawler & Excite Otis Gospodnetic ● Re: Standard? Nigel Rantor ● Re: WebCrawler & Excite Otis Gospodnetic ● Please take Uninvited Email discussion elsewhere Martijn Koster ● Just when you thought it might be interesting to standardize Dave Bakin ❍ ❍ ● ● Re: Just when you thought it might be interesting to standardize robots.txt... Klaus Johannes Rusch Re: Just when you thought it might be interesting to standardize Otis Gospodnetic Server Indexing -- Helping a Robot Out Ian Graham ❍ Re: Server Indexing -- Helping a Robot Out Martijn Koster ❍ Re: Server Indexing -- Helping a Robot Out Ian Graham stingy yahoo server? Mark Norman ❍ Re: stingy yahoo server? Klaus Johannes Rusch ❍ Re: stingy yahoo server? Klaus Johannes Rusch ❍ Re: stingy yahoo server? Dan Gildor ❍ Re: stingy yahoo server? Klaus Johannes Rusch ● IIS and If-modified-since [was Re: stingy yahoo server?] Dan Gildor ● Crawlers and "dynamic" urls David Koblas ❍ Re: Crawlers and "dynamic" urls Martijn Koster ❍ Re: Crawlers and "dynamic" urls [email protected] ❍ Re: Crawlers and "dynamic" urls [email protected] ● The Big Picture(sm): Visual Browsing in Web and non-Web Databases Gerry McKiernan ● Re: Crawlers and "dynamic" urls Klaus Johannes Rusch ❍ Re: Crawlers and "dynamic" urls Ian Graham ● Re: IIS and If-modified-since Lee Fisher ● Merry Christmas, spidie-boyz&bottie-girlz! Santa Claus ● Merry Christmas, HipXmas-SantaSpam! Santa Claus ● USER_AGENT spoofing Rob Hartill ● Web pages being served from an SQL database Eric Mackie ❍ Re: Web pages being served from an SQL database Randy Fischer ❍ Re: Web pages being served from an SQL database Brian Clark http://info.webcrawler.com/mailing-lists/robots/index.html (54 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ● Re: Web pages being served from an SQL database Sigfrid Lundberg ● RE: Web pages being served from an SQL database Martin.Soukup ● Netscape-Catalog-Robot Rob Hartill ❍ ● Re: Netscape-Catalog-Robot Nick Arnett It's not only robots we have to worry about ... Captain Napalm ❍ Re: It's not only robots we have to worry about ... Rob Hartill ❍ Re: It's not only robots we have to worry about ... Simon Powell ● Re: It's not only robots we have to worry about ... Rob Hartill ● Re: Remember Canseco..... [email protected] ● Re: Remember Canseco..... Lilian Bartholo ● Re: RE: Scalpers (SJPD does crack down) Tim Simmons ● FS- Sharks Tkts- Front Row (2nd Deck) Jan 13 Howard Strachman ● FS-Jan 7 Row 1 Sec 211 Anne Greene ● For Sale for 12/26! Reynolds, Cathy A ● Game tonight Karmy T. Kays ❍ Re: Game tonight ● Good HREFs vs Bogus HREFs: 80/20 mike mulligan ● Returned mail: User unknown Mail Delivery Subsystem ● We need to Shut down Roenick [email protected] ● RE: shot clock?!.... Mark Fullerton ● Forsberg Laura M. Sebastian ● Quick--who knows listproc? Bonnie Scott ● Error Condition Re: Invalid request [email protected] ● Cyclones sign MacLeod [email protected] ● Error Condition Re: Invalid request [email protected] ● Re: WRITERS WANTED (re-post) Greg Tanzola ● ADMIN: mailing list attack :-( Martijn Koster ❍ Re: ADMIN: mailing list attack :-( Martijn Koster ● RE: Netscape-Catalog-Robot Ian King ● Referencing dynamic pages Thomas Merlin ● Re: help Martijn Koster ● Excite Authors? Randy Terbush ● Do robots have to follow links ? Thomas Merlin ❍ ● Re: Do robots have to follow links ? Theo Van Dinter Frames ? Lycos ? Thomas Merlin http://info.webcrawler.com/mailing-lists/robots/index.html (55 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ● ❍ Re: Frames ? Lycos ? Mitch Allen ❍ Re: Frames ? Lycos ? Mitch Allen Do robots have to follow links ? Ross A. Finlayson ❍ Re: Do robots have to follow links ? Nick Arnett ● Re: Frames ? Lycos ? Theo Van Dinter ● Lycos Thomas Merlin ❍ ● ● Re: Lycos Klaus Johannes Rusch Meta refresh tags John Heard ❍ Re: Meta refresh tags Martijn Koster ❍ Re: Meta refresh tags Julian Smith Cron <robh@us2> /usr/home/robh/show_robots (fwd) Rob Hartill ❍ Re: Cron <robh@us2> /usr/home/robh/show_robots (fwd) [email protected] ● Re: Cron <robh@us2> /usr/home/robh/show_robots (fwd) Sigfrid Lundberg ● email address grabber Dorian Ellis ❍ Re: email address grabber Robert Raisch, The Internet Company ❍ Re: email address grabber Ken nakagama ❍ Re: email address grabber Wes Miller ● Re: email address grabber Art Matheny ● Re: email address grabber Klaus Johannes Rusch ● Re: email address grabber Jeff Drost ● Re: email address grabber Issac Roth ● robot source code v.sreekanth ❍ Re: robot source code =?ISO-8859-1?Q?Alvaro_Mu=F1oz-Aycuens_Martinez?= ❍ Re: robot source code Dan Howard ● Meta Tag Article Mitch Allen ● Re: email address grabber Jeff Drost ● re Email Grabber Steve Nisbet ● Re: email address grabber Captain Napalm ● Re: email address grabber Captain Napalm ● Re: email address grabber Art Matheny ● re: email grabber Rich Dorfman ● re: email grabber Richard Gaskin ● Re: email grabber Chris Brown ● SpamBots Mitch Allen ❍ Re: SpamBots Richard Gaskin http://info.webcrawler.com/mailing-lists/robots/index.html (56 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ❍ Re: SpamBots Mitch Allen ● an image observer Martin Paff ● Re: email grabber Andy Rollins ❍ More Robot Talk (was Re: email grabber) Captain Napalm ❍ Re: More Robot Talk (was Re: email grabber) [email protected] ● Re: an image observer David Steele ● Re: email grabber Wendell B. Kozak ● Re[2]: SpamBots Brad Fox ● Re: email grabber Robert Raisch, The Internet Company ● Re[3]: SpamBots ● Re: More Robot Talk (was Re: email grabber) Captain Napalm ● RE: email grabber Ian King ● Re: More Robot Talk Nick Arnett ● Too Many Admins (TMA) !!! HipCrime ● Re: More Robot Talk Theo Van Dinter ● Re: More Robot Talk [email protected] ● escaped vs unescaped urls Dan Gildor ● ● ● ❍ Re: escaped vs unescaped urls =?ISO-8859-1?Q?Jaakko_Hyv=E4tti?= ❍ Re: escaped vs unescaped urls =?ISO-8859-1?Q?Jaakko_Hyv=E4tti?= ❍ Re: Q: size of the web in bytes, comprehensive list DA Q: size of the web in bytes, comprehensive list Yossi Cohen ❍ Re: Q: size of the web in bytes, comprehensive list Thomas R. Bedell ❍ Re: Q: size of the web in bytes, comprehensive list Otis Gospodnetic ❍ Re: Q: size of the web in bytes, comprehensive list Noah Parker "real-time" spidering by Lycos Otis Gospodnetic ❍ Re: "real-time" spidering by Lycos Danny Sullivan ❍ Re: "real-time" spidering by Lycos Otis Gospodnetic Info on large scale spidering? Nick Craswell ❍ Re: Info on large scale spidering? Greg Fenton ❍ Re: Info on large scale spidering? Nick Arnett ❍ Re: Info on large scale spidering? Nick Craswell ● Re: Info on large scale spidering? Otis Gospodnetic ● AltaVista Meta Tag Rumour Mitch Allen ❍ Re: AltaVista Meta Tag Rumour Danny Sullivan ❍ Re: AltaVista Meta Tag Rumour Erik Selberg http://info.webcrawler.com/mailing-lists/robots/index.html (57 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ● indexing via redirectors Patrick Berchtold ❍ Re: indexing via redirectors Sigfrid Lundberg ❍ Re: indexing via redirectors Hrvoje Niksic ❍ Re: indexing via redirectors [email protected] ❍ Re: indexing via redirectors Mike Burner ● fetching .map files Patrick Berchtold ● Re: indexing via redirectors Eric Miller ● Re: indexing via redirectors Theo Van Dinter ● Re: indexing via redirectors Sigfrid Lundberg ❍ Re: indexing via redirectors Sigfrid Lundberg ❍ Re: indexing via redirectors Hrvoje Niksic ● Re: indexing via redirectors Benjamin Franz ● Re: indexing via redirectors Eric Miller ● RE: indexing via redirectors Martin.Soukup ● Meta Tags =?ISO-8859-1?Q?Alvaro_Mu=F1oz-Aycuens_Martinez?= ❍ Re: Meta Tags Jeff Drost ❍ Re: Meta Tags Martijn Koster ❍ Re: Meta Tags Danny Sullivan ❍ Re: Meta Tags Eric Miller ● Re: indexing via redirectors Jeff Drost ● Re: indexing via redirectors Captain Napalm ❍ Re: indexing via redirectors Hrvoje Niksic ● Have you used the Microsoft Active-X Internet controls for Visual Basic? (Or know someone who does?) Richard Edwards ● Crawling & DNS issues Neil Cotty ● ❍ Re: Meta Tags Jeff Drost ❍ Re: Crawling & DNS issues Neil Cotty ❍ Re: Crawling & DNS issues David L. Sifry ❍ Re: Crawling & DNS issues Martin Beet ❍ Re: Crawling & DNS issues [email protected] robot meta tags Dan Gildor ❍ Re: robot meta tags Theo Van Dinter ● Re: Crawling & DNS issues Otis Gospodnetic ● Re: Crawling & DNS issues Srinivas Padmanabhuni ● Re: Info on large scale spidering? Otis Gospodnetic ❍ Re: Info on large scale spidering? Martin Hamilton http://info.webcrawler.com/mailing-lists/robots/index.html (58 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ● Inktomi & large scale spidering Otis Gospodnetic ❍ Re: Inktomi & large scale spidering [email protected] ❍ Re: Inktomi & large scale spidering Martin Hamilton ❍ Re: Inktomi & large scale spidering Otis Gospodnetic ❍ Re: Inktomi & large scale spidering Nick Arnett ❍ Re: Inktomi & large scale spidering Erik Selberg ● Single tar (Re: Inktomi & large scale spidering) =?ISO-8859-1?Q?Jaakko_Hyv=E4tti?= ● Meta Tags only on home page ? Thomas Merlin ● Re: Single tar (Re: Inktomi & large scale spidering) Sigfrid Lundberg ❍ Re: Meta Tags only on home page ? Jon Knight ❍ Re: Meta Tags only on home page ? Klaus Johannes Rusch ❍ Re: Single tar (Re: Inktomi & large scale spidering) Erik Selberg ● Re: Single tar (Re: Inktomi & large scale spidering) Martin Hamilton ● robots & copyright law Tony Rose ❍ Re: robots & copyright law Wayne Rust ❍ Re: robots & copyright law Gary L. Burt ❍ Re: robots & copyright law Danny Sullivan ❍ Re: robots & copyright law Nick Arnett ❍ Re: robots & copyright law Mitch Allen ● Re: Single tar (Re: Inktomi & large sca Howard, Dan: CIO ● Re: Single tar (Re: Inktomi & large scale spidering) Rob Hartill ● Referencing dynamic pages Thomas Merlin ● Need help on Search Engine accuracy test. [email protected] ● ❍ Re: Referencing dynamic pages Klaus Johannes Rusch ❍ Re: Need help on Search Engine accuracy test. Nick Arnett ❍ Re: Need help on Search Engine accuracy test. Nick Craswell Need help again. [email protected] ❍ ● Re: Need help again. Nick Arnett Question about Robot.txt =?ISO-8859-1?Q?Alvaro_Mu=F1oz-Aycuens_Martinez?= ❍ Re: Question about Robot.txt Klaus Johannes Rusch ● Re: Analysing the Web (was Re: Info on large scale spidering?) Patrick Berchtold ● Robot Specifications. Paul Bingham ● Agent Specification Paul Bingham ● specialized searches Mike Fresener ● NaughtyRobot Martijn Koster http://info.webcrawler.com/mailing-lists/robots/index.html (59 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ● Unfriendly robot at 192.115.187.2 Tim Holt ● Information about AltaVista and Excite Huaiyu Liu ❍ ● FILEZ [email protected] ❍ ● Re: Information about AltaVista and Excite Klaus Johannes Rusch Re: FILEZ Barry A. Dobyns Re: Single tar (Re: Inktomi & large scale spidering) Martin Hamilton ❍ Re: Single tar (Re: Inktomi & large scale spidering) Simon Wilkinson ❍ Re: Single tar (Re: Inktomi & large scale spidering) Sigfrid Lundberg ● i need a bot! Myles Weissleder ● robots: lycos's t-rex: strange behaviour Dinesh ● WININET caching Martin.Soukup ● Welcome to cypherpunks [email protected] ● Re: Welcome to cypherpunks Skip Montanaro ● More with the Cypherpunk antics Captain Napalm ❍ ● ● ● Re: More with the Cypherpunk antics Hrvoje Niksic The Metacrawler, Reborn Paul Phillips ❍ Re: More with the Cypherpunk antics Martijn Koster ❍ Re: More with the Cypherpunk antics Chad Zimmerman Java and robots... Manuel J. Kwak ❍ Re: Java and robots... [email protected] ❍ Re: Java and robots... Art Matheny Lack of support for "If-Modified-Since" Howard, Dan: CIO ❍ Re: Lack of support for "If-Modified-Since" John W. James ❍ Re: Lack of support for "If-Modified-Since" mike mulligan ● Re: Lack of support for "If-Modified-Since" Rob Hartill ● How to get the document info ? mannina bruno ● Thanks! Manuel Jesus Fernandez Blanco ● RE: How to get the document info ? Howard, Dan: CIO ● Re: message to USSA House of Representatives [email protected] ● New Site Aaron Stayton ● New Site Aaron Stayton ● Re: message to USSA Senate [email protected] ● Re: More with the Cypherpunk antics Benjamin Franz ● Re: More with the Cypherpunk antics Sigfrid Lundberg ● Re: More with the Cypherpunk antics Rob Hartill http://info.webcrawler.com/mailing-lists/robots/index.html (60 of 61) [18.02.2001 13:19:28] Robots Mailing List Archive by thread ● Re: More with the Cypherpunk antics Jeff Drost ● AMDIN: The list is dead Martijn Koster ❍ ● Re: AMDIN: The list is dead Martijn Koster [3]RE>[5]RE>Checking Log fi Roger Dearnaley ❍ Re: [3]RE>[5]RE>Checking Log fi Gordon Bainbridge Last message date: Thu 18 Dec 1997 - 14:33:60 PDT Archived on: Sun Aug 17 1997 - 19:13:25 PDT ● Messages sorted by: [ date ][ subject ][ author ] ● Other mail archives This archive was generated by hypermail 1.02. http://info.webcrawler.com/mailing-lists/robots/index.html (61 of 61) [18.02.2001 13:19:28] Robots in the Web: threat or treat? The Web Robots Pages Robots in the Web: threat or treat? Martijn Koster, NEXOR April 1995 [1997: Updated links and addresses] ABSTRACT Robots have been operating in the World-Wide Web for over a year. In that time they have performed useful tasks, but also on occasion wreaked havoc on the networks. This paper investigates the advantages and disadvantages of robots, with an emphasis on robots used for resource discovery. New alternative resource discovery strategies are discussed and compared. It concludes that while current robots will be useful in the immediate future, they will become less effective and more problematic as the Web grows. INTRODUCTION The World Wide Web [1] has become highly popular in the last few years, and is now one of the primary means of information publishing on the Internet. When the size of the Web increased beyond a few sites and a small number of documents, it became clear that manual browsing through a significant portion of the hypertext structure is no longer possible, let alone an effective method for resource discovery. This problem has prompted experiments with automated browsing by "robots". A Web robot is a program that traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. These programs are sometimes called "spiders", "web wanderers", or "web worms". These names, while perhaps more appealing, may be misleading, as the term "spider" and "wanderer" give the false impression that the robot itself moves, and the term "worm" might imply that the robot multiplies itself, like the infamous Internet worm [2]. In reality robots are implemented as a single software system that retrieves information from remote sites using standard Web protocols. ROBOT USES Robots can be used to perform a number of useful tasks: Statistical Analysis The first robot [3] was deployed to discover and count the number of Web servers. Other statistics could include the average number of documents per server, the proportion of certain file types, the average size of a Web page, the degree of interconnectedness, etc. http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (1 of 12) [18.02.2001 13:19:53] Robots in the Web: threat or treat? Maintenance One of the main difficulties in maintaining a hypertext structure is that references to other pages may become "dead links", when the page referred to is moved or even removed. There is currently no general mechanism to proactively notify the maintainers of the referring pages of this change. Some servers, for example the CERN HTTPD, will log failed requests caused by dead links, along with the reference of the page where the dead link occurred, allowing for post-hoc manual resolution. This is not very practical, and in reality authors only find that their documents contain bad links when they notice themselves, or in the rare case that a user notifies them by e-mail. A robot that verifies references, such as MOMspider [4], can assist an author in locating these dead links, and as such can assist in the maintenance of the hypertext structure. Robots can help maintain the content as well as the structure, by checking for HTML [5] compliance, conformance to style guidelines, regular updates, etc., but this is not common practice. Arguably this kind of functionality should be an integrated part of HTML authoring environments, as these checks can then be repeated when the document is modified, and any problems can be resolved immediately. Mirroring Mirroring is a popular technique for maintaining FTP archives. A mirror copies an entire directory tree recursively by FTP, and then regularly retrieves those documents that have changed. This allows load sharing, redundancy to cope with host failures, and faster and cheaper local access, and off-line access. In the Web mirroring can be implemented with a robot, but at the time of writing no sophisticated mirroring tools exist. There are some robots that will retrieve a subtree of Web pages and store it locally, but they don't have facilities for updating only those pages that have changed. A second problem unique to the Web is that the references in the copied pages need to be rewritten: where they reference pages that have also been mirrored they may need to changed to point to the copies, and where relative links point to pages that haven't been mirrored they need to be expanded into absolute links. The need for mirroring tools for performance reasons is much reduced by the arrival of sophisticated caching servers [6], which do offer selective updates, can guarantee that a cached document is up-to-date, and are largely self maintaining. However, it is expected that mirroring tools will be developed in due course. Resource discovery Perhaps the most exciting application of robots is their use in resource discovery. Where humans cannot cope with the amount of information it is attractive to let the computer do the work. There are several robots that summarise large parts of the Web, and provide access to a database with these results through a search engine. This means that rather than relying solely on browsing, a Web user can combine browsing and searching to locate information; even if the database doesn't contain the exact item you want to retrieve, it is likely to contain references to related pages, which in turn may reference the target item. The second advantage is that these databases can be updated automatically at regular intervals, so that dead links in the database will be detected and removed. This in contrast to manual document maintenance, where verification is often sporadic and not comprehensive. The use of robots for resource discovery will be further discussed below. http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (2 of 12) [18.02.2001 13:19:53] Robots in the Web: threat or treat? Combined Uses A single robot can perform more than one of the above tasks. For example the RBSE Spider [7] does statistical analysis of the retrieved documents as well providing a resource discovery database. Such combined uses are unfortunately quite rare. OPERATIONAL COSTS AND DANGERS The use of robots comes at a price, especially when they are operated remotely on the Internet. In this section we will see that robots can be dangerous in that they place high demands on the Web. Network resource and server load Robots require considerable bandwidth. Firstly robots operate continually over prolonged periods of time, often months. To speed up operations many robots feature parallel retrieval, resulting in a consistently high use of bandwidth in the immediate proximity. Even remote parts of the network can feel the network resource strain if the robot makes a large number of retrievals in a short time ("rapid fire"). This can result in a temporary shortage of bandwidth for other uses, especially on low-bandwidth links, as the Internet has no facility for protocol-dependent load balancing. Traditionally the Internet has been perceived to be "free", as the individual users did not have to pay for its operation. This perception is coming under scrutiny, as especially corporate users do feel a direct cost associated with network usage. A company may feel that the service to its (potential) customers is worth this cost, but that automated transfers by robots are not. Besides placing demands on network, a robot also places extra demand on servers. Depending on the frequency with which it requests documents from the server this can result in a considerable load, which results in a lower level of service for other Web users accessing the server. Especially when the host is also used for other purposes this may not be acceptable. As an experiment the author ran a simulation of 20 concurrent retrievals from his server running the Plexus server on a Sun 4/330. Within minutes the machine slowed down to a crawl and was usable for anything. Even with only consecutive retrievals the effect can be felt. Only the week that this paper was written a robot visited the author's site with rapid fire requests. After 170 consecutive retrievals the server, which had been operating fine for weeks, crashed under the extra load. This shows that rapid fire needs to be avoided. Unfortunately even modern manual browsers (e.g. Netscape) contribute to this problem by retrieving in-line images concurrently. The Web's protocol, HTTP [8], has been shown to be inefficient for this kind of transfer [9], and new protocols are being designed to remedy this [10]. Updating overhead It has been mentioned that databases generated by robots can be automatically updated. Unfortunately there is no efficient change control mechanism in the Web; There is no single request that can determine which of a set of URL's has been removed, moved, or modified. The HTTP does provide the "If-Modified-Since" mechanism, whereby the user-agent can specify the modification time-stamp of a cached document along with a request for the document. The server will then only transfer the contents if the document has been modified since it was cached. http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (3 of 12) [18.02.2001 13:19:53] Robots in the Web: threat or treat? This facility can only be used by a robot if it retains the relationship between the summary data it extracts from a document, it's URL, and the timestamp of the retrieval. This places extra requirements on the size and complexity on the database, and is not widely implemented. Client-side robots/agents The load on the network is especially an issue with the category of robots that are used by end-users, and implemented as part of a general purpose Web client (e.g. the Fish Search [11] and the tkWWW robot [12]). One feature that is common in these end-user robots is the ability to pass on search-terms to search engines found while traversing the Web. This is touted as improving resource discovery by querying several remote resource discovery databases automatically. However it is the author's opinion that this feature is unacceptable for two reasons. Firstly a search operation places a far higher load on a server than a simple document retrieval, so a single user can cause a considerable overhead on several servers in a far shorter period than normal. Secondly, it is a fallacy to assume that the same search-terms are relevant, syntactically correct, let alone optimal for a broad range of databases, and the range of databases is totally hidden from the user. For example, the query "Ford and garage" could be sent to a database on 17th century literature, a database that doesn't support Boolean operators, or a database that specifies that queries specific to automobiles should start with the word "car:". And the user isn't even aware of this. Another dangerous aspect of a client-side robot is that once it is distributed no bugs can be fixed, no knowledge of problem areas can be added and no new efficient facilities can be taken advantage of, as not everyone will upgrade to the latest version. The most dangerous aspect however is the sheer number of possible users. While some people are likely to use such a facility sensibly, i.e. bounded by some maximum, on a known local area of the web, and for a short period of time, there will be people who will abuse this power, through ignorance or arrogance. It is the author's opinion that remote robots should not be distributed to end-users, and fortunately it has so far been possible to convince at least some robot authors to cancel releases [13]. Even without the dangers client-side robots pose an ethical question: where the use of a robot may be acceptable to the community if its data is then made available to the community, client-side robots may not be acceptable as they operate only for the benefit a single user. The ethical issues will be discussed further below. End-user "intelligent agents" [14] and "digital assistants" are currently a popular research topic in computing, and often viewed as the future of networking. While this may indeed be the case, and it is already apparent that automation is invaluable for resource discovery, a lot more research is required for them to be effective. Simplistic user-driven Web robots are far removed from intelligent network agents: an agent needs to have some knowledge of where to find specific kinds of information (i.e. which services to use) rather than blindly traversing all information. Compare the situation where a person is searching for a book shop; they use the Yellow Pages for a local area, find the list of shops, select one or a few, and visit those. A client-side robot would walk into all shops in the area asking for books. On a network, as in real life, this is inefficient on a small scale, and prohibitve on a larger scale. Bad Implementations The strain placed on the network and hosts is sometimes increased by bad implementations of especially newly written robots. Even if the protocol and URL's sent by the robot is correct, and the robot correctly deals with returned protocol (including more advanced features such as redirection), there are some less-obvious problems. http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (4 of 12) [18.02.2001 13:19:53] Robots in the Web: threat or treat? The author has observed several identical robot runs accessing his server. While in some cases this was caused by people using the site for testing (instead of a local server), in some cases it became apparent that this was caused by lax implementation. Repeated retrievals can occur when either no history of accessed locations is stored (which is unforgivable), or when a robot does not recognise cases where several URL are syntactically equivalent, e.g. where different DNS aliases for the same IP address are used, or where URL's aren't canonicalised by the robot, e.g. "foo/bar/../baz.html" is equivalent to "foo/baz.html". Some robots sometimes retrieve document types, such as GIF's and Postscript, which they cannot handle and thus ignore. Another danger is that some areas of the web are near-infinite. For example, consider a script that returns a page with a link to one level further down. This will start with for example "/cgi-bin/pit/", and continue with "/cgi-bin/pit/a/", "/cgi-bin/pit/a/a/", etc. Because such URL spaces can trap robots that fall into them, they are often called "black holes". See also the discussion of the Proposed Standard for Robot Exclusion below. CATALOGUING ISSUES That resource discovery databases generated by robots are popular is undisputed. The author himself regularly uses such databases when locating resources. However, there are some issues that limit the applicability of robots to Web-wide resource discovery. There is too much material, and it's too dynamic One measure of effectiveness of an information retrieval approach is "recall", the fraction of all relevant documents that were actually found. Brian Pinkerton [15] states that recall in Internet indexing systems is adequate, as finding enough relevant documents is not the problem. However, if one considers the complete set of information available on the Internet as a basis, rather than the database created by the robot, recall cannot be high, as the amount of information is enormous, and changes are very frequent. So in practice a robot database may not contain a particular resource that is available, and this will get worse as the Web grows. Determining what to include/exclude A robot cannot automatically determine if a given Web page should be included in its index. Web servers may serve documents that are only relevant to a local context (for example an index of an internal library), that exists only temporarily, etc. To a certain extent the decision of what is relevant also depends on the audience, which may not have been identified at the time the robot operates. In practice robots end up storing almost everything they come come accross. Note that even if a robot could decide if a particular page is to be exclude form its database they have already incurred the cost of retrieving the file; a robot that decides to ignore a high percentage of documents is very wasteful. In an attempt to alleviate this situation somewhat the robot community has adopted "A Standard for Robot exclusion" [16]. This standard describes the use of a simple structured text file available at well-known place on a server ("/robots.txt") to specify which parts of their URL space should be avoided by robots (see Figure 1). This facility can also be used to warn Robots for black holes. Individual robots can be given specific instructions, as some may behave more sensibly than others, or are known to specialise in a particular area. This standard is voluntary, but is very simple to implement, and there is considerable public pressure for robots to comply. http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (5 of 12) [18.02.2001 13:19:53] Robots in the Web: threat or treat? Determining how to traverse the Web is a related problem. Given that most Web servers are organised hierarchically, a breadth-first traversal from the top to a limited depth is likely to more quickly find a broader and higher-level set of document and services than a depth-first traversal, and is therefore much preferable for resource discovery. However, a depth-first traversal is more likely to find individual users' home pages with links to other, potentially new, servers, and is therefore more likely to find new sites to traverse. # /robots.txt for http://www.site.com/ User-agent: * # attention all robots: Disallow: /cyberworld/map # infinite URL space Disallow: /tmp/ # temporary files Figure 1: An example robots.txt file Summarising documents It is very difficult to index an arbitrary Web document. Early robots simply stored document titles and anchor texts, but newer robots use more advanced mechanisms and generally consider the entire content. These methods are good general measures, and can be automatically applied to all Web pages, but cannot be as effective as manual indexing by the author. HTML provides a facility to attach general meta information to documents, by specifying a <META> element, e.g. "<meta name= "Keywords" value= "Ford Car Maintenance">. However, no semantics have (yet) been defined for specific values of the attributes of this tag, and this severely limits its acceptance, and therfore its usefulness. This results in a low "precision", the proportion of the total number of documents retrieved that is relevant to the query. Advanced features such as Boolean operators, weighted matches like WAIS, or relevance feedback can improve this, but given that the information on the Internet is enormously diverse, this will continue to be a problem. Classifying documents Web users often ask for a "subject hierarchy" of documents in the Web. Projects such as GENVL [17] allow these subject hierarchies to be manually maintained, which presents a number of problems that fall outside the scope of this paper. It would be useful if a robot could present a subject hierarchy view of its data, but this requires some automated classification of documents [18]. The META tag discussed above could provide a mechanism for authors to classify their own documents. The question then arises which classification system to use, and how to apply it. Even traditional libraries don't use a single universal system, but adopt one of a few, and adopt their own conventions for applying them. This gives little hope for an immediate universal solution for the Web. Determining document structures Perhaps the most difficult issue is that the Web doesn't consist of a flat set of files of equal importance. Often services on the Web consist of a collection of Web pages: there is a welcome page, maybe some pages with forms, maybe some pages with background information, and some pages with individual data points. The service provider announces the service by referring to the welcome page, http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (6 of 12) [18.02.2001 13:19:53] Robots in the Web: threat or treat? which is designed to give structured access to the rest of the information. A robot however has no way of distinguishing these pages, and may well find a link into for example one of the data points or background files, and index those rather than the main page. So it can happen that rather than storing a reference to "The Perl FAQ", it stores some random subset of the questions addressed in the FAQ. If there was a facility in the web for specifying per document that someone shouldn't link to the page, but to another one specified, this problem could be avoided. Related to the above problem is that the content of web pages are often written for a specific context, provided by the access structure, and may not make sense outside that context. For example, a page describing the goals of a project may refer to "The project", without fully specifying the name, or giving a link to the welcome page. Another problem is that of moved URL's. Often when service administrators reorganise their URL structure they will provide mechanisms for backward compatibility with the previous URL structure, to prevent broken links. In some servers this can be achieved by specifying redirection configuration, which results in the HTTP negotiating a new URL when users try to access the old URL. However, when symbolic links are used it is not possible to tell the difference between the two. An indexing robot can in these cases store the deprecated URL, prolonging the requirement for a web administrator to provide backward compatibility. A related problem is that a robot might index a mirror of a particular service, rather than the original site. If both source and mirror are visited there will be duplicate entries in the database, and bandwidth is being wasted repeating identical retrievals to different hosts. If only the mirror is visited users may be referred to out-of-date information even when up-to-date information is available elsewhere. ETHICS We have seen that robots are useful, but that they can place high demands on bandwidth, and that they have some fundamental problems when indexing the Web. Therefore a robot author needs to balance these issues when designing and deploying a robot. This becomes an ethical question "Is the cost to others of the operation of a robot justified". This is a grey area, and people have very different opinions on what is acceptable. When some of the acceptability issues first became apparent (after a few incidents with robots doubling the load on servers) the author developed a set of Guidelines for Robot Writers [19], as a first step to identify problem areas and promote awareness. These guidelines can be summarised as follows: ● Reconsider: Do you really need a new robot? ● Be accountable: Ensure the robot can be identified by server maintainers, and the author can be easily contacted. ● Test extensively on local data ● Moderate resource consumption: Prevent rapid fire and eliminate redundant and pointless retrievals. ● Follow the Robot Exclusion Standard. ● Monitor operation: Continuously analyse the robot logs. ● Share results: Make the robot's results available to others, the raw results as well as any intended high-level results. David Eichman [20] makes a further distinction between Service Agents, robots that build information bases that will be publicly available, and User Agents, robots that benefit only a single user such as http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (7 of 12) [18.02.2001 13:19:53] Robots in the Web: threat or treat? client-side robots, and has identified separate high-level ethics for each. The fact that most Robot writers have already implemented these guidelines indicates that they are conscious of the issues, and eager to minimise any negative impact. The public discussion forum provided by the robots mailing list speeds up the discussion of new problem areas, and the public overview of the robots on the Active list provides a certain community pressure on robot behaviour [21]. This maturation of the Robot field means there have recently been fewer incidents where robots have upset information providers. Especially the standard for robot exclusion means that people who don't approve of robots can prevent being visited. Experiences from several projects that have deployed robots have been published, especially at the World-Wide Web conferences at CERN in July 1994 and Chicago in October 1994, and these help to educate, and discourage, would-be Robot writers. However, with the increasing popularity of the Internet in general, and the Web in particular it is inevitable that more Robots will appear, and it is likely that some will not behave appropriately. ALTERNATIVES FOR RESOURCE DISCOVERY Robots can be expected to continue to be used for network information retrieval on the Internet. However, we have seen that there are practical, fundamental and ethical problems with deploying robots, and it is worth considering research into alternatives, such as ALIWEB [22] and Harvest [23]. ALIWEB has a simple model for human distributed indexing of services in the Web, loosely based on Archie [24]. In this model aggregate indexing information is available from hosts on the Web. This information indexes only local resources, not resources available from third parties. In ALIWEB this is implemented with IAFA templates [25], which give typed resource information is a simple text-based format (See Figure 2). These templates can be produced manually, or can be constructed by automated means, for example from titles and META elements in a document tree. The ALIWEB gathering engine retrieves these index files through normal Web access protocols, and combines them into a searchable database. Note that it is not a robot, as it doesn't recursively retrieve documents found in the index. Template-Type: Title: URL: Description: Keywords: SERVICE The ArchiePlex Archie Gateway /public/archie/archieplex/archieplex.html A Full Hypertext interface to Archie. Archie, Anonymous FTP. Template-Type: Title: URL: Description: DOCUMENT The Perl Page /public/perl/perl.html Information on the Perl Programming Language. Includes hypertext versions of the Perl 5 Manual and the latest FAQ. Keywords: perl, programming language, perl-faq Figure 2: An IAFA index file There are several advantages to this approach. The quality of human-generated index information is combined with the efficiency of automated update mechanisms. The integrity of the information is http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (8 of 12) [18.02.2001 13:19:53] Robots in the Web: threat or treat? higher than with traditional "hotlists", as only local index information is maintained. Because the information is typed in a computer-readable format, search interfaces can offer extra facilities to constrain queries. There is very little network overhead, as the index information is retrieved in a single request. The simplicity of the model and the index file means any information provider can immediately participate. There are some disadvantages. The manual maintenance of indexing information can appear to give a large burden on the information provider, but in practice indexing information for major services don't change often. There have been experiments with index generation from TITLE and META tags in the HTML, but this requires the local use of a robot, and has the danger that the quality of the index information suffers. A second limitiation is that in the current implementation information providers have to register their index files at a central registry, which limits scalability. Finally updates are not optimally efficient, as an entire index files needs to retrieved even if only one of its records was modified. ALIWEB has been in operation since October 1993, and the results have been encouraging. The main operational difficulties appeared to be lack of understanding; initially people often attempted to register their own HTML files instead of IAFA index files. The other problem is that as a personal project ALIWEB is run on a spare-time basis and receives no funding, so further development is slow. Harvest is a distributed resource discovery system recently released by Internet Research Task Force Research Group on Resource Discovery (IRTF-RD), and offers software systems for automated indexing contents of documents, efficient replication and caching of such index information on remote hosts, and finally searching of this data through an interface in the web. Initial reactions to this system have been very positive. One disadvantage of Harvest is that it is a large and complex system which requires considerable human and computing resource, making it less accessible to information providers. The use of Harvest to form a common platform for the interworking of existing databases is perhaps its most exciting aspect. It is reasonably straightforward for other systems to interwork with Harvest; experiments have shown that ALIWEB for example can operate as a Harvest broker. This gives ALIWEB the caching and searching facilities Harvest offers, and offers Harvest a low-cost entry mechanism. These two systems show attractive alternatives to the use of robots for resource discovery: ALIWEB provides a simple and high-level index, Harvest provides comprehensive indexing system that uses low-level information. However, neither system is targeted at indexing of third-parties that don't actively participate, and it is therefore expected that robots will continue to be used for that purpose, but in co-operation with other systems such as ALIWEB and Harvest. CONCLUSIONS In today's World-Wide Web, robots are used for a number of different purposes, including global resource discovery. There are several practical, fundamental, and ethical problems involved in the use of robots for this task. The practical and ethical problems are being addressed as experience with robots increases, but are likely to continue to cause occasional problems. The fundamental problems limit the amount of growth there is for robots. Alternatives strategies such as ALIWEB and Harvest are more efficient, and give authors and sites control of the indexing of their own information. It is expected that this type of system will increase in popularity, and will operate alongside robots and http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (9 of 12) [18.02.2001 13:19:53] Robots in the Web: threat or treat? interwork with them. In the longer term complete Web-wide traversal by robots will become prohibitvely slow, expensive, and ineffective for resource discovery. REFERENCES 1 Berners-Lee, T., R. Cailliau, A. Loutonen, H.F.Nielsen and A. Secret. "The World-Wide Web". Communications of the ACM, v. 37, n. 8, August 1994, pp. 76-82. 2 Seeley, Donn. "A tour of the worm". USENINX Association Winter Conference 1989 Proceedings, January 1989, pp. 287-304. 3 Gray, M. "Growth of the World-Wide Web," Dec. 1993. <URL: http://www.mit.edu:8001/aft/sipb/user/mkgray/ht/web-growth.html > 4 Fielding, R. "Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web". Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994. 5 Berners-Lee, T., D. Conolly at al., "HyperText Markup Language Spacification 2.0". Work in progress of the HTML working group of the IETF. <URL: ftp://nic.merit.edu/documents/internet-drafts/draft-ietf-html-spec-00.txt > 6 Luotonen, A., K. Altis. "World-Wide Web Proxies". Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994. 7 Eichmann, D. "The RBSE Spider - Balancing Effective Search against Web Load". Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994. 8 Berners-Lee, T., R. Fielding, F. Nielsen. "HyperText Transfer Protocol". Work in progress of the HTTP working group of the IETF. <URL: ftp://nic.merit.edu/documents/internet-drafts/draft-fielding-http-spec-00.txt > 9 Spero, S. "Analysis of HTTP Performance problems" July 1994 <URL: http://sunsite.unc.edu/mdma-release/http-prob.html > 10 Spero, S. "Progress on HTTP-NG". <URL: http://info.cern.ch/hypertext/www/Protocols/HTTP-NG/http-ng-status.html > 11 De Bra, P.M.E and R.D.J. Post. "Information Retrieval in the World-Wide Web: Making Client-based searching feasable". Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994. http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (10 of 12) [18.02.2001 13:19:53] Robots in the Web: threat or treat? 12 Spetka, Scott. "The TkWWW Robot: Beyond Browsing". Proceedings of the Second International World-Wide Web Conference, Chicago United States, October 1994. 13 Slade, R., "Risks of client search tools," RISKS-FORUM Digest, v. 16, n. 37, Weds 31 August 1994. 14 Riechen, Doug. "Intelligent Agents". Communications of the ACM Vol. 37 No. 7, July 1994. 15 Pinkerton, B., "Finding What PEople Want: Experiences with the WebCrawler," Proceedings of the Second International World-Wide Web Conference, Chicago United States, October 1994. 16 Koster, M., "A Standard for Robot Exclusion," < URL: http://info.webcrawler.com/mak/projects/robots/exclusion.html > 17 McBryan, A., "GENVL and WWWW: Tools for Taming the Web," Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994. 18 Kent, R.E., Neus, C., "Creating a Web Analysis and Visualization Environment," Proceedings of the Second International World-Wide Web Conference, Chicago United States, October 1994. 19 Koster, Martijn. "Guidelines for Robot Writers". 1993. <URL: http://info.webcrawler.com/mak/projects/robots/guidelines.html > 20 Eichmann, D., "Ethical Web Agents," "Proceedings of the Second International World-Wide Web Conference, Chicago United States, October 1994. 21 Koster, Martijn. "WWW Robots, Wanderers and Spiders". <URL: http://info.webcrawler.com/mak/projects/robots/robots.html > 22 Koster, Martijn, "ALIWEB - Archie-Like Indexing in the Web," Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994. 23 Bowman, Mic, Peter B. Danzig, Darren R. Hardy, Udi Manber and Michael F. Schwartz. "Harvest: Scalable, Customizable Discovery and Access System". Technical Report CU-CS-732-94, Department of Computer Science, University of Colorado, Boulder, July 1994. <URL: http://harvest.cs.colorado.edu/> 24 Deutsch, P., A. Emtage, "Archie - An Electronic Directory Service for the Internet", Proc. Usenix Winter Conf., pp. 93-110, Jan 92. 25 http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (11 of 12) [18.02.2001 13:19:53] Robots in the Web: threat or treat? Deutsch, P., A. Emtage, M. Koster, and M. Stumpf. "Publishing Information on the Internet with Anonymous FTP". Work in progress of the Integrated Internet Information Retrieval working group. <URL: ftp://nic.merit.edu/documents/internet-drafts/draft-ietf-iiir-publishing-02.txt > MARTIJN KOSTER holds a B.Sc. in Computer Science from Nottingham University (UK). During his national service he worked on as 2nd lieutenant of the Dutch Army at the Operations Research group of STC, NATO's research lab in the Netherlands. Since 1992 he has worked for NEXOR as software engineer on X.500 Directory User Agents, and maintains NEXOR's World-Wide Web service. He is also author of the ALIWEB and CUSI search tools, and maintains a mailing-list dedicated to World-Wide Web robots. Reprinted with permission from ConneXions, Volume 9, No. 4, April 1995. ConneXions--The Interoperability Report is published monthly by: Interop Company, a division of SOFTBANK Expos 303 Vintage Park Drive, Suite 201 Foster City, CA 94404-1138 USA Phone: +1 415 578-6900 FAX: +1 415 525-0194 Toll-free (in USA): 1-800-INTEROP E-mail: [email protected] Free sample issue and list of back issues available upon request. The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (12 of 12) [18.02.2001 13:19:53] Guidelines for Robot Writers The Web Robots Pages Guidelines for Robot Writers Martijn Koster, 1993 This document contains some suggestions for people who are thinking about developing Web Wanderers (Robots), programs that traverse the Web. Reconsider Are you sure you really need a robot? They put a strain on network and processing resources all over the world, so consider if your purpose is really worth it. Also, the purpose for which you want to run your robot are probably not as novel as you think; there are already many other spiders out there. Perhaps you can make use of the data collected by one of the other spiders (check the list of robots and the mailing list). Finally, are you sure you can cope with the results? Retrieving the entire Web is not a scalable solution, it is just too big. If you do decide to do it, don't aim to traverse then entire web, only go a few levels deep. Be Accountable If you do decide you want to write and/or run one, make sure that if your actions do cause problems, people can easily contact you and start a dialog. Specifically: Identify your Web Wanderer HTTP supports a User-agent field to identify a WWW browser. As your robot is a kind of WWW browser, use this field to name your robot e.g. "NottinghamRobot/1.0". This will allow server maintainers to set your robot apart from human users using interactive browsers. It is also recommended to run it from a machine registered in the DNS, which will make it easier to recognise, and will indicate to people where you are. Identify yourself HTTP supports a From field to identify the user who runs the WWW browser. Use this to advertise your email address e.g. "[email protected]". This will allow server maintainers to contact you in case of problems, so that you can start a dialogue on better terms than if you were hard to track down. Announce It Post a message to comp.infosystems.www.providers before running your robots. If people know in advance they can keep an eye out. I maintain a list of active Web Wanderers, so that people who wonder about access from a certain site can quickly check if it is a known robot -- please help me keep it up-to-date by informing me of any missing ones. Announce it to the target If you are only targetting a single site, or a few, contact its administrator and inform him/her. Be informative http://info.webcrawler.com/mak/projects/robots/guidelines.html (1 of 5) [18.02.2001 13:20:03] Guidelines for Robot Writers Server maintainers often wonder why their server is hit. If you use the HTTP Referer field you can tell them. This costs no effort on your part, and may be informative. Be there Don't set your Web Wanderer going and then go on holiday for a couple of days. If in your absence it does things that upset people you are the only one who can fix it. It is best to remain logged in to the machine that is running your robot, so people can use "finger" and "talk" to contact you Suspend the robot when you're not there for a number of days (in the weekend), only run it in your presence. Yes, it may be better for the performance of the machine if you run it over night, but that implies you don't think about the performance overhead of other machines. Yes, it will take longer for the robot to run, but this is more an indication that robots are not they way to do things anyway, then an argument for running it continually; after all, what's the rush? Notify your authorities It is advisable to tell your system administrator / network provider what you are planning to do. You will be asking a lot of the services they offer, and if something goes wrong they like to hear it from you first, not from external people. Test Locally Don't run repeated test on remote servers, instead run a number of servers locally and use them to test your robot first. When going off-site for the first time, stay close to home first (e.g. start from a page with local servers). After doing a small run, analyse your performance, your results, and estimate how they scale up to thousands of documents. It may soon become obvious you can't cope. Don't hog resources Robots consume a lot of resources. To minimise the impact, keep the following in mind: Walk, don't run Make sure your robot runs slowly: although robots can handle hundreds of documents per minute, this puts a large strain on a server, and is guaranteed to infuriate the server maintainer. Instead, put a sleep in, or if you're clever rotate queries between different servers in a round-robin fashion. Retrieving 1 document per minute is a lot better than one per second. One per 5 minutes is better still. Yes, your robot will take longer, but what's the rush, it's only a program. Use If-modified-since or HEAD where possible If your application can use the HTTP If-modified-since header, or the HEAD method for its purposes, that gives less overhead than full GETs. Ask for what you want HTTP has a Accept field in which a browser (or your robot) can specify the kinds of data it can handle. Use it: if you only analyse text, specify so. This will allow clever servers to not bother sending you data you can't handle and have to throw away anyway. Also, make use of url suffices if they're there. Ask only for what you want You can build in some logic yourself: if a link refers to a ".ps", ".zip", ".Z", ".gif" etc, and you http://info.webcrawler.com/mak/projects/robots/guidelines.html (2 of 5) [18.02.2001 13:20:03] Guidelines for Robot Writers only handle text, then don't ask for it. Although they are not the modern way to do things (Accept is), there is an enourmeous installed base out there that uses it (especially FTP sites). Also look out for gateways (e.g. url's starting with finger), News gateways, WAIS gateways etc. And think about other protocols ("news:", "wais:") etc. Don't forget the sub-page references (<A HREF="#abstract">) -- don't retrieve the same page more then once. It's imperative to make a list of places not to visit before you start... Check URL's Don't assume the HTML documents you are going to get back are sensible. When scanning for URL be wary of things like <A HREF=" http://somehost.somedom/doc>. A lot of sites don't put the trailing / on urls for directories, a naieve strategy of concatenating the names of sub urls can result in bad names. Check the results Check what comes back. If a server refuses a number of documents in a row, check what it is saying. It may be that the server refuses to let you retrieve these things because you're a robot. Don't Loop or Repeat Remember all the places you have visited, so you can check that you're not looping. Check to see if the different machine addresses you have are not in fact the same box (e.g. web.nexor.co.uk is the same machine as "hercules.nexor.co.uk" and 128.243.219.1) so you don't have to go through it again. This is imperative. Run at opportune times On some systems there are preferred times of access, when the machine is only lightly loaded. If you plan to do many automatic requests from one particular site, check with its administrator(s) when the preferred time of access is. Don't run it often How often people find acceptable differs, but I'd say once every two months is probably too often. Also, when you re-run it, make use of your previous data: you know which url's to avoid. Make a list of volatile links (like the what's new page, and the meta-index). Use this to get pointers to other documents, and concentrate on new links -- this way you will get a high initial yield, and if you stop your robot for some reason at least it has spent it's time well. Don't try queries Some WWW documents are searcheable (ISINDEX) or contain forms. Don't follow these. The Fish Search does this for example, which may result in a search for "cars" being sent to databases with computer science PhD's, people in the X.500 directory, or botanical data. Not sensible. Stay with it It is vital you know what your robot is doing, and that it remains under control Log Make sure it provides ample logging, and it wouldn't hurt to keep certain statistics, such as the number of successes/failures, the hosts accessed recently, the average size of recent files, and keep an eye on it. This ties in with the "Don't Loop" section -- you need to log where you have been to prevent looping. Again, estimate the required disk-space, you may find you can't cope. Be interactive http://info.webcrawler.com/mak/projects/robots/guidelines.html (3 of 5) [18.02.2001 13:20:03] Guidelines for Robot Writers Arrange for you to be able to guide your robot. Commands that suspend or cancel the robot, or make it skip the current host can be very useful. Checkpoint your robot frequently. This way you don't lose everything if it falls over. Be prepared Your robot will visit hundreds of sites. It will probably upset a number of people. Be prepared to respond quickly to their enquiries, and tell them what you're doing. Be understanding If your robot upsets someone, instruct it not to visit his/her site, or only the home page. Don't lecture him/her about why your cause is worth it, because they probably aren't in the least interested. If you encounter barriers that people put up to stop your access, don't try to go around them to show that in the Web it is difficult to limit access. I have actually had this happen to me; and although I'm not normally violent, I was ready to strangle this person as he was deliberatly wasting my time. I have written a standard practice proposal for a simple method of excluding servers. Please implement this practice, and respect the wishes of the server maintainers. Share results OK, so you are using the resources of a lot of people to do this. Do something back: Keep results This may sound obvious, but think about what you are going to do with the retrieved documents. Try and keep as much info as you can possibly store. This will the results optimally useful. Raw Result Make your raw results available, from FTP, or the Web or whatever. This means other people can use it, and don't need to run their own servers. Polished Result You are running a robot for a reason; probably to create a database, or gather statistics. If you make these results available on the Web people are more likely to think it worth it. And you might get in touch with people with similar interests. Report Errors Your robot might come accross dangling links. You might as well publish them on the Web somewhere (after checking they really are. If you are convinced they are in error (as opposed to restricted), notify the administrator of the server. Examples This is not intended to be a public flaming forum or a "Best/Worst Robot" league-table. But it shows the problems are real, and the guidelines help aleviate them. He, maybe a league table isn't too bad an idea anyway. http://info.webcrawler.com/mak/projects/robots/guidelines.html (4 of 5) [18.02.2001 13:20:03] Guidelines for Robot Writers Examples of how not to do it The robot which retrieved the same sequence of about 100 documents on three occasions in four days. And the machine couldn't be fingered. The results were never published. Sigh. The robot run from phoenix.doc.ic.ac.uk in Jan 94. It provides no User-agent or From fields, one can't finger the host, and it is not part of a publicly known project. In addition it has been reported to retrieve documents it can't handle. Has since improved. The Fish search capability added to Mosaic. One instance managed to retrieve 25 documents in under one minute. Better examples The RBSE-Spider, run in December 93. It had a User-agent field, and after a finger to the host it was possible to open a dialogue with the robot writers. Their web server explained the purpose of it. Jumpstation: the results are presented in a searchable database, the author announced it, and is considering making the raw results available. Unfortunately some people complained about the high rate with which documents were retrieved. Why? Why am I rambling on about this? Because it annoys me to see that people cause other people unnecessary hassle, and the whole discussion can be so much gentler. And because I run a server that is regularly visited by robots, and I am worried they could make the Web look bad. This page has been contributed to by Jonathon Fletcher JumpStation Robot author, Lee McLoughlin ([email protected]), and others. The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/guidelines.html (5 of 5) [18.02.2001 13:20:03] Evaluation of the Standard for Robots Exclusion The Web Robots Pages Evaluation of the Standard for Robots Exclusion Martijn Koster, 1996 Abstract This paper contains an evaluation of the Standard for Robots Exclusion, identifies some of its problems and feature requsts, and recomends future work. ● Introduction ● Architecture ● Problems and Feature Requests ● Recommendations Introduction The Standard for Robots Exclusion (SRE) was first proposed in 1994, as a mechanism for keeping robots out of unwanted areas of the Web. Such unwanted areas included: ● infinite URL spaces in which robots could get trapped ("black holes"). ● resource intensive URL spaces, e.g. dynamically generated pages. ● documents which would attract unmanageable traffic, e.g. erotic material. ● documents which could represent a site unfavourably, e.g. bug archives. ● documents which aren't useful for world-wide indexing, e.g. local information. The Architecture The main design consideration to achieve this goal were: ● simple to administer, ● simple to implement, and ● simple to deploy. This specifically ruled out special network-level protocols, platform-specific solutions, or changes to clients or servers. Instead, the mechanism uses a specially formatted resource, at a know location in the server's URL space. In its simplest form the resource could be a text file produced with a text edittor, placed in the root-level server directory. http://info.webcrawler.com/mak/projects/robots/eval.html (1 of 7) [18.02.2001 13:20:16] Evaluation of the Standard for Robots Exclusion This formatted-file approach satisfied the design considerations: The administration was simple, because the format of the file was easy to understand, and required no special software to produce. The implementation was simple, because the format was simple to parse and apply. The deployment was simple, because no client or server changes were required. Indeed the majority of robot authors rapidly embraced this proposal, and it has received a great deal of attention in both Web-based documentation and the printed press. This in turn has promoted awareness and acceptance amongst users. Problems and Feature Requests In the years since the inital proposal, a lot of practical experience with the SRE has been gained, and a considerable number of suggestions for improvement or extensions have been made. They broadly fall into the following categories: 1. operational problems 2. general Web problems 3. further directives for exclusion 4. extensions beyond exclusion I will discuss some of the most frequent suggestions in that order, and give some arguments in favour or against them. One main point to keep in mind is that it is difficult to gauge how much of an issue these problems are in practice, and how wide-spread support for extensions would be. When considering further development of the SRE it is important to prevent second-system syndrome. Operational problems These relate to the administration of the SRE, and as such the effectiveness of the approach for the purpose. Administrative access to the /robots.txt resource The SRE specifies a location for the resource, in the root level of a server's URL space. Modifying this file generally requires administrative access to the server, which may not be granted to a user who would like to add exclusion directives to the file. This is especially common in large multi-user systems. It can be argued this is not a problem with the SRE, which after all does not specify how the resource is administered. It is for example possible to programatically collect individual's '~/robots.txt' files, combining them into a single '/robots.txt' file on a regular basis. How this could be implemented depends on the operating system, server software, and publishing process. In practice users find their adminstrators unwilling or incapable of providing such a solution. This indicates again how important it is to stress simplicity; even if the extra effort required is miniscule, requiring changes in practices, procedures, or software is a major barrier for deployment. Suggestions to alleviate the problem have been producing a CGI script which combines multiple individual files on the fly, or listing multiple referral files in the '/robots.txt' which the robot can retrieve and combine. Both these options suffer from the same problem; some administrative access is http://info.webcrawler.com/mak/projects/robots/eval.html (2 of 7) [18.02.2001 13:20:16] Evaluation of the Standard for Robots Exclusion still required. This is the most painful operational problem, and cannot be sufficiently addressed in the current design. It seems that the only solution is to move the robot policy closer to the user, in the URL space they do control. File specification The SRE allows only a single method for specifying parts of the URL space: by substring anchored at the front. People have asked for substrings achored at the end, as in "Disallow: *.shtml", as well as generlised regular expression parsing, as in 'Disallow: *sex*'. XXX The issue with this extension is that it increases complexity of both administration and implementation. In this case I feel this may be justified. Redundancy for specific robots The SRE allows for specific directives for individual robots. This may result in considerable repetiton of rules common to all robots. It has been suggested that an OO inheritance scheme could address this. In practice the per-robot distinction is not that widely used, and the need seems to be sporadic. The increased complexity of both adminstration and implementation seems prohibitive in this case. Scaleability The SRE groups all rules for the server into a single file. This doesn't scale well to thousands or millions of individually specified URL's. This is a fundamental problem, and one that can only be solved by moving beyond a single file, and bringing the policy closer to the individual resources. Web problems These are problems faced by the Web at large, which could be addressed (at leats for robots) separately using extensions to the SRE. I am against following that route, as it is fixing the problem in the wrong place. These issues should be addressed by proper general solution separate from the SRE. "Wrong" domain names The use of multiple domain names sharing a logical network interface is a common practice (even without vanity domains), which often leads to problems with indexing robots, who may end up using an undesired domain name for a given URL. This could be adressed by adding a "preferred" address, or even encoding "preferred" domain names for certain parts of a URL space. This again increases complexity, and doesn't solve the problem for non-robots which can suffer the same fate. The issue here is that deployed HTTP software doesn't have a facility to indicate the host part of the HTTP URL, and a server therefore cannot use that to decide the availability of a URL. HTTP 1.1 and later address this using a Host header and full URI's in the request line. This will address this problem accross the board, but will take time to be deployed and used. http://info.webcrawler.com/mak/projects/robots/eval.html (3 of 7) [18.02.2001 13:20:16] Evaluation of the Standard for Robots Exclusion Mirrors Some servers, such as "webcrawler.com", run identical URL spaces on several different machines, for load balancing or redundancy purposes. This can lead to problems when a robot uses only the IP address to uniquely identify a server; the robot would traverse and list each instance of the server separately. It is possible to list alternative IP addresses in the /robots.txt file, indicating equivalency. However, in the common case where a single domain name is used for these separate IP addresses this information is already obtainable from the DNS. Updates Currently robots can only track updates by frequent revisits. There seem to be a few: the robot could request a notification when a page changes, the robot could ask for modification information in bulk, or the SRE could be extended to suggest expirations on URL's. This is a more general problem, and ties in to caching issues and the link consistency. I will not go into the first two options as they donot concern the SRE. The last option would duplicate existing HTTP-level mechanisms such as Expires, only because they are currently difficult to configure in servers. It seems to me this is the wrong place to solve that problem. Further directives for exclusion These concern further suggestions to reduce robot-generated problems for a server. All of these are easy to add, at the cost of more complex administration and implementation. It also brings up the issue of partial compliance; not all robot may be willing or able to support all of these. Given that the importance of these extensions is secondary to the SRE's purpose, I suggest they are to be listed as MAY or SHOULD, not MUST options. Multiple prefixes per line The SRE doesn't allow multiple URL prefixes on a single line, as in "Disallow: /users /tmp". In practice people do this, so the implementation (if not the SRE) could be changed to condone this practice. Hit rate This directive could indicate to a robot how long to wait between requests to the server. Currently it is accepted practice to wait at least 30 seconds between requests, but this is too fast for some sites, too slow for others. A limitation is that this would specify a value for the entire site, whereas the value may depend on specific parts of the URL space. ReVisit frequency This directive could indicate how long a robot should wait before revisiting pages on the server. A limitation is that this would specify a value for the entire site, whereas the value may depend on specific parts of the URL space. http://info.webcrawler.com/mak/projects/robots/eval.html (4 of 7) [18.02.2001 13:20:16] Evaluation of the Standard for Robots Exclusion This appears to duplicate some of the existing (and future) cache-consistency measures such as Expires. Visit frequency for '/robots.txt' This is a special version of the directive above; specifying how often the '/robots.txt' file should be refreshed. Again Expires could be used to do this. Visiting hours It has often been suggested to list certain hours as "preferred hours" for robot accesses. These would be given in GMT, and would probably list local low-usage time. A limitation is that this would specify a value for the entire site, whereas the value may depend on specific parts of the URL space. Visiting vs indexing The SRE specifies URL prefixes that are not to be retrieved. In practice we find it is used both for URL's that are not to be retrieved, as ones that are not to be indexed, and that the distinction is not explicit. For example, a page with links to a company's employees pages may not be all that desirable to appear in an index, whereas the employees pages themselves are desirable; The robot should be allowed to recurse on the parent page to get to the child pages and index them, without indexing the parent. This could be addressed by adding a "DontIndex" directive. Extensions beyond exclusion The SRE's aim was to reduce abuses by robots, by specifying what is off-limits. It has often been suggested to add more constructive information. I strongly believe such constructive information would be of immense value, but I contest that the '/robots.txt' file is the best place for this. In the first place, there may be a number of different schemes for providing such information; keeping exclusion and "inclusion" separate allows multiple inclusions schemes to be used, or the inclusion scheme to be changed without affecting the exclusion parts. Given the broad debates on meta information this seems prudent. Some of you may actually not be aware of ALIWEB, a separate pilot project I set up in 1994 which used a '/site.idx' file in IAFA format, as one way of making such inclusive information available. A full analysis of ALIWEB is beyond the scope of this document, but as it used the same concept as the '/robots.txt' (single resource on a known URL), it shares many of the problems outlined in this document. In addition there were issues with the exact nature of the meta data, the complexity of administration, the restrictiveness of the RFC822-like format, and internationalisation issues. That experience suggests to me that this does not belong in the '/robots.txt' file, except possibly in its most basic form: a list of URL's to visit. For the record, people's suggestions for inclusive information included: http://info.webcrawler.com/mak/projects/robots/eval.html (5 of 7) [18.02.2001 13:20:16] Evaluation of the Standard for Robots Exclusion ● ● ● ● ● list of URI's to visit perl-URL meta information site administrator contact information description of the site geographic information Recommendations I have outlined most of the problems and missed features of the SRE. I also have indicated that I am against most of the extensions to the current scheme, because of increased complexity, or because the '/robots.txt' is the wrong place to solve the problem. Here is what I believe we can do to address these issues. Moving policy closer to the resources To address the issues of scaling and administrative access, it is clear we must move beyond the single resource per server. There is currently no effective way in the Web for clients to consider collections (subtrees) of documents together. Therefore the only option is to associate policy with the resources themselves, ie the pages identified with a URL. This association can be done in a few ways: Embedding the policy in the resource itself This could done using the META tag, e.g. <META NAME="robotpolicy" CONTENT="dontindex">. While this would only work for HTML, it would be extremely easy for a user to add this information to their documents. No software or administrative access is required for the user, and it is really easy to support in the robot. Embedding a reference to the policy in the resource This could be done using the LINK tag, e.g. <LINK REL="robotpolicy" HREF="public.pol"> This would give the extra flexibility of sharing a policy among documents, and supporting different policy encodings which could move beyond RFC822-like syntax. The drawback is increased traffic (using regular caching) and complexity. Using an explicit protocol for the association This could be done using PEP, in a similar fashion to PICS. It may even be possible or beneficial to use the PICS framework as the infrastructure, and express the policy as a rating. Note that this can be deployed independently of, and can be used together with a site '/robots.txt'. I suggest the first option should be an immediate first step, with the other options possibly following later. Meta information The same three steps can be used for descriptive META information: Embedding the meta information in the resource itself This could done using the META tag, e.g. <META NAME="description" CONTENT="...">. The nature of the META information could be the Dublin core set, or even just "description" and "keywords". While this would only work for HTML, it would be extremely easy for a user http://info.webcrawler.com/mak/projects/robots/eval.html (6 of 7) [18.02.2001 13:20:16] Evaluation of the Standard for Robots Exclusion to add this information to their documents. No software or administrative access is required for the user, and it is really easy to support in the robot. Embedding a reference to the policy in the resource This could be done using the LINK tag, e.g. <LINK REL="meta" HREF="doc.meta"> This would give the extra flexibility of sharing meta information among documents, and supporting different meta encodings which could move beyond RFC822-like syntax (which can even be negotiated using HTTP content type negotiation!) The drawback is increased traffic (using regular caching) and complexity. Using an explicit protocol for the association This could be done using PEP, in a similar fashion to PICS. It may even be possible or beneficial to use the PICS framework as the infrastructure, and express the meta information as a rating. I suggest the first option should be an immediate first step, with the other options possibly following later. Extending the SRE The meaures above address some of the problems in the SRE in a more scaleable and flexible way than by adding a multitude of directives to the '/robots.txt' file. I believe that of the suggested additions, this one will have the most benefit, without adding complexity: PleaseVisit To suggest relative URL's to visit on the site Standards... I believe any future version of the SRE should be documented either as an RFC or a W3C-backed standard. The Web Robots Pages http://info.webcrawler.com/mak/projects/robots/eval.html (7 of 7) [18.02.2001 13:20:16]