>Archive.org had a spider indexing my web site last night. It
>ignored the robots.txt file and was sucking down things it shouldn't
>have.
Hi Chip,
My name is Mike Burner, the author of the robot in question. We reget any
difficulties the visit of our robot may have caused, but it is incorrect to
say that the robot ignored your "robots.txt" file. We at the Archive are
very sensitive to the wishes of the site developers, and make every effort
to adhere to the proposed standard for robot exclusion.
Your "robots.txt" is inconsistent with the standard and was not parsed as
you intended. Specifically, your file has two "User-agent: *" lines:
# $Id: robots.txt,v 1.4 1996/11/23 17:57:27 chip Exp $
# Stay out of the spambait pages.
User-agent: *
Disallow: /spambait/bait-
# Flush the old gn-type prefixes from search engines.
User-agent: *
Disallow: /0/
Disallow: /0h/
Disallow: /1/
Disallow: /1h/
Disallow: /I/
To quote Martijn's specification:
User-agent
The value of this field is the name of the robot the record is
describing access policy for.
If more than one User-agent field is present the record describes
an identical access policy for more than one robot. At least one
field needs to be present per record.
The robot should be liberal in interpreting this field. A case
insensitive substring match of the name without version information
is recommended.
If the value is '*', the record describes the default access policy
for any robot that has not matched any of the other records.
It is not allowed to have multiple such records in the "/robots.txt" file.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(http://info.webcrawler.com/mak/projects/robots/norobots.html#format)
The way we parse "robots.txt", the last such entry is honored.
I can see how such a mistake would be made, and will change our parser to
recognize multiple "User-agent: *" sections. I would recommend however,
that you modify your file to comply with the SRE so that other crawlers will
not delve into areas you would rather not have indexed.
Again, I regret any difficulty this episode has caused you or your group.
In the future, please contact me directly about any difficulties you have
with our robot.
Mike Burner
|
|